# Chapter 8: Multimodal Data Processing

**Data-Juicer User Guide**

- Git Commit: `v1.0.5`
- Commit Date: 2026-01-16
- Repository: https://github.com/datajuicer/data-juicer

---

This chapter covers multimodal data format conversion and processing. See the [Multimodal Tools README](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/multimodal/README.html) for complete documentation.

## Data-Juicer Multimodal Format

Data-Juicer uses a unified text-based interleaved format for multimodal data:

```json
{
  "text": "<__dj__image> Antarctica is Earth's southernmost continent. <|__dj__eoc|>",
  "images": ["path/to/antarctica.jpg"],
  "audios": ["path/to/audio.wav"],
  "videos": ["path/to/video.mp4"]
}
```

**Special tokens:**
- `<__dj__image>`, `<__dj__audio>`, `<__dj__video>`: Modality placeholders
- `<|__dj__eoc|>`: End of chunk (semantic unit separator)

## Format Conversion Tools

Data-Juicer provides bidirectional conversion between its format and popular multimodal formats:

| Format | Type | To DJ | From DJ |
|--------|------|-------|----------|
| **LLaVA-like** | image-text | `llava_to_dj.py` | `dj_to_llava.py` |
| **MMC4-like** | image-text | `mmc4_to_dj.py` | `dj_to_mmc4.py` |
| **WavCaps-like** | audio-text | `wavcaps_to_dj.py` | `dj_to_wavcaps.py` |
| **Video-ChatGPT-like** | video-text | `video_chatgpt_to_dj.py` | `dj_to_video_chatgpt.py` |
| **Youku-mPLUG-like** | video-text | `youku_to_dj.py` | `dj_to_youku.py` |
| **InternVid-like** | video-text | `internvid_to_dj.py` | `dj_to_internvid.py` |

**Additional tool:**
- `absolute_path_to_relative_path.py`: Convert absolute paths to relative paths for data migration

In [None]:
!git clone --depth 1 https://github.com/datajuicer/data-juicer.git

In [None]:
!uv pip install -e data-juicer[dev]

## Example: LLaVA Format Conversion and Processing

In [None]:
import json
import os

# Create sample LLaVA-format dataset
os.makedirs('./data/llava_raw', exist_ok=True)

llava_samples = [
    {
        "id": "sample_1",
        "image": "../../data-juicer/tests/ops/data/img1.png",
        "conversations": [
            {"from": "human", "value": "<image>\nWhat do you see in this image?"},
            {"from": "gpt", "value": "I can see a comfortable bed."}
        ]
    },
    {
        "id": "sample_2", 
        "image": "../../data-juicer/tests/ops/data/img3.jpg",
        "conversations": [
            {"from": "human", "value": "<image>\nDescribe the weather in this photo."},
            {"from": "gpt", "value": "The weather in the photo is distinctly rainy."}
        ]
    }
]

with open('./data/llava_raw/dataset.json', 'w') as f:
    json.dump(llava_samples, f, indent=2)

print(f"Created LLaVA dataset with {len(llava_samples)} samples")

In [None]:
# Convert LLaVA to Data-Juicer format
!python data-juicer/tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/llava_to_dj.py \
    --llava_ds_path ./data/llava_raw/dataset.json \
    --target_ds_path ./data/dj_format/dataset.jsonl

In [None]:
# Check converted Data-Juicer format
with open('./data/dj_format/dataset.jsonl', 'r') as f:
    dj_samples = [json.loads(line) for line in f]

print("Data-Juicer format sample:")
print(json.dumps(dj_samples[0], indent=2))

In [None]:
# Process with Data-Juicer
config = """project_name: 'multimodal_demo'
dataset_path: './data/dj_format/dataset.jsonl'
export_path: './data/processed/dataset.jsonl'
np: 2

image_key: 'images'

process:
  - text_length_filter:
      min_len: 10
      max_len: 1000
  
  - image_aspect_ratio_filter:
      min_ratio: 0.5
      max_ratio: 2.0

  - image_size_filter:
      min_size: "1KB"
      max_size: "10MB"
"""

os.makedirs('./configs', exist_ok=True)
with open('./configs/multimodal.yaml', 'w') as f:
    f.write(config)

print("Configuration created")

In [None]:
# Run Data-Juicer processing
!dj-process --config ./configs/multimodal.yaml

## Path Management: Absolute to Relative Path Conversion

After processing with Data-Juicer, the output dataset contains absolute paths to multimodal files. To facilitate data migration and portability, you can convert these absolute paths to relative paths and optionally copy files to a unified directory.

In [None]:
# Check the results obtained in the previous section.
abs_dir = None
with open("./data/processed/dataset.jsonl", "r") as f:
    lines = f.readlines()
    for item in lines:
        sample = json.loads(item)
        print(sample)
        image_path = sample.get('images', [])[0]
        # Get absolute directory path for subsequent path conversion operations
        abs_dir = os.path.dirname(image_path)
print(f"Absolute directory: {abs_dir}")

In [None]:
# Convert absolute paths to relative paths
!python data-juicer/tools/fmt_conversion/multimodal/absolute_path_to_relative_path.py\
    --dj_ds_path ./data/processed/dataset.jsonl \
    --absolute_dir {abs_dir} \
    --path_key images \
    --target_dj_ds_path data/converted_data/relative_paths.jsonl \
    --target_mt_dir data/converted_data/

In [None]:
# Compare before and after conversion
print("Before conversion (absolute paths):")
with open('./data/processed/dataset.jsonl', 'r') as f:
    original = json.loads(f.readline())
print(json.dumps(original, indent=2))

print("\nAfter conversion (relative paths):")
with open("./data/converted_data/relative_paths.jsonl", "r") as f:
    lines = f.readlines()
    for item in lines:
        print(json.loads(item))

In [None]:
# Convert back to LLaVA format
!python data-juicer/tools/fmt_conversion/multimodal/data_juicer_format_to_target_format/dj_to_llava.py \
    --dj_ds_path ./data/converted_data/relative_paths.jsonl \
    --target_llava_ds_path ./data/converted_data/llava_final_dataset.json

In [None]:
# Check final result
with open('./data/converted_data/llava_final_dataset.json', 'r') as f:
    final_samples = json.load(f)

print(f"Final dataset: {len(final_samples)} samples")
print("\nSample:")
print(json.dumps(final_samples, indent=2))

## Practical Workflow

A typical multimodal data processing workflow:

1. **Format Conversion**: Convert source format to Data-Juicer format
2. **Data Processing**: Apply filters and mappers using Data-Juicer pipeline
3. **Path Management**: Convert absolute paths to relative paths (if needed)
4. **Format Export**: Convert back to target format for downstream use

For detailed information on:
- All supported formats and their specifications
- Tool usage and parameters
- Format conversion considerations

Please refer to the [Multimodal Tools README](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/multimodal/README.html).


For specific multimodal operators, refer to the [Operators Documentation](https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html).