# Chapter 9: Multimodal Data Processing

**Data-Juicer User Guide**

- Git Commit: `v1.4.6`
- Commit Date: 2026-02-02
- Repository: https://github.com/datajuicer/data-juicer

---

This chapter covers multimodal data format conversion and processing. See the [Multimodal Tools README](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/multimodal/README.html) for complete documentation.

# Table of Contents

1. [Data-Juicer Multimodal Format](#data-juicer-multimodal-format)
2. [Format Conversion Tools](#format-conversion-tools)
3. [Example: LLaVA Format Conversion and Processing](#example-llava-format-conversion-and-processing)
4. [Path Management: Absolute to Relative Path Conversion](#path-management-absolute-to-relative-path-conversion)
5. [Practical Workflow](#practical-workflow)

## Data-Juicer Multimodal Format

Data-Juicer uses a unified text-based interleaved format for multimodal data:

```json
{
  "text": "<__dj__image> Antarctica is Earth's southernmost and least-populated continent. <|__dj__eoc|> "
          "<__dj__video> <__dj__audio> Situated almost entirely south of the Antarctic Circle and surrounded by the "
          "Southern Ocean (also known as the Antarctic Ocean), it contains the geographic South Pole. <|__dj__eoc|> "
          "Antarctica is the fifth-largest continent, being about 40% larger than Europe, "
          "and has an area of 14,200,000 km2 (5,500,000 sq mi). <__dj__image> <__dj__image> <|__dj__eoc|> "
          "Most of Antarctica is covered by the Antarctic ice sheet, "
          "with an average thickness of 1.9 km (1.2 mi). <|__dj__eoc|>",
  "images": [
    "path/to/the/image/of/antarctica_snowfield",
    "path/to/the/image/of/antarctica_map",
    "path/to/the/image/of/europe_map"
  ],
  "audios": [
    "path/to/the/audio/of/sound_of_waves_in_Antarctic_Ocean"
  ],
  "videos": [
    "path/to/the/video/of/remote_sensing_view_of_antarctica"
  ]
}
```

**Special tokens:**
- `<__dj__image>`, `<__dj__audio>`, `<__dj__video>`: Modality placeholders
- `<|__dj__eoc|>`: End of chunk (semantic unit separator)

## Format Conversion Tools

Data-Juicer provides bidirectional conversion between its format and popular multimodal formats:

| Format | Type | To DJ | From DJ |
|--------|------|-------|----------|
| **LLaVA-like** | image-text | [`llava_to_dj.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/llava_to_dj.py) | [`dj_to_llava.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/data_juicer_format_to_target_format/dj_to_llava.py) |
| **MMC4-like** | image-text | [`mmc4_to_dj.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/mmc4_to_dj.py) | [`dj_to_mmc4.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/data_juicer_format_to_target_format/dj_to_mmc4.py) |
| **WavCaps-like** | audio-text | [`wavcaps_to_dj.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/wavcaps_to_dj.py) | [`dj_to_wavcaps.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/data_juicer_format_to_target_format/dj_to_wavcaps.py) |
| **Video-ChatGPT-like** | video-text | [`video_chatgpt_to_dj.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/video_chatgpt_to_dj.py) | [`dj_to_video_chatgpt.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/data_juicer_format_to_target_format/dj_to_video_chatgpt.py) |
| **Youku-mPLUG-like** | video-text | [`youku_to_dj.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/youku_to_dj.py) | [`dj_to_youku.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/data_juicer_format_to_target_format/dj_to_youku.py) |
| **InternVid-like** | video-text | [`internvid_to_dj.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/internvid_to_dj.py) | [`dj_to_internvid.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/data_juicer_format_to_target_format/dj_to_internvid.py) |

**Additional tool:**
- [`absolute_path_to_relative_path.py`](https://github.com/datajuicer/data-juicer/blob/main/tools/fmt_conversion/multimodal/absolute_path_to_relative_path.py): Convert absolute paths to relative paths for data migration

In [1]:
!git clone --depth 1 https://github.com/datajuicer/data-juicer.git

Cloning into 'data-juicer'...
remote: Enumerating objects: 1246, done.[K
remote: Counting objects: 100% (1246/1246), done.[K
remote: Compressing objects: 100% (932/932), done.[K
remote: Total 1246 (delta 361), reused 818 (delta 285), pack-reused 0 (from 0)[K
Receiving objects: 100% (1246/1246), 34.37 MiB | 40.08 MiB/s, done.
Resolving deltas: 100% (361/361), done.


In [2]:
# If running in Google Colab, use 'pip install' instead of 'uv pip install'
!uv pip install py-data-juicer[dev]

[2mAudited [1m1 package[0m [2min 22ms[0m[0m


## Example: LLaVA Format Conversion and Processing

In [3]:
import json
import os

# Create sample LLaVA-format dataset
os.makedirs('./data/llava_raw', exist_ok=True)

llava_samples = [
    {
        "id": "sample_1",
        "image": "../../data-juicer/tests/ops/data/img1.png",
        "conversations": [
            {"from": "human", "value": "<image>\nWhat do you see in this image?"},
            {"from": "gpt", "value": "I can see a comfortable bed."}
        ]
    },
    {
        "id": "sample_2", 
        "image": "../../data-juicer/tests/ops/data/img3.jpg",
        "conversations": [
            {"from": "human", "value": "<image>\nDescribe the weather in this photo."},
            {"from": "gpt", "value": "The weather in the photo is distinctly rainy."}
        ]
    }
]

with open('./data/llava_raw/dataset.json', 'w') as f:
    json.dump(llava_samples, f, indent=2)

print(f"Created LLaVA dataset with {len(llava_samples)} samples")

Created LLaVA dataset with 2 samples


In [4]:
# Convert LLaVA to Data-Juicer format
!python data-juicer/tools/fmt_conversion/multimodal/source_format_to_data_juicer_format/llava_to_dj.py \
    --llava_ds_path ./data/llava_raw/dataset.json \
    --target_ds_path ./data/dj_format/dataset.jsonl

[32m2026-02-12 09:38:05.711[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m134[0m - [1mCreate directory [./data/dj_format] for the target dataset.[0m
[32m2026-02-12 09:38:05.711[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m163[0m - [1mLoading original LLaVA dataset.[0m
[32m2026-02-12 09:38:05.711[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m165[0m - [1mLoad [2] samples.[0m
100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 20815.40it/s]
[32m2026-02-12 09:38:05.715[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m289[0m - [1mStore the target dataset into [./data/dj_format/dataset.jsonl].[0m


In [5]:
# Check converted Data-Juicer format
with open('./data/dj_format/dataset.jsonl', 'r') as f:
    dj_samples = [json.loads(line) for line in f]

print("Data-Juicer format sample:")
print(json.dumps(dj_samples[0], indent=2))

Data-Juicer format sample:
{
  "id": "sample_1",
  "text": "[[human]]: <image>\nWhat do you see in this image?\n[[gpt]]: I can see a comfortable bed. <|__dj__eoc|>",
  "images": [
    "../../data-juicer/tests/ops/data/img1.png"
  ]
}


In [6]:
# Process with Data-Juicer
config = """project_name: 'multimodal_demo'
dataset_path: './data/dj_format/dataset.jsonl'
export_path: './data/processed/dataset.jsonl'
np: 2

image_key: 'images'

process:
  - text_length_filter:
      min_len: 10
      max_len: 1000
  
  - image_aspect_ratio_filter:
      min_ratio: 0.5
      max_ratio: 2.0

  - image_size_filter:
      min_size: "1KB"
      max_size: "10MB"
"""

os.makedirs('./configs', exist_ok=True)
with open('./configs/multimodal.yaml', 'w') as f:
    f.write(config)

print("Configuration created")

Configuration created


In [7]:
# Run Data-Juicer processing
!dj-process --config ./configs/multimodal.yaml

[32m2026-02-12 09:38:12.433[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m695[0m - [1mdataset_path config is set and a valid local path[0m
[32m2026-02-12 09:38:12.492[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1012[0m - [1mBack up the input config file [/workspaces/data-juicer-hub/configs/multimodal.yaml] into the work_dir [/workspaces/data-juicer-hub/data/processed][0m
[32m2026-02-12 09:38:12.498[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1033[0m - [1mConfiguration table: [0m
╒══════════════════════════╤═════════════════════════════════════════════════════════════════════════════════════════════╕
│ key                      │ values                                                                                      │
╞══════════════════════════╪═════════════════════════════════════════════════════════════════════════════════════════════╡
│ config                   │ [Path_fr(./configs/multimodal.yaml, cwd=/work

## Path Management: Absolute to Relative Path Conversion

After processing with Data-Juicer, the output dataset contains absolute paths to multimodal files. To facilitate data migration and portability, you can convert these absolute paths to relative paths and optionally copy files to a unified directory.

In [8]:
# Check the results obtained in the previous section.
abs_dir = None
with open("./data/processed/dataset.jsonl", "r") as f:
    lines = f.readlines()
    for item in lines:
        sample = json.loads(item)
        print(sample)
        image_path = sample.get('images', [])[0]
        # Get absolute directory path for subsequent path conversion operations
        abs_dir = os.path.dirname(image_path)
print(f"Absolute directory: {abs_dir}")

{'id': 'sample_1', 'text': '[[human]]: <image>\nWhat do you see in this image?\n[[gpt]]: I can see a comfortable bed. <|__dj__eoc|>', 'images': ['/workspaces/data-juicer-hub/data/dj_format/../../data-juicer/tests/ops/data/img1.png']}
{'id': 'sample_2', 'text': '[[human]]: <image>\nDescribe the weather in this photo.\n[[gpt]]: The weather in the photo is distinctly rainy. <|__dj__eoc|>', 'images': ['/workspaces/data-juicer-hub/data/dj_format/../../data-juicer/tests/ops/data/img3.jpg']}
Absolute directory: /workspaces/data-juicer-hub/data/dj_format/../../data-juicer/tests/ops/data


In [9]:
# Convert absolute paths to relative paths
!python data-juicer/tools/fmt_conversion/multimodal/absolute_path_to_relative_path.py\
    --dj_ds_path ./data/processed/dataset.jsonl \
    --absolute_dir {abs_dir} \
    --path_key images \
    --target_dj_ds_path data/converted_data/relative_paths.jsonl \
    --target_mt_dir data/converted_data/

[32m2026-02-12 09:38:21.211[0m | [1mINFO    [0m | [36m__main__[0m:[36mconvert_absolute_path_to_relative_path[0m:[36m105[0m - [1mCreate directory [data/converted_data] for the target dataset.[0m
[32m2026-02-12 09:38:21.211[0m | [1mINFO    [0m | [36m__main__[0m:[36mconvert_absolute_path_to_relative_path[0m:[36m115[0m - [1mStart to convert absolute path to relative path.[0m
2it [00:00, 2212.19it/s]
[32m2026-02-12 09:38:21.217[0m | [1mINFO    [0m | [36m__main__[0m:[36mconvert_absolute_path_to_relative_path[0m:[36m152[0m - [1mStart to write the converted dataset to [data/converted_data/relative_paths.jsonl]...[0m


In [10]:
# Compare before and after conversion
print("Before conversion (absolute paths):")
with open('./data/processed/dataset.jsonl', 'r') as f:
    original = json.loads(f.readline())
print(json.dumps(original, indent=2))

print("\nAfter conversion (relative paths):")
with open("./data/converted_data/relative_paths.jsonl", "r") as f:
    lines = f.readlines()
    for item in lines:
        print(json.loads(item))

Before conversion (absolute paths):
{
  "id": "sample_1",
  "text": "[[human]]: <image>\nWhat do you see in this image?\n[[gpt]]: I can see a comfortable bed. <|__dj__eoc|>",
  "images": [
    "/workspaces/data-juicer-hub/data/dj_format/../../data-juicer/tests/ops/data/img1.png"
  ]
}

After conversion (relative paths):
{'id': 'sample_1', 'text': '[[human]]: <image>\nWhat do you see in this image?\n[[gpt]]: I can see a comfortable bed. <|__dj__eoc|>', 'images': ['img1.png'], '__dj__meta__': {'abs_dir': {'images': ['/workspaces/data-juicer-hub/data/dj_format/../../data-juicer/tests/ops/data/']}}}
{'id': 'sample_2', 'text': '[[human]]: <image>\nDescribe the weather in this photo.\n[[gpt]]: The weather in the photo is distinctly rainy. <|__dj__eoc|>', 'images': ['img3.jpg'], '__dj__meta__': {'abs_dir': {'images': ['/workspaces/data-juicer-hub/data/dj_format/../../data-juicer/tests/ops/data/']}}}


In [11]:
# Convert back to LLaVA format
!python data-juicer/tools/fmt_conversion/multimodal/data_juicer_format_to_target_format/dj_to_llava.py \
    --dj_ds_path ./data/converted_data/relative_paths.jsonl \
    --target_llava_ds_path ./data/converted_data/llava_final_dataset.json

[32m2026-02-12 09:38:24.268[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m148[0m - [1mStart to convert.[0m
2it [00:00, 950.77it/s]
[32m2026-02-12 09:38:24.276[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m236[0m - [1mStart to write the converted dataset to [./data/converted_data/llava_final_dataset.json]...[0m


In [12]:
# Check final result
with open('./data/converted_data/llava_final_dataset.json', 'r') as f:
    final_samples = json.load(f)

print(f"Final dataset: {len(final_samples)} samples")
print("\nSample:")
print(json.dumps(final_samples, indent=2))

Final dataset: 2 samples

Sample:
[
  {
    "id": "sample_1",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat do you see in this image?"
      },
      {
        "from": "gpt",
        "value": "I can see a comfortable bed."
      }
    ],
    "image": "img1.png"
  },
  {
    "id": "sample_2",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nDescribe the weather in this photo."
      },
      {
        "from": "gpt",
        "value": "The weather in the photo is distinctly rainy."
      }
    ],
    "image": "img3.jpg"
  }
]


In [13]:
# Remove cloned Data-Juicer repository
!rm -rf data-juicer

## Practical Workflow

A typical multimodal data processing workflow:

1. **Format Conversion**: Convert source format to Data-Juicer format
2. **Data Processing**: Apply filters and mappers using Data-Juicer pipeline
3. **Path Management**: Convert absolute paths to relative paths (if needed)
4. **Format Export**: Convert back to target format for downstream use

For detailed information on:
- All supported formats and their specifications
- Tool usage and parameters
- Format conversion considerations

Please refer to the [Multimodal Tools README](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/multimodal/README.html).


For specific multimodal operators, refer to the [Operators Documentation](https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html).