# Chapter 3: Data Formats

**Data-Juicer User Guide**

- Git Commit: `v1.0.5`
- Commit Date: 2026-01-16
- Repository: https://github.com/datajuicer/data-juicer

---

Real-world datasets come in many different formats—some from academic sources, others from companies or platforms. To process them effectively with Data-Juicer, we need a way to work with this diversity consistently.

This chapter introduces:

- **DJ Format (Unified Format)**: A standardized intermediate format that Data-Juicer uses internally
- **Format Conversion Tools**: Scripts to convert between DJ format and popular data formats
- **Practical Examples**: Step-by-step conversion for dialog and multimodal datasets

By the end of this chapter, you'll know how to convert your datasets to DJ format for processing with Data-Juicer.

# Table of Contents

1. [Data-Juicer Unified Format (DJ Format)](#data-juicer-unified-format-dj-format)
   - [Core Contents](#core-contents)
   - [Extra Data Contents](#extra-data-contents)
   - [Meta Info & Stats](#meta-info--stats)
2. [Format Conversion Tools](#format-conversion-tools)
   - [Supported Conversions](#supported-conversions)
3. [Example: Dialog Format Conversion](#example-dialog-format-conversion)
4. [Multimodal Format Conversion](#multimodal-format-conversion)

## Data-Juicer Unified Format (DJ Format)

Data-Juicer uses a unified intermediate format to standardize diverse datasets. The DJ format consists of three main categories:

### Core Contents

Fields directly related to training, fine-tuning, or pretraining:

```python
{
  "text": "xxx",           # For pretraining and general language modeling
  "query": "xxx",          # For dialog and question-answering
  "response": "xxx",       # For dialog responses and assistant output
  "instruction": "xxx"     # For instruction-tuning datasets (like Alpaca)
}
```

**Note**: Different dataset types use different core fields. A dialog dataset would have `query` and `response`, while a pretraining dataset would have only `text`.

### Extra Data Contents

Paths to multimodal data (images, audio, video):

```python
{
  "images": ["path/to/image1.jpg", "path/to/image2.jpg"],
  "audios": ["path/to/audio.wav"],
  "videos": ["path/to/video.mp4"]
}
```

These fields store file paths rather than the actual media, keeping the JSON records lightweight.

### Meta Info & Stats

Metadata and statistics computed during processing:

```python
{
  "meta": {"src": "custom", "version": "0.1"},
  "stats": {"lang": "en", "text_length": 256}
}
```

These are typically added by Data-Juicer operators during processing and help track data lineage.

### Complete Example

Here's a complete DJ format record:

```json
{
  "text": "Machine learning is a subset of artificial intelligence...",
  "images": ["dataset/image_001.jpg"],
  "meta": {"src": "wikipedia", "version": "1.0"},
  "stats": {"lang": "en", "text_length": 150}
}
```

**Complete Documentation**: [Format Conversion README](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/README.html)

## Format Conversion Tools

### How Conversion Works

Data-Juicer provides format conversion tools in `tools/fmt_conversion/` to convert between DJ format and various popular formats. The typical workflow is:

```
Your Format → DJ Format → Processing → Target Format
```

### Supported Conversions

**Dialog/Fine-tuning Formats:**
- **Messages format** (ModelScope-Swift): Standard multi-turn dialog format
- **ShareGPT format** (LLaMA-Factory/Swift): Popular instruction-following format
- **Alpaca format**: Simple instruction-response format
- **Query-Response format**: Simple Q&A format
- ...

Each format has dedicated conversion scripts in the [post_tuning_dialog/](https://github.com/datajuicer/data-juicer/tree/main/tools/fmt_conversion/post_tuning_dialog) directory.

**Multimodal Formats:**
- LLaVA, MMC4, InternVid, Video-ChatGPT, WavCaps, MSR-VTT, Youku
- See [Chapter 8: Multimodal Data Processing](./08_Multimodal_Data_Processing.ipynb) for detailed examples

### Finding Conversion Scripts

Conversion scripts are organized by data type:

```
tools/fmt_conversion/
├── post_tuning_dialog/          # Dialog format conversions
│   ├── source_format_to_data_juicer_format/  # Other → DJ
│   └── data_juicer_format_to_target_format/  # DJ → Other
└── multimodal/                  # Multimodal format conversions
    └── ...
```

## Example: Dialog Format Conversion

Let's walk through a practical example of converting between different dialog formats. We'll:

1. Create sample data in **Messages format** (Swift)
2. Convert to **DJ Format** using the conversion tool
3. Convert from DJ Format to **Alpaca Format**

**Reference**: [Post-tuning Dialog Formats Documentation](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/post_tuning_dialog/README.html)

In [1]:
!git clone --depth 1 https://github.com/datajuicer/data-juicer.git

fatal: 目标路径 'data-juicer' 已经存在，并且不是一个空目录。


In [2]:
!uv pip install py-data-juicer[dev]

[2mUsing Python 3.11.13 environment at: /home/cmgzn/data-juicer.worktrees/data-juicer-nk/.venv[0m
[2mAudited [1m1 package[0m [2min 107ms[0m[0m


In [3]:
import json
import os

# Create output directory
os.makedirs('./data/formats', exist_ok=True)

# Step 1: Create sample data in Messages format (Swift)
# Messages format is a list of objects with a 'messages' field
# Each message has 'role' (system/user/assistant) and 'content'

messages_data = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is machine learning?"},
            {"role": "assistant", "content": "Machine learning is a subset of AI that enables systems to learn from data."}
        ]
    },
    {
        "messages": [
            {"role": "user", "content": "Explain neural networks."},
            {"role": "assistant", "content": "Neural networks are computing systems inspired by biological neurons."}
        ]
    }
]

with open('./data/formats/messages.json', 'w') as f:
    json.dump(messages_data, f, indent=2)

print(f"✓ Created Messages format data with {len(messages_data)} samples")
print("\nSample (Messages Format):")
print(json.dumps(messages_data[0], indent=2))

✓ Created Messages format data with 2 samples

Sample (Messages Format):
{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "What is machine learning?"
    },
    {
      "role": "assistant",
      "content": "Machine learning is a subset of AI that enables systems to learn from data."
    }
  ]
}


In [4]:
# Step 2: Convert Messages format to DJ format
# This uses the provided conversion script

print("Converting Messages → DJ Format...")
!python data-juicer/tools/fmt_conversion/post_tuning_dialog/source_format_to_data_juicer_format/messages_to_dj.py \
    --src_ds_path ./data/formats/messages.json \
    --tgt_ds_path ./data/formats/dj_format.jsonl

print("✓ Conversion complete")

Converting Messages → DJ Format...
[32m2026-01-22 10:19:44.324[0m | [1mINFO    [0m | [36mllama_factory_sharegpt_to_dj[0m:[36mmain[0m:[36m185[0m - [1mLoading original dataset.[0m
[32m2026-01-22 10:19:44.324[0m | [1mINFO    [0m | [36mllama_factory_sharegpt_to_dj[0m:[36mmain[0m:[36m187[0m - [1mLoad [2] samples.[0m
100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 37449.14it/s]
[32m2026-01-22 10:19:44.327[0m | [1mINFO    [0m | [36mllama_factory_sharegpt_to_dj[0m:[36mmain[0m:[36m201[0m - [1mStore the target dataset into [./data/formats/dj_format.jsonl].[0m
✓ Conversion complete


In [5]:
# Inspect the converted DJ format
with open('./data/formats/dj_format.jsonl', 'r') as f:
    dj_data = [json.loads(line) for line in f]

print(f"✓ Converted to DJ format: {len(dj_data)} samples")
print("\nSample (DJ Format):")
print(json.dumps(dj_data[0], indent=2))
print("\nNotice:")
print("  - Multi-turn dialog converted to single 'query' and 'response' fields")
print("  - Format is now standardized for processing")

✓ Converted to DJ format: 2 samples

Sample (DJ Format):
{
  "system": "You are a helpful assistant.",
  "instruction": "",
  "query": "What is machine learning?",
  "response": "Machine learning is a subset of AI that enables systems to learn from data.",
  "history": []
}

Notice:
  - Multi-turn dialog converted to single 'query' and 'response' fields
  - Format is now standardized for processing


In [6]:
# Step 3: Convert DJ format to Alpaca format
# This shows we can convert to different target formats from DJ format

print("Converting DJ Format → Alpaca Format...")
!python data-juicer/tools/fmt_conversion/post_tuning_dialog/data_juicer_format_to_target_format/dj_to_alpaca.py \
    --src_ds_path ./data/formats/dj_format.jsonl \
    --tgt_ds_path ./data/formats/alpaca_format.json

print("✓ Conversion complete")

Converting DJ Format → Alpaca Format...
2it [00:00, 19737.90it/s]
[32m2026-01-22 10:19:44.614[0m | [1mINFO    [0m | [36m__main__[0m:[36mmain[0m:[36m97[0m - [1mStore the target dataset into [./data/formats/alpaca_format.json].[0m
✓ Conversion complete


In [7]:
# Inspect the Alpaca format output
with open('./data/formats/alpaca_format.json', 'r') as f:
    # Alpaca format might be newline-delimited JSON or regular JSON
    content = f.read().strip()
    if content.startswith('['):
        alpaca_data = json.loads(content)
    else:
        alpaca_data = [json.loads(line) for line in content.split('\n') if line]

print(f"✓ Converted to Alpaca format: {len(alpaca_data)} samples")
print("\nSample (Alpaca Format):")
print(json.dumps(alpaca_data[0], indent=2))
print("\nConversion Summary:")
print("  Messages (multi-turn) → DJ → Alpaca (instruction-response)")

✓ Converted to Alpaca format: 2 samples

Sample (Alpaca Format):
{
  "system": "You are a helpful assistant.",
  "input": "What is machine learning?",
  "output": "Machine learning is a subset of AI that enables systems to learn from data."
}

Conversion Summary:
  Messages (multi-turn) → DJ → Alpaca (instruction-response)


In [8]:
# Cleanup: Remove cloned repository and temporary data
!rm -rf data-juicer
!rm -rf ./data/formats

print("✓ Cleanup complete")

✓ Cleanup complete


## Multimodal Format Conversion

### Supported Multimodal Formats

For datasets containing images, videos, or audio, Data-Juicer supports conversion between:

| Format | Type | Use Case |
|--------|------|----------|
| **LLaVA** | Image-Text | Vision-language models |
| **MMC4** | Multimodal Documents | Document understanding |
| **InternVid** | Video Metadata | Video classification |
| **Video-ChatGPT** | Video-Dialog | Video question-answering |
| **WavCaps** | Audio Captions | Audio understanding |
| **Youku** | Video Platform | Chinese video understanding |

The conversion tools work similarly using `tools/fmt_conversion/multimodal/` scripts.

### Resources

- **Multimodal Conversion Scripts**: [GitHub](https://github.com/datajuicer/data-juicer/tree/main/tools/fmt_conversion/multimodal)
- **Multimodal Processing Guide**: See [Chapter 8: Multimodal Data Processing](./08_Multimodal_Data_Processing.ipynb)

**Important**: Multimodal conversion scripts handle media file references. Ensure all referenced files exist or are accessible during processing.