# Chapter 1: Getting Started

**Data-Juicer User Guide**

- Git Commit: `v1.4.6`
- Commit Date: 2026-02-02
- Repository: https://github.com/datajuicer/data-juicer

---

# Table of Contents

1. [Install Data-Juicer](#install-data-juicer)
2. [Create Sample JSONL Data](#create-sample-jsonl-data)
3. [Write Basic YAML Config](#write-basic-yaml-config)
4. [Execute Pipeline](#execute-pipeline)
5. [Check Output](#check-output)
6. [Learning Path](#learning-path)
   - [Core Concepts (Recommended Order)](#core-concepts-recommended-order)
   - [Advanced Topics (Appendices)](#advanced-topics-appendices)

## Install Data-Juicer

Data-Juicer can be easily installed via pip. We recommend using `uv` for faster installation, but standard `pip` works too.

Detailed installation tutorial [here](https://datajuicer.github.io/data-juicer/en/main/docs/tutorial/Installation.html)

In [1]:
# Install data-juicer package if you are NOT in the Playground
# If running in Google Colab, use 'pip install' instead of 'uv pip install'
!uv pip install py-data-juicer

[2K[2mResolved [1m161 packages[0m [2min 2.23s[0m[0m                                       [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/151)                                                 
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/151)-----------------[0m[0m     0 B/8.68 MiB            [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/151)-----------------[0m[0m 14.90 KiB/8.68 MiB          [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/151)-----------------[0m[0m 14.90 KiB/8.68 MiB          [1A
[2mmatplotlib          [0m [32m[30m[2m------------------------------[0m[0m     0 B/8.31 MiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/151)-----------------[0m[0m 14.90 KiB/8.68 MiB          [2A
[2mmatplotlib          [0m [32m[30m[2m------------------------------[0m[0m 14.90 KiB/8.31 MiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/151)-----------------[0m[0m 14.90 KiB/8.68 MiB          [2A
[2mmatplotlib       

## Create Sample JSONL Data

Data-Juicer works with JSONL (JSON Lines) format, where each line is a valid JSON object. This format is efficient for streaming large datasets and is widely used in the ML community.

In [2]:
import json
import os

# Create data directory
os.makedirs('./data', exist_ok=True)

# Sample data
samples = [
    {"text": "Today is Sunday and it's a happy day!", "meta": {"src": "web", "date": "2024-01-01"}},
    {"text": "Do you need a cup of coffee?", "meta": {"src": "social", "author": "user123"}},
    {"text": "Machine learning is transforming the world.", "meta": {"src": "article"}},
    {"text": "Short.", "meta": {"src": "web"}},
    {"text": "This is a longer text with more content to demonstrate filtering capabilities.", "meta": {"src": "blog"}}
]

# Write JSONL file
with open('./data/sample.jsonl', 'w') as f:
    for sample in samples:
        f.write(json.dumps(sample) + '\n')

print(f"Created sample dataset with {len(samples)} samples")

Created sample dataset with 5 samples


## Write Basic YAML Config

Data-Juicer uses YAML configuration files ("recipes") to define processing pipelines. A recipe specifies:
- **Input/Output paths**: Where to read and write data
- **Processing operators**: What transformations to apply
- **Execution settings**: Parallelism, caching, etc.

Let's create a simple recipe that filters text by length and language, then removes duplicates.

In [3]:
import yaml

# Create config as Python dictionary
config_dict = {
    'project_name': 'getting_started',
    
    # Input/Output paths
    'dataset_path': './data/sample.jsonl',
    'export_path': './outputs/processed.jsonl',
    
    # Number of parallel processes
    'np': 1,
    
    # Processing pipeline
    'process': [
        # 1. Filter by text length
        {
            'text_length_filter': {
                'min_len': 10,
                'max_len': 200
            }
        },
        # 2. Filter by language (English)
        # Learn more: https://datajuicer.github.io/data-juicer/en/main/docs/operators/filter/language_id_score_filter.html
        {
            'language_id_score_filter': {
                'lang': 'en',
                'min_score': 0.8
            }
        },
        # 3. Remove duplicates
        {
            'document_deduplicator': {
                'lowercase': True
            }
        }
    ]
}

# Save config dict to YAML file
os.makedirs('./configs', exist_ok=True)
with open('./configs/basic.yaml', 'w') as f:
    yaml.dump(config_dict, f, default_flow_style=False, allow_unicode=True, sort_keys=False)

print("Config saved to ./configs/basic.yaml")

Config saved to ./configs/basic.yaml


## Execute Pipeline

Data-Juicer provides two ways to run pipelines:
1. **Command-line**: Using the `dj-process` command
2. **Programmatic**: Using Python API for more control

Both methods produce identical results. Choose based on your workflow preference.

### Option 1: Command-line Execution

In [4]:
!dj-process --config ./configs/basic.yaml

[32m2026-02-12 09:14:55.729[0m | [1mINFO    [0m | [36mdata_juicer.utils.lazy_loader[0m:[36m_install_package[0m:[36m390[0m - [1mInstalling ray using uv...[0m
[2K[2mResolved [1m18 packages[0m [2min 399ms[0m[0m                                        [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/1)                                                   
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)-------------------[0m[0m     0 B/68.96 MiB           [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)-------------------[0m[0m 16.00 KiB/68.96 MiB         [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)-------------------[0m[0m 32.00 KiB/68.96 MiB         [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)-------------------[0m[0m 48.00 KiB/68.96 MiB         [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/1)-------------------[0m[0m 64.00 KiB/68.96 MiB         [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/

#### Check Command-line Output

Let's verify the results from the command-line execution:

In [5]:
import json

# Read processed data from YAML config execution
with open('./outputs/processed.jsonl', 'r') as f:
    processed = [json.loads(line) for line in f]

print(f"Original samples: 5")
print(f"Processed samples: {len(processed)}")
print("\nProcessed data:")
for i, sample in enumerate(processed, 1):
    print(f"\n{i}. {sample['text']}")
    print(f"   Metadata: {sample.get('meta', {})}")

Original samples: 5
Processed samples: 4

Processed data:

1. Today is Sunday and it's a happy day!
   Metadata: {'src': 'web', 'date': 1704067200000, 'author': None}

2. Do you need a cup of coffee?
   Metadata: {'src': 'social', 'date': None, 'author': 'user123'}

3. Machine learning is transforming the world.
   Metadata: {'src': 'article', 'date': None, 'author': None}

4. This is a longer text with more content to demonstrate filtering capabilities.
   Metadata: {'src': 'blog', 'date': None, 'author': None}


Notice how the pipeline filtered out:
- Short texts (< 10 characters) - e.g., "Short."
- Non-English texts (language confidence < 0.8)
- Duplicate entries

### Option 2: Programmatic Execution

Alternatively, you can run the pipeline entirely in Python **without any YAML file**. This approach directly uses Data-Juicer's low-level APIs for maximum flexibility:

In [6]:
from data_juicer.ops import load_ops
from data_juicer.core.exporter import Exporter
from data_juicer.core.data import NestedDataset

# Step 1: Load dataset directly from samples list
ds = NestedDataset.from_list(samples)
print(f"Loaded {len(ds)} samples")

# or from a JSONL file
# from jsonargparse import Namespace
# from data_juicer.core.data.dataset_builder import DatasetBuilder

# cfg = Namespace({"dataset_path": config_dict["dataset_path"]})

# builder = DatasetBuilder(cfg)
# ds = builder.load_dataset()

# Step 2: Define operators as Python list
process_list = config_dict["process"]

# Step 3: Load operators from the process list
ops = load_ops(process_list)
print(f"Loaded {len(ops)} operators: {[op._name for op in ops]}")

# Step 4: Process dataset through each operator
for op in ops:
    ds = op.run(ds)
    print(f"After {op._name}: {len(ds)} samples remaining")

# Step 5: Export results
exporter = Exporter("./outputs/processed_programmatic.jsonl")
exporter.export(ds)

# Display results
print(f"Original samples: 5")
print(f"Processed samples: {len(processed)}")
for i, sample in enumerate(ds):
    print(f"{i+1}. {sample['text']}")
    print(f"   Metadata: {sample.get('meta', {})}")

  from .autonotebook import tqdm as notebook_tqdm
2026-02-12 09:19:19,247	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2026-02-12 09:19:21,993	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
[32m2026-02-12 09:19:22.643[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading fasttext language identification model...[0m
[32m2026-02-12 09:19:22.787[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[text_length_filter] based on the required memory: NoneGB and required cpu: 1.[0m


Loaded 5 samples
Loaded 3 operators: ['text_length_filter', 'language_id_score_filter', 'document_deduplicator']


Adding new column for stats (num_proc=4): 100%|██████████| 5/5 [00:00<00:00, 20.30 examples/s]
[32m2026-02-12 09:19:23.051[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[text_length_filter] based on the required memory: NoneGB and required cpu: 1.[0m
text_length_filter_compute_stats (num_proc=4): 100%|██████████| 5/5 [00:00<00:00, 17.41 examples/s]
[32m2026-02-12 09:19:23.374[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[text_length_filter] based on the required memory: NoneGB and required cpu: 1.[0m
text_length_filter_process (num_proc=4): 100%|██████████| 5/5 [00:00<00:00, 16.07 examples/s]
[32m2026-02-12 09:19:23.737[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[language_id_score_filter] based on

After text_length_filter: 4 samples remaining


language_id_score_filter_compute_stats (num_proc=4):   0%|          | 0/4 [00:00<?, ? examples/s][32m2026-02-12 09:19:23.870[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading fasttext language identification model...[0m
[32m2026-02-12 09:19:23.896[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading fasttext language identification model...[0m
[32m2026-02-12 09:19:23.923[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading fasttext language identification model...[0m
[32m2026-02-12 09:19:23.949[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading fasttext language identification model...[0m
language_id_score_filter_compute_stats (num_proc=4): 100%|██████████| 4/4 [00:00<00:00,  6.70 examples/s]
[32m

After language_id_score_filter: 4 samples remaining


document_deduplicator_compute_hash (num_proc=4): 100%|██████████| 4/4 [00:00<00:00,  5.74 examples/s]
Filter: 100%|██████████| 4/4 [00:00<00:00, 1116.84 examples/s]
[32m2026-02-12 09:19:25.366[0m | [1mINFO    [0m | [36mdata_juicer.core.exporter[0m:[36m_export_impl[0m:[36m154[0m - [1mExporting computed stats into a single file...[0m


After document_deduplicator: 4 samples remaining


Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 752.75ba/s]
[32m2026-02-12 09:19:25.375[0m | [1mINFO    [0m | [36mdata_juicer.core.exporter[0m:[36m_export_impl[0m:[36m190[0m - [1mExport dataset into a single file...[0m
Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 1163.79ba/s]

Original samples: 5
Processed samples: 4
1. Today is Sunday and it's a happy day!
   Metadata: {'author': None, 'date': '2024-01-01', 'src': 'web'}
2. Do you need a cup of coffee?
   Metadata: {'author': 'user123', 'date': None, 'src': 'social'}
3. Machine learning is transforming the world.
   Metadata: {'author': None, 'date': None, 'src': 'article'}
4. This is a longer text with more content to demonstrate filtering capabilities.
   Metadata: {'author': None, 'date': None, 'src': 'blog'}





In [7]:
!cat ./outputs/processed_programmatic.jsonl

{"text":"Today is Sunday and it's a happy day!","meta":{"author":null,"date":"2024-01-01","src":"web"}}
{"text":"Do you need a cup of coffee?","meta":{"author":"user123","date":null,"src":"social"}}
{"text":"Machine learning is transforming the world.","meta":{"author":null,"date":null,"src":"article"}}
{"text":"This is a longer text with more content to demonstrate filtering capabilities.","meta":{"author":null,"date":null,"src":"blog"}}


Both execution methods produce the same filtered dataset:
- **Option 1 (Command-line)**: Simple and quick for one-off processing with YAML configs
- **Option 2 (Programmatic)**: Pure Python API without any YAML files - ideal for:
  - Integration into larger Python workflows
  - Fine-grained control over each processing step
  - Debugging and step-by-step inspection

## Table of Contents

Here are the remaining chapters available in this tutorial series:

### Core Concepts (Recommended Order)

1. **[Chapter 2: Building Recipes](./02_Building_Recipes.ipynb)**
   - Understand recipe structure (global parameters, process pipeline, operator parameters)
   - Create basic and custom recipes
   - Override parameters via CLI
   - Explore pre-defined recipes from the Recipe Gallery (data-juicer-hub)

2. **[Chapter 3: Data Formats and Loading](./03_Data_Formats_and_Loading.ipynb)**
   - Learn Data-Juicer's unified format (DJ Format)
   - Convert between dialog formats (Messages, ShareGPT, Alpaca, Query-Response)
   - Handle multimodal format conversion (LLaVA, MMC4, InternVid, etc.)

3. **[Chapter 4: DJ Dataset API](./04_DJ_Dataset_API.ipynb)**
   - Use NestedDataset (HuggingFace-compatible) and RayDataset (distributed)
   - Access nested fields with dot notation (e.g., `ds['meta.source']`)
   - Apply operators via `.process()` method

4. **[Chapter 5: Operators Usage](./05_Operators_Usage.ipynb)**
   - Use operators programmatically via Python API
   - Chain operators sequentially or batch process
   - Inspect operator statistics

5. **[Chapter 6: Analysis & Visualization](./06_Analysis_and_Visualization.ipynb)**
   - Run data analysis with `dj-analyze`
   - Interpret statistics and visualizations
   - Compare datasets before and after processing

6. **[Chapter 7: Distributed Processing with Ray](./07_Distributed_Processing_with_Ray.ipynb)**
   - Set up Ray clusters (local and multi-node)
   - Use demo configs from `demos/process_on_ray/`
   - Monitor resources via Ray Dashboard
   - Run distributed deduplication

### Advanced Topics

- **[Chapter 8: Pre-processing](./08_Preprocessing.ipynb)**
  - Split datasets by language
  - Convert raw formats (arXiv, Stack Exchange) to JSONL
  - Serialize complex metadata fields

- **[Chapter 9: Multimodal Data Processing](./09_Multimodal_Data_Processing.ipynb)**
  - Understand multimodal format with special tokens
  - Process image-text, video-text, audio-text data
  - Convert between multimodal formats (LLaVA, Video-ChatGPT, WavCaps, etc.)
  - Apply multimodal operators (image/video/audio filters)

- **[Chapter 10: Advanced Dataset Configuration](./10_Advanced_Dataset_Configuration.ipynb)**
  - Mix multiple datasets with custom weights
  - Sample subsets from large datasets

## Additional Resources

- **Documentation**: https://datajuicer.github.io/data-juicer
- **GitHub**: https://github.com/datajuicer/data-juicer
- **Recipe Gallery**: https://datajuicer.github.io/data-juicer-hub
- **Operator Reference**: https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html