# Getting Started with Data-Juicer

Welcome to Data-Juicer! This notebook will guide you through the basics of using Data-Juicer for data processing. We'll cover installation, core concepts, and create your first data processing pipeline.

## What is Data-Juicer?

Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs). It provides:

- A rich library of operators for data cleaning, filtering, and transformation
- Support for text, image, audio, and video data
- Both single-machine and distributed processing capabilities
- Easy extensibility with custom operators

## In this Notebook

1. Installation and setup
2. Core concepts: operators, recipes, and executors
3. Your first data processing pipeline
4. Basic configuration
5. Running a simple processing job

## Installation

First, let's install Data-Juicer. If you're in the Data-Juicer playground, this step is already done for you. Otherwise, uncomment and run the appropriate command:

In [7]:
# Installation Guide: https://modelscope.github.io/data-juicer/en/main/docs/tutorial/Installation.html
# Install from PyPI
# !uv pip install py-data-juicer

# Or install from source
# !pip install git+https://github.com/modelscope/data-juicer

## Core Concepts

Data-Juicer is built around several core concepts:

### Operators
Operators are the basic building blocks that perform specific data processing tasks. There are several types:
- **Mapper**: Edits and transforms samples (e.g., text cleaning)
- **Filter**: Filters out low-quality samples (e.g., language filtering)
- **Deduplicator**: Detects and removes duplicate samples
- **Selector**: Selects top samples based on ranking
- **Grouper**: Group samples to batched samples
- **Aggregator**: Aggregate for batched samples, such as summary or conclusion

### Recipes
Recipes are YAML configuration files that define a sequence of operators to apply to your data.

### Executors
Executors run the processing pipeline defined in a recipe. Data-Juicer provides both local and distributed executors.

## Your First Data Processing Pipeline

Let's create a simple data processing pipeline. First, we'll create a sample dataset to work with.

In [8]:
import os
import json

# Create a sample dataset
sample_data = [
    {"text": "Hello world! This is a sample text with good quality."},
    {"text": "hello world! this is a sample text with good quality."},
    {"text": "a text"},
    {"text": "Bonjour le monde! Ceci est un texte d'exemple de bonne qualité."},
    {"text": "Hello world! \tHello world! This is a sample text with good quality. Hello world! This is a sample text with good quality."}
]

# Create data directory if it doesn't exist
os.makedirs('data', exist_ok=True)

# Write sample data to a JSONL file
with open('data/sample_dataset.jsonl', 'w') as f:
    for item in sample_data:
        f.write(json.dumps(item) + '\n')

print("Sample dataset created with", len(sample_data), "samples")
print("First sample:", sample_data[0])

Sample dataset created with 5 samples
First sample: {'text': 'Hello world! This is a sample text with good quality.'}


Now let's create a simple recipe that:
1. Normalizes whitespace
2. Filters by language (English)
3. Filters by text length
4. Removes duplicates

In [9]:
recipe_content = """
project_name: 'getting_started_tutorial'
dataset_path: './data/sample_dataset.jsonl'
export_path: './outputs/processed_dataset.jsonl'
np: 1  # Number of processes

# Processing pipeline
process:
  - whitespace_normalization_mapper: {}
  - language_id_score_filter:
      lang: 'en'
      min_score: 0.8
  - text_length_filter:
      min_len: 10
      max_len: 200
  - document_deduplicator:
      lowercase: true
"""

# Write recipe to file
os.makedirs('configs', exist_ok=True)
with open('configs/basic_recipe.yaml', 'w') as f:
    f.write(recipe_content)

print("Recipe created at configs/basic_recipe.yaml")
print("Recipe content:")
print(recipe_content)

Recipe created at configs/basic_recipe.yaml
Recipe content:

project_name: 'getting_started_tutorial'
dataset_path: './data/sample_dataset.jsonl'
export_path: './outputs/processed_dataset.jsonl'
np: 1  # Number of processes

# Processing pipeline
process:
  - whitespace_normalization_mapper: {}
  - language_id_score_filter:
      lang: 'en'
      min_score: 0.8
  - text_length_filter:
      min_len: 10
      max_len: 200
  - document_deduplicator:
      lowercase: true



## Running the Processing Pipeline

Now let's run our data processing pipeline using the recipe we created.

In [10]:
# You can use command line tool
!dj-process --config configs/basic_recipe.yaml
# or this way, only for installation from source and located in the root directory of data-juicer
# !python tools/process_data.py --config configs/basic_recipe.yaml

[32m2025-09-28 18:20:44.499[0m | [1mINFO    [0m | [36mdata_juicer.ops[0m:[36mtiming_context[0m:[36m12[0m - [1mImporting operator modules took 3.24 seconds[0m
[32m2025-09-28 18:20:45.862[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m660[0m - [1mdataset_path config is set and a valid local path[0m
[32m2025-09-28 18:20:45.893[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m970[0m - [1mBack up the input config file [/home/cmgzn/data-juicer/notebook/juicybook/configs/basic_recipe.yaml] into the work_dir [/home/cmgzn/data-juicer/notebook/juicybook/outputs][0m
[32m2025-09-28 18:20:45.898[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m991[0m - [1mConfiguration table: [0m
╒══════════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ key                      │ values                                                                              

## Inspecting the Results

Let's examine what happened during processing:

In [11]:
# Let's look at the original data vs processed data

with open('outputs/processed_dataset.jsonl', 'r') as f:
    dataset = [json.loads(line) for line in f]

print("Original samples:")
for i, sample in enumerate(sample_data):
    print(f"{i+1}. {sample['text']}")

print("\nProcessed samples:")
for i, sample in enumerate(dataset):
    print(f"{i+1}. {sample['text']}")

print("\nWhat happened during processing?")
print("1. Whitespace normalization: Fixed spacing issues")
print("2. Language filtering: Kept only English texts (removed French)")
print("3. Length filtering: Removed texts that were too short")
print("4. Deduplication: Removed duplicate texts (case-insensitive)")

Original samples:
1. Hello world! This is a sample text with good quality.
2. hello world! this is a sample text with good quality.
3. a text
4. Bonjour le monde! Ceci est un texte d'exemple de bonne qualité.
5. Hello world! 	Hello world! This is a sample text with good quality. Hello world! This is a sample text with good quality.

Processed samples:
1. Hello world! This is a sample text with good quality.
2. Hello world!  Hello world! This is a sample text with good quality. Hello world! This is a sample text with good quality.

What happened during processing?
1. Whitespace normalization: Fixed spacing issues
2. Language filtering: Kept only English texts (removed French)
3. Length filtering: Removed texts that were too short
4. Deduplication: Removed duplicate texts (case-insensitive)


Finally, we remove the temporary files

In [None]:
import os
os.remove('configs/basic_recipe.yaml')
os.remove('data/sample_dataset.jsonl')

## Next Steps

Continue with the next notebook in the series to dive deeper into operators and their usage.