# Chapter 2: Building Recipes

**Data-Juicer User Guide**

- Git Commit: `v1.0.5`
- Commit Date: 2026-01-16
- Repository: https://github.com/datajuicer/data-juicer

# Table of Contents

1. [Recipe Structure](#recipe-structure)
2. [Basic Recipe Example](#basic-recipe-example)
3. [Overriding Parameters via CLI](#overriding-parameters-via-cli)
   - [Configuration Hierarchy](#configuration-hierarchy)
4. [Compare Results](#compare-results)
5. [Advanced Dataset Configuration](#advanced-dataset-configuration)
6. [Using Pre-defined Recipes](#using-pre-defined-recipes)
   - [Recipe Categories](#recipe-categories)
   - [Getting Started with Recipes](#getting-started-with-recipes)
   - [Example: Refining Alpaca-CoT Dataset](#example-refining-alpaca-cot-dataset)
   - [Exploring More Recipes](#exploring-more-recipes)
7. [Further Reading](#further-reading)

## Recipe Structure

A Data-Juicer recipe is a YAML configuration file with three main sections:

1. **Global Parameters**: Project settings, paths, parallelism
2. **Process Pipeline**: Ordered list of operators
3. **Operator Parameters**: Configuration for each operator

Data-Juicer uses a **Unified Intermediate Format** for data processing. For detailed format specifications, please refer to **[./03_Data_Formats.ipynb](./03_Data_Formats.ipynb)**.

Below we create a simple example for demonstration.

In [None]:
import json
import os

# Create sample data
os.makedirs('./data', exist_ok=True)

samples = [
    {"text": "This is a high-quality English text sample."},
    {"text": "Short"},
    {"text": "Another good example with sufficient length and quality."},
    {"text": "Bonjour! Ceci est un texte en franÃ§ais."},
    {"text": "Data processing is essential for machine learning projects."}
]

with open('./data/recipe_demo.jsonl', 'w') as f:
    for sample in samples:
        f.write(json.dumps(sample) + '\n')

print("Sample data created")

## Basic Recipe Example

For all available parameters and detailed usage, please refer to [config_all.yaml](https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml)

In [None]:
basic_recipe = """# Global Parameters
project_name: 'recipe_demo'
dataset_path: './data/recipe_demo.jsonl'
export_path: './outputs/recipe_demo.jsonl'
np: 2  # Number of parallel processes

# Process Pipeline
process:
  # Step 1: Normalize whitespace
  - whitespace_normalization_mapper: {}
  
  # Step 2: Filter by language
  - language_id_score_filter:
      lang: 'en'
      min_score: 0.8
  
  # Step 3: Filter by text length
  - text_length_filter:
      min_len: 20
      max_len: 500
  
  # Step 4: Remove duplicates
  - document_deduplicator:
      lowercase: true
"""

os.makedirs('./configs', exist_ok=True)
with open('./configs/basic_recipe.yaml', 'w') as f:
    f.write(basic_recipe)

print("Basic recipe created")

In [None]:
# Run with default config
!dj-process --config ./configs/basic_recipe.yaml

## Overriding Parameters via CLI

In [None]:
# Override parameters via CLI
!dj-process --config ./configs/basic_recipe.yaml \
    --dataset_path ./data/recipe_demo.jsonl \
    --export_path ./outputs/cli_override.jsonl \
    --np 4

### Configuration Hierarchy

Parameters are resolved in this order (highest to lowest priority):

1. **CLI arguments** (e.g., `--np 4`)
2. **YAML config file** (e.g., `configs/recipe.yaml`)
3. **Default values** (defined in operator code)

## Compare Results

In [None]:
import json

def load_jsonl(path):
    with open(path, 'r') as f:
        return [json.loads(line) for line in f]

# Load results
basic_results = load_jsonl('./outputs/recipe_demo.jsonl')
custom_results = load_jsonl('./outputs/custom.jsonl')

print(f"Original samples: 5")
print(f"Basic recipe output: {len(basic_results)} samples")
print(f"Custom recipe output: {len(custom_results)} samples")
print("\nCustom recipe is stricter (min_score=0.9, min_len=30)")

## Advanced Dataset Configuration

For complex data loading scenarios (data mixing, sampling, field mapping, remote datasets), see **[Chapter 9: Advanced Dataset Configuration](./09_Advanced_Dataset_Configuration.ipynb)**.

Key features covered:
- **Data Mixing**: Combine multiple datasets with custom weights
- **Sampling**: Extract subsets from large datasets
- **Field Mapping**: Handle datasets with custom field names
- **Remote Loading**: Load datasets from HuggingFace

ðŸ“– **Documentation**: [DatasetCfg Reference](https://datajuicer.github.io/data-juicer/en/main/docs/DatasetCfg.html)

---

## Using Pre-defined Recipes

Data-Juicer provides a comprehensive [Recipe Gallery](https://datajuicer.github.io/data-juicer-hub/en/main/docs/RecipeGallery.html) in the [data-juicer-hub](https://github.com/datajuicer/data-juicer-hub) repository, containing ready-to-use configuration files for various scenarios.

### Getting Started with Recipes

First, clone the data-juicer-hub repository to access all recipes:

In [None]:
# Clone data-juicer-hub repository
!git clone --depth 1 https://github.com/datajuicer/data-juicer-hub.git

### Example: Refining Alpaca-CoT Dataset

The [Alpaca-CoT recipes](https://datajuicer.github.io/data-juicer-hub/en/main/docs/AlpacaCOT.html) demonstrate how to refine instruction-tuning datasets. The recipes include:

- **English dataset**: 136M â†’ 73M samples (54.48% retention)
- **Chinese dataset**: 21M â†’ 10M samples (46.58% retention)

Each recipe applies quality-focused operators to improve dataset quality.

In [None]:
# Example: Use Alpaca-CoT English refinement recipe
# (Assuming you have the Alpaca-CoT dataset)

# View the recipe configuration
!cat data-juicer-hub/refined_recipes/alpaca_cot/alpaca-cot-en-refine.yaml | head -30

In [None]:
# Run the refinement (modify dataset_path to your actual data location)
# !dj-process --config data-juicer-hub/refined_recipes/alpaca_cot/alpaca-cot-en-refine.yaml \
#     --dataset_path /path/to/your/alpaca-cot-data.jsonl \
#     --export_path ./outputs/alpaca-cot-refined.jsonl

### Exploring More Recipes

Browse the complete collection:

- **Recipe Gallery**: https://datajuicer.github.io/data-juicer-hub/en/main/docs/RecipeGallery.html
- **GitHub Repository**: https://github.com/datajuicer/data-juicer-hub

In [1]:
# Cleanup: Remove cloned repository
!rm -rf data-juicer-hub

## Further Reading

- [Recipe Gallery](https://datajuicer.github.io/data-juicer/en/main/docs/RecipeGallery.html)
- [config_all.yaml](https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml)
- [Operators Documentation](https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html)
- [Format Conversion Tools](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/README.html)