# Chapter 2: Building Recipes

**Data-Juicer User Guide**

- Git Commit: `v1.4.6`
- Commit Date: 2026-02-02
- Repository: https://github.com/datajuicer/data-juicer

# Table of Contents

1. [Recipe Structure](#recipe-structure)
2. [Basic Recipe Example](#basic-recipe-example)
3. [Build Method](#build-method)
4. [Overriding Parameters via CLI](#overriding-parameters-via-cli)
   - [Configuration Hierarchy](#configuration-hierarchy)
5. [Compare Results](#compare-results)
6. [Advanced Dataset Configuration](#advanced-dataset-configuration)
7. [Using Pre-defined Recipes](#using-pre-defined-recipes)
   - [Recipe Categories](#recipe-categories)
   - [Getting Started with Recipes](#getting-started-with-recipes)
   - [Example: Refining Alpaca-CoT Dataset](#example-refining-alpaca-cot-dataset)
   - [Exploring More Recipes](#exploring-more-recipes)
8. [Further Reading](#further-reading)

In [1]:
# Install Data-Juicer (if not installed)
# If running in Google Colab, use 'pip install' instead of 'uv pip install'
# !uv pip install py-data-juicer

## Recipe Structure

A Data-Juicer recipe is a YAML configuration file with three main sections:

1. **Global Parameters**: Project settings, paths, parallelism
2. **Process Pipeline**: Ordered list of operators
3. **Operator Parameters**: Configuration for each operator

Data-Juicer can handle various data formats (JSONL, CSV, Parquet, etc.) and loading methods (local files, S3, Hugging Face datasets, etc.), and automatically convert them to the unified format for processing. For detailed format specifications and supported data sources, please refer to **[./03_Data_Formats_and_Loading.ipynb](./03_Data_Formats_and_Loading.ipynb)**.

Below we create a simple example for demonstration.

In [2]:
import json
import os

# Create sample data
os.makedirs('./data', exist_ok=True)

samples = [
    {"text": "This is a high-quality English text sample."},
    {"text": "Short"},
    {"text": "Another good example with sufficient length and quality."},
    {"text": "Bonjour! Ceci est un texte en français."},
    {"text": "Data processing is essential for machine learning projects."}
]

with open('./data/recipe_demo.jsonl', 'w') as f:
    for sample in samples:
        f.write(json.dumps(sample) + '\n')

print("Sample data created")

Sample data created


## Basic Recipe Example

In this recipe, we can find that it is primarily composed of two parts: 
- Global arguments
- Operator list

### Global Arguments

Global arguments typically include a set of required arguments as well as various optional arguments for optimization and debugging purposes.
  
For all available parameters and detailed usage, please refer to [`config_all.yaml`](https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml)

### Operator list

Data-Juicer offers an extensive array of operators for data manipulation, encompassing modification, cleansing, filtering, and deduplication tasks.

Data recipe must include the necessary operators and their respective arguments for efficient dataset processing. And Data-Juicer will process the operators sequentially as arranged in the provided operation list.

In [3]:
basic_recipe = """
# Global Parameters
project_name: 'recipe_demo'
dataset_path: './data/recipe_demo.jsonl'
export_path: './outputs/recipe_demo.jsonl'
np: 2  # Number of parallel processes

# Process Pipeline
process:
  # Step 1: Normalize whitespace
  - whitespace_normalization_mapper: {}
  
  # Step 2: Filter by language
  - language_id_score_filter:
      lang: 'en'
      min_score: 0.8
  
  # Step 3: Filter by text length
  - text_length_filter:
      min_len: 20
      max_len: 500
  
  # Step 4: Remove duplicates
  - document_deduplicator:
      lowercase: true
"""

os.makedirs('./configs', exist_ok=True)
with open('./configs/basic_recipe.yaml', 'w') as f:
    f.write(basic_recipe)

print("Basic recipe created")

Basic recipe created


In [4]:
# Run with default config
!dj-process --config ./configs/basic_recipe.yaml

[32m2026-02-12 09:23:30.371[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m695[0m - [1mdataset_path config is set and a valid local path[0m
[32m2026-02-12 09:23:30.436[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1012[0m - [1mBack up the input config file [/workspaces/data-juicer-hub/configs/basic_recipe.yaml] into the work_dir [/workspaces/data-juicer-hub/outputs][0m
[32m2026-02-12 09:23:30.443[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1033[0m - [1mConfiguration table: [0m
╒══════════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════╕
│ key                      │ values                                                                                     │
╞══════════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════╡
│ config                   │ [Path_fr(./configs/basic_recipe.yaml, cwd=/workspaces

## Build Method
There are two approaches to constructing a data recipe.

- ### Customize the Default Configuration File

The [`config_all.yaml`](https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml) contains all operators and their default arguments. 

You just need to **remove** ops that you won't use and refine some arguments of ops.

- ### Create a New Configuration from Scratch

You can refer our example config file [`config_all.yaml`](https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml), [op documents](https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html), and advanced [Build-Up Guide for developers](https://datajuicer.github.io/data-juicer/en/main/docs/DeveloperGuide.html#build-your-own-data-recipes-and-configs).

## Overriding Parameters via CLI

In [5]:
# Override parameters via CLI
!dj-process --config ./configs/basic_recipe.yaml \
    --dataset_path ./data/recipe_demo.jsonl \
    --export_path ./outputs/cli_override.jsonl \
    --np 4

[32m2026-02-12 09:23:44.293[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m695[0m - [1mdataset_path config is set and a valid local path[0m
[32m2026-02-12 09:23:44.354[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1012[0m - [1mBack up the input config file [/workspaces/data-juicer-hub/configs/basic_recipe.yaml] into the work_dir [/workspaces/data-juicer-hub/outputs][0m
[32m2026-02-12 09:23:44.360[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1033[0m - [1mConfiguration table: [0m
╒══════════════════════════╤════════════════════════════════════════════════════════════════════════════════════════════╕
│ key                      │ values                                                                                     │
╞══════════════════════════╪════════════════════════════════════════════════════════════════════════════════════════════╡
│ config                   │ [Path_fr(./configs/basic_recipe.yaml, cwd=/workspaces

### Configuration Hierarchy

Parameters are resolved in this order (highest to lowest priority):

1. **CLI arguments** (e.g., `--np 4`)
2. **YAML config file** (e.g., `configs/recipe.yaml`)
3. **Default values** (defined in operator code)

## Compare Results

In [6]:
import json

def load_jsonl(path):
    with open(path, 'r') as f:
        return [json.loads(line) for line in f]

# Load results
basic_results = load_jsonl('./outputs/recipe_demo.jsonl')
overriding_results = load_jsonl('./outputs/cli_override.jsonl')

print(f"Original samples: 5")
print(f"Basic recipe output: {len(basic_results)} samples")
print(f"Overriding parameter output: {len(overriding_results)} samples")

assert basic_results == overriding_results

Original samples: 5
Basic recipe output: 3 samples
Overriding parameter output: 3 samples


## Using Pre-defined Recipes

Data-Juicer provides a comprehensive [Recipe Gallery](https://datajuicer.github.io/data-juicer-hub/en/main/docs/RecipeGallery.html) in the [data-juicer-hub](https://github.com/datajuicer/data-juicer-hub) repository, containing ready-to-use configuration files for various scenarios.

### Getting Started with Recipes

First, clone the data-juicer-hub repository to access all recipes:

In [7]:
# Clone data-juicer-hub repository
!git clone --depth 1 https://github.com/datajuicer/data-juicer-hub.git

Cloning into 'data-juicer-hub'...
remote: Enumerating objects: 88, done.[K
remote: Counting objects: 100% (88/88), done.[K
remote: Compressing objects: 100% (54/54), done.[K
remote: Total 88 (delta 34), reused 69 (delta 31), pack-reused 0 (from 0)[K
Receiving objects: 100% (88/88), 44.89 KiB | 901.00 KiB/s, done.
Resolving deltas: 100% (34/34), done.


### Example: Refining Alpaca-CoT Dataset

The [Alpaca-CoT recipes](https://datajuicer.github.io/data-juicer-hub/en/main/docs/AlpacaCOT.html) demonstrate how to refine instruction-tuning datasets.

Each recipe applies quality-focused operators to improve dataset quality.

In [8]:
# Example: Use Alpaca-CoT English refinement recipe
# (Assuming you have the Alpaca-CoT dataset)

# View the recipe configuration
!cat data-juicer-hub/refined_recipes/alpaca_cot/alpaca-cot-en-refine.yaml | head -30

# global parameters
project_name: 'Data-Juicer-recipes-alpaca-cot-en'
dataset_path: '/path/to/your/dataset'  # path to your dataset directory or file
export_path: '/path/to/your/dataset.jsonl'

np: 50  # number of subprocess to process your dataset
open_tracer: true

# process schedule
# a list of several process operators with their arguments
process:
  - document_deduplicator: # 104636705
      lowercase: true
      ignore_non_character: true

  - alphanumeric_filter: # 104636381
      tokenization: false
      min_ratio: 0.1
  - character_repetition_filter: # 104630030
      rep_len: 10
      max_ratio: 0.6
  - flagged_words_filter: # 104576967
      lang: en
      tokenization: true
      max_ratio: 0.017
  - maximum_line_length_filter: # 104575811
      min_len: 20
  - text_length_filter: # 104573711
      min_len: 30



In [9]:
# Run the refinement (modify dataset_path to your actual data location)
# !dj-process --config data-juicer-hub/refined_recipes/alpaca_cot/alpaca-cot-en-refine.yaml \
#     --dataset_path /path/to/your/alpaca-cot-data.jsonl \
#     --export_path ./outputs/alpaca-cot-refined.jsonl

### Exploring More Recipes

Browse the complete collection:

- **Recipe Gallery**: https://datajuicer.github.io/data-juicer-hub/en/main/docs/RecipeGallery.html
- **GitHub Repository**: https://github.com/datajuicer/data-juicer-hub

In [10]:
# Cleanup: Remove cloned repository
!rm -rf data-juicer-hub

## Further Reading

- [Recipe Gallery](https://datajuicer.github.io/data-juicer/en/main/docs/RecipeGallery.html)
- [config_all.yaml](https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml)
- [Operators Documentation](https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html)
- [Format Conversion Tools](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/README.html)