# Building Data Recipes

In the previous notebooks, we learned the basic concepts of Data-Juicer and how to use operators. In this chapter, we'll dive deep into building data recipes, which is a crucial concept in Data-Juicer.

## What are Data Recipes?

Data recipes are configuration files written in YAML format that define a complete data processing workflow. Recipes combine various operators in a specific order to form an executable data processing pipeline.

## In This Notebook

1. Basic recipe structure
2. Global parameters
3. Process pipeline
4. Recipe design best practices

## Setup

First, let's import the necessary modules and create some sample data to work with.

In [None]:
import os
import json

# Create sample data
sample_data = [
    {"text": "Hello world! This is a sample text with good quality."},
    {"text": "This text has many repeated words words words words words words words words words words words words"},
    {"text": "Short"},
    {"text": "This is a high quality English text with appropriate length and good content."},
    {"text": "Bonjour le monde! Ceci est un texte d'exemple de bonne qualité."},
    {"text": "Visit https://example.com for more info. Email me at test@example.com"},
    {"text": "This is a high quality English text with appropriate length and good content."}
]

# Create data directory if it doesn't exist
os.makedirs('data', exist_ok=True)

# Write sample data to a JSONL file
with open('data/sample_dataset.jsonl', 'w') as f:
    for item in sample_data:
        f.write(json.dumps(item) + '\n')

print("Sample dataset created with", len(sample_data), "samples")
print("\nOriginal samples:")
for i, sample in enumerate(sample_data):
    print(f"{i+1}. {sample['text']}")

## Basic Recipe Structure

A typical Data-Juicer recipe contains the following parts:

- **Global parameters**: Define project name, dataset path, export path, etc.
- **Process pipeline**: Define the sequence of operators to execute and their parameters

Let's start with a simple example to understand the basic structure of a recipe:

In [None]:
import tempfile
from data_juicer.config import init_configs

basic_recipe = """
# Basic recipe example
project_name: 'my_first_recipe'
dataset_path: './data/sample_dataset.jsonl'
export_path: './outputs/processed_dataset.jsonl'
np: 4

process:
  - whitespace_normalization_mapper: {}
  - language_id_score_filter:
      lang: 'en'
      min_score: 0.8
  - text_length_filter:
      min_len: 10
      max_len: 1000
"""

# Write recipe to file
with tempfile.NamedTemporaryFile(mode='w+', delete=False) as basic:
    basic.write(basic_recipe)
    basic.flush()

# Load and check the recipe
cfg = init_configs(args=f'--config {basic.name}'.split())

## Global Configuration Parameters

Global parameters control the overall behavior of the data processing pipeline. These are divided into several categories:

Note: While we demonstrate the detailed operations using tools like `DatasetBuilder` and `init_configs` for educational purposes, in practice these operations are all handled automatically by Data-Juicer's dj-process command. You can also use `DefaultExecutor` or `RayExecutor`.

### 1. Dataset Configuration

Dataset configuration defines where to load data from and how to handle it:

#### Simple Dataset Configuration

For basic use cases, you can use the `dataset_path` parameter:

In [None]:
from data_juicer.core.data.dataset_builder import DatasetBuilder

simple_recipe = """
# Simple dataset configuration
project_name: 'simple_dataset_config'
dataset_path: './data/sample_dataset.jsonl'
"""

# Write recipe to file
with tempfile.NamedTemporaryFile(mode='w+', delete=False) as simple:
    simple.write(simple_recipe)
    simple.flush()

# Load and check the recipe
cfg = init_configs(args=f'--config {simple.name}'.split(), load_configs_only=True)

# Use the DatasetBuilder to load the dataset
dataset_builder = DatasetBuilder(cfg)
dataset = dataset_builder.load_dataset()

print(dataset.to_list())

#### Advanced Dataset Configuration

For more complex scenarios, Data-Juicer provides flexible dataset loading methods. This approach allows you to:

1. Load different types of datasets (local, remote, HuggingFace, etc.)
2. Mix multiple datasets with different weights
3. Apply data validation rules
4. Configure advanced loading parameters

Here's how to use the advanced dataset configuration:

##### Local Datasets

You can load local datasets in various formats:

In [None]:
# Formats like parquet, jsonl, json, csv, tsv, txt, and jsonl.gz are supported
local_json_recipe = """
# Loading a local JSON dataset
project_name: 'local_json_dataset'

dataset:
  configs:
    - type: 'local'
      path: './data/sample_dataset.jsonl'
      format: 'json'

# Optional data validators
validators:
  - type: required_fields
    required_fields:
      - "text"
"""

with tempfile.NamedTemporaryFile(mode='w+', delete=False) as local_json:
    local_json.write(local_json_recipe)
    local_json.flush()

cfg = init_configs(args=f'--config {local_json.name}'.split(), load_configs_only=True)

dataset_builder = DatasetBuilder(cfg)
dataset = dataset_builder.load_dataset()

print(dataset.to_list())

##### Remote Datasets

You can also load datasets from remote sources:

In [None]:
remote_hf_recipe = """
# Loading a HuggingFace dataset (example)
project_name: 'remote_hf_dataset'
dataset:
  configs:
    - type: 'remote'
      source: 'huggingface'
      path: "wikimedia/wikipedia"
      name: "20231101.kl"
      split: "train"
"""

with tempfile.NamedTemporaryFile(mode='w+', delete=False) as remote_hf:
    remote_hf.write(remote_hf_recipe)
    remote_hf.flush()

cfg = init_configs(args=f'--config {remote_hf.name}'.split(), load_configs_only=True)

dataset_builder = DatasetBuilder(cfg)
dataset = dataset_builder.load_dataset()

print(dataset.to_list())

##### Dataset Mixtures

You can mix multiple datasets with different weights:

In [None]:
sample_data_zh = [
    {"text": "你好世界！这是一个样本文本。"},
    {"text": "这是一段重复的文本文本文本文本文本文本文本文本文本文本"},
    {"text": "短文本"},
    {"text": "欢迎来到阿里巴巴！"}
]

with open('data/sample_dataset_zh.jsonl', 'w') as f:
    for item in sample_data_zh:
        f.write(json.dumps(item) + '\n')

In [None]:
mixture_recipe = """
# Mixing multiple datasets
project_name: 'dataset_mixture'
dataset:
  max_sample_num: 10
  configs:
    - type: 'local'
      weight: 1.0
      path: './data/sample_dataset.jsonl'
    - type: 'local'
      weight: 0.5
      path: './data/sample_dataset_zh.jsonl'
"""

with tempfile.NamedTemporaryFile(mode='w+', delete=False) as mixture:
    mixture.write(mixture_recipe)
    mixture.flush()

cfg = init_configs(args=f'--config {mixture.name}'.split(), load_configs_only=True)
dataset_builder = DatasetBuilder(cfg)
dataset = dataset_builder.load_dataset()

for i, sample in enumerate(dataset):
    print(f"{i+1}. {sample['text']}")

### 2. Export Configuration

Controls how and where to save the processed data:

In [None]:

import tempfile
from data_juicer.config import init_configs
from data_juicer.core.exporter import Exporter

export_config_recipe = """
# Export configuration example
project_name: 'export_config_example'
dataset_path: './data/sample_dataset.jsonl'

# Export settings
export_path: './outputs/exported_data.json'
export_type: 'json'  # or 'parquet', 'jsonl', etc. Optional.
export_shard_size: 0  # 0 means no sharding. Optional.
export_in_parallel: false  # Whether to export in parallel. Optional.
"""

with tempfile.NamedTemporaryFile(mode="w+", delete=False) as export_cfg:
    export_cfg.write(export_config_recipe)
    export_cfg.flush()

cfg = init_configs(args=f"--config {export_cfg.name}".split(), load_configs_only=True)

dataset = DatasetBuilder(cfg).load_dataset()
# Let's export the dataset loaded in the previous step to JSON format
# This is merely a demonstration
# In executor, the Exporter will be used to export the final results of data processing
exporter = Exporter(cfg.export_path, cfg.export_type, cfg.export_shard_size, cfg.export_in_parallel)
exporter.export(dataset)

if os.path.exists(cfg.export_path):
    print(f"Exported data saved to {cfg.export_path}")

### 3. System and Runtime Configuration

Controls system-level behavior and resource usage:

Note: The following parameters are for illustration only. For a complete list of available options, please refer to [`config_all.yaml`](https://github.com/modelscope/data-juicer/blob/main/configs/config_all.yaml).

In [None]:
import tempfile
from data_juicer.config import init_configs

system_config_recipe = """
# System configuration example
project_name: 'system_config_example'
dataset_path: './data/sample_dataset.jsonl'
export_path: './outputs/system_result.jsonl'

# Runtime settings
np: 4  # Number of processes
text_keys: 'text'  # Default text field name
image_key: 'images'  # Default image field name

# Caching and performance
use_cache: true
cache_compress: 'gzip'
ds_cache_dir: '~/.cache/huggingface/datasets'  # Dataset cache directory

# Monitoring and debugging
open_monitor: true  # Enable system monitoring
open_tracer: false  # Enable operation tracing
debug: false  # Debug mode

# Checkpointing for long-running jobs
use_checkpoint: false

process:
  - whitespace_normalization_mapper: {}
"""

with tempfile.NamedTemporaryFile(mode='w+', delete=False) as system_cfg:
    system_cfg.write(system_config_recipe)
    system_cfg.flush()

cfg = init_configs(args=f'--config {system_cfg.name}'.split(), load_configs_only=True)
print("System configuration loaded successfully")

## Process Pipeline Configuration

The `process` section defines the sequence of operators to be applied to the dataset. Each operator has its own parameter configuration.

### Operator Configuration Syntax

Operators are listed in order in the `process` list, with each operator's configuration following this syntax:

```yaml
process:
  - operator_name:
      parameter1: value1
      parameter2: value2
      ...
```

If there are no parameters, it can be simplified to:

```yaml
process:
  - operator_name: {}
```

### Example: Complete Process Pipeline

Let's look at a more complete example showing how to configure a process pipeline with multiple operators:

In [None]:
from data_juicer.ops import load_ops

complete_recipe = """
# Complete recipe example
project_name: 'complete_recipe_example'

# Text field configuration
text_keys: 'text'

# Process pipeline
process:
  # 1. Text cleaning and normalization
  - whitespace_normalization_mapper: {}
  - punctuation_normalization_mapper: {}
  
  # 2. Language detection and filtering
  - language_id_score_filter:
      lang: 'en'
      min_score: 0.8
  
  # 3. Text length filtering
  - text_length_filter:
      min_len: 10
      max_len: 1000
  
  # 4. Quality filtering
  - alphanumeric_filter:
      tokenization: false
      min_ratio: 0.5
  - average_line_length_filter:
      min_len: 10
      max_len: 1000
  
  # 5. Deduplication
  - document_simhash_deduplicator:
      tokenization: false
      window_size: 6
      lower: true
      ignore_pattern: null
      num_blocks: 6
      hamming_distance: 4
"""

with tempfile.NamedTemporaryFile(mode='w+', delete=False) as complete:
    complete.write(complete_recipe)
    complete.flush()

cfg = init_configs(args=f'--config {complete.name}'.split())

# Load operators (this is what happens internally)
ops = load_ops(cfg.process)
print(f"Loaded {len(ops)} operators:")
for i, op in enumerate(ops):
    print(f"  {i+1}. {op.__class__.__name__}")

## Best Practices for Recipe Design

There are three approaches to constructing a data recipe.

### Customize the Default Configuration File

The [`config_all.yaml`](https://github.com/modelscope/data-juicer/blob/main/configs/config_all.yaml) contains all operators and their default arguments. 

You just need to **remove** ops that you won't use and refine some arguments of ops.

### Create a New Configuration from Scratch

You can refer our example config file [`config_all.yaml`](https://github.com/modelscope/data-juicer/blob/main/configs/config_all.yaml), [op documents](https://modelscope.github.io/data-juicer/en/main/docs/Operators.html), and [Dataset Configuration Guide](https://modelscope.github.io/data-juicer/en/main/docs/DatasetCfg.html).

### Reusable Recipes

Data-Juicer provides a rich collection of reusable data recipes in the [Recipe Gallery](https://modelscope.github.io/data-juicer/en/main/docs/RecipeGallery.html). These recipes cover various scenarios including:

1. **Minimal Example Recipes**: Basic configurations to get you started
2. **Reproducing Open Source Datasets**: Recipes that reproduce the processing pipelines of popular datasets like RedPajama and BLOOM
3. **Refined Pre-training Datasets**: Improved versions of existing pre-training datasets with better quality
4. **Post-tuning Dataset Improvements**: Refined instruction datasets for fine-tuning
5. **Multimodal Dataset Processing**: Recipes for image-text and video datasets
6. **Synthetic Dataset Generation**: Recipes for generating contrastive learning datasets

We recommend exploring the [Recipe Gallery](https://modelscope.github.io/data-juicer/en/main/docs/RecipeGallery.html) to find recipes that match your use case, which can serve as a starting point for your own data processing workflows.

## Next Steps
Continue with the next notebook in the series to learn how to actually run these recipes to process datasets.