# Chapter 5: Operators Usage

**Data-Juicer User Guide**

- Git Commit: `v1.4.6`
- Commit Date: 2026-02-02
- Repository: https://github.com/datajuicer/data-juicer

---

Operators are the core building blocks of Data-Juicer pipelines. This chapter demonstrates how to use operators programmatically through the Python API.

There are two primary ways to use operators:

1. **YAML Configuration** (declarative): Define your pipeline in a YAML file and execute it with the CLI
2. **Python API** (programmatic): Instantiate and chain operators directly in Python code

Both approaches offer flexibility—choose based on your workflow preferences.

**Note:** For a complete list of operators and their parameters, refer to the [Operators Documentation](https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html).

# Table of Contents

1. [YAML-Based Configuration](#yaml-based-configuration)
2. [Operator Types](#operator-types)
3. [Setup](#setup)
4. [Create Sample Dataset](#create-sample-dataset)
5. [Initialize and Call a Single Operator](#initialize-and-call-a-single-operator)
6. [Chain Multiple Operators Sequentially](#chain-multiple-operators-sequentially)
7. [Batch Processing with Operator List](#batch-processing-with-operator-list)
8. [Inspect Operator Statistics](#inspect-operator-statistics)
9.  [Further Reading](#further-reading)

## YAML-Based Configuration

For declarative configuration, define your operator pipeline in a YAML file:

```yaml
project_name: 'operators_demo'
dataset_path: './data/operators_demo.jsonl'
export_path: './outputs/operators_demo.jsonl'
np: 1

process:
  - whitespace_normalization_mapper: {}
  - clean_email_mapper: {}
  - language_id_score_filter:
      lang: 'en'
      min_score: 0.8
  - text_length_filter:
      min_len: 20
      max_len: 200
  - alphanumeric_filter:
      min_ratio: 0.5
```

Execute the configuration file using the following command:

```bash
dj-process --config config.yaml
```

For detailed guidance on creating and using recipe YAML files, please refer to [Building Recipes](./02_Building_Recipes.ipynb).

## Operator Types

Data-Juicer provides several operator categories:

| Operator Type | Purpose | Examples |
|---|---|---|
| **Mapper** | Edits and transforms samples. | `CleanEmailMapper`, `WhitespaceNormalizationMapper` |
| **Filter** | Removes low-quality samples based on criteria | `LanguageIDScoreFilter`, `TextLengthFilter`, `AlphanumericFilter` |
| **Deduplicator** | Detects and removes duplicate samples. | `DocumentDeduplicator`, `ImageDeduplicator` |
| **Selector** | Selects top samples based on ranking. | `TopkSpecifiedFieldSelector` |
| **Grouper** | Group samples to batched samples. | `KeyValueGrouper` |
| **Aggregator** | 	Aggregate for batched samples, such as summary or conclusion. | `MetaTagsAggregator` |
| **Pipeline** | Applies dataset-level processing; both input and output are datasets. | `RayVLLMEnginePipeline` |

Each operator can be configured with specific parameters to suit your data processing requirements.

## Setup

In [1]:
# Install Data-Juicer (if not installed)
# If running in Google Colab, use 'pip install' instead of 'uv pip install'
# !uv pip install py-data-juicer

In [2]:
from data_juicer.core.data import NestedDataset as Dataset
from data_juicer.ops.filter import LanguageIDScoreFilter, TextLengthFilter, AlphanumericFilter
from data_juicer.ops.mapper import CleanEmailMapper, WhitespaceNormalizationMapper

  from .autonotebook import tqdm as notebook_tqdm
2026-02-12 09:27:39,822	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2026-02-12 09:27:41,385	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


## Create Sample Dataset

We'll create a sample dataset with varied data quality to demonstrate how different operators handle various scenarios.

In [3]:
# Sample data with various quality levels
samples = [
    {"text": "This is a high-quality English text sample."},
    {"text": "Short"},
    {"text": "Contact us at support@example.com for more information."},
    {"text": "Bonjour! Ceci est un texte en français."},
    {"text": "Machine learning is transforming industries worldwide."},
    {"text": "a@#$%^&*()_+{}[]|\\:;<>?,./"},
    {"text": "This has\textra	whitespace　issues."}
]

# Create Dataset object
dataset = Dataset.from_list(samples)
print(f"Created dataset with {len(dataset)} samples")

Created dataset with 7 samples


## Initialize and Call a Single Operator

Start by applying a single operator to understand how they work. Here we use `LanguageIDScoreFilter` to keep only English texts above a confidence threshold.

In [4]:
# Initialize LanguageIDScoreFilter
lang_filter = LanguageIDScoreFilter(
    lang='en',      # Keep English samples
    min_score=0.6   # Minimum confidence score
)

# Apply the filter
filtered_dataset = lang_filter.run(dataset)

print(f"Original: {len(dataset)} samples")
print(f"After language filter: {len(filtered_dataset)} samples")
print("\nFiltered samples:")
for i, sample in enumerate(filtered_dataset, 1):
    print(f"{i}. {sample['text']}")

[32m2026-02-12 09:27:51.467[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading fasttext language identification model...[0m
[32m2026-02-12 09:27:51.567[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[language_id_score_filter] based on the required memory: NoneGB and required cpu: 1.[0m
Adding new column for stats (num_proc=4): 100%|██████████| 7/7 [00:00<00:00, 28.57 examples/s]
[32m2026-02-12 09:27:51.824[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[language_id_score_filter] based on the required memory: NoneGB and required cpu: 1.[0m
language_id_score_filter_compute_stats (num_proc=4):   0%|          | 0/7 [00:00<?, ? examples/s][32m2026-02-12 09:27:51.941[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m

Original: 7 samples
After language filter: 5 samples

Filtered samples:
1. This is a high-quality English text sample.
2. Short
3. Contact us at support@example.com for more information.
4. Machine learning is transforming industries worldwide.
5. This has	extra	whitespace　issues.





## Chain Multiple Operators Sequentially

In practice, you'll often want to apply multiple operators in sequence. This approach gives you fine-grained control over the pipeline and allows you to inspect intermediate results.

In [5]:
# Method 1: Sequential application
dataset = Dataset.from_list(samples)

# Step 1: Normalize whitespace
print("Step 1: Normalizing whitespace...")
whitespace_mapper = WhitespaceNormalizationMapper()
dataset = whitespace_mapper.run(dataset)
print(f"  → {len(dataset)} samples")

# Step 2: Filter by language
print("Step 2: Filtering by language (English, min_score=0.6)...")
lang_filter = LanguageIDScoreFilter(lang='en', min_score=0.6)
dataset = lang_filter.run(dataset)
print(f"  → {len(dataset)} samples")

# Step 3: Filter by text length
print("Step 3: Filtering by text length (20-200 chars)...")
length_filter = TextLengthFilter(min_len=20, max_len=200)
dataset = length_filter.run(dataset)
print(f"  → {len(dataset)} samples")

# Step 4: Filter by alphanumeric ratio
print("Step 4: Filtering by alphanumeric ratio (min=0.5)...")
alpha_filter = AlphanumericFilter(min_ratio=0.5)
dataset = alpha_filter.run(dataset)
print(f"  → {len(dataset)} samples")

print("\nFinal output:")
for i, sample in enumerate(dataset, 1):
    print(f"{i}. {sample['text']}")

[32m2026-02-12 09:27:57.888[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[whitespace_normalization_mapper] based on the required memory: NoneGB and required cpu: 1.[0m


Step 1: Normalizing whitespace...


whitespace_normalization_mapper_process (num_proc=4): 100%|██████████| 7/7 [00:00<00:00, 27.54 examples/s]
[32m2026-02-12 09:27:58.161[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading fasttext language identification model...[0m
[32m2026-02-12 09:27:58.250[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[language_id_score_filter] based on the required memory: NoneGB and required cpu: 1.[0m


  → 7 samples
Step 2: Filtering by language (English, min_score=0.6)...


Adding new column for stats (num_proc=4): 100%|██████████| 7/7 [00:00<00:00, 34.12 examples/s]
[32m2026-02-12 09:27:58.467[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[language_id_score_filter] based on the required memory: NoneGB and required cpu: 1.[0m
language_id_score_filter_compute_stats (num_proc=4):   0%|          | 0/7 [00:00<?, ? examples/s][32m2026-02-12 09:27:58.655[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading fasttext language identification model...[0m
[32m2026-02-12 09:27:58.655[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading fasttext language identification model...[0m
[32m2026-02-12 09:27:58.660[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading f

  → 5 samples
Step 3: Filtering by text length (20-200 chars)...


text_length_filter_compute_stats (num_proc=4): 100%|██████████| 5/5 [00:00<00:00, 16.45 examples/s]
[32m2026-02-12 09:27:59.534[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[text_length_filter] based on the required memory: NoneGB and required cpu: 1.[0m
text_length_filter_process (num_proc=4): 100%|██████████| 5/5 [00:00<00:00, 16.66 examples/s]
[32m2026-02-12 09:27:59.875[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[alphanumeric_filter] based on the required memory: NoneGB and required cpu: 1.[0m


  → 4 samples
Step 4: Filtering by alphanumeric ratio (min=0.5)...


alphanumeric_filter_compute_stats (num_proc=4): 100%|██████████| 4/4 [00:00<00:00, 12.12 examples/s]
[32m2026-02-12 09:28:00.244[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[alphanumeric_filter] based on the required memory: NoneGB and required cpu: 1.[0m
alphanumeric_filter_process (num_proc=4): 100%|██████████| 4/4 [00:00<00:00, 14.96 examples/s]

  → 4 samples

Final output:
1. This is a high-quality English text sample.
2. Contact us at support@example.com for more information.
3. Machine learning is transforming industries worldwide.
4. This has extra whitespace issues.





## Batch Processing with Operator List

For cleaner code and better performance, you can pass all operators to the `process()` method at once.

In [6]:
# Method 2: Using process() with operator list
dataset = Dataset.from_list(samples)

# Define operator pipeline
operators = [
    WhitespaceNormalizationMapper(),
    CleanEmailMapper(),
    LanguageIDScoreFilter(lang='en', min_score=0.8),
    TextLengthFilter(min_len=20, max_len=200),
    AlphanumericFilter(min_ratio=0.5)
]

# Apply all operators in one call
dataset = dataset.process(operators)

print(f"Processed dataset: {len(dataset)} samples")
print("\nFinal output:")
for i, sample in enumerate(dataset, 1):
    print(f"{i}. {sample['text']}")

[32m2026-02-12 09:28:07.136[0m | [1mINFO    [0m | [36mdata_juicer.utils.model_utils[0m:[36mprepare_fasttext_model[0m:[36m502[0m - [1mLoading fasttext language identification model...[0m
[32m2026-02-12 09:28:07.310[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[whitespace_normalization_mapper] based on the required memory: NoneGB and required cpu: 1.[0m
whitespace_normalization_mapper_process (num_proc=4): 100%|██████████| 7/7 [00:00<00:00, 25.92 examples/s]
[32m2026-02-12 09:28:07.881[0m | [1mINFO    [0m | [36mdata_juicer.core.data.dj_dataset[0m:[36mprocess[0m:[36m310[0m - [1m[1/5] OP [whitespace_normalization_mapper] Done in 0.628s. Left 7 samples.[0m
[32m2026-02-12 09:28:07.919[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[clean_email_mapper] based on the required memor

Processed dataset: 4 samples

Final output:
1. This is a high-quality English text sample.
2. Contact us at  for more information.
3. Machine learning is transforming industries worldwide.
4. This has extra whitespace issues.


## Inspect Operator Statistics

Filter operators can be configured to compute statistics without filtering. This helps you understand your dataset characteristics before deciding on filter thresholds.

In [7]:
# Create fresh dataset
dataset = Dataset.from_list(samples)

# Compute statistics without filtering
length_filter = TextLengthFilter(min_len=20, max_len=200)
dataset_with_stats = length_filter.run(dataset, reduce=False)  # Compute stats without filtering

# Check statistics
print("Text length statistics:")
for i, sample in enumerate(dataset_with_stats, 1):
    stats = sample.get('__dj__stats__', {})
    print(f"{i}. Text: {sample['text'][:50]}...")
    print(f"   Length: {stats.get('text_len', 'N/A')} chars")

[32m2026-02-12 09:28:16.496[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[text_length_filter] based on the required memory: NoneGB and required cpu: 1.[0m
Adding new column for stats (num_proc=4): 100%|██████████| 7/7 [00:00<00:00, 28.57 examples/s]
[32m2026-02-12 09:28:16.753[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[text_length_filter] based on the required memory: NoneGB and required cpu: 1.[0m
text_length_filter_compute_stats (num_proc=4): 100%|██████████| 7/7 [00:00<00:00, 26.14 examples/s]

Text length statistics:
1. Text: This is a high-quality English text sample....
   Length: 43 chars
2. Text: Short...
   Length: 5 chars
3. Text: Contact us at support@example.com for more informa...
   Length: 55 chars
4. Text: Bonjour! Ceci est un texte en français....
   Length: 39 chars
5. Text: Machine learning is transforming industries worldw...
   Length: 54 chars
6. Text: a@#$%^&*()_+{}[]|\:;<>?,./...
   Length: 26 chars
7. Text: This has	extra	whitespace　issues....
   Length: 33 chars





## Further Reading

- [Complete Operators List](https://datajuicer.github.io/data-juicer/en/main/docs/Operators.html)
- [Building Recipes with YAML](./02_Building_Recipes.ipynb)
- [Developer Guide](https://datajuicer.github.io/data-juicer/en/main/docs/DeveloperGuide.html)