# Understanding Data-Juicer Operators

In this notebook, we'll explore the different types of operators in Data-Juicer and how to use them. Operators are the building blocks of data processing pipelines in Data-Juicer.

## What are Operators?

Operators in Data-Juicer are specialized functions that perform specific data processing tasks. Each operator is designed to handle a particular aspect of data cleaning, transformation, or filtering.

## Operator Types

Data-Juicer provides 6 main types of operators:

- **Mapper**: Edits and transforms samples (e.g., text cleaning)
- **Filter**: Filters out low-quality samples (e.g., language filtering)
- **Deduplicator**: Detects and removes duplicate samples
- **Selector**: Selects top samples based on ranking
- **Grouper**: Group samples to batched samples
- **Aggregator**: Aggregate for batched samples, such as summary or conclusion

In this notebook, we'll explore each type with practical examples.

## Setup

First, let's import the necessary modules and create some sample data to work with.

In [None]:
from data_juicer.core.data import NestedDataset as Dataset

# Create sample data
sample_data = [
    {"text": "Hello world! This is a sample text with good quality.Visit https://example.com for more info. Email me at test@example.com"},
    {"text": "This text has many repeated words words words words words words words words words words words words"},
    {"text": "Short"},
    {"text": "This is a high quality English text with appropriate length and good content."},
    {"text": "Bonjour le monde! Ceci est un texte d'exemple de bonne qualit√©."},
    {"text": "hello world! this is a sample text with good quality."},
    {"text": "This is a high quality English text with appropriate length and good content."}
]

# Create dataset
dataset = Dataset.from_list(sample_data)

print("Sample dataset created with", len(sample_data), "samples")
print("\nOriginal samples:")
for i, sample in enumerate(sample_data):
    print(f"{i+1}. {sample['text']}")

## 1. Mapper Operators

Mapper operators transform data samples. They take one sample and return a transformed version of that sample.

Let's try the `clean_links_mapper` and `clean_email_mapper`:

In [None]:
from data_juicer.ops.mapper import CleanLinksMapper, CleanEmailMapper

# Create mapper operators
clean_links_op = CleanLinksMapper()
clean_email_op = CleanEmailMapper()

# You can apply mappers one by one
print("Original text:")
sample_text = sample_data[0]['text']
print(sample_text)

# Process with mappers
sample = {'text': [sample_text]}  # Note: When using a batched operator, the single-sample format represents text as a list.
sample = clean_links_op.process(sample)
sample = clean_email_op.process(sample)

print("\nAfter processing:")
print(sample['text'][0])

In [None]:
# or apply mappers in a pipeline

print("Original dataset:")
for i, sample in enumerate(dataset):
    print(f"{i+1}. {sample['text']}")

# Apply mappers
# You can use `dataset.process` to apply operators in a pipeline
# or use `op.run` to apply a single operator
# dataset = clean_links_op.run(dataset)
# dataset = clean_email_op.run(dataset)
dataset = dataset.process([clean_links_op, clean_email_op])

print("\nAfter processing:")
for i, sample in enumerate(dataset):
    print(f"{i+1}. {sample['text']}")

## 2. Filter Operators

Filter operators remove data samples that don't meet certain criteria. They compute statistics and then decide whether to keep or remove a sample.

Let's try the `text_length_filter` `alphanumeric_filter` `language_id_score_filter`:

In [None]:
from data_juicer.ops.filter import TextLengthFilter, AlphanumericFilter, LanguageIDScoreFilter

# Create filter operators
length_filter = TextLengthFilter(min_len=10, max_len=100)
alpha_filter = AlphanumericFilter(min_ratio=0.5)
lang_filter = LanguageIDScoreFilter(lang='en', min_score=0.8)

print("Applying filters to dataset:")
print("Original dataset size:", len(dataset))

# Apply filters
filtered_dataset = length_filter.run(dataset, reduce=False)  # Just compute stats
filtered_dataset = alpha_filter.run(filtered_dataset, reduce=False)  # Just compute stats
filtered_dataset = lang_filter.run(filtered_dataset, reduce=False)  # Just compute stats

# Show stats
print("\nDataset with computed stats:")
for i, sample in enumerate(filtered_dataset):
    stats = sample.get('__dj__stats__', {})
    print(f"{i+1}. Text: {sample['text']}")
    print(f"   Stats: Length={stats.get('text_len', 'N/A')}, Alpha ratio={stats.get('alnum_ratio', 'N/A'):.2f}, Language score={stats.get('lang_score', 'N/A'):.2f}")

In [None]:
# Now apply actual filtering
final_dataset = length_filter.run(dataset)  # Compute stats and filter
final_dataset = alpha_filter.run(final_dataset)  # Compute stats and filter
final_dataset = lang_filter.run(final_dataset)  # Compute stats and filter

print("\nAfter filtering:")
print("Final dataset size:", len(final_dataset))
for i, sample in enumerate(final_dataset):
    print(f"{i+1}. {sample['text']}")

## 3. Deduplicator Operators

Deduplicator operators identify and remove duplicate samples from the dataset.

Let's try the `document_deduplicator`:

In [None]:
from data_juicer.ops.deduplicator import DocumentDeduplicator

# Create deduplicator operator
dedup_op = DocumentDeduplicator(lowercase=True)  # Case-insensitive deduplication

print("Dataset before deduplication:")
for i, sample in enumerate(dataset):
    print(f"{i+1}. {sample['text']}")

# Apply deduplication
deduped_dataset = dedup_op.run(dataset)

print("\nDataset after deduplication:")
print("Size before:", len(dataset), "Size after:", len(deduped_dataset))
for i, sample in enumerate(deduped_dataset):
    print(f"{i+1}. {sample['text']}")

## 4. Selector Operators

Selector operators select a subset of samples based on certain criteria.

Let's create a dataset with metadata and use `topk_specified_field_selector`:

In [None]:
from data_juicer.ops.selector import TopkSpecifiedFieldSelector

# Create sample data with metadata
sample_data_with_meta = [
    {"text": "Sample text 1", "meta": {"quality_score": 0.8}},
    {"text": "Sample text 2", "meta": {"quality_score": 0.9}},
    {"text": "Sample text 3", "meta": {"quality_score": 0.7}},
    {"text": "Sample text 4", "meta": {"quality_score": 0.95}},
    {"text": "Sample text 5", "meta": {"quality_score": 0.6}},
]

meta_dataset = Dataset.from_list(sample_data_with_meta)

print("Dataset with quality scores:")
for i, sample in enumerate(meta_dataset):
    print(f"{i+1}. {sample['text']} (Score: {sample['meta']['quality_score']})")

# Create selector operator
selector_op = TopkSpecifiedFieldSelector(field_key='meta.quality_score', topk=3)

# Apply selection
selected_dataset = selector_op.process(meta_dataset)

print("\nTop 3 samples by quality score:")
for i, sample in enumerate(selected_dataset):
    print(f"{i+1}. {sample['text']} (Score: {sample['meta']['quality_score']})")

## Different Ways to Use Operators

There are three main ways to use operators in Data-Juicer:

1. **Direct processing**: `op.process(sample)` - For single sample processing
2. **Functional programming style**: `op.run(dataset)` - For turnkey pipeline processing
3. **Chain call style**: `dataset.process([op1, op2, op3])` - For batch processing with automatic control

Let's see these in action:

In [None]:
# 1. Direct processing
clean_links_op = CleanLinksMapper()
sample = {'text': ["Visit https://example.com"]}
result = clean_links_op.process(sample)
print("1. Direct processing:")
print("   Original:", sample['text'][0])
print("   Processed:", result['text'][0])

# 2. Functional programming style
small_dataset = Dataset.from_list([{"text": "Visit https://example.com"}, {"text": "test"}])
result_dataset2 = clean_links_op.run(small_dataset)
print("\n2. Functional programming style:")
for i, sample in enumerate(result_dataset2):
    print(f"   {i+1}. {sample['text']}")

# 3. Chain call style
text_length_filter = TextLengthFilter(min_len=10)
result_dataset3 = small_dataset.process([text_length_filter, clean_links_op])
print("\n3. Chain call style:")
for i, sample in enumerate(result_dataset3):
    print(f"   {i+1}. {sample['text']}")

For more operators and their usage, please refer to the [Operator Schemas](https://modelscope.github.io/data-juicer/en/main/docs/Operators.html).

## Next Steps

Continue with the next notebook to learn how to build data recipes that combine multiple operators into processing pipelines.