# Chapter 7: Pre-processing

**Data-Juicer User Guide**

- Git Commit: `v1.0.5`
- Commit Date: 2026-01-16
- Repository: https://github.com/datajuicer/data-juicer

---

Pre-processing tools help prepare raw data before entering the main Data-Juicer pipeline. See the [Preprocess README](https://datajuicer.github.io/data-juicer/en/main/tools/preprocess/README.html) for complete documentation.

## Available Pre-processing Tools

Data-Juicer provides several pre-processing utilities:

- **dataset_split_by_language.py**: Split datasets by language
- **raw_arxiv_to_jsonl.py**: Convert arXiv data to JSONL
- **raw_stackexchange_to_jsonl.py**: Convert Stack Exchange data
- **serialize_meta.py**: Serialize metadata fields

In [None]:
!git clone --depth 1 https://github.com/datajuicer/data-juicer.git

In [None]:
!uv pip install -e data-juicer[dev]

## Example: Split Dataset by Language

In [None]:
import json
import os

# Create sample multilingual dataset
os.makedirs('./data/raw', exist_ok=True)

samples = [
    {"text": "This is an English text sample.", "id": 1},
    {"text": "Ceci est un texte en français.", "id": 2},
    {"text": "Another English sample for testing.", "id": 3},
    {"text": "这是一个中文文本示例。", "id": 4},
    {"text": "Machine learning is transforming industries.", "id": 5},
    {"text": "Bonjour le monde!", "id": 6}
]

with open('./data/raw/multilingual.jsonl', 'w') as f:
    for sample in samples:
        f.write(json.dumps(sample) + '\n')

print(f"Created multilingual dataset with {len(samples)} samples")

In [None]:
# Split by language
!python data-juicer/tools/preprocess/dataset_split_by_language.py \
    --src_dir ./data/raw \
    --target_dir ./data/split_by_lang \
    --suffixes jsonl \
    --text_key text \
    --num_proc 2

In [None]:
# Check split results
import os
import json

split_dir = './data/split_by_lang'
if os.path.exists(split_dir):
    for filename in os.listdir(split_dir):
        if filename.endswith('.jsonl'):
            filepath = os.path.join(split_dir, filename)
            with open(filepath, 'r') as f:
                samples = [json.loads(line) for line in f]
            print(f"\n{filename}: {len(samples)} samples")
            for sample in samples[:2]:  # Show first 2
                print(f"  - {sample['text'][:50]}...")

## Example: Serialize Metadata

In [None]:
# Create dataset with complex metadata
samples_with_meta = [
    {
        "text": "Sample text one",
        "meta": {
            "source": "web",
            "date": "2024-01-01",
            "author": "user123",
            "tags": ["tech", "ai"]
        }
    },
    {
        "text": "Sample text two",
        "meta": {
            "source": "social",
            "date": "2024-01-02",
            "author": "user456",
            "tags": ["news"]
        }
    }
]

os.makedirs('./data/with_meta', exist_ok=True)
with open('./data/with_meta/data.jsonl', 'w') as f:
    for sample in samples_with_meta:
        f.write(json.dumps(sample) + '\n')

print("Created dataset with metadata")

In [None]:
# Serialize metadata to string
!python data-juicer/tools/preprocess/serialize_meta.py \
    --src_dir ./data/with_meta \
    --target_dir ./data/serialized \
    --text_key text \
    --serialized_key meta_str \
    --num_proc 1

In [None]:
# Check serialized results
with open('./data/serialized/data.jsonl', 'r') as f:
    serialized = [json.loads(line) for line in f]

print("Serialized metadata:")
for i, sample in enumerate(serialized, 1):
    print(f"\n{i}. Text: {sample['text']}")
    print(f"   Meta (serialized): {sample.get('meta_str', 'N/A')[:100]}...")

## Practical Workflow

A typical pre-processing workflow:

1. **Raw Data Collection**: Gather data from various sources
2. **Format Conversion**: Convert to JSONL (if needed)
3. **Language Splitting**: Separate by language for targeted processing
4. **Metadata Handling**: Serialize complex metadata
5. **Main Pipeline**: Feed into Data-Juicer processing pipeline

In [None]:
# Example: Complete pre-processing + main pipeline
config = """project_name: 'preprocessed_data'
dataset_path: './data/split_by_lang/en.jsonl'  # Use English split
export_path: './outputs/preprocessed_final.jsonl'
np: 2

process:
  - text_length_filter:
      min_len: 10
      max_len: 500
  - alphanumeric_filter:
      min_ratio: 0.5
"""

os.makedirs('./configs', exist_ok=True)
with open('./configs/preprocess_pipeline.yaml', 'w') as f:
    f.write(config)

print("Pipeline config created")

In [None]:
# Run main pipeline on pre-processed data
!dj-process --config ./configs/preprocess_pipeline.yaml

## Cleanup

After completing the pre-processing tasks, clean up the cloned repository to save space.

In [1]:
# Remove cloned Data-Juicer repository
!rm -rf data-juicer

## Further Reading

- [Pre-processing Tools Documentation](https://datajuicer.github.io/data-juicer/en/main/tools/preprocess/README.html)
- [Pre-processing Scripts Source Code](https://github.com/datajuicer/data-juicer/blob/main/tools/preprocess/)
- [Format Conversion Tools](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/README.html)