# Chapter 8: Pre-processing

**Data-Juicer User Guide**

- Git Commit: `v1.4.6`
- Commit Date: 2026-02-02
- Repository: https://github.com/datajuicer/data-juicer

---

Pre-processing tools help prepare raw data before entering the main Data-Juicer pipeline. See the [Preprocess README](https://datajuicer.github.io/data-juicer/en/main/tools/preprocess/README.html) for complete documentation.

# Table of Contents

1. [Available Pre-processing Tools](#available-pre-processing-tools)
2. [Example: Split Dataset by Language](#example-split-dataset-by-language)
3. [Example: Serialize Metadata](#example-serialize-metadata)
4. [Practical Workflow](#practical-workflow)
5. [Cleanup](#cleanup)
6. [Further Reading](#further-reading)

## Available Pre-processing Tools

Data-Juicer provides several pre-processing utilities:

- **dataset_split_by_language.py**: Split datasets by language
- **raw_arxiv_to_jsonl.py**: Convert arXiv data to JSONL
- **raw_stackexchange_to_jsonl.py**: Convert Stack Exchange data
- **serialize_meta.py**: Serialize metadata fields

In [1]:
!git clone --depth 1 https://github.com/datajuicer/data-juicer.git

fatal: destination path 'data-juicer' already exists and is not an empty directory.


In [2]:
# If running in Google Colab, use 'pip install' instead of 'uv pip install'
!uv pip install py-data-juicer[dev]

[2K[2mResolved [1m216 packages[0m [2min 1.62s[0m[0m                                       [0m
[2K[37m⠙[0m [2mPreparing packages...[0m (0/54)                                                  
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/54)---------------------[0m[0m     0 B/116.50 KiB       [1A
[2K[1A[37m⠙[0m [2mPreparing packages...[0m (0/54)---------------------[0m[0m     0 B/116.50 KiB       [1A
[2msphinx-autobuild       [0m [32m[30m[2m------------------------------[0m[0m     0 B/12.24 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/54)---------------------[0m[0m     0 B/116.50 KiB       [2A
[2msphinx-autobuild       [0m [32m[30m[2m------------------------------[0m[0m     0 B/12.24 KiB
[2K[2A[37m⠙[0m [2mPreparing packages...[0m (0/54)---------------------[0m[0m     0 B/116.50 KiB       [2A
[2msphinx-autobuild       [0m [32m[30m[2m------------------------------[0m[0m     0 B/12.24 KiB
[2K[2A[37m⠙[0m [2mPrep

## Example: Split Dataset by Language

In [3]:
import json
import os

# Create sample multilingual dataset
os.makedirs('./data/raw', exist_ok=True)

samples = [
    {"text": "This is an English text sample.", "id": 1},
    {"text": "Ceci est un texte en français.", "id": 2},
    {"text": "Another English sample for testing.", "id": 3},
    {"text": "这是一个中文文本示例。", "id": 4},
    {"text": "Machine learning is transforming industries.", "id": 5},
    {"text": "Bonjour le monde!", "id": 6}
]

with open('./data/raw/multilingual.jsonl', 'w') as f:
    for sample in samples:
        f.write(json.dumps(sample) + '\n')

print(f"Created multilingual dataset with {len(samples)} samples")

Created multilingual dataset with 6 samples


In [4]:
# Split by language
!python data-juicer/tools/preprocess/dataset_split_by_language.py \
    --src_dir ./data/raw \
    --target_dir ./data/split_by_lang \
    --suffixes jsonl \
    --text_key text \
    --num_proc 2

[32m2026-02-12 09:37:28.060[0m | [1mINFO    [0m | [36mdata_juicer.core.data.dataset_builder[0m:[36m__init__[0m:[36m48[0m - [1mfound dataset_path setting: ./data/raw[0m
[32m2026-02-12 09:37:28.060[0m | [1mINFO    [0m | [36mdata_juicer.core.data.load_strategy[0m:[36mget_strategy_class[0m:[36m84[0m - [1mGetting strategy class for exec: default, data_type: local, data_source: None[0m
INFO:httpx:HTTP Request: HEAD https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/json/json.py "HTTP/1.1 200 OK"
Setting num_proc from 2 back to 1 for the jsonl split to disable multiprocessing as it only contains one shard.
Generating jsonl split: 6 examples [00:00, 893.23 examples/s]
[32m2026-02-12 09:37:28.318[0m | [1mINFO    [0m | [36mdata_juicer.format.formatter[0m:[36munify_format[0m:[36m174[0m - [1mUnifying the input dataset formats...[0m
[32m2026-02-12 09:37:28.318[0m | [1mINFO    [0m | [36mdata_juicer.format.formatter[0m:[36munify_format[0m:[

In [5]:
# Check split results
import os
import json

split_dir = './data/split_by_lang'
if os.path.exists(split_dir):
    for filename in os.listdir(split_dir):
        if filename.endswith('.jsonl'):
            filepath = os.path.join(split_dir, filename)
            with open(filepath, 'r') as f:
                samples = [json.loads(line) for line in f]
            print(f"\n{filename}: {len(samples)} samples")
            for sample in samples[:2]:  # Show first 2
                print(f"  - {sample['text'][:50]}...")


zh.jsonl: 1 samples
  - 这是一个中文文本示例。...

en.jsonl: 3 samples
  - This is an English text sample....
  - Another English sample for testing....

fr.jsonl: 2 samples
  - Ceci est un texte en français....
  - Bonjour le monde!...


## Example: Serialize Metadata

In [6]:
# Create dataset with complex metadata
samples_with_meta = [
    {
        "text": "Sample text one",
        "meta": {
            "source": "web",
            "date": "2024-01-01",
            "author": "user123",
            "tags": ["tech", "ai"]
        }
    },
    {
        "text": "Sample text two",
        "meta": {
            "source": "social",
            "date": "2024-01-02",
            "author": "user456",
            "tags": ["news"]
        }
    }
]

os.makedirs('./data/with_meta', exist_ok=True)
with open('./data/with_meta/data.jsonl', 'w') as f:
    for sample in samples_with_meta:
        f.write(json.dumps(sample) + '\n')

print("Created dataset with metadata")

Created dataset with metadata


In [7]:
# Serialize metadata to string
!python data-juicer/tools/preprocess/serialize_meta.py \
    --src_dir ./data/with_meta \
    --target_dir ./data/serialized \
    --text_key text \
    --serialized_key meta_str \
    --num_proc 1

data/with_meta/data.jsonl


In [8]:
# Check serialized results
with open('./data/serialized/data.jsonl', 'r') as f:
    serialized = [json.loads(line) for line in f]

print("Serialized metadata:")
for i, sample in enumerate(serialized, 1):
    print(f"\n{i}. Text: {sample['text']}")
    print(f"   Meta (serialized): {sample.get('meta_str', 'N/A')[:100]}...")

Serialized metadata:

1. Text: Sample text one
   Meta (serialized): {"meta": {"source": "web", "date": "2024-01-01", "author": "user123", "tags": ["tech", "ai"]}}...

2. Text: Sample text two
   Meta (serialized): {"meta": {"source": "social", "date": "2024-01-02", "author": "user456", "tags": ["news"]}}...


## Practical Workflow

A typical pre-processing workflow:

1. **Raw Data Collection**: Gather data from various sources
2. **Format Conversion**: Convert to JSONL (if needed)
3. **Language Splitting**: Separate by language for targeted processing
4. **Metadata Handling**: Serialize complex metadata
5. **Main Pipeline**: Feed into Data-Juicer processing pipeline

In [9]:
# Example: Complete pre-processing + main pipeline
config = """project_name: 'preprocessed_data'
dataset_path: './data/split_by_lang/en.jsonl'  # Use English split
export_path: './outputs/preprocessed_final.jsonl'
np: 2

process:
  - text_length_filter:
      min_len: 10
      max_len: 500
  - alphanumeric_filter:
      min_ratio: 0.5
"""

os.makedirs('./configs', exist_ok=True)
with open('./configs/preprocess_pipeline.yaml', 'w') as f:
    f.write(config)

print("Pipeline config created")

Pipeline config created


In [10]:
# Run main pipeline on pre-processed data
!dj-process --config ./configs/preprocess_pipeline.yaml

[32m2026-02-12 09:37:39.021[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m695[0m - [1mdataset_path config is set and a valid local path[0m
[32m2026-02-12 09:37:39.055[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1012[0m - [1mBack up the input config file [/workspaces/data-juicer-hub/configs/preprocess_pipeline.yaml] into the work_dir [/workspaces/data-juicer-hub/outputs][0m
[32m2026-02-12 09:37:39.060[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1033[0m - [1mConfiguration table: [0m
╒══════════════════════════╤════════════════════════════════════════════════════════════════════════════════╕
│ key                      │ values                                                                         │
╞══════════════════════════╪════════════════════════════════════════════════════════════════════════════════╡
│ config                   │ [Path_fr(./configs/preprocess_pipeline.yaml, cwd=/workspaces/data-juicer-hub)] │
├

## Cleanup

After completing the pre-processing tasks, clean up the cloned repository to save space.

In [11]:
# Remove cloned Data-Juicer repository
!rm -rf data-juicer

## Further Reading

- [Pre-processing Tools Documentation](https://datajuicer.github.io/data-juicer/en/main/tools/preprocess/README.html)
- [Pre-processing Scripts Source Code](https://github.com/datajuicer/data-juicer/blob/main/tools/preprocess/)
- [Format Conversion Tools](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/README.html)