# Chapter 10: Advanced Dataset Configuration

**Data-Juicer User Guide**

- Git Commit: `v1.4.6`
- Commit Date: 2026-02-02
- Repository: https://github.com/datajuicer/data-juicer

# Table of Contents

1. [Advanced Loading with DatasetCfg](#advanced-loading-with-datasetcfg)
2. [Example 1: Sampling from Large Dataset](#example-1-sampling-from-large-dataset)
3. [Example 2: Data Mixing with Weights](#example-2-data-mixing-with-weights)
4. [Further Reading](#further-reading)

## Advanced Loading with DatasetCfg

For complex data loading scenarios, Data-Juicer provides `DatasetCfg` to handle:
- Multiple data sources with different weights
- Sampling from large datasets
- Remote dataset loading

See [DatasetCfg](https://datajuicer.github.io/data-juicer/en/main/docs/DatasetCfg.html) for complete documentation.

In [1]:
import json
import os

In [2]:
# Install Data-Juicer (if not installed)
# If running in Google Colab, use 'pip install' instead of 'uv pip install'
# !uv pip install py-data-juicer

### Example 1: Sampling from Large Dataset

This example demonstrates how to sample a specific number of items from a large dataset.

In [3]:
# Create large dataset
large_data = [{"text": f"Sample text number {i}"} for i in range(100)]
with open('./data/large.jsonl', 'w') as f:
    for item in large_data:
        f.write(json.dumps(item) + '\n')

print(f"Created dataset with {len(large_data)} samples")

Created dataset with 100 samples


#### Loading via Python API
Use DatasetBuilder to verify that max_sample_num correctly limits the number of loaded samples.

In [4]:
from jsonargparse import Namespace
from data_juicer.core.data.dataset_builder import DatasetBuilder

cfg = Namespace({
    'dataset': {
        'max_sample_num': 15,  # Load only 15 samples
        'configs': [
            {
                'type': 'local',
                'path': './data/large.jsonl'
            }
        ]
    }
})

builder = DatasetBuilder(cfg)
ds = builder.load_dataset()

print(f"Original samples: {len(large_data)}")
print(f"Loaded samples: {len(ds)}")
print(f"First sample: {ds[0]}")

  from .autonotebook import tqdm as notebook_tqdm
2026-02-12 09:39:23,172	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2026-02-12 09:39:24,686	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
[32m2026-02-12 09:39:24.769[0m | [1mINFO    [0m | [36mdata_juicer.core.data.dataset_builder[0m:[36m__init__[0m:[36m51[0m - [1mfound dataset setting: {'max_sample_num': 15, 'configs': [{'type': 'local', 'path': './data/large.jsonl'}]}[0m
[32m2026-02-12 09:39:24.770[0m | [1mINFO    [0m | [36mdata_juicer.core.data.load_strategy[0m:[36mget_strategy_class[0m:[36m84[0m - [1mGetting strategy class for exec: default, data_type: local, data_source: None[0m
INFO:httpx:HTTP Request: HEAD https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/json/json.py "HTTP/1.1 200 OK"
Genera

Original samples: 100
Loaded samples: 15
First sample: {'text': 'Sample text number 59'}


#### Equivalent YAML Configuration for CLI Usage

In [5]:
%%writefile configs/sample_config.yaml
project_name: 'sampling_demo'
export_path: './outputs/sampled.jsonl'
np: 1

dataset:
  max_sample_num: 15
  configs:
    - type: 'local'
      path: './data/large.jsonl'

Writing configs/sample_config.yaml


In [6]:
!dj-process --config ./configs/sample_config.yaml

[32m2026-02-12 09:39:30.663[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m698[0m - [1mdataset_path config is empty; dataset is present[0m
[32m2026-02-12 09:39:30.678[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1012[0m - [1mBack up the input config file [/workspaces/data-juicer-hub/configs/sample_config.yaml] into the work_dir [/workspaces/data-juicer-hub/outputs][0m
[32m2026-02-12 09:39:30.681[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1033[0m - [1mConfiguration table: [0m
╒══════════════════════════╤══════════════════════════════════════════════════════════════════════════╕
│ key                      │ values                                                                   │
╞══════════════════════════╪══════════════════════════════════════════════════════════════════════════╡
│ config                   │ [Path_fr(./configs/sample_config.yaml, cwd=/workspaces/data-juicer-hub)] │
├──────────────────────────┼────

Result Verification

In [7]:
with open('./outputs/sampled.jsonl', 'r') as f:
    sampled_count = sum(1 for line in f)
print(f"Original: {len(large_data)} samples")
print(f"Sampled: {sampled_count} samples")

Original: 100 samples
Sampled: 15 samples


### Example 2: Data Mixing with Weights

In [8]:
os.makedirs('./data/mix', exist_ok=True)

en_data = [
    {"text": "Deep learning models require large amounts of training data."},
    {"text": "Attention mechanisms help models focus on relevant parts of input."},
    {"text": "Fine-tuning adapts pre-trained models to specific tasks."},
    {"text": "Batch normalization stabilizes neural network training."},
    {"text": "Transfer learning leverages knowledge from one domain to another."},
    {"text": "Loss functions measure the difference between predictions and targets."},
    {"text": "Backpropagation computes gradients for model optimization."},
    {"text": "Embeddings represent words or tokens as dense vectors."},
    {"text": "Overfitting can be mitigated with dropout or regularization."},
    {"text": "The transformer architecture enables parallel sequence processing."}
]

zh_data = [
    {"text": "今天天气晴朗，适合外出散步。"},
    {"text": "多读书可以开阔视野，增长知识。"},
    {"text": "保持规律作息对身体健康非常重要。"},
    {"text": "与家人共度时光是幸福的源泉。"},
    {"text": "学习新技能需要坚持和耐心。"},
    {"text": "听音乐有助于缓解压力和焦虑。"}
]

with open('./data/mix/en.jsonl', 'w', encoding='utf-8') as f:
    for item in en_data:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

with open('./data/mix/zh.jsonl', 'w', encoding='utf-8') as f:
    for item in zh_data:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

print("✅ Created bilingual datasets for clear mixing verification")

✅ Created bilingual datasets for clear mixing verification


#### Loading via Python API

In [9]:
from jsonargparse import Namespace
from data_juicer.core.data.dataset_builder import DatasetBuilder

cfg = Namespace({
    'dataset': {
        'max_sample_num': 10,
        'configs': [
            {
                'type': 'local',
                'path': './data/mix/en.jsonl',
                'weight': 0.7
            },
            {
                'type': 'local',
                'path': './data/mix/zh.jsonl',
                'weight': 0.3
            }
        ]
    }
})

builder = DatasetBuilder(cfg)
ds = builder.load_dataset()

print(f"Loaded {len(ds)} samples")
for sample in ds:
    print(sample)

[32m2026-02-12 09:39:32.397[0m | [1mINFO    [0m | [36mdata_juicer.core.data.dataset_builder[0m:[36m__init__[0m:[36m51[0m - [1mfound dataset setting: {'max_sample_num': 10, 'configs': [{'type': 'local', 'path': './data/mix/en.jsonl', 'weight': 0.7}, {'type': 'local', 'path': './data/mix/zh.jsonl', 'weight': 0.3}]}[0m
[32m2026-02-12 09:39:32.397[0m | [1mINFO    [0m | [36mdata_juicer.core.data.load_strategy[0m:[36mget_strategy_class[0m:[36m84[0m - [1mGetting strategy class for exec: default, data_type: local, data_source: None[0m
[32m2026-02-12 09:39:32.398[0m | [1mINFO    [0m | [36mdata_juicer.core.data.load_strategy[0m:[36mget_strategy_class[0m:[36m84[0m - [1mGetting strategy class for exec: default, data_type: local, data_source: None[0m
INFO:httpx:HTTP Request: HEAD https://s3.amazonaws.com/datasets.huggingface.co/datasets/datasets/json/json.py "HTTP/1.1 200 OK"
Generating jsonl split: 10 examples [00:00, 4893.03 examples/s]
[32m2026-02-12 09:39:32

Loaded 10 samples
{'text': 'Loss functions measure the difference between predictions and targets.'}
{'text': 'Backpropagation computes gradients for model optimization.'}
{'text': 'Deep learning models require large amounts of training data.'}
{'text': 'Embeddings represent words or tokens as dense vectors.'}
{'text': 'Batch normalization stabilizes neural network training.'}
{'text': 'Fine-tuning adapts pre-trained models to specific tasks.'}
{'text': 'Transfer learning leverages knowledge from one domain to another.'}
{'text': '与家人共度时光是幸福的源泉。'}
{'text': '保持规律作息对身体健康非常重要。'}
{'text': '听音乐有助于缓解压力和焦虑。'}


#### Equivalent YAML Configuration for CLI Usage

In [10]:
%%writefile configs/mix_config.yaml
project_name: 'mix'
export_path: './outputs/mix/mixed.jsonl'
np: 1

dataset:
  max_sample_num: 10
  configs:
    - type: 'local'
      path: './data/mix/en.jsonl'
      weight: 0.7
    - type: 'local'
      path: './data/mix/zh.jsonl'
      weight: 0.3

Writing configs/mix_config.yaml


In [11]:
!dj-process --config ./configs/mix_config.yaml

[32m2026-02-12 09:39:38.581[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m698[0m - [1mdataset_path config is empty; dataset is present[0m
[32m2026-02-12 09:39:38.596[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1012[0m - [1mBack up the input config file [/workspaces/data-juicer-hub/configs/mix_config.yaml] into the work_dir [/workspaces/data-juicer-hub/outputs/mix][0m
[32m2026-02-12 09:39:38.599[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1033[0m - [1mConfiguration table: [0m
╒══════════════════════════╤════════════════════════════════════════════════════════════════════════════════╕
│ key                      │ values                                                                         │
╞══════════════════════════╪════════════════════════════════════════════════════════════════════════════════╡
│ config                   │ [Path_fr(./configs/mix_config.yaml, cwd=/workspaces/data-juicer-hub)]          │
├──────

Result Verification

In [12]:
with open('./outputs/mix/mixed.jsonl', 'r') as f:
    for line in f:
        print(json.loads(line))

{'text': 'Loss functions measure the difference between predictions and targets.'}
{'text': 'Backpropagation computes gradients for model optimization.'}
{'text': 'Deep learning models require large amounts of training data.'}
{'text': 'Embeddings represent words or tokens as dense vectors.'}
{'text': 'Batch normalization stabilizes neural network training.'}
{'text': 'Fine-tuning adapts pre-trained models to specific tasks.'}
{'text': 'Transfer learning leverages knowledge from one domain to another.'}
{'text': '与家人共度时光是幸福的源泉。'}
{'text': '保持规律作息对身体健康非常重要。'}
{'text': '听音乐有助于缓解压力和焦虑。'}


Used in conjunction with operators:

In [13]:
%%writefile configs/mix_process.yaml
project_name: 'mix_process'
export_path: './outputs/mix_processed/mixed.jsonl'
np: 1
dataset:
  max_sample_num: 10
  configs:
    - type: 'local'
      path: './data/mix/en.jsonl'
      weight: 0.7
    - type: 'local'
      path: './data/mix/zh.jsonl'
      weight: 0.3

process:
  - language_id_score_filter:
      lang: 'en'
      min_score: 0.5

Writing configs/mix_process.yaml


In [14]:
!dj-process --config ./configs/mix_process.yaml

[32m2026-02-12 09:39:45.972[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m698[0m - [1mdataset_path config is empty; dataset is present[0m
[32m2026-02-12 09:39:45.995[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1012[0m - [1mBack up the input config file [/workspaces/data-juicer-hub/configs/mix_process.yaml] into the work_dir [/workspaces/data-juicer-hub/outputs/mix_processed][0m
[32m2026-02-12 09:39:45.999[0m | [1mINFO    [0m | [36mdata_juicer.config.config[0m:[36m1033[0m - [1mConfiguration table: [0m
╒══════════════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════╕
│ key                      │ values                                                                                            │
╞══════════════════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════╡
│ config                   │ [Path_fr(./configs/m

In [15]:
with open('./outputs/mix_processed/mixed.jsonl', 'r') as f:
    for line in f:
        print(json.loads(line))

{'text': 'Loss functions measure the difference between predictions and targets.'}
{'text': 'Backpropagation computes gradients for model optimization.'}
{'text': 'Deep learning models require large amounts of training data.'}
{'text': 'Embeddings represent words or tokens as dense vectors.'}
{'text': 'Batch normalization stabilizes neural network training.'}
{'text': 'Fine-tuning adapts pre-trained models to specific tasks.'}
{'text': 'Transfer learning leverages knowledge from one domain to another.'}


## Further Reading

- [DatasetCfg Documentation](https://datajuicer.github.io/data-juicer/en/main/docs/DatasetCfg.html)
- [Format Conversion Tools](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/README.html)
- [Complete Configuration Reference](https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml)