# Chapter 10: Advanced Dataset Configuration

**Data-Juicer User Guide**

- Git Commit: `v1.4.5`
- Commit Date: 2026-01-16
- Repository: https://github.com/datajuicer/data-juicer

# Table of Contents

1. [Advanced Loading with DatasetCfg](#advanced-loading-with-datasetcfg)
2. [Example 1: Sampling from Large Dataset](#example-1-sampling-from-large-dataset)
3. [Example 2: Data Mixing with Weights](#example-2-data-mixing-with-weights)
4. [Further Reading](#further-reading)

## Advanced Loading with DatasetCfg

For complex data loading scenarios, Data-Juicer provides `DatasetCfg` to handle:
- Multiple data sources with different weights
- Sampling from large datasets
- Remote dataset loading

See [DatasetCfg](https://datajuicer.github.io/data-juicer/en/main/docs/DatasetCfg.html) for complete documentation.

In [None]:
import json
import os

In [None]:
# Install Data-Juicer (if not installed)
# !uv pip install py-data-juicer

### Example 1: Sampling from Large Dataset

This example demonstrates how to sample a specific number of items from a large dataset.

In [None]:
# Create large dataset
large_data = [{"text": f"Sample text number {i}"} for i in range(100)]
with open('./data/large.jsonl', 'w') as f:
    for item in large_data:
        f.write(json.dumps(item) + '\n')

print(f"Created dataset with {len(large_data)} samples")

#### Loading via Python API
Use DatasetBuilder to verify that max_sample_num correctly limits the number of loaded samples.

In [None]:
from jsonargparse import Namespace
from data_juicer.core.data.dataset_builder import DatasetBuilder

cfg = Namespace({
    'dataset': {
        'max_sample_num': 15,  # Load only 15 samples
        'configs': [
            {
                'type': 'local',
                'path': './data/large.jsonl'
            }
        ]
    }
})

builder = DatasetBuilder(cfg)
ds = builder.load_dataset()

print(f"Original samples: {len(large_data)}")
print(f"Loaded samples: {len(ds)}")
print(f"First sample: {ds[0]}")

#### Equivalent YAML Configuration for CLI Usage

In [None]:
%%writefile configs/sample_config.yaml
project_name: 'sampling_demo'
export_path: './outputs/sampled.jsonl'
np: 1

dataset:
  max_sample_num: 15
  configs:
    - type: 'local'
      path: './data/large.jsonl'

In [None]:
!dj-process --config ./configs/sample_config.yaml

Result Verification

In [None]:
with open('./outputs/sampled.jsonl', 'r') as f:
    sampled_count = sum(1 for line in f)
print(f"Original: {len(large_data)} samples")
print(f"Processed & Sampled: {sampled_count} samples")

### Example 2: Data Mixing with Weights

In [None]:
os.makedirs('./data/mix', exist_ok=True)

en_data = [
    {"text": "Deep learning models require large amounts of training data."},
    {"text": "Attention mechanisms help models focus on relevant parts of input."},
    {"text": "Fine-tuning adapts pre-trained models to specific tasks."},
    {"text": "Batch normalization stabilizes neural network training."},
    {"text": "Transfer learning leverages knowledge from one domain to another."},
    {"text": "Loss functions measure the difference between predictions and targets."},
    {"text": "Backpropagation computes gradients for model optimization."},
    {"text": "Embeddings represent words or tokens as dense vectors."},
    {"text": "Overfitting can be mitigated with dropout or regularization."},
    {"text": "The transformer architecture enables parallel sequence processing."}
]

zh_data = [
    {"text": "今天天气晴朗，适合外出散步。"},
    {"text": "多读书可以开阔视野，增长知识。"},
    {"text": "保持规律作息对身体健康非常重要。"},
    {"text": "与家人共度时光是幸福的源泉。"},
    {"text": "学习新技能需要坚持和耐心。"},
    {"text": "听音乐有助于缓解压力和焦虑。"}
]

with open('./data/mix/en.jsonl', 'w', encoding='utf-8') as f:
    for item in en_data:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

with open('./data/mix/zh.jsonl', 'w', encoding='utf-8') as f:
    for item in zh_data:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

print("✅ Created bilingual datasets for clear mixing verification")

#### Loading via Python API

In [None]:
from jsonargparse import Namespace
from data_juicer.core.data.dataset_builder import DatasetBuilder

cfg = Namespace({
    'dataset': {
        'max_sample_num': 10,
        'configs': [
            {
                'type': 'local',
                'path': './data/mix/en.jsonl',
                'weight': 0.7
            },
            {
                'type': 'local',
                'path': './data/mix/zh.jsonl',
                'weight': 0.3
            }
        ]
    }
})

builder = DatasetBuilder(cfg)
ds = builder.load_dataset()

print(f"Loaded {len(ds)} samples")
for sample in ds:
    print(sample)

#### Equivalent YAML Configuration for CLI Usage

In [None]:
%%writefile configs/mix_config.yaml
project_name: 'mix'
export_path: './outputs/mix/mixed.jsonl'
np: 1

dataset:
  max_sample_num: 10
  configs:
    - type: 'local'
      path: './data/mix/en.jsonl'
      weight: 0.7
    - type: 'local'
      path: './data/mix/zh.jsonl'
      weight: 0.3

In [None]:
!dj-process --config ./configs/mix_config.yaml

Result Verification

In [None]:
with open('./outputs/mix/mixed.jsonl', 'r') as f:
    for line in f:
        print(json.loads(line))

Used in conjunction with operators:

In [None]:
%%writefile configs/mix_process.yaml
project_name: 'mix_process'
export_path: './outputs/mix_processed/mixed.jsonl'
np: 1
dataset:
  max_sample_num: 10
  configs:
    - type: 'local'
      path: './data/mix/en.jsonl'
      weight: 0.7
    - type: 'local'
      path: './data/mix/zh.jsonl'
      weight: 0.3

process:
  - language_id_score_filter:
      lang: 'en'
      min_score: 0.5

In [None]:
!dj-process --config ./configs/mix_process.yaml

In [None]:
with open('./outputs/mix_processed/mixed.jsonl', 'r') as f:
    for line in f:
        print(json.loads(line))

## Further Reading

- [DatasetCfg Documentation](https://datajuicer.github.io/data-juicer/en/main/docs/DatasetCfg.html)
- [Format Conversion Tools](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/README.html)
- [Complete Configuration Reference](https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml)