# Chapter 9: Advanced Dataset Configuration

**Data-Juicer User Guide**

- Git Commit: `v1.0.5`
- Commit Date: 2026-01-16
- Repository: https://github.com/datajuicer/data-juicer

## Advanced Loading with DatasetCfg

For complex data loading scenarios, Data-Juicer provides `DatasetCfg` to handle:
- Multiple data sources with different weights
- Sampling from large datasets
- Field mapping and transformation
- Remote dataset loading

See [DatasetCfg](https://datajuicer.github.io/data-juicer/en/main/docs/DatasetCfg.html) for complete documentation.

### Example 1: Data Mixing with Weights

In [None]:
import json
import os

# Create two datasets
os.makedirs('./data/mix', exist_ok=True)

# Dataset 1: Technical content
tech_data = [
    {"text": "Machine learning algorithms are transforming industries."},
    {"text": "Neural networks can learn complex patterns from data."}
]
with open('./data/mix/tech.jsonl', 'w') as f:
    for item in tech_data:
        f.write(json.dumps(item) + '\n')

# Dataset 2: General content
general_data = [
    {"text": "The weather is beautiful today."},
    {"text": "Reading books expands your knowledge."}
]
with open('./data/mix/general.jsonl', 'w') as f:
    for item in general_data:
        f.write(json.dumps(item) + '\n')

print("Created two datasets for mixing")

In [None]:
# Mix datasets with weights
mix_config = """project_name: 'data_mixing'
export_path: './outputs/mixed.jsonl'
np: 1

# DatasetCfg: Mix with weights
dataset:
  configs:
    - type: 'local'
      path: './data/mix/tech.jsonl'
      weight: 0.7  # 70% technical content
    - type: 'local'
      path: './data/mix/general.jsonl'
      weight: 0.3  # 30% general content

process:
  - text_length_filter:
      min_len: 10
      max_len: 200
"""

os.makedirs('./configs', exist_ok=True)
with open('./configs/mix.yaml', 'w') as f:
    f.write(mix_config)

!dj-process --config ./configs/mix.yaml

### Example 2: Sampling from Large Dataset

In [None]:
# Create large dataset
large_data = [{"text": f"Sample text number {i}"} for i in range(100)]
with open('./data/large.jsonl', 'w') as f:
    for item in large_data:
        f.write(json.dumps(item) + '\n')

print(f"Created dataset with {len(large_data)} samples")

In [None]:
# Sample subset
sample_config = """project_name: 'sampling'
export_path: './outputs/sampled.jsonl'
np: 1

dataset:
  configs:
    - type: 'local'
      path: './data/large.jsonl'
  max_sample_num: 20  # Sample only 20 items

process:
  - text_length_filter:
      min_len: 5
      max_len: 100
"""

with open('./configs/sample.yaml', 'w') as f:
    f.write(sample_config)

!dj-process --config ./configs/sample.yaml

In [None]:
# Check sampled results
with open('./outputs/sampled.jsonl', 'r') as f:
    sampled = [json.loads(line) for line in f]

print(f"Original: 100 samples")
print(f"Sampled: {len(sampled)} samples")

### Example 3: Field Mapping

In [None]:
# Create dataset with custom field names
custom_data = [
    {"content": "This uses 'content' instead of 'text'", "doc_id": 1},
    {"content": "Field mapping helps standardize datasets", "doc_id": 2}
]
with open('./data/custom_fields.jsonl', 'w') as f:
    for item in custom_data:
        f.write(json.dumps(item) + '\n')

print("Created dataset with custom field names")

In [None]:
# Map 'content' field to 'text'
mapping_config = """project_name: 'field_mapping'
dataset_path: './data/custom_fields.jsonl'
export_path: './outputs/mapped.jsonl'
text_keys: 'content'  # Specify which field contains text
np: 1

process:
  - text_length_filter:
      min_len: 10
      max_len: 200
"""

with open('./configs/mapping.yaml', 'w') as f:
    f.write(mapping_config)

!dj-process --config ./configs/mapping.yaml

## Loading Remote Datasets (HuggingFace)

In [None]:
# Example configuration for HuggingFace dataset
hf_config = """project_name: 'hf_dataset'
export_path: './outputs/hf_processed.jsonl'
np: 2

dataset:
  configs:
    - type: 'remote'
      source: 'huggingface'
      path: 'HuggingFaceFW/fineweb'
      name: 'CC-MAIN-2024-10'
      split: 'train'
      limit: 1000  # Load only 1000 samples

process:
  - language_id_score_filter:
      lang: 'en'
      min_score: 0.8
  - text_length_filter:
      min_len: 50
      max_len: 1000
"""

print("HuggingFace dataset configuration:")
print(hf_config)
print("\nNote: This requires internet connection and HuggingFace access")

## Further Reading

- [DatasetCfg Documentation](https://datajuicer.github.io/data-juicer/en/main/docs/DatasetCfg.html)
- [Format Conversion Tools](https://datajuicer.github.io/data-juicer/en/main/tools/fmt_conversion/README.html)
- [Complete Configuration Reference](https://github.com/datajuicer/data-juicer/blob/main/data_juicer/config/config_all.yaml)