# Chapter 4: DJ Dataset API

**Data-Juicer User Guide**

- Git Commit: `v1.4.5`
- Commit Date: 2026-01-16
- Repository: https://github.com/datajuicer/data-juicer

---
 
> **Note:** This chapter is intended for users who want to programmatically call Data-Juicer or are familiar with HuggingFace Dataset operations.  
> If you only care about YAML-based invocation, you can skip this chapter.

Data-Juicer provides two dataset implementations:
- **NestedDataset**: Built on HuggingFace Datasets, for single-machine processing
- **RayDataset**: Built on Ray Data, for distributed processing

Both share the same `DJDataset` interface, so you can switch backends without changing your operator code.

## Table of Contents

1. [Quick Comparison](#quick-comparison)
2. [NestedDataset: HuggingFace-Compatible API](#nesteddataset-huggingface-compatible-api)
3. [Data-Juicer Enhancements](#data-juicer-enhancements)
4. [RayDataset (Distributed)](#raydataset-distributed)
5. [Production Usage: Via Configuration](#production-usage-via-configuration)
6. [Key Differences](#key-differences)

In [None]:
# Install Data-Juicer (if not installed)
# !uv pip install py-data-juicer

## Quick Comparison

| Feature | Pandas | HuggingFace | Data-Juicer |
|---------|--------|-------------|-------------|
| **Base** | NumPy | Arrow | Built on HF |
| **Indexing** | `df['col']` | `ds['col']` | Same + **nested access** (`ds['meta.source']`) |
| **Processing** | `.apply()` | `.map()`, `.filter()` | Same + **100+ operators** via `.process()` |
| **Multimodal** | Manual | Supported | **Lazy loading** for efficiency |

## NestedDataset: HuggingFace-Compatible API

`NestedDataset` is fully compatible with HuggingFace Datasets API, so you can use familiar operations directly:

In [None]:
# HuggingFace-style API works directly
from data_juicer.core.data import NestedDataset

# Create dataset (same as HuggingFace)
ds = NestedDataset.from_dict({
    'text': ['Hello world', 'Data processing', 'Machine learning'],
    'label': [0, 1, 1]
})
print(f"Created: {len(ds)} samples, columns: {ds.column_names}")

# Standard operations
ds = ds.map(lambda x: {'text_len': len(x['text'])})     # Transform
ds = ds.filter(lambda x: x['text_len'] > 10)            # Filter
print(f"After filter: {len(ds)} samples")
print(f"First row: {ds[0]}")

In [None]:
# Convert from Pandas
import pandas as pd
from data_juicer.core.data import NestedDataset

df = pd.DataFrame({'text': ['From pandas!', 'Easy conversion']})
ds = NestedDataset.from_pandas(df)
print(f"From Pandas: {ds['text']}")

# Convert back to Pandas
df_back = ds.to_pandas()
print(f"Back to Pandas: {type(df_back)}")

## Data-Juicer Enhancements

Beyond HuggingFace, `NestedDataset` adds:

In [None]:
# 1. Nested Field Access - use dot notation for nested structures
from data_juicer.core.data import NestedDataset

ds = NestedDataset.from_dict({
    'text': ['Sample text', '中文样本'],
    'meta': [{'source': 'wiki', 'date': '2024-01'}, {'source': 'web', 'date': '2024-02'}],
    'stats': [{'lang': 'en', 'length': 100}, {'lang': 'zh', 'length': 20}]
})

# Access nested fields directly with dot notation
print(f"Source: {ds['meta.source']}")  # No need for ds['meta'][0]['source']
print(f"Language: {ds['stats.lang']}")

In [None]:
# 2. Built-in Operator Pipeline - chain 100+ operators via .process()
from data_juicer.core.data import NestedDataset
from data_juicer.ops.filter import TextLengthFilter
from data_juicer.ops.mapper import WhitespaceNormalizationMapper

ds = NestedDataset.from_dict({
    'text': [
        'Short',
        'This is a longer text that should pass the filter. aaaaaaaa',
        'Text with various spaces'
    ]
})

# Process with operator pipeline
ds_processed = ds.process([
    TextLengthFilter(min_len=10, max_len=30),      # Filter short texts
    WhitespaceNormalizationMapper(),               # Whitespace normalization
])

print(f"Before: {len(ds)} -> After: {len(ds_processed)} samples")
for row in ds_processed:
    print(f"  '{row['text']}'")

## RayDataset (Distributed)

`RayDataset` wraps Ray Data for distributed processing across multiple machines or GPUs.

In [None]:
# Direct usage: Create RayDataset from Ray Data
import ray
from data_juicer.core.data.ray_dataset import RayDataset

# Initialize Ray
ray.init(ignore_reinit_error=True)

# Create Ray Data
ray_data = ray.data.from_items([
    {'text': 'Hello distributed world'},
    {'text': 'Ray enables scalable processing'},
    {'text': 'Data-Juicer on Ray'}
])

# Wrap in RayDataset
ds = RayDataset(ray_data) # or dataset_path or cfg
print(f"Created RayDataset with {ds.count()} samples")
print(f"First 2 samples: {ds.get(2)}")

In [None]:
# Same operators work on RayDataset
from data_juicer.ops.filter import TextLengthFilter

# Process with operators - same API as NestedDataset
ds_processed = ds.process([
    TextLengthFilter(min_len=20)
])

print(f"After filter: {ds_processed.count()} samples")
print(f"Results: {ds_processed.get(10)}")

ray.shutdown()

### Production Usage: Via Configuration

For production, use configuration files with `executor_type: 'ray'`:

In [None]:
# Create a Ray config file
import os
os.makedirs('./configs', exist_ok=True)

ray_config = """
project_name: 'ray-demo'
dataset_path: './data/demo.jsonl'
export_path: './outputs/processed'

executor_type: 'ray'        # Enable Ray backend
ray_address: 'auto'         # Or 'ray://hostname:port' for cluster

process:
  - text_length_filter:
      min_len: 10
      max_len: 1000
"""

with open('./configs/ray_demo.yaml', 'w') as f:
    f.write(ray_config)

print("Run with: dj-process --config ./configs/ray_demo.yaml")

### Key Differences

| Feature | NestedDataset | RayDataset |
|---------|---------------|------------|
| **Backend** | HuggingFace Dataset | Ray Data |
| **Execution** | Eager | Lazy (streaming) |
| **GPU Support** | Manual | Auto GPU allocation |
| **Indexing** | `ds[0]`, `ds['col']` | `ds.get(k)`, `ds.get_column('col')` |

See [Chapter 7: Distributed Processing with Ray](./07_Distributed_Processing_with_Ray.ipynb) for more details.