# Chapter 4: DJ Dataset API

**Data-Juicer User Guide**

- Git Commit: `v1.4.6`
- Commit Date: 2026-02-02
- Repository: https://github.com/datajuicer/data-juicer

---
 
> **Note:** This chapter is intended for users who want to programmatically call Data-Juicer or are familiar with HuggingFace Dataset operations.  
> If you only care about YAML-based invocation, you can skip this chapter.

Data-Juicer provides two dataset implementations:
- **NestedDataset**: Built on HuggingFace Datasets, for single-machine processing
- **RayDataset**: Built on Ray Data, for distributed processing

Both share the same `DJDataset` interface, so you can switch backends without changing your operator code.

## Table of Contents

1. [Quick Comparison](#quick-comparison)
2. [NestedDataset: HuggingFace-Compatible API](#nesteddataset-huggingface-compatible-api)
3. [Data-Juicer Enhancements](#data-juicer-enhancements)
4. [RayDataset (Distributed)](#raydataset-distributed)
5. [Production Usage: Via Configuration](#production-usage-via-configuration)
6. [Key Differences](#key-differences)

In [1]:
# Install Data-Juicer (if not installed)
# If running in Google Colab, use 'pip install' instead of 'uv pip install'
# !uv pip install py-data-juicer

## Quick Comparison

| Feature | Pandas | HuggingFace | Data-Juicer |
|---------|--------|-------------|-------------|
| **Base** | NumPy | Arrow | Built on HF |
| **Indexing** | `df['col']` | `ds['col']` | Same + **nested access** (`ds['meta.source']`) |
| **Processing** | `.apply()` | `.map()`, `.filter()` | Same + **100+ operators** via `.process()` |
| **Multimodal** | Manual | Supported | **Lazy loading** for efficiency |

## NestedDataset: HuggingFace-Compatible API

`NestedDataset` is fully compatible with HuggingFace Datasets API, so you can use familiar operations directly:

In [2]:
# HuggingFace-style API works directly
from data_juicer.core.data import NestedDataset

# Create dataset (same as HuggingFace)
ds = NestedDataset.from_dict({
    'text': ['Hello world', 'Data processing', 'Machine learning'],
    'label': [0, 1, 1]
})
print(f"Created: {len(ds)} samples, columns: {ds.column_names}")

# Standard operations
ds = ds.map(lambda x: {'text_len': len(x['text'])})     # Transform
ds = ds.filter(lambda x: x['text_len'] > 10)            # Filter
print(f"After filter: {len(ds)} samples")
print(f"First row: {ds[0]}")

  from .autonotebook import tqdm as notebook_tqdm
2026-02-12 09:24:50,255	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2026-02-12 09:24:51,748	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


Created: 3 samples, columns: ['text', 'label']


Map: 100%|██████████| 3/3 [00:00<00:00, 1138.42 examples/s]
Filter: 100%|██████████| 3/3 [00:00<00:00, 1223.90 examples/s]

After filter: 3 samples
First row: {'text': 'Hello world', 'label': 0, 'text_len': 11}





In [3]:
# Convert from Pandas
import pandas as pd
from data_juicer.core.data import NestedDataset

df = pd.DataFrame({'text': ['From pandas!', 'Easy conversion']})
ds = NestedDataset.from_pandas(df)
print(f"From Pandas: {ds['text']}")

# Convert back to Pandas
df_back = ds.to_pandas()
print(f"Back to Pandas: {type(df_back)}")

From Pandas: Column(['From pandas!', 'Easy conversion'])
Back to Pandas: <class 'pandas.core.frame.DataFrame'>


## Data-Juicer Enhancements

Beyond HuggingFace, `NestedDataset` adds:

In [4]:
# 1. Nested Field Access - use dot notation for nested structures
from data_juicer.core.data import NestedDataset

ds = NestedDataset.from_dict({
    'text': ['Sample text', '中文样本'],
    'meta': [{'source': 'wiki', 'date': '2024-01'}, {'source': 'web', 'date': '2024-02'}],
    'stats': [{'lang': 'en', 'length': 100}, {'lang': 'zh', 'length': 20}]
})

# Access nested fields directly with dot notation
print(f"Source: {ds['meta.source']}")  # No need for ds['meta'][0]['source']
print(f"Language: {ds['stats.lang']}")

Source: Column(['wiki', 'web'])
Language: Column(['en', 'zh'])


In [5]:
# 2. Built-in Operator Pipeline - chain 100+ operators via .process()
from data_juicer.core.data import NestedDataset
from data_juicer.ops.filter import TextLengthFilter
from data_juicer.ops.mapper import WhitespaceNormalizationMapper

ds = NestedDataset.from_dict({
    'text': [
        'Short',
        'This is a longer text that should pass the filter. aaaaaaaa',
        'Text with various spaces'
    ]
})

# Process with operator pipeline
ds_processed = ds.process([
    TextLengthFilter(min_len=10, max_len=30),      # Filter short texts
    WhitespaceNormalizationMapper(),               # Whitespace normalization
])

print(f"Before: {len(ds)} -> After: {len(ds_processed)} samples")
for row in ds_processed:
    print(f"  '{row['text']}'")

[32m2026-02-12 09:24:51.936[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[text_length_filter] based on the required memory: NoneGB and required cpu: 1.[0m
num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
Adding new column for stats (num_proc=3): 100%|██████████| 3/3 [00:00<00:00, 14.63 examples/s]
[32m2026-02-12 09:24:52.159[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` to 4 of Op[text_length_filter] based on the required memory: NoneGB and required cpu: 1.[0m
num_proc must be <= 3. Reducing num_proc to 3 for dataset of size 3.
text_length_filter_compute_stats (num_proc=3): 100%|██████████| 3/3 [00:00<00:00, 12.76 examples/s]
[32m2026-02-12 09:24:52.434[0m | [1mINFO    [0m | [36mdata_juicer.utils.process_utils[0m:[36mcalculate_np[0m:[36m161[0m - [1mSet the auto `num_proc` t

Before: 3 -> After: 1 samples
  'Text with various spaces'


## RayDataset (Distributed)

`RayDataset` wraps Ray Data for distributed processing across multiple machines or GPUs.

In [6]:
# Direct usage: Create RayDataset from Ray Data
import ray
from data_juicer.core.data.ray_dataset import RayDataset

# Initialize Ray
ray.init(ignore_reinit_error=True)

# Create Ray Data
ray_data = ray.data.from_items([
    {'text': 'Hello distributed world'},
    {'text': 'Ray enables scalable processing'},
    {'text': 'Data-Juicer on Ray'}
])

# Wrap in RayDataset
ds = RayDataset(ray_data) # or dataset_path or cfg
print(f"Created RayDataset with {ds.count()} samples")
print(f"First 2 samples: {ds.get(2)}")

2026-02-12 09:24:55,080	INFO worker.py:2007 -- Started a local Ray instance.
2026-02-12 09:24:56,803	INFO dataset.py:3641 -- Tip: Use `take_batch()` instead of `take() / show()` to return records in pandas or numpy batch format.
2026-02-12 09:24:56,833	INFO logging.py:397 -- Registered dataset logger for dataset dataset_2_0
2026-02-12 09:24:56,873	INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_2_0. Full logs are in /tmp/ray/session_2026-02-12_09-24-53_591542_10118/logs/ray-data
2026-02-12 09:24:56,874	INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_2_0: InputDataBuffer[Input] -> LimitOperator[limit=2]
2026-02-12 09:24:56,877	INFO streaming_executor.py:686 -- [dataset]: A new progress UI is available. To enable, set `ray.data.DataContext.get_current().enable_rich_progress_bars = True` and `ray.data.DataContext.get_current().use_ray_tqdm = False`.
2026-02-12 09:24:56,877	INFO progress_bar.py:155 -- Progress bar disabled because stdout is a non

Created RayDataset with 3 samples
First 2 samples: [{'text': 'Hello distributed world'}, {'text': 'Ray enables scalable processing'}]


In [7]:
# Same operators work on RayDataset
from data_juicer.ops.filter import TextLengthFilter

# Process with operators - same API as NestedDataset
ds_processed = ds.process([
    TextLengthFilter(min_len=20)
])

print(f"After filter: {ds_processed.count()} samples")
print(f"Results: {ds_processed.get(10)}")

ray.shutdown()

[32m2026-02-12 09:24:56.952[0m | [1mINFO    [0m | [36mdata_juicer.utils.ray_utils[0m:[36mget_ray_nodes_info[0m:[36m96[0m - [1mRay nodes:
[{'NodeID': 'f715a36d2a6cd173031c64580250e96fa5777b5353ce0ba0b74e4643', 'Alive': True, 'NodeManagerAddress': '10.0.0.151', 'NodeManagerHostname': 'codespaces-94212f', 'NodeManagerPort': 42675, 'ObjectManagerPort': 37905, 'ObjectStoreSocketName': '/tmp/ray/session_2026-02-12_09-24-53_591542_10118/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2026-02-12_09-24-53_591542_10118/sockets/raylet', 'MetricsExportPort': 62229, 'NodeName': '10.0.0.151', 'RuntimeEnvAgentPort': 61727, 'DeathReason': 0, 'DeathReasonMessage': '', 'alive': True, 'Resources': {'CPU': 4.0, 'object_store_memory': 4030113792.0, 'memory': 9403598848.0, 'node:10.0.0.151': 1.0, 'node:__internal_head__': 1.0}, 'Labels': {'ray.io/node-id': 'f715a36d2a6cd173031c64580250e96fa5777b5353ce0ba0b74e4643'}}][0m
[32m2026-02-12 09:24:57.084[0m | [1mINFO    [0m | [36mdata

After filter: 2 samples
Results: [{'text': 'Hello distributed world', '__dj__stats__': {'text_len': 23}}, {'text': 'Ray enables scalable processing', '__dj__stats__': {'text_len': 31}}]


### Production Usage: Via Configuration

For production, use configuration files with `executor_type: 'ray'`:

In [8]:
# Create a Ray config file
import os
os.makedirs('./configs', exist_ok=True)

ray_config = """
project_name: 'ray-demo'
dataset_path: './data/demo.jsonl'
export_path: './outputs/processed'

executor_type: 'ray'        # Enable Ray backend
ray_address: 'auto'         # Or 'ray://hostname:port' for cluster

process:
  - text_length_filter:
      min_len: 10
      max_len: 1000
"""

with open('./configs/ray_demo.yaml', 'w') as f:
    f.write(ray_config)

print("Run with: dj-process --config ./configs/ray_demo.yaml")

Run with: dj-process --config ./configs/ray_demo.yaml


### Key Differences

| Feature | NestedDataset | RayDataset |
|---------|---------------|------------|
| **Backend** | HuggingFace Dataset | Ray Data |
| **Execution** | Eager | Lazy (streaming) |
| **GPU Support** | Manual | Auto GPU allocation |
| **Indexing** | `ds[0]`, `ds['col']` | `ds.get(k)`, `ds.get_column('col')` |

See [Chapter 7: Distributed Processing with Ray](./07_Distributed_Processing_with_Ray.ipynb) for more details.