# Distributed Processing with Data-Juicer

In this notebook, we'll explore how to use Data-Juicer's distributed processing capabilities to handle large-scale datasets efficiently. Data-Juicer provides powerful distributed processing features based on the Ray framework.

## In this Notebook

1. When to use distributed processing
2. Configuring distributed execution
3. Running distributed processing pipelines
4. Performance optimization techniques
5. Other distributed processing methods (DLC, Slurm)

## When to Use Distributed Processing

Distributed processing is beneficial in several scenarios:

1. **Large datasets**: When dealing with millions or billions of records that would take days or weeks to process on a single machine
2. **Compute-intensive operations**: Operations that involve deep learning models (image captioning, text generation, etc.)
3. **Memory constraints**: When datasets are too large to fit in a single machine's memory
4. **Time-sensitive tasks**: When you need to complete processing within a specific timeframe

Data-Juicer's distributed processing is built on Ray, which makes it easy to scale across multiple nodes and fully utilize cluster resources.

### Engine-Agnostic Design

For most implementations of Data-Juicer operators, the core processing functions are engine-agnostic. Interoperability is primarily managed in [RayDataset](https://github.com/modelscope/data-juicer/blob/main/data_juicer/core/ray_data.py) and [RayExecutor](https://github.com/modelscope/data-juicer/blob/main/data_juicer/core/executor/ray_executor.py), which are subclasses of the base DJDataset and BaseExecutor, respectively, and support both Ray [Tasks](https://docs.ray.io/en/latest/ray-core/tasks.html) and [Actors](https://docs.ray.io/en/latest/ray-core/actors.html).

### Special Consideration for Deduplication Operators

The exception is the deduplication operators, which are challenging to scale in standalone mode. We provides special distributed versions of these operators with names prefixed with `ray_`. These include:

- `ray_bts_minhash_deduplicator`: A distributed implementation of Union-Find with load balancing
- `ray_document_deduplicator`: Deduplicates samples at the document level using exact matching in Ray distributed mode
- `ray_image_deduplicator`: Deduplicates samples at the document level using exact matching of images in Ray distributed mode
- `ray_video_deduplicator`: Deduplicates samples at document-level using exact matching of videos in Ray distributed mode

These specialized operators are designed to handle the unique challenges of distributed deduplication efficiently.

## Configuring Distributed Execution

### Basic Configuration

To enable distributed processing in Data-Juicer, set these parameters in your configuration file:

```yaml
# Specify Ray executor
executor_type: ray

# Optional: Specify Ray cluster address, default is "auto"
ray_address: auto
```

Let's look at some example configuration files that are already available in Data-Juicer:

In [None]:
# cp asset
%mkdir -p configs/demo
%mkdir -p demos/data
%cp ../configs/demo/dedup-ray-bts.yaml configs/demo/
%cp ../configs/demo/dedup-ray-bts-gpu.yaml configs/demo/
%cp ../demos/data/demo-dataset-deduplication.jsonl demos/data/

In [None]:
# Let's examine the existing demo configuration files for distributed processing
print("Basic distributed deduplication config (configs/demo/dedup-ray-bts.yaml):")
!cat configs/demo/dedup-ray-bts.yaml

print("\n" + "="*80 + "\n")

print("GPU-accelerated distributed deduplication config (configs/demo/dedup-ray-bts-gpu.yaml):")
!cat configs/demo/dedup-ray-bts-gpu.yaml

Let's create a custom distributed configuration based on the existing examples:

In [None]:
%%writefile configs/custom_distributed_example.yaml

# Custom distributed processing config example

# Global parameters
project_name: 'custom-distributed-example'
dataset_path: './demos/data/demo-dataset-deduplication.jsonl'  # Using existing demo data
np: 4  # Number of subprocesses to process your dataset

# Distributed processing parameters
executor_type: ray
ray_address: auto
open_monitor: true
open_tracer: true

# Output path
export_path: './outputs/custom-distributed-example/processed.jsonl'

# Process schedule
process:
  - language_id_score_filter:
      lang: en
      min_score: 0.5
  - ray_bts_minhash_deduplicator:
      tokenization: 'character'
      lowercase: true
      union_find_parallel_num: 2
      # For GPU acceleration, add: accelerator: 'cuda'

## Running Distributed Processing Pipelines

### Starting a Ray Cluster

Before running distributed tasks, you need to start a Ray cluster:

```bash
# On the head node
ray start --head --port=6379

# On worker nodes (replace <HEAD_IP> with head node IP)
ray start --address='<HEAD_IP>:6379'
```

### Running Distributed Processing Jobs

Use the following commands to run distributed processing tasks with the existing configs:

In [None]:
# Run CPU-based distributed deduplication
# Note: This would normally be run with a real Ray cluster
print("Command to run CPU-based distributed deduplication:")
!dj-process --config configs/custom_distributed_example.yaml

In [None]:
# Run GPU-accelerated distributed deduplication
# Note: This would normally be run with a real Ray cluster with GPUs
# For demonstration purposes, we'll just show the command
print("Command to run GPU-accelerated distributed deduplication:")
print("dj-process --config configs/demo/dedup-ray-bts-gpu.yaml")

## Performance Optimization Techniques

### Streaming Reading of JSON Files

Streaming reading is crucial for processing large JSONL datasets without memory issues.

Many datasets are stored in JSONL format and can be extremely large. The standard Ray Datasets implementation (up to Ray version 2.40 and Arrow version 18.1.0) doesn't support streaming reading of JSON files, leading to potential Out-of-Memory issues.

Data-Juicer addresses this by:

1. **Developing a streaming loading interface**
2. **Contributing a patch to Apache Arrow** ([PR #45084](https://github.com/apache/arrow/pull/45084))
3. **Enabling streaming-read support** for JSON, CSV, and Parquet files

With this optimization, Data-Juicer in Ray mode uses streaming loading by default for JSON files, significantly reducing memory usage for large datasets.

### Subset Splitting

When dealing with many nodes but few dataset files, Ray's default behavior can be inefficient. 

Data-Juicer provides automatic dataset splitting to optimize performance: 

The single file size is set to 128MB, ensuring the number of sub-files is at least twice the total number of CPU cores in the cluster. The corresponding tool can be obtained in [tools/data_resplit.py]([tools/data_resplit.py](https://github.com/modelscope/data-juicer/blob/main/tools/data_resplit.py)).

### Distributed Deduplication Optimizations

Deduplication operations are particularly challenging to scale, which is why Data-Juicer provides specialized distributed versions of deduplication operators.

Standard deduplication algorithms don't scale well in distributed environments due to:
1. High memory requirements for hash tables
2. Network communication overhead
3. Load balancing issues

For the `ray_bts_minhash_deduplicator`, Data-Juicer implements:

1. Multiprocess Union-Find set in Ray Actors
2. Load-balanced distributed algorithm (BTS) for equivalence class merging

This optimization enables Data-Juicer to:
- Deduplicate terabyte-sized datasets on 1280 CPU cores in 3 hours
- Achieve 2x to 3x speedups compared to vanilla deduplication operators

## Other Distributed Processing Methods

### DLC (Deep Learning Containers)

Data-Juicer supports running distributed tasks in Alibaba Cloud PAI's DLC environment. Related scripts are in the `./scripts/dlc` directory:

- [`partition_data_dlc.py`](https://github.com/modelscope/data-juicer/blob/main/scripts/dlc/partition_data_dlc.py): Partitions datasets across multiple nodes
- [`run_on_dlc.sh`](https://github.com/modelscope/data-juicer/blob/main/scripts/dlc/run_on_dlc.sh): Script to run processing tasks in DLC environment

### Slurm

Data-Juicer also supports running distributed tasks on Slurm scheduling systems. Related scripts:

- [`run_slurm.sh`](https://github.com/modelscope/data-juicer/blob/main/scripts/run_slurm.sh): Script to run distributed processing tasks on Slurm clusters

## Next Steps

Continue with the next notebook to explore Data-Juicer's sandbox environment for data-model co-development.