# Ray Train + Ray Data + DLRM + Criteo

This report demonstrates training a DLRM model on the Criteo dataset using Ray Train and Ray Data. Compared to [the baseline](https://github.com/mlcommons/training/blob/master/recommendation_v2/torchrec_dlrm/dlrm_main.py), we achieve several improvements with a straightforward setup:

* Process training data on-the-fly during training
* Enable multi-node distributed training
* Profile the program using Ray Data metrics and GPU profiler
* Implement checkpointing with fault tolerance


## Workspace Requirements

* To demonstrate Ray’s capability to support heterogeneous clusters, we use a setup consisting of two g5.12xlarge nodes and two r7i.12xlarge nodes.

### Note

The original model requires A100 GPUs to run. To enable execution on A10 GPUs, we manually reduced the embedding table size. This adjustment may lead to a degradation in model quality.

# Install dependencies

In [1]:
%pip install torch==2.6.0 torchrec==1.1.0 fbgemm-gpu==1.1.0 --extra-index-url https://download.pytorch.org/whl/cu128

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu128
Note: you may need to restart the kernel to use updated packages.


## Import the Configs

In [2]:
# Note: we reduce the embedding table size to make it able to run in A10 GPUs.
from configs import RecsysConfig
import os

recsys_config = RecsysConfig()
# We use 2 g5.12xlarge nodes
recsys_config.num_workers = 8
recsys_config.train_step_limit = 5000

# Enable Ray Train V2
os.environ['RAY_TRAIN_V2_ENABLED'] = '1'

## Ray Data for Criteo Dataset

Instead of processing the training data offline, we process it on-the-fly and overlap with training. This approach includes several steps:

* Load the feature mapping table into the [object store](https://docs.ray.io/en/latest/ray-core/key-concepts.html#objects). The benefit of the object store is that processes on the same node can share memory efficiently.
* Start the Ray Data pipeline, which:
    * Reads the raw training data
    * Fills missing data
    * Looks up the feature mapping table to transform categorical features into feature IDs
    * Concatenates and normalizes the features

#### Lazily Load the Feature Mapping Table into the Object Store

In [3]:
import ray
from typing import Dict
from criteo import read_feature_mapping_table, DEFAULT_CAT_NAMES

def build_categorical_to_feature_mapping_refs() -> Dict[str, ray.ObjectRef]:
    return {
        cat_feature: read_feature_mapping_table.remote(cat_feature) for cat_feature in DEFAULT_CAT_NAMES
    }

# After running this, the task `read_feature_mapping_table` will run in the background.
categorical_to_feature_mapping_refs = build_categorical_to_feature_mapping_refs()


2025-09-03 11:10:34,843	INFO worker.py:1723 -- Connecting to existing Ray cluster at address: 10.0.1.116:6379...
2025-09-03 11:10:34,855	INFO worker.py:1908 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-25r9wrr33xev8seqqr88bcqqei.i.anyscaleuserdata-staging.com [39m[22m
2025-09-03 11:10:34,864	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b.zip' (1.17MiB) to Ray cluster...
2025-09-03 11:10:34,869	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b.zip'.


#### Build the Ray Data Pipeline

In [4]:
import ray
import pyarrow.csv
from criteo import TRAIN_DATASET_PATH, VAL_DATASET_PATH, DEFAULT_COLUMN_NAMES, fill_missing, map_features_to_indices, concat_and_normalize_dense_features
from typing import Tuple

def get_ray_dataset(path: str) -> ray.data.Dataset:
    categorical_to_feature_mapping_refs = build_categorical_to_feature_mapping_refs()
    dataset_path = path
    ds = ray.data.read_csv(
        dataset_path,
        read_options=pyarrow.csv.ReadOptions(column_names=DEFAULT_COLUMN_NAMES),
        parse_options=pyarrow.csv.ParseOptions(delimiter="\t"),
        ray_remote_args={
            # reading is memory intensive
            'memory': 800 * 1024 * 1024,  # 800 MB
        },
        shuffle=(
            "files"
        ),  # coarse file-level shuffle
    )
    ds = ds.map_batches(fill_missing)
    ds = ds.map_batches(map_features_to_indices, fn_kwargs={"categorical_to_feature_mapping_refs": categorical_to_feature_mapping_refs})
    ds = ds.map_batches(concat_and_normalize_dense_features)
    return ds

train_dataset = get_ray_dataset(TRAIN_DATASET_PATH)
val_dataset = get_ray_dataset(VAL_DATASET_PATH)


## Set Up Model and Build Train Function

To integrate the [TorchRec implementation](https://github.com/facebookresearch/dlrm/blob/main/torchrec_dlrm/dlrm_main.py) with Ray Train, minor modifications are required:

* Call `ray.train.get_dataset_shard('train')` to obtain the dataloader
* Use Ray Train APIs to fetch ranks and world sizes

Check [TorchRecWrapper](torchrec_wrapper.py) for implementation details.

Note: We need to initialize TorchRecWrapper inside the training function, where the training worker has completed initialization.

Within the training loop, we enable fault tolerance using checkpoints provided by Ray Train.


In [5]:
from torchrec_wrapper import train_loop


## Define TorchTrainer and Start Training

We define the `TorchTrainer` and run `fit()`. Key points to note:

* By configuring `scaling_config.num_workers`, we can easily enable multi-node distributed training. In this notebook, we use 2 g5.12xlarge nodes, providing 8 GPUs in total.
* Setting `{"KINETO_USE_DAEMON": "1", "KINETO_DAEMON_INIT_DELAY_S": "5"}` according to [GPU profiling guidelines](https://docs.anyscale.com/monitoring/workload-debugging/profiling-tools) enables easy profiling of GPU events on any worker.
* The remaining CPUs are allocated for Ray Data processing.

We will run 5000 iterations and evaluate every 1000 iterations.

In [6]:
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, RunConfig, CheckpointConfig, FailureConfig
import logging

logger = logging.getLogger(__name__)

scaling_config = ScalingConfig(
    num_workers=recsys_config.num_workers,
    # reserve CPUs to the training workers can make the training more stable.
    resources_per_worker={"GPU": 1, "CPU": 5},
    use_gpu=True,
)

config_dict = {}
for attr in dir(recsys_config):
    if not attr.startswith('_'):
        value = getattr(recsys_config, attr)
        if not callable(value):
            config_dict[attr] = value

logger.info(f"Starting Ray training with {recsys_config.num_workers} workers")
logger.info(f"Training configuration: {config_dict}")

# Create TorchTrainer
trainer = TorchTrainer(
    train_loop_per_worker=train_loop,
    train_loop_config=config_dict,
    scaling_config=scaling_config,
    run_config=RunConfig(
        failure_config=FailureConfig(max_failures=2),
        worker_runtime_env={'env_vars': {"KINETO_USE_DAEMON": "1", "KINETO_DAEMON_INIT_DELAY_S": "5"}},
        checkpoint_config=CheckpointConfig(
            num_to_keep=1,
        ),
        storage_path=recsys_config.checkpoint_dir,
    ),
    datasets={
        "train": train_dataset,
        "val": val_dataset,
    },
)

# Run training
logger.info("Starting distributed training...")
result = trainer.fit()

logger.info("Training completed successfully!")
logger.info(f"Final metrics: {result.metrics}")

[36m(TrainController pid=36480)[0m [State Transition] INITIALIZING -> SCHEDULING.
[36m(TrainController pid=36480)[0m Attempting to start training worker group of size 8 with the following resources: [{'GPU': 1, 'CPU': 5}] * 8
[36m(RayTrainWorker pid=24234, ip=10.0.9.59)[0m INFO:2025-09-03 11:10:45 24234:24234 init.cpp:136] Registering daemon config loader, cpuOnly =  0
[36m(RayTrainWorker pid=24234, ip=10.0.9.59)[0m Setting up process group for: env:// [rank=0, world_size=8]
[36m(TrainController pid=36480)[0m Started training worker group of size 8: 
[36m(TrainController pid=36480)[0m - (ip=10.0.9.59, pid=24234) world_rank=0, local_rank=0, node_rank=0
[36m(TrainController pid=36480)[0m - (ip=10.0.9.59, pid=24236) world_rank=1, local_rank=1, node_rank=0
[36m(TrainController pid=36480)[0m - (ip=10.0.9.59, pid=24237) world_rank=2, local_rank=2, node_rank=0
[36m(TrainController pid=36480)[0m - (ip=10.0.9.59, pid=24235) world_rank=3, local_rank=3, node_rank=0
[36m(TrainCo

(pid=36605) Running 0: 0.00 row [00:00, ? row/s]

(pid=36605) - ListFiles 1: 0.00 row [00:00, ? row/s]

(pid=36605) - ReadFiles 2: 0.00 row [00:00, ? row/s]

(pid=36605) - MapBatches(fill_missing)->...->MapBatches(concat_and_normalize_dense_features) 3: 0.00 row [00:0…

(pid=36605) - split(8, equal=True) 4: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=36605)[0m Starting execution of Dataset train_20_0. Full logs are in /tmp/ray/session_2025-09-03_10-36-42_840557_19002/logs/ray-data
[36m(SplitCoordinator pid=36605)[0m Execution plan of Dataset train_20_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(fill_missing)->MapBatches(map_features_to_indices)->MapBatches(concat_and_normalize_dense_features)] -> OutputSplitter[split(8, equal=True)]
[36m(SplitCoordinator pid=36605)[0m Truncating long operator name to 100 characters. To disable this behavior, set `ray.data.DataContext.get_current().DEFAULT_ENABLE_PROGRESS_BAR_NAME_TRUNCATION = False`.
[36m(raylet, ip=10.0.33.142)[0m Spilled 17047 MiB, 231 objects, write throughput 859 MiB/s.","component":"raylet","filename":"local_object_manager.cc","lineno":250}
[36m(RayTrainWorker pid=23478, ip=10.0.16.4)[0m Successfully loaded: 'fbgemm_gpu_config.so'[32m [repeated 7x across clus

[36m(autoscaler +50s)[0m Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
[36m(autoscaler +50s)[0m [worker-i-01fce145af89a60a8] Node is experiencing disk pressure - current disk utilization: 96.64% (140.29GiB/145.18GiB).


[36m(RayTrainWorker pid=28088, ip=10.0.9.59)[0m INFO:2025-09-03 11:11:26 28088:28088 init.cpp:136] Registering daemon config loader, cpuOnly =  0
[36m(TrainController pid=36480)[0m Traceback (most recent call last):[32m [repeated 16x across cluster][0m
[36m(TrainController pid=36480)[0m   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 472, in process_completed_tasks[32m [repeated 32x across cluster][0m
[36m(TrainController pid=36480)[0m     bytes_read = task.on_data_ready([32m [repeated 16x across cluster][0m
[36m(TrainController pid=36480)[0m            ^^^^^^^^^^^^^^^^^^^[32m [repeated 48x across cluster][0m
[36m(TrainController pid=36480)[0m   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 134, in on_data_ready[32m [repeated 32x across cluster][0m
[36m(TrainController pid=36480)[0m     raise ex from None[32m [rep

[36m(autoscaler +55s)[0m [worker-i-083d08465c1578dc3] Node is experiencing disk pressure - current disk utilization: 95.09% (138.05GiB/145.18GiB).


[33m(raylet, ip=10.0.33.142)[0m {"asctime":"2025-09-03 11:11:30,460","levelname":"E","message":"/mnt/local_storage/ is over 95% full, available space: 4.4242 GB; capacity: 145.191 GB. Object creation will fail if spilling is required.","component":"raylet","filename":"file_system_monitor.cc","lineno":116}
[36m(RayTrainWorker pid=28088, ip=10.0.9.59)[0m #####################################################################################################################################################################################################################################################################################################################################################
[36m(RayTrainWorker pid=28088, ip=10.0.9.59)[0m #                                                                                                                                                            --- Planner Statistics ---                                                                  

(pid=37138) Running 0: 0.00 row [00:00, ? row/s]

(pid=37138) - ListFiles 1: 0.00 row [00:00, ? row/s]

(pid=37138) - ReadFiles 2: 0.00 row [00:00, ? row/s]

(pid=37138) - MapBatches(fill_missing)->...->MapBatches(concat_and_normalize_dense_features) 3: 0.00 row [00:0…

(pid=37138) - split(8, equal=True) 4: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=37138)[0m Registered dataset logger for dataset train_22_0
[36m(SplitCoordinator pid=37138)[0m Starting execution of Dataset train_22_0. Full logs are in /tmp/ray/session_2025-09-03_10-36-42_840557_19002/logs/ray-data
[36m(SplitCoordinator pid=37138)[0m Execution plan of Dataset train_22_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(fill_missing)->MapBatches(map_features_to_indices)->MapBatches(concat_and_normalize_dense_features)] -> OutputSplitter[split(8, equal=True)]
[36m(SplitCoordinator pid=37138)[0m Truncating long operator name to 100 characters. To disable this behavior, set `ray.data.DataContext.get_current().DEFAULT_ENABLE_PROGRESS_BAR_NAME_TRUNCATION = False`.
[36m(SplitCoordinator pid=37138)[0m An exception was raised from a task of operator "ReadFiles". Dataset execution will now abort. To ignore this exception and continue, set DataContext.max_errored_blo

[33m(raylet)[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 42d18199be4c537a1f816e3a0832233c3f4bb53703000000 Worker ID: 352ef8f2d20ddbd6e7b21dfc2fab5fd504936f1a24f9225c9937cdfd Node ID: 666971dcc76464e898d5560e562d87900a79691d26ba350dd901db25 Worker IP address: 10.0.8.71 Worker port: 10314 Worker PID: 80000 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker exits unexpectedly. Worker exits with an exit code 1. The process receives a SIGTERM.


[36m(TrainController pid=36480)[0m Attempting to start training worker group of size 8 with the following resources: [{'GPU': 1, 'CPU': 5}] * 8
[33m(raylet, ip=10.0.33.142)[0m {"asctime":"2025-09-03 11:11:40,502","levelname":"E","message":"/mnt/local_storage/ is over 95% full, available space: 2.18708 GB; capacity: 145.191 GB. Object creation will fail if spilling is required.","component":"raylet","filename":"file_system_monitor.cc","lineno":116}
[36m(RayTrainWorker pid=29017, ip=10.0.9.59)[0m INFO:2025-09-03 11:11:41 29017:29017 init.cpp:136] Registering daemon config loader, cpuOnly =  0
[36m(TrainController pid=36480)[0m Traceback (most recent call last):[32m [repeated 16x across cluster][0m
[36m(TrainController pid=36480)[0m   File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 472, in process_completed_tasks[32m [repeated 32x across cluster][0m
[36m(TrainController pid=36480)[0m     bytes_read = ta

[33m(raylet)[0m Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 2460, in ray._raylet.spill_objects_handler
  File "python/ray/_raylet.pyx", line 2463, in ray._raylet.spill_objects_handler
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/external_storage.py", line 708, in spill_objects
    return _external_storage.spill_objects(object_refs, owner_addresses)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/external_storage.py", line 335, in spill_objects
    return self._write_multiple_objects(f, object_refs, owner_addresses, url)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/external_storage.py", line 175, in _write_multiple_objects
    written_bytes = f.write(payload)
                    ^^^^^^^^^^^^^^^^
OSError: [Errno 28] No space left on devic

[36m(RayTrainWorker pid=29017, ip=10.0.9.59)[0m Sharding Type is data_parallel, caching params will be ignored
[36m(RayTrainWorker pid=29017, ip=10.0.9.59)[0m Sharding Type is data_parallel, caching params will be ignored
[36m(RayTrainWorker pid=29017, ip=10.0.9.59)[0m Sharding Type is data_parallel, caching params will be ignored
[36m(RayTrainWorker pid=29017, ip=10.0.9.59)[0m Sharding Type is data_parallel, caching params will be ignored
[36m(RayTrainWorker pid=29017, ip=10.0.9.59)[0m Sharding Type is data_parallel, caching params will be ignored
[36m(RayTrainWorker pid=29017, ip=10.0.9.59)[0m Sharding Type is data_parallel, caching params will be ignored
[36m(RayTrainWorker pid=29017, ip=10.0.9.59)[0m Sharding Type is data_parallel, caching params will be ignored
[36m(RayTrainWorker pid=29017, ip=10.0.9.59)[0m Sharding Type is data_parallel, caching params will be ignored
[36m(RayTrainWorker pid=29017, ip=10.0.9.59)[0m Sharding Type is data_parallel, caching params

(pid=37397) Running 0: 0.00 row [00:00, ? row/s]

(pid=37397) - ListFiles 1: 0.00 row [00:00, ? row/s]

(pid=37397) - ReadFiles 2: 0.00 row [00:00, ? row/s]

(pid=37397) - MapBatches(fill_missing)->...->MapBatches(concat_and_normalize_dense_features) 3: 0.00 row [00:0…

(pid=37397) - split(8, equal=True) 4: 0.00 row [00:00, ? row/s]

[36m(SplitCoordinator pid=37397)[0m Registered dataset logger for dataset train_24_0
[36m(SplitCoordinator pid=37397)[0m Starting execution of Dataset train_24_0. Full logs are in /tmp/ray/session_2025-09-03_10-36-42_840557_19002/logs/ray-data
[36m(SplitCoordinator pid=37397)[0m Execution plan of Dataset train_24_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ListFiles] -> TaskPoolMapOperator[ReadFiles] -> TaskPoolMapOperator[MapBatches(fill_missing)->MapBatches(map_features_to_indices)->MapBatches(concat_and_normalize_dense_features)] -> OutputSplitter[split(8, equal=True)]
[36m(SplitCoordinator pid=37397)[0m Truncating long operator name to 100 characters. To disable this behavior, set `ray.data.DataContext.get_current().DEFAULT_ENABLE_PROGRESS_BAR_NAME_TRUNCATION = False`.
[36m(RayTrainWorker pid=29014, ip=10.0.9.59)[0m model.sparse_arch.embedding_bag_collection
[36m(RayTrainWorker pid=29014, ip=10.0.9.59)[0m t_cat_0
[36m(RayTrainWorker pid=29014, ip=10.0.9.59)[0m Pa

[36m(autoscaler +1m15s)[0m [worker-i-01fce145af89a60a8] Node has recovered from disk pressure - current disk utilization: 90.07% (130.76GiB/145.18GiB).


[33m(raylet, ip=10.0.8.71)[0m {"asctime":"2025-09-03 11:11:52,300","levelname":"E","message":"/mnt/local_storage/ is over 95% full, available space: 1.72724 GB; capacity: 145.191 GB. Object creation will fail if spilling is required.","component":"raylet","filename":"file_system_monitor.cc","lineno":116}
[36m(RayTrainWorker pid=29016, ip=10.0.9.59)[0m Successfully loaded: 'fbgemm_gpu_config.so'[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=29016, ip=10.0.9.59)[0m Successfully loaded: 'fbgemm_gpu_tbe_utils.so'[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=29016, ip=10.0.9.59)[0m Successfully loaded: 'fbgemm_gpu_tbe_index_select.so'[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=29016, ip=10.0.9.59)[0m Successfully loaded: 'fbgemm_gpu_tbe_optimizers.so'[32m [repeated 7x across cluster][0m
[36m(RayTrainWorker pid=29016, ip=10.0.9.59)[0m Successfully loaded: 'fbgemm_gpu_tbe_inference.so'[32m [repeated 7x across cluster][0m
[36

TrainingFailedError: Training failed due to worker errors:
[Rank 0]
Traceback (most recent call last):
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 603, in train_loop
    wrapper.run()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 557, in run
    self._train_epoch()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 306, in _train_epoch
    batch_loss, _logits, _labels = self._train_pipeline.progress(train_dataloader)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 478, in progress
    self.fill_pipeline(dataloader_iter)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 459, in fill_pipeline
    if not self.enqueue_batch(dataloader_iter):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 434, in enqueue_batch
    batch, context = self.copy_batch_to_gpu(dataloader_iter)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 582, in copy_batch_to_gpu
    batch = self._next_batch(dataloader_iter)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 602, in _next_batch
    batch = next(dataloader_iter, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/iterator.py", line 208, in _create_iterator
    for batch in iterator:
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 181, in iter_batches
    next_batch = next(async_batch_iter)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1092, in make_async_gen
    raise item
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 995, in _run_filling_worker
    for idx, item in enumerate(base_iterator):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 89, in gen_blocks
    ] = ray.get(future)
        ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::SplitCoordinator.get()[39m (pid=37397, ip=10.0.1.116, actor_id=79f8aaaa54dda692da55167303000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x79495e1ba410>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 278, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 336, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 505, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 472, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 138, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 134, in on_data_ready
    ray.get(block_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::ReadFiles()[39m (pid=79309, ip=10.0.8.71)
  File "python/ray/includes/common.pxi", line 79, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.

[Rank 1]
Traceback (most recent call last):
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 603, in train_loop
    wrapper.run()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 557, in run
    self._train_epoch()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 306, in _train_epoch
    batch_loss, _logits, _labels = self._train_pipeline.progress(train_dataloader)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 478, in progress
    self.fill_pipeline(dataloader_iter)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 459, in fill_pipeline
    if not self.enqueue_batch(dataloader_iter):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 434, in enqueue_batch
    batch, context = self.copy_batch_to_gpu(dataloader_iter)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 582, in copy_batch_to_gpu
    batch = self._next_batch(dataloader_iter)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 602, in _next_batch
    batch = next(dataloader_iter, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/iterator.py", line 208, in _create_iterator
    for batch in iterator:
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 181, in iter_batches
    next_batch = next(async_batch_iter)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1092, in make_async_gen
    raise item
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 995, in _run_filling_worker
    for idx, item in enumerate(base_iterator):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 89, in gen_blocks
    ] = ray.get(future)
        ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::SplitCoordinator.get()[39m (pid=37397, ip=10.0.1.116, actor_id=79f8aaaa54dda692da55167303000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x79495e1ba410>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 278, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 336, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 505, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 472, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 138, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 134, in on_data_ready
    ray.get(block_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::ReadFiles()[39m (pid=79309, ip=10.0.8.71)
  File "python/ray/includes/common.pxi", line 79, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.

[Rank 2]
Traceback (most recent call last):
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 603, in train_loop
    wrapper.run()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 557, in run
    self._train_epoch()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 306, in _train_epoch
    batch_loss, _logits, _labels = self._train_pipeline.progress(train_dataloader)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 478, in progress
    self.fill_pipeline(dataloader_iter)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 459, in fill_pipeline
    if not self.enqueue_batch(dataloader_iter):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 434, in enqueue_batch
    batch, context = self.copy_batch_to_gpu(dataloader_iter)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 582, in copy_batch_to_gpu
    batch = self._next_batch(dataloader_iter)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 602, in _next_batch
    batch = next(dataloader_iter, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/iterator.py", line 208, in _create_iterator
    for batch in iterator:
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 181, in iter_batches
    next_batch = next(async_batch_iter)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1092, in make_async_gen
    raise item
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 995, in _run_filling_worker
    for idx, item in enumerate(base_iterator):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 89, in gen_blocks
    ] = ray.get(future)
        ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::SplitCoordinator.get()[39m (pid=37397, ip=10.0.1.116, actor_id=79f8aaaa54dda692da55167303000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x79495e1ba410>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 278, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 336, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 505, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 472, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 138, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 134, in on_data_ready
    ray.get(block_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::ReadFiles()[39m (pid=79309, ip=10.0.8.71)
  File "python/ray/includes/common.pxi", line 79, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.

[Rank 3]
Traceback (most recent call last):
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 603, in train_loop
    wrapper.run()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 557, in run
    self._train_epoch()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 306, in _train_epoch
    batch_loss, _logits, _labels = self._train_pipeline.progress(train_dataloader)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 478, in progress
    self.fill_pipeline(dataloader_iter)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 459, in fill_pipeline
    if not self.enqueue_batch(dataloader_iter):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 434, in enqueue_batch
    batch, context = self.copy_batch_to_gpu(dataloader_iter)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 582, in copy_batch_to_gpu
    batch = self._next_batch(dataloader_iter)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 602, in _next_batch
    batch = next(dataloader_iter, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/iterator.py", line 208, in _create_iterator
    for batch in iterator:
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 181, in iter_batches
    next_batch = next(async_batch_iter)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1092, in make_async_gen
    raise item
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 995, in _run_filling_worker
    for idx, item in enumerate(base_iterator):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 89, in gen_blocks
    ] = ray.get(future)
        ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::SplitCoordinator.get()[39m (pid=37397, ip=10.0.1.116, actor_id=79f8aaaa54dda692da55167303000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x79495e1ba410>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 278, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 336, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 505, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 472, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 138, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 134, in on_data_ready
    ray.get(block_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::ReadFiles()[39m (pid=79309, ip=10.0.8.71)
  File "python/ray/includes/common.pxi", line 79, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.

[Rank 4]
Traceback (most recent call last):
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 603, in train_loop
    wrapper.run()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 557, in run
    self._train_epoch()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 306, in _train_epoch
    batch_loss, _logits, _labels = self._train_pipeline.progress(train_dataloader)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 478, in progress
    self.fill_pipeline(dataloader_iter)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 459, in fill_pipeline
    if not self.enqueue_batch(dataloader_iter):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 434, in enqueue_batch
    batch, context = self.copy_batch_to_gpu(dataloader_iter)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 582, in copy_batch_to_gpu
    batch = self._next_batch(dataloader_iter)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 602, in _next_batch
    batch = next(dataloader_iter, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/iterator.py", line 208, in _create_iterator
    for batch in iterator:
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 181, in iter_batches
    next_batch = next(async_batch_iter)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1092, in make_async_gen
    raise item
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 995, in _run_filling_worker
    for idx, item in enumerate(base_iterator):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 89, in gen_blocks
    ] = ray.get(future)
        ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::SplitCoordinator.get()[39m (pid=37397, ip=10.0.1.116, actor_id=79f8aaaa54dda692da55167303000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x79495e1ba410>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 278, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 336, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 505, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 472, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 138, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 134, in on_data_ready
    ray.get(block_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::ReadFiles()[39m (pid=79309, ip=10.0.8.71)
  File "python/ray/includes/common.pxi", line 79, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.

[Rank 5]
Traceback (most recent call last):
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 603, in train_loop
    wrapper.run()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 557, in run
    self._train_epoch()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 306, in _train_epoch
    batch_loss, _logits, _labels = self._train_pipeline.progress(train_dataloader)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 478, in progress
    self.fill_pipeline(dataloader_iter)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 459, in fill_pipeline
    if not self.enqueue_batch(dataloader_iter):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 434, in enqueue_batch
    batch, context = self.copy_batch_to_gpu(dataloader_iter)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 582, in copy_batch_to_gpu
    batch = self._next_batch(dataloader_iter)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 602, in _next_batch
    batch = next(dataloader_iter, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/iterator.py", line 208, in _create_iterator
    for batch in iterator:
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 181, in iter_batches
    next_batch = next(async_batch_iter)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1092, in make_async_gen
    raise item
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 995, in _run_filling_worker
    for idx, item in enumerate(base_iterator):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 89, in gen_blocks
    ] = ray.get(future)
        ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::SplitCoordinator.get()[39m (pid=37397, ip=10.0.1.116, actor_id=79f8aaaa54dda692da55167303000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x79495e1ba410>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 278, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 336, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 505, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 472, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 138, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 134, in on_data_ready
    ray.get(block_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::ReadFiles()[39m (pid=79309, ip=10.0.8.71)
  File "python/ray/includes/common.pxi", line 79, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.

[Rank 6]
Traceback (most recent call last):
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 603, in train_loop
    wrapper.run()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 557, in run
    self._train_epoch()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 306, in _train_epoch
    batch_loss, _logits, _labels = self._train_pipeline.progress(train_dataloader)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 478, in progress
    self.fill_pipeline(dataloader_iter)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 459, in fill_pipeline
    if not self.enqueue_batch(dataloader_iter):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 434, in enqueue_batch
    batch, context = self.copy_batch_to_gpu(dataloader_iter)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 582, in copy_batch_to_gpu
    batch = self._next_batch(dataloader_iter)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 602, in _next_batch
    batch = next(dataloader_iter, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/iterator.py", line 208, in _create_iterator
    for batch in iterator:
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 181, in iter_batches
    next_batch = next(async_batch_iter)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1092, in make_async_gen
    raise item
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 995, in _run_filling_worker
    for idx, item in enumerate(base_iterator):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 89, in gen_blocks
    ] = ray.get(future)
        ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::SplitCoordinator.get()[39m (pid=37397, ip=10.0.1.116, actor_id=79f8aaaa54dda692da55167303000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x79495e1ba410>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 278, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 336, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 505, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 472, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 138, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 134, in on_data_ready
    ray.get(block_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::ReadFiles()[39m (pid=79309, ip=10.0.8.71)
  File "python/ray/includes/common.pxi", line 79, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.

[Rank 7]
Traceback (most recent call last):
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 603, in train_loop
    wrapper.run()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 557, in run
    self._train_epoch()
  File "/tmp/ray/session_2025-09-03_10-36-42_840557_19002/runtime_resources/working_dir_files/_ray_pkg_e52e1da86b771e01b0e1a8670441ff2d173d865b/torchrec_wrapper.py", line 306, in _train_epoch
    batch_loss, _logits, _labels = self._train_pipeline.progress(train_dataloader)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 478, in progress
    self.fill_pipeline(dataloader_iter)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 459, in fill_pipeline
    if not self.enqueue_batch(dataloader_iter):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 434, in enqueue_batch
    batch, context = self.copy_batch_to_gpu(dataloader_iter)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 582, in copy_batch_to_gpu
    batch = self._next_batch(dataloader_iter)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/torchrec/distributed/train_pipeline/train_pipelines.py", line 602, in _next_batch
    batch = next(dataloader_iter, None)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/iterator.py", line 208, in _create_iterator
    for batch in iterator:
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/block_batching/iter_batches.py", line 181, in iter_batches
    next_batch = next(async_batch_iter)
                 ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 1092, in make_async_gen
    raise item
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/util.py", line 995, in _run_filling_worker
    for idx, item in enumerate(base_iterator):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 89, in gen_blocks
    ] = ray.get(future)
        ^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 2849, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/_private/worker.py", line 937, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::SplitCoordinator.get()[39m (pid=37397, ip=10.0.1.116, actor_id=79f8aaaa54dda692da55167303000000, repr=<ray.data._internal.iterator.stream_split_iterator.SplitCoordinator object at 0x79495e1ba410>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/iterator/stream_split_iterator.py", line 236, in get
    next_bundle = self._output_iterator.get_next(output_split_idx)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/legacy_compat.py", line 75, in get_next
    bundle = self._base_iterator.get_next(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 577, in get_next
    bundle = state.get_output_blocking(output_split_idx)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 331, in get_output_blocking
    raise self._exception
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 278, in run
    continue_sched = self._scheduling_loop_step(self._topology)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor.py", line 336, in _scheduling_loop_step
    num_errored_blocks = process_completed_tasks(
                         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 505, in process_completed_tasks
    raise e from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/streaming_executor_state.py", line 472, in process_completed_tasks
    bytes_read = task.on_data_ready(
                 ^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 138, in on_data_ready
    raise ex from None
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/execution/interfaces/physical_operator.py", line 134, in on_data_ready
    ray.get(block_ref)
           ^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ray.exceptions.RayTaskError(OutOfDiskError): [36mray::ReadFiles()[39m (pid=79309, ip=10.0.8.71)
  File "python/ray/includes/common.pxi", line 79, in ray._raylet.check_status
ray.exceptions.OutOfDiskError: Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.


You can see how the workloads are distributed across GPU and CPU machines.

![running_progress](./images/running_progress.png)

## Ray Data Metrics

Ray Data metrics dashboards provide numerous useful tools for understanding the training pipeline.

For example, `Iteration Blocked Time` is a valuable metric for identifying data loading bottlenecks. If this value consistently increases over time, it indicates that the model is frequently waiting for data, suggesting that the training pipeline is bottlenecked by data loading.

![example](./images/data_metrics.png)

## GPU Profiling

If you set the environment variables `"KINETO_USE_DAEMON": "1", "KINETO_DAEMON_INIT_DELAY_S": "5"`, you can profile GPU metrics with one click.

![gpu profiling](./images/gpu_profiling.png)

## Model Quality

We achieve comparable training loss and validation performance to the baseline. The training loss curve appears spiky due to the presence of numerous sparse features in the model.

![wandb_curve](./images/model_quality.png)

## Throughput Benchmark

We compare the baseline and Ray Train versions under several conditions:

* Baseline (torchrun), single p4d, processed numpy data took 7 minutes to download locally. Throughput: 1,020k
* Ray Train, single p4d, training data processed on-the-fly: data loading becomes the bottleneck. Throughput: 220k
* Ray Train, single p4d + 1 r7i, throughput: 800k
* Ray Train, single p4d + 2 r7i, throughput: 925k

From the benchmarks, we can observe:

* The task is data-loading bounded, and adding more CPU machines can mitigate this bottleneck. This demonstrates the strength of Ray Train.

## Multi-node Scalability with EFA

By following [Cluster-level EFA configuration](https://docs.anyscale.com/configuration/compute/aws#cluster-level-efa-configuration), we can enable EFA on the Anyscale platform, which is critical for multi-node training.

We conduct two sets of experiments on `p3dn.24xlarge` machines.

### Without EFA

The throughput on 2 nodes is worse than on 1 node.

![wandb_curve](./images/no_efa.png)

### With EFA

With EFA enabled, we observe clear scalability as the number of nodes increases.

![wandb_curve](./images/with_efa.png)
