# Ray Tasks Advanced: Building Distributed Applications

© 2025, Anyscale. All Rights Reserved

This notebook provides a step-by-step introduction to Ray Tasks, the fundamental building block of Ray that enables distributed computing.

<div class="alert alert-block alert-info">

<b> Here is the roadmap for this notebook </b>

<ol>
  <li>Error handling and task retries</li>
  <li>Task runtime environments</li>
  <li>Resource allocation and management</li>
  <li>Pipeline data processing and waiting for results</li>
  <li>Ray generators</li>
</ol>
</div>

## Imports



In [None]:
import math
import os
import random
import time
import sys

import numpy as np
import pandas as pd
import ray
import requests
import ray.runtime_context
from ray import tune
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

## 1. Error handling and task retries

### 1.1. Understanding exception types

Let's consider two types of exceptions:
1. **System errors**: Worker node dies, out of memory, network issues
2. **Application-level errors**: Python exceptions in your code (ValueError, TypeError, etc.)

Ray will automatically **retry a task up to 3 times** if it fails due to a system error (e.g., a worker node dies).

In [None]:
@ray.remote
def sys_flaky_square(x: int, prob: float) -> int:
    if random.random() < prob:
        raise sys.exit(1)
    return x**2

In [None]:
ray.get([sys_flaky_square.remote(x=4, prob=0.2) for _ in range(10)])

### 1.2. Handling application exceptions

Below task won't be retried by default because it's an application failure:

In [None]:
@ray.remote
def incorrect_square(x: int, prob: float) -> int:
    if random.random() < prob:
        raise ValueError("Random failure")
    return x**2

In [None]:
try:
    ray.get([incorrect_square.remote(x=4, prob=0.5) for _ in range(10)])
except ray.exceptions.RayTaskError as e:
    print(f"Task failed with: {e}")
    print(f"Original exception: {e.cause}")  # Access underlying exception

**Exception propagation:**
- Exceptions in tasks are wrapped in `RayTaskError`
- The original exception is available via `.cause` attribute
- `ray.get()` will raise the exception
- ObjectRefs remain valid, but getting them raises the exception

### 1.3. Configuring retries

Ray lets you specify how to handle retries when an exception is encountered:



In [None]:
@ray.remote(retry_exceptions=[ValueError])
def correct_square(x: int, prob: float) -> int:
    if random.random() < prob:
        raise ValueError("Random failure")
    return x**2

Note we did not have to re-define the remote function, instead we can create an updated version using `.options`:

In [None]:
correct_square_mod = correct_square.options(
    retry_exceptions=[ValueError],
    max_retries=10,
)

Let's try it out:



In [None]:
try:
    outputs = ray.get([correct_square_mod.remote(x=4, prob=0.5) for _ in range(10)])
except ray.exceptions.RayTaskError:
    print("At least one of the tasks failed after all retries")
else:
    print(f"\nSuccess! Results: {outputs}")

### 1.4. Idempotency: Critical for reliable retries

**⚠️ WARNING:** Only retry tasks that are **idempotent** (can be safely run multiple times).

**❌ Non-idempotent (dangerous to retry):**



In [None]:
@ray.remote(retry_exceptions=[ValueError])
def append_to_file(data):
    with open("/tmp/data.txt", "a") as f:
        f.write(data)  # Will duplicate data on retry!
    if random.random() < 0.5:
        raise ValueError("Simulated failure")
    return "done"

**If this task fails and retries:**
1. First attempt: Writes "hello" → fails
2. Retry: Writes "hello" again → file now has "hellohello"

**✅ Idempotent (safe to retry):**



In [None]:
@ray.remote(retry_exceptions=[ValueError])
def write_to_file_safe(data, unique_id):
    filename = f"data_{unique_id}.txt"
    with open(filename, "w") as f:  # Overwrites on retry
        f.write(data)
    if random.random() < 0.5:
        raise ValueError("Simulated failure")
    return "done"

**Other idempotent operations:**
- Reading from database
- GET requests (not POST/PUT/DELETE)
- Mathematical computations
- Overwriting files (not appending)

**Non-idempotent operations to avoid retrying:**
- Appending to files/databases
- Sending emails/notifications
- Charging credit cards
- Incrementing counters

### 1.5. Task timeouts and cancellation

Sometimes you want to set a maximum execution time or cancel tasks:

#### Setting timeouts with ray.get()



In [None]:
@ray.remote
def slow_task():
    time.sleep(100)
    return "done"

ref = slow_task.remote()

try:
    result = ray.get(ref, timeout=5)  # Wait max 5 seconds
except ray.exceptions.GetTimeoutError:
    print("Task took too long!")

#### Cancelling tasks

Ray provides the functionality to cancel tasks

In [None]:
@ray.remote
def long_running_task(duration):
    time.sleep(duration)
    return "completed"

refs = [long_running_task.remote(10) for _ in range(5)]

# Cancel all tasks
for ref in refs:
    ray.cancel(
        ref,
        force=False,
        recursive=True
    )

# Verify task is cancelled
try:
    ray.get(refs[0])
except ray.exceptions.TaskCancelledError:
    print("Task was cancelled")

**Expected behavior**

Here is the expected behavior
* When `force=False` a KeyboardInterrupt is raised in Python
* When `force=True`, ray will force-kill the worker process running task (does not apply for actors)

If `recursive=True`, all the child Tasks and Actor Tasks are cancelled. 

**Important notes about cancellation:**
- Cancellation is best-effort, not guaranteed
- Task might complete before cancellation takes effect
- Dependent tasks are also cancelled
- Use for cleanup, not critical functionality


## 2. Task runtime environments

A runtime environment defines dependencies such as files, packages, and environment variables needed for a Python script to run.

- **Runtime Environment Management**:
  - Managed by a `RuntimeEnvAgent` gRPC server on each node.
  - The `RuntimeEnvAgent` fate-shares with the raylet, simplifying the failure model and ensuring it is a core component for task and actor scheduling.

- **Environment Creation**:
  - Triggered by the raylet via a gRPC request to the `RuntimeEnvAgent` when a task or actor requires a runtime environment.
  - May involve:
    - Installing packages using `pip install`.
    - Setting environment variables for Ray worker processes.
    - Activating conda environments with `conda activate`.
    - Downloading files from remote cloud storage.

- **Resource Caching**:
  - Runtime environment resources, such as downloaded files and installed conda environments, are cached on each node.
  - The cache allows sharing of resources between different tasks, actors, and jobs.
  - When the cache size limit is exceeded, resources not currently in use are deleted to free up space.

Here is a diagram showcasing the above concepts:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-core/runtime_env.png" width="500">

### 2.1. Setting environment variables

For example, we can set an environment variable:



In [None]:
@ray.remote(runtime_env={"env_vars": {"my_custom_env": "prod"}})
def f():
    env = os.environ["my_custom_env"]
    return f"My custom env is {env}"

In [None]:
ray.get(f.remote())

### 2.2. Installing pip dependencies

This will perform pip installation at runtime and setting up the worker process to execute the task appropriately

In [None]:
@ray.remote(runtime_env={"pip": ["requests", "pandas==1.5.0"]})
def fetch_data(url):
    return requests.get(url).json()

<div class="alert alert-info">
pip dependencies add overhead to first task startup but are cached afterwards; for frequently used dependencies, bake them into your cluster image instead.
</div>


### 2.3. Working directory

Files can also be fetched from remote storage at runtime and made available in the worker processe's working directory

In [None]:
@ray.remote(runtime_env={"working_dir": "s3://my-bucket/project/my_directory.zip"})
def load_config():
    with open("config.yaml") as f:
        return f.read()

## 3. Resource allocation and management

### 3.1. Understanding logical vs physical resources

In [None]:
@ray.remote(num_cpus=1)  # Ray reserves 1 CPU slot for scheduling
def cpu_intensive_task():
    # Ray sets OMP_NUM_THREADS=1 to match num_cpus
    return np.dot(large_matrix_a, large_matrix_b)

**Key points:**
- `num_cpus` is a scheduling hint, not a hard limit
- Ray automatically sets `OMP_NUM_THREADS` to match `num_cpus` to prevent oversubscription

You can override this if needed (may cause oversubscription)


In [None]:
@ray.remote(num_cpus=1)
def mm(n: int = 4000):
    return np.dot(np.random.rand(n, n), np.random.rand(n, n))

# Override to use 8 threads (caution: may oversubscribe)
ray.get(mm.options(runtime_env={"env_vars": {"OMP_NUM_THREADS": "8"}}).remote())

Note assigning "GPU" resources to a task, Ray will automatically set the `CUDA_VISIBLE_DEVICES` env var within the worker to limit it to specific GPU ids.

<div class="alert alert-info">
Learn more about <strong><a href="https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#physical-resources-and-logical-resources" target="_blank">physical resources and logical resources</a></strong>.
</div>

### 3.2. Fractional resources for I/O-bound tasks

Ray supports **fractional CPU requests** to enable efficient oversubscription of I/O-bound tasks.

**When to use fractional CPUs:**

Tasks that spend most of their time waiting (not computing) can share CPU slots:


In [None]:
# Moderately I/O-bound: Some computation, some I/O
@ray.remote(num_cpus=0.5)  # Allow 2 tasks per CPU core
def download_and_parse(url):
    data = requests.get(url).text
    return process_file(data)

**Benefits:**
- **Higher throughput**: Run more tasks concurrently when they're waiting on I/O
- **Better resource utilization**: Don't waste CPU cores on tasks that are mostly idle
- **Cost efficiency**: Process more work on the same hardware

<div class="alert alert-warning">

**Note:** Don't abuse fractional resources and fall into the anti-pattern of launching too many small tasks. Instead, batch work and leverage multi-threading within tasks when possible.

</div>

<div class="alert alert-info">
Fractional resources include support for <strong><a href="https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html#fractional-accelerators" target="_blank">multiple accelerators</a></strong>, allowing users to load multiple smaller models onto a single GPU. Learn more about <strong><a href="https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#fractional-resource-requirements" target="_blank">fractional resource requirements</a></strong>.
</div>


### 3.3. Resource availability and cluster inspection

Ray's scheduler matches tasks to nodes based on **resource requirements** like CPUs, GPUs, memory, or custom resources:

In [None]:
@ray.remote(num_cpus=2, num_gpus=1)
def train_model(data):
    return model.fit(data)

**Inspecting cluster resources** returns total resources across the cluster

In [None]:
ray.cluster_resources()

**Inspecting available resources** returns currently unreserved resources

In [None]:
ray.available_resources()

### 3.4. Resource management and autoscaling

Ray's **Global Control Service (GCS)** orchestrates cluster-wide resource management and autoscaling:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-core/resource_management_autoscaling.svg" width="700">

**The resource synchronization loop:**

1. **Raylets report usage**: Each raylet sends its resource usage to the GCS every ~100ms
2. **GCS broadcasts state**: GCS pushes the global resource view back to all raylets every ~100ms  
3. **Autoscaler reconciles**: The autoscaler queries the GCS for cluster load and:
   - Adds nodes when demand exceeds available capacity
   - Removes idle nodes to reduce costs

**Example:** If 10 tasks need `num_gpus=1` but only 4 GPUs exist, the autoscaler provisions additional GPU nodes to meet demand.


## 4. Pipeline data processing and waiting for results

After launching a number of tasks, you may want to know which ones have finished executing without blocking on all of them. This could be achieved by `ray.wait()`

### 4.1. Understanding ray.wait()

`ray.wait()` is a powerful primitive for building pipelines and managing task completion.

Given a sample remote function:

In [None]:
@ray.remote
def remote_fn(x):
    time.sleep(random.uniform(2, 10))
    return x

Unlike `ray.get`, which blocks until all tasks are complete, `ray.wait` allows you to wait for a specified number of tasks to finish and returns two lists: one with the completed tasks and another with the pending tasks.

In [None]:
refs = [remote_fn.remote(i) for i in range(10)]

ready, not_ready = ray.wait(
    refs,
    num_returns=1,      # Number of references to wait for
    timeout=None,       # Max time to wait for (seconds)
    fetch_local=True    # Whether to fetch objects to the local node or not
)

Let's inspect the ready refs:

In [None]:
ready

**Returns:**
- `ready`: List of ObjectRefs that are ready
- `not_ready`: List of ObjectRefs still pending

### 4.2. Pipeline pattern with ray.wait()

| <img src="https://assets-training.s3.us-west-2.amazonaws.com/ray-core/ray-core/pipeline-data-processing.png" width="400px" loading="lazy">                                                                               |
| :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| 
(top panel) Execution timeline when using ray.get() to wait for all results before calling process results. 
(bottom panel) Execution timeline when using ray.wait() to process results as soon as they become available. |

Here are functions to match the above diagram:

In [None]:
@ray.remote
def do_some_work(x: int) -> int:
    time.sleep(x)  # varying execution time based on input
    return x

@ray.remote
def process_incremental(result: int) -> int:
    time.sleep(1)
    return result * 2

@ray.remote
def process_results(result_refs: list) -> list:
    results = ray.get(result_refs)  # need to call ray.get explicitly for containers
    out = []
    for result in results:
        time.sleep(1)
        out.append(result * 2)
    return out

This is the **naive approach:** block until all tasks are complete and then process the results.



In [None]:
inputs = [2, 3, 1, 4]
start = time.time()
data_list = [do_some_work.remote(x) for x in inputs]
output = ray.get(process_results.remote(data_list))
print("duration =", time.time() - start, "\nresult = ", output)
# Duration: ~8 seconds (4s max task + 4s processing)

This is the **pipelined** approach: process items as soon as they become available



In [None]:
start = time.time()
result_ids = [do_some_work.remote(x) for x in inputs]
refs = []
while len(result_ids):
    done_id, result_ids = ray.wait(result_ids, num_returns=1)
    print(done_id)
    refs.append(process_incremental.remote(done_id[0]))
output = ray.get(refs)
print("duration =", time.time() - start, "\nresult = ", output)
# Duration: ~5 seconds (overlapping ~4s computation and 1s processing)

<div class="alert alert-info">
Read more about the <strong><a href="https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html#tip-4-pipeline-data-processing" target="_blank">pipeline data processing</a></strong>
</div>

## 5. Ray generators

[Ray Generators](https://docs.ray.io/en/latest/ray-core/ray-generator.html) are a way to make use of the python generator pattern to generate data.

They are useful for:
- Reducing worker heap memory usage **by** avoiding building up a large in-memory collection
- Reducing object store memory usage **by** allowing for garbage collection of objects that are processed

### 5.1. Why use Ray Generators?


**Problem with regular tasks:**
Memory intensive tasks will cause pressure on both the worker process heap and shared object store memory.

In [None]:
@ray.remote(num_returns=2)
def produce_large_dataset():
    task_id = ray.runtime_context.get_runtime_context().get_task_id()
    # Creates all data in memory at once
    results = []
    for i in range(100):
        results.append(np.random.rand(1024**2))  # Each object is ~8MiB (64 bit * 1024)
    return results, task_id  # ~800iMB in memory!

# High memory pressure
ref, task_id_ref = produce_large_dataset.remote()
task_id = ray.get(task_id_ref)

Let's see the objects ties to this task:

In [None]:
!ray list objects --filter TASK_STATUS!=NIL --filter TYPE=DRIVER --filter OBJECT_ID={task_id}01000000

In [None]:
# cleanup
%xdel ref

**Solution with generators:**

Avoid memory build up in the task, and start generating blocks or partitions as soon as they are available.


In [None]:
@ray.remote
def produce_large_dataset():
    # Yields one object at a time
    for i in range(100):
        yield np.random.rand(1024**2)  # Only ~8MB at a time

@ray.remote
def process(result):
    time.sleep(0.1)

# Process streaming
for obj_ref in produce_large_dataset.remote():
    process.remote(obj_ref)
    # Previous objects can be garbage collected

### 5.2. Python generator recap

Let's start with a sample python generator function:



In [None]:
def generator_function():
    for i in range(10):
        time.sleep(1)
        yield i

Here is how we can iterate over the generator function:



In [None]:
for obj in generator_function():
    print(obj)

### 5.3. Converting to Ray generator

Converting into a Ray generator function is straightforward - simply decorate with `@ray.remote`


In [None]:
@ray.remote
def generator_function():
    for i in range(10):
        time.sleep(1)
        yield i

Now instead of yielding the value we get back object references

In [None]:
for obj_ref in generator_function.remote():
    print(obj_ref)  # Prints ObjectRef

result = ray.get(obj_ref)
print(result)  # Prints actual value

### 5.4. Memory usage comparison

See the below script which shows the memory consumption when running with and without a generator:



In [None]:
!RAY_DEDUP_LOGS=0 python scripts/ray_generator_object_store_diff.py

### 5.5. Key differences from Python generators

Unlike python generators, Ray generators:
- **Don't pause execution** - i.e. they don't require `__next__` to be called to yield the next element
- **Don't support all APIs** like `send` and `throw`

Given that Ray **eagerly executes** a generator task to completion **regardless** of whether the caller is polling the partial results or not, it might lead to **object store spilling.**

### 5.6. When to use Ray Generators

**✅ Use Ray Generators when:**
- Processing large datasets that don't fit in memory
- Streaming results from long-running computations
- Building data pipelines with multiple stages
- You want incremental results (don't wait for everything)

**❌ Don't use Ray Generators when:**
- You need random access to results
- You need the full result set at once