# Ray Tasks Fundamentals: Building Distributed Applications

© 2025, Anyscale. All Rights Reserved

This notebook provides a step-by-step introduction to Ray Tasks, the fundamental building block of Ray that enables distributed computing.

<div class="alert alert-block alert-info">

<b> Here is the roadmap for this notebook </b>

<ol>
  <li>Overview and setup</li>
  <li>Simple task submission (creating, executing, and getting results)</li>
  <li>Task options and configuration</li>
  <li>Object store and memory model</li>
  <li>Chaining tasks and passing data</li>
</ol>
</div>

**Imports**

In [None]:
import math
import os
import random
import time

import numpy as np
import pandas as pd
import ray
import requests
import ray.runtime_context
from ray import tune
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy

## 1. Overview and setup

### 1.1. Ray Core at a glance

- **Scales your code** across many CPU cores, machines, and accelerators.  
- **Schedules arbitrary task graphs** thanks to its distributed scheduler.
- **Hides distributed-system overhead** with built-ins for  
  - fast data serialization and transfer,  
  - smart task placement, 
  - distributed memory & reference counting.

Ray's higher-level libraries build on Ray Core to offer ready-made APIs for common workloads.

### 1.2. When to use Ray Tasks

Ray Tasks are ideal for:
- **Parallelizing computationally expensive functions** across multiple cores or machines
- **Processing large datasets** by distributing work across workers
- **Building complex task dependency graphs** (DAGs) for data pipelines
- **Scaling existing Python code** with minimal changes

**When NOT to use Ray Tasks:**
- Functions that execute in < 1ms (overhead not worth it)
- Very fine-grained parallelism (e.g., parallelizing simple arithmetic - use numpy instead)
- When you need mutable shared state (use Ray Actors instead)

### 1.3. Ray cluster architecture

Before diving into tasks, let's understand the key components of a Ray cluster.

<img src="https://docs.ray.io/en/latest/_images/ray-cluster.svg" width="800">

A Ray cluster consists of:
- One or more **worker nodes**, where each worker node consists of the following processes:
    - **worker processes** responsible for task submission and execution.
    - A **raylet** responsible for:
      - resource management and task placement.
      - shared memory management through an object store 
- One of the worker nodes is designated a **head node** and is responsible for running 
  - A **global control service** responsible for keeping track of the **cluster-level state** that is not supposed to change too frequently.
  - An **autoscaler** service responsible for adding and removing worker nodes by integrating with different infrastructure providers (e.g. AWS, GCP, ...) to match the resource requirements of the cluster.


### 1.4. Initializing Ray

ray.init() is the primary function to connect to an existing Ray cluster or start a new one and connect to it. 

In [None]:
ray.init(ignore_reinit_error=True)

<div class="alert alert-info">

**NOTE** In case you don't manually call ray.init() inside a python script, Ray will automatically call ray.init() for you with default parameters when you define or invoke your first remote function or actor.

</div>


## 2. Simple task submission (creating, executing, and getting results)

### 2.1. Creating remote functions

The first step in using Ray is to create remote functions. A remote function is a regular Python function that can be executed on any process in your cluster.

Given a simple Python function:



In [None]:
def add(a, b):
    return a + b

add

Decorate the function with `@ray.remote` to turn it into a remote function.



In [None]:
@ray.remote
def remote_add(a, b):
    return a + b

remote_add

### 2.2. Executing remote functions (asynchronous by default)

Native python functions are invoked by calling them:



In [None]:
add(1, 2)  # Returns 3 immediately

Remote ray functions are executed as tasks by calling them with `.remote()` suffix:



In [None]:
remote_add.remote(1, 2)  # Returns ObjectRef immediately, computation happens async

Here is what happens when you call `{remote_function}.remote`:
1. Ray **immediately** schedules the function execution by submitting a **task** to the cluster
2. The submitting process returns an `ObjectRef` (a reference to the future result)
3. The cluster begins executing the computation in the background

<div class="alert alert-info">
  <strong>A <a href="https://docs.ray.io/en/latest/ray-core/key-concepts.html#tasks" target="_blank">task</a></strong> is a remote, stateless Python function invocation.
</div>


In [None]:
ref = remote_add.remote(1, 2)
ref

**Think of `ObjectRef` as a future**: it's a placeholder for a value that is being computed on the cluster.

Here is a map of how Python code is translated into Ray tasks:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-core/python_to_ray_task_map_v2.png" alt="Python to Ray Task Map" width="800">

### 2.3. Getting results

If we want to wait (block) and retrieve the corresponding object, we can use `ray.get`:



In [None]:
ray.get(ref)

### 2.4. Putting it all together

Here are the three steps:
1. Create the remote function
2. Execute it remotely (non-blocking)
3. Get the result when needed (blocking)

<div class="alert alert-block alert-info">
    
__Activity: define and invoke a Ray task__

Define a remote function `sqrt_add` that accepts two arguments and performs the following steps:
1. computes the square-root of the first
2. adds the second
3. returns the result

Execute it with 2 different sets of parameters and collect the results



In [None]:
# Hint: define the below as a remote function
def sqrt_add(a, b):
    ... 

# Hint: invoke it as a remote task and collect the results

In [None]:
# Write your solution here

<div class="alert alert-block alert-info">

<details>

<summary> Click to see solution </summary>


```python
@ray.remote
def sqrt_add(a, b):
    return math.sqrt(a) + b

ray.get([sqrt_add.remote(2, 3), sqrt_add.remote(5, 4)])
```

</details>


### 2.5. Understanding asynchronous execution

The key difference between regular Python and Ray is that `.remote()` **does not block**:



In [None]:
def slow_function(x):
    time.sleep(3)
    return x * x

# Sequential Python (blocks for each call)
start = time.time()
results = [slow_function(i) for i in range(4)]  # Would take 12 seconds!
print(f"Wall time sequential: {time.time() - start:.2f}s")

`.remote()` will **immediately submit** the task and return, `.get` **will block** until the reference(s) is/are resolved.

In [None]:
@ray.remote
def slow_function(x):
    time.sleep(3)
    return x * x

# Distributed Ray (non-blocking)
start = time.time()
refs = [slow_function.remote(i) for i in range(4)]  # Returns immediately!
print(f"Task submission: {time.time() - start:.2f}s")  # < 0.01s

# Now wait for results (blocks until all complete)
results = ray.get(refs)
print(f"Wall time with ray: {time.time() - start:.2f}s")  # ~3s

In [None]:
# clean up
%xdel refs
%xdel results

</details>

</div>

### 2.6. What can be passed to Ray tasks? (Serialization)

Ray uses **cloudpickle** to serialize code (functions, arguments and return values). Most Python objects work, but there are limitations:

**✅ Can serialize:**
- Basic types: int, float, str, bool, None
- Collections: list, dict, tuple, set
- NumPy arrays, Pandas DataFrames
- Most custom classes
- Nested functions and lambdas

**❌ Cannot serialize:**
- File handles (`open()` objects)
- Network sockets
- Threading locks

**Example of serialization issues:**



In [None]:
# ❌ BAD: File handle won't serialize
file = open("/tmp/data.txt", "w")

@ray.remote
def read_file(f):
    return f.read()

# ref = read_file.remote(file)  # Will fail with a PicklingError

<div class="alert alert-warning">
<b>💡 Troubleshooting:</b> If you see <code>pickle.PicklingError</code> or <code>TypeError: cannot pickle</code>, check if you're passing non-serializable objects to your task.
</div>

### 2.7 Task submission sequence

Here is the sequence of events when you submit a Ray task:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-core/task-submission_old.gif" alt="Task Submission Sequence" width="800">

### 2.8 Task submission under the hood

When a task is submitted, here is how resource fulfillment works

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-core/normal-task-resource-fullfilment.svg" width="700" alt="Resource fulfillment and execution of `double(2)` in a Ray cluster.">

The caller must choose **which node (raylet)** should schedule it.

#### 1️⃣ Choosing the Preferred Raylet
| **Rule**          | **When Used**                    | **How It Works**                                                                                    |
| ----------------- | -------------------------------- | --------------------------------------------------------------------------------------------------- |
| **Data locality** | Task has data dependencies.      | Pick node holding the most object bytes locally (from the object directory, may be slightly stale). |
| **Node affinity** | Task specifies a target node.    | Use the node from `NodeAffinitySchedulingStrategy`.                                                 |
| **Default**       | No data or affinity preferences. | Use the **local raylet**.                                                                           |


#### 2️⃣ Request → Lease → Worker
- Caller sends a **resource request** to the preferred raylet.  
- If granted, the raylet **leases a local worker** and returns its address.  
- The **lease stays active** while both caller and worker are alive.  
- Idle or unused leases are returned after a short timeout (~hundreds of ms).


#### 3️⃣ Task Execution on the Leased Worker
The caller can schedule **multiple compatible tasks** on the same worker without re-contacting the scheduler.

Compatibility means matching:
- **Resource shape**, e.g. `{"CPU": 1}`
- **Shared-memory arguments** (large objects must be local; small ones are inlined)
- **Runtime environment**


#### 4️⃣ Optimization Insight
Worker leases act as a *cache* for scheduling decisions — similar tasks can reuse the same worker for lower latency and higher throughput.

Note also that the caller can hold multiple worker leases to increase parallelism. 



## 3. Task options and configuration

You can dynamically configure tasks using the `.options()` method without redefining the function. This is useful for adjusting resources, retries, or other settings per task invocation.

### 3.1. Basic usage of .options()



In [None]:
@ray.remote
def flexible_task(x):
    return x * 2

# Use default configuration (1 CPU)
ref1 = flexible_task.remote(5)

# Override to use 2 CPUs for this specific invocation
ref2 = flexible_task.options(num_cpus=2).remote(10)

### 3.2. Common options

**Resource options:**
- `num_cpus`: Number of CPUs (can be fractional, e.g., 0.5)
- `num_gpus`: Number of GPUs (can be fractional)
- `memory`: Memory in bytes
- `resources`: Dict of custom resources

**Fault tolerance options:**
- `max_retries`: Max number of retries (default: 3 for system errors)
- `retry_exceptions`: List of exception types to retry on

**Execution options:**
- `runtime_env`: Dict specifying runtime environment
- `scheduling_strategy`: Control task placement
- `name`: Name for debugging/monitoring

### 3.3. Scheduling strategies



In [None]:
# Default: Ray decides based on data locality and load
flexible_task.remote(4)

# SPREAD: Distribute tasks across nodes
flexible_task.options(scheduling_strategy="SPREAD").remote(5)

# Node affinity: Run on specific node
strategy = NodeAffinitySchedulingStrategy(
    node_id=ray.get_runtime_context().get_node_id(),
    soft=True  # soft=True allows fallback if node unavailable
)
flexible_task.options(scheduling_strategy=strategy).remote(6)

## 4. Object store and memory model

Each worker node has its own object store, and collectively, these form a shared object store across the cluster.

Remote objects are immutable. That is, their values cannot be changed after creation. This allows remote objects to be replicated in multiple object stores without needing to synchronize the copies.

| <img src="https://assets-training.s3.us-west-2.amazonaws.com/ray-core/ray-core/ray-cluster.png" width="700px" loading="lazy"> |
| :---------------------------------------------------------------------------------------------------------------------------- |
| A Ray cluster with a head node and two worker nodes. Highlighted in orange is distributed object store.                       |

<div class="alert alert-info">
  <strong><a href="https://docs.ray.io/en/latest/ray-core/key-concepts.html#objects" target="_blank">Object</a></strong> - tasks and actors create and work with remote objects, which can be stored anywhere in a cluster. These objects are accessed using <strong>ObjectRef</strong> and are cached in a distributed shared-memory <strong>object store</strong>.
</div>

### 4.1. Ray memory model

Ray manages memory in several ways to efficiently handle distributed tasks:

1. **Heap memory**:
   - Used by workers to execute tasks and actors.
   - Used to store small objects (less than 100KB) and Ray metadata.
   - High memory pressure can cause Ray to terminate some tasks to free up resources.

2. **Shared memory (Object Store)**:
   - Serves as the medium for passing data between tasks.
   - Large objects (greater than 100KB) are stored in a shared memory space, using up to 30% of a node's memory.
   - If more space is needed, objects can be spilled to disk or stored on disk in a slower-access format.

Here is a diagram showing a horizontal slicing of a node's memory.

<img src="https://docs.ray.io/en/latest/_images/memory.svg" width="600">

### 4.2. Example: Producer-consumer pattern with numpy arrays

This example demonstrates how Ray transfers data in the distributed object store. The `producer_task` creates a 4 GiB numpy array, and the `consumer_task` accesses it with zero-copy deserialization when on the same node:



In [None]:
@ray.remote
def producer_task(size_mb: int = 4 * 1024) -> np.ndarray:
    array = np.random.rand((1024**2 * size_mb // 8)).astype(np.float64)
    return array


@ray.remote
def consumer_task(array: np.ndarray) -> None:
    assert isinstance(array, np.ndarray)
    assert not array.flags.owndata  # Confirms zero-copy

# arr_ref = producer_task.remote()  # Produce a 4 GiB array
# output_ref = consumer_task.remote(arr_ref)  # Pass ObjectRef to consumer

**What happens under the hood:**

1. **Producer task** creates the array in heap memory, then Ray stores it in the shared object store (large objects > 100KB)
2. **Consumer task** receives the `ObjectRef` and directly accesses the array from shared memory with zero-copy deserialization (if on same node)
3. If tasks run on different nodes, Ray copies the array across the network only once

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-data-deep-dive/producer-consumer-object-store-v2.png" width="600">

To see memory usage in action, run this inspection script:



In [None]:
!python scripts/memory_inspection.py

#### On zero-copy deserialization

Ray uses **cloudpickle** for serialization and **pickle 5** for zero-copy deserialization. 

**How Ray transfers code and data:**

1. **Code transfer (functions)**: Functions are pickled and stored in the Global Control Store (GCS), then cached for subsequent calls

2. **Data transfer (arguments/return values)**:
   - **Small objects (< 100 KB)**: Pickled and transferred inline with the task metadata
   - **Large objects (> 100 KB)**: Stored in shared memory (object store), only the `ObjectRef` is transferred

**Key performance characteristics:**

- **Zero-copy benefits**: Works for contiguous numpy arrays and PyArrow arrays on the same node, enabling efficient read access without data copying. 
- **Zero-copy limitation**: Does not support PyTorch tensors or other array types
- **Immutability**: Objects in the object store are **immutable once sealed**, enabling safe sharing across processes

To read more about object serialization in Ray, see [this documentation page here](https://docs.ray.io/en/latest/ray-core/objects/serialization.html).

### 4.3. Usecase: Hyper-parameter tuning

**The Problem**: When running hyperparameter tuning or experimentation, you often need to use the same dataset across dozens or hundreds of trials. If you pass the dataset by value to each training function, Ray will serialize it repeatedly, wasting memory and time.

**The Solution**: Store the dataset once in the object store using `ray.put()`, then pass only the lightweight `ObjectRef` to each trial. All workers can access the same data without duplication.

#### Real-world scenario: Grid search with shared data

Imagine running 20 experiments on a 100MB training dataset:


In [None]:
# Simulate a 100MB training dataset
df = pd.DataFrame(np.random.rand(100 * 1024 ** 2 // 8))

@ray.remote
def train_model(data, learning_rate, batch_size):
    # Simulate model training
    result = data.mean().sum() * learning_rate / batch_size
    time.sleep(20)
    return {"lr": learning_rate, "batch_size": batch_size, "score": result}

# Grid search: 20 different hyperparameter combinations
hyperparameters = [
    {"lr": lr, "batch_size": bs}
    for lr in [0.001, 0.01, 0.1, 0.5]
    for bs in [32, 64, 128, 256, 512]
]

Here is an efficient way to run the experiments by passing the dataset **once by reference**:


In [None]:
# ✅ Memory Efficient: Pass ObjectRef (100 MB total memory)
# Ray serializes once, all workers share the same data
df_ref = ray.put(df)
[
    train_model.remote(df_ref, hp["lr"], hp["batch_size"]) 
    for hp in hyperparameters
]
print("Pass once by reference: ~100 MiB memory used")

Let's inspect the object store, we should only see the same 100MiB object being used across tasks


In [None]:
!ray list objects --filter TASK_STATUS!=NIL

In [None]:
# clean up
%xdel df_ref

Here is the inefficient way by passing the dataset by value:


In [None]:
# ❌ Memory inefficient: Pass dataframe by value (2 GB total memory!)
# Ray serializes 100MB × 20 times = 2 GB of redundant data
[
    train_model.remote(df, hp["lr"], hp["batch_size"]) 
    for hp in hyperparameters
]
print("Pass by value: ~2GiB memory used")

Let's inspect the object store, we should now see different 100MiB objects being used across tasks


In [None]:
!ray list objects --filter TASK_STATUS!=NIL

**Performance comparison:**
- **Pass by value**: 2 GB memory used (20× serialization overhead)
- **Pass by reference**: 100 MB memory used (1× serialization)

**Rule of thumb**: Pass by value only for small literals (< 100 KiB); otherwise, pass by reference.

#### How Ray Tune leverages this pattern

Ray Tune uses `tune.with_parameters()` to automatically pass large constant objects via the object store:


In [None]:
def trainable(config, data):
    # Each trial receives a reference to the shared data
    model = train(data, lr=config["lr"], epochs=config["epochs"])
    return {"accuracy": model.eval()}

# Tune automatically stores train_data in the object store
tuner = tune.Tuner(
    tune.with_parameters(trainable, data=pd.DataFrame()),  # Passed by reference
    param_space={"lr": tune.grid_search([0.001, 0.01, 0.1]), "epochs": tune.choice([10, 20, 50])},
)

Without `tune.with_parameters()`, each trial would receive a separate copy of `train_data`, multiplying memory usage by the number of concurrent trials.

### 4.4. Distributed ownership and fate-sharing

Ray uses a **distributed ownership model** to manage objects efficiently across the cluster. Understanding this concept is crucial for building robust distributed applications.

#### How distributed ownership works

In Ray, the process that creates or submits a task becomes the **owner** of the task's result. The owner maintains critical metadata about the object, including:
- Object location(s) in the cluster
- Reference counts
- Object size and other properties

<img src="https://assets-training.s3.us-west-2.amazonaws.com/ray-core/task-actor-lifecycle/v2/scheduling/distributed_ownership_overview_v4.svg" width="800px">

**Benefits of distributed ownership:**
- **Lower latency**: No need to communicate all ownership information back to a central node
- **Better scalability**: No single bottleneck as every worker maintains its own ownership information

Here is a diagram that explains how distributed ownership works in a Ray cluster:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-core/distributed-ownership.png" width="800">

Below is some code based on the above diagram to illustrate distributed ownership:


In [None]:
@ray.remote
def b():
    size_mib = 19
    return np.ones(1024 ** 2 // 8 *  size_mib)

@ray.remote
def a(dep):
    z = b.remote() # z is owned by worker process running task a

    ip = ray.util.get_node_ip_address()
    print(f"{ip=}")

    time.sleep(20)
    return dep.sum() / ray.get(z).sum() 

size_mib = 33
arr = np.ones(1024 ** 2 // 8 *  size_mib)
x = ray.put(arr)  # x is owned by driver process
y = a.remote(x)  # y is owned by driver process

We can verify that the the 19 MB array is owned by the worker that submitted task `b` - i.e. the worker executing task `a` 


In [None]:
!ray list objects --filter TASK_STATUS!=NIL --filter TYPE=WORKER 

<div class="alert alert-info">

Note: `TASK_STATUS = NIL` matches non-owner processes given only the owner tracks task status

</div>


In [None]:
ray.get(y), 33 / 19

In [None]:
# clean up
%xdel x
%xdel y

#### The fate-sharing limitation

The main tradeoff of distributed ownership is **fate-sharing**: objects are tied to the lifetime of their owner process.

**What this means:**
- Even if an object is stored in the object store on a different node, if the owner process dies, the object becomes unreachable
- The owner maintains critical metadata (locations, reference counts) that other processes need to access the object
- When the owner fails, this metadata is lost, making the object inaccessible even if copies exist elsewhere

<img src="https://assets-training.s3.us-west-2.amazonaws.com/ray-core/task-actor-lifecycle/v2/scheduling/distributed_ownership_fate_share_with_owner_v4.svg" width="900px">

#### Example: Demonstrating fate-sharing

This example creates two actors: an **Owner** that creates an object reference, and a **Borrower** that tries to access it. We'll see what happens when the Owner is terminated:



In [None]:
@ray.remote
def f(data):
    return data

@ray.remote
class Owner:
    def __init__(self):
        self.ref = None

    def set_object_ref(self, data):
        self.ref = f.remote(data)
        return self.ref
    
    def is_alive(self): 
        return True

@ray.remote
class Borrower:
    def get_object(self, ref):
        return ray.get(ref)

owner = Owner.remote()
borrower = Borrower.remote()
assert ray.get(owner.is_alive.remote())

object_ref = owner.set_object_ref.remote(data="test1")
# Since owner is alive we can resolve the object reference
data = ray.get(borrower.get_object.remote(object_ref))
assert data == "test1"
print(f"✓ Successfully retrieved data while Owner is alive: {data}")

ray.kill(owner)
time.sleep(2)

# After killing the owner we can no longer resolve the object reference
try:
    ray.get(borrower.get_object.remote(object_ref))
    print("✗ Unexpected: Should have failed!")
except Exception as e:
    print("✓ Failed as expected after owner termination:")
    print(e)

**What happens:**
1. While the Owner is alive, the Borrower can successfully retrieve the object using the `ObjectRef`
2. After the Owner is killed, the Borrower still has the `ObjectRef`, but attempting to access the object fails
3. Even though the object data may still exist in the object store, the ownership metadata is lost

**Key takeaway:** In Ray's distributed ownership model, object lifetime is tied to the owner's lifetime. When building fault-tolerant applications:
- Keep important owners alive (e.g., use long-running actors or the driver process)
- Consider checkpointing critical data outside Ray's object store for durability

### 4.5 Lineage Reconstruction

If instead, the **owner is still alive**, but **the object is lost** (e.g., due to node failure), Ray can reconstruct the object by either

1. Finding any secondary copies in the object store (if they exist) and returning one of those
2. Re-executing the task or task chain that created it. This is known as **lineage reconstruction**.

See the below test inspired from the [Ray test suite](https://github.com/ray-project/ray/blob/a04cb06bb1a2c09e93b882b611492d62b8d1837a/python/ray/tests/test_reconstruction.py#L126) for an example of lineage reconstruction:

```python
@pytest.mark.parametrize("reconstruction_enabled", [False, True])
def test_basic_reconstruction(config, ray_start_cluster, reconstruction_enabled):
    cluster = ray_start_cluster
    
    # Start head node with reconstruction enabled/disabled
    cluster.add_node(num_cpus=0, _system_config=config, 
                     enable_object_reconstruction=reconstruction_enabled)
    ray.init(address=cluster.address)
    
    # Add worker node to store the object
    node_to_kill = cluster.add_node(
        num_cpus=1, resources={"node1": 1}, object_store_memory=10**8
    )
    cluster.wait_for_nodes()

    @ray.remote(max_retries=1 if reconstruction_enabled else 0)
    def create_large_object():
        return np.zeros(10**7, dtype=np.uint8)

    @ray.remote
    def process_large_object(x):
        return

    # Create object and verify it can be used
    # Note: obj_ref owner is the driver (on head node), so lineage is preserved
    # even when the worker node storing the object is killed
    obj_ref = create_large_object.options(resources={"node1": 1}).remote()
    ray.get(process_large_object.options(resources={"node1": 1}).remote(obj_ref))
    
    # Simulate node failure and replacement
    cluster.remove_node(node_to_kill, allow_graceful=False)
    node_to_kill = cluster.add_node(
        num_cpus=1, resources={"node1": 1}, object_store_memory=10**8
    )

    # With reconstruction: task re-executes and object is recreated
    # Without reconstruction: both task and object are lost
    if reconstruction_enabled:
        ray.get(process_large_object.remote(obj_ref))
    else:
        with pytest.raises(ray.exceptions.RayTaskError):
            ray.get(process_large_object.remote(obj_ref))
        with pytest.raises(ray.exceptions.ObjectLostError):
            ray.get(obj_ref)

    # Second node failure exceeds max_retries
    cluster.remove_node(node_to_kill, allow_graceful=False)
    cluster.add_node(num_cpus=1, resources={"node1": 1}, object_store_memory=10**8)

    expected_error = (
        ray.exceptions.ObjectReconstructionFailedMaxAttemptsExceededError
        if reconstruction_enabled
        else ray.exceptions.ObjectLostError
    )
    with pytest.raises(expected_error):
        ray.get(obj_ref)
```


### 4.6 ObjectRef lifecycle and garbage collection

Objects in the object store are automatically garbage collected when their distributed reference count drops to zero. This happens when all `ObjectRef`s pointing to the object are deleted or go out of scope.

**Example:**



In [None]:
@ray.remote(num_returns=2)
def create_object():
    task_id = ray.runtime_context.get_runtime_context().get_task_id()
    return np.random.rand(1024 ** 2 // 8 * 20), task_id

# Object created and stored
ref1, ref2 = create_object.remote()

# Object still in memory
result = ray.get(ref1)
task_id = ray.get(ref2)

Let's inspect the returned objects in the store - note in this case we leverage the object id specification to find the object in the store


In [None]:
!ray list objects --filter TASK_STATUS!=NIL --filter TYPE=DRIVER --filter OBJECT_ID={task_id}01000000

In [None]:
# Release reference (allows GC)
del ref1
del ref2

In [None]:
!ray list objects --filter TASK_STATUS!=NIL --filter TYPE=DRIVER --filter OBJECT_ID={task_id}01000000

<div class="alert alert-info">
<b>⚡ Performance Tip:</b> For long-running applications, explicitly delete ObjectRefs you no longer need to free up object store memory.
</div>

## 5. Chaining tasks and passing data

Let's say we now want to execute a graph of two tasks:
1. Square a value using `expensive_square`
2. Add 1 to the `expensive_square` result, by using `remote_add`


In [None]:
@ray.remote
def expensive_square(x):
    time.sleep(1)
    return x**2

This can be achieved without fetching an intermediate result.

**❌ Anti-pattern:**


In [None]:
# 1st task
square_ref = expensive_square.remote(2)
square_value = ray.get(square_ref)  # wait to get the value

# 2nd task
sum_ref = remote_add.remote(1, square_value)  # pass value from 1st task
sum_value = ray.get(sum_ref)

**✅ Better:** Chain the tasks by passing the `ObjectRef` directly to the second task:



In [None]:
square_ref = expensive_square.remote(2)
sum_ref = remote_add.remote(1, square_ref)  # Pass ObjectRef, not value!
sum_value = ray.get(sum_ref)  # Wait only at the end

**Why this is better:**
- No unnecessary data transfer (ObjectRef is just an ID)
- Ray automatically handles dependencies
- Second task waits for first task to complete
- More efficient scheduling

### 5.1. Common task graph patterns

Ray excels at executing complex directed acyclic graphs (DAGs) of tasks:

#### Pattern 1: Linear chain



In [None]:
@ray.remote
def step1(data):
    return process_a(data)

@ray.remote
def step2(data):
    return process_b(data)

@ray.remote
def step3(data):
    return process_c(data)

# Chain tasks
# ref1 = step1.remote(input_data)
# ref2 = step2.remote(ref1)
# ref3 = step3.remote(ref2)
# final_result = ray.get(ref3)

#### Pattern 2: Fan-out / Fan-in (MapReduce)



In [None]:
@ray.remote
def map_task(chunk):
    return process_chunk(chunk)

@ray.remote
def reduce_task(results):
    return aggregate(results)

# Map phase (fan-out)
# map_refs = [map_task.remote(chunk) for chunk in data_chunks]

# Reduce phase (fan-in)
# final_result = ray.get(reduce_task.remote(map_refs))

#### Pattern 3: Tree reduction



In [None]:
@ray.remote
def pairwise_sum(a, b):
    return a + b


refs = [ray.put(i) for i in range(16)]  # Initial values

# Tree reduction (depth = log2(16) = 4)
while len(refs) > 1:
    refs = [pairwise_sum.remote(refs[i], refs[i + 1]) for i in range(0, len(refs), 2)]

result = ray.get(refs[0])

### 5.2. Nested tasks

Tasks can submit other tasks, enabling dynamic workflows:


In [None]:
@ray.remote
def main():
    square_ref_1 = expensive_square.remote(1)
    square_ref_2 = expensive_square.remote(2)
    add_ref = remote_add.remote(square_ref_1, square_ref_2)
    return ray.get(add_ref)

ray.get(main.remote())

**Avoiding deadlocks:** Ray automatically yields CPU resources when blocked on `ray.get()`, preventing deadlocks when nested tasks need the same resources.


In [None]:
# Example: Imagine if the cluster has 2 CPUs total

@ray.remote(num_cpus=2)
def outer_task():
    inner_refs = [inner_task.remote() for _ in range(10)]
    return ray.get(inner_refs)  # Ray yields the 2 CPUs while waiting

@ray.remote(num_cpus=1)
def inner_task():
    return 

ray.get(outer_task.remote())  # Works! No deadlock

<div class="alert alert-info">
Read more about <strong><a href="https://docs.ray.io/en/latest/ray-core/tasks/nested-tasks.html#yielding-resources-while-blocked" target="_blank">yielding resources while blocked</a></strong>.
</div>