<h2 style="color: #0f2027; background: linear-gradient(90deg, #43cea2 0%, #185a9d 100%); padding: 12px 0; border-radius: 8px; text-align:center; font-size: 2rem; letter-spacing: 1px;">
   <span style="color: #fff;"> Introduction to Ray Core</span> 
</h2>


This notebook provides a step-by-step introduction to Ray Tasks, the fundamental building block of Ray that enables distributed computing.

<div class="alert alert-block alert-info">

<b> Here is the roadmap for this notebook </b>

<ol>
  <li>Overview</li>
  <li>Creating Remote Functions</li>
  <li>Executing Remote Functions</li>
  <li>Getting Results</li>
  <li>Putting It All Together</li>
  <li>Object store and Memory model</li>
  <li>Chaining Tasks and Passing Data</li>
  <li>Task retries</li>
  <li>Task Runtime Environments</li>
  <li>Resource allocation and management</li>
  <li>Pipeline data processing and waiting for results</li>
  <li>Ray Actors</li>
</ol>
</div>

## Imports

In [1]:
import os
import random
import sys
import time

import numpy as np
import ray

## 1. Overview

### Ray Core at a glance

- **Scales your code** across many CPU cores, machines, and accelerators.  
- **Schedules arbitrary task graphs** thanks to its distributed scheduler.
- **Hides distributed-system overhead** with built-ins for  
  - fast data serialization and transfer,  
  - smart task placement, 
  - distributed memory & reference counting.

Ray’s higher-level libraries build on Ray Core to offer ready-made APIs for common workloads.


## 2. Creating Remote Functions

The first step in using Ray is to create remote functions. A remote function is a regular Python function that can be executed on any process in your cluster.

Given a simple Python function:

In [2]:
def add(a, b):
    return a + b

add

<function __main__.add(a, b)>

Decorate the function with `@ray.remote` to turn it into a remote function.

In [3]:
@ray.remote
def remote_add(a, b):
    return a + b

remote_add

<ray.remote_function.RemoteFunction at 0x7d2358be74a0>

## 3. Executing Remote Functions

Native python functions are invoked by calling them

In [4]:
add(1, 2)

3

Remote ray functions are executed as tasks by calling them with `.remote()` suffix

In [5]:
remote_add.remote(1, 2)

2026-01-08 17:49:22,657	INFO worker.py:1833 -- Connecting to existing Ray cluster at address: 10.0.36.30:6379...
2026-01-08 17:49:22,668	INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-zhee2uzsi3lhk3sdl5dvqc8x4m.i.anyscaleuserdata.com [39m[22m
2026-01-08 17:49:22,698	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_65ba782e3db6ef7574cdbf3355d281d661917131.zip' (9.97MiB) to Ray cluster...
2026-01-08 17:49:22,735	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_65ba782e3db6ef7574cdbf3355d281d661917131.zip'.


ObjectRef(3ca0590f168eaec5ffffffffffffffffffffffff0500000001000000)

<div class="alert alert-info">
  <strong><a href="https://docs.ray.io/en/latest/ray-core/key-concepts.html#tasks" target="_blank">Tasks</a></strong> is a remote, stateless Python function invokation.
</div>


Here is what happens when you call `{remote_function}.remote`:
1. Ray schedules the function execution as a task in a separate process in the cluster
2. Ray returns an `ObjectRef` (a reference to the future result) to you **immediately** 
3. The cluster executes the actual computation in the background


In [6]:
ref = remote_add.remote(1, 2)
ref

ObjectRef(4482c0d3e15a41a8ffffffffffffffffffffffff0500000001000000)

Here is a map of how Python code is translated into Ray tasks.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-core/python_to_ray_task_map_v2.png" alt="Python to Ray Task Map" width="800">

## 4. Getting Results

If we want to wait (block) and retrieve the corresponding object, we can use `ray.get`

In [7]:
ray.get(ref)

3

## 5. Putting It All Together

Here are the three steps:
1. Create the remote function
2. Execute it remotely
3. Get the result when needed


<div class="alert alert-block alert-info">
    
__Activity: define and invoke a Ray task__

Define a remote function `sqrt_add` that accepts two arguments and performs the following steps:
1. computes the square-root of the first
2. adds the second
3. returns the result

Execute it with 2 different sets of parameters and collect the results

```python
# Hint: define the below as a remote function
def sqrt_add(a, b):
    ... 

# Hint: invoke it as a remote task and collect the results
```


</div>

In [8]:
# Write your solution here
import math 

@ray.remote
def sqrt_add(a,b):
    return math.sqrt(a) + b


inputs = [(2, 3), (10, 20), (40, 50)]
refs = [sqrt_add.remote(*input) for input in inputs]
ray.get(refs)

[4.414213562373095, 23.162277660168378, 56.324555320336756]

<div class="alert alert-block alert-info">

<details>

<summary> Click to see solution </summary>

```python
import math

@ray.remote
def sqrt_add(a, b):
    return math.sqrt(a) + b

ray.get([sqrt_add.remote(2, 3), sqrt_add.remote(5, 4)])
```

</details>

</div>


### 4.1. Note about Ray ID Specification

IDs for tasks and objects are build according to the [ID specification in Ray](https://github.com/ray-project/ray/blob/master/src/ray/design_docs/id_specification.md).

In [9]:
refs[0]

ObjectRef(68d7b3a94be6e983ffffffffffffffffffffffff0500000001000000)

In [10]:
refs[0].job_id()

JobID(05000000)

In [11]:
refs[0].task_id()

TaskID(68d7b3a94be6e983ffffffffffffffffffffffff05000000)

In [12]:
dir(refs[0])

['__await__',
 '__bytes__',
 '__class__',
 '__class_getitem__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '_on_completed',
 '_set_id',
 'as_future',
 'binary',
 'call_site',
 'from_binary',
 'from_hex',
 'from_random',
 'future',
 'hex',
 'is_nil',
 'job_id',
 'nil',
 'owner_address',
 'redis_shard_hash',
 'size',
 'task_id',
 'tensor_transport']

### 4.2. Anti-pattern: Calling ray.get in a loop harms parallelism

|<img src="https://assets-training.s3.us-west-2.amazonaws.com/ray-core/ray-core/ray-get-in-a-loop.png" width="700px" loading="lazy">|
|:--|
|ray.get() is a blocking call. Avoid calling it on every item (left panel). Calling only on the final result improves performance (right panel).|

When trying to collect results for multiple remote function invocations (tasks), don't block and wait for each one individually. Let's consider this remote function:

In [13]:
@ray.remote
def expensive_square(x):
    time.sleep(5)
    return x**2

This implementation will block for each item in the loop:

In [14]:
results = []
for item in range(4):
    output = ray.get(expensive_square.remote(item))
    results.append(output)
results

[0, 1, 4, 9]

Schedule all remote calls, which are then processed in parallel. After scheduling the work, we can then request all the results at once.

In [15]:
refs = []
for j in range(4):
    refs.append(expensive_square.remote(j))
results = ray.get(refs)
results

[0, 1, 4, 9]

<div class="alert alert-info">
Read more about this <strong><a href="https://docs.ray.io/en/latest/ray-core/patterns/ray-get-loop.html" target="_blank">anti-pattern</a></strong>.
</div>

## 6. Object store and Memory model

Each worker node has its own object store, and collectively, these form a shared object store across the cluster.

Remote objects are immutable. That is, their values cannot be changed after creation. This allows remote objects to be replicated in multiple object stores without needing to synchronize the copies.

|<img src="https://assets-training.s3.us-west-2.amazonaws.com/ray-core/ray-core/ray-cluster.png" width="700px" loading="lazy">|
|:--|
|A Ray cluster with a head node and two worker nodes. Highlighted in orange is distributed object store.|

<div class="alert alert-info">
  <strong><a href="https://docs.ray.io/en/latest/ray-core/key-concepts.html#objects" target="_blank">Object</a></strong> - tasks and actors create and work with remote objects, which can be stored anywhere in a cluster. These objects are accessed using <strong>ObjectRef</strong> and are cached in a distributed shared-memory <strong>object store</strong>.
</div>

Let's consider following example:

In [16]:
large_matrix = np.random.rand(2, 1024, 1024, 1024//8) # approx. 2 GB
size_in_bytes = sys.getsizeof(large_matrix)

print(f"large_matrix has: {size_in_bytes/1024/1024/1024:.2f} GB")

large_matrix has: 2.00 GB


Add an object to the object store using `ray.put()`

In [17]:
obj_ref = ray.put(large_matrix)
obj_ref

ObjectRef(00ffffffffffffffffffffffffffffffffffffff0500000001e1f505)

Use the `ray.get()` method to fetch the result of a remote object from an object ref

In [18]:
large_mat_from_object_store = ray.get(obj_ref)

While the contents are the same, the object store contains a copy of the array which is not the same memory location. 

In [19]:
np.array_equal(large_mat_from_object_store, large_matrix)

True

Array in store is in shared memory whereas array in notebook is local to the notebook process.

In [20]:
id(large_mat_from_object_store), id(large_matrix)

(137589151124144, 137589150762448)

In [21]:
large_mat_from_object_store is large_matrix

False

### 6.1 Pattern: Pass large objects **by reference**

Ray distinguishes between a **value** and an **Object Ref** when you pass arguments to remote functions:

* **Value** → Ray serializes the data and writes a fresh copy to the object store for every call.
* **Object Ref** → Ray forwards only a lightweight ID; the worker fetches the data once and reuses it.

Upload big, read‑only objects once, then circulate their `ObjectRef` to every consumer task. This avoids repeated serialization, network traffic, and object‑store pressure.

Pass by value only for small literals (< 100 KiB); otherwise, pass by reference.


In [22]:
large_ref = ray.put(np.random.rand(1000, 1000))  # Approx. 8 MB -> place in object store


@ray.remote
def compute_sum(x):
    return int(x.sum())


# all tasks use same ObjectRef minimizing copies (memory efficient)
results = ray.get([compute_sum.remote(large_ref) for _ in range(100)])

### 6.2. Ray memory model

Ray manages memory in several ways to efficiently handle distributed tasks:

1. **Heap memory**:
   - Used by workers to execute tasks and actors.
   - Used to store small objects (less than 100KB) and Ray metadata.
   - High memory pressure can cause Ray to terminate some tasks to free up resources.

2. **Shared memory (Object Store)**:
   - Serves as the medium for passing data between tasks.
   - Large objects (greater than 100KB) are stored in a shared memory space, using up to 30% of a node's memory.
   - If more space is needed, objects can be spilled to disk or stored on disk in a slower-access format.

Here is a diagram showing a horizontal slicing of a node's memory.

<img src="https://docs.ray.io/en/latest/_images/memory.svg" width="600">

### 6.3. Example: Producer-consumer pattern with numpy arrays

This example demonstrates how Ray transfers data in the distributed object store. The `producer_task` creates a 4 GiB numpy array, and the `consumer_task` accesses it with zero-copy deserialization when on the same node:



In [23]:
@ray.remote
def producer_task(size_mb: int = 4 * 1024) -> np.ndarray:
    array = np.random.rand((1024**2 * size_mb // 8)).astype(np.float64)
    return array


@ray.remote
def consumer_task(array: np.ndarray) -> None:
    assert isinstance(array, np.ndarray)
    assert not array.flags.owndata  # Confirms zero-copy

arr_ref = producer_task.remote()  # Produce a 4 GiB array
output_ref = consumer_task.remote(arr_ref)  # Pass ObjectRef to consumer

**What happens under the hood:**

1. **Producer task** creates the array in heap memory, then Ray stores it in the shared object store (large objects > 100KB)
2. **Consumer task** receives the `ObjectRef` and directly accesses the array from shared memory with zero-copy deserialization (if on same node)
3. If tasks run on different nodes, Ray copies the array across the network only once

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-data-deep-dive/producer-consumer-object-store-v2.png" width="600">

To see memory usage in action, run this inspection script:



In [24]:
!python code/memory_inspection.py

2026-01-08 17:49:56,366	INFO worker.py:1833 -- Connecting to existing Ray cluster at address: 10.0.36.30:6379...
2026-01-08 17:49:56,377	INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at [1m[32mhttps://session-zhee2uzsi3lhk3sdl5dvqc8x4m.i.anyscaleuserdata.com [39m[22m
2026-01-08 17:49:56,406	INFO packaging.py:380 -- Pushing file package 'gcs://_ray_pkg_65ba782e3db6ef7574cdbf3355d281d661917131.zip' (9.97MiB) to Ray cluster...
2026-01-08 17:49:56,440	INFO packaging.py:393 -- Successfully pushed file package 'gcs://_ray_pkg_65ba782e3db6ef7574cdbf3355d281d661917131.zip'.
[36m(producer_task pid=4777, ip=10.0.30.21)[0m producer_task: At start
[36m(producer_task pid=4777, ip=10.0.30.21)[0m RSS : 91.7578125 MiB
[36m(producer_task pid=4777, ip=10.0.30.21)[0m Shared memory: 44.125 MiB
[36m(producer_task pid=4777, ip=10.0.30.21)[0m Heap memory (RSS - Shared): 47.6328125 MiB
[36m(producer_task pid=4777, ip=10.0.30.21)[0m ------------------------------
[36m(produ

#### On zero-copy deserialization

Ray uses **cloudpickle** for serialization and **pickle 5** for zero-copy deserialization. 

**How Ray transfers code and data:**

1. **Code transfer (functions)**: Functions are pickled and stored in the Global Control Store (GCS), then cached for subsequent calls

2. **Data transfer (arguments/return values)**:
   - **Small objects (< 100 KB)**: Pickled and transferred inline with the task metadata
   - **Large objects (> 100 KB)**: Stored in shared memory (object store), only the `ObjectRef` is transferred

**Key performance characteristics:**

- **Zero-copy benefits**: Works for contiguous numpy arrays and PyArrow arrays on the same node, enabling efficient read access without data copying. 
- **Zero-copy limitation**: Does not support PyTorch tensors or other array types
- **Immutability**: Objects in the object store are **immutable once sealed**, enabling safe sharing across processes

To read more about object serialization in Ray, see [this documentation page here](https://docs.ray.io/en/latest/ray-core/objects/serialization.html).

## 7. Chaining Tasks and Passing Data

Let's say we now want to execute a graph of two tasks:
1. Square a value using `expensive_square`
2. Add 1 to the `expensive_square` result, by using `remote_add`

This can be achieved without fetching an intermediate result.

In [25]:
@ray.remote
def expensive_square(x):
    time.sleep(1)
    return x**2

This can be achieved without fetching an intermediate result.

**❌ Anti-pattern:**


In [26]:
# 1st task
square_ref = expensive_square.remote(2)
square_value = ray.get(square_ref)  # wait to get the value

# 2nd task
sum_ref = remote_add.remote(1, square_value)  # pass value from 1st task
sum_value = ray.get(sum_ref)

**✅ Better:** Chain the tasks by passing the `ObjectRef` directly to the second task:



In [27]:
square_ref = expensive_square.remote(2)
sum_ref = remote_add.remote(1, square_ref)  # Pass ObjectRef, not value!
sum_value = ray.get(sum_ref)  # Wait only at the end

**Why this is better:**
- No unnecessary data transfer (ObjectRef is just an ID)
- Ray automatically handles dependencies
- Second task waits for first task to complete
- More efficient scheduling

## 8. Task retries

Let's consider two types of exceptions:
1. **system errors** (e.g., Python-level exceptions)
2. **application-level errors** (e.g., a machine fails)

Ray will automatically **retry a task up to 3 times**, if it fails due to a system error (e.g., a worker node dies).

Below task won't be retried by default because it's an application failure

In [28]:
@ray.remote
def incorrect_square(x: int, prob: float) -> int:
    # Simulate potential failures
    if random.random() < prob:  # % chance of failure
        raise ValueError("Random failure")
    return x**2

In [29]:
try:
    ray.get([incorrect_square.remote(x=4, prob=0.5) for _ in range(10)])
except ray.exceptions.RayTaskError:
    print("At least one of the tasks failed", flush=True)

At least one of the tasks failed


2026-01-08 17:50:08,920	ERROR worker.py:430 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::incorrect_square()[39m (pid=4160, ip=10.0.18.182)
  File "/tmp/ipykernel_10055/2967030291.py", line 5, in incorrect_square
ValueError: Random failure
2026-01-08 17:50:08,921	ERROR worker.py:430 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::incorrect_square()[39m (pid=4161, ip=10.0.18.182)
  File "/tmp/ipykernel_10055/2967030291.py", line 5, in incorrect_square
ValueError: Random failure
2026-01-08 17:50:08,922	ERROR worker.py:430 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::incorrect_square()[39m (pid=4159, ip=10.0.18.182)
  File "/tmp/ipykernel_10055/2967030291.py", line 5, in incorrect_square
ValueError: Random failure
2026-01-08 17:50:08,923	ERROR worker.py:430 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::incorrect_square()[39m (pid=4163, ip=10.0.18.182)
  File "

Ray let's you specify how to handle retries when an exception is encountered.

Let's retry on `ValueError`, like below:

In [30]:
@ray.remote(retry_exceptions=[ValueError])
def correct_square(x: int, prob: float) -> int:
    # Simulate potential failures
    if random.random() < prob:  # % chance of failure
        raise ValueError("Random failure")

    return x**2

Note we did not have to re-define the remote function, instead we can an update version using `.options`

In [None]:
correct_square_mod = correct_square.options(
    retry_exceptions=[ValueError], max_retries=10,
)

2026-01-08 17:50:09,082	ERROR worker.py:430 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): [36mray::incorrect_square()[39m (pid=4424, ip=10.0.18.182)
  File "/tmp/ipykernel_10055/2967030291.py", line 5, in incorrect_square
ValueError: Random failure


Let's try it out:

In [32]:
try:
    outputs = ray.get([correct_square_mod.remote(x=4, prob=0.5) for _ in range(10)])
except ray.exceptions.RayTaskError:
    print("At least one of the tasks failed", flush=True)

outputs

[33m(raylet)[0m Task correct_square failed. There are 9 retries remaining, so the task will be retried. Error: User exception:
[36mray::correct_square()[39m (pid=4424, ip=10.0.18.182)
  File "/tmp/ipykernel_10055/991359300.py", line 5, in correct_square
ValueError: Random failure


[16, 16, 16, 16, 16, 16, 16, 16, 16, 16]

<div class="alert alert-info">
Refer to the <strong><a href="https://docs.ray.io/en/latest/ray-core/tasks/retries.html" target="_blank">retries</a></strong> to learn more.
</div>

## 9. Task Runtime Environments

Runtime environments can be used on top of the prepared environment from the Ray Cluster to customize the execution of tasks.

When setting up a worker process to run a task, Ray will first prepare the environment for the task.

This includes things like:
* installing dependencies
* setting environment variables

For example, we can set an environment variable:

In [33]:
@ray.remote(runtime_env={"env_vars": {"my_custom_env": "prod"}})
def f():
    env = os.environ["my_custom_env"]
    return f"My custom env is {env}"

In [34]:
ray.get(f.remote())

'My custom env is prod'

## 10. Resource allocation and management



Here is the sequence of events when you submit a Ray task:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-core/task-submission_old.gif" alt="Task Submission Sequence" width="800">

By default, Ray will schedule a task as long as there is at least one CPU available.

In code this can be specified in the `ray.remote`, like this:

In [35]:
@ray.remote(num_cpus=1)
def remote_add(a, b):
    return a + b

However, these resource specifications are not enforced - i.e. they are entirely [logical and not physical](https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#physical-resources-vs-logical-resources).

This means that you can for instance perform multiprocessing ormultithreading within a task and oversubscribe to resources.

In [36]:
@ray.remote(num_cpus=1)
def mm(n: int = 4000):
    A = np.random.rand(n, n)
    B = np.random.rand(n, n)

    # Time the dot product
    start = time.time()
    np.dot(A, B)
    end = time.time()
    print(f"Took {end - start}s")
    
ray.get(mm.options(runtime_env={"env_vars": {"OMP_NUM_THREADS": "1"}}).remote())
ray.get(mm.options(runtime_env={"env_vars": {"OMP_NUM_THREADS": "8"}}).remote())

[36m(mm pid=4896, ip=10.0.18.182)[0m Took 1.9162685871124268s


<div class="alert alert-info">

Note by default, Ray will set the `OMP_NUM_THREADS` environment variable to the number of CPUs in the cluster.

Learn more about <strong><a href="https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#physical-resources-and-logical-resources" target="_blank">physical resources and logical resources</a></strong>.
</div>

### 10.1 Common options

**Resource options:**
- `num_cpus`: Number of CPUs (can be fractional, e.g., 0.5)
- `num_gpus`: Number of GPUs (can be fractional)
- `memory`: Memory in bytes
- `resources`: Dict of custom resources

**Fault tolerance options:**
- `max_retries`: Max number of retries (default: 3 for system errors)
- `retry_exceptions`: List of exception types to retry on

**Execution options:**
- `runtime_env`: Dict specifying runtime environment
- `scheduling_strategy`: Control task placement
- `name`: Name for debugging/monitoring

### 10.2 Note on resources requests, available resources, configuring large clusters

<p>During the <em>scheduling stage</em>, Ray evaluates the <strong>resource requirements</strong> specified via the <code>@ray.remote</code> decorator or within the <code>resources={...}</code> argument. These requirements may include:</p>

<ul>
    <li><strong>CPU</strong> e.g., <code>@ray.remote(num_cpus=2)</code>)</li>
    <li><strong>GPU</strong> e.g., <code>@ray.remote(num_gpus=1)</code>)</li>
    <li><strong>Custom resources</strong>: User-defined custom resources like <code>"TPU"</code></li>
    <li><strong>Memory</strong></li>
</ul>

<p>Ray's scheduler checks the <strong>resource specification</strong> (sometimes referred to as <strong>resource shape</strong>) to match tasks and actors with available resources in the cluster. If the exact resource combination is unavailable, Ray may autoscaler the cluster.</p>

<p>You can inspect the current resource availability using:</p>
<pre><code>
ray.available_resources()
</code></pre>

<p>This returns a dictionary showing the currently available CPUs, GPUs, memory, and any custom resources, for example:</p>

<pre><code>{'CPU': 24.0, 'GPU': 1.0, 'memory': 2147483648.0}</code></pre>

In [37]:
ray.available_resources()

{'memory': 103079215104.0,
 'anyscale/provider:aws': 3.0,
 'anyscale/accelerator_shape:1xT4': 2.0,
 'node:10.0.30.21': 1.0,
 'object_store_memory': 17847714694.0,
 'CPU': 16.0,
 'anyscale/region:us-west-2': 3.0,
 'accelerator_type:T4': 2.0,
 'GPU': 2.0,
 'anyscale/node-group:1xT4:8CPU-32GB': 2.0,
 'anyscale/node-group:head': 1.0,
 'node:__internal_head__': 1.0,
 'node:10.0.36.30': 1.0,
 'anyscale/cpu_only:true': 1.0,
 'node:10.0.18.182': 1.0}

<div class="alert alert-info">

<strong>Pattern:</strong> configure the head node to be unavailable for compute tasks.

When scaling to large clusters, it's important to ensure that the <strong>head node</strong> does not handle any compute tasks. Users can indicate that the head node is unavailable for compute by setting its resources:

```resources: {"CPU": 0}```

Learn more about <strong><a href="https://docs.ray.io/en/latest/cluster/vms/user-guides/large-cluster-best-practices.html#configuring-the-head-node" target="_blank">configuring the head node</a></strong>.
</div>

### 10.2. Fractional resources

Fractional resources allow Ray Tasks to request a fraction of a CPU or GPU (e.g., 0.5), enabling finer-grained resource allocation.

Let's consider the above example again:

In [38]:
@ray.remote(num_cpus=0.5)
def remote_add(a, b):
    return a + b

This means Ray will allow execution of 2x the number of CPUs on the machine to run the task.

In [39]:
ref = remote_add.remote(2, 3)
ref

ObjectRef(7a636a2779a3d471ffffffffffffffffffffffff0500000001000000)

In [40]:
ray.get(ref)

5

<div class="alert alert-info">
    Fractional resources include support for <strong><a href="https://docs.ray.io/en/latest/ray-core/scheduling/accelerators.html#fractional-accelerators" target="_blank">multiple accelerators</a></strong>, allowing users to load multiple smaller models onto a single GPU. This is especially useful for scenarios like model inference. Learn more about <strong><a href="https://docs.ray.io/en/latest/ray-core/scheduling/resources.html#fractional-resource-requirements" target="_blank">fractional resource requirements</a></strong>.
</div>

## 11. Pipeline data processing and waiting for results

After launching a number of tasks, you may want to know which ones have finished executing without blocking on all of them. This could be achieved by `ray.wait()`

|<img src="https://assets-training.s3.us-west-2.amazonaws.com/ray-core/ray-core/pipeline-data-processing.png" width="400px" loading="lazy">|
|:--|
|(top panel) Execution timeline when using ray.get() to wait for all results before calling process results. (bottom panel) Execution timeline when using ray.wait() to process results as soon as they become available.|

Here are functions to match the above diagram:

In [41]:
@ray.remote
def do_some_work(x):
    time.sleep(random.uniform(0, 4))  # Replace this with work you need to do.
    return x


def process_incremental(sum, result):
    time.sleep(1)  # Replace this with some processing code.
    return sum + result


def process_results(results):
    sum = 0
    for x in results:
        sum += process_incremental(sum, x)
    return sum

This is the **naive approach:**, block until all tasks are complete and then process the results.

In [42]:
start = time.time()
data_list = ray.get([do_some_work.remote(x) for x in range(20)])
sum = process_results(data_list)
print("duration =", time.time() - start, "\nresult = ", sum)

[36m(raylet, ip=10.0.30.21)[0m Spilled 8192 MiB, 2 objects, write throughput 454 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.","component":"raylet","filename":"local_object_manager.cc","lineno":259}


duration = 24.28804039955139 
result =  1048555


This is the **pipelined** approach, process items as soon as they become available

In [43]:
start = time.time()
result_ids = [do_some_work.remote(x) for x in range(20)]

sum = 0
while len(result_ids):
    done_id, result_ids = ray.wait(result_ids)
    sum = process_incremental(sum, ray.get(done_id[0]))

print("duration =", time.time() - start, "\nresult = ", sum)

duration = 20.087604522705078 
result =  190


<div class="alert alert-info">
Read more about the <strong><a href="https://docs.ray.io/en/latest/ray-core/tips-for-first-time.html#tip-4-pipeline-data-processing" target="_blank">pipeline data processing</a></strong>
</div>

## 12. Ray Actors

Actors extend the Ray API from functions (tasks) to classes.

An actor is a stateful worker. When a new actor is instantiated, a new worker is created, and methods of the actor are scheduled on that specific worker and can access and mutate the state of that worker. Similarly to Ray Tasks, actors support CPU and GPU compute as well as fractional resources.

Let's look at an example of an actor which maintains a running balance.

In [44]:
@ray.remote
class Accounting:
    def __init__(self):
        self.total = 0
    
    def add(self, amount):
        self.total += amount
        
    def remove(self, amount):
        self.total -= amount
        
    def total(self):
        return self.total

<div class="alert alert-info">
  <strong><a href="https://docs.ray.io/en/latest/ray-core/key-concepts.html#actors" target="_blank">Actor</a></strong> is a remote, stateful Python class.
</div>

<div class="alert alert-info">

The most common use case for actors is with state that is not mutated but is large enough that we may want to load it only once and ensure we can route calls to it over time, such as a large AI model.

</div>

Define an actor with the `@ray.remote` decorator and then use `<class_name>.remote()` ask Ray to construct and instance of this actor somewhere in the cluster.

We get an actor handle which we can use to communicate with that actor, pass to other code, tasks, or actors, etc.

In [45]:
acc = Accounting.remote()

We can send a message to an actor -- with RPC semantics -- by using `<handle>.<method_name>.remote()`

In [46]:
acc.total.remote()

ObjectRef(756963cc1de4954040f50f2e5c482efea54133b90500000001000000)

Not surprisingly, we get an object ref back

In [47]:
ray.get(acc.total.remote())

0

We can mutate the state inside this actor instance

In [48]:
acc.add.remote(100)

ObjectRef(186010f41212601640f50f2e5c482efea54133b90500000001000000)

In [49]:
acc.remove.remote(10)

ObjectRef(3e0f2a001193943c40f50f2e5c482efea54133b90500000001000000)

In [50]:
ray.get(acc.total.remote())

90

<div class="alert alert-block alert-info">

__Activity: linear model inference__

* Create an actor which applies a model to convert Celsius temperatures to Fahrenheit
* The constructor should take model weights (w1 and w0) and store them as instance state
* A convert method should take a scalar, multiply it by w1 then add w0 (weights retrieved from instance state) and then return the result


```python
# Hint: define the below as a remote actor
class LinearModel:
    def __init__(self, w0, w1):
        # Hint: store the weights

    def convert(self, celsius):
        # Hint: convert the celsius temperature to Fahrenheit

# Hint: create an instance of the LinearModel actor

# Hint: convert 100 Celsius to Fahrenheit
```

</div>

In [51]:
# Write your solution here
@ray.remote
class LinearModel:
    def __init__(self, w0, w1):
        self.w0 = w0
        self.w1 = w1

    def convert(self, celsius):
        result = self.w0 * celsius + self.w1 
        return result

# Hint: create an instance of the LinearModel actor
model = LinearModel.remote(w0 = 9/5, w1 = 32)
# Hint: convert 100 Celsius to Fahrenheit
ray.get(model.convert.remote(100))


212.0

<div class="alert alert-block alert-info">

<details>

<summary> Click to see solution </summary>

```python
@ray.remote
class LinearModel:
    def __init__(self, w0, w1):
        self.w0 = w0
        self.w1 = w1

    def convert(self, celsius):
        return self.w1 * celsius + self.w0

model = LinearModel.remote(w1=9/5, w0=32)
ray.get(model.convert.remote(100))
```

</details>
</div>


<!-- TODO: add Patterns/antipatterns based on above learnings-->
