<img src="logo-ray.png" width="80px">


# Introduction to Ray Core

# Lifecycle of a task 

We start out detailing the full lifecylce of a **ray task** from when it is **created** and submitted till when it is **completed** and the **resulting objects are returned** to the user. 

## 10,000 feet view

We have a python function convenitenly named `expensive_computation` which executes an expensive computation. To keep it simple all it does is perform a naive matrix multiplication and returns the number of elements in the resulting matrix. 


It gets called in sequence a number of times (`n_runs`) to be specific

In [1]:
%%writefile utils.py
from itertools import product

def perform_naive_matrix_multiplication(n):
    matrix1 = matrix2 = [[1 for _ in range(n)] for _ in range(n)]

    result = [[0 for _ in range(n)] for _ in range(n)]
    for i, j, k in product(range(n), range(n), range(n)):
        result[i][j] += matrix1[i][k] * matrix2[k][j]

    return result

Overwriting utils.py


In [2]:
from utils import perform_naive_matrix_multiplication

n_runs = 10
n = 300

def expensive_computation(n):
    result = perform_naive_matrix_multiplication(n)
    n_rows, n_cols = len(result), len(result[0])
    num_elements_in_matrix = n_rows * n_cols
    return num_elements_in_matrix


results = [expensive_computation(n) for _ in range(n_runs)]
assert sum(results) == n_runs * n * n

Below is the execution visualized

<img src="sequential_simple_.jpeg" height=300>

We want to:
- Run the same function but in a distributed fashion - i.e. in parallel on a cluster of machines

We do this by following these steps:
- Convert the `expensive_computation` function to a ray task by decorating it with `ray.remote`
- Submit a task for execution by calling `future = expensive_computation.remote()`
- Use the returned `future` object reference to fetch the result of the function by calling `ray.get(future)` 

In [3]:
import ray


@ray.remote  # decorator to convert python function to ray task
def expensive_computation(n):
    result = perform_naive_matrix_multiplication(n)
    n_rows, n_cols = len(result), len(result[0])
    num_elements_in_matrix = n_rows * n_cols
    return num_elements_in_matrix


# submit n_run ray tasks to a ray cluster
# and keep a reference to the task futures
futures = [expensive_computation.remote(n) for _ in range(n_runs)]

# wait for all tasks to complete and get the resulting objects
# results are returned in the same order as submitted
results = ray.get(futures)

# confirm that we got the right result
assert sum(results) == n_runs * n * n

2023-11-21 22:42:12,764	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


Here is what is happening under the hood:

<img src="parallel_simple.jpeg" height="300">

## 1000 feet view

Let's detail the parallel execution of the function a bit more.

More specifically:
- **ray tasks** are executed on a **ray cluster** as part of a **ray job**
   - You can think of a **ray job** as the collection of tasks, objects, and actors originating from the same runtime environment
- **ray worker processes** are the processes that execute the tasks
- **futures** in ray are called `ObjectRef`s short for **object references**
- results are stored as **objects** in an "**object store**"
- `ray.get()` is used to wait and fetch the **object value** given the **object reference** from the "**object store**"

Here is a more detailed view of the parallel execution


<img src="parallel_execution_1000ft.png" height="300">

Let's use the ray state client to verify the above.

We re-declare the `expensive_computation` but give it a unique name so we can easily track its state and a longer sleep time so we can see the state evolve more clearly

In [None]:
from uuid import uuid4
import ray

task_sleep_time = 20


@ray.remote
def my_task():
    import time

    time.sleep(task_sleep_time)
    return 1


id_ = str(uuid4())[:8]
name = f"expensive_computation_{id_}"
ray_task = my_task.options(name=name)

We submit the task and inspect the future object reference - we see that it is a ray.ObjectRef with a given id

In [None]:
future_object_ref = ray_task.remote()
future_object_ref

We now request the cluster state to see our task running and transitioning through some of its states

In [None]:
from ray.util.state import get_task
import time

start_time = time.time()

while time.time() - start_time < (task_sleep_time + 10):
    time.sleep(5)
    task = get_task(id=future_object_ref.task_id().hex())
    print(
        f"task {task.name} is in state={task.state} running on worker {task.worker_id[:8]} as part of Job ID {task.job_id}"
    )

In general this diagram below shows the high-level state transitions a task will go through on it happy path

<img src="state_transition_simplified.png" width="700px">

Finally, we use `ray.get` to fetch the resulting object value now that the task is completed

In [None]:
object_value = ray.get(future_object_ref)
object_value

## 100 feet view

Let's further detail the lifecycle of a ray task.

More specifically here is what a cluster looks like:


<img src="ray_cluster.png" height="500">

Things to keep in mind:

- The **head node** is a special node that runs the **global control service**, **cluster level services** and usually the **driver**
  - The **global control service** keeps track of the **cluster state** that is not supposed to change often
  - Cluster level services are services that are shared across the cluster suc as autoscaling, job submission, etc. 
  - The **driver** can submit tasks but does not execute them 
- Each **worker process** will keep track of all the **tasks** it owns/submits in its **ownership table**
- Small **objects** (< 100KB) are stored in the **in-process object store** of a **worker**
- Large **objects** are stored in the **plasma object store** which is **shared across worker processes** on the same node
  - The **plasma object store** by default is in-memory and takes up **30% of the memory of the node**
  - If the **plasma object store** is full, objects are **spilled to disk**
  - The **plasma object store** is also referred to as the **shared memory object store**

### Distributed ownership work in ray?

- The worker process that submits a task is the **owner** of that task
  - Note that this is not the same as the worker process that executes the task
  
Here is a sample diagram showing how ownership works
<img src="ownership_diagram.png" width="500px">


- The Driver submits `a`
- This means the **Driver** is the **owner** of `x` (result of putting object in store) and `y` from **task 1**
- Then the **worker process** executing **task a** will submit **task b**
    - This means the **worker process** executing **task a** is the **owner** of the **resulting object** `z` from **task b**

With the cluster architecture in mind, let's look at the lifecycle of a task in more detail.

#### Submitting a task
<img src="submit_task_detailed.png">

#### Data locality in ray

- The owner will select the **raylet** where **bulk of the objects the task depends on** are located
  - This can be a **raylet** running on a **different node**!
  - Bulk is determined by the dependency's object size

<img src="selecting_raylet.png">

#### Scheduling a Task

<img src="schedule_task_.png">

## Scheduling policies deep-dive

How does a raylet's scheduler choose a worker node to lease work from?

### Classifying nodes as feasible/infeasible and available/unavailable

Note that every 100ms, the **GCS pulls resource availability** from each **raylet** and then aggregates and **rebroadcasts them back to each raylet**.

In [None]:
from df_utils import styled_df
import pandas as pd

# Read DataFrame from CSV
df_read = pd.read_csv("raylet_node_classification.csv", header=[0, 1])

# Display the read DataFrame
styled_df(df_read)

### Scheduling Policies

#### Default Hybrid policy


This is the default policy used by ray. It is a hybrid policy that combines the following two heuristics:
- Bin packing heuristic
- Load balancing heuristic

**Make sure to note it is the local node to the chosen raylet**

The diagram below shows the two modes in action when scheduling two tasks Task1 and Task2

<img src="scheduling_hybrid_heuristic.png">

**Note** you can set the following environment variables to configure the default hybrid policy:

- `RAY_scheduler_spread_threshold` - default is 0.5 or 50% utilization of the node
- `RAY_scheduler_top_k_fraction` - default is 0.2 or 20% of the nodes
  - You can also set `RAY_scheduler_top_k_absolute` to set an absolute number of nodes to use
  - Note that it is the max of `RAY_scheduler_top_k_fraction` and `RAY_scheduler_top_k_absolute` that is used

In [None]:
import ray

@ray.remote(scheduling_strategy="DEFAULT") # this is the default so we don't need to specify it
def default_schedule_func():
    return 2

ray.get(default_schedule_func.remote())

#### Node Affinity Policy 

Assigns tasks to a given node in either a strict or soft manner.

<img src="node_affinity_policy.png" width="700px">

In [None]:
import ray
from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy


@ray.remote(
    scheduling_strategy=NodeAffinitySchedulingStrategy(
        node_id=ray.get_runtime_context().get_node_id(),
        soft=False,
    )
)
def node_affinity_schedule():
    return 2


ray.get(node_affinity_schedule.remote())

#### SPREAD Policy 

As the name suggests, the SPREAD policy spreads the tasks across the nodes.

Note that it spreads across all the available nodes first and then the feasible nodes.

Behaves like a best-effort round-robin

<img src="spread_scheduling_policy_.png" width="500px">

In [None]:
import ray


@ray.remote(scheduling_strategy="SPREAD")
def spread_default_func():
    return 2


ray.get(spread_default_func.remote())

### Placement Group Policy

In cases when we want to treat a set of resources as a single unit, we can use placement groups.


<img src="placement_group_policy_.png" width="300px">

**Things to keep in mind**:

- A **placement group** is formed from a set of **resource bundles**
  - A **resource bundle** is a list of resource requirements that fit in a single node
- A **placement group** can specify a **placement strategy** that determines how the **resource bundles** are placed
  - The **placement strategy** can be one of the following:
    - **PACK**: pack the **resource bundles** into as few nodes as possible
    - **SPREAD**: spread the **resource bundles** across as many nodes as possible
    - **STRICT_PACK**: pack the **resource bundles** into as few nodes as possible and fail if not possible
    - **STRICT_SPREAD**: spread the **resource bundles** across as many nodes as possible and fail if not possible
- **Placement Groups** are **atomic** 
  -  i.e. either all the **resource bundles** are placed or none are placed
  -  GCS uses a two-phase commit protocol to ensure atomicity



In [None]:
import ray
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
# Import placement group related functions
from ray.util.placement_group import (
    placement_group,
    placement_group_table,
    remove_placement_group,
)

# Reserve a placement group of 1 bundle that reserves 0.1 CPU
pg = placement_group([{"CPU": 0.1}], strategy="PACK", name="my_pg")

# Wait until placement group is created.
ray.get(pg.ready(), timeout=10)

# look at placement group states using the table
print(placement_group_table(pg))


@ray.remote(
    scheduling_strategy=PlacementGroupSchedulingStrategy(
        placement_group=pg,
    ),
    # task requirement needs to be less than placement group capacity
    num_cpus=0.1,
)
def placement_group_schedule():
    return 2


out = ray.get(placement_group_schedule.remote())
print(out)

# Remove placement group.
remove_placement_group(pg)

#### Fetching task results

<img src="fetch_result_.png">

Note: If the owner is fetching the result from a different node than the one where the task was executed, the result is first copied to the local object store of the owner node and then returned.

### Object management and dependency resolution

Let's drill down on how a task's dependencies are resolved - using the following example of simple batch inference:

- we load a model
- we use the model to make predictions on an input

In [None]:
import ray
import numpy as np


def load_model(size_mb):
    weights = np.ones((1024, 1024, size_mb), dtype=np.uint8)
    assert weights.nbytes / 1024**2 == size_mb
    return weights


@ray.remote
def predict(model, input):
    return model * input

We start with this simple implementation

In [None]:
# load 1 GB model in memory
model = load_model(1_000) 

# submit 3 tasks to the cluster
futures = ray.get([predict.remote(model, i) for i in range(3)])

There are 3 `predict` tasks that will be submitted.

- The owner of each task will need to go over all the task arguments and:
    - check that all the arguments are available
    - store a reference to all the available arguments in the plasma/shared object store or inprocess object store
- In the case of our 1 GB "model", the owner will make use of the shared object store given it exceeds the 100KB limit of the inprocess object store
- Each owner will create a copy of the model and produce an object reference to use as the argument for the task
- Each owner process will now execute their task

The outcome is that we have made 3 copies of the model in the shared object store.

Instead to save on memory, we should use the `ray.put` API to store the model in the shared object store and pass the reference to the model as an argument to the task.

Here is the optimized implementation:


In [None]:
# put the model in the object store and get a reference to it
model_ref = ray.put(model)

# submit 3 tasks to the cluster using the same model reference
futures = ray.get([predict.remote(model_ref, i) for i in range(3)])

## 10 feet view of ray

### Inspecting debug logs

Given the below code, we can inspect the debug logs to see what is happening under the hood

In [None]:
import ray
import numpy as np


def load_model(size_mb):
    weights = np.ones((1024, 1024, size_mb), dtype=np.uint8)
    assert weights.nbytes / 1024**2 == size_mb
    return weights


@ray.remote
def predict(model, input):
    return model * input


model = load_model(size_mb=1000)
obj_ref = predict.remote(model, 1)
result = ray.get(obj_ref)  # c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000

Below are the the worker process debug logs parsed into pandas, color-categorized and annotated

<img src="debug_logs_with_legend.png">

### Fault Tolerance of Ray Tasks and Objects

- If a task raises an application-level exception, the task will fail and the exception will be propagated to the caller.
- If instead a system-level failures, i.e the worker process executing the task crashes then:
    - Ray will rerun the task until either the task succeeds or the maximum number of retries is exceeded. 
        - The default number of retries is 3 and can be overridden by specifying max_retries in the @ray.remote decorator.

In [None]:
# application-level failure flakiness but with infinite retries
import ray
import pickle

def write_x(val):
    with open("x.pkl", "wb") as f:
        pickle.dump({"x": val}, f)    

def read_x():
    with open("x.pkl", "rb") as f:
        data = pickle.load(f)
    return data["x"]

# start with x = 0 to force failure
write_x(0)

@ray.remote(max_retries=-1) # infinite retries
def flaky_app_task():
    """Reads x, increments it by 1, writes it and fails, next retry it should pass."""
    x = read_x()
    if x % 2 == 0:
        x += 1
        write_x(1)
        raise ValueError("x is even - that's odd!")
    return 1

try:
    out = ray.get(flaky_app_task.remote())
except ray.exceptions.RayTaskError:
    print("application-level exceptions shortcircuit retries")

**Note** You can enable retries on application-level exceptions you need to set `retry_exceptions=True` or specify a list of exceptions

Make sure your task is **idempotent** to avoid side-effects due to retries!

In [None]:
# system-level failure flakiness but with infinite retries
import sys
import ray
import pickle

def write_x(val):
    with open("x.pkl", "wb") as f:
        pickle.dump({"x": val}, f)    

def read_x():
    with open("x.pkl", "rb") as f:
        data = pickle.load(f)
    return data["x"]

# start with x = 0 to force failure
write_x(0)

@ray.remote(max_retries=-1) # infinite retries
def flaky_sys_task():
    """Reads x, increments it by 1, writes it and fails, next retry it should pass."""
    x = read_x()
    if x % 2 == 0:
        x += 1
        write_x(1)
        raise sys.exit(1)
    return 1

# never raises an error given retries eventually succeed
out = ray.get(flaky_sys_task.remote())
print("returned", out, "after retrying worker failure")

The below diagram shows the fault tolerance of ray objects - taken from https://www.usenix.org/system/files/nsdi21-wang.pdf

<img src="object_fault_tolerance.png" width="700px">

- When an object value is lost from the object store, such as during node failures:
    - Ray will use lineage reconstruction to recover the object.
        - Ray will first automatically attempt to recover the value by looking for copies of the same object on other nodes.
        - If none are found, then Ray will automatically recover the value by re-executing the task that previously created the value. 
        - Arguments to the task are recursively reconstructed through the same mechanism.

