# Lifecycle of a task

We start out detailing the full lifecylce of a task, from when it is created and submitted till when it is completed and the results are returned to the user. 

## 10,000 feet view

We have a python function convenitenly named `expensive_computation` which executes an expensive computation. To keep it simple all it does is perform a naive matrix multiplication and returns the number of elements in the resulting matrix. 


It gets called in sequence a number of times (`n_runs`) to be specific

In [1]:
n_runs = 10
n = 300


def perform_naive_matrix_multiplication(n):
    # Create two n x n matrices
    matrix1 = [[1 for _ in range(n)] for _ in range(n)]
    matrix2 = [[1 for _ in range(n)] for _ in range(n)]

    # Perform matrix multiplication
    result = [[0 for _ in range(n)] for _ in range(n)]
    for i in range(n):
        for j in range(n):
            for k in range(n):
                result[i][j] += matrix1[i][k] * matrix2[k][j]

    return result


def expensive_computation(n):
    result = perform_naive_matrix_multiplication(n)

    # Return the number of elements in the result matrix
    return len(result) * len(result[0])


results = [expensive_computation(n) for _ in range(n_runs)]
assert sum(results) == n_runs * n * n

Below is the execution visualized

<img src="sequential_simple_.jpeg" height=300>

We want to:
- Run the same function in a distributed fashion - i.e. in parallel on a cluster of machines
- Get the results of the function as they become available

We do this by following these steps:
- Convert the `expensive_computation` function to a ray task decoration by decorating it with `ray.remote`
- Submit a task for execution by calling `future = expensive_computation.remote()`
- Use the returned `future` object reference to fetch the result of the function by calling `ray.get(future)` 

In [2]:
import ray


@ray.remote  # decorator to convert python function to ray task
def expensive_computation(n):
    result = perform_naive_matrix_multiplication(n)
    return len(result) * len(result[0])


# submit n_run ray tasks to a ray cluster
# and keep a reference to the task futures
futures = [expensive_computation.remote(n) for _ in range(n_runs)]

# wait for all tasks to complete and get the resulting objects
results = ray.get(futures)

# confirm that we got the right result
assert sum(results) == n_runs * n * n

2023-11-13 20:37:05,883	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


Here is what is happening under the hood:

<img src="parallel_simple.jpeg" height="300">

## 1000 feet view

Let's detail the parallel execution of the function a bit more.

More specifically:
- ray tasks are executed on a ray cluster as part of a ray job
- ray workers are the processes that execute the tasks
- futures in ray are called `ObjectRef`s short for object references
- results are stored as objects in the ray object store
- ray.get() is used to fetch an object given its object reference

In [3]:
import ray

n = 300

@ray.remote
def expensive_computation(n):
    result = perform_naive_matrix_multiplication(n)
    return len(result) * len(result[0])

# this returns an object reference to the result
object_ref_future = expensive_computation.remote(n)

Here we inspect the object reference and see that it is a ray.ObjectRef with a given id

In [4]:
object_ref_future

ObjectRef(359ec6ce30d3ca2dffffffffffffffffffffffff0100000001000000)

We then call `ray.get` which waits till the object reference is resolved and returns the resulting value

In [5]:
object_value = ray.get(object_ref_future) 
object_value

90000

Here is a more detailed view of the parallel execution


<img src="parallel_1000_feet.png" height="300">

Let's use the ray state client to verify the above.

We re-declare the `expensive_computation` but give it a unique name so we can easily track its state and a longer sleep time so we can see the state evolve more clearly

In [6]:
from uuid import uuid4

task_sleep_time = 20


@ray.remote
def my_task():
    import time

    time.sleep(task_sleep_time)
    return 1


id_ = str(uuid4())[:8]
name = f"expensive_computation_{id_}"
ray_task = my_task.options(name=name)

We submit the task

In [7]:
ray_task.remote()

ObjectRef(1e8ff6d236132784ffffffffffffffffffffffff0100000001000000)

We now request the cluster state to see our task running

In [8]:
from ray.util.state import list_tasks
import time

start_time = time.time()

while time.time() - start_time < task_sleep_time:
    time.sleep(5)
    task = next(task for task in list_tasks() if task.name == name)
    print(
        f"task {task.name} is in state={task.state} running on worker {task.worker_id[:8]} as part of Job ID {task.job_id}"
    )

task expensive_computation_c18f072b is in state=RUNNING running on worker aecd9247 as part of Job ID 01000000
task expensive_computation_c18f072b is in state=RUNNING running on worker aecd9247 as part of Job ID 01000000
task expensive_computation_c18f072b is in state=RUNNING running on worker aecd9247 as part of Job ID 01000000
task expensive_computation_c18f072b is in state=RUNNING running on worker aecd9247 as part of Job ID 01000000


## 100 feet view

Let's further detail the lifecycle of a ray task.

More specifically here is what a cluster looks like:


<img src="ray_cluster.png" height="500">

Things to keep in mind:

- The head node is a special node that runs the driver and the global control service. 
- The head node can also spawn worker processes to execute tasks
- The Global control service keeps track of cluster state that is not supposed to change often
- Each worker process will keep track of all the task it executes and submits in its ownership table
- Small objects (< 100KB) are stored in the in-process object store of a worker
- Large objects are stored in the plasma store which is shared across worker processes on the same node
- plasma store by default is in-memory and takes up 30% of the memory of the node
- if plasma store is full, objects are spilled to disk

With the cluster architecture in mind, let's look at the lifecycle of a task in more detail.

#### Submitting a task
<img src="submit_task.png">

**On data locality**:

- The owner will select the raylet where most of the objects the task depends on are located
  - this can be a raylet running on a different worker node

#### Scheduling a Task

<img src="schedule_task.png">

**Note**: 
- Step 0 is "raylet scheduler finding a node to schedule the task on"
- Step 0 is not shown in the diagram but instead will be discussed in the next section.

## Scheduling deep-dive

How does a raylet's scheduler choose a worker node to lease work from?

### Classifying nodes as feasible/infeasible and available/unavailable

<img src="availability_feasibility.png" height="500">

### Scheduling Policies

#### Default Hybrid policy


This is the default policy used by ray. It is a hybrid policy that combines the following two modes:
- Bin packing mode
- Load balancing mode
  

The diagram below shows the two modes in action when scheduling two tasks Task1 and Task2

<img src="default_hybrid_policy_new.png">

#### Node Affinity Policy 

Assigns tasks to a given node in either a strict or soft manner.

<img src="node_affinity_policy.png" height="500">

#### SPREAD Policy 

As the name suggests, the SPREAD policy spreads the tasks across all the available nodes.

<img src="spread_policy_.png" height="500">

### Placement Group Policy

In cases when we want to treat a set of nodes as a single unit, we can use placement groups.

<img src="placement_group_policy.png" height="500">

**Things to keep in mind**:

- A **placement group** is formed from a set of **resource bundles**
  - A **resource bundle** is a list of resource requirements that fit in a single node
- A **placement group** can specify a **placement strategy** that determines how the **resource bundles** are placed
  - The **placement strategy** can be one of the following:
    - **PACK**: pack the **resource bundles** into as few nodes as possible
    - **SPREAD**: spread the **resource bundles** across as many nodes as possible
    - **STRICT_PACK**: pack the **resource bundles** into as few nodes as possible and fail if not possible
    - **STRICT_SPREAD**: spread the **resource bundles** across as many nodes as possible and fail if not possible
- **Placement Groups** are **atomic** 
  -  i.e. either all the **resource bundles** are placed or none are placed
  -  GCS uses a two-phase commit protocol to ensure atomicity



#### Fetching task results

<img src="fetch_result.png">

Note: If the owner is fetching the result from a different node than the one where the task was executed, the result is first copied to the local object store of the owner node and then returned to the owner.

### Object management and dependency resolution

Let's drill down on how a task's dependencies are resolved - using the following example of simple batch inference:

- we load a model
- we use the model to make predictions on an input

In [9]:
import ray
import numpy as np


def load_model(size_mb):
    weights = np.ones((1024, 1024, size_mb), dtype=np.uint8)
    assert weights.nbytes / 1024**2 == size_mb
    return weights


@ray.remote
def predict(model, input):
    return model * input

We start with this simple implementation

In [10]:
# load 1 GB model in memory
model = load_model(1_000) 

# submit 3 tasks to the cluster
futures = ray.get([predict.remote(model, i) for i in range(3)])

There are 3 `predict` tasks that will be submitted.

- The owner of each task will need to go over all the task arguments and:
    - check that all the arguments are available
    - store a reference to all the available arguments in the (plasma) shared or inprocess object store
- In the case of our 1 GB model, the owner will make use of the shared object store given our object is > 1KB
- Each owner will create a copy of the model and produce an object reference to use as the argument for the task
- Each owner process will now execute their task

The outcome is that we have made 3 copies of the model in the shared object store.

Instead to save on memory, we should use the `ray.put` API to store the model in the shared object store and pass the reference to the model as an argument to the task.

Here is the optimized implementation:


In [11]:
# put the model in the object store and get a reference to it
model_ref = ray.put(model)

# submit 3 tasks to the cluster using the same model reference
futures = ray.get([predict.remote(model_ref, i) for i in range(3)])

## 10 feet view of ray

### Inspecting debug logs

Given the below code, we can inspect the debug logs to see what is happening under the hood

In [1]:
import ray
import numpy as np


def load_model(size_mb):
    weights = np.ones((1024, 1024, size_mb), dtype=np.uint8)
    assert weights.nbytes / 1024**2 == size_mb
    return weights


@ray.remote
def predict(model, input):
    return model * input


model = load_model(size_mb=1000)
obj_ref = predict.remote(model, 1)
result = ray.get(obj_ref)  # c8ef45ccd0112571ffffffffffffffffffffffff0100000001000000

2023-11-20 12:19:07,653	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


Below are the debug logs, color-categorized and annotated
<img src="debug_logs_annotated_cropped.png" height=500>

# Lifecycle of an Actor

An actor is a stateful object that can be used to encapsulate state and methods that operate on that state.


### Why use an actor ?

- We can't naively share a global variable across tasks
  - Global variables are not shared across worker processes - i.e. across tasks

In [37]:
import ray

global_var = 3

@ray.remote
def increment_global_var():
    global global_var
    global_var += 1
    return global_var

@ray.remote
def decrement_global_var():
    global global_var
    global_var -= 1
    return global_var

step1 = ray.get(increment_global_var.remote())
step2 = ray.get(decrement_global_var.remote())

# we expect 4, 3 but we get 4, 2
# given the two tasks have separate copies of the global variable
print(step1, step2)

4 2


- Storing state in a database is slow
  - Actors are in-memory
  - Actors are distributed across nodes

In [8]:
import time
import ray
import json
from pathlib import Path

def read_from_db(key):
    # only for demo purposes: mimic reading from a database
    time.sleep(1)
    return json.loads(Path("table.json").read_text())[key]


def write_to_db(key, val):
    data = {key: val}
    # only for demo purposes: mimic reading from a database
    time.sleep(1)
    Path("table.json").write_text(json.dumps(data))


@ray.remote
def increment_global_var():
    global_var = read_from_db("global_var")
    global_var += 1
    write_to_db("global_var", global_var)
    return global_var


@ray.remote
def decrement_global_var():
    global_var = read_from_db("global_var")
    global_var -= 1
    write_to_db("global_var", global_var)
    return global_var


write_to_db("global_var", 3)
step1 = ray.get(increment_global_var.remote())
step2 = ray.get(decrement_global_var.remote())
print(step1, step2)

4 3


## 10,000 feet view

Let's take an example of a simple counter actor. We create an actor handle by calling `Counter.remote()`. 

In [9]:
import ray


@ray.remote
class MyCounter:
    def __init__(self) -> None:
        self.counter = 0

    def increment(self):
        time.sleep(3)
        self.counter += 1

    def get_counter(self):
        return self.counter


my_counter_handle = MyCounter.remote()

We can then call methods on the actor handle to increment the counter and get the current value of the counter. The methods will be executed sequentially against the actor process.

In [14]:
# this will take 3 seconds * 2 = 6 seconds at least
ray.get([my_counter_handle.increment.remote() for _ in range(2)])

[None, None]

In [15]:
ray.get(my_counter_handle.get_counter.remote())

2

Here is a diagram showing the lifecycle of our actor (note that our actor is referred to as a "synchronous" actor)


<img src="actor_simple_.jpeg" height="300">

- A special "create actor" task is executed on the cluster to create the actor process
- The actor process can be thought of as a special worker process
- The actor tasks are executed sequentially on the actor process using a FIFO queue

## 1,000 feet view of ray actors

In this section we will detail the lifecycle of an actor in more detail.

- Actors are always owned by the GCS (global control service), unlike tasks which are owned by the worker process that submitted them
- The GCS maintains an actor table that keeps track of all the actors in the cluster
- Actors hold the resources they need to execute their tasks until they are killed
- Actors can be launched in a detached mode, in which case they do not fate share with a ray driver/job - instead they need to be killed manually

See the below diagram for more details


<img src="actor_centralized.jpeg" height="300">


Our actors can be asynchronous - this is especially useful for actors whose methods are IO bound and whose state can be easily shared and locked if needed

In [16]:
import ray
from asyncio import sleep


@ray.remote
class MyAsyncService:
    def __init__(self) -> None:
        self.fixed_state = 1

    async def run(self):
        await sleep(15)
        return self.fixed_state


my_async_actor_handle = MyAsyncService.remote()

Given the service run is mostly IO bound (sleeping), we can run it asynchronously using an asynchronous actor implementation

In [17]:
%%time

ray.get([my_async_actor_handle.run.remote() for _ in range(2)])

CPU times: user 13 ms, sys: 12.9 ms, total: 25.8 ms
Wall time: 15.3 s


[1, 1]

Here is a diagram visualizing task execution against an asynchroneous actor.

<img src="actor_async.jpeg" height="300">

## Fault tolerance of Ray Actors