# Lifecycle of a task

We start out detailing the full lifecylce of a task, from when it is created and submitted till when it is completed and the results are returned to the user. 

In [1]:
import ray

In [2]:
from textwrap import TextWrapper

wrapper = TextWrapper(width=80)

In [3]:
# ensure no ray cluster is running
# subprocess.run(["ray", "stop", "--force"], check=True)

## 10,000 feet view

We have a python function convenitenly named `expensive_computation` which executes an expensive computation. To keep it simple all it does is sleep for a given number of seconds.

It gets called in sequence a number of times (`n_runs`) to be specific

In [4]:
n_runs = 30

def expensive_computation():
    import time
    time.sleep(1)
    return 1

results = [expensive_computation() for _ in range(n_runs)]
assert sum(results) == n_runs

Below is the execution visualized

<img src="sequential_simple_.jpeg" height="300">

We want to:
- Run the same function in a distributed fashion - i.e. in parallel on a cluster of machines
- Get the results of the function as they become available

We do this by following these steps:
- Convert the `expensive_computation` function to a ray task decoration by decorating it with `ray.remote`
- Submit a task for execution by calling `future = expensive_computation.remote()`
- Use the returned `future` object reference to fetch the result of the function by calling `ray.get(future)` 

In [5]:
@ray.remote # decorator to convert function to a ray task
def expensive_computation():
    import time
    time.sleep(1)
    return 1

# submit n_ray tasks to multiple workers in the cluster and keep a reference to the result
futures = [expensive_computation.remote() for _ in range(n_runs)] 
# wait for all tasks to complete and get the resulting objects
results = ray.get(futures) 
# confirm that we got the right result
assert sum(results) == n_runs 

2023-11-07 18:36:15,765	INFO worker.py:1633 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m


Here is what is happening under the hood:

<img src="parallel_simple.jpeg" height="300">

## 1000 feet view

Let's detail the parallel execution of the function a bit more.

More specifically:
- ray tasks are executed on a ray cluster as part of a ray job
- ray workers are the processes that execute the tasks
- futures in ray are called `ObjectRef`s short for object references
- results are stored as objects in the ray object store
- ray.get() is used to fetch an object given its object reference

In [None]:
@ray.remote # decorator to convert function to a ray task
def expensive_computation():
    import time
    time.sleep(1)
    return 1

# this returns an object reference to the result
object_ref_future = expensive_computation.remote()

In [7]:
object_ref_future

ObjectRef(a631fe8d231813bfffffffffffffffffffffffff0100000001000000)

In [8]:
# wait for the tasks to complete and get the resulting object
object_value = ray.get(object_ref_future) 
object_value

1

Here is a more detailed view of the parallel execution


<img src="parallel_1000_feet.jpeg" height="300">

Let's use the ray state client to verify the above.

In [9]:
from ray.util.state import list_tasks
from textwrap import TextWrapper

wrapper = TextWrapper(width=80)

We re-declare the `expensive_computation` but give it a unique name so we can easily track its state and a longer sleep time so we can see the state evolve more clearly

In [10]:
from uuid import uuid4

task_sleep_time = 20

@ray.remote
def my_task():
    import time

    time.sleep(task_sleep_time)
    return 1


id_ = str(uuid4())[:8]
name = f"expensive_computation_{id_}"
name

'expensive_computation_ac87bc4f'

In [11]:
ray_task = my_task.options(name=name)
ray_task

<ray.remote_function.RemoteFunction.options.<locals>.FuncWrapper at 0x11d579a30>

In [12]:
# we submit the task
ray_task.remote()

ObjectRef(79cc316456d39201ffffffffffffffffffffffff0100000001000000)

In [13]:
import time

start_time = time.time()

while time.time() - start_time < task_sleep_time:
    time.sleep(5)
    task = next(task for task in list_tasks() if task.name == name)
    print(f"task {task.name} is in state={task.state} running on worker {task.worker_id[:8]} as part of Job ID {task.job_id}")

task expensive_computation_ac87bc4f is in state=RUNNING running on worker 2d866cac as part of Job ID 01000000
task expensive_computation_ac87bc4f is in state=RUNNING running on worker 2d866cac as part of Job ID 01000000
task expensive_computation_ac87bc4f is in state=RUNNING running on worker 2d866cac as part of Job ID 01000000
task expensive_computation_ac87bc4f is in state=RUNNING running on worker 2d866cac as part of Job ID 01000000


## 100 feet view

Let's further detail the lifecycle of a ray task.

More specifically here is what a cluster looks like:


<img src="ray_cluster_.jpeg" height="500">

Things to keep in mind:

- The head node is a special node that runs the driver and the global control service. 
- The head node can also spawn worker processes to execute tasks
- The Global control service keeps track of cluster state that is not supposed to change often
- Each worker process will keep track of all the task it executes and submits in its ownership table
- Small objects (< 100KB) are stored in the in-process object store of a worker
- large objects are stored in the plasma store which is shared across worker processes on the same node
- plasma store by default is in-memory and takes up 30% of the memory of the node
- if plasma store is full, objects are spilled to disk

With the cluster architecture in mind, let's look at the lifecycle of a task in more detail.

<img src="parallel_100feet.jpeg" height="400">

### Object management and dependency resolution

Let's drill down on how a task's dependencies are resolved - using the following example of simple batch inference:

- we load a model
- we use the model to make predictions

In [14]:
import ray
import numpy as np

def load_model(size_mb):
    weights = np.ones((1024, 1024, size_mb), dtype=np.uint8)
    assert weights.nbytes / 1024**2 == size_mb
    return weights


@ray.remote
def predict(model, input):
    return model * input


We start with this simple implementation

In [15]:
# load 1 GB model in memory
model = load_model(1_000) 

# submit 3 tasks to the cluster
futures = ray.get([predict.remote(model, i) for i in range(3)])

There are 3 `predict` tasks that will be submitted.

- The owner of each task will need to go over all the task arguments and:
    - check that all the arguments are available
    - store a reference to all the available arguments in the (plasma) shared or inprocess object store
- In the case of our 1 GB model, the owner will make use of the shared object store given our object is > 1KB
- Each owner will create a copy of the model and produce a model reference to use as the argument for the task
- Each owner process will now execute their task

The outcome is that we have made 3 copies of the model in the shared object store.

Instead to save on memory, we should use the `ray.put` API to store the model in the shared object store and pass the reference to the model as an argument to the task.

Here is the optimized implementation:


In [16]:
# put the model in the object store and get a reference to it
model_ref = ray.put(model)

# submit 3 tasks to the cluster using the same model reference
futures = ray.get([predict.remote(model_ref, i) for i in range(3)])

# Lifecycle of an Actor

An actor is a stateful object that can be used to encapsulate state and methods that operate on that state.


## 10,000 feet view

Let's take an example of a simple counter actor. We create an actor handle by calling `Counter.remote()`. 

In [17]:
@ray.remote
class MyCounter:
    def __init__(self) -> None:
        self.counter = 0

    def increment(self):
        time.sleep(3)
        self.counter += 1
        
    def get_counter(self):
        return self.counter

my_counter_handle = MyCounter.remote()

We can then call methods on the actor handle to increment the counter and get the current value of the counter. The methods will be executed sequentially against the actor process.

In [18]:
# this will take 3 seconds * 3 = 9 seconds at least
ray.get([my_counter_handle.increment.remote() for _ in range(3)])

[2m[36m(raylet)[0m Spilled 6000 MiB, 6 objects, write throughput 3026 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.


[None, None, None]

In [19]:
ray.get(my_counter_handle.get_counter.remote())

3

Here is a diagram showing the lifecycle of our actor (note that our actor is referred to as a "synchronous" actor)


<img src="actor_simple_.jpeg" height="300">

- A special "create actor" task is executed on the cluster to create the actor process
- The actor process can be thought of as a special worker process
- The actor tasks are executed sequentially on the actor process using a FIFO queue

## 1,000 feet view of ray actors

In this section we will detail the lifecycle of an actor in more detail.

- Actors are always owned by the GCS (global control service), unlike tasks which are owned by the worker process that submitted them
- The GCS maintains an actor table that keeps track of all the actors in the cluster
- Actors hold the resources they need to execute their tasks until they are killed
- Actors can be launched in a detached mode, in which case they do not fate share with a ray driver/job - instead they need to be killed manually

See the below diagram for more details


<img src="actor_centralized.jpeg" height="300">


Our actors can be asynchronous - this is especially useful for actors whose methods are IO bound and whose state can be easily shared and locked if needed

In [20]:
from asyncio import sleep


@ray.remote
class MyAsyncService:
    def __init__(self) -> None:
        self.fixed_state = 1

    async def run(self):
        await sleep(15)
        return self.fixed_state


my_async_actor_handle = MyAsyncService.remote()

Given the service run is mostly IO bound (sleeping), we can run it asynchronously using an asynchronous actor implementation

In [21]:
ray.get([my_async_actor_handle.run.remote() for _ in range(3)])

[1, 1, 1]

Here is a diagram visualizing task execution against an asynchroneous actor.

<img src="actor_async.jpeg" height="300">