# Ray Actors Revisited

The [Ray Crash Course](../ray-crash-course/00-Ray-Crash-Course-Overview.ipynb) introduced the core concepts of Ray's API and how they parallelize work. Specifically, we learned how to define Ray _tasks_ and _actors_, run them, and retrieve the results. 

This lesson explores Ray actors in greater depth, including the following:

* Detached actors
* Specifying limits on the number of invocations and retries on failure
* Profiling actors
* More about actor lifecycles and behaviors

In [4]:
!../tools/start-ray.sh


Ray already running or successfully started


In [5]:
import ray, time, sys 
import numpy as np 
sys.path.append("..")
from util.printing import pd, pnd  # convenience methods for printing results.

In [6]:
ray.init(address='auto', ignore_reinit_error=True)



{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:15758',
 'object_store_address': '/tmp/ray/session_2020-05-23_17-34-40_069358_39510/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-05-23_17-34-40_069358_39510/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-05-23_17-34-40_069358_39510'}

The Ray Dashboard URL.

The following URL will only work when running this lesson on your laptop. When using the Anyscale platform, use the URL provided by your instructor to access the Ray Dashboard.

In [11]:
print(f'Ray Dashboard: http://{ray.get_webui_url()}')

Ray Dashboard: http://localhost:8265


## Detached Actors

[Detached actors](https://docs.ray.io/en/latest/advanced.html#detached-actors) are designed to be long-lived actors that can be referenced by name and must be explicitly cleaned up. The are not deleted automatically when references to them go out of scope, as for regular actors. 

Detached actors are useful for "services", where different tasks and actors in the application want to lookup an actor and use it.

Here is an example of a "normal" actor definition:

In [12]:
@ray.remote
class Counter:
    def __init__(self, label='Counter'):
        self.label = 'Counter'
        self.count = 0
    def next(self):
        self.count += 1
        return self.count

Now create a detached instance of it.

In [14]:
counter = Counter.options(name="Counter1", detached=True).remote()

Then we can use it "somewhere else":

In [15]:
c = ray.util.get_actor("Counter1")
print(ray.get([counter.next.remote() for _ in range(100)]))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]


See also the notes on detached actors and actor lifecycles in the lesson [03: Ray Internals](03-Ray-Internals.ipynb). See also the [detached actors](https://docs.ray.io/en/latest/advanced.html#detached-actors) documentation.

## Limiting Actor Invocations and Retries on Failure

> **Note:** This feature may change in a future version of Ray. See the latest details in the [Ray documentation](https://docs.ray.io/en/latest/package-ref.html#ray.remote). 

Two options you can pass to `ray.remote` when defining an actor affect how often it can be invoked and retrying on failure:

* `max_restarts`: This specifies the maximum number of times that the actor should be restarted when it dies unexpectedly. The minimum valid value is 0 (default), which indicates that the actor doesn't need to be restarted. A value of -1 indicates that an actor should be restarted indefinitely.
* `max_task_retries`: How many times to retry an actor task if the task fails due to a system error, e.g., the actor has died. If set to -1, the system will retry the failed task until the task succeeds, or the actor has reached its max_restart limit. If set to to a value `n` greater than 0, the system will retry the failed task up to `n` times, after which time the task will throw a `RayActorError` exception when `ray.get` attempts to retrieve a result. Note that Python exceptions are not considered system errors and will not trigger retries.

Example:

```python
@ray.remote(max_restarts=-1, max_task_retries=-1)
class Foo():
    pass
```

See the [ray.remote()](https://docs.ray.io/en/latest/package-ref.html#ray.remote) documentation for all the keyword arguments supported.

### Overriding with config()

Remote task and actor objects returned by `@ray.remote` can also be dynamically modified with the same arguments supported by `ray.remote()` using `options()` as in the following example:

```python
@ray.remote(num_cpus=2, resources={"CustomResource": 1})
class Foo:
    def method(self):
        return 1
Bar = Foo.options(num_cpus=1, resources=None)
```

## Actor Performance Profiling with the Ray Dashboard

In lesson [01: Ray Tasks Revisited](01-Ray-Tasks-Revisited.ipynb), we learned how to profile task performance using [ray.timeline(file)](https://ray.readthedocs.io/en/latest/package-ref.html#ray.timeline)) and a Chrome web browser to view the data. 

Now we'll investigate how to profile performance of Ray actors using the Ray Dashboard ([documentation](https://ray.readthedocs.io/en/latest/ray-dashboard.html#ray-dashboard)). 

### Ray Dashboard Profiling

The Ray Dashboard provides an easy way to profile execution of Ray actors, producing [flame graphs](http://www.brendangregg.com/flamegraphs.html), which show performance characteristics of the application. This feature uses [py-spy](https://github.com/benfred/py-spy) to instrument and profile the application.

> **WARNING:** Py-spy may require `sudo` access. When you follow the instructions we are about to give, you may see an message that `sudo` is required, but there's no way to enter the password in the Dashboard. 

We discuss workarounds for the `sudo` issue in [Troubleshooting, Tips, and Tricks](reference/Troubleshooting-Tips-Tricks.ipynb#Profiling-Actors), but this technique may not work for many of you, due to the _sudo_ requirement. Note that a workaround is to run the application you want to profile outside a notebook, using a command line. Then if the Dashboard needs a password, you will be prompted at the terminal. 

In any case, we'll describe the process of profiling here and provide a demonstration video and at live tutorial events.

To profile with the Dashboard, click the _Logical View_ tab. It shows a list of actors that have been executed or are running. Find the running actor that appears to be the one you want to profile. You'll see a line like this:

> Actor <hex_number> (Profile for 10s 30s 60s) Kill Actor

The _10s, 30s, 60s_ are links. We'll use of them to profile our `RayGame` performance.

More specifically, use this procedure, which helps you find the correct actor:

1. Start the next cell
2. Copy to the clipboard the hex number shown, something like `[Actor(RayGame, a139e2970100)]`
3. Immediately go to the Dashboard's _Logical View_
4. Search for the actor, CTRL-F (Windows/Linux) or CMD-F (MacOS) and enter the hex code
3. For the actor found, click _10s_ to profile it.
4. When the profile run finishes, click _Profile results_ to see the _flame graphs_ in another tab.

A screenshot is below:

In [36]:
game_ids4 = time_ray_games(num_games = 1, max_steps = 2 * max_steps, batch_size=50, grid_size=grid_size)

game_ids:
[Actor(RayGame, a139e2970100)]
Total time for 1 games (max_steps = 400, batch_size = 50) duration: 26.686 seconds


In [52]:
# Uncomment and run if you start getting warnings later.
# cleanup(game_ids4)

Here is an interesting screen shot of the captured _flame graph_ for one of the :
![Conway's GoL Flame Graph](../images/ConwaysGameOfLife-FlameGraph.png)

You are looking at the call stack, with the top-most stack frame at the bottom. It's showing that the _list comprehension_ in the `ConwaysRules.live_neighbors` method, along with the rest of that method's work, are taking the most time. That's where you want to optimize!

A lot of these graphs may look like this, where you're only seeing a "quiet" actor communicating with the local scheduler, etc.:
![Boring Conway's GoL Flame Graph](../images/ConwaysGameOfLife-FlameGraph2.png)

> **Tips:** 
>
> 1. [Lesson 6](06-RecapTipsTricks.ipynb#Profiling-Actors) offers tips on working with this UI.
> 2. The [Ray Dashboard documentation](https://ray.readthedocs.io/en/latest/ray-dashboard.html#debugging-a-blocked-actor) offers tips for using the _Logical View_ to debug actor issues.

We'll explore using this performance information in the exercise below, but first, let's cover a few more important details about Ray Actors.

## More about Actors

Let's finish with a discussion of additional important information about actors. Some of the points were mentioned above.

### Actor Scheduling and Lifetimes

For the most part, when Ray runs actor code, it uses the same _task_ mechanisms we discussed in _Task Parallelism, Parts 1 and 2_. Actor constructor and method invocations work just like task invocations. However, there are a few notable differences:

* Once a task finishes, it is removed from the worker that executed it, while an actor is _pinned_ to the worker until all Python references to it are out of scope. That is, the usual garbage collection mechanism in Python determines when an actor is removed from a worker. The reason the actor must remain in memory is because it holds state that might be needed, whereas tasks are stateless.
* Each actor instance uses about 40MB of memory overhead. Hence, just as you should avoid having too many fine-grained tasks, you should avoid too many actor instances. Reducing the overhead per actor is an ongoing effort.
* An actor can be killed with `ray.kill(actor_id)`, which is abrupt. If you want to kill the actor but let pending tasks finish,use `actor_id.__ray_terminate__.remote()` instead, which queues a termination task.

> **Note:** By default, an actor that is killed stays dead, but if the parameter `max_reconstructions=N` was passed as an argument to `@ray.remote` for the Actor declaration, it will be restarted up to N times, even when `ray.kill()` is used. If `N` is the value `ray.ray_constants.INFINITE_RECONSTRUCTION`, then reconstruction will always be attempted, without limit.

> **Tip:** If you get a warning that Ray can't schedule an actor because it doesn't have enough resources in the available workers, you can use the Ray Dashboard _Logical View_ to kill the ones you no longer need. This can be useful in constrained environments like a notebook session on small VM instance. Obviously care is required.

### Durability of Actor State

At this time, Ray provides no built-in mechanism for persisting actor state. Hence, if a worker or whole server goes down with actor instances, their state is lost. 

Actor state is stored in the object store, just like the results of task executions. Even if a worker goes down but the object store is healthy, the reference to the object is lost.

This is an area where Ray will evolve and improve in the future. For now, an important design consideration is to decide when you need to _checkpoint_ state and to use an appropriate mechanism for this purpose. Some of the Ray APIs explored in other tutorials have built-in checkpoint features, such as for saving snapshots of trained models to a file system.

## Exercise 4 - "Homework"

This exercise is more involved and will mostly appeal to those of you who like the challenges of optimizing low-level performance. Also, for a live toturial event, this exercise is too much to take on in the limite time available. That's why it's labelled _Homework_. I encourage you to read the solution though.

We already proved that running whole games as actors gives us a performance boost when we need to run many of them at once, simply by running the games concurrently across the available CPU cores.

This exercise explores whether or not we improve the performance of a single game? This task isn't easy. We'll try a few ideas, but find that many don't provide much improvement. Only doing low-level optimizations provide significant improvement at this point. That could be important in a massive system where every bit of efficiency matters, but _Premature optimization is the root of all evil_. Why? Because optimizations often obscure the logic of the code, making it harder to maintain and bugs more likely. This code does basic math, but a lot of it, and sometimes it's better to crank through it in a single thread.

For your convenience, here are new versions of `RayGame` (called `RayGame2`) and `ConwaysRules` (called `RayConwaysRules`), both declared as actors. What about `State`? It actually _doesn't_ make sense to make it an actor, because it is really just an _immutable_ holder of data, so making it an actor is not going to bring any benefit. Rather, it would likely slow things down by adding unnecessary overhead. (You can try it and see...)

In [55]:
@ray.remote
class RayConwaysRules:
    """
    Apply the rules to a state and return a new state.
    """
    def step(self, state):
        """
        Determine the next values for all the cells, based on the current
        state. Creates a new State with the changes.
        """
        new_grid = state.grid.copy()
        for i in range(state.size):
            for j in range(state.size):
                lns = self.live_neighbors(i, j, state)
                new_grid[i][j] = self.apply_rules(i, j, lns, state)
        new_state = State(grid = new_grid)
        return new_state

    def apply_rules(self, i, j, live_neighbors, state):
        """
        Determine next value for a cell, which could be the same.
        The rules for Conway's Game of Life:
            Any live cell with fewer than two live neighbours dies, as if by underpopulation.
            Any live cell with two or three live neighbours lives on to the next generation.
            Any live cell with more than three live neighbours dies, as if by overpopulation.
            Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.
        """
        cell = state.grid[i][j]  # default value is no change in state
        if cell == 1:
            if live_neighbors < 2 or live_neighbors > 3:
                cell = 0
        elif live_neighbors == 3:
            cell = 1
        return cell

    def live_neighbors(self, i, j, state):
        """
        Wrap at boundaries (i.e., treat the grid as a 2-dim "toroid")
        To wrap at boundaries, when k-1=-1, that wraps itself;
        for k+1=state.size, we mod it (which works for -1, too)
        For simplicity, we count the cell itself, then subtact it
        """
        s = state.size
        g = state.grid
        return sum([g[i2%s][j2%s] for i2 in [i-1,i,i+1] for j2 in [j-1,j,j+1]]) - g[i][j]

In [15]:
@ray.remote
class RayGame2:
    # TODO: Game memory grows unbounded; trim older states?
    def __init__(self, initial_state, rules_id):
        self.states = [initial_state]
        self.rules_id = rules_id

    def step(self, num_steps = 1):
        """Take 1 or more steps, returning a list of new states."""
        start_index = len(self.states)
        for _ in range(num_steps):
            new_state_id = self.rules_id.step.remote(self.states[-1])
            self.states.append(ray.get(new_state_id))
        return self.states[start_index:-1]  # return the new states only!

Finally, we need a new timing function

In [62]:
def time_ray_games2(num_games = 10, max_steps = max_steps, batch_size = 1, grid_size = grid_size):
    rules_ids = []
    game_ids = []
    for i in range(num_games):
        rules = RayConwaysRules.remote()
        rules_ids.append(rules)
        game_ids.append(RayGame2.remote(State(size = grid_size), rules))
    print(f'rules_ids:\n{rules_ids}')  # these will produce more interesting flame graphs!
    print(f'game_ids:\n{game_ids}')
    start = time.time()
    state_ids = []
    for game_id in game_ids:
        for i in range(int(max_steps/batch_size)):  # Do a total of max_steps game steps, which is max_steps/delta_steps
            state_ids.append(game_id.step.remote(batch_size))
    ray.get(state_ids)  # wait for everything to finish! We are ignoring what ray.get() returns, but what will it be??
    pd(time.time() - start, prefix = f'Total time for {num_games} games (max_steps = {max_steps}, batch_size = {batch_size})')
    return game_ids  # for cleanup afterwards

Make sure you watch the dashboard again.

In [63]:
game_ids5 = time_ray_games2(num_games = 1, max_steps = 2 * max_steps, batch_size=50, grid_size=grid_size)



rules_ids:
[Actor(RayConwaysRules, 34108bdd0100)]
game_ids:
[Actor(RayGame2, 89003fb90100)]
Total time for 1 games (max_steps = 400, batch_size = 50) duration: 24.300 seconds


In [117]:
# Uncomment and run if you start getting warnings later.
# cleanup(game_ids5)

As written, we still haven't improved performance. The results are within a second of the previous performance run.

Why? Well if you watched the dashboard, you should have noticed that while several workers had part of the code, **one** core was pegged running `RayConwaysRules.step()`. Profiling the result and looking at the flame graph provides more information. As we showed previously in this lesson, most of the time is taken up in `RayConwaysRules.live_neighbors()` and in that method, more than half the time is spent in the _list comprehension_.

For the exercise, there are a few things you could do.

Since profiling shows that `live_neighbors` is the bottleneck, what could be done to reduce its execution time? 

The solutions discussed in the [solutions notebook](solutions/Advanced-Ray-Solutions.ipynb) notebook show that in fact you can reduce its overhead by about 40%. Not bad. 

What about parallel invocations of it or other aspects of grid processing to update the state? Currently, for each step, the code walks through the current grid _synchronously_ to compute the next grid. _Could we do that that process in parallel, e.g., blocks of rows at a time??_ This is doable, because updating each cell only depends on the past state, not on updates to other cells. (Blocks of rows is just a convenient choice for breaking up the work.)

But let's step back for a moment; this is the sort of optimization you do when you _really_ have a compelling reason to squeeze optimal performance out of the code. Hence, for this exercise, it's probably overkill, unless you're interested in low-level performance optimizations like this. If you are, see the discussion in the _Solutions_ notebook.

Easier experiments you can try may not not produce much improvement, based on the flame graph results above, but consider trying them for "practice". Look at `RayGame2.step()` and `RayConwaysRules.step()`. There are a bunch of remote calls in there. What refactoring could be done that might improve performance? For example, what about extending `RayConwaysRules.step()` to accept a `num_steps` argument like `RayGames2.step()` supports, then modify the call to it from `RayGames2.step()`? Does this actually improve performance or not? Don't forget to watch the Dashboard...

In [None]:
ray.shutdown()  # "Undo ray.init()". Terminate all the processes started in this notebook.