# Ray Crash Course - Actors

We explored how Ray _tasks_ allow us to maximize utilization of our cluster resources. However, a typical challenge in distributed systems is management of distributed _state_. Ray's solution is _Actors_.

The [Actor Model](https://en.wikipedia.org/wiki/Actor_model) is really similar to what Alan Kay had in mind for object-oriented programming, which he implemented in [Smalltalk](https://en.wikipedia.org/wiki/Smalltalk). Specifically, objects are autonomous agents that communicate and coordinate activities by exchanging messages. Each object/actor manages its own encapsulated state. Modern actor model implementations add constructs for distributed computation and thread-safe concurrency. 

As we'll see, Ray converts normal Python classes into Ray actors, analogous to way stateless functions were turned into Ray tasks. Classes are used because they are a familiar tool in languages like Python for encapsulating state with methods for manipulating and exposing the state.

In addition to [Ray's actor model implementation](https://ray.readthedocs.io/en/latest/actors.html), other implementations include [Erlang](https://www.erlang.org/), the first production-grade implementation of actors, [Akka](https://akka.io/) for the JVM, and [many others](https://en.wikipedia.org/wiki/Actor_model#Actor_libraries_and_frameworks).

> **Tip:** For more about Ray, see [ray.io](https://ray.io) or the [Ray documentation](https://ray.readthedocs.io/en/latest/).

We'll demonstrate Ray actors using a simplified implementation of a _parameter server_. 

## Parameter Server Scenario

The idea of a _parameter server_ emerged as a way to manage model parameters at large scales. A parameter server is effectively an optimized database, possibly distributed, for updating and sharing model parameters in model training and serving scenarios, especially for very large deep learning models.

Here is a conceptual sketch of a parameter server and other components that use it.

![Parameter Server](../images/Parameter-Server.png)

New data is stored for use in the next cycle of model training. It is also passed to a model server to score the data with one or more previously trained models. The scored data is sent to other services as required. Training data is feed from storage to the trainers. The parameter server is used as durable storage for parameters used by trainers during models training, and it serves parameters for previously-trained models to the model server. 

For more information on the concept of a parameter server:

* [Parameter Server for Distributed Machine learning](https://medium.com/coinmonks/parameter-server-for-distributed-machine-learning-fd79d99f84c3)
* [Scaling Distributed Machine Learning with the Parameter Server](https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf)

For a more extensive example of parameter server usage in Ray, see [this section of the Ray documentation](https://ray.readthedocs.io/en/latest/auto_examples/plot_parameter_server.html).

Let's start with some imports and definitions.

In [1]:
import numpy as np
import random
import time

Normally you would let Ray determine the actual number of CPU cores, but on a single machine, we'll pretend we have more of them, which is what `num_workers` will be used for.

**Mini Exercise:** Play with these numbers. For older machines, a shorter sleep time or smaller `num_workers` or `num_iterations` might be useful. The `num_params` value doesn't make much difference in the performance.

In [12]:
num_workers = 16          # Number of CPUs we'll "pretend" we have available
num_parameters = 10000    # Number of parameters (array size) in a model
num_trainers = 4          # Number of trainers that use the parameter server
num_iterations = 500      # Number of iterations for a training run (per trainer)

Let's start with a simple Python class, `ParameterServer`, that wraps a dictionary of parameter objects as values associated with arbitrary names as keys. Each model trainer we implement will use its own key for its set of parameters. This would work if the trainers are all working on the same model, with each trainer working on a partition of the model parameters. It would also work when each trainer has a whole model, with other trainers working on different models.

Finally, the parameter server can also serve trained model parameters to model servers.

We'll leave the type of parameters unspecified, so different kinds of parameter sets can be managed transparently.

In [37]:
class ParameterServer():
    def __init__(self):
        self.parameters = {}

    def get(self, key):
        """
        Returns:
            The parameters for a given key or None by default.
        """
        return self.parameters.get(key, None)

    def set(self, key, new_params):
        self.parameters[key] = new_params
        
    def keys(self):
        """
        Returns:
            The dictionary keys object converted to a list. This conversion isn't
            particularly necessary here, but it will be needed for remote objects, 
            because a dictionary keys object can't be pickled! So, to minimize 
            required changes later, we'll return a list here.
        """
        return list(self.parameters.keys())
    
    def __str__(self):  # Useful for printing below
        return f'ParameterServer(keys={self.keys()})'

The trainer runs a user-specified number of iterations to improve the parameter set. To simulate a real training process like gradient descent, we compute random deltas and apply them to the last state of the parameters. The parameter set is kept here, so this class is also stateful, but updates are written to the server every time.

Unlike `ParameterServer`, we assume the parameters are a list, for simplicity.

In [54]:
class Trainer():
    def __init__(self, name, param_server, initial_parameters):
        """Pushes the initial parameters to the parameter server"""
        self.name = name
        self.parameters = initial_parameters
        self.param_server = param_server
        self.send_to_server()

    def send_to_server(self):
        self.param_server.set(self.name, self.parameters)

    def train(self, num_iterations):
        """
        Simulate training by running 'num_iterations' iterations. For each step, compute 
        random deltas to update the parameters, simulating gradient descent. Push the
        updated parameters to the parameter server.
        
        Returns:
            (name, params)
        """
        for _ in range(num_iterations):
            deltas = [random.uniform(-0.1,0.1) for i in range(self.parameters.size)]
            self.parameters = np.add(self.parameters, deltas)
        self.send_to_server()
        return (self.name, self.parameters)
    
    def __str__(self):  # Useful for printing below
        return f'Trainer(name={self.name}, parameters={self.parameters}, param_server={self.param_server}'

Instantiate `num_trainers` trainers, simulating training over a partitioned parameter set. For simplicity, we'll use a single parameter server. In a real-world scenario, this might become a bottleneck, so several servers might be used for the partitioned parameter set. 

In [55]:
ps = ParameterServer()

In [56]:
ts = [Trainer(f'Trainer{i}', ps, np.zeros(num_parameters)) for i in range(num_trainers)]
print(f'{num_trainers} Trainers created:')
for t in ts:
    print(t)

4 Trainers created:
Trainer(name=Trainer0, parameters=[0. 0. 0. ... 0. 0. 0.], param_server=ParameterServer(keys=['Trainer0', 'Trainer1', 'Trainer2', 'Trainer3'])
Trainer(name=Trainer1, parameters=[0. 0. 0. ... 0. 0. 0.], param_server=ParameterServer(keys=['Trainer0', 'Trainer1', 'Trainer2', 'Trainer3'])
Trainer(name=Trainer2, parameters=[0. 0. 0. ... 0. 0. 0.], param_server=ParameterServer(keys=['Trainer0', 'Trainer1', 'Trainer2', 'Trainer3'])
Trainer(name=Trainer3, parameters=[0. 0. 0. ... 0. 0. 0.], param_server=ParameterServer(keys=['Trainer0', 'Trainer1', 'Trainer2', 'Trainer3'])


Check that the initial parameters were written to the server:

In [57]:
for key in ps.keys():
    print(f'{key}: {ps.get(key)}')

Trainer0: [0. 0. 0. ... 0. 0. 0.]
Trainer1: [0. 0. 0. ... 0. 0. 0.]
Trainer2: [0. 0. 0. ... 0. 0. 0.]
Trainer3: [0. 0. 0. ... 0. 0. 0.]


The next cell takes about 5 seconds user time, 10 seconds wall time, on a 2019 model MacBook Pro 13"...

In [58]:
%time [t.train(num_iterations) for t in ts]

CPU times: user 4.81 s, sys: 77.5 ms, total: 4.89 s
Wall time: 4.92 s


[('Trainer0',
  array([-0.00329442,  0.86496108, -1.51937846, ..., -0.88591523,
          0.3509482 , -1.96414379])),
 ('Trainer1',
  array([-0.65204375, -0.02438378,  0.24815471, ...,  1.05218844,
         -0.02307567,  1.50306938])),
 ('Trainer2',
  array([-0.58644958, -0.52373158, -0.47386362, ...,  1.01543566,
         -1.32350437, -0.36347235])),
 ('Trainer3',
  array([ 1.30368004,  0.37275973, -0.63920378, ..., -1.60705274,
          2.21114751, -0.94467373]))]

In [59]:
for key in ps.keys():
    print(f'{key}: {ps.get(key)}')

Trainer0: [-0.00329442  0.86496108 -1.51937846 ... -0.88591523  0.3509482
 -1.96414379]
Trainer1: [-0.65204375 -0.02438378  0.24815471 ...  1.05218844 -0.02307567
  1.50306938]
Trainer2: [-0.58644958 -0.52373158 -0.47386362 ...  1.01543566 -1.32350437
 -0.36347235]
Trainer3: [ 1.30368004  0.37275973 -0.63920378 ... -1.60705274  2.21114751
 -0.94467373]


## Scaling the Parameter Server with Ray

Let's make this faster with Ray. Clearly the trainers are completely independent, so they should run in parallel. We'll also make the `ParameterServer` a Ray actor, even though there is currently only one. 

In [60]:
import ray
ray.init(num_cpus = num_workers, ignore_reinit_error = True)

2020-04-21 17:16:56,243	ERROR worker.py:682 -- Calling ray.init() again after it has already been called.


In [61]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8266


### Creating a Ray Actor from a Class.

We created Ray _tasks_ by decorating normal Python _functions_ with `@ray.remote`. Similarly, we create Ray _actors_ by decorating normal Python _classes_ with `@ray.remote`.

One important difference between actors and classes is how you access _fields_ (or _attributes_) of an instance. While in Python you can just say `my_instance.field`, this isn't supported at this time for Ray actors. Instead, you have to provide accessor methods for all fields (or derived state) that users might need. 

Fortunately, we can subclass a normal Python class to create an actor, and add the additional accessor methods we might need. In our case, we already have everything we need in `ParameterServer`, specifically the `keys()` and `get()` methods, although you could add a method to return the entire dictionary at once. So, our actor declaration, `RayParameterServer`, is quite simple.

In [62]:
@ray.remote
class RayParameterServer(ParameterServer):
    pass

Actor instances are constructed with `.remote(...)`, like calling tasks, and actor methods are called with `my_instance.method.remote(...)`.

In [63]:
rps = RayParameterServer.remote()

The corresponding Ray trainer has to invoke the remote server using `param_server.x.remote(...)`. Fortunately, we already encapsulated calls to the parameter server in `send_to_server()`, so all we need to do to create a new trainer is to subclass `Trainer` and override that method.

In [64]:
@ray.remote
class RayTrainer(Trainer):
    def __init__(self, name, param_server, initial_parameters):
        super().__init__(name, param_server, initial_parameters)

    def send_to_server(self):
        """A synchronously send updates."""
        self.param_server.set.remote(self.name, self.parameters)

In [65]:
rts = [RayTrainer.remote(f'RayTrainer{i}', rps, np.zeros(num_parameters)) for i in range(num_trainers)]
print(f'{num_trainers} Ray Trainers created:')
for rt in rts:
    print(rt)

4 Ray Trainers created:
Actor(RayTrainer, 8e3e47920100)
Actor(RayTrainer, 169cae520100)
Actor(RayTrainer, a1651cc50100)
Actor(RayTrainer, a000cd7e0100)


The next cell takes about 25 **milliseconds**, 1.5 wall time seconds, on the same 2019 model MacBook Pro 13", roughly 1/4th the wall time as above, reflecting the difference between using one core previously and all four-cores on the machine's CPU.

In [67]:
%time ray.get([rt.train.remote(num_iterations) for rt in rts])

CPU times: user 24.7 ms, sys: 9.93 ms, total: 34.6 ms
Wall time: 1.51 s


[('RayTrainer0',
  array([ 0.03254323,  0.32799843,  1.14463483, ..., -1.04798486,
         -3.02949342,  0.87614499])),
 ('RayTrainer1',
  array([ 2.26093849,  1.18781758,  0.90302837, ..., -0.14009133,
          1.36071276,  0.29690826])),
 ('RayTrainer2',
  array([ 0.55147404, -0.91344929, -0.64431792, ...,  0.47301978,
          1.69434414, -0.17184378])),
 ('RayTrainer3',
  array([-1.03341044,  1.18637555, -0.89236703, ...,  1.57739792,
         -0.39743502, -1.08034883]))]

In [68]:
ray.get(rps.keys.remote())

['RayTrainer0', 'RayTrainer2', 'RayTrainer3', 'RayTrainer1']

## Exercise 1

`RayTrainer` asynchronously sends the updates to the `ParameterServer`, but doesn't confirm success. Many distributed systems require a confirmation before an update is considered confirmed. Modify `send_to_server` to wait for confirmation.

## Exercise 2

Try different values of the parameters, `num_workers`, `num_parameters`, `num_trainers`, and `num_iterations`. How do the results change?