# Ray Crash Course - Actors

We explored how Ray _tasks_ allow us to maximize utilization of our cluster resources. However, a typical challenge in distributed systems is management of distributed _state_. Ray's solution is _Actors_.

The [Actor Model](https://en.wikipedia.org/wiki/Actor_model) is really similar to what Alan Kay had in mind for object-oriented programming, which he implemented in [Smalltalk](https://en.wikipedia.org/wiki/Smalltalk). Specifically, objects are autonomous agents that communicate and coordinate activities by exchanging messages. Each object/actor manages its own encapsulated state. Modern actor model implementations add constructs for distributed computation and thread-safe concurrency. 

In addition to [Ray's actor model implementation](https://ray.readthedocs.io/en/latest/actors.html), other implementations include [Erlang](https://www.erlang.org/), the first production-grade implementation of actors, [Akka](https://akka.io/) for the JVM, and [many others](https://en.wikipedia.org/wiki/Actor_model#Actor_libraries_and_frameworks).

For more about Ray, see [ray.io](http://ray.io).

We'll demonstrate Ray actors using a simplified implementation of a _parameter server_. 

## Parameter Server Scenario

The idea of a _parameter server_ emerged as a way to manage model parameters at large scales. A parameter server is effectively an optimized database, possibly distributed, for updating and sharing model parameters in model training and serving scenarios, especially for very large deep learning models.

For more information on the idea of a parameter server:

* [Parameter Server for Distributed Machine learning](https://medium.com/coinmonks/parameter-server-for-distributed-machine-learning-fd79d99f84c3)
* [Scaling Distributed Machine Learning with the Parameter Server](https://www.cs.cmu.edu/~muli/file/parameter_server_osdi14.pdf)

For a more extensive example of parameter server usage in Ray, see [this section of the Ray documentation](https://ray.readthedocs.io/en/latest/auto_examples/plot_parameter_server.html).

Let's start with some imports and definitions.

In [17]:
import numpy as np
import random
import time

In [18]:
# Normally you would let Ray determine the actual number of cores, 
# but on a single machine, we'll pretend we have more of them.
# Mini Exercise: Play with these numbers. For older machines, a shorter sleep time 
# or smaller `ncpus` or `niterations` might be useful. The `nparams` value won't 
# make a lot of difference
ncpus = 16             # Number of CPUs we'll "pretend" we have available
nparams = 100          # Number of parameters (array size) in a model
ntrainers = 5          # Number of trainers that use the parameter server
niterations = 100      # Number of iterations for a training run (per trainer)
sleep_interval = 0.1   # Pause to simulate an expensive computation

Let's start with a simple, "local" implementation first, then use Ray. `LocalParameterServer` just wraps a dictionary of parameter objects as values associated with arbitrary names as keys. Each model trainer we implement will use its own key.

In [19]:
class LocalParameterServer():
    def __init__(self):
        self.parameters = {}

    def get(self, key):
        """
        Returns:
            The parameters for a given key or None by default.
        """
        return self.parameters.get(key, None)

    def set(self, key, new_params):
        self.parameters[key] = new_params
        
    def keys(self):
        """
        Returns:
            The dictionary keys object converted to a list. This conversion isn't
            particularly necessary here, but it will be needed for remote objects, 
            because a dictionary keys object can't be pickled! So, to minimize 
            required changes later, we'll return a list here.
        """
        return list(self.parameters.keys())        

The local model trainer runs a specified number of iterations to improve the parameter set. Here, we'll just use
random updates, followed by short sleep calls to simulate expensive work.

In [20]:
class LocalTrainer():
    def __init__(self, name, number_parameters, param_server):
        self.name = name
        self.number_parameters = number_parameters
        self.param_server = param_server

    def train(self, number_iterations):
        """
        Simulate training by running 'number_iterations' iterations. For each one, compute 
        random deltas to update the parameters followed by a small sleep value to simulate
        a long-running computation. When finished, set the updated parameters in the parameter
        server.
        
        Returns:
            (name, params)
        """
        params = self.param_server.get(self.name) or np.zeros(self.number_parameters)
        for _ in range(number_iterations):
            deltas = [random.uniform(-0.1,0.1) for _ in range(self.number_parameters)]
            params = np.add(params, deltas)
            time.sleep(sleep_interval)
        self.param_server.set(self.name, params)
        return (self.name, params)

Instantiate `ntrainers` trainers, simulating training over a partitioned parameter set. For simplicity, we'll use a single parameter server. In a real-world scenario, this might become a bottleneck, so several servers might be used for the partitioned parameter set. 

In [21]:
lps = LocalParameterServer()

In [22]:
lts = [LocalTrainer(f'LocalTrainer{i}', nparams, lps) for i in range(ntrainers)]
f'{ntrainers} LocalTrainers created'

'5 LocalTrainers created'

The next cell takes 50-55 seconds on a 2019 model MacBook Pro 13"...

In [32]:
[lt.train(niterations) for lt in lts]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

In [24]:
lps.keys()

['LocalTrainer0',
 'LocalTrainer1',
 'LocalTrainer2',
 'LocalTrainer3',
 'LocalTrainer4']

In a real model-training scenario, the next step might be to combine the trainers' outputs in some way, such as joining the partitions of a model.

Can we make this faster?

In [25]:
import ray
ray.init(num_cpus = ncpus, ignore_reinit_error = True)

2020-04-20 17:35:16,176	ERROR worker.py:682 -- Calling ray.init() again after it has already been called.


**Note:** Because of features we added to `LocalParameterServer`, specifically the `params_length()` method and returning a list from `keys()`, to implement a remote parameter server we can just declare it with the `@ray.remote` annotation and parent class `LocalParameterServer`.

In [26]:
@ray.remote
class RemoteParameterServer(LocalParameterServer):
    pass

In [27]:
rps = RemoteParameterServer.remote()  # Constructed with ".remote()", like we saw for tasks.

However, the corresponding remote trainer has to invoke the remote server using `param_server.x.remote(...)`, so we can't just use `LocalTrainer`, because we'll get errors calling methods on the remote server. So, let's implement a version that uses `RemoteParameterServer`:

In [28]:
@ray.remote
class RemoteTrainer():
    def __init__(self, name, number_parameters, param_server):
        self.name = name
        self.number_parameters = number_parameters
        self.param_server = param_server

    def train(self, number_iterations):
        """
        Simulate training by running 'number_iterations' iterations. For each one, compute 
        random deltas to update the parameters followed by a small sleep value to simulate
        a long-running computation. When finished, set the updated parameters in the parameter
        server.
        
        Returns:
            (name, params)
        """
        params = ray.get(self.param_server.get.remote(self.name)) or np.zeros(self.number_parameters)
        for _ in range(number_iterations):
            deltas = [random.uniform(-0.1,0.1) for _ in range(self.number_parameters)]
            params = np.add(params, deltas)
            time.sleep(sleep_interval)
        self.param_server.set.remote(self.name, params)
        return (self.name, params)

In [29]:
rts = [RemoteTrainer.remote(f'RemoteTrainer{i}', nparams, rps) for i in range(ntrainers)]
f'{ntrainers} RemoteTrainers starting'

'5 RemoteTrainers starting'

The next cell takes about 10 seconds on a 2019 model MacBook Pro 13". Compare with the local, sequential run above.

In [30]:
%time ray.get([rt.train.remote(niterations) for rt in rts])

CPU times: user 277 ms, sys: 107 ms, total: 384 ms
Wall time: 10.3 s


[('RemoteTrainer0',
  array([-0.76578108,  0.64613257, -0.42304991,  0.31097099, -0.53648638,
          0.78404893, -0.80588895, -1.07725911, -0.09400659, -0.15740178,
         -0.90572274,  0.09697633, -0.17633377,  0.16215123,  0.16490174,
         -0.82036564, -0.01410076, -1.05618224, -0.79512241, -0.55465338,
          0.91992343, -0.44730101,  0.12511925,  0.0896709 , -1.25615709,
         -0.09593135,  0.02951344,  0.99751251,  1.6713826 ,  0.44934778,
         -0.08681112, -0.38581867, -0.39818031,  0.22117686,  0.07371581,
         -0.85347809,  0.19613094, -0.44982533, -0.07463884,  0.08599995,
          0.14580855, -0.86564899, -0.69032072,  0.03033596, -0.32182171,
         -0.06738351,  0.51779742,  0.47534912, -0.93897788, -0.17648893,
          0.43458898,  0.05148876,  0.85317384, -0.0750082 ,  0.29649449,
         -0.08501202,  0.09668437, -0.01648455,  0.50063064,  0.21857703,
          0.08161356, -0.65354514,  0.20471728,  1.04165816, -0.00244974,
         -0.502026

In [33]:
ray.get(rps.keys.remote())

['RemoteTrainer3',
 'RemoteTrainer2',
 'RemoteTrainer1',
 'RemoteTrainer4',
 'RemoteTrainer0']

So, the timing was about 1/5 for the remote execution. Recall that `ntrainers` is `5` at the top of this notebook. Hence, the results indicate that the training with the `LocalTrainers` happened synchronously, while the `RemoteTrainers` ran in parallel.