# Solutions to exercises in the `ray-core` Lessons

First, import everything we'll need and start Ray:

In [1]:
import ray, time, sys
import numpy as np

In [2]:
def pnd(n, duration, prefix=''):
    """Print an integer and a time duration, with an optional prefix."""
    prefix2 = prefix if len(prefix) == 0 else prefix+' '
    print('{:s}n: {:2d}, duration: {:6.3f} seconds'.format(prefix2, n, duration))

def pd(duration, prefix=''):
    """Print a time duration, with an optional prefix."""
    prefix2 = prefix if len(prefix) == 0 else prefix+' '
    print('{:s}duration: {:6.3f} seconds'.format(prefix2, duration))

In [3]:
ray.init(ignore_reinit_error=True)

2020-04-15 09:27:21,679	INFO resource_spec.py:212 -- Starting Ray with 4.54 GiB memory available for workers and up to 2.29 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-15 09:27:22,021	INFO services.py:1148 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:11486',
 'object_store_address': '/tmp/ray/session_2020-04-15_09-27-21_670092_4807/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-04-15_09-27-21_670092_4807/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-04-15_09-27-21_670092_4807'}

## Exercise 1 in 02-TaskParallelism-Part1

You were asked to convert the regular Python code to Ray code. Here are the three cells appropriately modified.

First, we need the appropriate imports and `ray.init()`.

In [4]:
@ray.remote
def slow_square(n):
    time.sleep(n)
    return n*n

In [5]:
start = time.time()
ids = [slow_square.remote(n) for n in range(4)]
squares = ray.get(ids)
duration = time.time() - start

In [6]:
assert squares == [0, 1, 4, 9]
# should fail until the code modifications are made:
assert duration < 4.1, f'duration = {duration}' 

## Exercise 2 in 03-TaskParallelism-Part2

You were asked to use `ray.wait()` with a shorter timeout, `2.5` seconds. First we need to redefine in this notebook the remote functions we used in that lesson:

In [7]:
@ray.remote
def make_array(n):
    time.sleep(n/10.0)
    return np.random.standard_normal(n)

@ray.remote
def add_arrays(a1, a2):
    time.sleep(a1.size/10.0)
    return np.add(a1, a2)

In [8]:
start = time.time()
array_ids = [make_array.remote(n*10) for n in range(5)]
added_array_ids = [add_arrays.remote(id, id) for id in array_ids]

arrays = []
waiting_ids = list(added_array_ids)        # Assign a working list to the full list of ids
while len(waiting_ids) > 0:                # Loop until all tasks have completed
    # Call ray.wait with:
    #   1. the list of ids we're still waiting to complete,
    #   2. tell it to return immediately as soon as TWO of them complete,
    #   3. tell it wait up to 10 seconds before timing out.
    return_n = 2 if len(waiting_ids) > 1 else 1
    ready_ids, remaining_ids = ray.wait(waiting_ids, num_returns=return_n, timeout=2.5)
    print('Returned {:3d} completed tasks. (elapsed time: {:6.3f})'.format(len(ready_ids), time.time() - start))
    new_arrays = ray.get(ready_ids)
    arrays.extend(new_arrays)
    for array in new_arrays:
        print(f'{array.size}: {array}')
    waiting_ids = remaining_ids  # Reset this list; don't include the completed ids in the list again!
    
print(f"\nall arrays: {arrays}")
pd(time.time() - start, prefix="Total time:")

Returned   2 completed tasks. (elapsed time:  2.018)
0: []
10: [ 1.86775352  2.9454727  -5.04052366 -4.33534297  1.6263961   0.39691723
 -1.25821447 -0.29963352 -2.3948451   3.30066021]
Returned   1 completed tasks. (elapsed time:  5.537)
20: [ 1.47778499  2.54213133 -0.78084779  0.46153321  0.60669328 -1.42004875
  0.25173278  0.83958007 -3.21123112 -3.46694075  0.24307297 -0.62017288
 -0.71827436  0.80699748 -0.17335598 -1.11008536  2.80218211 -2.01117915
 -3.20214683 -2.39029264]
Returned   2 completed tasks. (elapsed time:  8.016)
30: [ 0.79918898  1.41123412  0.80206555  1.57553845 -1.22381679 -0.01631308
  0.09983679  1.4726712   3.00156156 -0.9257089   2.37555648  1.77130338
 -1.67851542  0.80422592 -3.30197833 -0.09043308  1.55741247 -0.09579186
  5.74481545 -2.21937375  0.10755704 -0.20810047 -4.62367772 -5.12056826
  1.30345284  1.27297868  3.98460268  1.82273513  3.09241094  0.37172731]
40: [-1.31749739 -0.48467903  3.29539476  1.52684404 -3.04442408 -0.77205294
  3.61668242

For a timeout of `2.5` seconds, the second call to `ray.wait()` times out before two tasks finish, so it only returns one completed task. Why did the third and last iteration not time out? (That is, they both successfully returned two items.) It's because all the tasks were running in parallel so they had time to finish. If you use a shorter timeout, you'll see more time outs, where zero or one items are returned. 

Try `1.5` seconds, where all but one iteration times out and returns one item. The first iteration returns two items.
Try `0.5` seconds, where you'll get several iterations that time out and return zero items, while all the other iterations time out and return one item.

## Exercise 3 in 03-TaskParallelism-Part2

You were asked to convert the code to use Ray, especially `ray.wait()`.

In [9]:
@ray.remote
def slow_square(n):
    time.sleep(n)
    return n*n

start = time.time()
ids = [slow_square.remote(n) for n in range(4)]
squares = []
waiting_ids = ids
while len(waiting_ids) > 0:
    finished_ids, waiting_ids = ray.wait(waiting_ids)  # We just assign the second list to waiting_ids...
    squares.extend(ray.get(finished_ids))
duration = time.time() - start

In [10]:
assert squares == [0, 1, 4, 9]
assert duration < 4.1, f'duration = {duration}' 

## Exercise 4 - "Homework" - in 04-DistributedStateWithActors

Since profiling shows that `live_neighbors` is the bottleneck, what could be done to reduce its execution time? The solution shown here reduces its overhead by about 40%. Not bad. 

The solution also implements parallel invocations grid updates, rather doing the whole grid in sequential steps.

As discussed in lesson 4, these kinds of optimizations make sense when you _really_ have a compelling reason to squeeze optimal performance out of the code. Hence, this optimization exercise will mostly appeal to those of you with such requirements or who low-level performance optimizations like this. 

Note that the `util` directory has standalone Python scripts you can play with, such as `util/micro-perf-tests.py`, which tests three variants of _live_neighbors_, including the original version. Also, `util/Ex4-GameOfLife.py` is the program used to develop the solution show here.

If you tried the "easier experiments" suggested, such as enhancing `RayConwaysRules.step()` to accept a `num_steps` argument, you probably found that they didn't improve performance. As for the non-Ray game, this change only moves processing around but doesn't parallelize it more than before, so performance is about the same.

> **Note:** This solution is only partially done. For a work in progress, see `../../util/Ex4-GameOfLife.py`.

First, let's redefine a few things we need from that notebook, including the exercise code we need.

In [11]:
grid_size = 100
max_steps = 200

def cleanup(ids):
    for id in ids: 
        id.__ray_terminate__.remote()

In [12]:
print(f'http://{ray.get_webui_url()}')

http://localhost:8265


For comparison, my runs with the exercise code before improvements were about 12 to 12.5 seconds.

If you look at `RayGame2.step`, it calls `RayConwaysRules.step` one step at a time, using remote calls. This seems like a good place for improvement. Let's extend `RayConwaysRules.step` to do more than one step, just like `RayGame2.step` already supports.

Changes are indicated with comments.

In [13]:
class State:
    """
    Represents a grid of game cells.
    For simplicity, require square grids.
    Each instance is considered immutable.
    """
    def __init__(self, grid = None, size = 10):
        """
        Create a State. Specify either a grid of cells or a size, for
        which an size x size grid will be computed with random values.
        (For simplicity, only use square grids.)
        """
        if type(grid) != type(None): # avoid annoying AttributeError
            assert grid.shape[0] == grid.shape[1]
            self.size = grid.shape[0]
            self.grid = grid.copy()
        else:
            self.size = size
            # Seed: random initialization
            self.grid = np.random.random(size*size).reshape((size, size)).round()

    def living_cells(self):
        """
        Returns ([x1, x2, ...], [y1, y2, ...]) for all living cells.
        Simplifies graphing.
        """
        cells = [(i,j) for i in range(self.size) for j in range(self.size) if self.grid[i][j] == 1]
        return zip(*cells)

    def __str__(self):
        s = ' |\n| '.join([' '.join(map(lambda x: '*' if x else ' ', self.grid[i])) for i in range(self.size)])
        return '| ' + s + ' |'

In [14]:
@ray.remote
class RayConwaysRules:
    """
    Apply the rules to a state and return a new state.
    """
    def step(self, state, num_steps = 1):
        """
        Determine the next values for all the cells, based on the current
        state. Creates a new State with the changes and returns a one-elemen array
        of new states, supporting num_steps > 1.
        """
        new_states = []
        for n in range(num_steps):
            new_grid = state.grid.copy()
            for i in range(state.size):
                for j in range(state.size):
                    lns = self.live_neighbors(i, j, state)
                    new_grid[i][j] = self.apply_rules(i, j, lns, state)
            new_states.append(State(grid = new_grid))
        return new_states

    def apply_rules(self, i, j, live_neighbors, state):
        """
        Determine next value for a cell, which could be the same.
        The rules for Conway's Game of Life:
            Any live cell with fewer than two live neighbours dies, as if by underpopulation.
            Any live cell with two or three live neighbours lives on to the next generation.
            Any live cell with more than three live neighbours dies, as if by overpopulation.
            Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction.
        """
        cell = state.grid[i][j]  # default value is no change in state
        if cell == 1:
            if live_neighbors < 2 or live_neighbors > 3:
                cell = 0
        elif live_neighbors == 3:
            cell = 1
        return cell

    def live_neighbors(self, i, j, state):
        """
        This is the faster implementation than the original one.
        Wrap at boundaries (i.e., treat the grid as a 2-dim "toroid")
        """
        s = state.size
        g = state.grid
        im1 = i-1 if i > 0   else s-1
        ip1 = i+1 if i < s-1 else 0
        jm1 = j-1 if j > 0   else s-1
        jp1 = j+1 if j < s-1 else 0
        return g[im1][jm1] + g[im1][j] + g[im1][jp1] + g[i][jm1] + g[i][jp1] + g[ip1][jm1] + g[ip1][j] + g[ip1][jp1]

In [15]:
@ray.remote
class RayGame2:
    # TODO: Game memory grows unbounded; trim older states?
    def __init__(self, initial_state, rules_id):
        self.states = [initial_state]
        self.rules_id = rules_id

    def step(self, num_steps = 1):
        """Take 1 or more steps, returning a list of new states."""
        start_index = len(self.states)
        new_state_ids = self.rules_id.step.remote(self.states[-1], num_steps)
        self.states.extend(ray.get(new_state_ids))
        return self.states[start_index:-1]  # return the new states only!

In [16]:
def time_ray_games2(num_games = 10, max_steps = max_steps, batch_size = 1, grid_size = grid_size):
    game_ids = [RayGame2.remote(State(size = grid_size), RayConwaysRules.remote()) for i in range(num_games)]
    start = time.time()
    state_ids = []
    for game_id in game_ids:
        for i in range(int(max_steps/batch_size)):  # Do a total of max_steps game steps, which is max_steps/delta_steps
            state_ids.append(game_id.step.remote(batch_size))
    ray.get(state_ids)  # wait for everything to finish! We are ignoring what ray.get() returns, but what will it be??
    pd(time.time() - start, prefix = f'Total time for {num_games} games (max_steps = {max_steps}, batch_size = {batch_size})')
    return game_ids  # for cleanup afterwards

In [17]:
ids1 = time_ray_games2(num_games = 1, max_steps = max_steps, batch_size=1, grid_size=grid_size)
ids2 = time_ray_games2(num_games = 1, max_steps = max_steps, batch_size=50, grid_size=grid_size)

Total time for 1 games (max_steps = 200, batch_size = 1) duration:  7.174 seconds
Total time for 1 games (max_steps = 200, batch_size = 50) duration:  6.871 seconds


In [18]:
cleanup(ids1)
cleanup(ids2)

The new implementation of `live_neighbors` has a noticable benefit, but batching doesn't make a large different.

In this same `solutions` directory, you'll find standalone Python scripts used to explore optimizations. All have a `--help` option that describes command-line options, e.g., `python micro-perf-tests.py --help`.

* The new `live_neighbors` implementation was tested separately using the standalone python script `micro-perf-tests.py`.
* The updated Game of Life implementation above was explored in `Ex4-GameOfLife.py`.
* A more aggressive refactoring that processes the grid updates in async (parallel) blocks is `Ex4-GameOfLife-blocks.py`. The comments at the top of this file compare how will it performs vs. `Ex4-GameOfLife.py`. As written, it only improves performance about 3%.

Therefore, the big win is optimizing `live_neighbors`.