# Ray Crash Course - Tasks

Let's see some examples that demonstrate how to use the Ray API and the benefits it provides.

> **Tip:** For more about Ray, see [ray.io](https://ray.io) or the [Ray documentation](https://ray.readthedocs.io/en/latest/).

In [1]:
import math, statistics, random, time, sys
import ray

In [2]:
# The file ./bokeh_util.py for these Bokeh helper functions.
from bokeh_util import square_circle_plot, two_lines_plot, means_stddevs_plot

from bokeh.plotting import show, gridplot

In [3]:
# The file ./pi_util.py for these helper functions for computing Pi.
from pi_util import monte_carlo_pi, compute_pi_for

To see the benefits of parallel execution, let's use an example where we compute `π` (3.14159....). Imagine a square piece of paper 2 meters by 2 meters square. Draw a circle inside it with radius 1 meter, centered on the center point of the paper. The circle will touch the edges of the paper.

Here is a graph to illustrate what I mean:

In [4]:
scp_plot = square_circle_plot(radius=1.0, title="Square vs. Circle")
show(scp_plot)

Now suppose you throw `N` darts at the paper. Some will land inside the circle, call them `n` and the rest will land outside, `N-n`. The area of this circle is `πrr` (that is, "pi r squared"), where `r=1`, and the area of the square is `(2r)(2r) = 4rr`. The ratio of `n/N` equals the ratio of the circle area over the square area, `πrr/4rr = π/4`. (Does it make sense that this ratio is independent of the actual radius value?).

In other words,

```
π/4 = n/N
π = 4n/N
```
Therefore, to approximate π, we can count the number of darts thrown and the number that land inside the circle.

> **Note:** This is a [_Monte Carlo_](https://en.wikipedia.org/wiki/Monte_Carlo_method) calculation of π, where we randomly sample a _uniform distribution_, one with equal probably of picking points between -1 and 1.

The `monte_carlo_pi` function we imported above does this calculation for whatever number of points we tell it to sample. It all returns all the points as separate `x` and `y` lists, which we'll use for plotting:

In [5]:
pi1000,  xs1000,  ys1000  = monte_carlo_pi(1000,  return_points=True) # return the Pi estimate and the points.
pi10000, xs10000, ys10000 = monte_carlo_pi(10000, return_points=True) # return the Pi estimate and the points.
print('π for 1000: {:8.6f}, 10000: {:8.6f}'.format(pi1000, pi10000))

π for 1000: 3.160000, 10000: 3.150400


Note that π isn't very accurate with just 1000 points. Let's plot these points for both `N`. 

In [6]:
scp_plot1000 = square_circle_plot(radius=1.0, title="1000 Points")
scp_plot1000.circle(xs1000, ys1000, color='lightgrey', size=4)
scp_plot10000 = square_circle_plot(radius=1.0, title="10000 Points")
scp_plot10000.circle(xs10000, ys10000, color='lightgrey', size=2)
grid_two_plots = gridplot([[scp_plot1000, scp_plot10000]])
show(grid_two_plots)                

At 10,000 points, the distribution looks reasonably good, but clearly not enough for accuracy better than 3.14.

**Exercise:** Try using smaller and larger numbers in the call to `monte_carlo_pi()`. How do the values for π and the graph change?

Okay, if we want to improve our accuracy, we need to throw a lot more darts (i.e., sample more points `N`), but this quickly gets computationally expensive. Let's see what happens. 

First a few definitions and some helper functions we'll use. The helper functions are also in `./pi_util.py`, but we'll define them here so we can define new versions later. 

We use `just_pi` to call `monte_carlo_pi` and return just the approximate π.

We use `compute_pi_loop` to run our calculation `num_workers` of times, which will be averaged. 

In [7]:
num_workers = 16  # We'll do this many calculations for a given N and average the results.

Ns = [500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000] #,  5000000, 10000000] # for a LONG wait! 

def just_pi(N):
    approx_pi, xs, ys = monte_carlo_pi(N, return_points=False)
    return approx_pi

def compute_pi_loop(N):
    return [just_pi(N) for i in range(num_workers)]

Finally, there's a helper function in `./pi_util.py` called `compute_pi_for`. It loops over the passed-in `Ns`, calls the passed-in `compute_pi_loop`. It then averages over the approximate π results, along with the errors (π - approx. π), and the stadard deviation over the `mum_workers` results. It also prints out this data and returns all of it as lists.

In [8]:
ns, means, stddevs, errors, durations = compute_pi_for(Ns, compute_pi_loop)

# samples =       500: ~pi = 3.147000 (stddev = 0.091582), error = 0.172121%, duration =   0.00622 seconds
# samples =      1000: ~pi = 3.158500 (stddev = 0.040315), error = 0.538178%, duration =   0.01179 seconds
# samples =      5000: ~pi = 3.142600 (stddev = 0.021493), error = 0.032065%, duration =   0.05131 seconds
# samples =     10000: ~pi = 3.140750 (stddev = 0.011877), error = 0.026822%, duration =   0.10640 seconds
# samples =     50000: ~pi = 3.141885 (stddev = 0.008662), error = 0.009306%, duration =   0.50912 seconds
# samples =    100000: ~pi = 3.144852 (stddev = 0.005128), error = 0.103764%, duration =   1.03753 seconds
# samples =    500000: ~pi = 3.142065 (stddev = 0.002811), error = 0.015035%, duration =   5.19971 seconds
# samples =   1000000: ~pi = 3.141930 (stddev = 0.000975), error = 0.010738%, duration =  10.58749 seconds


On my machine, the last calculation takes 10-11 seconds. If you run for 5M and 10M, it takes about 50+ seconds and 100+ seconds, respectively! At least the errors and standard deviations of the `num_workers` results improve (but can oscillate) as we use higher `N`.

We'll plot this data shortly...

## Parallelism with Ray

We did the previous calculation serially and all on one CPU core, while the rest of the CPU cores were idle. In a cluster, the rest of the cores _on the rest of the machines_ would be idle, too.

We can use Ray to parallelize a lot of this work. Let's see how.

Before using Ray, we need to initialize it. We'll tell Ray to prepend we have `num_workers` cores. The `ignore_reinit_error` argument tells Ray not to cause an error if we rerun this cell for some reason. Both are optional arguments.

In [9]:
ray.init(num_cpus=num_workers, ignore_reinit_error=True)

2020-04-20 17:24:28,429	INFO resource_spec.py:212 -- Starting Ray with 3.86 GiB memory available for workers and up to 1.93 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-20 17:24:28,825	INFO services.py:1148 -- View the Ray dashboard at [1m[32mlocalhost:8266[39m[22m


{'node_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:10292',
 'object_store_address': '/tmp/ray/session_2020-04-20_17-24-28_419341_86381/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-04-20_17-24-28_419341_86381/sockets/raylet',
 'webui_url': 'localhost:8266',
 'session_dir': '/tmp/ray/session_2020-04-20_17-24-28_419341_86381'}

The next cell prints the URL for the Ray Dashboard (see also the output form the previous cell). Click it to open the dashboard.

In [10]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8266


You create a Ray _task_ by decorating a normal Python function with `@ray.remote`. 

For a first pass at optimization, the only function that makes sense to parallelize is either `just_pi` or `monte_carlo_pi`, because the calculation for a given `N` doesn't depend on other calls to this function with a different `N` value.

We also need a new version of `compute_pi_loop`, because a Ray task `foo` is invoked using `foo.remote()`. This starts an asynchronous task somewhere in your Ray cluster (or laptop...). Instead of returning the result, an `ObjectID` for a _future_ is returned. When the task finishes, you can retrieve the result using `ray.get()`. 

In [11]:
@ray.remote
def ray_just_pi(N):
    return just_pi(N)   # No need to redefine, just call just_pi. 

# No @ray.remote needed here, at least for our first optimizations.
def ray_compute_pi_loop(N):
    ids = [ray_just_pi.remote(N) for i in range(num_workers)]  # ids = [...remote(N)... is new
    return ray.get(ids)       # Blocks until all the tasks for the ids are finished.

Now run it!

In [12]:
ray_ns, ray_means, ray_stddevs, ray_errors, ray_durations = compute_pi_for(Ns, ray_compute_pi_loop)

# samples =       500: ~pi = 3.149500 (stddev = 0.074974), error = 0.251699%, duration =   0.01999 seconds
# samples =      1000: ~pi = 3.146250 (stddev = 0.036689), error = 0.148248%, duration =   0.01153 seconds
# samples =      5000: ~pi = 3.138800 (stddev = 0.018149), error = 0.088893%, duration =   0.02956 seconds
# samples =     10000: ~pi = 3.137900 (stddev = 0.013046), error = 0.117541%, duration =   0.05424 seconds
# samples =     50000: ~pi = 3.142640 (stddev = 0.003944), error = 0.033338%, duration =   0.26011 seconds
# samples =    100000: ~pi = 3.141332 (stddev = 0.005709), error = 0.008281%, duration =   0.54617 seconds
# samples =    500000: ~pi = 3.141412 (stddev = 0.001734), error = 0.005734%, duration =   2.28570 seconds
# samples =   1000000: ~pi = 3.141676 (stddev = 0.001338), error = 0.002645%, duration =   3.36734 seconds


It's 3-4x faster. I ran 16 parallel tasks, so why not a 16x improvement? It's really because I have a four-core laptop, and it was fully utilized during these runs, as opposed to the previous calculation that was only 20-25% utilized. On a Ray cluster, you would see 16x improvement, minus some factor for networking overhead.

Let's graph our results:

In [13]:
two_lines = two_lines_plot(
    "Execution Times (Smaller Is Better)", 'N', 'Time', 'No Ray', 'Ray', ns, durations, ray_ns, ray_durations)
show(two_lines, plot_width=800, plot_height=400)

Notice this is log-log plot.

For relatively small `N` values, the small overhead of Ray is larger than the benefit of using it to parallelize the computation. Most of the time, using Ray is a win. On a full cluster, the times could be dramatically better for larger `N`.

Let's plot the approximate mean values and the standard deviations over the `num_workers` trials for each `N`.

In [14]:
pi_without_ray_plot = means_stddevs_plot(
  ns, means, stddevs, title = 'π Results Without Ray')
# Use a grid to make it layout better.
pi_without_ray_grid = gridplot([[pi_without_ray_plot]], plot_width=1000, plot_height=400)
show(pi_without_ray_grid)

You may have to scroll horizontally (click and drag) to see all of the graph.

With Ray:

In [15]:
pi_with_ray_plot = means_stddevs_plot(
    ray_ns, ray_means, ray_stddevs, 'π Results With Ray')
# Use a grid to make it layout better.
pi_with_ray_grid = gridplot([[pi_with_ray_plot]], plot_width=1000, plot_height=400)
show(pi_with_ray_grid)