# Ray Crash Course - Tasks

Let's quickly explore the Ray ApI using some examples that demonstrate how Ray enables horizontal scalability.

> **Tip:** For more about Ray, see [ray.io](https://ray.io) or the [Ray documentation](https://ray.readthedocs.io/en/latest/).

In [1]:
import math, statistics, random, time, sys
import ray

We'll import some of the "boilerplate code we need for graphing is in in two Python files in this directory:

* [./bokeh_util.py](bokeh_util.py): For plotting with [Bokeh](https://docs.bokeh.org/en/latest/index.html)
* [./pi_calc.py](pi_calc.py): For calculating π

In [2]:
from bokeh_util import square_circle_plot, two_lines_plot, means_stddevs_plot

from bokeh.plotting import show, save, gridplot

In [3]:
from pi_calc import monte_carlo_pi, compute_pi_for

To see the benefits of parallel execution, let's use an example where we compute `π` (3.14159....). Imagine a square piece of paper 2 meters by 2 meters square. Draw a circle inside it with radius 1 meter, centered on the center point of the paper. The circle will touch the edges of the paper.

Here is a graph to illustrate what I mean:

In [4]:
scp_plot = square_circle_plot(radius=1.0, title="Square vs. Circle")
show(scp_plot)

(If you can't see it, click [here](../images/Circle-vs-Square.png).)

Now suppose you throw $N$ darts at the paper. Some will land inside the circle, call them $n$ and the rest will land outside, $N-n$. The area of this circle is $πr^{2}$ and the area of the square is $(2r)^{2} = 4r^{2}$. The ratio of $n/N$ _approximately_ equals the ratio of the circle area over the square area, $πr^{2}/4r^{2} = π/4$. (Does it make sense that this ratio is independent of the actual radius value?).

In other words,

$π/4 \approx n/N$

$π \approx 4n/N$

So, to approximate $π$, we can count the number of darts thrown and the number that land inside the circle.

> **Note:** This is a [_Monte Carlo_](https://en.wikipedia.org/wiki/Monte_Carlo_method) calculation of π, where we randomly sample a _uniform distribution_, one with equal probably of picking points between -1 and 1.

The `monte_carlo_pi` function we imported above does this calculation for whatever number of points we tell it to sample. It all returns all the points as separate `x` and `y` lists, which we'll use for plotting:

In [5]:
pi1000,  xs1000_in,  ys1000_in,  xs1000_out,  ys1000_out  = monte_carlo_pi(1000,  return_points=True) # return the Pi estimate and the points.
pi10000, xs10000_in, ys10000_in, xs10000_out, ys10000_out = monte_carlo_pi(10000, return_points=True) # return the Pi estimate and the points.
print('π for 1000: {:8.6f}, 10000: {:8.6f}'.format(pi1000, pi10000))

π for 1000: 3.040000, 10000: 3.150400


Note that π isn't very accurate with just 1000 points. Let's plot these points for both `N`. 

In [6]:
scp_plot1000 = square_circle_plot(radius=1.0, title="1000 Points")
scp_plot1000.circle(xs1000_in,  ys1000_in,  color='pink',    size=4)
scp_plot1000.circle(xs1000_out, ys1000_out, color='skyblue', size=4)
scp_plot10000 = square_circle_plot(radius=1.0, title="10000 Points")
scp_plot10000.circle(xs10000_in,  ys10000_in,  color='pink',    size=2)
scp_plot10000.circle(xs10000_out, ys10000_out, color='skyblue', size=2)
grid_two_plots = gridplot([[scp_plot1000, scp_plot10000]])
show(grid_two_plots)                

(If you can't see it, click [here](../images/1000-vs-10000-points.png).)

Even with 10,000 points, the distribution shows small gaps. Even more points are needed for accuracy better than roughly 3.14.

**Exercise:** Try using smaller and larger numbers in the call to `monte_carlo_pi()` above. How do the values for π and the graphs change?

Okay, if we want to improve our accuracy, we need to throw a lot more darts (i.e., sample more points `N`), but this quickly gets computationally expensive. Let's see what happens. 

First a few definitions and some helper functions we'll use. The helper functions are also in [`./pi_calc.py`](./pi_calc.py), but we'll define them here so we can define new versions later. 

We use `just_pi` to call `monte_carlo_pi` and return just the approximate π.

We use `compute_pi_loop` to run our calculation `num_workers` of times, which will be averaged. 

In [7]:
num_workers = 16  # We'll do this many calculations for a given N and average the results.

Ns = [500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000] #,  5000000, 10000000] # for a LONG wait! 

def just_pi(N):
    approx_pi, xs_in, ys_in, xs_out, ys_out = monte_carlo_pi(N, return_points=False)
    return approx_pi

def compute_pi_loop(N):
    return [just_pi(N) for i in range(num_workers)]

Finally, there's a helper function in [`./pi_calc.py`](pi_calc.py) called `compute_pi_for`. It loops over the passed-in `Ns`, calls the passed-in `compute_pi_loop`. It then averages over the approximate π results, along with the errors (π - approx. π), and the stadard deviation over the `mum_workers` results. It also prints out this data and returns all of it as lists.

In [8]:
ns, means, stddevs, errors, durations = compute_pi_for(Ns, compute_pi_loop)

# samples =       500: ~pi = 3.133500 (stddev = 0.069357), error = 0.257597%, duration =   0.00682 seconds
# samples =      1000: ~pi = 3.137000 (stddev = 0.050838), error = 0.146189%, duration =   0.01261 seconds
# samples =      5000: ~pi = 3.128900 (stddev = 0.014606), error = 0.404020%, duration =   0.05927 seconds
# samples =     10000: ~pi = 3.144475 (stddev = 0.016644), error = 0.091748%, duration =   0.10500 seconds
# samples =     50000: ~pi = 3.140945 (stddev = 0.009299), error = 0.020615%, duration =   0.49687 seconds
# samples =    100000: ~pi = 3.140378 (stddev = 0.003816), error = 0.038680%, duration =   1.03011 seconds
# samples =    500000: ~pi = 3.142784 (stddev = 0.002914), error = 0.037938%, duration =   4.96644 seconds
# samples =   1000000: ~pi = 3.141700 (stddev = 0.001618), error = 0.003417%, duration =   9.92790 seconds


On my machine, the last calculation takes 10-11 seconds. If you run for 5M and 10M, it takes about 50+ seconds and 100+ seconds, respectively! At least the errors and standard deviations of the `num_workers` results improve (but can oscillate) as we use higher `N`.

We'll plot this data shortly...

## Parallelism with Ray

We did the previous calculation serially and all on one CPU core, while the rest of the CPU cores were idle. In a cluster, the rest of the cores _on the rest of the machines_ would be idle, too.

We can use Ray to parallelize a lot of this work. Let's see how.

Before using Ray, we need to initialize it. We'll tell Ray to prepend we have `num_workers` cores. The `ignore_reinit_error` argument tells Ray not to cause an error if we rerun this cell for some reason. Both are optional arguments.

In [9]:
ray.init(num_cpus=num_workers, ignore_reinit_error=True)

2020-04-23 15:24:26,330	INFO resource_spec.py:212 -- Starting Ray with 4.35 GiB memory available for workers and up to 2.18 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-04-23 15:24:26,739	INFO services.py:1148 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:14071',
 'object_store_address': '/tmp/ray/session_2020-04-23_15-24-26_315787_48536/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-04-23_15-24-26_315787_48536/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-04-23_15-24-26_315787_48536'}

> **Tip:** Having trouble starting Ray? See the [Troubleshooting](../reference/Troubleshooting-Tips-Tricks.ipynb) tips.

The next cell prints the URL for the Ray Dashboard (see also the output form the previous cell). Click it to open the dashboard.

In [10]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8265


You create a Ray _task_ by decorating a normal Python function with `@ray.remote`. 

For a first pass at optimization, the only function that makes sense to parallelize is either `just_pi` or `monte_carlo_pi`, because the calculation for a given `N` doesn't depend on other calls to this function with a different `N` value.

We also need a new version of `compute_pi_loop`, because a Ray task `foo` is invoked using `foo.remote()`. This starts an asynchronous task somewhere in your Ray cluster (or laptop...). Instead of returning the result, an `ObjectID` for a _future_ is returned. When the task finishes, you can retrieve the result using `ray.get()`. 

In [11]:
@ray.remote
def ray_just_pi(N):
    return just_pi(N)   # No need to redefine, just call just_pi. 

# No @ray.remote needed here, at least for our first optimizations.
def ray_compute_pi_loop(N):
    ids = [ray_just_pi.remote(N) for i in range(num_workers)]  # ids = [...remote(N)... is new
    return ray.get(ids)       # Blocks until all the tasks for the ids are finished.

Now run it!

In [12]:
ray_ns, ray_means, ray_stddevs, ray_errors, ray_durations = compute_pi_for(Ns, ray_compute_pi_loop)

# samples =       500: ~pi = 3.117500 (stddev = 0.068053), error = 0.766893%, duration =   0.05576 seconds
# samples =      1000: ~pi = 3.140000 (stddev = 0.054101), error = 0.050696%, duration =   0.00726 seconds
# samples =      5000: ~pi = 3.149500 (stddev = 0.021234), error = 0.251699%, duration =   0.02326 seconds
# samples =     10000: ~pi = 3.149925 (stddev = 0.013697), error = 0.265227%, duration =   0.04794 seconds
# samples =     50000: ~pi = 3.140055 (stddev = 0.005672), error = 0.048945%, duration =   0.26022 seconds
# samples =    100000: ~pi = 3.139700 (stddev = 0.003847), error = 0.060245%, duration =   0.56302 seconds
# samples =    500000: ~pi = 3.142432 (stddev = 0.002844), error = 0.026701%, duration =   2.25141 seconds
# samples =   1000000: ~pi = 3.142120 (stddev = 0.001367), error = 0.016794%, duration =   3.04414 seconds


It's 3-4x faster. I ran 16 parallel tasks, so why not a 16x improvement? It's really because I have a four-core laptop, and it was fully utilized during these runs, as opposed to the previous calculation that was only 20-25% utilized. On a Ray cluster, you would see 16x improvement, minus some factor for networking overhead.

Let's graph our results:

In [13]:
two_lines = two_lines_plot(
    "Execution Times (Smaller Is Better)", 'N', 'Time', 'No Ray', 'Ray', ns, durations, ray_ns, ray_durations)
show(two_lines, plot_width=800, plot_height=400)

(If you can't see it, click [here](../images/Crash-Course-Ray-NoRay.png).)

Notice this is log-log plot.

For relatively small `N` values, the small overhead of Ray is larger than the benefit of using it to parallelize the computation. Most of the time, using Ray is a win. On a full cluster, the times could be dramatically better for larger `N`.

Let's plot the approximate mean values and the standard deviations over the `num_workers` trials for each `N`.

In [14]:
pi_without_ray_plot = means_stddevs_plot(
  ns, means, stddevs, title = 'π Results Without Ray')
# Use a grid to make it layout better.
pi_without_ray_grid = gridplot([[pi_without_ray_plot]], plot_width=1000, plot_height=400)
show(pi_without_ray_grid)

(If you can't see it, click [here](../images/Pi-Results-Without-Ray.png).)

You may have to use the controls shown to zoom or scroll horizontally (click and drag) to see all of the graph.

With Ray:

In [15]:
pi_with_ray_plot = means_stddevs_plot(
    ray_ns, ray_means, ray_stddevs, 'π Results With Ray')
# Use a grid to make it layout better.
pi_with_ray_grid = gridplot([[pi_with_ray_plot]], plot_width=1000, plot_height=400)
show(pi_with_ray_grid)

(If you can't see it, click [here](../images/Pi-Results-With-Ray.png).)

## ray.get() vs. ray.wait()

Calling `ray.get(ids)` blocks until all the tasks have completed that correspond to the input `ids`. That's fine for this example, but what if you're waiting for a number of tasks where some finish more quickly than others? What if you would like to process the completed results as they become available, even while other tasks are still running? That's where `ray.wait()` is recommended. Here we'll provide a brief example. For more details, see the [Ray Core - A Deeper Dive](../ray-core/00-Overview.ipynb) tutorial.

In [18]:
@ray.remote
def slow(n):           # An example long-running task.
    time.sleep(n)
    return n

First, let's just use `ray.get()` as we did before, for comparison:

In [19]:
start = time.time()
running_ids = [slow.remote(n) for n in range(5)]
results = ray.get(running_ids)
print(f'{time.time() - start} seconds: results = {results}')

4.010973215103149 seconds: results = [0, 1, 2, 3, 4]


We had to wait four seconds for the slowest task to finish. 

Now here's the idiomatic way to use `ray.wait()` with `ray.get()`.

In [20]:
start = time.time()
running_ids = [slow.remote(n) for n in range(5)]
results = []
while len(running_ids) > 0:
    # Returns two lists: what's done, what's still running.
    # Note that we reset "running_ids" to what's still running.
    # The timeout is optional but always a good idea for safety!
    finished_ids, running_ids = ray.wait(running_ids, timeout=2.0)  
    finished = ray.get(finished_ids)
    results.extend(finished)
    print(f'finished: {finished}')
print(f'{time.time() - start} seconds: results = {results}')

finished: [0]
finished: [1]
finished: [2]
finished: [3]
finished: [4]
4.009230136871338 seconds: results = [0, 1, 2, 3, 4]


Instead of waiting four seconds for everything to finish, we saw a task return every second.