# Ray Crash Course - Python Multiprocessing with Ray

This lesson explores how to replace two popular multiprocessing libraries with Ray replacements to break the one-machine boundary:

* [`multiprocessing.Pool`](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool) for general management of process pools.
* [`joblib`](https://joblib.readthedocs.io/en/latest/), the underpinnings of [scikit-learn](https://scikit-learn.org/stable/), which Ray can scale to a cluster.

We also examine how Ray can work with Python's [`asyncio`](https://docs.python.org/3/library/asyncio.html).

> **Tip:** For more about Ray, see [ray.io](https://ray.io) or the [Ray documentation](https://docs.ray.io/en/latest/).

In [None]:
import ray, time, sys, os
import numpy as np

In [None]:
!../tools/start-ray.sh

In [None]:
ray.init(address='auto', ignore_reinit_error=True)

The Ray Dashboard, if you are running this notebook on a local machine:

In [None]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

## Drop-in Replacements for Popular Single-node, Multiprocessing Libraries

The Python community has three popular libraries for breaking out of Python's _global interpreter lock_ to enable better multiprocessing and concurrency. Ray now offers drop-in replacements for two of them, [`multiprocessing.Pool`](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool) and [`joblib`](https://joblib.readthedocs.io/en/latest/), and integration with the third, Python's [`asyncio`](https://docs.python.org/3/library/asyncio.html).

This section explores the `multiprocessing.Pool` and `joblib` replacements.

| Library | Library Docs | Ray Docs | Description |
| :------ | :----------- | :------- | :---------- |
| `multiprocessing.Pool` | [docs](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool) | [Ray](https://docs.ray.io/en/latest/multiprocessing.html) | Create a pool of processes for running work. The Ray replacement allows scaling to a cluster. |
| `joblib` | [docs](https://joblib.readthedocs.io/en/latest/) | [Ray](https://docs.ray.io/en/latest/joblib.html) | Ray supports running distributed [scikit-learn](https://scikit-learn.org/stable/) programs by implementing a Ray backend for `joblib` using Ray Actors instead of local processes. This makes it easy to scale existing applications that use scikit-learn from a single node to a cluster. |


### Multiprocessing.Pool

If your application already uses `multiprocessing.Pool`, then scaling beyond a single just requires replacing your import statements from this:

```python
from multiprocessing.pool import Pool
```

To this:

```python
from ray.util.multiprocessing.pool import Pool
```

A local Ray cluster will be started the first time you create a Pool and your tasks will be distributed across it. See [Run on a Cluster](https://docs.ray.io/en/latest/multiprocessing.html#run-on-a-cluster) in the Ray documentation for details on how to use a multi-node Ray cluster instead.

Here is an example:

In [None]:
from ray.util.multiprocessing import Pool

def f(index):
    return index

pool = Pool()
for result in pool.map(f, range(100)):
    print(f'{result}|', end='')

The full `multiprocessing.Pool` API is currently supported. Please see the [multiprocessing documentation](https://docs.python.org/3/library/multiprocessing.html#module-multiprocessing.pool) for details.

### Joblib

Ray supports running distributed [scikit-learn](https://scikit-learn.org/) programs by implementing a Ray backend for [joblib](https://joblib.readthedocs.io/) using Ray Actors instead of local processes. This makes it easy to scale existing applications that use scikit-learn from a single node to a cluster.

> **Note:** This API is new and may be revised in the future. Please [report any issues](https://github.com/ray-project/ray/issues) you encounter.

To get started, use `from ray.util.joblib import register_ray` and then run `register_ray()`. This will register Ray as a `joblib` backend for `scikit-learn` to use. Then run your original `scikit-learn` code inside `with joblib.parallel_backend('ray')`. This will start a local Ray cluster. 

See [Run on a Cluster](https://docs.ray.io/en/latest/joblib.html#run-on-a-cluster) in the Ray documentation for details on how to use a multi-node Ray cluster instead.

Here is an example:

In [None]:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC

In [None]:
digits = load_digits()
param_space = {
    'C': np.logspace(-6, 6, 30),
    'gamma': np.logspace(-8, 8, 30),
    'tol': np.logspace(-4, -1, 30),
    'class_weight': [None, 'balanced'],
}
model = SVC(kernel='rbf')
search = RandomizedSearchCV(model, param_space, cv=5, n_iter=300, verbose=10)

In [None]:
import joblib
from ray.util.joblib import register_ray
register_ray()

> **Note:** The next cell will take a while!

In [None]:
with joblib.parallel_backend('ray'):
    search.fit(digits.data, digits.target)

### Using Ray with asyncio

Python's [`asyncio`](https://docs.python.org/3/library/asyncio.html) can be used with Ray actors and tasks.

> **Note:** The Async API support is experimental and work is ongoing to improve it. Please [report any issues](https://github.com/ray-project/ray/issues) you encounter.

#### Actors
Here is an actor example, adapted from the [Ray documentation](https://docs.ray.io/en/latest/async_api.html).

In [None]:
import asyncio

@ray.remote
class AsyncActor:
    # Multiple invocations of this method can be running in
    # the event loop at the same time.
    async def run_concurrent(self):
        print("started")
        await asyncio.sleep(2)   # Concurrent workload here
        print("finished")

actor = AsyncActor.remote()

# regular ray.get
ray.get([actor.run_concurrent.remote() for _ in range(4)])

# async ray.get
await actor.run_concurrent.remote()

#### Async Tasks

For Ray tasks, the object ids returned by them can be converted to `async.Future` instances.

In [None]:
@ray.remote
def some_task():
    return 1

# The normal Ray way:
ray.wait([some_task.remote()])
ray.get(some_task.remote())

In [None]:
# asyncio alternative way:
await some_task.remote()
await asyncio.wait([some_task.remote()])

See the [asyncio docs](https://docs.python.org/3/library/asyncio-task.html) for more details on `asyncio` patterns, including timeouts and `asyncio.gather`.

#### Async Actor

Ray also supports concurrent multitasking by executing many actor tasks at once. To do so, you can define an actor with async methods:

In [None]:
@ray.remote
class AsyncActor:
    async def run_task(self):
        print("started")
        await asyncio.sleep(1) # Network, I/O task here
        print("ended")

actor = AsyncActor.remote()

In the following invocation, all 10 tasks should start at once. After 1 second they should all finish about the same time. Note the _wall time_ that will be printed.

In [None]:
%time ray.get([actor.run_task.remote() for _ in range(10)])

Under the hood, Ray runs all of the methods inside a single python event loop.

> **Note:** Running blocking `ray.get` and `ray.wait` inside async actor methods is not allowed, because `ray.get` will block the execution of the event loop.

You can limit the number of concurrent task running at once using the `max_concurrency` flag. By default, 1000 tasks can be running concurrently. 

In the following cell, we set the `max_concurrency` to `3`, so the subsequent cell will run tasks three at a time. Since there are ten total, it should take about four seconds to run.

In [None]:
actor = AsyncActor.options(max_concurrency=3).remote()

In [None]:
%time ray.get([actor.run_task.remote() for _ in range(10)])

In [None]:
ray.shutdown()  # "Undo ray.init()". Terminate all the processes started in this notebook.