# Parallel computation with Ray

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/coobas/europython-25/blob/main/04-ray.ipynb)

[Ray](https://docs.ray.io/en/latest/index.html) is a set of libraries that (among others) allow an easy parallelisation of Python tasks - both locally and also in clusters. This part is called **Ray Core**.

Apart from this, it also provides specialised libraries for data processing (**Ray Data**), for machine learning and even reinforcement learning (**Ray Train**, **Ray Train**, ...). We will not deal with those in this workshop.

In [None]:
# Run this in Google Collab, not needed if you install this package locally
!pip install numpy ray[default]

In [None]:
# Obligatory imports
import numpy as np
import pandas as pd
import ray
import plotly.express as px
from IPython import get_ipython

from pathlib import Path

In [48]:
GOOGLE_COLAB = "google.colab" in str(get_ipython())

# Download the data which are part of this repo
if GOOGLE_COLAB:
    import urllib
    url = "https://github.com/coobas/europython-25/raw/refs/heads/main/data.parquet"
    urllib.request.urlretrieve(url, "data.parquet")

In [None]:
def long_running(i: int) -> int:
    """A long running task that we will parallelise."""
    import time
    time.sleep(i)
    return i * i

In [None]:
%%time
long_running(1)

In [None]:
%%time
[long_running(i) for i in range(10)]
    

If these tasks are independent, we can run them in parallel. There are of course options in Python itself:

- multiprocessing (https://docs.python.org/3.13/library/multiprocessing.html) 
- threading (with GIL-releasing code or with caution in free-threaded Python 3.13+)

This does scale exactly well if there are more tasks than CPUs / GPUs on your machine...

Other options:
- [celery](https://docs.celeryq.dev/en/stable/)
- [dask](https://docs.dask.org/en/stable/index.html)

With their strengths and weaknesses. 

## Use ray

Ray always runs a server (even implicitly) and executes the task in nodes that it manages.

In [None]:
ray.init()

This either connects to an existing local server, or creates a new one. (In the same way, we can connect to a different one).

### Remote functions

Any function we decorate with the `ray.remote` decorator, becomes a **task** that can be submitted to this server.

In [None]:
@ray.remote
def long_running_ray(i: int) -> int:
    import time
    time.sleep(1)
    return i * i

In [None]:
long_running_ray(1)  # This will fail

Well, the error message is right. We should use `.remote` (a different one!)

Better (and see that we pass arguments the same way):

In [None]:
%%time
long_running_ray.remote(1)

What happened? The task was submitted to the cluster. But asynchronously. We have to capture the **future** object...

In [None]:
task_id = long_running_ray.remote(2)

...and get its value (synchronously):

In [None]:
ray.get(task_id)

In [None]:
%%time
task_ids = [long_running_ray.remote(i) for i in range(10)]
ray.get(task_ids)

## Exercise: Compute k nearest neighbours in parallel

We will reuse our definitions of kNN functions (slightly modified) and for random points fínd their neighbours:

In [None]:
DEFAULT_K = 4   # How many nearest neighbors to consider

def calculate_distances(query_points: np.ndarray, reference_points: np.ndarray, *, n_dim: int = 3) -> np.ndarray:
    """
    Calculate mutual Euclidean distances between M query and N reference points.

    Parameters:
    ----------
    query_points: np.ndarray
        (M, n_dim+) array of query points
    reference_points: np.ndarray
        (N, n_dim+) array of reference points
    n_dim: int
        Number of dimensions to consider (default: 3, for x, y, floor)

    Returns:
    --------
    distances: np.ndarray
        (M, N) array of the distances
    """
    # Expand for broadcasting
    query_points = query_points[:, np.newaxis,:n_dim]
    reference_points = reference_points[np.newaxis, :, :n_dim]
    return np.sqrt(np.sum((reference_points - query_points) ** 2, axis=-1))


def knn_search(
    query_points: np.ndarray,
    reference_points: np.ndarray,
    k: int,
):
    """
    Find k nearest neighbour reference point indices for N query points.

    Returns:
    --------
    indices: np.ndarray
        (N, k) matrix of integral indices
    """
    distances = calculate_distances(query_points, reference_points).T
    return np.argpartition(distances, k, axis=0)[:k].T

Let's create some points to test this on:

In [None]:
def create_random_points(
    n_points: int, *, n_dim: int = 3, seed: int = 42
) -> np.ndarray:
    # TODO: Fix floor!
    np.random.seed(seed)
    return np.random.sample((n_points, n_dim))

In [33]:
query_points = create_random_points(5, seed=42)
reference_points = create_random_points(10, seed=84)

query_points

array([[0.37454012, 0.95071431, 0.73199394],
       [0.59865848, 0.15601864, 0.15599452],
       [0.05808361, 0.86617615, 0.60111501],
       [0.70807258, 0.02058449, 0.96990985],
       [0.83244264, 0.21233911, 0.18182497]])

In [34]:
calculate_distances(query_points, reference_points)

array([[0.82743077, 0.65400806, 0.78621956, 0.74247728, 0.88559438,
        0.2181387 , 1.02385479, 0.93615029, 0.14672897, 0.61668925],
       [0.59894045, 0.83501096, 0.50990624, 0.59320397, 0.63363197,
        1.0200325 , 0.57310898, 0.94268293, 0.87078142, 0.90995193],
       [0.61223402, 0.74973814, 0.64515347, 0.58989451, 1.0734169 ,
        0.54579517, 1.16489686, 0.89312126, 0.30165584, 0.49965924],
       [1.04399996, 1.28482524, 0.80396027, 0.79439734, 0.4948328 ,
        0.88969344, 0.43420167, 0.53282295, 0.95103093, 1.36282904],
       [0.80437579, 0.80570071, 0.70154747, 0.77224803, 0.49620442,
        1.00473388, 0.45730809, 1.05327225, 0.91005286, 0.98827047]])

First, we run this without ray (for sizable arrays):

In [40]:
%%time
query_points = create_random_points(32768, seed=42)
reference_points = create_random_points(1000, seed=84)
knn_search(query_points, reference_points, k=DEFAULT_K)

CPU times: user 1.88 s, sys: 104 ms, total: 1.98 s
Wall time: 1.99 s


array([[532, 606, 302, 371],
       [271, 552, 872, 544],
       [520, 125, 178, 885],
       ...,
       [820, 102, 966, 620],
       [339, 990, 107, 114],
       [847, 267,  66, 724]], shape=(32768, 4))

**Task**

1. Change the knn_search function into a task.
2. Submit the computation to ray and compare the results.
3. Look at the execution times (and )

In [42]:
@ray.remote
def knn_search_ray(
    query_points: np.ndarray,
    reference_points: np.ndarray,
    k: int,
):
    """
    Find k nearest neighbour reference point indices for N query points.

    Returns:
    --------
    indices: np.ndarray
        (N, k) matrix of integral indices
    """
    distances = calculate_distances(query_points, reference_points).T
    return np.argpartition(distances, k, axis=0)[:k].T

In [43]:
%%time
knn_id = knn_search_ray.remote(query_points, reference_points, k=DEFAULT_K)
knn_results = ray.get(knn_id)

CPU times: user 10.6 ms, sys: 4.04 ms, total: 14.6 ms
Wall time: 2 s


Is there any improvement yet?

## Monitoring ray

Ray comes with a nice dashboard that allows you to observe running jobs. It runs in a local web server, mostly likely http://localhost:8265. This address is not accessible when running within Google Colab, and so you have to use a special trick to show a mini-window forwarded to the dashboard running in the cloud.

In [49]:
if GOOGLE_COLAB:
    from google.colab import output
    output.serve_kernel_port_as_iframe(8265)  # The port may differ!
else:
    print("Not in google Colab. Try the local link, it should work.")

Not in google Colab. Try the local link, it should work.


In [None]:
%%time
query_points = create_random_points(100000, seed=42)
reference_points = create_random_points(100, seed=84)
knn_search(query_points, reference_points, k=DEFAULT_K)

CPU times: user 6.21 s, sys: 354 ms, total: 6.56 s
Wall time: 6.59 s


array([[ 8, 93,  5, 26],
       [10, 95, 47, 51],
       [91, 82, 22, 79],
       ...,
       [84, 33, 62, 32],
       [ 8, 98, 93, 63],
       [94, 21, 51, 16]], shape=(1000000, 4))

## Exercise: Parallelize kNN execution

## END THIS

We have predefined reference points in an external file, modeled using an (unknown?) analytical function:

Let's see how the whole thing looks without ray:

In [None]:
query_points = create_random_points(10)
query_points    

In [None]:
prices = compute_prices(query_points, reference_points)
prices

In [None]:
combine_points_and_prices(query_points, prices)

**Tasks**
1) Create a remote function variant of compute_prices (called e.g. compute_prices_ray)
2) Run it in ray and get the result
3) Compare the result to the previous
4) Compare the execution time (for a larger number of points)

In [None]:
...

## Exercise: Create a map of house prices

We would like to create a map of prices, i.e. sample data in a grid over the allowed area and use our kNN model to predict a price for each of those (given the floor as a parameter). 

In [None]:
N_POINTS = 10   # Default number of points in each dimension for the grid
LIMIT = 10.0    # +/- Span of the grid

def compute_prices(query_points, reference_points, k: int = DEFAULT_K):
    """
    Find prices for N data_points.

    Parameters:
    query_points: np.ndarray
        (N, 3) array of query points
    reference_points: np.ndarray
        (M, 4) array of data points with x, y, floor, and price
    k: int
        Number of nearest neighbors to consider

    Returns:
    --------
    prices: np.ndarray
        (N,) array of prices
    """
    indices = knn_search(query_points, reference_points, k)
    prices: np.ndarray = reference_points[indices, 3]
    return prices.mean(axis=1)


def combine_points_and_prices(
    query_points: np.ndarray, prices: np.ndarray
) -> pd.DataFrame:
    """
    Prepare human-friendly output from numpy arrays.

    Returns:
    --------
    df: pd.DataFrame
        DataFrame with columns x, y, floor, price
    """
    return pd.DataFrame(
        {
            "x": query_points[:,0],
            "y": query_points[:,1],
            "floor": query_points[:,2],
            "price": prices,
        }
    )

In [None]:
def load_reference_points_df(path: Path = Path("data.parquet")) -> np.ndarray:
    """
    Load reference data points from a parquet file.

    Returns:
    --------
    data_points: np.ndarray
        (N, 4) array of data points with x, y, floor, and price columns
    """

    return pd.read_parquet(path)
    # return df[["x", "y", "floor", "price"]].to_numpy().astype(float)

reference_points_df = load_reference_points_df()
reference_points_df

In [None]:
def create_grid(n_points: int = N_POINTS) -> tuple[np.ndarray, ...]:
    """
    Create a homogenous grid of points to create a map.

    Returns:
    --------
    x: np.ndarray
        Flattened (n_points x n_points,) array of x values
    y: np.ndarray
        Flattened (n_points x n_points,) array of x values
    """
    # Note: Tested indirectly via `create_query_points`
    # TODO: Add floor
    x = np.linspace(-LIMIT, LIMIT, n_points)
    y = np.linspace(-LIMIT, LIMIT, n_points)
    return tuple(arr.flatten() for arr in np.meshgrid(x, y))


def create_query_points(n_points: int = N_POINTS, floor: int = 1) -> np.ndarray:
    """
    Create a homogenous grid of points with a floor to create a map.

    Returns:
    --------
    query_points: np.ndarray
        (n_points x n_points, 3) array of query points
    """
    x, y = create_grid(n_points=n_points)
    return np.vstack([x, y, np.ones(x.shape[0]) * floor]).T


create_query_points(2)

TODO: Create an actor?

In [None]:
def draw_points(points: np.ndarray) -> None:
    """
    Draw points on a map.

    Parameters:
    -----------
    points: np.ndarray
        (N, 3) array of points to draw
    """
    df = pd.DataFrame({"x": points[:,0], "y": points[:,1]})
    fig = px.scatter(df, x="x", y="y", title="Query points")
    fig.show()

draw_points(create_query_points(5))

## Exercise: Make our computation faster