# Parallel computation with Ray

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/coobas/europython-25/blob/main/98-ray.ipynb)

[Ray](https://docs.ray.io/en/latest/index.html) is a set of libraries that (among others) allow an easy parallelisation of Python tasks - both locally and also in clusters. This part is called **Ray Core**.

Apart from this, it also provides specialised libraries for data processing (**Ray Data**), for machine learning and even reinforcement learning (**Ray Train**, **Ray Train**, ...). We will not deal with those in this workshop.

In [None]:
# Run this in Google Collab, not needed if you install this package locally
!pip install numpy ray[default]

In [None]:
# Obligatory imports
import numpy as np
import pandas as pd
import ray
import plotly.express as px

from pathlib import Path

In [None]:
def long_running() -> int:
    """A long running task that we will parallelise."""
    import time
    time.sleep(1)
    return 42

In [None]:
%%time
long_running()

In [None]:
%%time
[long_running() for i in range(10)]
    

If these tasks are independent, we can run them in parallel. There are of course options in Python itself:

- multiprocessing (https://docs.python.org/3.13/library/multiprocessing.html) 
- threading (with GIL-releasing code or with caution in free-threaded Python 3.13+)

This does scale exactly well if there are more tasks than CPUs / GPUs on your machine...

Other options:
- [celery](https://docs.celeryq.dev/en/stable/)
- [dask](https://docs.dask.org/en/stable/index.html)

With their strengths and weaknesses. 

## Use ray

Ray always runs a server (even implicitly) and executes the task in nodes that it manages.

In [None]:
ray.init()

In [None]:
@ray.remote
def long_running_ray() -> int:
    import time
    time.sleep(1)
    return 42

In [None]:
long_running_ray()  # This will fail

In [None]:
long_running_ray.remote()

In [None]:
task_id = long_running_ray.remote()

In [None]:
ray.get(task_id)

In [None]:
%%time
task_ids = [long_running_ray.remote() for i in range(10)]
ray.get(task_ids)

## Exercise: Compute prices at many points 

We will reuse our definitions of kNN functions (slightly modified) and for random points

In [None]:

N_POINTS = 10   # Default number of points in each dimension for the grid
LIMIT = 10.0    # +/- Span of the grid
DEFAULT_K = 4   # How many nearest neighbors to consider


def calculate_distances(query_points: np.ndarray, reference_points: np.ndarray) -> np.ndarray:
    """
    Calculate mutual Euclidean distances between M query and N reference points.

    Parameters:
    ----------
    query_points: np.ndarray
        (M, 3) array of query points
    reference_points: np.ndarray
        (N, 3+) array of reference points

    Returns:
    --------
    distances: np.ndarray
        (M, N) array of the distances
    """
    # Expand for broadcasting
    query_points = query_points[:, np.newaxis,:3]
    reference_points = reference_points[np.newaxis, :, :3]
    return np.sqrt(np.sum((reference_points - query_points) ** 2, axis=-1))


def knn_search(
    query_points: np.ndarray,
    reference_points: np.ndarray,
    k: int,
):
    """
    Find k nearest neighbour reference point indices for N query points.

    Returns:
    --------
    indices: np.ndarray
        (N, k) matrix of integral indices
    """
    distances = calculate_distances(query_points, reference_points).T
    return np.argpartition(distances, k, axis=0)[:k].T


In [102]:
def compute_prices(query_points, reference_points, k: int = DEFAULT_K):
    """
    Find prices for N data_points.

    Parameters:
    ----------
    query_points: np.ndarray
        (N, 3) array of query points
    reference_points: np.ndarray
        (M, 4) array of data points with x, y, floor, and price
    k: int
        Number of nearest neighbors to consider

    Returns:
    --------
    prices: np.ndarray
        (N,) array of prices
    """
    indices = knn_search(query_points, reference_points, k)
    prices: np.ndarray = reference_points[indices, 3]
    return prices.mean(axis=1)


def combine_points_and_prices(
    query_points: np.ndarray, prices: np.ndarray
) -> pd.DataFrame:
    """
    Prepare human-friendly output from numpy arrays.

    Returns:
    --------
    df: pd.DataFrame
        DataFrame with columns x, y, floor, price
    """
    return pd.DataFrame(
        {
            "x": query_points[:,0],
            "y": query_points[:,1],
            "floor": query_points[:,2],
            "price": prices,
        }
    )

We have predefined reference points in an external file:

In [None]:
def load_reference_points_df(path: Path = Path("data.parquet")) -> np.ndarray:
    """
    Load reference data points from a parquet file.

    Returns:
    --------
    data_points: np.ndarray
        (N, 4) array of data points with x, y, floor, and price columns
    """

    return pd.read_parquet(path)
    # return df[["x", "y", "floor", "price"]].to_numpy().astype(float)

reference_points_df = load_reference_points_df()
reference_points_df

Unnamed: 0,x,y,floor,price
0,-2.509198,-2.527184,1,983.737837
1,9.014286,-3.341758,3,520.453425
2,4.639879,-6.476922,12,621.498431
3,1.973170,2.145333,12,1751.234358
4,-6.879627,-0.467517,16,672.690047
...,...,...,...,...
9995,7.153120,7.540773,3,520.173028
9996,7.950177,-9.063721,3,520.453522
9997,8.934158,-3.926031,16,673.260987
9998,-2.050240,-1.133600,1,1325.611747


In [None]:
def create_random_points(
    n_points: int, n_dim: int = 3, *, seed: int = 42
) -> np.ndarray:
    # TODO: Fix floor!
    np.random.seed(seed)
    return np.random.sample((n_points, n_dim))

In [91]:
query_points = create_random_points(10)
query_points    

array([[0.37454012, 0.95071431, 0.73199394],
       [0.59865848, 0.15601864, 0.15599452],
       [0.05808361, 0.86617615, 0.60111501],
       [0.70807258, 0.02058449, 0.96990985],
       [0.83244264, 0.21233911, 0.18182497],
       [0.18340451, 0.30424224, 0.52475643],
       [0.43194502, 0.29122914, 0.61185289],
       [0.13949386, 0.29214465, 0.36636184],
       [0.45606998, 0.78517596, 0.19967378],
       [0.51423444, 0.59241457, 0.04645041]])

In [103]:
prices = compute_prices(query_points, reference_points)
prices

array([ 502.62571959,  500.02744492,  505.01821533,  500.04099002,
        502.54348979,  495.09202285,  499.95408297, 1099.94876668,
        551.99297849,  500.01361168,  499.9947853 ,  499.91854264,
       2004.94187513,  777.17692896,  499.99960569,  499.99283174,
        497.68848685,  584.1101067 ,  558.52818969,  499.93305831,
        497.64758741,  497.50457342,  499.98461487,  502.53533383,
        497.50167414])

In [None]:
combine_points_and_prices(query_points, prices)

Unnamed: 0,x,y,floor,price
0,0.37454,0.950714,0.731994,1700.943634
1,0.598658,0.156019,0.155995,1869.514894
2,0.058084,0.866176,0.601115,1791.037249
3,0.708073,0.020584,0.96991,1801.436404
4,0.832443,0.212339,0.181825,1730.902772
5,0.183405,0.304242,0.524756,1854.022345
6,0.431945,0.291229,0.611853,1820.996388
7,0.139494,0.292145,0.366362,1869.514894
8,0.45607,0.785176,0.199674,1700.943634
9,0.514234,0.592415,0.04645,1755.392455


**Tasks**
1) Create a remote function variant of compute_prices (called e.g. compute_prices_ray)
2) Run it in ray and get the result
3) Compare the result to the previous
4) Compare the execution time (for a larger number of points)

In [None]:
...

**Question**

Did we achieve anything sofar?

## Monitoring ray

Ray comes with a nice dashboard that allows you to observe running jobs. It runs in a local web server, mostly likely http://localhost:8265. This address is not accessible when running within Google Colab, and so you have to use a special trick to show a mini-window forwarded to the dashboard running in the cloud.

In [None]:
try:
    from google.colab import output
    output.serve_kernel_port_as_iframe(8265)  # The port may differ!
except ImportError:
    print("Not in google Colab. Try the local link, it might work.")

TODO: Submit objects

TODO: Batching

## Exercise: Create a map of prices

In [95]:
def create_grid(n_points: int = N_POINTS) -> tuple[np.ndarray, ...]:
    """
    Create a homogenous grid of points to create a map.

    Returns:
    --------
    x: np.ndarray
        Flattened (n_points x n_points,) array of x values
    y: np.ndarray
        Flattened (n_points x n_points,) array of x values
    """
    # Note: Tested indirectly via `create_query_points`
    # TODO: Add floor
    x = np.linspace(-LIMIT, LIMIT, n_points)
    y = np.linspace(-LIMIT, LIMIT, n_points)
    return tuple(arr.flatten() for arr in np.meshgrid(x, y))


def create_query_points(n_points: int = N_POINTS, floor: int = 1) -> np.ndarray:
    """
    Create a homogenous grid of points with a floor to create a map.

    Returns:
    --------
    query_points: np.ndarray
        (n_points x n_points, 3) array of query points
    """
    x, y = create_grid(n_points=n_points)
    return np.vstack([x, y, np.ones(x.shape[0]) * floor]).T


create_query_points(2)

array([[-10., -10.,   1.],
       [ 10., -10.,   1.],
       [-10.,  10.,   1.],
       [ 10.,  10.,   1.]])

In [96]:
def draw_points(points: np.ndarray) -> None:
    """
    Draw points on a map.

    Parameters:
    -----------
    points: np.ndarray
        (N, 3) array of points to draw
    """
    df = pd.DataFrame({"x": points[:,0], "y": points[:,1]})
    fig = px.scatter(df, x="x", y="y", title="Query points")
    fig.show()

draw_points(create_query_points(5))