# Computing the Euclidean Distance Matrix with NumPy

In this notebook we implement two functions to compute the Euclidean distance matrix. We use a simple algebra trick that makes possible to write the function in a completely vectorized way in terms of optimized NumPy functions.

In [None]:
import numpy as np

Now that we know bradcasting, let's use it to implement a function that calculates the Euclidean distance matrix of an array of vectors.

In [None]:
def euclidean_broadcast(x, y):
    """Euclidean square distance matrix.
    
    Inputs:
    x: (N, m) numpy array
    y: (N, m) numpy array
    
    Ouput:
    (N, N) Euclidean square distance matrix:
    r_ij = (x_ij - y_ij)^2
    """
    diff = x[:, np.newaxis, :] - y[np.newaxis, :, :]

    return (diff * diff).sum(axis=2)

<mark> Question </mark>: At this point you are starting to get acquainted with the `numpy.ndarray`s and it's memory managment. Could you analyse advantages and possible drawbacks of the `euclidean_broadcast` function? Write a positive and a negative point about it.

***

Let's consider now a more sophisticated implementation:

In [None]:
def euclidean_trick(x, y):
    """Euclidean square distance matrix.
    
    Inputs:
    x: (N, m) numpy array
    y: (N, m) numpy array
    
    Ouput:
    (N, N) Euclidean square distance matrix:
    r_ij = (x_ij - y_ij)^2
    """
    x2 = (x*x).sum(axis=1)[:, np.newaxis]
    y2 = (y*y).sum(axis=1)[np.newaxis, :]

    xy = x @ y.T

    return np.abs(x2 + y2 - 2. * xy)

## The `euclidean_trick` function

Each element of the Euclidean distance matrix is the scalar product of the difference between two rows of the array. `euclidean_trick` takes advantage of this by doing the following
$$
\sum_k {(x_{ik}-y_{ik})^2} = (\vec{x}_i - \vec{y}_j)\cdot(\vec{x}_i - \vec{y}_j) = \vec{x}_i\cdot\vec{x}_i + \vec{y}_j\cdot\vec{y}_j - 2\vec{x}_i\cdot\vec{y}_j
$$

Fortunately, there are NumPy functions to compute each of these terms:

$\vec{x}_i\cdot\vec{y}_j$ $\rightarrow$ `x @ y.T` : Matrix product of $\{\vec{x}\}$ and $\{\vec{y}\}$

$\vec{x}_i\cdot\vec{x}_i$ $\rightarrow$ `(x*x).sum(axis=1)[:, np.newaxis]` : A $(n,1)$ vector of elements $\sum_j x_{ij}x_{ij}$

$\vec{y}_j\cdot\vec{y}_j$ $\rightarrow$ `(y*y).sum(axis=1)[:, np.newaxis]` : A $(1,n)$ vector of elements $\sum_j y_{ij}y_{ij}$

To have all the combinations $ij$ of the sum $\vec{x}_i\cdot\vec{x}_i + \vec{y}_j\cdot\vec{y}_j$, we add a new axis to each of the arrays, transpose one them and add them.

We now use `@` to perform the matrix multiplication of the full dataset by itself. We didn't use it before as alternative to `(x*x).sum(axis=1)` because it doesn't perform row by row scalar products.

In [None]:
nsamples = 10
nfeat = 3

x = 10. * np.random.random([nsamples, nfeat])

xy = x @ x.T
xy.shape

Let's time them and look at the `top` command to see how `@` uses multiple OpenMP threads. Let's check also that they give the same result ;)

In [None]:
nsamples = 2000
nfeat = 50

x = 10. * np.random.random([nsamples, nfeat])

%timeit euclidean_broadcast(x, x)
%timeit euclidean_trick(x, x)

In [None]:
np.abs(euclidean_broadcast(x, x) - euclidean_trick(x, x)).max()

In [None]:
False in np.isclose(euclidean_broadcast(x, x), euclidean_trick(x, x))

## Profiling

Let use `line_profiler` to time every line of our functions

In [None]:
%load_ext line_profiler

In [None]:
%lprun -f euclidean_trick euclidean_trick(x, x)

In [None]:
%lprun -f euclidean_broadcast euclidean_broadcast(x, x)

# Conclusions

The main points to take from this notebook are:
  * NumPy is all about vectorization. Loops in python must be avoided.
  * Always consider different vectorized implementations and compare them.
  * Even within NumPy, some functions might bring a more significant speedup than others.
  
> To get an extra speed up, we can use `np.einsum('ij,ij->i', x, x)` to compute the terms $\vec{x}_i\cdot\vec{x}_i$ and $\vec{y}_i\cdot\vec{y}_i$instead of `(x*x)sum(axis=1)`. Please. have a look to the notebook `numpy/04-euclidean-distance-matrix-numpy-advanced.ipynb`. If there are questions, we can discuss it on the Q&A!