# Worksheet on linear algebra: Pairwise distance


## Please restart the kernel before you submit!

## Your Name: Amay Jain

Computing euclidean distance is ubiquitous in machine learning community. However, it is hard to compute euclidean distance if the data is in a high-dimensional space or you have a huge amount of data samples. 

The goal of this part is writing your own function to compute pairwise distance for a given dataset. We want to compare the computational time. Moreover, we also call a python function to compute pairwise distance.

Given two data points $x=(x_1,\dots,x_d)$ and $y=(y_1,\dots,y_d)$, the euclidean distance between $x$ and $y$ is $\|x-y\|_2$. 

Given data matrix $X\in\mathbb{R}^{n\times d}$ where the $i$-th row is the $i$-th data samples $x^{i}\in\mathbb{R}^d$, we want to generate a pairwise distance matrix $D\in\mathbb{R}^{n\times n}$ such that $$D_{i,j} = \|x^i - x^j\|_2$$



**Requirement:**
1. You should write a function to compute $D$. Your function should work for any data matrix $X$.

In [1]:
import numpy as np
import time

# your function here

def D(x):

    rows = x.shape[0]

    cols = x.shape[1]

    start = time.time()

    first = x.reshape(rows, 1, cols)

    second = x.reshape(1, rows, cols)
    
    d = np.linalg.norm(first - second, axis = 2)

    stop = time.time()

    print("Computational time is", stop - start, "seconds")

    return d

Some data matrices are randomly generated, please test your function using each data matrix and compare the computation times. What is your conclusion?

In [2]:
# test example 1

n = 100                     # number of samples
d = 5                       # ambient dimension
X = np.random.randn(n,d)    # data matrix

D(X)

Computational time is 0.0010249614715576172 seconds


array([[0.        , 3.82626082, 2.61395997, ..., 1.77170269, 3.04404022,
        2.88402877],
       [3.82626082, 0.        , 3.25956143, ..., 2.62624397, 2.34404471,
        4.06154597],
       [2.61395997, 3.25956143, 0.        , ..., 2.96930993, 2.70066133,
        4.30351424],
       ...,
       [1.77170269, 2.62624397, 2.96930993, ..., 0.        , 2.43172078,
        2.38036318],
       [3.04404022, 2.34404471, 2.70066133, ..., 2.43172078, 0.        ,
        4.60464604],
       [2.88402877, 4.06154597, 4.30351424, ..., 2.38036318, 4.60464604,
        0.        ]])

The following 2 examples take a long time and can not be handled by my laptop.

In [3]:
# test example 2

# n = 10000                    # number of samples
# d = 5                        # ambient dimension
# X = np.random.randn(n,d)     # data matrix

# D(X)

In [4]:
# test example 3

# n = 10000                  # number of samples
# d = 50                     # ambient dimension
# X = np.random.randn(n,d)   # data matrix

Calling python functions to compute pairwise distance is much more efficient in computation time. Here are two functions (from different modules) you can use. Please read documentations and call those commands to compute pairwise distance matrix for each test example. Please compare the computational time.


**Documentation:**

1. scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

2. scipy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html

In [5]:
from sklearn.metrics.pairwise import pairwise_distances

start = time.time()

test = pairwise_distances(X)

stop = time.time()

print("Computational time is", stop - start, "seconds")

test

Computational time is 0.002955198287963867 seconds


array([[0.        , 3.82626082, 2.61395997, ..., 1.77170269, 3.04404022,
        2.88402877],
       [3.82626082, 0.        , 3.25956143, ..., 2.62624397, 2.34404471,
        4.06154597],
       [2.61395997, 3.25956143, 0.        , ..., 2.96930993, 2.70066133,
        4.30351424],
       ...,
       [1.77170269, 2.62624397, 2.96930993, ..., 0.        , 2.43172078,
        2.38036318],
       [3.04404022, 2.34404471, 2.70066133, ..., 2.43172078, 0.        ,
        4.60464604],
       [2.88402877, 4.06154597, 4.30351424, ..., 2.38036318, 4.60464604,
        0.        ]])

In [6]:
import scipy

start = time.time()

test = scipy.spatial.distance.pdist(X, metric = 'euclidean')

stop = time.time()

print("Computational time is", stop - start, "seconds")

test

Computational time is 0.00027632713317871094 seconds


array([3.82626082, 2.61395997, 2.23939349, ..., 2.43172078, 2.38036318,
       4.60464604])