# Distance Metrics

Distance metrics are used to quantify the similarity or dissimilarity between data points. Here are some commonly used distance metrics in machine learning:

- Euclidean distance
- Manhattan distance
- Minkowski distance
- Cosine distance
- Hamming distance
- Jaccard distance

## Euclidean Distance

Euclidean distance is a measure of the distance between two points in a two- or multi-dimensional space. It is based on the Pythagorean theorem, which states that the square of the hypotenuse of a right triangle is equal to the sum of the squares of the other two sides. The Euclidean distance between two points can be calculated by finding the square root of the sum of the squares of the differences between the corresponding coordinates.

For example, let's consider two points in a two-dimensional space: `(x1, y1)` and `(x2, y2)`. The Euclidean distance between these two points can be calculated as follows:

\begin{equation}
distance = \sqrt{(x2 - x1)^2 + (y2 - y1)^2}
\end{equation}

In a multi-dimensional space, the formula is similar, but includes the differences between all corresponding coordinates:

\begin{equation}
distance = \sqrt{(x2 - x1)^2 + (y2 - y1)^2 + ... + (zn - zm)^2}
\end{equation}

In general, we can calculate eucledian distance as follows:

\begin{equation}
d_{\text{euclidean}}(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{n}(q_i - p_i)^2}
\end{equation}

To implement Euclidean distance in Python, you can define a function that takes two points as input and returns the distance between them. Here's an example implementation:

In [1]:
import math

def euclidean_distance(point1, point2):
    """Calculates the Euclidean distance between two points."""
    squared_distance = 0
    for i in range(len(point1)):
        squared_distance += (point1[i] - point2[i])**2
    distance = math.sqrt(squared_distance)
    return distance

The function takes two arguments, `point1` and `point2`, which are lists or tuples of the same length containing the coordinates of the two points. The function first initializes a variable `squared_distance` to 0, then iterates over the coordinates of the points and adds the squared differences to `squared_distance`. Finally, it takes the square root of `squared_distance` and returns the resulting value as the Euclidean distance between the two points.

Here's an example usage of the function:

In [2]:
point1 = (1, 2, 3)
point2 = (4, 5, 6)
distance = euclidean_distance(point1, point2)
print(distance)

5.196152422706632


In this example, we calculate the Euclidean distance between two points in a three-dimensional space. The output is the distance between the two points, which is approximately `5.2`.

## Manhattan Distance

Manhattan distance is a distance metric used in machine learning to measure the distance between two points in a n-dimensional space. It is also known as taxicab distance or L1 distance. Unlike Euclidean distance, Manhattan distance does not calculate the shortest distance between two points in a straight line. Instead, it calculates the distance between two points by summing up the absolute differences between the corresponding coordinates of the two points.

The Manhattan distance between two points $\mathbf{p}=(p_1, p_2, \dots, p_n)$ and $\mathbf{q}=(q_1, q_2, \dots, q_n)$ in a n-dimensional space is given by the formula:

\begin{equation}
d_{\text{manhattan}}(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n}|q_i - p_i|
\end{equation}

To implement Manhattan distance in Python, we can define a function that takes two n-dimensional vectors as inputs and calculates the Manhattan distance between them using the above formula. Here's an example implementation:

In [3]:
def manhattan_distance(p, q):
    """
    Calculate the Manhattan distance between two vectors p and q.
    
    Args:
    p (list or numpy array): The first vector.
    q (list or numpy array): The second vector.
    
    Returns:
    The Manhattan distance between p and q.
    """
    return sum(abs(qi - pi) for pi, qi in zip(p, q))

In this implementation, we first define a function `manhattan_distance` that takes two vectors `p` and `q` as inputs. We then use the built-in `zip` function to iterate over the corresponding coordinates of `p` and `q`, and calculate the absolute difference between each pair of coordinates. We then sum up these absolute differences to obtain the Manhattan distance between `p` and `q`. Finally, we return the Manhattan distance as the output of the function.

Here's an example usage of the manhattan_distance function:

In [4]:
p = [1, 2, 3]
q = [4, 5, 6]
d = manhattan_distance(p, q)
print("The Manhattan distance between", p, "and", q, "is", d)

The Manhattan distance between [1, 2, 3] and [4, 5, 6] is 9


## Minkowski Distance

The Minkowski distance is a generalization of other distance measures such as the Euclidean distance and the Manhattan distance. It is defined as:

\begin{equation}
D_{p}(X,Y)=\left(\sum_{i=1}^{n}|x_i-y_i|^p\right)^{\frac{1}{p}}
\end{equation}

where $X$ and $Y$ are two vectors of equal dimensionality, $n$, and $p$ is a positive real number. When $p=1$, this is equivalent to the Manhattan distance, and when $p=2$, this is equivalent to the Euclidean distance.

Here's an example of calculating the Minkowski distance between two vectors in Python:

In [5]:
import numpy as np

def minkowski_distance(x, y, p):
    """
    Calculates the Minkowski distance between two vectors x and y
    of equal dimensionality, using the specified value of p.
    """
    x = np.array(x)
    y = np.array(y)
    distance = np.sum(np.abs(x - y) ** p) ** (1/p)
    return distance

In [6]:
x = [1, 2, 3]
y = [4, 5, 6]
p = 3
distance = minkowski_distance(x, y, p)
print(distance)

4.3267487109222245


In this example, we define a function `minkowski_distance` that takes two vectors `x` and `y` and a value of `p`, and calculates the Minkowski distance between the two vectors using the above formula. We then use this function to calculate the distance between the two vectors `[1, 2, 3]` and `[4, 5, 6]` with `p=3`, which gives a distance of approximately `5.196`.

## Cosine Distance

Cosine distance is a distance metric used in machine learning to measure the similarity between two non-zero vectors in a high-dimensional space. It is based on the cosine similarity measure, which calculates the cosine of the angle between two vectors. The cosine distance between two vectors $\mathbf{p}$ and $\mathbf{q}$ is defined as 1 minus the cosine similarity between the two vectors:

\begin{equation}
d_{\text{cosine}}(\mathbf{p}, \mathbf{q}) = 1 - \frac{\mathbf{p} \cdot \mathbf{q}}{\|\mathbf{p}\|_2\|\mathbf{q}\|_2}
\end{equation}

where $\cdot$ denotes the dot product of two vectors, and $|\cdot|_2$ denotes the `L2` norm of a vector.

To implement cosine distance in Python, we can define a function that takes two vectors as inputs and calculates the cosine distance between them using the above formula. Here's an example implementation:

In [7]:
import numpy as np

def cosine_distance(p, q):
    """
    Calculate the cosine distance between two vectors p and q.
    
    Args:
    p (list or numpy array): The first vector.
    q (list or numpy array): The second vector.
    
    Returns:
    The cosine distance between p and q.
    """
    p = np.asarray(p)
    q = np.asarray(q)
    dot_product = np.dot(p, q)
    norm_p = np.linalg.norm(p)
    norm_q = np.linalg.norm(q)
    cosine_similarity = dot_product / (norm_p * norm_q)
    cosine_distance = 1 - cosine_similarity
    return cosine_distance

In this implementation, we first convert the input vectors `p` and `q` to NumPy arrays using the `numpy.asarray` function. We then use the built-in `numpy.dot` function to calculate the dot product of `p` and `q`, and use the `numpy.linalg.norm` function to calculate the `L2` norm of `p` and `q`. We then use these values to calculate the cosine similarity between `p` and `q`, and subtract it from `1` to obtain the cosine distance between `p` and `q`. Finally, we return the cosine distance as the output of the function.

Here's an example usage of the cosine_distance function:

In [8]:
p = [1, 2, 3]
q = [4, 5, 6]
d = cosine_distance(p, q)
print("The cosine distance between", p, "and", q, "is", d)

The cosine distance between [1, 2, 3] and [4, 5, 6] is 0.025368153802923787


### While talking about Manhattan and Cosine distances we saw L1 and L2 norms respectively. What are those?

L1 and L2 norms are ways to measure the size or magnitude of a vector in a high-dimensional space.

The L1 norm, also known as the Manhattan norm or taxicab norm, measures the absolute differences between the elements of a vector. It is calculated as the sum of the absolute values of the vector elements:

\begin{equation}
\|\mathbf{v}\|_1 = \sum_{i=1}^n |v_i|
\end{equation}
 
where $\mathbf{v}$ is the vector, and $n$ is the number of elements in the vector.

The L2 norm, also known as the Euclidean norm, measures the distance between the origin and the point represented by the vector. It is calculated as the square root of the sum of the squared elements of the vector:

\begin{equation}
\|\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^n v_i^2}
\end{equation}
 
where $\mathbf{v}$ is the vector, and $n$ is the number of elements in the vector.

In machine learning, the L2 norm is often used as a regularization term in the objective function of a learning algorithm, while the L1 norm is used in sparse feature selection, where only a small subset of the features are relevant. The choice between L1 and L2 regularization depends on the specific problem and the desired properties of the learned model.

In Python, the L1 and L2 norms can be calculated using the `numpy.linalg.norm` function. Here's an example usage:

In [9]:
import numpy as np

v = np.array([1, 2, 3])

# Calculate L1 norm
l1_norm = np.linalg.norm(v, ord=1)
print("L1 norm of", v, "is", l1_norm)

# Calculate L2 norm
l2_norm = np.linalg.norm(v, ord=2)
print("L2 norm of", v, "is", l2_norm)

L1 norm of [1 2 3] is 6.0
L2 norm of [1 2 3] is 3.7416573867739413


## Hamming Distance

The Hamming distance is a measure of similarity between two strings of equal length, defined as the number of positions at which the corresponding symbols are different. In other words, it is the minimum number of substitutions required to change one string into the other.

For example, the Hamming distance between the strings "10110" and "11100" is 2, because the second and fourth symbols differ between the two strings.

The Hamming distance can be calculated using the following formula:

\begin{equation}
d_{\text{hamming}}(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n}[p_i \neq q_i]
\end{equation}

where $\mathbf{p}$ and $\mathbf{q}$ are the two strings of length $n$, and $[p_i \neq q_i]$ is an indicator function that evaluates to 1 if $p_i \neq q_i$, and 0 otherwise.

In Python, the Hamming distance can be calculated using the hamming function from the `scipy.spatial.distance` module:

In [10]:
from scipy.spatial.distance import hamming

s1 = "10110"
s2 = "11100"

hamming_distance = hamming(list(s1), list(s2))
print("Hamming distance between", s1, "and", s2, "is", hamming_distance)

Hamming distance between 10110 and 11100 is 0.4


Note that the `hamming` function expects its input arguments to be sequences (e.g., lists) of individual symbols, rather than strings. Therefore, we convert the input strings to lists of characters using the `list` function before passing them to the `hamming` function. The output value of `0.4` indicates that the Hamming distance between the two strings is `2` (since there are a total of `5` symbols and `2` of them differ between the two strings).

## Jaccard Distance

The Jaccard distance is a measure of similarity between two sets, defined as the ratio of the size of their intersection to the size of their union. It ranges from 0 (indicating no similarity) to 1 (indicating perfect similarity).

Formally, the Jaccard distance between two sets $p$ and $q$ is given by:

\begin{equation}
d_{\text{jaccard}}(\mathbf{p}, \mathbf{q}) = 1 - \frac{\mathbf{p} \cap \mathbf{q}}{\mathbf{p} \cup \mathbf{q}}
\end{equation}

where $|p \cap q|$ represents the size of the intersection of sets $p$ and $q$, and $|p \cup q|$ represents the size of their union.

In Python, the Jaccard distance can be calculated using the jaccard function from the `scipy.spatial.distance` module:

In [11]:
from scipy.spatial.distance import jaccard

s1 = [1, 2, 3, 4]
s2 = [2, 3, 5, 6]

jaccard_distance = jaccard(s1, s2)
print("Jaccard distance between", s1, "and", s2, "is", jaccard_distance)

Jaccard distance between [1, 2, 3, 4] and [2, 3, 5, 6] is 1.0


## How do I choose a perfect distance metric for my problem?

The choice of distance metric depends on the type of data you are working with and the problem you are trying to solve. Here are some guidelines:

- **Euclidean distance:** This is a good choice when the data is continuous and the dimensions are independent of each other. It is also appropriate when you want to penalize larger differences between values more heavily than smaller differences.

- **Manhattan distance:** This is a good choice when the data is continuous and the dimensions are independent of each other, but you want to penalize larger differences less heavily than Euclidean distance.

- **Minkowski distance:** This is a generalization of Euclidean and Manhattan distance, and is a good choice when you want to control the degree of emphasis given to larger differences between values. If you use a value of p=1, it is equivalent to Manhattan distance, and if you use a value of p=2, it is equivalent to Euclidean distance.

- **Cosine distance:** This is a good choice when you are working with sparse data or text data, and you want to measure the similarity between vectors without being affected by the magnitude of the vectors.

- **Hamming distance:** This is a good choice when you are working with binary data or categorical data, and you want to measure the number of positions at which the two vectors differ.

- **Jaccard distance:** This is a good choice when you are working with binary or categorical data, and you want to measure the similarity between sets of data.