# Mahalanobis Distance

**Understanding Covariance Matrix and Its Role in Bayesian Decision Theory**

In this tutorial, we will explore the **covariance matrix**, its geometric interpretation, and its role in Bayesian decision theory, particularly in Mahalanobis Distance. We will use resources from [Vision Dummy](https://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/) and [SAS Blogs](https://blogs.sas.com/content/iml/2012/02/15/what-is-mahalanobis-distance.html) and [Wiki](https://en.wikipedia.org/wiki/Mahalanobis_distance) to explain the concepts, along with Chapter 2 of Duda's book {cite}`Duda2000`

---

## Table of Contents
1. [Introduction to Covariance Matrix](#introduction-to-covariance-matrix)
2. [Geometric Interpretation of Covariance Matrix](#geometric-interpretation-of-covariance-matrix)
3. [Covariance Matrix in Bayesian Decision Theory](#covariance-matrix-in-bayesian-decision-theory)
4. [Mahalanobis Distance and Its Role](#mahalanobis-distance-and-its-role)
5. [Relationship to Normal Random Variables](#relationship-to-normal-random-variables)
6. [Definition of Mahalanobis Distance](#definition-of-mahalanobis-distance)
7. [Intuitive Explanation of Mahalanobis Distance](#intuitive-explanation-of-mahalanobis-distance)
8. [Example: Covariance Matrix in Gaussian Distributions](#example-covariance-matrix-in-gaussian-distributions)
9. [Conclusion](#conclusion)

---

## Introduction to Covariance Matrix

The **covariance matrix** is a square matrix that summarizes the variances and covariances of a set of random variables. For a dataset with $d$ features, the covariance matrix $\mathbf{\Sigma}$ is a $d \times d$ matrix where the diagonal elements represent the variances of each feature, and the off-diagonal elements represent the covariances between pairs of features.

Mathematically, the covariance matrix is defined as:

$$
\mathbf{\Sigma} = \begin{bmatrix}
\sigma_{11} & \sigma_{12} & \cdots & \sigma_{1d} \\
\sigma_{21} & \sigma_{22} & \cdots & \sigma_{2d} \\
\vdots & \vdots & \ddots & \vdots \\
\sigma_{d1} & \sigma_{d2} & \cdots & \sigma_{dd}
\end{bmatrix}
$$

where:
- $\sigma_{ii}$ is the variance of the $i$-th feature.
- $\sigma_{ij}$ is the covariance between the $i$-th and $j$-th features.

The covariance matrix is symmetric ($\sigma_{ij} = \sigma_{ji}$) and positive semi-definite.

---

## Geometric Interpretation of Covariance Matrix

The covariance matrix can be interpreted geometrically as describing the shape and orientation of the data distribution in the feature space. Here’s how:

1. **Eigenvalues and Eigenvectors**: The eigenvectors of the covariance matrix represent the directions (axes) of maximum variance in the data, while the eigenvalues represent the magnitude of variance along these directions.
   
2. **Ellipsoid Representation**: The covariance matrix defines an ellipsoid in the feature space. The eigenvectors determine the orientation of the ellipsoid, and the eigenvalues determine the lengths of its axes. For example, in a 2D space, the covariance matrix defines an ellipse.

3. **Scaling and Rotation**: The covariance matrix can be decomposed into scaling and rotation components. The eigenvalues represent the scaling factors, and the eigenvectors represent the rotation of the ellipsoid.

![Covariance Shape of the Data](img/cov-shape-of-data.png)

The covariance matrix defines the shape of the data. Diagonal spread is captured by the covariance, while axis-aligned spread is captured by the variance.


![Covariance Ellipse](img/cov-eigenvectors.png)

Figure 1: Geometric interpretation of the covariance matrix as an ellipse in 2D space. The eigenvectors determine the orientation, and the eigenvalues determine the lengths of the axes.



---

## Covariance Matrix in Bayesian Decision Theory

In Bayesian decision theory, the covariance matrix plays a crucial role in defining the **class-conditional probability density functions** for multivariate normal distributions. The probability density function for a multivariate normal distribution is given by:

$$
p(\mathbf{x}|\omega_i) = \frac{1}{(2\pi)^{d/2}|\mathbf{\Sigma}_i|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_i)^T \mathbf{\Sigma}_i^{-1} (\mathbf{x} - \boldsymbol{\mu}_i)\right)
$$

where:
- $\mathbf{x}$ is the feature vector.
- $\boldsymbol{\mu}_i$ is the mean vector for class $\omega_i$.
- $\mathbf{\Sigma}_i$ is the covariance matrix for class $\omega_i$.

The covariance matrix $\mathbf{\Sigma}_i$ determines the shape and orientation of the distribution for class $\omega_i$. If the covariance matrix is diagonal, the features are uncorrelated, and the distribution is axis-aligned. If the covariance matrix is non-diagonal, the features are correlated, and the distribution is skewed.

---

## Mahalanobis Distance and Its Role

The **Mahalanobis distance** is a measure of the distance between a point and a distribution, taking into account the covariance structure of the distribution. It is defined as:

$$
D_M(\mathbf{x}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}
$$

The Mahalanobis distance is used in Bayesian decision theory to classify data points based on their distance from the mean of a distribution, normalized by the covariance matrix. This distance is particularly useful when the features are correlated or have different scales.

![Mahalanobis Distance](img/cov-figure-2.png)

*Figure 2: Mahalanobis distance accounts for the covariance structure of the data, unlike Euclidean distance.*

---

## Relationship to Normal Random Variables

In general, given a normal (Gaussian) random variable $X$ with variance $S=1$ and mean $\mu = 0$, any other normal random variable $R$ (with mean $\mu_1$ and variance $S_1$) can be defined in terms of $X$ by the equation:

$$
R = \mu_1 + \sqrt{S_1}X.
$$

Conversely, to recover a normalized random variable from any normal random variable, one can typically solve for:

$$
X = \frac{R - \mu_1}{\sqrt{S_1}}.
$$

If we square both sides and take the square root, we get an equation for a metric that looks a lot like the Mahalanobis distance:

$$
D = \sqrt{X^2} = \sqrt{\frac{(R - \mu_1)^2}{S_1}} = \sqrt{(R - \mu_1) S_1^{-1} (R - \mu_1)}.
$$

The resulting magnitude is always non-negative and varies with the distance of the data from the mean, attributes that are convenient when trying to define a model for the data.

---

## Definition of Mahalanobis Distance

Given a probability distribution $Q$ on $\mathbb{R}^N$, with mean $\vec{\mu} = (\mu_1, \mu_2, \mu_3, \dots , \mu_N)^\mathsf{T}$ and positive semi-definite covariance matrix $\mathbf{\Sigma}$, the Mahalanobis distance of a point $\vec{x} = (x_1, x_2, x_3, \dots, x_N )^\mathsf{T}$ from $Q$ is:

$$
d_M(\vec{x}, Q) = \sqrt{(\vec{x} - \vec{\mu})^\mathsf{T} \mathbf{\Sigma}^{-1} (\vec{x} - \vec{\mu})}.
$$

Given two points $\vec{x}$ and $\vec{y}$ in $\mathbb{R}^N$, the Mahalanobis distance between them with respect to $Q$ is:

$$
d_M(\vec{x}, \vec{y}; Q) = \sqrt{(\vec{x} - \vec{y})^\mathsf{T} \mathbf{\Sigma}^{-1} (\vec{x} - \vec{y})}.
$$

This means that $d_M(\vec{x}, Q) = d_M(\vec{x}, \vec{\mu}; Q)$.

Since $\mathbf{\Sigma}$ is positive semi-definite, so is $\mathbf{\Sigma}^{-1}$, thus the square roots are always defined.

---

## Intuitive Explanation of Mahalanobis Distance

Consider the problem of estimating the probability that a test point in $N$-dimensional Euclidean space belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the centroid or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.

However, we also need to know if the set is spread out over a large range or a small range, so that we can decide whether a given distance from the center is noteworthy or not. The simplistic approach is to estimate the standard deviation of the distances of the sample points from the center of mass. If the distance between the test point and the center of mass is less than one standard deviation, then we might conclude that it is highly probable that the test point belongs to the set. The further away it is, the more likely that the test point should not be classified as belonging to the set.

This intuitive approach can be made quantitative by defining the normalized distance between the test point and the set to be:

$$
\frac{\lVert x - \mu\rVert_2}{\sigma},
$$

which reads: 

$$
\frac{\text{testpoint} - \text{sample mean}}{\text{standard deviation}}.
$$

By plugging this into the normal distribution, we can derive the probability of the test point belonging to the set.

The drawback of the above approach is that we assumed that the sample points are distributed about the center of mass in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction. In those directions where the ellipsoid has a short axis, the test point must be closer, while in those where the axis is long, the test point can be further away from the center.

Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.

---

## Example: Covariance Matrix in Gaussian Distributions

Let’s consider an example from Duda’s book (Chapter 2) where we have two classes, $\omega_1$ and $\omega_2$, with the following parameters:

- Mean vectors:
  $$
  \boldsymbol{\mu}_1 = \begin{bmatrix} 3 \\ 6 \end{bmatrix}, \quad \boldsymbol{\mu}_2 = \begin{bmatrix} 3 \\ -2 \end{bmatrix}
  $$

- Covariance matrices:
  $$
  \mathbf{\Sigma}_1 = \begin{bmatrix} 1/2 & 0 \\ 0 & 2 \end{bmatrix}, \quad \mathbf{\Sigma}_2 = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}
  $$

The covariance matrices describe the shape and orientation of the distributions for each class. For $\omega_1$, the variance along the $x_1$-axis is smaller than along the $x_2$-axis, resulting in a "squeezed" distribution along $x_1$. For $\omega_2$, the variances are equal, resulting in a spherical distribution.

The decision boundary between the two classes is determined by the Mahalanobis distance and is given by:

$$
x_2 = 3.514 - 1.125x_1 + 0.1875x_1^2
$$

This boundary is a parabola, reflecting the different covariance structures of the two classes.

---

## Conclusion

The **covariance matrix** is a fundamental concept in multivariate statistics and Bayesian decision theory. It describes the shape, orientation, and scale of data distributions and plays a key role in defining class-conditional probability densities. The **Mahalanobis distance**, which incorporates the covariance matrix, is a powerful tool for classification tasks, especially when dealing with correlated features.

By understanding the geometric interpretation of the covariance matrix and its role in Bayesian decision theory, you can better analyze and classify multivariate data. For further reading, refer to the original [Vision Dummy article](https://www.visiondummy.com/2014/04/geometric-interpretation-covariance-matrix/) and [SAS blog post](https://blogs.sas.com/content/iml/2012/02/15/what-is-mahalanobis-distance.html).
