# Norms and Distances for Data Scientists

## 1. Norms

A **norm** is a function that assigns a positive length or size to vectors in a vector space. Norms are used to measure the magnitude of a vector and are essential in optimization, regularization, and many machine learning algorithms.

### Common Norms

1. **L2 Norm (Euclidean Norm)**:
   The L2 norm is the most commonly used norm, measuring the Euclidean distance from the origin in Euclidean space.
   $$
   \| \mathbf{x} \|_2 = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2} = \sqrt{\sum_{i=1}^n x_i^2}
   $$
   **Usage**: Used in most machine learning algorithms, especially for regularization (L2 regularization).

2. **L1 Norm (Manhattan Norm or Taxicab Norm)**:
   The L1 norm is the sum of the absolute values of the components of the vector.
   $$
   \| \mathbf{x} \|_1 = |x_1| + |x_2| + \cdots + |x_n| = \sum_{i=1}^n |x_i|
   $$
   **Usage**: Common in sparsity-inducing regularization (L1 regularization), such as Lasso regression.

3. **L∞ Norm (Maximum Norm or Chebyshev Norm)**:
   The L∞ norm is the maximum absolute value among the components of the vector.
   $$
   \| \mathbf{x} \|_\infty = \max(|x_1|, |x_2|, \dots, |x_n|)
   $$
   **Usage**: Used in cases where you want the largest deviation to be prioritized.

4. **Lp Norm**:
   The Lp norm is a generalization of L1 and L2 norms, where \( p \geq 1 \). The formula for the Lp norm is:
   $$
   \| \mathbf{x} \|_p = \left( \sum_{i=1}^n |x_i|^p \right)^{1/p}
   $$
   **Usage**: You can tune \( p \) to control the nature of the regularization (e.g., L2 for ridge, L1 for lasso).

5. **Frobenius Norm (Matrix Norm)**:
   The Frobenius norm is used for matrices, and it is the square root of the sum of the absolute squares of the matrix elements.
   $$
   \| A \|_F = \sqrt{\sum_{i,j} |a_{ij}|^2}
   $$
   **Usage**: Applied in machine learning algorithms involving matrices, such as PCA.

---

## 2. Distances

Distances are functions that measure the "closeness" between two points (vectors) in space. Different distance metrics are used based on the context and the application.

### Common Distance Metrics

1. **Euclidean Distance (L2 Distance)**:
   The Euclidean distance is the straight-line distance between two points.
   $$
   d(\mathbf{x}, \mathbf{y}) = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n - y_n)^2} = \|\mathbf{x} - \mathbf{y}\|_2
   $$
   **Usage**: Commonly used in clustering (e.g., k-means) and nearest neighbors algorithms (e.g., k-NN).

2. **Manhattan Distance (L1 Distance)**:
   The Manhattan distance is the sum of the absolute differences of the components of the vectors.
   $$
   d(\mathbf{x}, \mathbf{y}) = |x_1 - y_1| + |x_2 - y_2| + \cdots + |x_n - y_n| = \|\mathbf{x} - \mathbf{y}\|_1
   $$
   **Usage**: Used in grid-based systems or when the path is restricted to axis-aligned movements.

3. **Cosine Similarity / Cosine Distance**:
   Cosine similarity measures the cosine of the angle between two vectors. Cosine distance is simply \( 1 - \text{cosine similarity} \).
   $$
   \text{cosine similarity}(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_2 \|\mathbf{y}\|_2}
   $$
   $$
   \text{cosine distance}(\mathbf{x}, \mathbf{y}) = 1 - \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|_2 \|\mathbf{y}\|_2}
   $$
   **Usage**: Used in text mining (e.g., document similarity) and other high-dimensional data.

4. **Minkowski Distance**:
   The Minkowski distance is a generalization of both the Euclidean and Manhattan distances. It is defined as:
   $$
   d(\mathbf{x}, \mathbf{y}) = \left( \sum_{i=1}^n |x_i - y_i|^p \right)^{1/p}
   $$
   **Usage**: Provides flexibility in selecting different distance measures depending on the value of \( p \). For \( p = 2 \), it becomes Euclidean, and for \( p = 1 \), it becomes Manhattan.

5. **Hamming Distance**:
   The Hamming distance is used for binary vectors and counts the number of positions where the corresponding elements are different.
   $$
   d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^n \mathbb{1}(x_i \neq y_i)
   $$
   **Usage**: Used in situations like error detection and correction and classification of categorical data.

6. **Jaccard Distance**:
   The Jaccard distance is used for comparing the similarity and diversity of sample sets.
   $$
   \text{Jaccard distance}(\mathbf{x}, \mathbf{y}) = 1 - \frac{|A \cap B|}{|A \cup B|}
   $$
   **Usage**: Used in set-based data analysis, particularly for categorical or binary features.

7. **Mahalanobis Distance**:
   The Mahalanobis distance takes into account the correlations of the data set and is scale-invariant.
   $$
   d(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T \Sigma^{-1} (\mathbf{x} - \mathbf{y})}
   $$
   where \( \Sigma \) is the covariance matrix of the dataset.
   **Usage**: Useful in multivariate anomaly detection and classification.

8. **Earth Mover's Distance (EMD)**:
   The Earth Mover's Distance, also known as the Wasserstein distance, measures the cost of turning one distribution into another.
   **Usage**: Used in applications such as image retrieval and generative models.

---

## 3. Applications in Data Science

- **Clustering**: Algorithms like k-means use Euclidean distance to group similar data points.
- **Classification**: k-NN uses distance metrics (like Euclidean, Manhattan, or Mahalanobis) to classify data points based on their proximity to labeled instances.
- **Dimensionality Reduction**: PCA uses Euclidean distance in its optimization steps, which involves finding the axis of maximum variance.
- **Recommendation Systems**: Cosine similarity and Jaccard distance are used to recommend items based on similarity between users or items.

## 4. Important Properties of Distances

- **Non-negativity**: $d(\mathbf{x}, \mathbf{y}) \geq 0$
- **Identity of indiscernibles**: $d(\mathbf{x}, \mathbf{y}) = 0$ if and only if $\mathbf{x} = \mathbf{y}$
- **Symmetry**: $d(\mathbf{x}, \mathbf{y}) = d(\mathbf{y}, \mathbf{x})$
- **Triangle inequality**: $d(\mathbf{x}, \mathbf{z}) \leq d(\mathbf{x}, \mathbf{y}) + d(\mathbf{y}, \mathbf{z})$

---

## Conclusion

For a data scientist, a solid understanding of norms and distances is essential for tasks ranging from clustering and classification to similarity search and dimensionality reduction. The choice of norm and distance metric depends on the characteristics of your data and the problem you're solving.
