<a id="table-of-contents"></a>  
# 📖 Table of Contents  
1. [🧭 Overview](#overview)  
2. [📏 Distance Metrics for Numeric Data](#distance-metrics-for-numeric-data)  
   - [📌 Euclidean Distance](#euclidean-distance)  
   - [📌 Manhattan Distance](#manhattan-distance)  
   - [📌 Minkowski Distance](#minkowski-distance)  
   - [📌 Mahalanobis Distance](#mahalanobis-distance)  
3. [🧮 Distance Metrics for Vectors and Angles](#distance-metrics-for-vectors-and-angles)  
   - [📌 Cosine Similarity / Distance](#cosine-similarity--distance)  
4. [🔤 Distance Metrics for Categorical or Binary Data](#distance-metrics-for-categorical-or-binary-data)  
   - [📌 Hamming Distance](#hamming-distance)  
   - [📌 Jaccard Similarity / Distance](#jaccard-similarity--distance)  
5. [📊 Similarity Measures for Continuous Data](#similarity-measures-for-continuous-data)  
   - [📌 Pearson Correlation](#pearson-correlation)  
   - [📌 Spearman Rank Correlation](#spearman-rank-correlation)  
___

<a id="overview"></a>  
# 🧭 Overview  

<details><summary><strong>📖 Click to Expand</strong></summary>  

<p>This notebook covers a wide range of <strong>distance and similarity metrics</strong> that are foundational to machine learning and statistical analysis.</p>

<ul>
  <li>📏 <strong>Distance Metrics for Numeric Data</strong>  
    Includes Euclidean, Manhattan, Minkowski, and Mahalanobis distances—core to algorithms like KNN, clustering, and anomaly detection.
  </li>
  <li>🧮 <strong>Vector-Based Measures</strong>  
    Covers Cosine similarity, useful in high-dimensional spaces like NLP and recommender systems.
  </li>
  <li>🔤 <strong>Distance Metrics for Categorical/Binary Data</strong>  
    Includes Hamming and Jaccard distances, often used in matching and similarity scoring for categorical features.
  </li>
  <li>📊 <strong>Similarity Measures for Continuous Data</strong>  
    Covers Pearson and Spearman correlations, essential for understanding relationships and dependencies between numeric variables.
  </li>
</ul>

<p>Each section contains:</p>
<ul>
  <li>Clear explanation + intuition</li>
  <li>Mathematical formula</li>
  <li>Clean, reproducible code implementation</li>
</ul>

</details>


[Back to the top](#table-of-contents)
___


<a id="distance-metrics-for-numeric-data"></a>  
# 📏 Distance Metrics for Numeric Data  

<details><summary><strong>📖 Click to Expand</strong></summary>  

This section includes distance metrics that operate on **numerical features**. These metrics are used when data points are represented as vectors in a continuous feature space.

They form the backbone of many machine learning algorithms, particularly those that rely on geometric closeness, such as:

- 📌 **K-Nearest Neighbors (KNN)**
- 📌 **K-Means Clustering**
- 📌 **Anomaly Detection**
- 📌 **Distance-based recommender systems**

Each metric here differs in how it defines "closeness"—some are sensitive to scale or outliers, while others account for data correlations.

</details>


#### 📌 Euclidean Distance  
<a id="euclidean-distance"></a>  

<details><summary><strong>📖 Click to Expand</strong></summary>  

🧠 **Intuition**  
The straight-line (as-the-crow-flies) distance between two points in space. Think of it like using a ruler to measure distance on a map.

🧮 **Formula**

$$
d(x, y) = \sqrt{ \sum_{i=1}^{n} (x_i - y_i)^2 }
$$

⚠️ **Sensitivity**  
- Sensitive to scale differences between features  
- Highly affected by outliers  
- Requires normalization when features vary in range

🧰 **Use Cases + Real-World Examples**  
- Used in **KNN** and **K-Means** to compute closeness  
- In **image processing**, for comparing pixel intensities or feature embeddings  
- Can model physical distances in **geospatial analysis** when units are aligned

📝 **Notes**  
- Assumes all features contribute equally  
- Simple, intuitive, but not always reliable without preprocessing  
- Can mislead in high-dimensional spaces or with unscaled features

</details>


<hr style="border: none; height: 1px; background-color: #ddd;" />


#### 📌 Manhattan Distance  
<a id="manhattan-distance"></a>  

<details><summary><strong>📖 Click to Expand</strong></summary>  

🧠 **Intuition**  
Measures distance by summing absolute differences across dimensions. Like navigating a city grid—no diagonal shortcuts, only vertical and horizontal movement.

🧮 **Formula**

$$
d(x, y) = \sum_{i=1}^{n} |x_i - y_i|
$$

⚠️ **Sensitivity**  
- Less sensitive to outliers than Euclidean  
- Still scale-dependent—normalization is recommended  
- Can be more robust in sparse or high-dimensional settings

🧰 **Use Cases + Real-World Examples**  
- Common in **recommender systems** where input vectors are high-dimensional and sparse  
- Used in **L1-regularized models** like Lasso, which induce sparsity  
- Helpful when minimizing absolute error is preferred (e.g., **median-based objectives**)

📝 **Notes**  
- Captures linear path cost better than Euclidean in some contexts  
- Useful when small differences across many features matter more than large differences in a few  
- Often performs better than Euclidean in high-dimensional, noisy data

</details>


<hr style="border: none; height: 1px; background-color: #ddd;" />

#### 📌 Minkowski Distance  
<a id="minkowski-distance"></a>  

<details><summary><strong>📖 Click to Expand</strong></summary>  

🧠 **Intuition**  
A generalization of both Euclidean and Manhattan distances. By adjusting the parameter \( p \), it morphs into different distance metrics. Think of it as a flexible distance formula with a sensitivity dial.

🧮 **Formula**

$$
d(x, y) = \left( \sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p}
$$

⚠️ **Sensitivity**  
- Sensitive to the choice of \( p \):  
  - \( p = 1 \): Manhattan Distance  
  - \( p = 2 \): Euclidean Distance  
- Higher \( p \) values emphasize larger deviations  
- Still scale-dependent like its special cases

🧰 **Use Cases + Real-World Examples**  
- Used in **KNN classifiers** to experiment with different notions of "closeness"  
- Helpful in **model tuning**, especially when testing sensitivity to distance metrics  
- Useful in **feature engineering pipelines** with customizable distance needs

📝 **Notes**  
- Acts as a bridge between L1 and L2 distances  
- Not commonly used directly, but understanding it gives you control over distance behavior  
- Can help explore robustness to outliers by adjusting \( p \)

</details>


<hr style="border: none; height: 1px; background-color: #ddd;" />

#### 📌 Mahalanobis Distance  
<a id="mahalanobis-distance"></a>  

<details><summary><strong>📖 Click to Expand</strong></summary>  

🧠 **Intuition**  
Measures distance between a point and a distribution, not just another point. It accounts for the variance and correlation in the data, effectively "whitening" the space before measuring distance.

🧮 **Formula**

$$
d(x, y) = \sqrt{(x - y)^T S^{-1} (x - y)}
$$

Where \( S \) is the covariance matrix of the data.

⚠️ **Sensitivity**  
- Not scale-sensitive—handles feature scaling internally via covariance  
- Sensitive to multicollinearity or singularity in the covariance matrix  
- Requires a well-estimated covariance matrix (large sample size helps)

🧰 **Use Cases + Real-World Examples**  
- Common in **multivariate outlier detection** (e.g., fraud detection in finance)  
- Used in **discriminant analysis** (e.g., LDA)  
- Helpful when features are correlated, unlike Euclidean/Manhattan

📝 **Notes**  
- Allows distance to stretch/shrink based on feature correlation structure  
- Highlights points that are far from the mean *and* unusual based on the data distribution  
- More reliable with large, clean datasets—can break with singular or noisy covariance

</details>


[Back to the top](#table-of-contents)
___


<a id="distance-metrics-for-vectors-and-angles"></a>  
# 🧮 Distance Metrics for Vectors and Angles  

<details><summary><strong>📖 Click to Expand</strong></summary>  

This section focuses on metrics that measure **angular relationships** between vectors, rather than their raw distance.

These are especially useful in **high-dimensional spaces** where magnitude is less meaningful and **direction** matters more.

Typical scenarios include:
- 🧠 **NLP**: comparing TF-IDF or embedding vectors  
- 🎧 **Recommender Systems**: user/item interaction vectors  
- 🧬 **Similarity Scoring** in sparse or normalized datasets

These metrics shine when you're more interested in **alignment** than absolute difference.

</details>


#### 📌 Cosine Similarity / Distance  
<a id="cosine-similarity--distance"></a>  

<details><summary><strong>📖 Click to Expand</strong></summary>  

🧠 **Intuition**  
Measures the angle between two vectors, not their magnitude. It captures how aligned two directions are—perfect for understanding similarity in high-dimensional, sparse spaces.

🧮 **Formula**

$$
\text{Cosine Similarity} = \frac{\vec{x} \cdot \vec{y}}{||\vec{x}|| \cdot ||\vec{y}||}
$$

$$
\text{Cosine Distance} = 1 - \text{Cosine Similarity}
$$

⚠️ **Sensitivity**  
- Ignores magnitude, focuses only on orientation  
- Not affected by vector scaling (e.g., multiplying a vector by 10 doesn’t change similarity)  
- Still sensitive to dimensionality sparsity if most features are zeros

🧰 **Use Cases + Real-World Examples**  
- Dominant metric in **text analysis**, especially with **TF-IDF** vectors  
- Used in **recommender systems** to compute user-item similarity  
- Helps detect directionally similar patterns regardless of intensity (e.g., in topic modeling)

📝 **Notes**  
- Works well when **direction matters more than magnitude**  
- Can be misleading if vectors are zero or near-zero (need to handle edge cases)  
- In practice, often used with high-dimensional embeddings (e.g., NLP, document matching)

</details>


[Back to the top](#table-of-contents)
___


<a id="distance-metrics-for-categorical-or-binary-data"></a>  
# 🔤 Distance Metrics for Categorical or Binary Data  

<details><summary><strong>📖 Click to Expand</strong></summary>  

This section includes metrics tailored for **categorical**, **binary**, or **boolean** feature spaces—where traditional numeric distances don’t make sense.

These are particularly useful when:
- Your data is one-hot encoded  
- You're comparing sequences, strings, or sets  
- Features are **non-numeric** but still informative

Common applications:
- 🧬 **Genomic and text sequence comparison**  
- 📦 **Product recommendation based on binary attributes**  
- 🏷️ **Clustering with categorical features**  

These metrics help quantify **presence/absence** and **set overlap**, making them ideal for discrete comparisons.

</details>


#### 📌 Hamming Distance  
<a id="hamming-distance"></a>  

<details><summary><strong>📖 Click to Expand</strong></summary>  

🧠 **Intuition**  
Counts how many positions two strings (or binary vectors) differ in. Imagine comparing two passwords or binary sequences and marking the mismatches.

🧮 **Formula**

$$
d(x, y) = \sum_{i=1}^{n} \mathbf{1}(x_i \ne y_i) \\
\text{where } \mathbf{1}(x_i \ne y_i) = 
\begin{cases}
1, & \text{if } x_i \ne y_i \\
0, & \text{otherwise}
\end{cases}
$$

⚠️ **Sensitivity**  
- Only works on equal-length vectors  
- Binary/categorical only—makes no sense for continuous values  
- Each mismatch is treated equally, no weighting

🧰 **Use Cases + Real-World Examples**  
- Used in **error detection/correction** (e.g., digital communication, QR codes)  
- Common in **genomic sequence analysis**  
- Helpful for comparing **one-hot encoded categorical features** in clustering or similarity scoring

📝 **Notes**  
- Simple and interpretable for binary comparisons  
- Doesn’t account for *how different* the values are—just whether they differ  
- Can be extended to non-binary categorical data using matching scores

</details>


<hr style="border: none; height: 1px; background-color: #ddd;" />

#### 📌 Jaccard Similarity / Distance  
<a id="jaccard-similarity--distance"></a>  

<details><summary><strong>📖 Click to Expand</strong></summary>  

🧠 **Intuition**  
Measures the overlap between two sets relative to their union. It tells you *how similar two binary vectors or sets are*, ignoring what they don't share.

🧮 **Formula**

$$
\text{Jaccard Similarity} = \frac{|A \cap B|}{|A \cup B|} \\
\text{Jaccard Distance} = 1 - \text{Jaccard Similarity}
$$

⚠️ **Sensitivity**  
- Only works on binary/categorical data or sets  
- Ignores true negatives (things both sets don't have)  
- Sensitive to sparsity—more zeros → lower similarity

🧰 **Use Cases + Real-World Examples**  
- Common in **recommender systems** to compare item sets (e.g., users with similar purchase histories)  
- Used in **clustering binary data** (e.g., one-hot encoded attributes)  
- Applied in **text mining** to compare sets of words (bag-of-words or shingled phrases)

📝 **Notes**  
- Especially useful when **presence** is more important than absence  
- Performs well when comparing sparse or asymmetric binary vectors  
- Jaccard Distance is a proper metric (satisfies triangle inequality)

</details>


[Back to the top](#table-of-contents)
___


<a id="similarity-measures-for-continuous-data"></a>  
# 📊 Similarity Measures for Continuous Data  

<details><summary><strong>📖 Click to Expand</strong></summary>  

This section covers **correlation-based similarity measures** for continuous variables. Instead of measuring distance, these metrics quantify the **strength and direction of relationships** between variables.

Use cases typically involve:
- 📈 **Exploratory Data Analysis (EDA)**  
- 🧪 **Feature selection** in modeling pipelines  
- 💰 **Financial modeling** (e.g., correlation between asset returns)

These measures are:
- Scale-invariant  
- Useful for spotting patterns in **paired continuous variables**  
- Sensitive to relationship type—linear vs. monotonic

These metrics are key to understanding **how variables move together**, whether for modeling or diagnostics.

</details>


#### 📌 Pearson Correlation  
<a id="pearson-correlation"></a>  

<details><summary><strong>📖 Click to Expand</strong></summary>  

🧠 **Intuition**  
Measures the strength and direction of a **linear relationship** between two continuous variables. A value of +1 means perfect positive linear correlation, -1 means perfect negative, and 0 means no linear relationship.

🧮 **Formula**

$$
r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}
         {\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n} (y_i - \bar{y})^2}}
$$

⚠️ **Sensitivity**  
- Extremely sensitive to **outliers**  
- Assumes **linearity**  
- Affected by non-normal distributions or non-constant variance

🧰 **Use Cases + Real-World Examples**  
- Used in **feature selection** (e.g., removing highly correlated variables)  
- Helps in **exploratory data analysis** to understand relationships  
- Common in **finance** (e.g., correlation between stock returns)

📝 **Notes**  
- Does **not imply causation**—only association  
- Works best when both variables are continuous, normally distributed, and linearly related  
- For non-linear relationships, consider Spearman instead

</details>


<hr style="border: none; height: 1px; background-color: #ddd;" />

#### 📌 Spearman Rank Correlation  
<a id="spearman-rank-correlation"></a>  

<details><summary><strong>📖 Click to Expand</strong></summary>  

🧠 **Intuition**  
Measures the **monotonic relationship** between two variables using their ranks instead of raw values. It tells you whether the relationship is consistently increasing or decreasing, even if not linear.

🧮 **Formula**

If there are no tied ranks:

$$
\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \\
\text{where } d_i = \text{rank}(x_i) - \text{rank}(y_i)
$$

⚠️ **Sensitivity**  
- Robust to **outliers**  
- Captures **monotonic** (not just linear) trends  
- Still assumes **ordinal** or continuous variables

🧰 **Use Cases + Real-World Examples**  
- Great for **ordinal data** (e.g., survey rankings, Likert scales)  
- Used when variables don’t meet normality assumptions  
- Common in **bioinformatics** or **psychometrics** for measuring association strength

📝 **Notes**  
- Doesn’t assume linearity or equal spacing between values  
- Less powerful than Pearson when linearity holds  
- Ideal fallback when data violates Pearson’s assumptions

</details>


[Back to the top](#table-of-contents)
___


# Scratch Notes

Distance Metrics:
	Euclidean Distance: The straight-line distance between two points in space.
	Manhattan Distance: Also called L1 distance, it measures the sum of absolute differences.
	Minkowski Distance: A generalization of Euclidean and Manhattan distance.
	Cosine Similarity: Measures the cosine of the angle between two vectors, often used in text analysis.
	Mahalanobis Distance: Takes into account the correlations of the data set and is useful for multivariate analysis.


Similarity Measures:
	Jaccard Similarity: Measures similarity between finite sample sets, used for binary attributes.
	Pearson Correlation: Measures linear correlation between two variables.
	Spearman Rank Correlation: Measures the relationship between two variables using rank-order.
	Hamming Distance: Used to compare strings of equal length, measuring the number of positions at which the corresponding elements are different.
