# Unsupervised Learning

Unsupervised learning is a type of machine learning where the model learns from unlabeled data — meaning we do not provide output labels or categories.

The goal is to discover hidden patterns, structures, or features in the data.

- No labeled output (no Y values)
- The model finds:
  - Groups / Clusters
  - Structure
  - Anomalies
  - Reduced Dimensions

**Use Cases**
- Customer Segmentation
- Document categorization
- Image grouping

## Types of Unsupervised Learning:

- **Clustering or Cluster Analysis**

- **Association Rule Learning**

- **Dimensionality Reduction**

- **Anomaly Detection**

**Note**: Classification vs Clustering - Classification has pre-defined classes, but Clustering has no known classes.

https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

### Cluster Analysis

Cluster analysis is a type of unsupervised learning where we group similar data points together. The objective is to classify objects into clusters that are more similar to each other than to those in other clusters.

#### Types of Clustering Algorithms

1. **K-means Clustering**
   - Partitions data into **K** distinct clusters.
   - Each data point belongs to the cluster with the nearest mean.

2. **Hierarchical Clustering**
   - Builds a hierarchy of clusters.
   - Can be agglomerative (bottom-up) or divisive (top-down).

3. **DBSCAN (Density-Based Spatial Clustering of Applications with Noise)**
   - Groups together closely packed points.
   - Identifies outliers as noise.

4. **Gaussian Mixture Models (GMM)**
   - Assumes data is generated from a mixture of several Gaussian distributions.
   - Soft clustering approach (probabilistic).



# 1. Clustering



## 1.1. K-Means Clustering
K-means is an unsupervised machine learning algorithm used to group data into K distinct clusters, where each data point belongs to the cluster with the nearest centroid (mean of the cluster).

- K = Number of cluster you define
- Centroid = Mean positions of all points in  a cluster
- Each point is assisgned to the cluster with the nearest centroid. (Hence, K-Means clustering is a distance-based clustering)

![K-means Clustering Diagram](https://developers.google.com/static/machine-learning/clustering/images/clustering_example.png?hl=pt-br)

K-means clustering is distance-based clustering. Steps below explains how the clustering takes place.

### Steps in K-Means Algorithm

**Step1**: Choose the number of clusters, K.

**Step2**: Rendomly initialize K centroids. (Assume a mean for each cluster)

**Step3**: Assign each point to the nearest centroid.

**Step4**: Recalculate centroids by taking the mean of all assigned points.

**Step5**: Repeat steps3 and 4 until centroids do not change (covergence).

Animation Demonstating how K-means work >>
https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

### Evaluation Metrics for K-Means

- **Intertia (WCSS)**: Sum of squared distances from points to their centroids.

- **Silhouette Score**: How well a point fits in its cluster compated to others.

- **Elbow Method**: Helps choose the best value of K by plotting WCSS vs K.

#### 1.1.1 Intertia / WCSS
Inertia is a metric used to evaluate the quality of K-means clustering. It measures how tightly the data points are clustered around the centroids.

- Low Inertia = points are close to their centroids → good clustering

- High Inertia = points are far from centroids → poor clustering

**Formulae**

$$
\text{Inertia} = \sum_{i=1}^{k} \sum_{x \in C_i} \|x - \mu_i\|^2
$$

Where:
- $ k $: Number of clusters  
- $ C_i $: Set of points in cluster $ i $  
- $ \mu_i $: Centroid of cluster $ i $  
- $ x $: Data point in cluster $ i $  
- $ \|x - \mu_i\|^2 $: Squared Euclidean distance between point and centroid

---

**💡 Intuition**

- **Low Inertia** = Points are close to centroids → Good clustering
- **High Inertia** = Points are far from centroids → Poor clustering

⚠️ Note: Inertia **always decreases** when you increase the number of clusters (K), so it shouldn’t be used alone to determine the best K.

##### 1.1.1. Elbow Method
The **Elbow Method** is a popular technique used in conjuction with WCSS to determine the optimal number of clusters $ k $ for K-Means clustering.

- As the number of clusters $ k $ increases, the **Within-Cluster Sum of Squares (WCSS)** decreases.
- However, after a certain point, the improvement becomes marginal.
- The **"elbow point"** on the plot (k vs. WCSS) indicates the optimal number of clusters — where the rate of decrease sharply changes.

<img src="https://www.ibm.com/content/dam/connectedassets-adobe-cms/worldwide-content/creative-assets/s-migr/ul/g/88/64/k-means-clustering-graph.png" alt="Elbow Method" width="600" height="300">


---

#### 1.1.2 Silhouette Score
The **Silhouette Score** is a metric used to evaluate how well the results of a clustering algorithm (like **K-Means**) are formed. It measures how similar each point is to its **own cluster** (cohesion) compared to other clusters (separation).

---

##### 🧠 Intuition

- A **high Silhouette Score** means the point is **well matched to its own cluster** and **far from other clusters**.
- A **low or negative score** indicates the point may be **misclassified** or lies between clusters.

---

##### 📐 Silhouette Score Formula

For a single point **i**:

- `a(i)`: average distance from **i** to all other points in the **same cluster**.
- `b(i)`: average distance from **i** to all points in the **nearest different cluster**.

$$
s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}
$$

- `s(i)` ranges between **-1 and +1**:
  - **+1**: ideal clustering
  - **0**: point is on the boundary
  - **-1**: likely misclassified

<img src="https://miro.medium.com/v2/resize:fit:1400/1*Kq52h5BzHqPHQSwOAsLdfw.jpeg" width="800" height="500">

## 1.2. Hierarchical Clustering
Hierarchical clustering is an unsupervised learning algorithm used to group similar data points into nested clusters based on a hierarchy. Unlike K-means, it does not require specifying the number of clusters (K) in advance.

2 Types of hierarchical clustering.
- Agglomerative hierarchical clustering
- Divisive hirerachical clustering (**not in use any more**)

### Agglomerative Hierarchical Clustering
- Each datapoint is a cluster.
- Take distance of the datapoint from each other data points.
- Find out the closets cluster or datapoint, and then merge them to a cluster.
- In the end, this algorithm terminates when there is only a **single cluster left.**
- The results from the hierarchical clustering can be shown using **dendrogram**.

<img src="https://www.researchgate.net/publication/378433323/figure/fig1/AS:11431281225406409@1708707906747/Hierarchical-clustering-dendrogram-The-figure-displays-the-hierarchical-clustering.png">

### Reading Dendogram
- Height of the blocks (vertical line) represent the **distance between the clusters.**

- Distance between the observations/data points (horizental line) represents **dissimilarities.**. Observations are allocated to a cluster by drawing the horizental line.

- Cut the dendogram in such a way that, it cuts the tallest vertical line. Clusters under the line is your starting cluster count.In the picture above, you could see there 4 clusters to start with. (orange, green, red, and purple).
**Note**: Cluster Count = No of times the Horizental line cuts the vertical = 4








# 2. Association Rule Learning
Association Rule Learning is an unsupervised machine learning technique used to discover interesting relationships, patterns, or associations among variables in large datasets.

It’s most commonly used in market basket analysis (e.g., "If a customer buys bread & beer, they are likely to buy milk"), marketing strategy and stock planning.

Let’s say we have 5 transactions from a small store:

| Transaction ID | Items Bought                    |
|----------------|----------------------------------|
| T1             | Milk, Bread                     |
| T2             | Milk, Diaper, Beer, Eggs        |
| T3             | Milk, Diaper, Beer, Bread       |
| T4             | Diaper, Beer                    |
| T5             | Milk, Bread, Diaper, Beer       |

#### 🔢 Key Metrics
Example: **If a customer buys Diaper and Beer, they are likely to buy Milk**  
Rule: `A{Diaper, Beer} ⇒ B{Milk}`

##### 1. **Support**
Indicates how frequently an item appears in the dataset.

$$
\text{Support}(B) = \frac{\text{Transactions containing } B}{\text{Total transactions}}
$$

- (B) = Milk appears in **T1, T2, T3, and T5** → 4 transactions  
- Total transactions = 5
- Support = 4 / 5 = 0.8 (80%)

**Support for B appearing in the basket is 0.8**

##### 2. **Confidence**
Confidence measures how often the consequent (then-part) occurs when the antecedent (if-part) occurs. **How many times, B has appeared together in the basket when A was purchased.**

$$
\text{Confidence}(A \Rightarrow B) = \frac{\text{Support}(A, B)}{\text{Support}(A)}
$$

- (A, B appearing together) = Diaper + Beer + Milk  appears in **T2, T3, T5** → 3 transactions
- A = Diaper + Beer appears in **T2, T3, T4, T5** → 4 transactions
- Total transactions = 5
- Support(A,B) = 3/5
- Support(A) = 4/5
- Confidence = Support(A,B) / Support(A)  = 3 / 4 = 0.75 (75%)

##### 3. **Lift**
Lift measures how much more likely the consequent (then-part) is to occur with the antecedent (if-part) than by random chance.

$$
\text{Lift}(A \Rightarrow B) = \frac{\text{Confidence}(A \Rightarrow B)}{\text{Support}(B)}
$$

- Lift = 0.75 / 0.8 = 0.9375 (93.75%)

> - **Lift > 1**: Positive association  
> - **Lift = 1**: No association  
> - **Lift < 1**: Negative association


---

#### ⚙️ Common Algorithms

| Algorithm     | Description                                   |
|---------------|-----------------------------------------------|
| **Apriori**   | Iteratively finds frequent itemsets and prunes unlikely ones |
| **Eclat**     | Uses set intersections for fast frequency counting |
| **FP-Growth** | Uses a compact tree structure (FP-tree) to find frequent patterns without candidate generation |

---

#### 📦 Use Cases

| Domain          | Use Case                                        |
|-----------------|-------------------------------------------------|
| Retail          | Market basket analysis                          |
| E-commerce      | Product recommendation                          |
| Healthcare      | Detect co-occurring symptoms or conditions      |
| Web Usage       | Analyze browsing behavior or click patterns     |

## 2.1 🔍 Apriori Algorithm Explained

The **Apriori algorithm** is a popular algorithm for **association rule mining** — it finds patterns like:

> "If a customer buys **bread** and **butter**, they are also likely to buy **jam**."

It’s commonly used in **market basket analysis**.

---

### 🧠 What is Apriori?

Apriori identifies **frequent itemsets** (items often purchased together) and then generates **association rules** using:

- **Support**: Frequency of occurrence in the dataset.
- **Confidence**: Strength of implication in a rule.
- **Lift**: How much more often items occur together than expected by chance.

---

### ⚙️ How Apriori Works

#### 1. **Set thresholds**
- **Support**: Minimum proportion of transactions containing an itemset.
- **Confidence**: Minimum reliability of a rule.
- **Lift** (optional): Measures the strength of a rule over random chance.

#### 2. **Generate frequent itemsets**
- Count item frequencies.
- Eliminate infrequent items based on the support threshold.
- Use the **Apriori principle**: if a set is infrequent, all supersets are also infrequent.

#### 3. **Generate association rules**
- For each frequent itemset, generate all valid rules and evaluate them using confidence and lift.

---

### 🧾 Example Dataset

| Transaction ID | Items                  |
|----------------|------------------------|
| 1              | Milk, Bread, Butter    |
| 2              | Bread, Butter          |
| 3              | Milk, Bread            |
| 4              | Milk, Butter           |
| 5              | Bread, Butter          |

---

### ✅ Step-by-Step Example

#### Step 1: Set thresholds

- `min_support` = 0.6
- `min_confidence` = 0.7

---

#### Step 2: Frequent 1-itemsets
Eliminate the items that do not appear frequent in the basket (i.e the support level is less than `min_support`)

| Item    | Support Count | Support |
|---------|----------------|---------|
| Milk    | 3              | 3/5 = 0.6 ✅ |
| Bread   | 4              | 4/5 = 0.8 ✅ |
| Butter  | 4              | 4/5 = 0.8 ✅ |

---

#### Step 3: Frequent 2-itemsets
- Use the Frequent 1-itemsets from Step#2 to build the Item Pairs, and calculate it's support level.

- Eliminate the item-pairs that have support level less than the `min_support`

| Itemset         | Support Count | Support |
|------------------|----------------|---------|
| Milk, Bread      | 2              | 2/5 = 0.4 ❌ |
| Milk, Butter     | 2              | 2/5 = 0.4 ❌ |
| Bread, Butter    | 3              | 3/5 = 0.6 ✅ |

Only `Bread, Butter` is frequent.

You could iterate the process to build 3-itemsets and calculate the support-level. And then eliminate the itemsets with support less than min_support.

---

#### Step 4: Generate Association Rules

When you keep on repeating the steps above, you build the itemsets with support and confidence level. If the itemsets above the threshold set in Step#1, then they are your association rules.

From `Bread, Butter`, we get:

| Rule             | Support | Confidence |
|------------------|---------|------------|
| Bread → Butter   | 0.6     | 3/4 = 0.75 ✅ |
| Butter → Bread   | 0.6     | 3/4 = 0.75 ✅ |

---

### 📈 Final Rules
Rule: Bread → Butter (Support = 0.6, Confidence = 0.75)
Rule: Butter → Bread (Support = 0.6, Confidence = 0.75)

### 🧪 Python Code Example

```python
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
import pandas as pd

# Sample transactions
transactions = [
    ['Milk', 'Bread', 'Butter'],
    ['Bread', 'Butter'],
    ['Milk', 'Bread'],
    ['Milk', 'Butter'],
    ['Bread', 'Butter']
]

# Encode transactions
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
df = pd.DataFrame(te_array, columns=te.columns_)

# Apply Apriori
frequent_itemsets = apriori(df, min_support=0.6, use_colnames=True)

# Generate rules
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

# Display results
print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])
```

# 3. 🌐 Dimensionality Reduction

**Dimensionality Reduction** is a technique used in data science and machine learning to reduce the number of input variables (or features) in a dataset while retaining as much relevant information as possible.

**Dimensionality Reduction** is used as a pre-processing step.

---

## 🧠 Why it’s useful

- **Removes noise**: Eliminates irrelevant or redundant features.
- **Speeds up training**: Fewer features = faster computations.
- **Improves performance**: Helps reduce overfitting and enhances model generalization.
- **Simplifies visualization**: Makes high-dimensional data viewable in 2D or 3D.

---

## 🧰 Common Techniques

### 🔍 Feature Selection / Elimination Techniques

| Technique                              | Category        | Description |
|---------------------------------------|-----------------|-------------|
| **Missing Value Ratio**               | Filter Method   | Removes features with a high percentage of missing values (e.g., >60%). These features may not contribute meaningfully to the model. |
| **Low Variance Filter**               | Filter Method   | Eliminates features with little to no variance across samples. Such features provide minimal or no predictive value. |
| **High Correlation Filter**           | Filter Method   | Removes one of two features that are highly correlated (e.g., correlation > 0.85), to avoid redundancy and multicollinearity. |
| **Random Forest**           | Filter Method   | Random Forest Feature Importance Technique can be used filter out important features. |
| **Forward Feature Selection**              | Wrapper Method   | Starts with no features and adds them one by one, selecting the feature that improves model performance the most at each step. |
| **Backward Feature Elimination**                   | Wrapper Method  | Starts with all features and removes the least useful one at each step based on model performance. |


---

### 🧪 Feature Engineering Techniques (Also called Extraction Techniques)

| Technique                             | Description |
|--------------------------------------|-------------|
| **PCA (Principal Component Analysis)** | Projects data onto new axes (principal components) that capture the most variance. Reduces dimensionality while preserving as much information as possible. |
| **LDA (Linear Discriminant Analysis)** | Similar to PCA but supervised—it tries to maximize the separation between classes. Useful in classification tasks. |
| **Factor Analysis (FA)**             | A statistical method that models observed variables as combinations of a few underlying latent factors. Assumes these latent factors cause correlations between variables. |
| **Independent Component Analysis (ICA)** | Decomposes data into statistically independent components. Useful for signal separation and uncovering hidden factors. |



---

## 📉 Example

Imagine you have a dataset with **1000 features** (columns) for each product, but only **10** of them are truly useful for predicting customer purchases.  
Dimensionality reduction helps you shrink that down — say to **10 or 20** key features — making the model both simpler and more accurate.

---




## 3.1 Feature Engineering Techniques

### 3.1.1 PCA (Principal Component Analysis)
**Principal Component Analysis (PCA)** is a widely used dimensionality reduction technique in machine learning and data analysis. It helps ***simplify datasets with many features by reducing the number of variables*** while preserving as much of the original variance (information) as possible.

E.g. Using BMI instead of Weight and Height of person as a feature.

PCA is a method to find the ***linear combination that features for as much variability as possible.***

*   https://www.youtube.com/watch?v=dz8imS1vwIM
*   https://www.youtube.com/watch?v=S51bTyIwxFs


**A good visual explanation of PCA using Eigenvalue**.:
*   https://www.youtube.com/watch?v=FgakZw6K1QQ
*   https://www.youtube.com/watch?v=FD4DeN81ODY



