# 🧠 Session 5: Introduction to Unsupervised Learning (with Bonus Exercises)

## 🕒 00:00–00:15 — What is Unsupervised Learning?
**Unsupervised Learning** algorithms learn from data **without labels**.

**Key Goals:**
- Discover patterns or structure in data
- Group similar items together
- Reduce the dimensionality of complex datasets

**Use Cases:**
- Customer segmentation
- Anomaly detection
- Topic modeling in documents

**Comparison Table:**

| Task                     | Supervised | Unsupervised        |
|--------------------------|------------|----------------------|
| Email Spam Detection     | ✅         | ❌                   |
| Customer Grouping        | ❌         | ✅                   |
| Image Labeling           | ✅         | ❌                   |
| Topic Modeling in Text   | ❌         | ✅                   |

_Prompt: Can you think of data you have access to that has no labels?_

## 🕒 00:15–00:30 — K-Means and Hierarchical Clustering
**K-Means Clustering**
- Partition data into k groups
- Fast and scalable
- Sensitive to initial starting conditions

**Hierarchical Clustering**
- Builds a tree-like structure (dendrogram)
- Good for understanding nested structure
- Computationally expensive for large datasets

**Visuals:**
- Clustered scatterplots
- Dendrogram diagrams

_How do these two methods differ in flexibility and interpretation?_

## 🕒 00:30–00:45 — Live Coding: Clustering Iris Dataset (No Labels)

In [None]:
# Load the Iris dataset and drop labels
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris(as_frame=True)
df = iris.frame.drop(columns='target')
df.head()

In [None]:
# Apply K-Means clustering
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(df)

df['cluster'] = clusters
df.head()

In [None]:
# Visualize clustering results using seaborn pairplot
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, hue='cluster', palette='Set1')
plt.suptitle('K-Means Clustering (Iris without labels)', y=1.02)
plt.show()

In [None]:
# Optional: Apply Hierarchical Clustering and plot dendrogram
from scipy.cluster.hierarchy import dendrogram, linkage

linked = linkage(df.drop(columns='cluster'), method='ward')

plt.figure(figsize=(10, 5))
dendrogram(linked, truncate_mode='lastp', p=20, leaf_rotation=45., leaf_font_size=12.)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

## ✅ Session Summary
- Unsupervised learning finds patterns without labeled data
- K-Means and Hierarchical clustering are two common techniques
- We used the Iris dataset without labels to group flowers

_Reflection: Did the clustering groupings resemble the species you know?_

## ⭐ Bonus Exercises

### 🧪 Bonus 1: Elbow Method for Choosing K
Use the Elbow Method to find the optimal number of clusters.

In [None]:
# Elbow method to determine best value of k
inertias = []
for k in range(1, 11):
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(df.drop(columns='cluster'))
    inertias.append(model.inertia_)

plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()

### 🧪 Bonus 2: Silhouette Score for Cluster Quality

In [None]:
from sklearn.metrics import silhouette_score

X = df.drop(columns='cluster')
score = silhouette_score(X, df['cluster'])
print(f'Silhouette Score for k=3: {score:.3f}')

### 🧪 Bonus 3: PCA Visualization of Clusters
Reduce the dataset to 2D using PCA and color by cluster.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['cluster'], cmap='Set1', s=50)
plt.title('PCA Projection of Clusters')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.grid(True)
plt.show()