# Session 3 – Unsupervised Learning: Dimensionality Reduction & Clustering

By the end of this notebook, we will:
- Understand unsupervised learning and how it differs from supervised tasks
- Apply Principal Component Analysis (PCA) for dimensionality reduction and visualization
- Try out other methods such us t-SNE and UMAP
- Perform K-Means clustering to find patterns in unlabeled data
- Interpret cluster assignments and compare them with true labels (if available)
- Explore hierarchical clustering

We’ll reuse the **Breast Cancer dataset**.

### Load the Breast Cancer dataset

We’ll ignore labels during clustering — but later compare clusters to the true diagnosis.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
import pandas as pd
import numpy as np

cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target  # ground truth, used only for evaluation

X.head()

## Part 1: Dimensionality Reduction

### Principal Component Analysis (PCA)

PCA transforms correlated features into a smaller set of uncorrelated components capturing the most variance.

In order to perform PCA we must scale the data first.

### Variance explained for each of the principal components

### Visualize the first two components

### Exercise 1: Visualize the explained variance as a cumulative function

1. Plot as a line plot the Cumulative Explained Variance vs the number of PCA Components.
2. How many component explain 90% of the variance?
3. How many components would you select? 

*Hints: Use np.cumsum()*

### Exercise 2: Visualize the first two components but colored by the target value.

- Compare the range of values of the x-axis and the y-axis. Does it make sense?
- Is PCA able to separate between the two classes (malignant or healthy)?
- Compute the PCA again but setting the parameter `n_components=2`. Do you see any difference?

### Exercise 3: Plot the PCA component loadings.

Plotting the loadings helps us see which original features contribute the most. 

*Hints:* Use pca.compontents_ to get the weights for each feature.

### Exercise 4: Compare PCA with t-SNE 

In this exercise we want to compare:
- **PCA** (linear)
- **t-SNE** (non-linear, good for local structure)

Tasks:
1. Compute 2D embeddings for each method using the standardized feature matrix (`X_scaled`).
2. Plot the 2D scatter for each method side-by-side, coloring by the true labels (`y`) to visually compare separation.
3. Discuss differences: Which method gives the most visually separated classes? Which scores are higher? How do runtime and stability compare?
4. (Optional) Try different parameters for TSNE and compare.

*Hints:*
- Use `sklearn.manifold.TSNE` for t-SNE.
- t-SNE has parameters `perplexity` and `n_iter`. Try `perplexity=30`, `n_iter=1000`.


**1. Compute 2D embeddings for each method using the standardized feature matrix (`X_scaled`).**

**2. Plot the 2D scatter for each method side-by-side, coloring by the true labels (`y`) to visually compare separation.**

**4. (Optional) Try different parameters for TSNE and compare.**

### (Optional) Exercise 5: Install and compare with UMAP

If you want, you can install umap and run the comparison with the other methods. To install it, try `pip install umap-learn`.

Then:

`reducer = umap.UMAP(n_components=2, random_state=42)`

`X_umap = reducer.fit_transform(X_scaled)`

## Part 2: Clustering

### K-Means Clustering

Now we’ll let an unsupervised algorithm find groups in the data projected by PCA.

In [None]:
from sklearn.cluster import KMeans

# X_pca needs to have been computed in the previous section first

kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
y_kmeans = kmeans.fit_predict(X_pca)

plt.figure(figsize=(7,6))
sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], hue=y_kmeans, palette='viridis', alpha=0.7)
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], 
            s=200, c='red', marker='X', label='Centers')
plt.title("K-Means Clusters on PCA Projection")
plt.legend()
plt.show()


### Computing some clustering metrics

In [None]:
from sklearn.metrics import adjusted_rand_score, silhouette_score

ari = adjusted_rand_score(y, y_kmeans)
sil = silhouette_score(X_pca, y_kmeans)

print(f"Adjusted Rand Index: {ari:.3f}")
print(f"Silhouette Score: {sil:.3f}")

### Exercise 1: Try different k values for the cluster.

- Loop over k=2...10. Store inertia_ and plot.
- What is the appropriate number of clusters?
- Now try different seeds. How is the cluster stability?

**Loop over k=2...10. Store inertia_ and plot.**

**Now try different seeds. How is the cluster stability?**

### Exercise 2: Try Hierarchical clustering instead.

- Use `AgglomerativeClustering` with 2 clusters and visualzie the dendrogram.

*Hint:* from sklearn.cluster import AgglomerativeClustering, and from scipy.cluster.hierarchy import linkage, dendrogram

### Exercise 3: Exploring Distance Metrics

Most clustering and dimensionality reduction algorithms rely on a *distance* or *similarity* measure to define how close two samples are.

In this exercise, we’ll explore three common metrics:

| Metric | Notes |
|----------|-------|
| **Euclidean** | Standard straight-line distance (default in K-Means) |
| **Manhattan** | Sum of absolute differences (“city block” distance) |
| **Cosine similarity** | Measures *angle* between vectors (scale-invariant) |

You will:
1. Compute pairwise distances between samples.  
2. Visualize how the metric affects clustering using K-Means.  
3. Use t-SNE with custom metrics to show how neighborhood relationships change when using these three different metrics.

**1. Compute pairwise distances between samples.**

In [None]:
from sklearn.metrics import pairwise_distances

**2. Visualize how the metric affects clustering using K-Means.**

**3. Use t-SNE with custom metrics to show how neighborhood relationships change when using these three different metrics.**

### (Optional) Exercise 4: Explore density-based clustering and compare with the approaches we learned today 