# 🧪 Lab 3: Unsupervised Learning - Mice Protein Expression


## Step 1: Load and Inspect the Dataset

We begin by loading the **Mice Protein Expression** dataset which contains measurements of 77 protein expression levels from the cerebral cortex of mice. 
These mice belong to different experimental groups but we ignore the labels for this unsupervised task.


In [None]:
import pandas as pd
df = pd.read_csv("Data_Cortex_Nuclear.csv")
df.head()


## Step 2: Preprocessing - Drop Metadata and Handle Missing Values

Since the lab focuses on **unsupervised learning**, we discard all label-related or metadata columns:
- First column: `MouseID`
- Last 4 columns: `Genotype`, `Treatment`, `Behavior`, `class`

We also drop any rows with missing values to ensure clean clustering performance.


In [None]:
df_cleaned = df.drop(columns=["MouseID", "Genotype", "Treatment", "Behavior", "class"])
df_cleaned = df_cleaned.dropna()
df_cleaned.head()


## Step 3: Normalize the Data and Apply PCA

Normalization is essential before clustering to ensure that all protein expressions contribute equally.

We use **StandardScaler** to scale the data and **PCA** to reduce it to 2 dimensions for visualization purposes.
This step also helps in visual inspection of any natural groupings or separability in the data.


In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
data_scaled = scaler.fit_transform(df_cleaned)

pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)

import matplotlib.pyplot as plt
df_pca = pd.DataFrame(data_pca, columns=["PC1", "PC2"])
df_pca.head()


## Step 4: Apply Clustering Algorithms

We apply multiple clustering algorithms to group the mice data based solely on the 77 protein expression features:
- **K-Means**: partitions the data into K fixed clusters
- **Gaussian Mixture Models (GMM)**: assigns probabilities for each point to belong to each cluster
- **DBSCAN**: identifies clusters based on density and separates noise


In [None]:
from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture

kmeans = KMeans(n_clusters=8, random_state=42)
kmeans_labels = kmeans.fit_predict(data_scaled)

gmm = GaussianMixture(n_components=8, random_state=42)
gmm_labels = gmm.fit_predict(data_scaled)

dbscan = DBSCAN(eps=3, min_samples=5)
dbscan_labels = dbscan.fit_predict(data_scaled)


## Step 5: Visualize Clustering Results

Using the PCA-reduced 2D data, we visualize the clusters formed by each algorithm.
This helps in interpreting whether the algorithms are able to distinguish well-separated groups.

**Do the results look similar?**  
- K-Means and GMM give fairly similar groupings.
- DBSCAN detects outliers and may form fewer clusters depending on the density threshold.
- There are visible differences, suggesting that cluster shape assumptions (e.g., spherical for K-Means) influence outcomes.



## Step 6: Apply Hierarchical Clustering

Hierarchical clustering builds a dendrogram to represent nested groupings.
We use **Ward's linkage** and cut the dendrogram to form 8 clusters for comparison.

**Is this clustering similar to others?**  
It shows some similarity to K-Means but captures slightly different boundaries, showing that hierarchical methods may better capture hierarchical structure.


## Step 5: Visualize clustering results using PCA-reduced data

## Step 6: Apply Hierarchical Clustering

In [None]:
from scipy.cluster.hierarchy import linkage, fcluster

linkage_matrix = linkage(data_scaled, method='ward')
hierarchical_labels = fcluster(linkage_matrix, t=8, criterion='maxclust')

In [None]:
fig, axs = plt.subplots(4, 1, figsize=(10, 20))

axs[0].scatter(df_pca["PC1"], df_pca["PC2"], c=kmeans_labels, cmap="tab10", s=30)
axs[0].set_title("K-Means Clustering")
axs[0].set_xlabel("PC1")
axs[0].set_ylabel("PC2")

axs[1].scatter(df_pca["PC1"], df_pca["PC2"], c=gmm_labels, cmap="tab10", s=30)
axs[1].set_title("Gaussian Mixture Model Clustering")
axs[1].set_xlabel("PC1")
axs[1].set_ylabel("PC2")

axs[2].scatter(df_pca["PC1"], df_pca["PC2"], c=hierarchical_labels, cmap="tab10", s=30)
axs[2].set_title("Hierarchical Clustering (Ward Linkage)")
axs[2].set_xlabel("PC1")
axs[2].set_ylabel("PC2")

axs[3].scatter(df_pca["PC1"], df_pca["PC2"], c=dbscan_labels, cmap="tab10", s=30)
axs[3].set_title("DBSCAN Clustering")
axs[3].set_xlabel("PC1")
axs[3].set_ylabel("PC2")

plt.tight_layout()
plt.show()