<a href="https://colab.research.google.com/github/anandchauhan21/Machine_Learning/blob/main/Labs/Lab6_K_Means_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧪 Lab 6: K-Means Clustering for Patient Segmentation

## 🎯 Objective
To apply **K-Means Clustering** on patient health data (e.g., Age, BMI, Blood Pressure, Cholesterol, Glucose)  
to segment patients into groups based on clinical similarity.

---

## 🧠 Concept Recap

### What is Clustering?
- **Unsupervised learning** technique — no labels, the model discovers natural groupings.
- Groups (clusters) are formed based on **similarity** between data points.

### K-Means Algorithm Steps
1. Choose number of clusters **k**.
2. Randomly assign **k centroids**.
3. Assign each data point to the nearest centroid.
4. Recalculate centroids based on assigned points.
5. Repeat until convergence (centroids don’t change much).

---

### Mathematical Representation
\[
J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2
\]

where  
- \( C_i \) = cluster i  
- \( \mu_i \) = centroid of cluster i  
- \( J \) = sum of squared distances (to minimize)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

sns.set(style="whitegrid")
np.random.seed(42)


In [None]:
# Synthetic clinical dataset generator
def create_patient_data(n=300):
    np.random.seed(42)
    age1 = np.random.normal(30, 5, n//3)
    age2 = np.random.normal(45, 6, n//3)
    age3 = np.random.normal(60, 5, n - 2*(n//3))

    bmi1 = np.random.normal(22, 2, n//3)
    bmi2 = np.random.normal(27, 3, n//3)
    bmi3 = np.random.normal(31, 3, n - 2*(n//3))

    bp1 = np.random.normal(110, 8, n//3)
    bp2 = np.random.normal(130, 10, n//3)
    bp3 = np.random.normal(150, 12, n - 2*(n//3))

    chol = np.random.normal(200, 30, n)
    glucose = np.random.normal(100, 15, n)

    df = pd.DataFrame({
        'Age': np.concatenate([age1, age2, age3]).round(1),
        'BMI': np.concatenate([bmi1, bmi2, bmi3]).round(1),
        'SystolicBP': np.concatenate([bp1, bp2, bp3]).round(1),
        'Cholesterol': chol.round(1),
        'Glucose': glucose.round(1)
    })
    return df

df = create_patient_data(300)
print("✅ Synthetic patient data created. Shape:", df.shape)
df.head()


In [None]:
print(df.describe().T)

plt.figure(figsize=(12, 8))
for i, col in enumerate(df.columns, 1):
    plt.subplot(2, 3, i)
    sns.histplot(df[col], kde=True, color="royalblue")
    plt.title(col)
plt.tight_layout()
plt.show()


In [None]:
features = ['Age', 'BMI', 'SystolicBP', 'Cholesterol', 'Glucose']
X = df[features]

# Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Data standardized. Shape:", X_scaled.shape)


In [None]:
inertia = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertia.append(kmeans.inertia_)

plt.figure(figsize=(6,4))
plt.plot(K_range, inertia, 'bo-')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.show()


In [None]:
sil_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    sil_scores.append(silhouette_score(X_scaled, labels))

plt.figure(figsize=(6,4))
plt.plot(K_range, sil_scores, 'ro-')
plt.title('Silhouette Scores for different k')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()

best_k = K_range[np.argmax(sil_scores)]
print(f"✅ Best k based on Silhouette Score: {best_k}")


In [None]:
kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=20)
labels = kmeans.fit_predict(X_scaled)

df['Cluster'] = labels
df['Cluster'] = df['Cluster'].astype(str)

df.head()


In [None]:
pca = PCA(n_components=2)
reduced = pca.fit_transform(X_scaled)

plt.figure(figsize=(8,6))
sns.scatterplot(x=reduced[:,0], y=reduced[:,1], hue=df['Cluster'], palette='tab10', s=60)
centers_pca = pca.transform(kmeans.cluster_centers_)
plt.scatter(centers_pca[:,0], centers_pca[:,1], s=200, c='black', marker='X', label='Centroids')
plt.title(f"K-Means Clusters Visualization (k={best_k})")
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.legend()
plt.show()


In [None]:
cluster_profile = df.groupby('Cluster')[features].mean().round(2)
print("Mean values per cluster:")
display(cluster_profile)

plt.figure(figsize=(8,4))
sns.heatmap(cluster_profile, annot=True, cmap='coolwarm', cbar=False)
plt.title("Cluster Mean Profile Heatmap")
plt.show()

print("\nCluster sizes:")
print(df['Cluster'].value_counts())


In [None]:
silhouette = silhouette_score(X_scaled, labels)
print(f"Silhouette Score for k={best_k}: {silhouette:.3f}")


In [None]:
# 🧩 Recap: K-Means Clustering for Patient Segmentation

### Key Takeaways:
- K-Means is an **unsupervised algorithm** that groups patients based on feature similarity.
- Used **Age**, **BMI**, **Systolic BP**, **Cholesterol**, and **Glucose** as features.
- **Feature scaling** is essential before clustering.
- **Elbow method** and **Silhouette score** help choose the optimal k.
- Visualized clusters using **PCA**.
- Each cluster represents a distinct **patient risk group**.

---

### Example Insights:
- Cluster 0 → Younger, lower BMI, lower BP → *Healthy group*
- Cluster 1 → Mid-age, moderate BMI → *Moderate risk group*
- Cluster 2 → Older, high BMI, high BP → *High-risk group*

---

### Advantages:
✅ Simple and fast for large datasets
✅ Easy to interpret when visualized with PCA

### Limitations:
❌ Sensitive to outliers
❌ Requires specifying `k`
❌ Assumes spherical clusters (may fail for irregular data)


In [None]:
## ✅ Viva Questions

1. What type of learning algorithm is K-Means?
2. Why do we standardize features before clustering?
3. What is the purpose of the Elbow method?
4. What does the Silhouette Score represent?
5. How can we determine the number of clusters automatically?
6. How would you interpret a cluster with high BMI and BP but low age?
7. How could PCA help in clustering high-dimensional medical data?
8. What would happen if we didn’t scale the dataset before applying K-Means?
