
# Week 10 — Clustering 1 (Chronic Kidney Disease Dataset)
**Integrated Capstone Project Notebook**  
Generated on 2025-11-10 00:22

This notebook explores **patient subgroups** in the Chronic Kidney Disease dataset using **unsupervised learning (KMeans)**.  
Our objectives are to:
1. Load and prepare the dataset.  
2. Automatically select and scale numeric features.  
3. Explore data structure using PCA.  
4. Apply and tune KMeans clustering.  
5. Interpret the clinical meaning of discovered clusters.  



## 1. Setup

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score

pd.set_option("display.max_columns", 100)
print("Setup complete.")


## 2. Load dataset

In [None]:

# Load the provided dataset
DATA_PATH = "Chronic_Kidney_Dsease_data.csv"

df = pd.read_csv(DATA_PATH)
print("Data shape:", df.shape)
display(df.head())

# Quick info
df.info()


## 3. Select and prepare numeric features

In [None]:

# Automatically select numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Numeric columns selected:", numeric_cols)

X = df[numeric_cols].copy()

# Handle missing values
X = X.fillna(X.median(numeric_only=True))

# Standardize numeric data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Scaled feature matrix shape:", X_scaled.shape)


## 4. Visualize structure with PCA

In [None]:

pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)

print("Explained variance ratio:", pca.explained_variance_ratio_.round(3))

plt.figure(figsize=(6,5))
plt.scatter(X_pca[:,0], X_pca[:,1], s=10, alpha=0.6)
plt.title("PCA Projection (2D)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()


## 5. Determine optimal number of clusters (KMeans)

In [None]:

inertias, sils, dbs = [], [], []
ks = range(2, 11)

for k in ks:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    sils.append(silhouette_score(X_scaled, labels))
    dbs.append(davies_bouldin_score(X_scaled, labels))

# Plot elbow and silhouette
fig, ax = plt.subplots(1,2, figsize=(10,4))
ax[0].plot(ks, inertias, marker='o')
ax[0].set_title("Elbow Plot (Inertia)")
ax[0].set_xlabel("k")
ax[0].set_ylabel("Inertia")

ax[1].plot(ks, sils, marker='o', color='green')
ax[1].set_title("Silhouette Scores")
ax[1].set_xlabel("k")
ax[1].set_ylabel("Silhouette")
plt.tight_layout()
plt.show()

best_k = ks[np.argmax(sils)]
print("Best k by silhouette:", best_k)


## 6. Fit final KMeans model

In [None]:

kmeans = KMeans(n_clusters=best_k, random_state=42, n_init=25)
df["cluster"] = kmeans.fit_predict(X_scaled)

print("Silhouette score:", round(silhouette_score(X_scaled, df['cluster']), 3))
print("Davies-Bouldin score:", round(davies_bouldin_score(X_scaled, df['cluster']), 3))

# PCA visualization with clusters
Xp = PCA(n_components=2, random_state=42).fit_transform(X_scaled)
plt.figure(figsize=(6,5))
plt.scatter(Xp[:,0], Xp[:,1], c=df['cluster'], s=12, alpha=0.7)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title(f"KMeans Clusters (k={best_k})")
plt.show()


## 7. Cluster profiling

In [None]:

profile = df.groupby("cluster")[numeric_cols].mean(numeric_only=True).round(2)
display(profile)

print("\nCluster sizes:")
display(df['cluster'].value_counts())



## 8. Interpretation & Reflection

**Guiding prompts:**
- What clinical patterns do you notice across clusters?  
- Do some clusters show higher creatinine, urea, or blood glucose levels — possibly indicating advanced CKD?  
- Are there clusters that show normal ranges (potentially early stage or non-CKD)?  
- Do cluster sizes align with expectations for CKD prevalence?  
- What insights might these clusters provide for targeted intervention or diagnosis?  

**Write your summary paragraph here:**  
> _Example:_  
> "Cluster 0 contained patients with higher serum creatinine and blood urea, likely representing advanced CKD, while Cluster 2 showed lower blood pressure and normal glucose levels, possibly early-stage cases. These findings suggest the clustering effectively differentiates between disease severity levels."
