# Week 11 – DBSCAN and Hierarchical Clustering on CKD Dataset

This notebook applies **DBSCAN** and **Hierarchical Agglomerative Clustering (HAC)** to the Chronic Kidney Disease dataset to identify patient subgroups based on numeric clinical features.

The steps follow the Milestone Two summary document and are intended for use as appendix evidence.


In [None]:
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.cluster import DBSCAN, AgglomerativeClustering
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

DATA_PATH = "Chronic_Kidney_Dsease_data.csv"

df = pd.read_csv(DATA_PATH)
print(df.shape)
df.head()

## Select Numeric Features and Preprocess

In [None]:
# Drop identifiers and target
df = df.drop(columns=["PatientID", "DoctorInCharge"], errors="ignore")

numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

# Remove Diagnosis from clustering features (keep it for interpretation)
if "Diagnosis" in numeric_cols:
    numeric_cols.remove("Diagnosis")

X = df[numeric_cols].copy()

# Impute + scale numeric features
pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

X_scaled = pipe.fit_transform(X)
print("Numeric feature shape:", X_scaled.shape)

## DBSCAN – k-distance Plot for Choosing ε

In [None]:
# Use NearestNeighbors to help choose epsilon (k-distance graph)
k = 10
neighbors = NearestNeighbors(n_neighbors=k)
neighbors_fit = neighbors.fit(X_scaled)
distances, indices = neighbors_fit.kneighbors(X_scaled)

# Sort distances to kth nearest neighbor
distances_k = np.sort(distances[:, -1])

plt.figure(figsize=(6, 4))
plt.plot(distances_k)
plt.ylabel(f"Distance to {k}th neighbor")
plt.xlabel("Points sorted by distance")
plt.title("k-distance Plot (use elbow as epsilon guide)")
plt.show()

## DBSCAN Clustering

In [None]:
# Example epsilon/min_samples values – adjust based on k-distance plot
eps_values = [0.8, 1.0, 1.2]
min_samples = 10

for eps in eps_values:
    dbscan = DBSCAN(eps=eps, min_samples=min_samples)
    labels = dbscan.fit_predict(X_scaled)
    n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise = list(labels).count(-1)
    print(f"eps={eps}: clusters={n_clusters}, noise points={n_noise}")
    
    # Compute silhouette only if at least 2 clusters
    if n_clusters >= 2:
        mask = labels != -1
        sil = silhouette_score(X_scaled[mask], labels[mask])
        print(f"  Silhouette (excluding noise): {sil:.3f}")
    print()

### Visualizing Two Principal Dimensions (Optional)

To visualize DBSCAN results, we can reduce the data to two dimensions (e.g., via PCA).  
If you wish, you can add PCA here; for the appendix, the key is to show DBSCAN parameters and cluster counts.


## Hierarchical Agglomerative Clustering (HAC)

In [None]:
# Try HAC with a small number of clusters
for n_clusters in [2, 3, 4]:
    hac = AgglomerativeClustering(n_clusters=n_clusters, linkage="ward")
    labels = hac.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, labels)
    print(f"HAC with {n_clusters} clusters – silhouette: {sil:.3f}")

### Dendrogram (on a Random Subset for Clarity)

In [None]:
# For dendrograms, use a smaller random sample to keep it readable
np.random.seed(42)
sample_idx = np.random.choice(X_scaled.shape[0], size=min(200, X_scaled.shape[0]), replace=False)
X_sample = X_scaled[sample_idx]

Z = linkage(X_sample, method="ward")

plt.figure(figsize=(8, 4))
dendrogram(Z, truncate_mode="level", p=5)
plt.title("Hierarchical Clustering Dendrogram (truncated)")
plt.xlabel("Sample index")
plt.ylabel("Distance")
plt.show()

## Interpreting Clusters with Diagnosis (Optional)

You can compare cluster labels to the `Diagnosis` column to see whether high-risk groups align with clinical CKD labels.


In [None]:
# Example: fit HAC with 3 clusters and compare to Diagnosis
hac3 = AgglomerativeClustering(n_clusters=3, linkage="ward")
labels3 = hac3.fit_predict(X_scaled)

if "Diagnosis" in df.columns:
    ctab = pd.crosstab(labels3, df["Diagnosis"], rownames=["Cluster"], colnames=["Diagnosis"])
    print(ctab)