# Clustering Patients for Phenotype Discovery

Time estimate: **20** minutes


## Objectives
After completing this lab, you will be able to:
- Describe why patient clustering is used in healthcare.
- Prepare clinical features for unsupervised learning.
- Apply clustering to discover patient subgroups.
- Examine and describe patient clusters.
- Interpret clusters as potential clinical phenotypes.



## What you will do in this lab

In this lab, you will prepare patient-level clinical data and apply clustering to uncover patient subgroups.

You will:

- Review a patient-level clinical feature dataset.
- Prepare the data so patients can be fairly compared.
- Group similar patients using a clustering algorithm.
- Examine how patient characteristics differ across clusters.
- Interpret clusters in clinical, non-technical terms.



## Overview
In real-world healthcare settings, clinicians often observe that not all patients
with the same diagnosis behave or respond to care in the same way.
Clustering allows **discovering natural groupings of patients** based on their
clinical characteristics, without telling the algorithm what to look for in advance.

These groups, often called *phenotypes*, can support population health management,
care pathway design, and hypothesis generation.



## About the dataset/environment
You will work with a **synthetic, de-identified patient-level dataset**
that represents information typically available after data cleaning and feature
engineering. Each row represents a patient, and each column represents a clinical
characteristic such as lab results or healthcare utilization.


## Setup

In [None]:

# This cell prepares the environment and reads the synthetic dataset used in this lab

# Import pandas for working with tables of data
import pandas as pd

# Import tools for scaling data and clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Load a patient-level feature dataset
# Each row represents a patient
# Each column represents a clinical feature
patient_features = pd.read_csv("https://machine-learning-for-healthcare-applications-f276df.gitlab.io/labs/lab2/patient_features1.csv")



## Step 1: Review patient feature data

In this step, you will **understand what information is available**
before applying any algorithms. You will look at column names, data types,
and basic statistics.

This is similar to a clinician reviewing a patient chart before making decisions.

**Why this matters in healthcare:** Clinically meaningful clustering is only possible
when we clearly understand what each feature represents.


In [None]:
# Show the structure of the dataset, including column types and missing values
patient_features.info()

In [None]:
# Show summary statistics to understand typical values and ranges
patient_features.describe()


## Step 2: Scale features for clustering

Clinical features can be measured on very different scales.
For example, lab values may range in the hundreds, while condition indicators are
binary (0 or 1).

Scaling ensures that **each feature contributes fairly** to the clustering process.

**Why this matters in healthcare:** Without scaling, clusters may reflect measurement
units rather than true clinical similarity.


In [None]:
# Create a scaler that standardizes features
# This rescales each column so it has similar influence
scaler = StandardScaler()

# Fit the scaler to the data and transform it
# The result is a numerical array used for clustering
scaled_features = scaler.fit_transform(patient_features)

# Display the scaled feature values
scaled_features



## Step 3: Apply a clustering algorithm

You will now group patients using a clustering algorithm.
K-means clustering groups patients so that those within a cluster
are more similar to each other than to those in other clusters.

The number of clusters is chosen for demonstration purposes.

**Why this matters in healthcare:** Clustering helps reveal hidden patient subgroups
that may benefit from different care strategies.


In [None]:
# Create a k-means clustering model
# n_clusters defines how many patient groups you want to discover
kmeans = KMeans(n_clusters=3, random_state=42)

# Fit the model and assign each patient to a cluster
clusters = kmeans.fit_predict(scaled_features)

# Display cluster assignments
print(clusters)



## Step 4: Assign cluster labels to patients

After clustering, you will attach the cluster label back to each patient record.
This allows examining patient characteristics within each cluster.

**Why this matters in healthcare:** Cluster labels enable clinical interpretation
and communication of patient phenotypes.


In [None]:
# Add cluster labels as a new column in the dataset
patient_features["cluster"] = clusters

# Display patient data with cluster labels
patient_features



## Step 5: Examine cluster characteristics

Finally, you will summarize patient characteristics within each cluster.
This helps understand how clusters differ and what makes each group unique.

**Why this matters in healthcare:** Cluster summaries help clinicians describe
and reason about potential phenotypes in plain language.


In [None]:
# Group patients by cluster and compute average values
cluster_summary = patient_features.groupby("cluster").mean()

# Display the cluster-level summaries
cluster_summary


## Exercises

### Exercise 1: Create 2 clusters using the same dataset

In [None]:
# your code goes here


<details>
<summary>Click here for a hint</summary>

Try a different value for n_clusters.

</details>

<details>
<summary>Click here for solution</summary>

```python
new_clusters = KMeans(n_clusters=2, random_state=42).fit_predict(scaled_features)
print(new_clusters)
```

</details>


### Exercise 2: Inspect cluster sizes

In [None]:
# your code goes here


<details>
<summary>Click here for a hint</summary>

Count how many patients are in each cluster.

</details>

<details>
<summary>Click here for solution</summary>

```python
patient_features['cluster'] = new_clusters
patient_features['cluster'].value_counts()
```

</details>


### Exercise 3: Compare glucose levels across clusters

In [None]:
# your code goes here


<details>
<summary>Click here for a hint</summary>

Compute average glucose per cluster.

</details>

<details>
<summary>Click here for solution</summary>

```python
patient_features.groupby('cluster')['avg_glucose'].mean()
```

</details>


### Exercise 4: Identify high-utilization cluster

In [None]:
# your code goes here


<details>
<summary>Click here for a hint</summary>

Look at encounter counts.

</details>

<details>
<summary>Click here for solution</summary>

```python
patient_features.groupby('cluster')['encounter_count'].mean()
```

</details>


### Exercise 5: Display cluster summary

In [None]:
# your code goes here


<details>
<summary>Click here for a hint</summary>

Review the cluster summary table.

</details>

<details>
<summary>Click here for solution</summary>

```python
cluster_summary=patient_features.groupby('cluster').mean()
cluster_summary
```

</details>


## Congratulations!

You have successfully completed this lab on using clustering to explore patient phenotypes. You practiced using clustering to uncover meaningful patient subgroups and interpret them as potential clinical phenotypes for healthcare analysis.

## Authors
Ramesh Sannareddy

<br>

Â© SkillUp. All rights reserved.

Materials may not be reproduced in whole or in part without written permission from SkillUp.
