# K-means Clustering for Patient Segmentation


Time estimate: **30** minutes

## Objectives

After completing this lab, you will be able to:

 - Explain the purpose of clustering in population health research
 - Prepare and preprocess patient datasets for clustering analysis
 - Apply K-means clustering to segment patients into distinct subgroups
 - Determine the optimal number of clusters using the elbow method
 - Analyze and interpret cluster characteristics to identify patterns in patient demographics and health variables
 - Visualize clusters using dimensionality reduction techniques like PCA
 - Develop practical skills in Python for performing clustering on anonymized patient data

## What you will do in this lab

In this lab, you will work with a simulated patient dataset containing demographic and health information. You will apply K-means clustering, a popular unsupervised machine learning technique, to identify distinct patient subgroups based on their characteristics.

You will:

- Load and explore an anonymized patient dataset with multiple health features
- Preprocess the data by handling missing values, encoding categorical variables, and standardizing numeric features
- Apply K-means clustering to segment patients into three distinct groups
- Analyze the characteristics of each cluster to understand patient patterns
- Visualize the clusters using Principal Component Analysis (PCA) for 2D representation
- Practice clustering on an additional brain stroke risk dataset

## Overview

Clustering is an unsupervised machine learning technique that groups similar data points together based on their characteristics. In healthcare, clustering helps identify patient subgroups with similar health profiles, enabling personalized treatment strategies and targeted interventions.

K-means clustering is one of the most widely used clustering algorithms. It works by partitioning data into K distinct clusters, where each data point belongs to the cluster with the nearest centroid (center point). The algorithm iteratively assigns points to clusters and updates centroids until convergence.

In this lab, you will work with a patient dataset containing demographic information, vital signs, laboratory values, and health history. The goal is to discover natural groupings in the data that may represent different patient risk profiles or health conditions. Understanding these patterns can help healthcare providers develop targeted prevention and treatment strategies.

Before applying K-means, proper data preprocessing is essential. This includes handling missing values, encoding categorical variables into numeric form, and standardizing features to ensure all variables contribute equally to the clustering process. You will learn these critical preprocessing steps and understand why they are necessary for effective clustering.

## About the dataset

In this lab, you will work with a simulated patient dataset designed for population health research. The dataset contains anonymized health records with demographic information, vital signs, laboratory values, and medical history.

### Dataset overview

The patient dataset consists of 6,000 anonymized patient records with 16 features covering various aspects of patient health. This dataset combines cardiovascular, metabolic, and lifestyle factors to create a comprehensive health profile for each patient. The data has been specifically designed for clustering analysis to identify distinct patient subgroups that may benefit from targeted healthcare interventions.

The dataset includes both numeric features (such as age, blood pressure, and cholesterol levels) and categorical features (such as gender, residence type, and smoking status). Some features contain missing values, which is common in real-world healthcare datasets and requires proper handling during preprocessing.

### Column descriptions

1. **age** - Patient's age in years (numeric, range: 18-90)
2. **gender** - Patient's gender (categorical: 0 = Female, 1 = Male)
3. **chest_pain_type** - Type of chest pain experienced (categorical: 1 = Typical Angina, 2 = Atypical Angina, 3 = Non-Anginal Pain, 4 = Asymptomatic)
4. **blood_pressure** - Resting blood pressure in mm Hg (numeric, range: 0-300)
5. **cholesterol** - Serum cholesterol level in mg/dl (numeric, range: 120-300)
6. **max_heart_rate** - Maximum heart rate achieved during exercise (numeric, range: 70-220)
7. **exercise_angina** - Angina experienced during exercise (categorical: 0 = No, 1 = Yes)
8. **plasma_glucose** - Plasma glucose concentration in mg/dl (numeric, range: 70-250)
9. **skin_thickness** - Skinfold thickness in mm (numeric, range: 20-100)
10. **insulin** - Serum insulin level in µU/ml (numeric, range: 80-180)
11. **bmi** - Body Mass Index calculated as weight/height² (numeric, range: 10-50)
12. **diabetes_pedigree** - Diabetes pedigree function representing family history risk (numeric, range: 0.1-2.5)
13. **hypertension** - History of hypertension (categorical: 0 = No, 1 = Yes)
14. **heart_disease** - History of heart disease (categorical: 0 = No, 1 = Yes)
15. **residence_type** - Type of area where the patient lives (categorical: Urban, Rural)
16. **smoking_status** - Patient's smoking status (categorical: Smoker, Non-Smoker, Unknown)

## Setup

### Installing required libraries

The following libraries are required to run this lab. If you are running this notebook in a local environment, you may need to install these libraries using pip.

In [None]:
# Install the libraries required for this lab
!pip install pandas numpy matplotlib seaborn scikit-learn

In [None]:
# Optional: suppress warnings for cleaner output
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

### Importing required libraries

In [None]:
# Import necessary libraries for data manipulation and clustering
import pandas as pd      # For loading, inspecting, and manipulating tabular data (DataFrames)
import numpy as np       # For numerical operations, arrays, and mathematical functions
import matplotlib.pyplot as plt   # For creating static plots, charts, and visualizations
import seaborn as sns    # For advanced statistical visualizations (heatmaps, scatterplots, pairplots)

# Import machine learning modules for clustering
from sklearn.cluster import KMeans        # K-means clustering algorithm for segmenting patients into clusters
from sklearn.preprocessing import StandardScaler  # For standardizing features (mean=0, variance=1) before clustering
from sklearn.decomposition import PCA             # Principal Component Analysis for dimensionality reduction and 2D visualization

print("All libraries imported successfully!")
print("Ready to begin K-means clustering analysis.")

## Step 1: Load the patient dataset

The first step in any data analysis project is to load the dataset into a pandas DataFrame. A DataFrame is a tabular data structure that allows you to easily manipulate and analyze data.

In this step, you will load the patient_dataset.csv file and verify that it has been loaded successfully.

In [None]:
# Specify the path to the patient dataset CSV file
file_path = "https://advanced-machine-learning-for-medical-data-8e1579.gitlab.io/labs/lab4/patient_dataset.csv"

# Load the CSV file into a pandas DataFrame
df = pd.read_csv(file_path)

# Display success message
print("File loaded successfully!")

# Display the first few rows to preview the data
print("\nFirst 5 rows of the dataset:")
df.head()

In [None]:
# Display the column names in the dataset
print("\nColumn names:")
print(df.columns.tolist())

# Display dataset dimensions (rows and columns)
print(f"\nDataset shape: {df.shape[0]} rows, {df.shape[1]} columns")

## Step 2: Explore the dataset

Before applying any machine learning algorithm, it is essential to understand the structure and characteristics of your data. In this step, you will examine:

- Basic dataset information (data types, non-null counts)
- Summary statistics for numeric features
- Missing values in each column
- Unique values in categorical features

This exploration helps identify data quality issues and informs preprocessing decisions.

In [None]:
# Display basic information about the dataset
# This shows data types, number of non-null values, and memory usage
print("Dataset information (columns, data types, non-null counts):")
print(df.info())

In [None]:
# Check for missing values in each column
# Missing values must be handled before clustering
print("\nMissing values per column:")
print(df.isnull().sum())

In [None]:
# Display summary statistics for numeric features
# This includes count, mean, standard deviation, min, max, and quartiles
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
print("\nSummary statistics for numeric features:")
df[numeric_cols].describe()

In [None]:
# Display unique values for categorical features
# This helps understand the categories you need to encode
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
print("\nUnique values for categorical features:")
for col in categorical_cols:
    print(f"{col}: {df[col].unique()}")

## Step 3: Preprocess the data

Data preprocessing is a critical step before applying K-means clustering. Raw data often contains missing values, categorical variables, and features with different scales, which can negatively impact clustering results.

In this step, you will perform the following preprocessing tasks:

1. **Separate numeric and categorical columns** - Different data types require different preprocessing approaches
2. **Fill missing values** - K-means cannot handle missing data, so you must impute (fill) missing values
3. **Encode categorical variables** - Convert categorical text into numeric form using one-hot encoding
4. **Standardize numeric features** - Scale all features to have mean=0 and standard deviation=1

### Why preprocessing matters

**Handling missing values:** Machine learning algorithms cannot process missing data. You use mean imputation for numeric features and mode imputation for categorical features.

**Encoding categorical variables:** K-means uses Euclidean distance, which requires all features to be numeric. One-hot encoding converts categories such as "Urban" and "Rural" into binary columns.

**Standardization:** Features with larger ranges (e.g., cholesterol: 120-300) would dominate distance calculations over smaller ranges (e.g., gender: 0-1). Standardization ensures all features contribute equally.

In [None]:
# Step 3.1: Separate numeric and categorical columns
# Numeric columns will be scaled, categorical columns will be encoded
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object', 'category']).columns

print(f"Numeric columns ({len(numeric_cols)}): {list(numeric_cols)}")
print(f"\nCategorical columns ({len(categorical_cols)}): {list(categorical_cols)}")

In [None]:
# Step 3.2: Fill missing values
# For numeric columns: fill with the mean (average) value
# For categorical columns: fill with the mode (most frequent) value
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])

print("Missing values filled successfully!")
print("\nVerifying - Missing values remaining:")
print(df.isnull().sum().sum())  # Should be 0

In [None]:
# Step 3.3: Encode categorical variables using one-hot encoding
# This converts categorical features into binary columns (0 or 1)
# drop_first=True prevents multicollinearity by dropping one category per feature
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

print("Categorical variables encoded successfully!")
print(f"\nDataset shape after encoding: {df_encoded.shape[0]} rows, {df_encoded.shape[1]} columns")
print("\nNew encoded columns:")
print([col for col in df_encoded.columns if col not in df.columns])

In [None]:
# Step 3.4: Standardize numeric features using StandardScaler
# This transforms each feature to have mean=0 and standard deviation=1
# Standardization ensures all features contribute equally to distance calculations
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df_encoded)

print("Preprocessing complete!")
print("All features have been standardized and are ready for K-means clustering.")

## Step 4: Apply K-means clustering

Now that the data is preprocessed, you can apply the K-means clustering algorithm. K-means groups similar data points into K clusters based on Euclidean distance.

### How K-means works

1. **Initialize K centroids** randomly (or using smart initialization like k-means++)
2. **Assign each point** to the nearest centroid
3. **Update centroids** by calculating the mean of all points in each cluster
4. **Repeat steps 2-3** until centroids stop moving (convergence)

### Key parameters

- **n_clusters:** The number of clusters (K). In this lab, you use K=3 to identify three patient subgroups
- **init='k-means++':** Smart initialization method that improves convergence
- **max_iter:** Maximum number of iterations (300 is typically sufficient)
- **n_init:** Number of times the algorithm runs with different initial centroids (10 is default)
- **random_state:** Ensures reproducibility by fixing the random seed

In [None]:
# Convert the scaled NumPy array back to a DataFrame
# This makes it easier to work with and add the cluster labels
df_scaled = pd.DataFrame(df_scaled, columns=df_encoded.columns)

print("Data converted back to DataFrame format.")
print(f"Shape: {df_scaled.shape}")

In [None]:
# Apply K-means clustering with K=3 clusters
optimal_k = 3

# Initialize the K-means model
kmeans = KMeans(
    n_clusters=optimal_k,      # Number of clusters
    init='k-means++',           # Smart initialization
    max_iter=300,               # Maximum iterations
    n_init=10,                  # Number of runs with different initializations
    random_state=42             # For reproducibility
)

# Fit the model and predict cluster labels for each patient
df_scaled['Cluster'] = kmeans.fit_predict(df_scaled)

print("K-means clustering complete!")
print(f"\nPatients have been segmented into {optimal_k} clusters.")
print(f"\nCluster distribution:")
print(df_scaled['Cluster'].value_counts().sort_index())

## Step 5: Analyze cluster characteristics

After clustering, it's important to understand what differentiates each cluster. By examining the mean values of features in each cluster, you can identify distinct patient subgroups and their characteristics.

This analysis helps answer questions like:
- Which cluster has older patients?
- Which cluster has higher cholesterol levels?
- Which cluster is associated with more health risk factors?

These insights can inform targeted healthcare interventions and personalized treatment strategies.

In [None]:
# Calculate the mean value of each feature for each cluster
# This shows the average characteristics of patients in each cluster
print("Cluster summary (mean values):")
cluster_summary = df_scaled.groupby('Cluster').mean()


# Note: Values are standardized (mean=0, std=1)
# Positive values indicate above-average for that feature
# Negative values indicate below-average for that feature
cluster_summary

In [None]:
#copy the Cluster information to the original DataFrame
df['Cluster'] = df_scaled['Cluster']

In [None]:
# Cluster summary based on the original data frame
df.groupby('Cluster').mean(numeric_only=True)

## Step 6: Visualize clusters using PCA

With many features (16+ after encoding), it's impossible to visualize clusters in high-dimensional space. Principal Component Analysis (PCA) solves this problem by reducing dimensionality while preserving the most important patterns in the data.

### What is PCA?

PCA transforms high-dimensional data into a lower-dimensional representation by finding the directions (principal components) with the most variance. The first two principal components (PC1 and PC2) capture the most significant patterns, allowing you to create a 2D visualization.

### Why use PCA for visualization?

- **Dimensionality reduction:** Converts 16+ dimensions to 2 dimensions
- **Pattern preservation:** Retains the most important relationships between patients
- **Visual interpretation:** Makes it easy to see cluster separation and overlap

In the visualization, each point represents a patient, and colors represent cluster assignments.

In [None]:
# Apply PCA to reduce dimensions from 16+ to 2 for visualization
pca = PCA(n_components=2)  # Keep only the first 2 principal components

# Transform the scaled data
principal_components = pca.fit_transform(df_scaled.drop('Cluster', axis=1))

# Create a DataFrame with the 2 principal components
df_pca = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
df_pca['Cluster'] = df_scaled['Cluster'].values

# Create a scatter plot showing clusters in 2D space
plt.figure(figsize=(8, 6))
sns.scatterplot(
    x='PC1', 
    y='PC2', 
    hue='Cluster',  # Color points by cluster
    data=df_pca, 
    palette='viridis',  # Color scheme
    s=50,  # Point size
    alpha=0.7  # Transparency
)
plt.title("PCA Visualization of K-means Clusters", fontsize=14, fontweight='bold')
plt.xlabel("Principal Component 1 (PC1)", fontsize=12)
plt.ylabel("Principal Component 2 (PC2)", fontsize=12)
plt.legend(title='Cluster', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Display the variance explained by the first 2 components
print(f"\nVariance explained by PC1: {pca.explained_variance_ratio_[0]:.2%}")
print(f"Variance explained by PC2: {pca.explained_variance_ratio_[1]:.2%}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.2%}")

# Exercises

Now it's your turn to apply what you've learned! In these exercises, you will work with a different dataset called **brain_stroke_clusters.csv** (https://advanced-machine-learning-for-medical-data-8e1579.gitlab.io/labs/lab4/brain_stroke_clusters.csv), which contains patient data related to stroke risk factors.

Your goal is to perform K-means clustering on this dataset to identify patient subgroups with different stroke risk profiles.

## Exercise 1: Import the required libraries

Import the necessary libraries for this exercise. You will need pandas, scikit-learn modules, matplotlib, and seaborn.

In [None]:
# your code goes here

<details>
    <summary>Click here for a hint</summary>
    
You need to import: pandas, StandardScaler and KMeans from sklearn, matplotlib.pyplot, and seaborn.

</details>

<details>
    <summary>Click here for solution</summary>

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns

print("Libraries imported successfully!")
```

</details>

## Exercise 2: Load the dataset

Load the brain_stroke_clusters.csv file into a pandas DataFrame and display a success message.

In [None]:
# your code goes here

<details>
    <summary>Click here for a hint</summary>
    
Use pd.read_csv() to load the CSV file. Refer to Step 1 of this lab.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Specify the path to the brain stroke dataset
file_path = "https://advanced-machine-learning-for-medical-data-8e1579.gitlab.io/labs/lab4/brain_stroke_clusters.csv"

# Load CSV into a DataFrame
df = pd.read_csv(file_path)

# Display success message
print("File loaded successfully!")
```

</details>

## Exercise 3: Explore the dataset

Display the number of rows, column names, and the first 5 rows of the dataset.

In [None]:
# your code goes here

<details>
    <summary>Click here for a hint</summary>
    
Use df.shape[0] for row count, df.columns.tolist() for column names, and df.head() to display the first 5 rows.

</details>

<details>
    <summary>Click here for solution</summary>

```python
print("Number of rows:", df.shape[0])
print("\nColumns:", df.columns.tolist())
print("\nFirst 5 rows:")
print(df.head())
```

</details>

## Exercise 4: Check for categorical features

Check if there are any categorical features in the dataset by displaying unique values for categorical columns.

In [None]:
# your code goes here

<details>
    <summary>Click here for a hint</summary>
    
Use df.select_dtypes(include=['object', 'category']) to find categorical columns. Refer to Step 2 of this lab.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Check unique values for categorical features
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
print("Unique values for categorical features:")
for col in categorical_cols:
    print(f"{col}: {df[col].unique()}")
```

</details>

## Exercise 5: Prepare and standardize the data

Select the features for clustering (age, avg_glucose_level, bmi) and standardize them using StandardScaler.

In [None]:
# your code goes here

<details>
    <summary>Click here for a hint</summary>
    
First, select the three numeric columns into a variable X. Then use StandardScaler().fit_transform() to standardize. Refer to Step 3 of this lab.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Select features for clustering
X = df[['age', 'avg_glucose_level', 'bmi']]

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Data standardized successfully!")
```

</details>

## Exercise 6: Apply K-means with k=2

Apply K-means clustering with 2 clusters and add the cluster labels to the original DataFrame.

In [None]:
# your code goes here

<details>
    <summary>Click here for a hint</summary>
    
Create a KMeans object with n_clusters=2 and random_state=42. Use fit_predict() on the scaled data and store the results in a new column called 'cluster'. Refer to Step 4 of this lab.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Apply K-means with k=2
kmeans = KMeans(n_clusters=2, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)

print("K-means clustering complete!")
print("\nCluster distribution:")
print(df['cluster'].value_counts())
```

</details>

## Exercise 7: Visualize the clusters

Create a scatter plot showing the clusters with avg_glucose_level on the x-axis and bmi on the y-axis.

In [None]:
# your code goes here

<details>
    <summary>Click here for a hint</summary>
    
Use sns.scatterplot() with x='avg_glucose_level', y='bmi', and hue='cluster'. Refer to Step 6 of this lab.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Visualize clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(
    x='avg_glucose_level',
    y='bmi',
    hue='cluster',
    data=df,
    palette='Set1',
    s=80
)
plt.title('K-Means Clustering of Stroke Risk Groups')
plt.xlabel('Average Glucose Level')
plt.ylabel('BMI')
plt.legend(title='Cluster')
plt.show()
```

</details>

# Congratulations!

You have successfully completed this lab on K-means clustering for patient segmentation! You learned how to preprocess healthcare data, apply K-means clustering to identify distinct patient subgroups, analyze cluster characteristics, and visualize results using PCA. These skills are valuable for population health research and can help healthcare providers develop targeted interventions and personalized treatment strategies.

## Authors

Ramesh Sannareddy

Copyright © 2025 SkillUp. All rights reserved.