# Visualizing Patient Populations with Dimensionality Reduction

Time estimate: **20** minutes


## Objectives
After completing this lab, you will be able to:
- Explain why dimensionality reduction is useful in healthcare analytics.
- Prepare patient-level features for visualization.
- Apply dimensionality reduction techniques to clinical data.
- Visualize patient populations in two dimensions.
- Interpret visual patterns as potential clinical subgroups.



## What you will do in this lab

In this lab, you will prepare clinical features and apply dimensionality reduction to visualize patient populations in two dimensions.

You will:

- Review a patient-level feature dataset with multiple clinical variables.
- Prepare data so patients can be compared fairly.
- Reduce complex clinical data into two visual dimensions.
- Create visual plots of patient populations.
- Explore how patient clusters appear in reduced space.
- Reflect on what these visualizations can and cannot tell clinically.



## Overview
Healthcare datasets often contain many variables describing patients, such as
lab values, utilization measures, and condition indicators.
While these variables are meaningful individually, they are difficult
to reason about collectively.

Dimensionality reduction helps **compress complex patient information**
into a small number of dimensions that can be visualized.
These visualizations support clinical intuition, hypothesis generation,
and communication with non-technical stakeholders.

The goal of this lab is **understanding and interpretation**, not mathematical depth.



## About the dataset/environment
You will work with a **synthetic, de-identified patient-level dataset**. Each row represents a patient,
and each column represents a clinical feature such as:
- Average lab values
- Encounter frequency
- Chronic condition indicators

This dataset is appropriate for exploratory visualization.


## Setup

In [None]:
# This cell prepares the environment and loads the dataset used in this lab.
# All data is synthetic and safe for instructional purposes.

# Import pandas for data manipulation
import pandas as pd


# Import tools for scaling and dimensionality reduction
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Import matplotlib for plotting
import matplotlib.pyplot as plt


# Load a synthetic patient-level feature dataset

patient_features = pd.read_csv("https://machine-learning-for-healthcare-applications-f276df.gitlab.io/labs/lab3/patient_features.csv")

# Display the dataset
patient_features



## Step 1: Review patient feature data

You will begin by carefully reviewing the patient features that will be visualized.
This helps you in understanding what each variable represents and whether
it is appropriate for exploratory analysis.

**Why this matters in healthcare:** Visualizations are only meaningful
if the underlying features make clinical sense.


In [None]:
# Display basic information about the dataset
patient_features.info()

# Display summary statistics to understand typical values
patient_features.describe()



## Step 2: Consider feature scale and comparability

Before visualization, it is important to recognize that different clinical
features are measured on different scales. For example, lab values might be
much larger in magnitude than binary condition indicators.

**Why this matters in healthcare:** Without adjustment, some features might
dominate visual patterns simply because of their units.


In [None]:
# Display the raw feature values to illustrate scale differences
patient_features



## Step 3: Scale features for fair comparison

You will now scale the features so that each one contributes equally
to the dimensionality reduction process.

**Why this matters in healthcare:** Scaling ensures that visual patterns
reflect patient similarity, not measurement units.


In [None]:
# Create a scaler to standardize features
scaler = StandardScaler()

# Fit the scaler to the data and transform it
scaled_features = scaler.fit_transform(patient_features)

# Display the scaled feature values
scaled_features



## Introducing dimensionality reduction

Dimensionality reduction techniques transform high-dimensional data
into a smaller number of dimensions while preserving important structure.

In this lab, you will explore two common techniques:
- Principal component analysis (PCA)
- t-distributed stochastic neighbor embedding (t-SNE)

**Why this matters in healthcare:** These techniques help clinicians
visually explore patient populations.



## Step 4: Apply PCA

PCA identifies directions in the data that capture the most variation.
You will reduce the patient data to two principal components
so that it can be visualized on a 2D plot.

**Why this matters in healthcare:** PCA provides a stable, interpretable
overview of population-level variation.


In [None]:
# Create a PCA model that reduces data to two components
pca = PCA(n_components=2)

# Fit PCA to the scaled data and transform it
pca_components = pca.fit_transform(scaled_features)

# Convert the result into a DataFrame for easier handling
pca_df = pd.DataFrame(
    pca_components,
    columns=["PC1", "PC2"]
)

# Display the PCA-transformed data
pca_df



## Step 5: Visualize patients using PCA

You will now create a scatter plot where each point represents a patient.
Patients that appear close together have similar clinical profiles
based on the original features.

**Why this matters in healthcare:** Visual proximity can suggest
shared phenotypes or care needs.


In [None]:
# Create a scatter plot of the PCA results
plt.figure()

plt.scatter(pca_df["PC1"], pca_df["PC2"])

# Label the axes for clarity
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")

# Add a title to the plot
plt.title("Patient Population Visualization using PCA")

# Display the plot
plt.show()



## Reflecting on PCA visualization

At this point, pause and visually inspect the plot.
Look for:
- Groups of patients that appear close together
- Patients that appear isolated
- Overall spread of the population

**Why this matters in healthcare:** Visual reflection supports
clinical intuition and hypothesis generation.



## Step 6: Apply t-SNE for nonlinear visualization

t-SNE is a technique that emphasizes local neighborhood structure.
It is often used to highlight clusters that may not be obvious with PCA.

**Why this matters in healthcare:** t-SNE can reveal subtle subgroupings,
but it must be interpreted cautiously.


In [None]:
# Create a t-SNE model to reduce data to two dimensions
tsne = TSNE(n_components=2, random_state=42)

# Fit and transform the scaled data
tsne_components = tsne.fit_transform(scaled_features)

# Convert results into a DataFrame
tsne_df = pd.DataFrame(
    tsne_components,
    columns=["Dim1", "Dim2"]
)

# Display the t-SNE-transformed data
tsne_df



## Step 7: Visualize patients using t-SNE

You will now visualize the patient population using the t-SNE results.
This visualization often appears more clustered than PCA.

**Why this matters in healthcare:** Such plots are useful for exploration,
not for definitive conclusions.


In [None]:
# Create a scatter plot of the t-SNE results
plt.figure()

plt.scatter(tsne_df["Dim1"], tsne_df["Dim2"])

# Label axes
plt.xlabel("t-SNE Dimension 1")
plt.ylabel("t-SNE Dimension 2")

# Add title
plt.title("Patient Population Visualization using t-SNE")

# Display the plot
plt.show()



## Comparing PCA and t-SNE visualizations

Compare the two visualizations:
- PCA emphasizes global variation
- t-SNE emphasizes local neighborhoods

**Why this matters in healthcare:** Different techniques answer
different clinical questions.



## Understanding the limitations of visual interpretation

Dimensionality reduction can be misleading if over-interpreted.
Distances on plots do not always correspond to clinical importance.

**Why this matters in healthcare:** Clinical decisions should not
be based solely on visual patterns.


## Exercises

Use the dataset at https://machine-learning-for-healthcare-applications-f276df.gitlab.io/labs/lab3/patient_features_exercise_dataset.csv and try reducing dimensions using PCA and t-SNE. Visualize both the plots.

## Congratulations!

You have successfully completed this lab on visualizing patient populations using dimensionality reduction. You practiced translating high-dimensional clinical data into intuitive visualizations that support exploration, pattern recognition, and clinical interpretation.

## Authors
Ramesh Sannareddy

<br>

Â© SkillUp. All rights reserved.

Materials may not be reproduced in whole or in part without written permission from SkillUp.
