<a href="https://colab.research.google.com/github/ccorbett0116/Fall2025ResearchProject/blob/main/Research_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Title:
# Authors: Jose Henriquez, Cole Corbett
## Description:
The deployment of medical AI systems across different hospitals raises critical questions about whether fairness and representation quality can be reliably transferred across clinical domains. Models trained on one hospital’s imaging data are often reused in new environments where patient demographics, imaging devices, and diagnostic practices differ substantially, potentially resulting in unintended bias against certain groups. This project investigates this challenge by studying fairness-aware representation alignment in medical imaging. The student will train contrastive learning models—such as SimCLR—independently on two large-scale chest X-ray datasets: CheXpert (from Stanford Hospital) and MIMIC-CXR (from Beth Israel Deaconess Medical Center). After learning embeddings in each domain, the student will apply domain alignment techniques such as Procrustes alignment to map representations from the CheXpert embedding space into the MIMIC-CXR space. The aligned embeddings will then be evaluated using fairness metrics designed for representation spaces, including demographic subgroup alignment, intra- vs. inter-group embedding disparity, and cluster-level demographic parity. The expected outcome is a rigorous understanding of whether fairness properties learned in one hospital setting preserve, degrade, or improve when transferred to another, revealing how robust model fairness is to realworld clinical domain shifts. A practical use case involves a healthcare network seeking to deploy a model trained at a major academic hospital (e.g., Stanford) into a community hospital setting: this project helps determine whether the transferred representations remain equitable across patient groups such as older adults, women, or specific disease cohorts. The findings will support responsible AI deployment in healthcare by highlighting the conditions under which fairness is stable across institutions and identifying scenarios where domain-specific mitigation strategies may be required.

In [None]:
#Process is probably different on colab, this is hyperspecific to me because I'm working on Pycharm connected to my WSL
import sys
!{sys.executable} -m pip install kagglehub polars
#We're going to use polars because it's significantly faster, it's build on rust and enables multi-threaded processing as well as some memory optimizations over pandas.

'c:\Users\joseh\Desktop\Masters' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
#Again, this is probably different on colab
import kagglehub
path_chexpert = kagglehub.dataset_download("mimsadiislam/chexpert")
print("Path to chexpert dataset files:", path_chexpert)
path_mimic = kagglehub.dataset_download("simhadrisadaram/mimic-cxr-dataset")
print("Path to mimic dataset files:", path_mimic)

Path to chexpert dataset files: C:\Users\joseh\.cache\kagglehub\datasets\mimsadiislam\chexpert\versions\1
Path to mimic dataset files: C:\Users\joseh\.cache\kagglehub\datasets\simhadrisadaram\mimic-cxr-dataset\versions\2


In [None]:

import os
os.listdir(path_mimic)
os.makedirs("./checkpoints", exist_ok=True)
os.makedirs("./embeddings", exist_ok=True)
os.makedirs("./logs", exist_ok=True)

In [26]:
import polars as pl
import os

dir_chexpert = os.path.join(path_chexpert, "CheXpert-v1.0-small")
dir_mimic = path_mimic

train_csv_chexpert = os.path.join(dir_chexpert, "train.csv")
train_csv_mimic = os.path.join(dir_mimic, "mimic_cxr_aug_train.csv")
valid_csv_chexpert = os.path.join(dir_chexpert, "valid.csv")
valid_csv_mimic = os.path.join(dir_mimic, "mimic_cxr_aug_validate.csv")

df_train_chexpert = pl.read_csv(train_csv_chexpert)
df_train_mimic = pl.read_csv(train_csv_mimic)
df_valid_chexpert = pl.read_csv(valid_csv_chexpert)
df_valid_mimic = pl.read_csv(valid_csv_mimic)

In [None]:
import shutil

# Example: move images for CheXpert
os.makedirs("./CheXpert/train", exist_ok=True)
os.makedirs("./CheXpert/val", exist_ok=True)

for row in df_train_chexpert.iter_rows():
    src = os.path.join(dir_chexpert, row[0])  # Assuming first column is 'Path'
    dst = os.path.join("./CheXpert/train", os.path.basename(src))
    if os.path.exists(src):
        shutil.copy(src, dst)

for row in df_valid_chexpert.iter_rows():
    src = os.path.join(dir_chexpert, row[0])
    dst = os.path.join("./CheXpert/val", os.path.basename(src))
    if os.path.exists(src):
        shutil.copy(src, dst)

# Example: move images for MIMIC-CXR
os.makedirs("./MIMIC-CXR/train", exist_ok=True)     
os.makedirs("./MIMIC-CXR/val", exist_ok=True)
for row in df_train_mimic.iter_rows():
    src = os.path.join(dir_mimic, row[0])  # Assuming first column is 'Path'
    dst = os.path.join("./MIMIC-CXR/train", os.path.basename(src))
    if os.path.exists(src):
        shutil.copy(src, dst)

for row in df_valid_mimic.iter_rows():
    src = os.path.join(dir_mimic, row[0])
    dst = os.path.join("./MIMIC-CXR/val", os.path.basename(src))
    if os.path.exists(src):
        shutil.copy(src, dst)

In [27]:
df_train_chexpert.head()

Path,Sex,Age,Frontal/Lateral,AP/PA,No Finding,Enlarged Cardiomediastinum,Cardiomegaly,Lung Opacity,Lung Lesion,Edema,Consolidation,Pneumonia,Atelectasis,Pneumothorax,Pleural Effusion,Pleural Other,Fracture,Support Devices
str,str,i64,str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""CheXpert-v1.0-small/train/pati…","""Female""",68,"""Frontal""","""AP""",1.0,,,,,,,,,0.0,,,,1.0
"""CheXpert-v1.0-small/train/pati…","""Female""",87,"""Frontal""","""AP""",,,-1.0,1.0,,-1.0,-1.0,,-1.0,,-1.0,,1.0,
"""CheXpert-v1.0-small/train/pati…","""Female""",83,"""Frontal""","""AP""",,,,1.0,,,-1.0,,,,,,1.0,
"""CheXpert-v1.0-small/train/pati…","""Female""",83,"""Lateral""",,,,,1.0,,,-1.0,,,,,,1.0,
"""CheXpert-v1.0-small/train/pati…","""Male""",41,"""Frontal""","""AP""",,,,,,1.0,,,,0.0,,,,


In [28]:
df_train_mimic.head()

Unnamed: 0.1,Unnamed: 0,subject_id,image,view,AP,PA,Lateral,text,text_augment
i64,i64,i64,str,str,str,str,str,str,str
0,0,10000032,"""['files/p10/p10000032/s5041426…","""['PA', 'LATERAL', 'AP']""","""['files/p10/p10000032/s5391176…","""['files/p10/p10000032/s5041426…","""['files/p10/p10000032/s5041426…","""['Findings: There is no focal …","""['Findings: There is no focus,…"
1,1,10000764,"""['files/p10/p10000764/s5737596…","""['AP', 'LATERAL']""","""['files/p10/p10000764/s5737596…","""[]""","""['files/p10/p10000764/s5737596…","""['Findings: PA and lateral vie…","""['Finds: PA and lateral view o…"
2,2,10000898,"""['files/p10/p10000898/s5077138…","""['LATERAL', 'PA']""","""[]""","""['files/p10/p10000898/s5077138…","""['files/p10/p10000898/s5077138…","""['Findings: PA and lateral vie…","""['Finds: PA and side view of t…"
3,3,10000935,"""['files/p10/p10000935/s5057897…","""['AP', 'LATERAL', 'LL', 'PA']""","""['files/p10/p10000935/s5057897…","""['files/p10/p10000935/s5569729…","""['files/p10/p10000935/s5117837…","""['Findings: Lung volumes remai…","""['Results: Pulmonary volumes r…"
4,4,10000980,"""['files/p10/p10000980/s5098509…","""['PA', 'LL', 'AP', 'LATERAL']""","""['files/p10/p10000980/s5196728…","""['files/p10/p10000980/s5098509…","""['files/p10/p10000980/s5457736…","""['Findings: Impression: Compa…","""['Findings: Impression: Compar…"


In [29]:
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121


Looking in indexes: https://download.pytorch.org/whl/cu121



[notice] A new release of pip is available: 25.0.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
!pip install tensorboard


: 

Training with solo learn

In [None]:


# MIMIC
python -m solo.main_pretrain \
  --method simclr \
  --dataset custom \
  --data_dir /path/to/MIMIC-CXR \
  --train_dir train \
  --val_dir val \
  --encoder resnet50 \
  --batch_size 128 \
  --max_epochs 50 \
  --num_workers 8 \
  --gpus 1 \
  --precision 16 \
  --logger tensorboard \
  --log_dir ./logs \
  --project mimic_ssl \
  --name simclr_mimic

# CheXpert
python -m solo.main_pretrain \
  --method simclr \
  --dataset custom \
  --data_dir /path/to/chesXpert \
  --train_dir train \
  --val_dir val \
  --encoder resnet50 \
  --batch_size 128 \
  --max_epochs 50 \
  --num_workers 8 \
  --gpus 1 \
  --precision 16 \
  --logger tensorboard \
  --log_dir ./logs \
  --project mimic_ssl \
  --name simclr_chexpert


Embedding extraction

In [None]:
# CheXpert
python -m solo.main_linear \
  --backbone resnet50 \
  --pretrained_feature_extractor ./checkpoints/simclr_chexpert.ckpt \
  --extract_features \
  --data_dir /path/to/CheXpert \
  --train_dir train \
  --output_features simclr_chexpert_embeddings.pt

# MIMIC
python -m solo.main_linear \
  --backbone resnet50 \
  --pretrained_feature_extractor ./checkpoints/simclr_mimic.ckpt \
  --extract_features \
  --data_dir /path/to/MIMIC-CXR \
  --train_dir train \
  --output_features simclr_mimic_embeddings.pt


In [None]:
import torch

# Load embeddings
chexpert_emb = torch.load("./embeddings/simclr_chexpert_embeddings.pt")
mimic_emb = torch.load("./embeddings/simclr_mimic_embeddings.pt")

# Procrustes alignment
def procrustes_alignment(X, Y):
    U, _, Vt = torch.linalg.svd(Y.T @ X)
    R = U @ Vt
    return X @ R.T, R

chexpert_aligned, R = procrustes_alignment(chexpert_emb, mimic_emb)
torch.save(chexpert_aligned, "./embeddings/chexpert_aligned.pt")

# Example: subgroup distance with Polars
def subgroup_centroid_distance(embeddings, demographics_col):
    groups = demographics_col.unique().to_list()
    centroids = {}
    for g in groups:
        mask = demographics_col == g
        emb_group = embeddings[mask.to_numpy()]
        centroids[g] = emb_group.mean(dim=0)
    dist = {}
    for g1 in groups:
        for g2 in groups:
            dist[(g1, g2)] = torch.norm(centroids[g1] - centroids[g2]).item()
    return dist

demographics = df_train_chexpert["Sex"]  # or "Gender" column
dists = subgroup_centroid_distance(chexpert_aligned, demographics)
print(dists)


In [None]:
%load_ext tensorboard
%tensorboard --logdir ./logs
