<a href="https://colab.research.google.com/github/ccorbett0116/Fall2025ResearchProject/blob/main/Research_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Title:
# Authors: Jose Henriquez, Cole Corbett
## Description:
The deployment of medical AI systems across different hospitals raises critical questions about whether fairness and representation quality can be reliably transferred across clinical domains. Models trained on one hospital’s imaging data are often reused in new environments where patient demographics, imaging devices, and diagnostic practices differ substantially, potentially resulting in unintended bias against certain groups. This project investigates this challenge by studying fairness-aware representation alignment in medical imaging. The student will train contrastive learning models—such as SimCLR—independently on two large-scale chest X-ray datasets: CheXpert (from Stanford Hospital) and MIMIC-CXR (from Beth Israel Deaconess Medical Center). After learning embeddings in each domain, the student will apply domain alignment techniques such as Procrustes alignment to map representations from the CheXpert embedding space into the MIMIC-CXR space. The aligned embeddings will then be evaluated using fairness metrics designed for representation spaces, including demographic subgroup alignment, intra- vs. inter-group embedding disparity, and cluster-level demographic parity. The expected outcome is a rigorous understanding of whether fairness properties learned in one hospital setting preserve, degrade, or improve when transferred to another, revealing how robust model fairness is to realworld clinical domain shifts. A practical use case involves a healthcare network seeking to deploy a model trained at a major academic hospital (e.g., Stanford) into a community hospital setting: this project helps determine whether the transferred representations remain equitable across patient groups such as older adults, women, or specific disease cohorts. The findings will support responsible AI deployment in healthcare by highlighting the conditions under which fairness is stable across institutions and identifying scenarios where domain-specific mitigation strategies may be required.

In [11]:
#Process is probably different on colab, this is hyperspecific to me because I'm working on Pycharm connected to my WSL
import sys
import os
!{sys.executable} -m pip install kagglehub polars
#We're going to use polars because it's significantly faster, it's build on rust and enables multi-threaded processing as well as some memory optimizations over pandas.



In [12]:
#Again, this is probably different on colab
import kagglehub
path_chexpert = kagglehub.dataset_download("mimsadiislam/chexpert")
print("Path to chexpert dataset files:", path_chexpert)
path_mimic = kagglehub.dataset_download("simhadrisadaram/mimic-cxr-dataset")
print("Path to mimic dataset files:", path_mimic)

  from .autonotebook import tqdm as notebook_tqdm


Path to chexpert dataset files: C:\Users\joseh\.cache\kagglehub\datasets\mimsadiislam\chexpert\versions\1
Path to mimic dataset files: C:\Users\joseh\.cache\kagglehub\datasets\simhadrisadaram\mimic-cxr-dataset\versions\2


In [22]:


os.makedirs("./CheXpert/train", exist_ok=True)
os.makedirs("./CheXpert/val", exist_ok=True)
os.makedirs("./MIMIC-CXR/train", exist_ok=True)
os.makedirs("./MIMIC-CXR/val", exist_ok=True)
os.makedirs("./checkpoints", exist_ok=True)
os.makedirs("./embeddings", exist_ok=True)
os.makedirs("./logs", exist_ok=True)

In [23]:
import polars as pl


dir_chexpert = os.path.join(path_chexpert, "CheXpert-v1.0-small")
dir_mimic = path_mimic

train_csv_chexpert = os.path.join(dir_chexpert, "train.csv")
train_csv_mimic = os.path.join(dir_mimic, "mimic_cxr_aug_train.csv")
valid_csv_chexpert = os.path.join(dir_chexpert, "valid.csv")
valid_csv_mimic = os.path.join(dir_mimic, "mimic_cxr_aug_validate.csv")

df_train_chexpert = pl.read_csv(train_csv_chexpert)
df_train_mimic = pl.read_csv(train_csv_mimic)
df_valid_chexpert = pl.read_csv(valid_csv_chexpert)
df_valid_mimic = pl.read_csv(valid_csv_mimic)

In [24]:
from tqdm import tqdm
import shutil

In [25]:

def copy_images_polars(df, source_dir, target_dir, path_column="Path", dataset_name="Dataset"):
    """
    Copies images from source_dir to target_dir based on the paths in Polars DataFrame.
    Preserves patient/study folder structure to avoid filename collisions.
    """
    copied = 0
    skipped = 0
    print(f"Copying {dataset_name} images to {target_dir}...")
    
    for row in tqdm(df.iter_rows(), total=len(df)):
        if path_column in df.columns:
            idx = df.columns.index(path_column)
            rel_path = row[idx]
        else:
            rel_path = row[0]
        
        # Fix: remove duplicated base folder for CheXpert
        if dataset_name.startswith("CheXpert"):
            rel_path_fixed = rel_path.replace("CheXpert-v1.0-small/", "")
        else:
            rel_path_fixed = rel_path
        
        src = os.path.join(source_dir, rel_path_fixed)
        
        # PRESERVE STRUCTURE: Keep last 2 folder levels (patient/study)
        path_parts = Path(rel_path_fixed).parts
        if len(path_parts) >= 3:
            # Keep patient/study folders
            subfolder = os.path.join(path_parts[-3], path_parts[-2])
        else:
            subfolder = ""
        
        # Create subfolder structure in target
        target_subfolder = os.path.join(target_dir, subfolder)
        os.makedirs(target_subfolder, exist_ok=True)
        
        dst = os.path.join(target_subfolder, os.path.basename(rel_path_fixed))
        
        if os.path.exists(src):
            shutil.copy2(src, dst)
            copied += 1
        else:
            skipped += 1
            if skipped <= 5:
                print(f"Warning: Source file not found: {src}")
    
    print(f"Completed: {copied} files copied, {skipped} files skipped\n")
    return copied, skipped

copy_images_polars(df_train_chexpert, dir_chexpert, "./CheXpert/train", path_column="Path", dataset_name="CheXpert Train")
copy_images_polars(df_valid_chexpert, dir_chexpert, "./CheXpert/val", path_column="Path", dataset_name="CheXpert Val")

Copying CheXpert Train images to ./CheXpert/train...


100%|██████████| 223414/223414 [03:51<00:00, 963.76it/s] 


Completed: 223414 files copied, 0 files skipped

Copying CheXpert Val images to ./CheXpert/val...


100%|██████████| 234/234 [00:00<00:00, 767.06it/s]

Completed: 234 files copied, 0 files skipped






(234, 0)

In [46]:
import os
import shutil
import ast
from tqdm import tqdm

# ============================================================================
# MIMIC Image Copy Function - Flat Structure (matches CheXpert style)
# ============================================================================

def find_mimic_image_root(dataset_root):
    """Find where MIMIC images are actually stored"""
    possible_roots = [
        "official_data_iccv_final",
        "files",
        "images",
        "mimic-cxr",
    ]
    
    for root in possible_roots:
        test_path = os.path.join(dataset_root, root)
        if os.path.exists(test_path):
            return test_path
    
    return dataset_root


def copy_mimic_images_polars(df, source_dir, target_dir, path_column="image", dataset_name="MIMIC"):
    """
    Copies MIMIC images from source_dir to target_dir (flat structure).
    
    Args:
        df: Polars DataFrame with image paths
        source_dir: Root directory of MIMIC dataset
        target_dir: Where to copy images
        path_column: Column name containing image paths (default: "image")
        dataset_name: Name for logging
    """
    os.makedirs(target_dir, exist_ok=True)
    
    copied = 0
    skipped = 0
    parse_errors = 0
    
    # Find where images are actually stored
    images_root = find_mimic_image_root(source_dir)
    print(f"Using MIMIC image root: {images_root}")
    print(f"Copying {dataset_name} images to {target_dir}...")
    
    for row in tqdm(df.iter_rows(), total=len(df)):
        # Get the image column value
        if path_column in df.columns:
            idx = df.columns.index(path_column)
            image_value = row[idx]
        else:
            image_value = row[0]
        
        # Parse the list of image paths
        try:
            # MIMIC stores image paths as string representation of list
            if isinstance(image_value, str):
                image_list = ast.literal_eval(image_value)
            elif isinstance(image_value, list):
                image_list = image_value
            else:
                parse_errors += 1
                continue
        except Exception as e:
            parse_errors += 1
            if parse_errors <= 5:
                print(f"Warning: Could not parse image paths: {image_value[:100]}...")
            continue
        
        # Process each image in the list
        for img_path in image_list:
            # Clean the path
            img_path = img_path.lstrip("/\\").replace("\\", "/")
            
            # Try multiple path combinations
            path_attempts = [
                img_path,
                img_path.replace("files/", ""),  # Remove 'files/' prefix
                os.path.join("files", img_path.replace("files/", "")),  # Ensure 'files/' prefix
            ]
            
            found = False
            for attempt in path_attempts:
                src = os.path.join(images_root, attempt)
                
                if os.path.exists(src):
                    # Copy to target directory
                    dst = os.path.join(target_dir, os.path.basename(src))
                    
                    # Handle potential duplicate filenames
                    if os.path.exists(dst):
                        # Add counter to filename
                        base_name, ext = os.path.splitext(os.path.basename(src))
                        counter = 1
                        while os.path.exists(os.path.join(target_dir, f"{base_name}_{counter}{ext}")):
                            counter += 1
                        dst = os.path.join(target_dir, f"{base_name}_{counter}{ext}")
                    
                    shutil.copy2(src, dst)
                    copied += 1
                    found = True
                    break
            
            if not found:
                skipped += 1
                if skipped <= 5:
                    print(f"Warning: Source file not found: {img_path}")
    
    print(f"Completed: {copied} files copied, {skipped} files skipped, {parse_errors} parse errors\n")
    return copied, skipped


# =========================
# 4️⃣ Copy MIMIC images
# =========================
copy_mimic_images_polars(df_train_mimic, path_mimic, "./MIMIC-CXR/train", 
                         path_column="image", dataset_name="MIMIC Train")
copy_mimic_images_polars(df_valid_mimic, path_mimic, "./MIMIC-CXR/val", 
                         path_column="image", dataset_name="MIMIC Validation")

Using MIMIC image root: C:\Users\joseh\.cache\kagglehub\datasets\simhadrisadaram\mimic-cxr-dataset\versions\2\official_data_iccv_final
Copying MIMIC Train images to ./MIMIC-CXR/train...


  0%|          | 12/64586 [00:00<10:24, 103.45it/s]



100%|██████████| 64586/64586 [06:00<00:00, 179.07it/s]


Completed: 259038 files copied, 109922 files skipped, 0 parse errors

Using MIMIC image root: C:\Users\joseh\.cache\kagglehub\datasets\simhadrisadaram\mimic-cxr-dataset\versions\2\official_data_iccv_final
Copying MIMIC Validation images to ./MIMIC-CXR/val...


  4%|▎         | 18/500 [00:00<00:03, 148.15it/s]



100%|██████████| 500/500 [00:02<00:00, 169.32it/s]

Completed: 2099 files copied, 892 files skipped, 0 parse errors






(2099, 892)

In [None]:
import os
import shutil
from tqdm import tqdm
from pathlib import Path
def organize_for_sololearn(flat_dir, output_dir, class_name="images"):
    """
    Reorganize flat image directory into solo-learn expected format
    
    Before: flat_dir/img1.jpg, img2.jpg, ...
    After:  output_dir/class_name/img1.jpg, img2.jpg, ...
    """
    print(f"Organizing {flat_dir} -> {output_dir}/{class_name}/")
    
    # Create class subfolder
    class_dir = os.path.join(output_dir, class_name)
    os.makedirs(class_dir, exist_ok=True)
    
    # Find all images in flat directory
    image_extensions = {'.jpg', '.jpeg', '.png'}
    image_files = []
    
    for ext in image_extensions:
        image_files.extend(Path(flat_dir).glob(f"*{ext}"))
    
    # Move/copy images to class folder
    for img_file in tqdm(image_files):
        dst = os.path.join(class_dir, img_file.name)
        if not os.path.exists(dst):
            shutil.copy2(img_file, dst)
    
    print(f"✓ Organized {len(image_files)} images into {class_dir}")
    return len(image_files)


# Organize MIMIC
print("="*80)
print("ORGANIZING MIMIC-CXR FOR SOLO-LEARN")
print("="*80)
organize_for_sololearn("./MIMIC-CXR/train", "./MIMIC-CXR-organized/train", "xray")
organize_for_sololearn("./MIMIC-CXR/val", "./MIMIC-CXR-organized/val", "xray")




ORGANIZING MIMIC-CXR FOR SOLO-LEARN
Organizing ./MIMIC-CXR/train -> ./MIMIC-CXR-organized/train/xray/


100%|██████████| 259038/259038 [00:17<00:00, 15031.51it/s]


✓ Organized 259038 images into ./MIMIC-CXR-organized/train\xray
Organizing ./MIMIC-CXR/val -> ./MIMIC-CXR-organized/val/xray/


100%|██████████| 2099/2099 [00:00<00:00, 15265.32it/s]

✓ Organized 2099 images into ./MIMIC-CXR-organized/val\xray

ORGANIZING CheXpert FOR SOLO-LEARN





In [27]:
import os
from pathlib import Path
from tqdm import tqdm

def count_images(flat_dir):
    """Count images in directory without copying"""
    flat_path = Path(flat_dir)
    print("Scanning for images...")
    image_files = [
        f for f in flat_path.rglob("*")
        if f.is_file() and f.suffix.lower() in (".jpg", ".jpeg", ".png")
    ]
    print(f"Found {len(image_files)} images total.")
    return image_files

def copy_images_simple(image_files, output_dir, class_name="images"):
    """Link images one at a time with progress bar (much faster than copying)"""
    class_dir = Path(output_dir) / class_name
    class_dir.mkdir(parents=True, exist_ok=True)
    
    print(f"Starting to link {len(image_files)} images...")
    successful = 0
    failed = []
    
    for img_file in tqdm(image_files, desc="Linking"):
        try:
            # Create unique filename from full path
            # e.g., patient00001/study1/view1_frontal.jpg -> patient00001_study1_view1_frontal.jpg
            relative_path = img_file.relative_to(img_file.parents[3])  # Get path from train/val root
            unique_name = str(relative_path).replace(os.sep, '_')
            
            dst = class_dir / unique_name
            
            # Use hard link instead of copy (instantaneous, no disk space used)
            os.link(img_file, dst)
            successful += 1
            
        except Exception as e:
            failed.append((img_file.name, str(e)))
    
    print(f"\nLinked {successful}/{len(image_files)} images successfully.")
    if failed:
        print(f"Failed links: {len(failed)}")
        print("First 5 errors:", failed[:5])


# Example usage
if __name__ == "__main__":
    # TRAIN SET
    print("\n=== Processing TRAIN set ===")
    train_images = count_images("./CheXpert/train")
    copy_images_simple(train_images, "chexpert-organized/train", class_name="xray")
    print("Done! Train images are now in: chexpert-organized/train/xray/")
    
    # VALIDATION SET
    print("\n=== Processing VALIDATION set ===")
    val_images = count_images("./CheXpert/val")
    copy_images_simple(val_images, "chexpert-organized/val", class_name="xray")
    print("Done! Val images are now in: chexpert-organized/val/xray/")
    



=== Processing TRAIN set ===
Scanning for images...
Found 223414 images total.
Starting to link 223414 images...


Linking: 100%|██████████| 223414/223414 [01:27<00:00, 2559.13it/s]



Linked 0/223414 images successfully.
Failed links: 223414
First 5 errors: [('view1_frontal.jpg', "[WinError 183] Cannot create a file when that file already exists: 'CheXpert\\\\train\\\\patient00001\\\\study1\\\\view1_frontal.jpg' -> 'chexpert-organized\\\\train\\\\xray\\\\train_patient00001_study1_view1_frontal.jpg'"), ('view1_frontal.jpg', "[WinError 183] Cannot create a file when that file already exists: 'CheXpert\\\\train\\\\patient00002\\\\study1\\\\view1_frontal.jpg' -> 'chexpert-organized\\\\train\\\\xray\\\\train_patient00002_study1_view1_frontal.jpg'"), ('view2_lateral.jpg', "[WinError 183] Cannot create a file when that file already exists: 'CheXpert\\\\train\\\\patient00002\\\\study1\\\\view2_lateral.jpg' -> 'chexpert-organized\\\\train\\\\xray\\\\train_patient00002_study1_view2_lateral.jpg'"), ('view1_frontal.jpg', "[WinError 183] Cannot create a file when that file already exists: 'CheXpert\\\\train\\\\patient00002\\\\study2\\\\view1_frontal.jpg' -> 'chexpert-organized\

Linking: 100%|██████████| 234/234 [00:00<00:00, 2818.80it/s]


Linked 0/234 images successfully.
Failed links: 234
First 5 errors: [('view1_frontal.jpg', "[WinError 183] Cannot create a file when that file already exists: 'CheXpert\\\\val\\\\patient64541\\\\study1\\\\view1_frontal.jpg' -> 'chexpert-organized\\\\val\\\\xray\\\\val_patient64541_study1_view1_frontal.jpg'"), ('view1_frontal.jpg', "[WinError 183] Cannot create a file when that file already exists: 'CheXpert\\\\val\\\\patient64542\\\\study1\\\\view1_frontal.jpg' -> 'chexpert-organized\\\\val\\\\xray\\\\val_patient64542_study1_view1_frontal.jpg'"), ('view2_lateral.jpg', "[WinError 183] Cannot create a file when that file already exists: 'CheXpert\\\\val\\\\patient64542\\\\study1\\\\view2_lateral.jpg' -> 'chexpert-organized\\\\val\\\\xray\\\\val_patient64542_study1_view2_lateral.jpg'"), ('view1_frontal.jpg', "[WinError 183] Cannot create a file when that file already exists: 'CheXpert\\\\val\\\\patient64543\\\\study1\\\\view1_frontal.jpg' -> 'chexpert-organized\\\\val\\\\xray\\\\val_pati




In [None]:
df_train_chexpert.head()

In [None]:
df_train_mimic.head()

Training with solo learn

In [18]:
import sys, os
sys.path.append(os.path.abspath("solo-learn"))
print("solo-learn added to path.")


solo-learn added to path.


In [None]:
#reset hydra
from hydra.core.global_hydra import GlobalHydra
GlobalHydra.instance().clear()
print("Hydra reset.")


Hydra reset.


In [None]:
# Load config
from omegaconf import OmegaConf

cfg = OmegaConf.load("config/simclr.yaml")
print(OmegaConf.to_yaml(cfg))


dataset: custom
data:
  train_path: CheXpert-organized/train
  val_path: CheXpert-organized/val
  format: image_folder
  batch_size: 128
  num_workers: 8
backbone:
  name: resnet18
  kwargs: {}
method_kwargs:
  proj_hidden_dim: 2048
  proj_output_dim: 128
  temperature: 0.5
optimizer:
  name: adam
  lr: 0.001
  weight_decay: 1.0e-06
scheduler:
  name: cosine
  warmup_epochs: 10
max_epochs: 2



In [None]:
# Prepare data loaders
from solo.data.classification_dataloader import prepare_data
train_loader, val_loader = prepare_data(
    dataset=cfg.dataset,
    train_data_path=cfg.data.train_path,
    val_data_path=cfg.data.val_path,
    data_format="image_folder",
    batch_size=cfg.data.batch_size,
    num_workers=cfg.data.num_workers,
)


In [31]:
# Instantiate model <-- this is where im failing
from solo.methods import METHOD_REGISTRY    
method_class = METHOD_REGISTRY.get(cfg.method.name)
method = method_class(cfg)
print(f"Instantiated method: {cfg.method.name}")
print(method)
# Now you can proceed with training or evaluation using `method`, `train_loader`, and `val_loader`.

ImportError: cannot import name 'METHOD_REGISTRY' from 'solo.methods' (c:\Users\joseh\.conda\envs\py312\Lib\site-packages\solo\methods\__init__.py)

In [None]:



# CheXpert
!python -m solo.main_pretrain \
  --method simclr \
  --dataset custom \
  --data_dir /path/to/chesXpert \
  --train_dir train \
  --val_dir val \
  --encoder resnet50 \
  --batch_size 128 \
  --max_epochs 50 \
  --num_workers 8 \
  --gpus 1 \
  --precision 16 \
  --logger tensorboard \
  --log_dir ./logs \
  --project mimic_ssl \
  --name simclr_chexpert


SyntaxError: invalid syntax (3234304305.py, line 2)

Embedding extraction

In [None]:
# CheXpert
python -m solo.main_linear \
  --backbone resnet50 \
  --pretrained_feature_extractor ./checkpoints/simclr_chexpert.ckpt \
  --extract_features \
  --data_dir /path/to/CheXpert \
  --train_dir train \
  --output_features simclr_chexpert_embeddings.pt

# MIMIC
python -m solo.main_linear \
  --backbone resnet50 \
  --pretrained_feature_extractor ./checkpoints/simclr_mimic.ckpt \
  --extract_features \
  --data_dir /path/to/MIMIC-CXR \
  --train_dir train \
  --output_features simclr_mimic_embeddings.pt


In [None]:
import torch

# Load embeddings
chexpert_emb = torch.load("./embeddings/simclr_chexpert_embeddings.pt")
mimic_emb = torch.load("./embeddings/simclr_mimic_embeddings.pt")

# Procrustes alignment
def procrustes_alignment(X, Y):
    U, _, Vt = torch.linalg.svd(Y.T @ X)
    R = U @ Vt
    return X @ R.T, R

chexpert_aligned, R = procrustes_alignment(chexpert_emb, mimic_emb)
torch.save(chexpert_aligned, "./embeddings/chexpert_aligned.pt")

# Example: subgroup distance with Polars
def subgroup_centroid_distance(embeddings, demographics_col):
    groups = demographics_col.unique().to_list()
    centroids = {}
    for g in groups:
        mask = demographics_col == g
        emb_group = embeddings[mask.to_numpy()]
        centroids[g] = emb_group.mean(dim=0)
    dist = {}
    for g1 in groups:
        for g2 in groups:
            dist[(g1, g2)] = torch.norm(centroids[g1] - centroids[g2]).item()
    return dist

demographics = df_train_chexpert["Sex"]  # or "Gender" column
dists = subgroup_centroid_distance(chexpert_aligned, demographics)
print(dists)


In [None]:
%load_ext tensorboard
%tensorboard --logdir ./logs
