# Multi-Objective Hyperparameter Optimization for Breast Cancer Classification

**Running on Google Colab**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dtobi59/mammography-multiobjective-optimization/blob/main/colab_tutorial.ipynb)

This notebook demonstrates the complete workflow:
1. Setup environment and clone repository
2. Upload or mount datasets
3. Run NSGA-III optimization with checkpointing
4. Analyze results and visualize Pareto front
5. Evaluate on source and target datasets

**Author:** David ([@dtobi59](https://github.com/dtobi59))

## 1. Setup Environment

Check GPU availability and clone the repository.

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")

### Clone Repository from GitHub

In [None]:
# Clone the repository
!git clone https://github.com/dtobi59/mammography-multiobjective-optimization.git

# Change to project directory
%cd mammography-multiobjective-optimization

# List files
!ls -la

In [None]:
# Setup Python path to ensure imports work
import sys
import os

# Get current directory (project root)
project_root = os.getcwd()
print(f"Project root: {project_root}")

# Add to Python path if not already there
if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print(f"Added {project_root} to sys.path")

# Verify path setup
print(f"\nPython sys.path[0]: {sys.path[0]}")
print("[OK] Path setup complete!")

# Install required packages
\!pip install -q -r requirements.txt

# Install the package in editable mode
\!pip install -q -e .

print("
[SUCCESS] All dependencies installed\!")

In [None]:
# Install required packages
!pip install -q -r requirements.txt

print("\n[SUCCESS] All dependencies installed!")

## 2. Dataset Setup

**Option C: Mount Google Drive** (ACTIVE - Best for large datasets)

**Option A: Use Small Demo Dataset** (For testing only - commented out)

**Option B: Upload Your Own Datasets** (Alternative - commented out)

We'll use Google Drive by default to access the VinDr and INbreast datasets.

### Option A: Create Demo Dataset (Quick Test)

**COMMENTED OUT - Using Google Drive instead**

This creates a small synthetic dataset for testing the pipeline.

In [None]:
# OPTION A - COMMENTED OUT (Using Google Drive instead)
# Uncomment this cell if you want to test with demo data instead

# import os
# import pandas as pd
# import numpy as np
# from PIL import Image
#
# # Create demo data directories
# os.makedirs("demo_data/vindr/images", exist_ok=True)
# os.makedirs("demo_data/inbreast/images", exist_ok=True)
#
# # Create demo VinDr-Mammo metadata
# vindr_data = []
# for patient_id in range(1, 6):  # 5 patients
#     for laterality in ['L', 'R']:
#         for view in ['CC', 'MLO']:
#             image_id = f"P{patient_id:03d}_{laterality}_{view}"
#             birads = np.random.choice([2, 3, 4, '4A', 5])
#             vindr_data.append({
#                 'image_id': image_id,
#                 'study_id': f'P{patient_id:03d}',
#                 'laterality': laterality,
#                 'view_position': view,
#                 'breast_birads': birads
#             })
#             # Create dummy image (512x512 grayscale)
#             img = Image.fromarray(np.random.randint(0, 256, (512, 512), dtype=np.uint8), mode='L')
#             img.save(f"demo_data/vindr/images/{image_id}.png")
#
# vindr_df = pd.DataFrame(vindr_data)
# vindr_df.to_csv("demo_data/vindr/metadata.csv", index=False)
#
# # Create demo INbreast metadata
# inbreast_data = []
# for patient_id in range(1, 4):  # 3 patients
#     for laterality in ['L', 'R']:
#         for view in ['CC', 'MLO']:
#             file_name = f"INbreast_{patient_id:03d}_{laterality}_{view}.png"
#             birads = np.random.choice([2, 3, '4B', 5])
#             inbreast_data.append({
#                 'patient_id': f'INB{patient_id:03d}',
#                 'laterality': laterality,
#                 'view': view,
#                 'birads': birads,
#                 'file_name': file_name
#             })
#             # Create dummy image
#             img = Image.fromarray(np.random.randint(0, 256, (512, 512), dtype=np.uint8), mode='L')
#             img.save(f"demo_data/inbreast/images/{file_name}")
#
# inbreast_df = pd.DataFrame(inbreast_data)
# inbreast_df.to_csv("demo_data/inbreast/metadata.csv", index=False)
#
# print("[SUCCESS] Demo dataset created!")
# print(f"VinDr-Mammo: {len(vindr_df)} images from 5 patients")
# print(f"INbreast: {len(inbreast_df)} images from 3 patients")
#
# # Set paths for demo data
# VINDR_PATH = "demo_data/vindr"
# INBREAST_PATH = "demo_data/inbreast"

print("Option A is disabled. Using Google Drive (Option C).")

### Option B: Upload Your Own Datasets

Skip if using demo data or Google Drive.

In [None]:
# Uncomment to upload files
# from google.colab import files
#
# print("Upload VinDr-Mammo metadata.csv")
# vindr_metadata = files.upload()
#
# print("Upload INbreast metadata.csv")
# inbreast_metadata = files.upload()
#
# # For images, it's better to use Google Drive for large datasets
# print("For image files, please use Google Drive (Option C)")

### Option C: Mount Google Drive (ACTIVE)

**This option is now active by default.**

Make sure you have uploaded your datasets to Google Drive:
- VinDr dataset: `/content/drive/MyDrive/kaggle_vindr_data/`
- INbreast dataset: `/content/drive/MyDrive/INbreast/`

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Set paths to your data in Google Drive
VINDR_PATH = "/content/drive/MyDrive/kaggle_vindr_data"
INBREAST_PATH = "/content/drive/MyDrive/INbreast"

# Set checkpoint directories in Google Drive (persistent storage)
CHECKPOINT_DIR = "/content/drive/MyDrive/vindr_optimization/checkpoints"
OUTPUT_DIR = "/content/drive/MyDrive/vindr_optimization/results"

# Create checkpoint directories
import os
os.makedirs(CHECKPOINT_DIR, exist_ok=True)
os.makedirs(OUTPUT_DIR, exist_ok=True)

print("\n[SUCCESS] Google Drive mounted!")
print(f"VinDr dataset path: {VINDR_PATH}")
print(f"INbreast dataset path: {INBREAST_PATH}")
print(f"\nCheckpoints will be saved to:")
print(f"  Model checkpoints: {CHECKPOINT_DIR}")
print(f"  Optimization results: {OUTPUT_DIR}")
print(f"\n‚úì Checkpoints are persistent and won't be lost when session ends!")

## 3. Configure Paths

Update configuration with your dataset paths.

In [None]:
# Read current config
with open('config.py', 'r') as f:
    config_content = f.read()

# Update paths to Google Drive locations
config_content = config_content.replace(
    'VINDR_MAMMO_PATH = "/content/drive/MyDrive/kaggle_vindr_data"',
    f'VINDR_MAMMO_PATH = "{VINDR_PATH}"'
)
config_content = config_content.replace(
    'INBREAST_PATH = "/content/drive/MyDrive/INbreast"',
    f'INBREAST_PATH = "{INBREAST_PATH}"'
)

# Optional: Reduce for testing (uncomment if needed)
# config_content = config_content.replace(
#     '"pop_size": 24,',
#     '"pop_size": 6,  # Reduced for testing'
# )
# config_content = config_content.replace(
#     '"n_generations": 50,',
#     '"n_generations": 5,  # Reduced for testing'
# )
# config_content = config_content.replace(
#     'MAX_EPOCHS = 100',
#     'MAX_EPOCHS = 10  # Reduced for testing'
# )

# Write updated config
with open('config.py', 'w') as f:
    f.write(config_content)

print("[SUCCESS] Configuration updated!")
print(f"VinDr-Mammo path: {VINDR_PATH}")
print(f"INbreast path: {INBREAST_PATH}")
print("\nUsing production settings from config.py")
print("To reduce compute time, uncomment the optional section above.")

## 4. Verify Setup

Test that data loading works correctly.

In [None]:
# Verify setup and test data loading
import sys
import os

# Ensure project root is in path
project_root = os.getcwd()
if project_root not in sys.path:
    sys.path.insert(0, project_root)

print(f"Project root: {project_root}")
print(f"Python path includes project root: {project_root in sys.path}\n")

# Now import modules
import config
from optimization.nsga3_runner import load_metadata

# Load VinDr-Mammo metadata
print("Loading VinDr-Mammo metadata...")
vindr_metadata = load_metadata(
    dataset_name="vindr",
    dataset_path=config.VINDR_MAMMO_PATH,
    dataset_config=config.VINDR_CONFIG
)
print(f"[OK] Loaded {len(vindr_metadata)} images")
print(f"     Patients: {vindr_metadata['patient_id'].nunique()}")
print(f"     Label distribution: {vindr_metadata['label'].value_counts().to_dict()}")

# Load INbreast metadata
print("\nLoading INbreast metadata...")
inbreast_metadata = load_metadata(
    dataset_name="inbreast",
    dataset_path=config.INBREAST_PATH,
    dataset_config=config.INBREAST_CONFIG
)
print(f"[OK] Loaded {len(inbreast_metadata)} images")
print(f"     Patients: {inbreast_metadata['patient_id'].nunique()}")
print(f"     Label distribution: {inbreast_metadata['label'].value_counts().to_dict()}")

print("\n[SUCCESS] Setup verification complete!")

## 5. Run NSGA-III Optimization

This will optimize 5 hyperparameters for 4 objectives with automatic checkpointing.

In [None]:
# Ensure imports work
import sys
import os
if os.getcwd() not in sys.path:
    sys.path.insert(0, os.getcwd())

from optimization.nsga3_runner import NSGA3Runner
from data.dataset import create_train_val_split
from pathlib import Path
import config

# Create train/val split
print("Creating train/validation split...")
train_metadata, val_metadata = create_train_val_split(vindr_metadata)

print(f"Train samples: {len(train_metadata)}")
print(f"Validation samples: {len(val_metadata)}")
print(f"Unique patients - Train: {train_metadata['patient_id'].nunique()}, "
      f"Val: {val_metadata['patient_id'].nunique()}")

# Create runner with checkpoint saving to Google Drive
print("\nInitializing NSGA-III runner...")
image_dir = str(Path(config.VINDR_MAMMO_PATH) / config.VINDR_CONFIG["image_dir"])
runner = NSGA3Runner(
    train_metadata=train_metadata,
    val_metadata=val_metadata,
    image_dir=image_dir,
    output_dir=OUTPUT_DIR,          # Save to Google Drive
    checkpoint_dir=CHECKPOINT_DIR,  # Save to Google Drive
    save_frequency=1  # Save every generation
)

print("\n" + "="*80)
print("STARTING OPTIMIZATION")
print("="*80)
print("This may take a while depending on:")
print("  - Population size (current: from config)")
print("  - Number of generations (current: from config)")
print("  - Dataset size")
print("  - GPU availability")
print("\nCheckpoints are being saved to Google Drive:")
print(f"  {CHECKPOINT_DIR}")
print(f"  {OUTPUT_DIR}")
print("\n‚úì Your progress is safe and persistent!")
print("="*80 + "\n")

# Run optimization
result = runner.run()

print("\n" + "="*80)
print("OPTIMIZATION COMPLETE!")
print("="*80)
print(f"Pareto front size: {len(result.F)}")
print(f"Results saved to: {runner.output_dir}")
print(f"\n‚úì All checkpoints and results are saved in Google Drive!")
print("="*80)

## 6. Inspect Checkpoints

View saved checkpoints and load Pareto fronts from different generations.

In [None]:
# List all checkpoints
checkpoints = runner.list_checkpoints()
print(f"Found {len(checkpoints)} checkpoints:\n")

for i, checkpoint_path in enumerate(checkpoints):
    print(f"{i+1}. {checkpoint_path.name}")

# Load and display the latest checkpoint
if checkpoints:
    print("\nLoading latest checkpoint...")
    latest_checkpoint = runner.load_checkpoint(checkpoints[-1])
    
    # Get Pareto front from latest checkpoint
    pareto_df = runner.get_pareto_front_from_checkpoint(checkpoints[-1])
    print(f"\nPareto front at generation {latest_checkpoint['generation']}:")
    print(pareto_df)
else:
    print("No checkpoints found.")

## 7. Analyze Results

Load and visualize the final Pareto front.

In [None]:
import glob
import matplotlib.pyplot as plt
import pandas as pd

# Find most recent results file in Google Drive
pareto_files = sorted(glob.glob(f"{OUTPUT_DIR}/pareto_solutions_*.csv"))
if not pareto_files:
    print("No results found. Please run optimization first.")
else:
    latest_results = pareto_files[-1]
    print(f"Loading results from: {latest_results}")
    
    results_df = pd.read_csv(latest_results)
    print(f"\nPareto front contains {len(results_df)} solutions")
    print("\nSummary statistics:")
    print(results_df.describe())

### Visualize Pareto Front

In [None]:
if pareto_files:
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('Pareto Front - Objective Space', fontsize=16)
    
    # Plot objective pairs
    objective_pairs = [
        ('pr_auc', 'auroc'),
        ('pr_auc', 'brier'),
        ('pr_auc', 'robustness_degradation'),
        ('auroc', 'brier'),
        ('auroc', 'robustness_degradation'),
        ('brier', 'robustness_degradation')
    ]
    
    for ax, (obj1, obj2) in zip(axes.flat, objective_pairs):
        ax.scatter(results_df[obj1], results_df[obj2], alpha=0.6, s=50)
        ax.set_xlabel(obj1.replace('_', ' ').title())
        ax.set_ylabel(obj2.replace('_', ' ').title())
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Plot hyperparameter distributions
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    fig.suptitle('Pareto Front - Hyperparameter Distributions', fontsize=16)
    
    hyperparams = ['learning_rate', 'weight_decay', 'dropout_rate', 
                   'augmentation_strength', 'unfreeze_fraction']
    
    for ax, param in zip(axes.flat, hyperparams):
        ax.hist(results_df[param], bins=20, alpha=0.7, edgecolor='black')
        ax.set_xlabel(param.replace('_', ' ').title())
        ax.set_ylabel('Frequency')
        ax.grid(True, alpha=0.3)
    
    # Hide the last subplot (we have 5 params, 6 subplots)
    axes.flat[-1].axis('off')
    
    plt.tight_layout()
    plt.show()
    
    # Identify extreme solutions
    print("\n" + "="*80)
    print("EXTREME SOLUTIONS (Best for each objective)")
    print("="*80)
    
    # Best PR-AUC
    best_pr_auc_idx = results_df['pr_auc'].idxmax()
    print(f"\nBest PR-AUC: {results_df.loc[best_pr_auc_idx, 'pr_auc']:.4f}")
    print(f"Solution ID: {results_df.loc[best_pr_auc_idx, 'solution_id']}")
    print("Hyperparameters:")
    for param in hyperparams:
        print(f"  {param}: {results_df.loc[best_pr_auc_idx, param]:.6f}")
    
    # Best AUROC
    best_auroc_idx = results_df['auroc'].idxmax()
    print(f"\nBest AUROC: {results_df.loc[best_auroc_idx, 'auroc']:.4f}")
    print(f"Solution ID: {results_df.loc[best_auroc_idx, 'solution_id']}")
    print("Hyperparameters:")
    for param in hyperparams:
        print(f"  {param}: {results_df.loc[best_auroc_idx, param]:.6f}")
    
    # Best Brier (lowest)
    best_brier_idx = results_df['brier'].idxmin()
    print(f"\nBest Brier Score: {results_df.loc[best_brier_idx, 'brier']:.4f}")
    print(f"Solution ID: {results_df.loc[best_brier_idx, 'solution_id']}")
    print("Hyperparameters:")
    for param in hyperparams:
        print(f"  {param}: {results_df.loc[best_brier_idx, param]:.6f}")
    
    # Best Robustness (lowest degradation)
    best_robust_idx = results_df['robustness_degradation'].idxmin()
    print(f"\nBest Robustness: {results_df.loc[best_robust_idx, 'robustness_degradation']:.4f}")
    print(f"Solution ID: {results_df.loc[best_robust_idx, 'solution_id']}")
    print("Hyperparameters:")
    for param in hyperparams:
        print(f"  {param}: {results_df.loc[best_robust_idx, param]:.6f}")

## 8. Results in Google Drive

All results and checkpoints are automatically saved to Google Drive!

In [None]:
# Results are already in Google Drive!
print("=" * 80)
print("RESULTS LOCATION")
print("=" * 80)
print(f"\nAll results are saved in Google Drive:")
print(f"  üìÅ {OUTPUT_DIR}")
print(f"\nCheckpoint structure:")
print(f"  ‚îú‚îÄ‚îÄ optimization_checkpoints/")
print(f"  ‚îÇ   ‚îú‚îÄ‚îÄ checkpoint_gen_0001.pkl")
print(f"  ‚îÇ   ‚îú‚îÄ‚îÄ checkpoint_gen_0002.pkl")
print(f"  ‚îÇ   ‚îî‚îÄ‚îÄ pareto_gen_XXXX.csv")
print(f"  ‚îî‚îÄ‚îÄ pareto_solutions_TIMESTAMP.csv")
print(f"\nModel checkpoints:")
print(f"  üìÅ {CHECKPOINT_DIR}")
print(f"  ‚îú‚îÄ‚îÄ eval_1/best_checkpoint.pt")
print(f"  ‚îú‚îÄ‚îÄ eval_2/best_checkpoint.pt")
print(f"  ‚îî‚îÄ‚îÄ ...")
print("\n‚úì Access these files anytime from your Google Drive!")
print("=" * 80)

# Optional: Create a zip file for download
print("\nOptional: Create zip file for download")
print("Uncomment the code below if you want to download results:")
print()
print("# !cd /content/drive/MyDrive/vindr_optimization && zip -r results.zip results/ checkpoints/")
print("# from google.colab import files")
print("# files.download('/content/drive/MyDrive/vindr_optimization/results.zip')")

## Next Steps

1. **Evaluate on INbreast (Zero-Shot Transfer)**
   - Use the best solution to evaluate on the target dataset
   - No fine-tuning or threshold adjustment

2. **Analyze Trade-offs**
   - Compare different Pareto solutions
   - Select based on your priorities (PR-AUC, AUROC, calibration, robustness)

3. **Production Runs**
   - Increase population size (e.g., 20)
   - Increase generations (e.g., 100)
   - Increase max epochs (e.g., 100)
   - Use full datasets

4. **Resume Training**
   - All checkpoints are saved in Google Drive
   - You can resume optimization from any checkpoint
   - Simply load and continue training

## Access Your Results

All files are stored in Google Drive:
- **Results:** `/content/drive/MyDrive/vindr_optimization/results/`
- **Checkpoints:** `/content/drive/MyDrive/vindr_optimization/checkpoints/`

You can access these from any Colab session or download them to your computer!

## Resources

- **GitHub Repository:** https://github.com/dtobi59/mammography-multiobjective-optimization
- **Documentation:** See README.md and other guides in the repository
- **Paper:** [Add your paper link here when published]

## Citation

If you use this code in your research, please cite:

```bibtex
@software{mammography_multiobjective_optimization,
  author = {David},
  title = {Multi-Objective Hyperparameter Optimization for Breast Cancer Classification},
  year = {2026},
  url = {https://github.com/dtobi59/mammography-multiobjective-optimization}
}
```