# Multi-Objective Hyperparameter Optimization for Breast Cancer Classification

**Running on Google Colab**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dtobi59/mammography-multiobjective-optimization/blob/main/colab_tutorial.ipynb)

This notebook demonstrates the complete workflow:
1. Setup environment and clone repository
2. Upload or mount datasets
3. Run NSGA-III optimization with checkpointing
4. Analyze results and visualize Pareto front
5. Evaluate on source and target datasets

**Author:** David ([@dtobi59](https://github.com/dtobi59))

## 1. Setup Environment

Check GPU availability and clone the repository.

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU device: {torch.cuda.get_device_name(0)}")

### Clone Repository from GitHub

In [None]:
# Clone the repository
!git clone https://github.com/dtobi59/mammography-multiobjective-optimization.git

# Change to project directory
%cd mammography-multiobjective-optimization

# List files
!ls -la

# Install required packages
\!pip install -q -r requirements.txt

# Install the package in editable mode
\!pip install -q -e .

print("
[SUCCESS] All dependencies installed\!")

In [None]:
# Install required packages
!pip install -q -r requirements.txt

# Install the package in editable mode to fix imports
!pip install -q -e .

print("\n[SUCCESS] All dependencies installed!")

## 2. Dataset Setup

**Option A: Use Small Demo Dataset** (Recommended for testing)

**Option B: Upload Your Own Datasets**

**Option C: Mount Google Drive** (Best for large datasets)

Choose one option below:

### Option A: Create Demo Dataset (Quick Test)

This creates a small synthetic dataset for testing the pipeline.

In [None]:
import os
import pandas as pd
import numpy as np
from PIL import Image

# Create demo data directories
os.makedirs("demo_data/vindr/images", exist_ok=True)
os.makedirs("demo_data/inbreast/images", exist_ok=True)

# Create demo VinDr-Mammo metadata
vindr_data = []
for patient_id in range(1, 6):  # 5 patients
    for laterality in ['L', 'R']:
        for view in ['CC', 'MLO']:
            image_id = f"P{patient_id:03d}_{laterality}_{view}"
            birads = np.random.choice([2, 3, 4, '4A', 5])
            vindr_data.append({
                'image_id': image_id,
                'study_id': f'P{patient_id:03d}',
                'laterality': laterality,
                'view_position': view,
                'breast_birads': birads
            })
            # Create dummy image (512x512 grayscale)
            img = Image.fromarray(np.random.randint(0, 256, (512, 512), dtype=np.uint8), mode='L')
            img.save(f"demo_data/vindr/images/{image_id}.png")

vindr_df = pd.DataFrame(vindr_data)
vindr_df.to_csv("demo_data/vindr/metadata.csv", index=False)

# Create demo INbreast metadata
inbreast_data = []
for patient_id in range(1, 4):  # 3 patients
    for laterality in ['L', 'R']:
        for view in ['CC', 'MLO']:
            file_name = f"INbreast_{patient_id:03d}_{laterality}_{view}.png"
            birads = np.random.choice([2, 3, '4B', 5])
            inbreast_data.append({
                'patient_id': f'INB{patient_id:03d}',
                'laterality': laterality,
                'view': view,
                'birads': birads,
                'file_name': file_name
            })
            # Create dummy image
            img = Image.fromarray(np.random.randint(0, 256, (512, 512), dtype=np.uint8), mode='L')
            img.save(f"demo_data/inbreast/images/{file_name}")

inbreast_df = pd.DataFrame(inbreast_data)
inbreast_df.to_csv("demo_data/inbreast/metadata.csv", index=False)

print("[SUCCESS] Demo dataset created!")
print(f"VinDr-Mammo: {len(vindr_df)} images from 5 patients")
print(f"INbreast: {len(inbreast_df)} images from 3 patients")

# Set paths for demo data
VINDR_PATH = "demo_data/vindr"
INBREAST_PATH = "demo_data/inbreast"

### Option B: Upload Your Own Datasets

Skip if using demo data or Google Drive.

In [None]:
# Uncomment to upload files
# from google.colab import files
#
# print("Upload VinDr-Mammo metadata.csv")
# vindr_metadata = files.upload()
#
# print("Upload INbreast metadata.csv")
# inbreast_metadata = files.upload()
#
# # For images, it's better to use Google Drive for large datasets
# print("For image files, please use Google Drive (Option C)")

### Option C: Mount Google Drive

Best option for large datasets. Upload your data to Google Drive first.

In [None]:
# Uncomment to mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')
#
# # Set paths to your data in Google Drive
# VINDR_PATH = "/content/drive/MyDrive/datasets/vindr_mammo"
# INBREAST_PATH = "/content/drive/MyDrive/datasets/inbreast"

## 3. Configure Paths

Update configuration with your dataset paths.

In [None]:
# Read current config
with open('config.py', 'r') as f:
    config_content = f.read()

# Update paths (using demo data by default)
config_content = config_content.replace(
    'VINDR_MAMMO_PATH = "/path/to/vindr_mammo"',
    f'VINDR_MAMMO_PATH = "{VINDR_PATH}"'
)
config_content = config_content.replace(
    'INBREAST_PATH = "/path/to/inbreast"',
    f'INBREAST_PATH = "{INBREAST_PATH}"'
)

# Reduce population size and generations for quick testing
config_content = config_content.replace(
    '"pop_size": 20,',
    '"pop_size": 6,  # Reduced for demo'
)
config_content = config_content.replace(
    '"n_generations": 100,',
    '"n_generations": 5,  # Reduced for demo'
)

# Reduce epochs for faster training in demo
config_content = config_content.replace(
    'MAX_EPOCHS = 100',
    'MAX_EPOCHS = 3  # Reduced for demo'
)

# Write updated config
with open('config.py', 'w') as f:
    f.write(config_content)

print("[SUCCESS] Configuration updated!")
print(f"VinDr-Mammo path: {VINDR_PATH}")
print(f"INbreast path: {INBREAST_PATH}")
print("\nDemo settings: pop_size=6, n_generations=5, max_epochs=3")
print("For production runs, increase these values in config.py")

## 4. Verify Setup

Test that data loading works correctly.

In [None]:
import sys
import config
from optimization.nsga3_runner import load_metadata

# Load VinDr-Mammo metadata
print("Loading VinDr-Mammo metadata...")
vindr_metadata = load_metadata(
    dataset_name="vindr",
    dataset_path=config.VINDR_MAMMO_PATH,
    dataset_config=config.VINDR_CONFIG
)
print(f"[OK] Loaded {len(vindr_metadata)} images")
print(f"     Patients: {vindr_metadata['patient_id'].nunique()}")
print(f"     Label distribution: {vindr_metadata['label'].value_counts().to_dict()}")

# Load INbreast metadata
print("\nLoading INbreast metadata...")
inbreast_metadata = load_metadata(
    dataset_name="inbreast",
    dataset_path=config.INBREAST_PATH,
    dataset_config=config.INBREAST_CONFIG
)
print(f"[OK] Loaded {len(inbreast_metadata)} images")
print(f"     Patients: {inbreast_metadata['patient_id'].nunique()}")
print(f"     Label distribution: {inbreast_metadata['label'].value_counts().to_dict()}")

print("\n[SUCCESS] Setup verification complete!")

## 5. Run NSGA-III Optimization

This will optimize 5 hyperparameters for 4 objectives with automatic checkpointing.

In [None]:
from optimization.nsga3_runner import NSGA3Runner
from data.dataset import create_train_val_split
from pathlib import Path

# Create train/val split
print("Creating train/validation split...")
train_metadata, val_metadata = create_train_val_split(vindr_metadata)

print(f"Train samples: {len(train_metadata)}")
print(f"Validation samples: {len(val_metadata)}")
print(f"Unique patients - Train: {train_metadata['patient_id'].nunique()}, "
      f"Val: {val_metadata['patient_id'].nunique()}")

# Create runner with checkpoint saving
print("\nInitializing NSGA-III runner...")
image_dir = str(Path(config.VINDR_MAMMO_PATH) / config.VINDR_CONFIG["image_dir"])
runner = NSGA3Runner(
    train_metadata=train_metadata,
    val_metadata=val_metadata,
    image_dir=image_dir,
    output_dir="./optimization_results",
    checkpoint_dir="./checkpoints",
    save_frequency=1  # Save every generation
)

print("\n" + "="*80)
print("STARTING OPTIMIZATION")
print("="*80)
print("This may take a while depending on:")
print("  - Population size (current: from config)")
print("  - Number of generations (current: from config)")
print("  - Dataset size")
print("  - GPU availability")
print("\nCheckpoints will be saved every generation.")
print("You can monitor progress in the optimization_checkpoints folder.")
print("="*80 + "\n")

# Run optimization
result = runner.run()

print("\n" + "="*80)
print("OPTIMIZATION COMPLETE!")
print("="*80)
print(f"Pareto front size: {len(result.F)}")
print(f"Results saved to: {runner.output_dir}")
print("="*80)

## 6. Inspect Checkpoints

View saved checkpoints and load Pareto fronts from different generations.

In [None]:
# List all checkpoints
checkpoints = runner.list_checkpoints()
print(f"Found {len(checkpoints)} checkpoints:\n")

for i, checkpoint_path in enumerate(checkpoints):
    print(f"{i+1}. {checkpoint_path.name}")

# Load and display the latest checkpoint
if checkpoints:
    print("\nLoading latest checkpoint...")
    latest_checkpoint = runner.load_checkpoint(checkpoints[-1])
    
    # Get Pareto front from latest checkpoint
    pareto_df = runner.get_pareto_front_from_checkpoint(checkpoints[-1])
    print(f"\nPareto front at generation {latest_checkpoint['generation']}:")
    print(pareto_df)
else:
    print("No checkpoints found.")

## 7. Analyze Results

Load and visualize the final Pareto front.

In [None]:
import glob
import matplotlib.pyplot as plt
import seaborn as sns

# Find most recent results file
pareto_files = sorted(glob.glob("optimization_results/pareto_solutions_*.csv"))
if not pareto_files:
    print("No results found. Please run optimization first.")
else:
    latest_results = pareto_files[-1]
    print(f"Loading results from: {latest_results}")
    
    results_df = pd.read_csv(latest_results)
    print(f"\nPareto front contains {len(results_df)} solutions")
    print("\nSummary statistics:")
    print(results_df.describe())

### Visualize Pareto Front

In [None]:
if pareto_files:
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('Pareto Front - Objective Space', fontsize=16)
    
    # Plot objective pairs
    objective_pairs = [
        ('pr_auc', 'auroc'),
        ('pr_auc', 'brier'),
        ('pr_auc', 'robustness_degradation'),
        ('auroc', 'brier'),
        ('auroc', 'robustness_degradation'),
        ('brier', 'robustness_degradation')
    ]
    
    for ax, (obj1, obj2) in zip(axes.flat, objective_pairs):
        ax.scatter(results_df[obj1], results_df[obj2], alpha=0.6, s=50)
        ax.set_xlabel(obj1.replace('_', ' ').title())
        ax.set_ylabel(obj2.replace('_', ' ').title())
        ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Plot hyperparameter distributions
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    fig.suptitle('Pareto Front - Hyperparameter Distributions', fontsize=16)
    
    hyperparams = ['learning_rate', 'weight_decay', 'dropout_rate', 
                   'augmentation_strength', 'unfreeze_fraction']
    
    for ax, param in zip(axes.flat, hyperparams):
        ax.hist(results_df[param], bins=20, alpha=0.7, edgecolor='black')
        ax.set_xlabel(param.replace('_', ' ').title())
        ax.set_ylabel('Frequency')
        ax.grid(True, alpha=0.3)
    
    # Hide the last subplot (we have 5 params, 6 subplots)
    axes.flat[-1].axis('off')
    
    plt.tight_layout()
    plt.show()
    
    # Identify extreme solutions
    print("\n" + "="*80)
    print("EXTREME SOLUTIONS (Best for each objective)")
    print("="*80)
    
    # Best PR-AUC
    best_pr_auc_idx = results_df['pr_auc'].idxmax()
    print(f"\nBest PR-AUC: {results_df.loc[best_pr_auc_idx, 'pr_auc']:.4f}")
    print(f"Solution ID: {results_df.loc[best_pr_auc_idx, 'solution_id']}")
    print("Hyperparameters:")
    for param in hyperparams:
        print(f"  {param}: {results_df.loc[best_pr_auc_idx, param]:.6f}")
    
    # Best AUROC
    best_auroc_idx = results_df['auroc'].idxmax()
    print(f"\nBest AUROC: {results_df.loc[best_auroc_idx, 'auroc']:.4f}")
    print(f"Solution ID: {results_df.loc[best_auroc_idx, 'solution_id']}")
    print("Hyperparameters:")
    for param in hyperparams:
        print(f"  {param}: {results_df.loc[best_auroc_idx, param]:.6f}")
    
    # Best Brier (lowest)
    best_brier_idx = results_df['brier'].idxmin()
    print(f"\nBest Brier Score: {results_df.loc[best_brier_idx, 'brier']:.4f}")
    print(f"Solution ID: {results_df.loc[best_brier_idx, 'solution_id']}")
    print("Hyperparameters:")
    for param in hyperparams:
        print(f"  {param}: {results_df.loc[best_brier_idx, param]:.6f}")
    
    # Best Robustness (lowest degradation)
    best_robust_idx = results_df['robustness_degradation'].idxmin()
    print(f"\nBest Robustness: {results_df.loc[best_robust_idx, 'robustness_degradation']:.4f}")
    print(f"Solution ID: {results_df.loc[best_robust_idx, 'solution_id']}")
    print("Hyperparameters:")
    for param in hyperparams:
        print(f"  {param}: {results_df.loc[best_robust_idx, param]:.6f}")

## 8. Download Results

Download optimization results, checkpoints, and trained models to your local machine.

In [None]:
# Create zip file with all results
!zip -r optimization_results.zip optimization_results/ checkpoints/

print("\n[SUCCESS] Results zipped!")
print("Download 'optimization_results.zip' using the file browser on the left.")

# Optionally, directly download using Colab files
# from google.colab import files
# files.download('optimization_results.zip')

## Next Steps

1. **Evaluate on INbreast (Zero-Shot Transfer)**
   - Use the best solution to evaluate on the target dataset
   - No fine-tuning or threshold adjustment

2. **Analyze Trade-offs**
   - Compare different Pareto solutions
   - Select based on your priorities (PR-AUC, AUROC, calibration, robustness)

3. **Production Runs**
   - Increase population size (e.g., 20)
   - Increase generations (e.g., 100)
   - Increase max epochs (e.g., 100)
   - Use full datasets

4. **Save to Google Drive**
   - Mount Google Drive and save results there for persistence

## Resources

- **GitHub Repository:** https://github.com/dtobi59/mammography-multiobjective-optimization
- **Documentation:** See README.md and other guides in the repository
- **Paper:** [Add your paper link here when published]

## Citation

If you use this code in your research, please cite:

```bibtex
@software{mammography_multiobjective_optimization,
  author = {David},
  title = {Multi-Objective Hyperparameter Optimization for Breast Cancer Classification},
  year = {2026},
  url = {https://github.com/dtobi59/mammography-multiobjective-optimization}
}
```