# MOSAIC: Training and Evaluation Demo

This notebook demonstrates how to use the MOSAIC framework for training and evaluating models on radiological report classification. We'll be using the MIMIC dataset example and the Mosaic-4B model from Hugging Face.

## Prerequisites

Before running this notebook, make sure you have:
1. Created and activated the conda environment:
```bash
conda env create -f environment.yml
conda activate mosaic
```

2. Installed the MOSAIC package:
```bash
pip install -e .
```

Note: If you're running this in a Jupyter notebook, you'll need to restart the kernel after creating the conda environment. You can do this by clicking "Kernel" > "Restart & Run All" in the menu.

## Directory Structure

All outputs from this notebook will be organized in a `mosaic_output` directory with the following structure:

```
mosaic_output/
├── checkpoints/    # Training checkpoints
├── model/         # Final saved model
├── logs/          # Training logs
└── results/       # Evaluation results
```

## Model Information

We'll be using the `AliceSch/mosaic-4b` model from Hugging Face Hub, which is specifically designed for medical text classification tasks. The model will be fine-tuned using LoRA (Low-Rank Adaptation) to make training more efficient.

In [3]:
# Verify our Python environment and key packages
import sys
import os
import torch
import transformers
import datasets
import peft

## 1. Setup Environment and Dependencies

First, let's set up our Python environment and install the required dependencies.

In [2]:
import os
import torch

# Import MOSAIC utilities from our installed package
from mosaic import ConfigLoader, DatasetLoader
from mosaic.core import finetune, evals
from transformers import TrainingArguments, Trainer

# Check if CUDA is available
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Device: {torch.cuda.get_device_name(0)}")

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Initialize configuration
config_loader = ConfigLoader()
dataset_loader = DatasetLoader(config_loader)

Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "/home/alice/miniconda3/envs/mosaic/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3579, in run_code
  File "/tmp/ipykernel_92574/3269608903.py", line 2, in <module>
    import torch
ModuleNotFoundError: No module named 'torch'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alice/miniconda3/envs/mosaic/lib/python3.10/site-packages/pygments/styles/__init__.py", line 45, in get_style_by_name
ModuleNotFoundError: No module named 'pygments.styles.default'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/alice/miniconda3/envs/mosaic/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 2170, in showtraceback
  File "/home/alice/miniconda3/envs/mosaic/lib/python3.10/site-packages/IPython/core/ultratb.py", line 1457, in structured_traceback
  File "/home/alice/miniconda3/envs

In [None]:
# Create output directory structure
output_dir = "mosaic_output"
checkpoints_dir = os.path.join(output_dir, "checkpoints")
saved_model_dir = os.path.join(output_dir, "model")
logs_dir = os.path.join(output_dir, "logs")
results_dir = os.path.join(output_dir, "results")

# Create directories
for dir_path in [output_dir, checkpoints_dir, saved_model_dir, logs_dir, results_dir]:
    os.makedirs(dir_path, exist_ok=True)
    print(f"Created directory: {dir_path}")

## 2. Install Required Packages

Let's install the required packages from the requirements.txt file. Note that we're using pip install in quiet mode to avoid cluttering the notebook output.

In [None]:
# Install required packages using pip
import subprocess
import sys

def install_package(package):
    print(f"Installing {package}...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Install required packages
packages = [
    "torch",
    "transformers==4.51.3",
    "datasets==3.5.0",
    "peft==0.15.1",
    "accelerate==1.6.0",
    "bitsandbytes==0.46.0",
    "wandb==0.19.9",
    "pygments"
]

for package in packages:
    install_package(package)

print("\nPackages installed successfully!")

## 3. Load and Prepare Dataset

We'll now load the MIMIC dataset from the data directory. The dataset is in CSV format and contains radiological reports with their corresponding labels.

In [None]:
# Load the pre-processed MIMIC dataset using our DatasetLoader
dataset = {
    'train': dataset_loader.load_datasets('mimic', split='train'),
    'val': dataset_loader.load_datasets('mimic', split='val'),
    'test': dataset_loader.load_datasets('mimic', split='test')
}

print("Dataset structure:")
print(f"Train split: {len(dataset['train'])} examples")
print(f"Validation split: {len(dataset['val'])} examples")
print(f"Test split: {len(dataset['test'])} examples")

# Display a sample from the training set
print("\nSample from training set:")
print(dataset['train'][0])

## 4. Initialize Mosaic-4B Model

Now we'll initialize the Mosaic-4B model from Hugging Face and configure it for fine-tuning using LoRA (Low-Rank Adaptation) to make the training more efficient.

In [None]:
# Load configurations
model_config = config_loader.load_yaml('models')
peft_config = config_loader.load_yaml('peft')

# Initialize model using the finetune utility
model, tokenizer = finetune.model_init(
    model_tag="AliceSch/mosaic-4b",
    model_config=model_config,
    peft_config=peft_config
)

print("Model initialized with configuration:")
print(f"Model family: {model_config.get('model_family', 'gemma')}")
print(f"Max sequence length: {model_config.get('max_seq_length', 512)}")
print(f"LoRA rank: {model_config.get('lora_rank', 16)}")
print(f"Batch size: {model_config.get('batch_size', 4)}")

## 5. Fine-tune Model

We'll now set up the training arguments and start the fine-tuning process. We'll use a small number of epochs for demonstration purposes.

In [None]:
# Define compute_metrics function first
def compute_metrics(eval_pred):
    predictions = eval_pred.predictions.argmax(-1)
    labels = eval_pred.label_ids
    
    # Calculate metrics using sklearn
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support
    
    accuracy = accuracy_score(labels, predictions)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Initialize trainer
training_args = TrainingArguments(
    output_dir=checkpoints_dir,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-4,
    weight_decay=0.01,
    logging_dir=logs_dir,
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1"
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['val'],
    compute_metrics=compute_metrics,
)

print("\nTraining configuration:")
print(f"Number of epochs: {training_args.num_train_epochs}")
print(f"Learning rate: {training_args.learning_rate}")
print(f"Weight decay: {training_args.weight_decay}")
print(f"Batch size: {training_args.per_device_train_batch_size}")

# Start training
print("\nStarting training...")
trainer_stats = trainer.train()

## 6. Evaluate Model Performance

Now let's evaluate the model's performance on the test set.

In [None]:
# Evaluate on test set using the evals utility
print("Evaluating on test set...")
test_results = trainer.evaluate(dataset['test'])

# Print test results
for metric, value in test_results.items():
    print(f"{metric}: {value:.4f}")

# Get detailed F1 scores using the evals utility
predictions = trainer.predict(dataset['test'])
preds = predictions.predictions.argmax(-1)
labels = predictions.label_ids

# Calculate detailed F1 scores
from sklearn.metrics import classification_report
f1_scores = classification_report(labels, preds, target_names=dataset['test'].features['label'].names)
print("\nDetailed Classification Report:")
print(f1_scores)

## 7. Run Inference Examples

Finally, let's demonstrate how to use the fine-tuned model for inference on new examples.

In [None]:
# Save the model using the finetune utility
save_model(model, tokenizer, saved_model_dir, model_config)
print(f"Model saved to {saved_model_dir}")

def predict_report(text):
    # Tokenize the input
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(device)
    
    # Get prediction
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Get predicted class
    predicted_class = outputs.logits.argmax(-1).item()
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    # Get class name and probability
    class_name = dataset['test'].features['label'].names[predicted_class]
    probability = probabilities[0][predicted_class].item()
    
    return class_name, probability

# Example reports from test set
example_reports = dataset['test'].select(range(3))

print("Running inference on example reports:")
for example in example_reports:
    text = example['text']
    true_label = dataset['test'].features['label'].names[example['label']]
    
    predicted_label, confidence = predict_report(text)
    
    print(f"\nReport text (truncated): {text[:200]}...")
    print(f"True label: {true_label}")
    print(f"Predicted label: {predicted_label}")
    print(f"Confidence: {confidence:.2%}")
    print("-" * 80)

## 8. Cleanup

Clean up the generated files and directories to free up space.

In [None]:
import shutil

# Clean up the output directory
if os.path.exists("mosaic_output"):
    shutil.rmtree("mosaic_output")
    print("Removed mosaic_output directory")

# Clear CUDA cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("Cleared CUDA cache")

print("Cleanup complete!")