# ü§ü MANO - Colab Training Notebook

Train the Colombian Sign Language gesture classifier using Google Colab's free GPU.

## Prerequisites

Before running this notebook, you need to:

1. **Generate preprocessed tensors** locally:
   ```bash
   python -m src.cv_model.preprocessing
   ```
   This creates `data/processed/tensors.pth`

2. **Upload to Google Drive** at `My Drive/Mano_data/tensors.pth`

3. **Push your code to GitHub** (for `src/` scripts)

## Workflow

1. Mount Google Drive
2. Clone repo ‚Üí get `src/` scripts  
3. Load preprocessed tensors ‚Üí instant data loading!
4. Train model with GPU acceleration
5. Models saved to Google Drive for persistence


In [None]:
# =============================================================================
# MANO - Colombian Sign Language Translator - Colab Training Notebook
# =============================================================================
# This notebook loads preprocessed tensors from Google Drive for fast training
# =============================================================================

# Install dependencies
%pip install torch torchvision mlflow scikit-learn pillow numpy matplotlib -q

# Verify GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è No GPU detected. Training will be slow on CPU.")


In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

import os

# =============================================================================
# ‚ö†Ô∏è CONFIGURATION - UPDATE THESE VALUES
# =============================================================================
REPO_URL = "https://github.com/davidrfb/Mano.git"  # Your GitHub repo URL
REPO_DIR = "/content/Mano"

# Preprocessed tensors file (generated locally with: python -m src.cv_model.preprocessing)
TENSOR_PATH = "/content/drive/MyDrive/Mano_data/tensors.pth"

# Where to save trained models
MODELS_DIR = "/content/drive/MyDrive/Mano/models"
# =============================================================================

# Verify tensor file exists
if os.path.exists(TENSOR_PATH):
    size_mb = os.path.getsize(TENSOR_PATH) / 1024 / 1024
    print(f"‚úÖ Tensor file found: {TENSOR_PATH}")
    print(f"   Size: {size_mb:.1f} MB")
else:
    print(f"‚ùå Tensor file NOT found at: {TENSOR_PATH}")
    print("   Please run locally: python -m src.cv_model.preprocessing")
    print("   Then upload data/processed/tensors.pth to Google Drive")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Clone repository (or pull if already exists)
if os.path.exists(REPO_DIR):
    print(f"Repository already exists at {REPO_DIR}")
    %cd {REPO_DIR}
    !git pull
else:
    print(f"Cloning repository to {REPO_DIR}...")
    !git clone {REPO_URL} {REPO_DIR}
    %cd {REPO_DIR}

print(f"\nCurrent directory: {os.getcwd()}")
!ls -la


In [None]:
# Load preprocessed tensors from Google Drive
# This is MUCH faster than pulling individual images via DVC

from torch.utils.data import TensorDataset, DataLoader

print(f"Loading tensors from {TENSOR_PATH}...")
data = torch.load(TENSOR_PATH, weights_only=False)

# Extract data
train_images, train_labels = data['train_images'], data['train_labels']
val_images, val_labels = data['val_images'], data['val_labels']
test_images, test_labels = data['test_images'], data['test_labels']
classes = data['classes']
num_classes = data['num_classes']

print(f"\n‚úÖ Data loaded successfully!")
print(f"   Train: {train_images.shape} ({len(train_labels)} samples)")
print(f"   Val: {val_images.shape} ({len(val_labels)} samples)")
print(f"   Test: {test_images.shape} ({len(test_labels)} samples)")
print(f"   Classes ({num_classes}): {classes}")


In [None]:
# Data augmentation for training (applied on-the-fly to normalized tensors)
import torchvision.transforms.v2 as T

class AugmentedTensorDataset(torch.utils.data.Dataset):
    """Dataset that applies augmentation to pre-normalized tensors."""
    def __init__(self, images, labels, augment=False):
        self.images = images
        self.labels = labels
        self.augment = augment
        
        # Augmentation for normalized tensors (careful with intensity)
        self.transforms = T.Compose([
            T.RandomHorizontalFlip(p=0.3),
            T.RandomRotation(degrees=15),
            T.RandomAffine(degrees=0, translate=(0.1, 0.1), scale=(0.9, 1.1)),
            # ColorJitter works on normalized tensors too
            T.ColorJitter(brightness=0.15, contrast=0.15, saturation=0.15),
        ]) if augment else None
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        image = self.images[idx]
        label = self.labels[idx]
        
        if self.augment and self.transforms:
            image = self.transforms(image)
        
        return image, label

print("‚úÖ Augmentation pipeline ready")
print("   Train augmentations: RandomHorizontalFlip, RandomRotation, RandomAffine, ColorJitter")

In [None]:
# Imports and setup
import sys
from pathlib import Path
import json
import time
from datetime import datetime

import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
import mlflow
import mlflow.pytorch

# Add repo to path for model imports
sys.path.insert(0, REPO_DIR)
from src.cv_model.train import get_model, train_one_epoch, evaluate

# Create models directory
os.makedirs(MODELS_DIR, exist_ok=True)
print(f"Models will be saved to: {MODELS_DIR}")


In [None]:
# =============================================================================
# HYPERPARAMETER SEARCH CONFIGURATION
# =============================================================================
MODELS_TO_TRAIN = ["mobilenet_v2", "mobilenet_v3_small", "efficientnet_b0", "resnet18"]

# Learning rates to search (1e-3 already done, skip it)
LEARNING_RATES = [5e-4, 3e-3, 5e-3]  # Excluding 1e-3 which was already run

# Batch sizes to try
BATCH_SIZES = [32, 64]

# Fixed hyperparameters
EPOCHS = 30
WEIGHT_DECAY = 1e-4
PATIENCE = 10
EXPERIMENT_NAME = "V2_moredata"

# Calculate total runs
total_runs = len(MODELS_TO_TRAIN) * len(LEARNING_RATES) * len(BATCH_SIZES)
print(f"=" * 60)
print(f"HYPERPARAMETER SEARCH")
print(f"=" * 60)
print(f"Models: {MODELS_TO_TRAIN}")
print(f"Learning rates: {LEARNING_RATES}")
print(f"Batch sizes: {BATCH_SIZES}")
print(f"Total experiments: {total_runs}")
print(f"=" * 60)

# Device setup
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

# MLflow setup
MLFLOW_TRACKING_URI = f"file://{MODELS_DIR}/mlruns"
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.set_experiment(EXPERIMENT_NAME)
print(f"MLflow tracking URI: {MLFLOW_TRACKING_URI}")


In [None]:
# Store results for all experiments
all_results = []
run_count = 0

# Hyperparameter search: iterate over all combinations
for MODEL_NAME in MODELS_TO_TRAIN:
    for LEARNING_RATE in LEARNING_RATES:
        for BATCH_SIZE in BATCH_SIZES:
            run_count += 1
            print("\n" + "=" * 70)
            print(f"üöÄ EXPERIMENT {run_count}/{total_runs}")
            print(f"   Model: {MODEL_NAME} | LR: {LEARNING_RATE} | Batch: {BATCH_SIZE}")
            print("=" * 70)
            
            # Create DataLoaders with current batch size and augmentation
            train_dataset = AugmentedTensorDataset(train_images, train_labels, augment=True)
            val_dataset = AugmentedTensorDataset(val_images, val_labels, augment=False)
            test_dataset = AugmentedTensorDataset(test_images, test_labels, augment=False)
            
            train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, pin_memory=True)
            val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
            test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, pin_memory=True)
            
            # Create fresh model
            print(f"Initializing {MODEL_NAME} with pretrained weights...")
            model = get_model(MODEL_NAME, num_classes, pretrained=True)
            model = model.to(DEVICE)
            
            # Count parameters
            total_params = sum(p.numel() for p in model.parameters())
            trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
            print(f"Total parameters: {total_params:,}")
            
            # Fresh optimizer and scheduler
            criterion = nn.CrossEntropyLoss()
            optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
            scheduler = CosineAnnealingLR(optimizer, T_max=EPOCHS, eta_min=LEARNING_RATE / 100)
            
            # Training loop
            run_name = f"{MODEL_NAME}_lr{LEARNING_RATE}_bs{BATCH_SIZE}"
            best_val_loss = float('inf')  # Track loss (lower is better)
            best_val_acc = 0.0  # Still track for reporting
            epochs_without_improvement = 0
            best_checkpoint_path = None
            
            with mlflow.start_run(run_name=run_name):
                # Log parameters
                mlflow.log_params({
                    "model_name": MODEL_NAME,
                    "epochs": EPOCHS,
                    "batch_size": BATCH_SIZE,
                    "learning_rate": LEARNING_RATE,
                    "weight_decay": WEIGHT_DECAY,
                    "patience": PATIENCE,
                    "num_classes": num_classes,
                    "classes": ",".join(classes),
                    "optimizer": "AdamW",
                    "scheduler": "CosineAnnealingLR",
                    "device": str(DEVICE),
                    "pretrained": True,
                    "augmentation": True,
                    "train_samples": len(train_loader.dataset),
                    "val_samples": len(val_loader.dataset),
                    "test_samples": len(test_loader.dataset),
                    "total_params": total_params,
                    "trainable_params": trainable_params,
                })

                print("-" * 60)
                print(f"Training with augmentation enabled...")
                print("-" * 60)

                for epoch in range(1, EPOCHS + 1):
                    start_time = time.time()

                    # Train
                    train_loss, train_acc = train_one_epoch(
                        model, train_loader, criterion, optimizer, DEVICE
                    )

                    # Validate
                    val_loss, val_acc = evaluate(model, val_loader, criterion, DEVICE)

                    # Update scheduler
                    scheduler.step()
                    current_lr = scheduler.get_last_lr()[0]

                    # Log metrics to MLflow
                    mlflow.log_metrics({
                        "train_loss": train_loss,
                        "train_acc": train_acc,
                        "val_loss": val_loss,
                        "val_acc": val_acc,
                        "learning_rate": current_lr,
                    }, step=epoch)

                    # Logging
                    elapsed = time.time() - start_time
                    print(
                        f"Epoch {epoch:3d}/{EPOCHS} | "
                        f"Train Loss: {train_loss:.4f} Acc: {train_acc:.4f} | "
                        f"Val Loss: {val_loss:.4f} Acc: {val_acc:.4f} | "
                        f"LR: {current_lr:.6f} | "
                        f"Time: {elapsed:.1f}s"
                    )

                    # Save best model (based on validation LOSS - lower is better)
                    if val_loss < best_val_loss:
                        best_val_loss = val_loss
                        best_val_acc = val_acc  # Track best acc at best loss
                        epochs_without_improvement = 0

                        # Save checkpoint
                        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
                        filename = f"{MODEL_NAME}_lr{LEARNING_RATE}_bs{BATCH_SIZE}_acc{val_acc:.2f}.pth"
                        filepath = Path(MODELS_DIR) / filename

                        checkpoint = {
                            "model_state_dict": model.state_dict(),
                            "optimizer_state_dict": optimizer.state_dict(),
                            "epoch": epoch,
                            "val_loss": val_loss,
                            "val_acc": val_acc,
                            "model_name": MODEL_NAME,
                            "learning_rate": LEARNING_RATE,
                            "batch_size": BATCH_SIZE,
                            "classes": classes,
                            "num_classes": num_classes,
                        }
                        torch.save(checkpoint, filepath)

                        # Log to MLflow
                        mlflow.log_artifact(str(filepath), artifact_path="checkpoints")
                        best_checkpoint_path = filepath
                        print(f"  ‚Ü≥ New best! (loss: {val_loss:.4f})")
                    else:
                        epochs_without_improvement += 1

                    # Early stopping (based on validation loss)
                    if epochs_without_improvement >= PATIENCE:
                        print(f"\nEarly stopping at epoch {epoch} (val_loss not improving)")
                        mlflow.log_param("early_stopped_epoch", epoch)
                        break

                # Final evaluation on test set
                print("\nEvaluating on test set...")
                test_loss, test_acc = evaluate(model, test_loader, criterion, DEVICE)
                print(f"Test Loss: {test_loss:.4f} | Test Accuracy: {test_acc:.4f}")

                # Log final metrics
                mlflow.log_metrics({
                    "best_val_loss": best_val_loss,
                    "best_val_acc": best_val_acc,
                    "test_loss": test_loss,
                    "test_acc": test_acc,
                })

                # Register best model in MLflow model registry
                if best_checkpoint_path:
                    mlflow.pytorch.log_model(
                        model,
                        artifact_path="model",
                        registered_model_name=f"lsc_{MODEL_NAME}",
                    )

                run_id = mlflow.active_run().info.run_id
                
            # Store results
            all_results.append({
                "model": MODEL_NAME,
                "lr": LEARNING_RATE,
                "batch_size": BATCH_SIZE,
                "best_val_loss": best_val_loss,
                "best_val_acc": best_val_acc,
                "test_acc": test_acc,
                "params": total_params,
                "run_id": run_id,
            })
            
            print(f"\n‚úÖ Complete! Val Loss: {best_val_loss:.4f}, Val Acc: {best_val_acc:.4f}, Test Acc: {test_acc:.4f}")
            
            # Clear GPU memory
            del model, optimizer, scheduler, train_loader, val_loader, test_loader
            torch.cuda.empty_cache()

print("\n" + "=" * 70)
print("üèÅ HYPERPARAMETER SEARCH COMPLETE!")
print("=" * 70)

In [None]:
# Summary comparison of all experiments
import pandas as pd

print("=" * 70)
print("üìä HYPERPARAMETER SEARCH RESULTS")
print("=" * 70)

results_df = pd.DataFrame(all_results)
# Sort by best_val_loss (ascending - lower is better)
results_df = results_df.sort_values("best_val_loss", ascending=True)

# Format for display
display_df = results_df.copy()
display_df['lr'] = display_df['lr'].apply(lambda x: f"{x:.0e}")
display_df['best_val_loss'] = display_df['best_val_loss'].apply(lambda x: f"{x:.4f}")
display_df['best_val_acc'] = display_df['best_val_acc'].apply(lambda x: f"{x:.4f}")
display_df['test_acc'] = display_df['test_acc'].apply(lambda x: f"{x:.4f}")
print(display_df[['model', 'lr', 'batch_size', 'best_val_loss', 'best_val_acc', 'test_acc']].to_string(index=False))

# Best configuration (lowest val_loss)
best = results_df.iloc[0]
print(f"\nüèÜ BEST CONFIGURATION (by val_loss):")
print(f"   Model: {best['model']}")
print(f"   Learning Rate: {best['lr']}")
print(f"   Batch Size: {best['batch_size']}")
print(f"   Val Loss: {best['best_val_loss']:.4f}")
print(f"   Val Accuracy: {best['best_val_acc']:.4f}")
print(f"   Test Accuracy: {best['test_acc']:.4f}")

# Best per model
print(f"\nüìà BEST LR/BATCH PER MODEL (by val_loss):")
for model in MODELS_TO_TRAIN:
    model_results = results_df[results_df['model'] == model]
    if len(model_results) > 0:
        best_for_model = model_results.iloc[0]
        print(f"   {model}: LR={best_for_model['lr']:.0e}, BS={best_for_model['batch_size']}, Loss={best_for_model['best_val_loss']:.4f}, Acc={best_for_model['test_acc']:.4f}")


2025/11/27 21:08:55 INFO mlflow.tracking.fluent: Experiment with name 'colab_mobilenet_v3_small' does not exist. Creating a new experiment.


Using device: cuda
MLflow tracking URI: file:///content/drive/MyDrive/Mano/models/mlruns


In [None]:
# View all MLflow runs for this experiment
print(f"MLflow tracking URI: {MLFLOW_TRACKING_URI}")
print(f"To view results locally: mlflow ui --backend-store-uri {MLFLOW_TRACKING_URI.replace('file://', '')}")

# List all runs from this experiment
experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
if experiment:
    runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id])
    print(f"\nüìã All runs in '{EXPERIMENT_NAME}' ({len(runs)} total):")
    cols = ['params.model_name', 'params.learning_rate', 'params.batch_size', 
            'metrics.best_val_loss', 'metrics.best_val_acc', 'metrics.test_acc', 'status']
    available_cols = [c for c in cols if c in runs.columns]
    if available_cols:
        display_runs = runs[available_cols].copy()
        display_runs.columns = [c.split('.')[-1] for c in display_runs.columns]
        # Sort by best_val_loss if available
        if 'best_val_loss' in display_runs.columns:
            display_runs = display_runs.sort_values('best_val_loss', ascending=True)
        print(display_runs.to_string(index=False))


Loading data...
Loaded 1871 images from 26 classes
Split sizes - Train: 1309, Val: 281, Test: 281
Loaded 1871 images from 26 classes
Loaded 1871 images from 26 classes
Loaded 1871 images from 26 classes
Classes (26): ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Train batches: 41
Val batches: 9
Test batches: 9


Initializing mobilenet_v3_small with pretrained weights...
Downloading: "https://download.pytorch.org/models/mobilenet_v3_small-047dcff4.pth" to /root/.cache/torch/hub/checkpoints/mobilenet_v3_small-047dcff4.pth


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9.83M/9.83M [00:00<00:00, 151MB/s]


Total parameters: 1,544,506
Trainable parameters: 1,544,506
Starting training...
Epoch   1/30 | Train Loss: 0.9120 Acc: 0.7762 | Val Loss: 0.2520 Acc: 0.9075 | LR: 0.000997 | Time: 14.7s
  ‚Ü≥ New best! Saved to mobilenet_v3_small_v1_acc0.91_20251127_210912.pth
Epoch   2/30 | Train Loss: 0.0408 Acc: 0.9893 | Val Loss: 0.3104 Acc: 0.8897 | LR: 0.000989 | Time: 15.9s
Epoch   3/30 | Train Loss: 0.0217 Acc: 0.9954 | Val Loss: 0.2389 Acc: 0.9217 | LR: 0.000976 | Time: 14.4s
  ‚Ü≥ New best! Saved to mobilenet_v3_small_v1_acc0.92_20251127_210942.pth
Epoch   4/30 | Train Loss: 0.0201 Acc: 0.9947 | Val Loss: 0.3572 Acc: 0.8790 | LR: 0.000957 | Time: 14.9s
Epoch   5/30 | Train Loss: 0.0137 Acc: 0.9954 | Val Loss: 0.1606 Acc: 0.9609 | LR: 0.000934 | Time: 14.4s
  ‚Ü≥ New best! Saved to mobilenet_v3_small_v1_acc0.96_20251127_211012.pth
Epoch   6/30 | Train Loss: 0.0248 Acc: 0.9916 | Val Loss: 0.0758 Acc: 0.9751 | LR: 0.000905 | Time: 14.5s
  ‚Ü≥ New best! Saved to mobilenet_v3_small_v1_acc0.98_202



Test Loss: 0.0000 | Test Accuracy: 1.0000


Successfully registered model 'lsc_mobilenet_v3_small'.
Created version '1' of model 'lsc_mobilenet_v3_small'.



Training complete! Best validation accuracy: 1.0000
Test accuracy: 1.0000
Models saved to: /content/drive/MyDrive/Mano/models
MLflow run ID: d4f96492d7224d4ba0f75676c44bcaca


MLflow tracking URI: file:///content/drive/MyDrive/Mano/models/mlruns
To view results, download the folder: /content/drive/MyDrive/Mano/models/mlruns
Or run: mlflow ui --backend-store-uri /content/drive/MyDrive/Mano/models/mlruns

Recent runs:
                             run_id  metrics.val_acc  metrics.test_acc  \
0  d4f96492d7224d4ba0f75676c44bcaca              1.0               1.0   

     status  
0  FINISHED  
