# ü§ü MANO - Colab Training Notebook

Train the Colombian Sign Language gesture classifier using Google Colab's free GPU.

## Prerequisites

Before running this notebook, make sure you have:

1. **Pushed your code to GitHub** (or your git remote)
2. **Pushed your data to DVC** with Google Drive storage:
   ```bash
   # On your local machine (Windows)
   dvc add data/raw
   dvc push
   git add data/raw.dvc .gitignore
   git commit -m "Update dataset"
   git push
   ```
3. **Your DVC storage folder exists** in Google Drive at `dvc-storage/mano/`

## Workflow

1. Mount Google Drive
2. Clone repo ‚Üí get `src/` scripts
3. Configure DVC ‚Üí point to Google Drive storage
4. Pull data ‚Üí DVC fetches from `dvc-storage/mano/`
5. Train model with GPU acceleration
6. Models saved to Google Drive for persistence


In [None]:
# =============================================================================
# MANO - Colombian Sign Language Translator - Colab Training Notebook
# =============================================================================
# This notebook clones the repo, pulls data from Google Drive via DVC, and trains
# =============================================================================

# Install dependencies
%pip install torch torchvision mlflow scikit-learn opencv-python pillow numpy matplotlib dvc -q

# Verify GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è No GPU detected. Training will be slow on CPU.")


In [None]:
# Mount Google Drive (required for DVC storage)
from google.colab import drive
drive.mount('/content/drive')

# =============================================================================
# ‚ö†Ô∏è CONFIGURATION - UPDATE THESE VALUES
# =============================================================================
REPO_URL = "https://github.com/YOUR_USERNAME/Mano.git"  # TODO: Your GitHub repo URL

# DVC storage path in Google Drive
# Check your .dvc/config for the correct folder name (dvc-storage vs dvc_storage)
DVC_STORAGE = "/content/drive/MyDrive/dvc-storage/mano"

REPO_DIR = "/content/Mano"
# =============================================================================

# Verify DVC storage exists
import os
if os.path.exists(DVC_STORAGE):
    print(f"‚úÖ DVC storage found at: {DVC_STORAGE}")
    print(f"   Contents: {os.listdir(DVC_STORAGE)[:5]}...")  # Show first 5 items
else:
    print(f"‚ùå DVC storage NOT found at: {DVC_STORAGE}")
    print("   Please check your Google Drive folder structure and update DVC_STORAGE path")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Clone repository (or pull if already exists)
if os.path.exists(REPO_DIR):
    print(f"Repository already exists at {REPO_DIR}")
    %cd {REPO_DIR}
    !git pull
else:
    print(f"Cloning repository to {REPO_DIR}...")
    !git clone {REPO_URL} {REPO_DIR}
    %cd {REPO_DIR}

print(f"\nCurrent directory: {os.getcwd()}")
!ls -la


In [None]:
# Configure DVC to use Google Drive storage (Colab path)
# This overrides the local Windows path in .dvc/config

!dvc remote modify gdrive url {DVC_STORAGE}
!dvc remote default gdrive

# Verify DVC config
print("DVC remote configuration:")
!dvc remote list
!cat .dvc/config


In [None]:
# Pull data from DVC
print("Pulling data from DVC storage...")
!dvc pull -v

# Verify data was pulled
DATA_DIR = f"{REPO_DIR}/data/raw"
if os.path.exists(DATA_DIR):
    classes = sorted([d for d in os.listdir(DATA_DIR) if os.path.isdir(os.path.join(DATA_DIR, d))])
    total_images = sum(len(os.listdir(os.path.join(DATA_DIR, c))) for c in classes)
    print(f"\n‚úÖ Data pulled successfully!")
    print(f"   Classes: {len(classes)}")
    print(f"   Total images: {total_images}")
else:
    print(f"‚ùå Data directory not found at {DATA_DIR}")
    print("Please check your DVC configuration and storage path.")


In [None]:
import sys
from pathlib import Path
import json
import time
from datetime import datetime
from typing import Optional

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import CosineAnnealingLR
from torchvision import models
from torch.utils.data import Dataset, DataLoader, Subset
from torchvision import transforms
from PIL import Image
from sklearn.model_selection import train_test_split
import mlflow
import mlflow.pytorch

# Add repo to path for imports
sys.path.insert(0, REPO_DIR)
from src.cv_model.preprocessing import create_dataloaders
from src.cv_model.train import get_model, train_one_epoch, evaluate, save_checkpoint


In [None]:
# Models directory - save to Google Drive for persistence
MODELS_DIR = "/content/drive/MyDrive/Mano/models"
os.makedirs(MODELS_DIR, exist_ok=True)
print(f"Models will be saved to: {MODELS_DIR}")

In [22]:
# Training hyperparameters
MODEL_NAME = "mobilenet_v3_small"  # Options: mobilenet_v2, mobilenet_v3_small, efficientnet_b0, resnet18
EPOCHS = 30
BATCH_SIZE = 32
LEARNING_RATE = 1e-3
WEIGHT_DECAY = 1e-4
PATIENCE = 10
EXPERIMENT_NAME = f"colab_{MODEL_NAME}"

# Device setup
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

# MLflow setup
MLFLOW_TRACKING_URI = f"file://{MODELS_DIR}/mlruns"
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.set_experiment(EXPERIMENT_NAME)
print(f"MLflow tracking URI: {MLFLOW_TRACKING_URI}")


2025/11/27 21:08:55 INFO mlflow.tracking.fluent: Experiment with name 'colab_mobilenet_v3_small' does not exist. Creating a new experiment.


Using device: cuda
MLflow tracking URI: file:///content/drive/MyDrive/Mano/models/mlruns


In [23]:
# Create dataloaders
print("Loading data...")
train_loader, val_loader, test_loader, num_classes, classes = create_dataloaders(
    data_dir=DATA_DIR,
    batch_size=BATCH_SIZE,
    num_workers=1
)

print(f"Classes ({num_classes}): {classes}")
print(f"Train batches: {len(train_loader)}")
print(f"Val batches: {len(val_loader)}")
print(f"Test batches: {len(test_loader)}")


Loading data...
Loaded 1871 images from 26 classes
Split sizes - Train: 1309, Val: 281, Test: 281
Loaded 1871 images from 26 classes
Loaded 1871 images from 26 classes
Loaded 1871 images from 26 classes
Classes (26): ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
Train batches: 41
Val batches: 9
Test batches: 9


In [24]:
# Create model
print(f"Initializing {MODEL_NAME} with pretrained weights...")
model = get_model(MODEL_NAME, num_classes, pretrained=True)
model = model.to(DEVICE)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

# Loss, optimizer, scheduler
criterion = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE, weight_decay=WEIGHT_DECAY)
scheduler = CosineAnnealingLR(optimizer, T_max=EPOCHS, eta_min=LEARNING_RATE / 100)
# Start MLflow run
run_name = f"{MODEL_NAME}_lr{LEARNING_RATE}_bs{BATCH_SIZE}"
best_val_acc = 0.0
epochs_without_improvement = 0
best_checkpoint_path = None

with mlflow.start_run(run_name=run_name):
    # Log parameters
    mlflow.log_params({
        "model_name": MODEL_NAME,
        "epochs": EPOCHS,
        "batch_size": BATCH_SIZE,
        "learning_rate": LEARNING_RATE,
        "weight_decay": WEIGHT_DECAY,
        "patience": PATIENCE,
        "num_classes": num_classes,
        "classes": ",".join(classes),
        "optimizer": "AdamW",
        "scheduler": "CosineAnnealingLR",
        "device": str(DEVICE),
        "pretrained": True,
        "train_samples": len(train_loader.dataset),
        "val_samples": len(val_loader.dataset),
        "test_samples": len(test_loader.dataset),
        "total_params": total_params,
        "trainable_params": trainable_params,
    })

    print("=" * 60)
    print("Starting training...")
    print("=" * 60)

    for epoch in range(1, EPOCHS + 1):
        start_time = time.time()

        # Train
        train_loss, train_acc = train_one_epoch(
            model, train_loader, criterion, optimizer, DEVICE
        )

        # Validate
        val_loss, val_acc = evaluate(model, val_loader, criterion, DEVICE)

        # Update scheduler
        scheduler.step()
        current_lr = scheduler.get_last_lr()[0]

        # Log metrics to MLflow
        mlflow.log_metrics({
            "train_loss": train_loss,
            "train_acc": train_acc,
            "val_loss": val_loss,
            "val_acc": val_acc,
            "learning_rate": current_lr,
        }, step=epoch)

        # Logging
        elapsed = time.time() - start_time
        print(
            f"Epoch {epoch:3d}/{EPOCHS} | "
            f"Train Loss: {train_loss:.4f} Acc: {train_acc:.4f} | "
            f"Val Loss: {val_loss:.4f} Acc: {val_acc:.4f} | "
            f"LR: {current_lr:.6f} | "
            f"Time: {elapsed:.1f}s"
        )

        # Save best model
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            epochs_without_improvement = 0

            # Save checkpoint
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"{MODEL_NAME}_v1_acc{val_acc:.2f}_{timestamp}.pth"
            filepath = Path(MODELS_DIR) / filename

            checkpoint = {
                "model_state_dict": model.state_dict(),
                "optimizer_state_dict": optimizer.state_dict(),
                "epoch": epoch,
                "val_acc": val_acc,
                "model_name": MODEL_NAME,
                "classes": classes,
                "num_classes": num_classes,
            }
            torch.save(checkpoint, filepath)

            # Save metadata
            metadata = {
                "model_name": MODEL_NAME,
                "epoch": epoch,
                "val_acc": val_acc,
                "classes": classes,
                "num_classes": num_classes,
                "timestamp": timestamp,
            }
            json_path = filepath.with_suffix(".json")
            with open(json_path, "w") as f:
                json.dump(metadata, f, indent=2)

            # Log to MLflow
            mlflow.log_artifact(str(filepath), artifact_path="checkpoints")
            mlflow.log_artifact(str(json_path), artifact_path="checkpoints")

            best_checkpoint_path = filepath
            print(f"  ‚Ü≥ New best! Saved to {filepath.name}")
        else:
            epochs_without_improvement += 1

        # Early stopping
        if epochs_without_improvement >= PATIENCE:
            print(f"\nEarly stopping at epoch {epoch} (no improvement for {PATIENCE} epochs)")
            mlflow.log_param("early_stopped_epoch", epoch)
            break

    print("-" * 60)

    # Final evaluation on test set
    print("\nEvaluating on test set...")
    test_loss, test_acc = evaluate(model, test_loader, criterion, DEVICE)
    print(f"Test Loss: {test_loss:.4f} | Test Accuracy: {test_acc:.4f}")

    # Log final metrics
    mlflow.log_metrics({
        "best_val_acc": best_val_acc,
        "test_loss": test_loss,
        "test_acc": test_acc,
    })

    # Log the model to MLflow
    if best_checkpoint_path:
        mlflow.pytorch.log_model(
            model,
            artifact_path="model",
            registered_model_name=f"lsc_{MODEL_NAME}",
        )

    print("\n" + "=" * 60)
    print(f"Training complete! Best validation accuracy: {best_val_acc:.4f}")
    print(f"Test accuracy: {test_acc:.4f}")
    print(f"Models saved to: {MODELS_DIR}")
    print(f"MLflow run ID: {mlflow.active_run().info.run_id}")
    print("=" * 60)


Initializing mobilenet_v3_small with pretrained weights...
Downloading: "https://download.pytorch.org/models/mobilenet_v3_small-047dcff4.pth" to /root/.cache/torch/hub/checkpoints/mobilenet_v3_small-047dcff4.pth


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 9.83M/9.83M [00:00<00:00, 151MB/s]


Total parameters: 1,544,506
Trainable parameters: 1,544,506
Starting training...
Epoch   1/30 | Train Loss: 0.9120 Acc: 0.7762 | Val Loss: 0.2520 Acc: 0.9075 | LR: 0.000997 | Time: 14.7s
  ‚Ü≥ New best! Saved to mobilenet_v3_small_v1_acc0.91_20251127_210912.pth
Epoch   2/30 | Train Loss: 0.0408 Acc: 0.9893 | Val Loss: 0.3104 Acc: 0.8897 | LR: 0.000989 | Time: 15.9s
Epoch   3/30 | Train Loss: 0.0217 Acc: 0.9954 | Val Loss: 0.2389 Acc: 0.9217 | LR: 0.000976 | Time: 14.4s
  ‚Ü≥ New best! Saved to mobilenet_v3_small_v1_acc0.92_20251127_210942.pth
Epoch   4/30 | Train Loss: 0.0201 Acc: 0.9947 | Val Loss: 0.3572 Acc: 0.8790 | LR: 0.000957 | Time: 14.9s
Epoch   5/30 | Train Loss: 0.0137 Acc: 0.9954 | Val Loss: 0.1606 Acc: 0.9609 | LR: 0.000934 | Time: 14.4s
  ‚Ü≥ New best! Saved to mobilenet_v3_small_v1_acc0.96_20251127_211012.pth
Epoch   6/30 | Train Loss: 0.0248 Acc: 0.9916 | Val Loss: 0.0758 Acc: 0.9751 | LR: 0.000905 | Time: 14.5s
  ‚Ü≥ New best! Saved to mobilenet_v3_small_v1_acc0.98_202



Test Loss: 0.0000 | Test Accuracy: 1.0000


Successfully registered model 'lsc_mobilenet_v3_small'.
Created version '1' of model 'lsc_mobilenet_v3_small'.



Training complete! Best validation accuracy: 1.0000
Test accuracy: 1.0000
Models saved to: /content/drive/MyDrive/Mano/models
MLflow run ID: d4f96492d7224d4ba0f75676c44bcaca


In [25]:
# Start MLflow UI (runs in background)
# Note: In Colab, you'll need to use ngrok or similar to access the UI
# Or download the mlruns folder and view locally

print(f"MLflow tracking URI: {MLFLOW_TRACKING_URI}")
print(f"To view results, download the folder: {MODELS_DIR}/mlruns")
print("Or run: mlflow ui --backend-store-uri", MLFLOW_TRACKING_URI.replace("file://", ""))

# Optionally, list recent runs
import mlflow
experiment = mlflow.get_experiment_by_name(EXPERIMENT_NAME)
if experiment:
    runs = mlflow.search_runs(experiment_ids=[experiment.experiment_id], max_results=5)
    print("\nRecent runs:")
    print(runs[['run_id', 'metrics.val_acc', 'metrics.test_acc', 'status']].head())


MLflow tracking URI: file:///content/drive/MyDrive/Mano/models/mlruns
To view results, download the folder: /content/drive/MyDrive/Mano/models/mlruns
Or run: mlflow ui --backend-store-uri /content/drive/MyDrive/Mano/models/mlruns

Recent runs:
                             run_id  metrics.val_acc  metrics.test_acc  \
0  d4f96492d7224d4ba0f75676c44bcaca              1.0               1.0   

     status  
0  FINISHED  
