# SparsAE Training on Google Colab

## üéØ Quick Start Guide

1. **Runtime Setup:** Runtime ‚Üí Change runtime type ‚Üí GPU (T4, V100, or A100)
2. **Run all cells** in order
3. **Monitor training** in the output

---

### Expected Performance:
- **T4 (Free/Pro):** 49M-125M models, ~2-3 hrs
- **V100 (Pro):** 125M-350M models, ~1-2 hrs  
- **A100 (Pro+):** 350M-774M models, <1 hr

---

In [None]:
# 1. Check GPU availability
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# 2. Clone your repository
!git clone https://github.com/codenlighten/ai-algo-agents.git
%cd ai-algo-agents

In [None]:
# 3. Install dependencies
!pip install -q torch torchvision torchaudio
!pip install -q transformers datasets tokenizers
!pip install -q numpy matplotlib tqdm

print("\n‚úÖ Dependencies installed!")

In [None]:
# 4. Verify installation
import torch
import transformers
import datasets

print(f"‚úÖ PyTorch: {torch.__version__}")
print(f"‚úÖ Transformers: {transformers.__version__}")
print(f"‚úÖ Datasets: {datasets.__version__}")
print(f"‚úÖ CUDA: {torch.cuda.is_available()}")

## üîß Configuration

Choose your experiment configuration:

In [None]:
# 5. Configuration
# Adjust these based on your GPU:
# - T4 (15GB): model_size="tiny" (49M), batch_size=8
# - V100 (16GB): model_size="small" (125M), batch_size=16
# - A100 (40GB): model_size="medium" (350M), batch_size=32

CONFIG = {
    "model_size": "small",  # "tiny" (49M), "small" (125M), "medium" (350M)
    "batch_size": 8,
    "max_steps": 10000,
    "sparsity": 0.8,  # 80% sparse
    "save_checkpoints": True,
    "checkpoint_interval": 1000,
}

# Auto-detect optimal settings
gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
print(f"GPU Memory: {gpu_memory_gb:.1f} GB")

if gpu_memory_gb >= 35:  # A100
    print("üöÄ Detected A100-class GPU - optimal for 350M models")
    if CONFIG["model_size"] == "medium":
        CONFIG["batch_size"] = 32
elif gpu_memory_gb >= 14:  # V100/T4
    print("‚ö° Detected V100/T4-class GPU - optimal for 125M models")
    if CONFIG["model_size"] == "small":
        CONFIG["batch_size"] = 12
else:
    print("‚ö†Ô∏è Limited GPU memory - recommend tiny model (49M)")
    CONFIG["model_size"] = "tiny"
    CONFIG["batch_size"] = 6

print(f"\nüìù Final Config: {CONFIG}")

## üíæ Google Drive (Optional)

Mount Google Drive to save checkpoints:

In [None]:
# 6. Mount Google Drive (optional - for saving checkpoints)
from google.colab import drive
import os

try:
    drive.mount('/content/drive')
    
    # Create checkpoint directory
    checkpoint_dir = '/content/drive/MyDrive/sparsae_checkpoints'
    os.makedirs(checkpoint_dir, exist_ok=True)
    print(f"‚úÖ Checkpoints will be saved to: {checkpoint_dir}")
    
    CONFIG['checkpoint_dir'] = checkpoint_dir
except Exception as e:
    print(f"‚ö†Ô∏è Could not mount Drive: {e}")
    print("Checkpoints will be saved locally (lost on session end)")
    CONFIG['checkpoint_dir'] = '/content/checkpoints'
    os.makedirs(CONFIG['checkpoint_dir'], exist_ok=True)

## üèÉ Training

Run the SparsAE training:

In [None]:
# 7. Run training
# This will train the model with the configuration above

!python experiments/sparsae_wikitext.py \
    --model_size {CONFIG['model_size']} \
    --batch_size {CONFIG['batch_size']} \
    --max_steps {CONFIG['max_steps']} \
    --sparsity {CONFIG['sparsity']} \
    --checkpoint_dir {CONFIG['checkpoint_dir']}

## üìä Monitoring

Check GPU usage during training:

In [None]:
# 8. Monitor GPU (run this in a separate cell while training)
import time
from IPython.display import clear_output

for i in range(10):
    clear_output(wait=True)
    !nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv
    time.sleep(10)

## üìà Results & Download

View results and download checkpoints:

In [None]:
# 9. View training results
import matplotlib.pyplot as plt
import pandas as pd

# This assumes your training script saves metrics to a CSV
# Adjust path as needed
try:
    metrics = pd.read_csv('training_metrics.csv')
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot training loss
    axes[0].plot(metrics['step'], metrics['train_loss'])
    axes[0].set_xlabel('Step')
    axes[0].set_ylabel('Training Loss')
    axes[0].set_title('Training Loss Over Time')
    axes[0].grid(True)
    
    # Plot validation perplexity
    val_data = metrics[metrics['val_ppl'].notna()]
    axes[1].plot(val_data['step'], val_data['val_ppl'])
    axes[1].set_xlabel('Step')
    axes[1].set_ylabel('Validation Perplexity')
    axes[1].set_title('Validation Perplexity Over Time')
    axes[1].grid(True)
    
    plt.tight_layout()
    plt.savefig('training_curves.png', dpi=150)
    plt.show()
    
    print("\nüìä Final Results:")
    print(f"Final Training Loss: {metrics['train_loss'].iloc[-1]:.4f}")
    final_val_ppl = val_data['val_ppl'].iloc[-1]
    print(f"Final Validation Perplexity: {final_val_ppl:.2f}")
    
except FileNotFoundError:
    print("‚ö†Ô∏è Metrics file not found. Training may still be running.")

In [None]:
# 10. Download checkpoints (optional)
from google.colab import files

# Compress checkpoints
!tar -czf sparsae_checkpoints.tar.gz {CONFIG['checkpoint_dir']}

print("üì¶ Checkpoint archive created. Download below:")
# Uncomment to auto-download:
# files.download('sparsae_checkpoints.tar.gz')

## üí° Tips & Troubleshooting

### GPU Selection:
- **Free Tier:** T4 (15GB) - Good for 49-125M models
- **Colab Pro ($10/mo):** T4/V100 (16GB) - Good for 125M models  
- **Colab Pro+ ($50/mo):** V100/A100 (40GB) - Good for 350M+ models

### Common Issues:

1. **Out of Memory:**
   - Reduce `batch_size` in config
   - Use smaller model size
   - Enable gradient checkpointing

2. **Session Timeout:**
   - Save checkpoints frequently (`checkpoint_interval=500`)
   - Use Google Drive mount for persistence
   - Keep browser tab active

3. **Slow Training:**
   - Check GPU utilization with `!nvidia-smi`
   - Ensure using GPU runtime (not CPU)
   - Increase `num_workers` if CPU-bound

### Resuming from Checkpoint:
```python
# In the training cell, add:
!python experiments/sparsae_wikitext.py \
    --resume_from /path/to/checkpoint.pt \
    ...
```

---

## üìö Next Steps:

1. **Baseline Comparisons:** Train dense, static pruning, RigL
2. **Ablation Studies:** Test without distillation, ES, etc.
3. **Scale Up:** Try larger models (350M) with Pro+
4. **Paper Results:** Generate all tables and figures

---

**Questions?** Check the GitHub repo or open an issue!