# üöÄ SparsAE Training on Google Colab

## Quick Start Guide

1. **Runtime Setup:** Runtime ‚Üí Change runtime type ‚Üí **GPU** (T4, V100, or A100)
2. **Run all cells** in order (Shift+Enter)
3. **Monitor training** progress below

---

### Expected Performance:
- **T4 (Free/Pro):** 125M model in ~2-3 hrs
- **V100 (Pro):** 125M model in ~1.5 hrs
- **A100 (Pro+):** 350M model in ~12 hrs

---

In [None]:
# Cell 1: Check GPU availability
!nvidia-smi

import torch
print(f"\n{'='*60}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"{'='*60}")
else:
    print("‚ö†Ô∏è NO GPU DETECTED! Please change runtime to GPU.")
    print("Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí GPU")

In [None]:
# Cell 2: Clone repository
import os

# Remove if already exists
if os.path.exists('/content/ai-algo-agents'):
    print("üìÅ Repository already exists, removing...")
    !rm -rf /content/ai-algo-agents

# Clone from public GitHub (no authentication needed)
print("üì• Cloning repository from GitHub...")
!git clone https://github.com/codenlighten/ai-algo-agents.git /content/ai-algo-agents

# Change to repo directory
%cd /content/ai-algo-agents

# Verify clone successful
print("\n‚úÖ Repository cloned successfully!")
print("\nüìÇ Repository contents:")
!ls -la experiments/

In [None]:
# Cell 3: Install dependencies
print("üì¶ Installing dependencies...")
print("This may take 2-3 minutes...\n")

!pip install -q transformers datasets tokenizers
!pip install -q matplotlib tqdm

print("\n‚úÖ All dependencies installed!")

In [None]:
# Cell 4: Verify installation
import torch
import transformers
import datasets

print("\n" + "="*60)
print("ENVIRONMENT VERIFICATION")
print("="*60)
print(f"‚úÖ PyTorch: {torch.__version__}")
print(f"‚úÖ Transformers: {transformers.__version__}")
print(f"‚úÖ Datasets: {datasets.__version__}")
print(f"‚úÖ CUDA: {torch.cuda.is_available()}")
print(f"‚úÖ GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
print("="*60)
print("\nüéâ Ready to train!")

## üîß Configuration

Auto-detects optimal settings based on your GPU.

In [None]:
# Cell 5: Configuration
import torch

# Default configuration (ultra-conservative to prevent OOM)
CONFIG = {
    "model_size": "tiny",  # "tiny" (49M), "small" (125M), "medium" (350M)
    "batch_size": 2,       # Ultra-conservative to prevent OOM
    "max_steps": 10000,
    "sparsity": 0.8,       # 80% sparse
    "checkpoint_interval": 1000,
    "eval_interval": 200,
    "max_train_examples": 10000,  # Very limited to prevent OOM during tokenization
    "max_val_examples": 1000,
    "num_workers": 0,      # 0 workers to prevent multiprocessing OOM
}

# Auto-detect optimal settings based on GPU
if torch.cuda.is_available():
    gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
    gpu_name = torch.cuda.get_device_name(0)
    
    print(f"\n{'='*60}")
    print(f"GPU DETECTED: {gpu_name}")
    print(f"GPU Memory: {gpu_memory_gb:.1f} GB")
    print(f"{'='*60}\n")
    
    if gpu_memory_gb >= 35:  # A100
        print("üöÄ A100-class GPU detected!")
        print("   ‚Üí Can use larger models and batch sizes")
        CONFIG["model_size"] = "medium"
        CONFIG["batch_size"] = 16
        CONFIG["max_train_examples"] = 100000
        CONFIG["max_val_examples"] = 10000
    elif gpu_memory_gb >= 14:  # V100/T4
        print("‚ö° T4/V100-class GPU detected!")
        print("   ‚Üí Can use small model with moderate settings")
        CONFIG["model_size"] = "tiny"  # Start with tiny to be safe
        CONFIG["batch_size"] = 4
        CONFIG["max_train_examples"] = 20000
        CONFIG["max_val_examples"] = 2000
    else:
        print("‚ö†Ô∏è  Limited GPU memory detected")
        print("   ‚Üí Using minimal settings")
        # Keep ultra-conservative defaults

print(f"\nüìù Final Configuration:")
for key, value in CONFIG.items():
    print(f"   {key}: {value}")

print(f"\nüí° OOM Prevention Strategy:")
print(f"   ‚úÖ Ultra-low defaults (10K examples, batch=2)")
print(f"   ‚úÖ Early exit during tokenization")
print(f"   ‚úÖ No DataLoader workers")
print(f"   ‚úÖ Progress logging every 1000 docs")
print(f"\n‚ö†Ô∏è  If still getting OOM, try:")
print(f"   CONFIG['max_train_examples'] = 5000")
print(f"   CONFIG['batch_size'] = 1")
print()

## üíæ Google Drive (Optional)

Mount Google Drive to save checkpoints persistently.

In [None]:
# Cell 6: Mount Google Drive (optional)
from google.colab import drive
import os

try:
    print("üìÅ Mounting Google Drive...")
    drive.mount('/content/drive')
    
    # Create checkpoint directory
    checkpoint_dir = '/content/drive/MyDrive/sparsae_checkpoints'
    os.makedirs(checkpoint_dir, exist_ok=True)
    print(f"\n‚úÖ Google Drive mounted!")
    print(f"üìÇ Checkpoints will be saved to: {checkpoint_dir}")
    print("   (These will persist after session ends)\n")
    
    CONFIG['checkpoint_dir'] = checkpoint_dir
except Exception as e:
    print(f"\n‚ö†Ô∏è  Could not mount Drive: {e}")
    print("Checkpoints will be saved locally (lost on session end)\n")
    CONFIG['checkpoint_dir'] = '/content/checkpoints'
    os.makedirs(CONFIG['checkpoint_dir'], exist_ok=True)

In [None]:
# Cell 5.5: Memory Diagnostic (run this if you get OOM errors)
import torch
import gc

print("\n" + "="*60)
print("üíæ MEMORY DIAGNOSTIC")
print("="*60 + "\n")

# System memory
print("üìä System RAM:")
!free -h | grep Mem

# GPU memory
if torch.cuda.is_available():
    gpu_props = torch.cuda.get_device_properties(0)
    total_memory = gpu_props.total_memory / 1e9
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    free = total_memory - reserved
    
    print(f"\nüìä GPU Memory ({torch.cuda.get_device_name(0)}):")
    print(f"   Total:     {total_memory:.2f} GB")
    print(f"   Allocated: {allocated:.2f} GB ({allocated/total_memory*100:.1f}%)")
    print(f"   Reserved:  {reserved:.2f} GB ({reserved/total_memory*100:.1f}%)")
    print(f"   Free:      {free:.2f} GB ({free/total_memory*100:.1f}%)")
    
    # Clear cache
    torch.cuda.empty_cache()
    gc.collect()
    
    print(f"\nüßπ After cleanup:")
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    free = total_memory - reserved
    print(f"   Allocated: {allocated:.2f} GB")
    print(f"   Reserved:  {reserved:.2f} GB")
    print(f"   Free:      {free:.2f} GB")
    
    # Check for other processes
    print(f"\nüîç GPU Processes:")
    !nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv
    
    # Memory recommendations
    print(f"\nüí° Recommendations for {total_memory:.0f}GB GPU:")
    if total_memory < 12:
        print(f"   ‚ö†Ô∏è  Limited memory - use tiny model, batch_size=2-4")
        print(f"   ‚ö†Ô∏è  Set max_train_examples=10000")
    elif total_memory < 16:
        print(f"   ‚úÖ Good for tiny model (batch_size=4-8)")
        print(f"   ‚ö†Ô∏è  Small model may OOM - try batch_size=4")
    elif total_memory < 20:
        print(f"   ‚úÖ Good for small model (batch_size=6-10)")
    else:
        print(f"   ‚úÖ Excellent for medium model (batch_size=16+)")

else:
    print("‚ùå No GPU detected!")
    print("   Change runtime: Runtime ‚Üí Change runtime type ‚Üí GPU")

print("\n" + "="*60)


## üèÉ Training

This will train SparsAE with the configuration above. Expected time: 1.5-3 hours depending on GPU.

**Before running Cell 7:** Make sure Cells 1-6 completed successfully!

In [None]:
# Cell 6.5: Pre-training diagnostic check
import sys
import os

print("\n" + "="*60)
print("PRE-TRAINING DIAGNOSTICS")
print("="*60 + "\n")

# Check Python
print(f"‚úÖ Python: {sys.executable}")
print(f"   Version: {sys.version.split()[0]}\n")

# Check working directory
print(f"üìÇ Working directory: {os.getcwd()}")
print(f"   Expected: /content/ai-algo-agents\n")

# Check training script exists
script_path = "experiments/sparsae_wikitext.py"
if os.path.exists(script_path):
    print(f"‚úÖ Training script found: {script_path}")
    print(f"   Size: {os.path.getsize(script_path) / 1024:.1f} KB\n")
else:
    print(f"‚ùå Training script NOT found: {script_path}")
    print(f"   Current files: {os.listdir('.')}\n")

# Check dependencies
print("üì¶ Checking dependencies:")
try:
    import torch
    print(f"   ‚úÖ PyTorch {torch.__version__}")
    print(f"      CUDA: {torch.cuda.is_available()}")
except ImportError as e:
    print(f"   ‚ùå PyTorch: {e}")

try:
    import transformers
    print(f"   ‚úÖ Transformers {transformers.__version__}")
except ImportError as e:
    print(f"   ‚ùå Transformers: {e}")

try:
    import datasets
    print(f"   ‚úÖ Datasets {datasets.__version__}")
except ImportError as e:
    print(f"   ‚ùå Datasets: {e}")

# Test script imports
print("\nüß™ Testing script imports...")
result = os.system(f"{sys.executable} experiments/sparsae_wikitext.py --help 2>&1 | head -20")
if result != 0:
    print(f"   ‚ö†Ô∏è  Import/help test failed with code {result}")
    print("   This may indicate missing dependencies or syntax errors")
else:
    print("   ‚úÖ Script imports and argument parsing OK")

print("\n" + "="*60)
print("If all checks passed ‚úÖ, proceed to Cell 7")
print("If any checks failed ‚ùå, rerun cells 1-6")
print("="*60 + "\n")

In [None]:
# Cell 7: Run training
import subprocess
import sys
import os

print("\n" + "="*60)
print("STARTING SPARSAE TRAINING")
print("="*60)
print(f"Model: {CONFIG['model_size'].upper()}")
print(f"Batch size: {CONFIG['batch_size']}")
print(f"Max steps: {CONFIG['max_steps']}")
print(f"Sparsity: {CONFIG['sparsity']*100:.0f}%")
print(f"Checkpoints: {CONFIG['checkpoint_dir']}")
print("="*60 + "\n")

# Verify we're in the right directory
print(f"üìÇ Current directory: {os.getcwd()}")
print(f"üìÑ Training script exists: {os.path.exists('experiments/sparsae_wikitext.py')}\n")

# Build command with unbuffered output
cmd = [
    sys.executable,
    "-u",  # Force unbuffered output for real-time logs
    "experiments/sparsae_wikitext.py",
    "--model_size", CONFIG['model_size'],
    "--batch_size", str(CONFIG['batch_size']),
    "--max_steps", str(CONFIG['max_steps']),
    "--sparsity", str(CONFIG['sparsity']),
    "--checkpoint_dir", CONFIG['checkpoint_dir'],
    "--checkpoint_interval", str(CONFIG['checkpoint_interval']),
    "--eval_interval", str(CONFIG['eval_interval']),
    "--max_train_examples", str(CONFIG['max_train_examples']),
    "--max_val_examples", str(CONFIG['max_val_examples']),
    "--num_workers", str(CONFIG['num_workers']),
]

print("üöÄ Launching training...")
print(f"üíª Command: {' '.join(cmd)}\n")
print("="*60 + "\n")

# Run training with better error capture
try:
    result = subprocess.run(cmd, check=True, capture_output=False, text=True)
    print("\n" + "="*60)
    print("üéâ TRAINING COMPLETED SUCCESSFULLY!")
    print("="*60)
except subprocess.CalledProcessError as e:
    print(f"\n{'='*60}")
    print(f"‚ùå TRAINING FAILED")
    print(f"{'='*60}")
    print(f"Exit code: {e.returncode}")
    print(f"\nüí° Common issues:")
    print(f"   - Exit code 2: Usually means argument parsing error or missing imports")
    print(f"   - Check that all cells above (1-6) ran successfully")
    print(f"   - Try running: !python3 experiments/sparsae_wikitext.py --help")
except Exception as e:
    print(f"\n‚ùå Unexpected error: {e}")

## üîß Troubleshooting OOM (Out of Memory)

**If you got exit code -9**, the process was killed due to out of memory. Try these fixes:

### Quick Fixes:
1. **Restart runtime**: Runtime ‚Üí Restart runtime
2. **Re-run Cell 5.5** (Memory Diagnostic) to check available memory
3. **Reduce batch size**: In Cell 5, change `CONFIG['batch_size'] = 2`
4. **Reduce dataset size**: In Cell 5, change `CONFIG['max_train_examples'] = 10000`
5. **Use smaller model**: In Cell 5, change `CONFIG['model_size'] = 'tiny'`

### Detailed Troubleshooting:
See `COLAB_OOM_FIX.md` for complete guide including:
- Memory requirements by model size
- Gradient accumulation technique
- Advanced optimization strategies

## üìä Monitoring (Optional)

Run this in a separate window while training to monitor GPU usage.

In [None]:
# Cell 8: Monitor GPU (run this while training runs)
import time
from IPython.display import clear_output

print("üìä GPU Monitoring (updating every 10 seconds)")
print("Press 'Stop' button to end monitoring\n")

try:
    for i in range(360):  # Monitor for 1 hour
        clear_output(wait=True)
        print(f"üìä GPU Status (Update {i+1}/360)\n")
        !nvidia-smi --query-gpu=timestamp,name,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.total --format=csv
        print("\n‚úÖ Healthy ranges:")
        print("   Temperature: <80¬∞C")
        print("   GPU Utilization: >90%")
        print("   Memory: <90% of total\n")
        print("Press 'Stop' button to end monitoring")
        time.sleep(10)
except KeyboardInterrupt:
    print("\n‚èπÔ∏è  Monitoring stopped")

## üìà View Results

Check training progress and final metrics.

In [None]:
# Cell 9: View results
import os

print("\n" + "="*60)
print("TRAINING RESULTS")
print("="*60 + "\n")

# List checkpoints
checkpoint_dir = CONFIG['checkpoint_dir']
if os.path.exists(checkpoint_dir):
    checkpoints = sorted([f for f in os.listdir(checkpoint_dir) if f.endswith('.pt')])
    print(f"üìÇ Found {len(checkpoints)} checkpoints:")
    for cp in checkpoints[-5:]:  # Show last 5
        path = os.path.join(checkpoint_dir, cp)
        size_mb = os.path.getsize(path) / 1e6
        print(f"   {cp} ({size_mb:.1f} MB)")
else:
    print("‚ö†Ô∏è  No checkpoints found yet. Training may still be running.")

print("\n" + "="*60)
print("\nüí° To download checkpoints:")
print("   1. Go to Files panel (left sidebar)")
print(f"   2. Navigate to {checkpoint_dir}")
print("   3. Right-click ‚Üí Download")
print("\nOr run the cell below to create a compressed archive.")

In [None]:
# Cell 10: Download checkpoints (optional)
from google.colab import files
import os

checkpoint_dir = CONFIG['checkpoint_dir']
archive_name = 'sparsae_checkpoints.tar.gz'

if os.path.exists(checkpoint_dir):
    print(f"üì¶ Creating compressed archive...")
    !tar -czf {archive_name} -C {os.path.dirname(checkpoint_dir)} {os.path.basename(checkpoint_dir)}
    
    size_mb = os.path.getsize(archive_name) / 1e6
    print(f"‚úÖ Archive created: {archive_name} ({size_mb:.1f} MB)")
    print(f"\nüì• Downloading...")
    
    files.download(archive_name)
    print("\n‚úÖ Download started! Check your browser's download folder.")
else:
    print("‚ö†Ô∏è  No checkpoints to download yet.")

## üí° Tips & Next Steps

### Resuming from Checkpoint:
If your session disconnects, rerun Cell 7 with:
```python
cmd.extend([
    "--resume_from", "/path/to/checkpoint_step_5000.pt"
])
```

### Running Experiments:
Modify `CONFIG` in Cell 5 and rerun from there:
- Try different sparsity levels: `0.7, 0.8, 0.9`
- Compare model sizes: `tiny, small, medium`
- Adjust batch size for memory/speed tradeoff

### Common Issues:
- **OOM Error:** Reduce `batch_size` in CONFIG
- **Session Timeout:** Use Google Drive mount to save progress
- **Slow Training:** Check GPU utilization with Cell 8

### Next Steps:
1. Run dense baseline: `CONFIG["sparsity"] = 0.0`
2. Compare results across sparsity levels
3. Scale to larger models (upgrade to Pro+)
4. Run ablation studies (modify training script)

---

**Full Documentation:** [COLAB_SETUP.md](https://github.com/codenlighten/ai-algo-agents/blob/main/COLAB_SETUP.md)

**Report Issues:** [GitHub Issues](https://github.com/codenlighten/ai-algo-agents/issues)