# AlphaGomoku Training on Google Colab

This notebook allows you to train AlphaGomoku on Google Colab with GPU acceleration.

**Requirements:**
- GPU Runtime (T4, V100, or A100)
- Google Drive for checkpoint storage

**Estimated Training Time:**
- T4 (16GB): ~6-8 hours/epoch
- V100 (16GB): ~4-5 hours/epoch  
- A100 (40GB): ~2-3 hours/epoch

## 1. Setup Runtime

**IMPORTANT:** Go to `Runtime → Change runtime type → Hardware accelerator → GPU`

In [None]:
# Check GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("⚠️ No GPU detected! Please enable GPU runtime.")

## 2. Mount Google Drive (for checkpoints)

Checkpoints will be saved to your Google Drive so you can resume training.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Create directory for this project
import os
PROJECT_DIR = '/content/drive/MyDrive/alphagomoku'
os.makedirs(PROJECT_DIR, exist_ok=True)
print(f"Project directory: {PROJECT_DIR}")

## 3. Clone Repository

In [None]:
# Clone your repository (update with your actual repo URL)
!git clone https://github.com/YOUR_USERNAME/alphagomoku.git /content/alphagomoku
%cd /content/alphagomoku

# Or upload your code manually:
# from google.colab import files
# uploaded = files.upload()  # Upload a zip file of your code

## 4. Install Dependencies

In [None]:
# Install required packages
!pip install -q torch torchvision torchaudio
!pip install -q numpy tqdm matplotlib lmdb psutil

print("✓ Dependencies installed")

## 5. Training Configuration

Adjust these parameters based on your needs:

In [None]:
# Training configuration
CONFIG = {
    'epochs': 200,
    'selfplay_games': 200,
    'mcts_simulations': 150,
    'parallel_workers': 4,
    'lr': 1e-3,
    'min_lr': 5e-4,
    'difficulty': 'medium',
    
    # Paths (using Google Drive)
    'checkpoint_dir': f'{PROJECT_DIR}/checkpoints',
    'data_dir': f'{PROJECT_DIR}/data',
    
    # Auto-configured based on GPU
    'device': 'auto',  # Will detect CUDA automatically
}

# Create directories
os.makedirs(CONFIG['checkpoint_dir'], exist_ok=True)
os.makedirs(CONFIG['data_dir'], exist_ok=True)

print("Configuration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

## 6. Start Training

**Note:** Training will auto-configure for your GPU. The script will:
- Detect GPU type (T4/V100/A100)
- Set optimal batch size
- Enable checkpointing if needed
- Save checkpoints to Google Drive every epoch

In [None]:
# Run training
!python scripts/train.py \
    --epochs {CONFIG['epochs']} \
    --selfplay-games {CONFIG['selfplay_games']} \
    --mcts-simulations {CONFIG['mcts_simulations']} \
    --parallel-workers {CONFIG['parallel_workers']} \
    --lr {CONFIG['lr']} \
    --min-lr {CONFIG['min_lr']} \
    --warmup-epochs 0 \
    --lr-schedule cosine \
    --difficulty {CONFIG['difficulty']} \
    --checkpoint-dir {CONFIG['checkpoint_dir']} \
    --data-dir {CONFIG['data_dir']} \
    --device {CONFIG['device']} \
    --resume auto

## 7. Resume Training (if interrupted)

Colab sessions disconnect after 12 hours. To resume:

In [None]:
# Just re-run the training cell above!
# The --resume auto flag will automatically find and load the latest checkpoint

# Or manually specify a checkpoint:
# !python scripts/train.py ... --resume {CONFIG['checkpoint_dir']}/model_epoch_50.pt

## 8. Monitor Training

View training progress and metrics:

In [None]:
# View training metrics CSV
import pandas as pd
import matplotlib.pyplot as plt

metrics_path = f"{CONFIG['checkpoint_dir']}/training_metrics.csv"
if os.path.exists(metrics_path):
    df = pd.read_csv(metrics_path)
    
    # Plot metrics
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    
    df.plot(x='epoch', y='loss', ax=axes[0,0], title='Training Loss')
    df.plot(x='epoch', y='policy_acc', ax=axes[0,1], title='Policy Accuracy')
    df.plot(x='epoch', y='value_mae', ax=axes[1,0], title='Value MAE')
    df.plot(x='epoch', y='lr', ax=axes[1,1], title='Learning Rate')
    
    plt.tight_layout()
    plt.show()
    
    # Print latest metrics
    print("\nLatest metrics:")
    print(df.tail())
else:
    print("No metrics file found yet. Training hasn't started or just started.")

## 9. Download Checkpoints

Download trained models to your local machine:

In [None]:
from google.colab import files

# List available checkpoints
!ls -lh {CONFIG['checkpoint_dir']}/*.pt

# Download latest checkpoint
import glob
checkpoints = glob.glob(f"{CONFIG['checkpoint_dir']}/model_epoch_*.pt")
if checkpoints:
    latest = max(checkpoints, key=os.path.getctime)
    print(f"\nDownloading: {latest}")
    files.download(latest)
else:
    print("No checkpoints found yet.")

## Tips for Long Training Sessions

1. **Colab Pro**: Get longer sessions (24 hours) and better GPUs (A100)
2. **Save frequently**: Checkpoints are saved every epoch to Google Drive
3. **Monitor memory**: The script auto-configures for your GPU
4. **Resume easily**: Just re-run the training cell with `--resume auto`
5. **Download checkpoints**: Save important checkpoints to your local machine

## Expected Training Times (200 epochs)

| GPU | Memory | Batch Size | Time/Epoch | Total Time |
|-----|--------|-----------|-----------|------------|
| T4 | 16 GB | 1024 | ~6h | ~50 days |
| V100 | 16 GB | 1024 | ~4h | ~33 days |
| A100 | 40 GB | 2048 | ~2.5h | ~21 days |

**Recommendation:** Use A100 with Colab Pro for best results!