# AlphaGomoku Training - Universal Notebook

Works on:
- ✓ Google Colab
- ✓ Vast.ai
- ✓ RunPod
- ✓ Lambda Labs
- ✓ AWS/GCP/Azure VMs
- ✓ Local Jupyter

**Model:** 18 blocks × 192 channels = 5.2M params

**Recommended:** RTX 4090 (24GB) or A100 (40GB) with 64GB RAM

## 1. Check Environment

In [None]:
import os
import sys
import platform

# Detect environment
IS_COLAB = 'google.colab' in sys.modules
IS_KAGGLE = 'kaggle' in os.environ.get('KAGGLE_URL_BASE', '')

print(f"Platform: {platform.system()}")
print(f"Python: {sys.version}")
print(f"Environment: {'Colab' if IS_COLAB else 'Kaggle' if IS_KAGGLE else 'Standard VM/Local'}")
print(f"Working directory: {os.getcwd()}")

## 2. Check GPU

In [None]:
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"GPU Memory: {gpu_memory_gb:.1f} GB")
    
    # Recommend batch size
    if gpu_memory_gb >= 32:
        print("✓ Recommended batch size: 2048")
    elif gpu_memory_gb >= 20:
        print("✓ Recommended batch size: 1024")
    elif gpu_memory_gb >= 12:
        print("✓ Recommended batch size: 512")
    else:
        print("⚠️  GPU has limited memory, will use checkpointing")
else:
    print("❌ No GPU detected! Training will be very slow.")
    print("   For Colab: Runtime → Change runtime type → GPU")

## 3. Setup Storage (Platform-specific)

Choose the appropriate cell based on your platform:

In [None]:
# === GOOGLE COLAB ONLY ===
# Uncomment and run if using Colab

# from google.colab import drive
# drive.mount('/content/drive')
# PROJECT_DIR = '/content/drive/MyDrive/alphagomoku'
# WORK_DIR = '/content/alphagomoku'
# os.makedirs(PROJECT_DIR, exist_ok=True)

In [None]:
# === VAST.AI / RUNPOD / OTHER VMs ===
# Use local storage (usually faster than network storage)

PROJECT_DIR = os.path.expanduser('~/alphagomoku')
WORK_DIR = PROJECT_DIR
os.makedirs(PROJECT_DIR, exist_ok=True)

print(f"Project directory: {PROJECT_DIR}")
print(f"Working directory: {WORK_DIR}")

## 4. Get Code

Choose one method:

In [None]:
# METHOD 1: Clone from Git
!git clone https://github.com/YOUR_USERNAME/alphagomoku.git {WORK_DIR}
%cd {WORK_DIR}

In [None]:
# METHOD 2: If code already exists (Vast.ai with persistent storage)
# Just navigate to it
%cd {WORK_DIR}

In [None]:
# METHOD 3: Upload code manually (for small updates)
# Uncomment if needed

# from google.colab import files
# uploaded = files.upload()  # Upload zip file
# !unzip -o alphagomoku.zip -d {WORK_DIR}
# %cd {WORK_DIR}

## 5. Install Dependencies

In [None]:
# Check if dependencies are already installed
try:
    import lmdb
    import psutil
    print("✓ Dependencies already installed")
except ImportError:
    print("Installing dependencies...")
    !pip install -q numpy tqdm matplotlib lmdb psutil
    print("✓ Dependencies installed")

## 6. Training Configuration

In [None]:
# Training configuration
CONFIG = {
    'epochs': 200,
    'selfplay_games': 200,
    'mcts_simulations': 150,
    'parallel_workers': 4,
    'lr': 1e-3,
    'min_lr': 5e-4,
    'difficulty': 'medium',
    
    # Paths
    'checkpoint_dir': f'{PROJECT_DIR}/checkpoints',
    'data_dir': f'{PROJECT_DIR}/data',
    
    # Device auto-configured
    'device': 'auto',
}

# Create directories
os.makedirs(CONFIG['checkpoint_dir'], exist_ok=True)
os.makedirs(CONFIG['data_dir'], exist_ok=True)

print("\nConfiguration:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")

print("\n✓ Ready to train!")

## 7. Start Training

**Auto-configuration:**
- Model: 18 blocks × 192 channels (5.2M params)
- Batch size: Auto-configured based on GPU
- Checkpointing: Auto-enabled if needed

**Training will:**
- Save checkpoints every epoch
- Auto-resume from latest checkpoint
- Show progress in real-time

In [None]:
# Run training
!python scripts/train.py \
    --epochs {CONFIG['epochs']} \
    --selfplay-games {CONFIG['selfplay_games']} \
    --mcts-simulations {CONFIG['mcts_simulations']} \
    --parallel-workers {CONFIG['parallel_workers']} \
    --lr {CONFIG['lr']} \
    --min-lr {CONFIG['min_lr']} \
    --warmup-epochs 0 \
    --lr-schedule cosine \
    --difficulty {CONFIG['difficulty']} \
    --checkpoint-dir {CONFIG['checkpoint_dir']} \
    --data-dir {CONFIG['data_dir']} \
    --device {CONFIG['device']} \
    --resume auto

## 8. Monitor Training Progress

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

metrics_path = f"{CONFIG['checkpoint_dir']}/training_metrics.csv"

if os.path.exists(metrics_path):
    df = pd.read_csv(metrics_path)
    
    # Plot metrics
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    df.plot(x='epoch', y='loss', ax=axes[0,0], title='Training Loss', grid=True)
    df.plot(x='epoch', y='policy_acc', ax=axes[0,1], title='Policy Accuracy', grid=True)
    df.plot(x='epoch', y='value_mae', ax=axes[1,0], title='Value MAE', grid=True)
    df.plot(x='epoch', y='lr', ax=axes[1,1], title='Learning Rate', grid=True)
    
    plt.tight_layout()
    plt.show()
    
    # Print summary
    print(f"\nTraining Progress: {len(df)}/200 epochs")
    print(f"\nLatest metrics (last 5 epochs):")
    print(df.tail())
    
    # Estimate time remaining
    if 'epoch_time' in df.columns and len(df) > 0:
        avg_time = df['epoch_time'].mean()
        remaining_epochs = 200 - len(df)
        remaining_hours = (avg_time * remaining_epochs) / 3600
        print(f"\n⏱️  Estimated time remaining: {remaining_hours:.1f} hours")
else:
    print("No metrics file found yet. Training hasn't started.")

## 9. Check Latest Checkpoint

In [None]:
import glob

# List all checkpoints
checkpoints = sorted(glob.glob(f"{CONFIG['checkpoint_dir']}/model_epoch_*.pt"))

if checkpoints:
    print(f"Found {len(checkpoints)} checkpoints:\n")
    
    # Show last 5
    for cp in checkpoints[-5:]:
        size_mb = os.path.getsize(cp) / 1024**2
        epoch = cp.split('_')[-1].replace('.pt', '')
        print(f"  Epoch {epoch:>3}: {size_mb:>6.1f} MB - {os.path.basename(cp)}")
    
    latest = checkpoints[-1]
    print(f"\n✓ Latest checkpoint: {os.path.basename(latest)}")
else:
    print("No checkpoints found yet.")

## 10. Download Checkpoint (Optional)

For Colab/Kaggle, download to local machine:

In [None]:
# Only works in Colab
if IS_COLAB:
    from google.colab import files
    
    checkpoints = sorted(glob.glob(f"{CONFIG['checkpoint_dir']}/model_epoch_*.pt"))
    if checkpoints:
        latest = checkpoints[-1]
        print(f"Downloading: {os.path.basename(latest)}")
        files.download(latest)
    else:
        print("No checkpoints to download")
else:
    print("Download not needed - checkpoints are already on VM storage")
    print(f"Checkpoint location: {CONFIG['checkpoint_dir']}")

## Tips by Platform

### Google Colab
- ✓ Sessions timeout after 12-24 hours
- ✓ Checkpoints saved to Google Drive persist
- ✓ Just re-run training cell to resume
- ✓ Use Colab Pro for A100 access

### Vast.ai
- ✓ Hourly billing - pause anytime
- ✓ Use persistent storage for checkpoints
- ✓ Can rent spot instances (cheaper)
- ⚠️ Spot instances can be interrupted

### RunPod / Lambda Labs
- ✓ More reliable than spot instances
- ✓ Good for long training runs
- ✓ Usually have persistent storage

### Local / AWS / GCP
- ✓ Full control over environment
- ✓ Can run indefinitely
- ✓ May need to install dependencies

## Expected Training Time (200 epochs)

| GPU | VRAM | Batch | Time/Epoch | Total |
|-----|------|-------|-----------|-------|
| RTX 4090 | 24GB | 1024 | ~2.5h | ~21 days |
| A100 | 40GB | 2048 | ~1.8h | ~15 days |
| V100 | 16GB | 1024 | ~3.5h | ~29 days |
| RTX 3080 | 12GB | 512 | ~6h | ~50 days |

## Troubleshooting

**Out of Memory:**
- Script should auto-configure, but if OOM still happens:
- Reduce workers: `--parallel-workers 2`
- Reduce games: `--selfplay-games 100`

**Slow Training:**
- Check GPU is being used (should see CUDA in logs)
- Reduce simulations if needed: `--mcts-simulations 100`

**Connection Lost (Colab/Vast.ai):**
- Just re-run training cell with `--resume auto`
- Checkpoints saved every epoch

**Code Updates:**
- Re-clone repository or pull latest changes
- Training will resume from latest checkpoint