# AlphaGomoku Training - Universal Notebook

Works on:
- ‚úì Google Colab
- ‚úì Vast.ai
- ‚úì RunPod
- ‚úì Lambda Labs
- ‚úì AWS/GCP/Azure VMs
- ‚úì Local Jupyter

**Training Modes:**
- üñ•Ô∏è **Single-machine**: Traditional training (self-play + training on same machine)
- üåê **Distributed**: Training worker (pulls games from Redis queue)

**Model:** Medium preset = 5.04M params (30 blocks √ó 192 channels)

**Recommended for Distributed Training:** T4 (16GB), RTX 4090 (24GB), or A100 (40GB)

## 1. Choose Training Mode

**Select your training mode:**
- **Single-machine**: Run self-play and training on the same machine (traditional)
- **Distributed**: Run as training worker, pulling games from Redis queue

Set `TRAINING_MODE` below:

In [None]:
# ============================================
# CONFIGURATION - CHOOSE YOUR MODE
# ============================================

# Choose mode: "single" or "distributed"
TRAINING_MODE = "single"  # Change to "distributed" for queue-based training

# For distributed mode, set Redis URL:
REDIS_URL = "redis://:YOUR_PASSWORD@REDIS_DOMAIN:6379/0"

# ============================================

import os
import sys
import platform

# Detect environment
IS_COLAB = 'google.colab' in sys.modules
IS_KAGGLE = 'kaggle' in os.environ.get('KAGGLE_URL_BASE', '')

print(f"Platform: {platform.system()}")
print(f"Python: {sys.version}")
print(f"Environment: {'Colab' if IS_COLAB else 'Kaggle' if IS_KAGGLE else 'Standard VM/Local'}")
print(f"Working directory: {os.getcwd()}")
print(f"\n{'='*50}")
print(f"Training Mode: {TRAINING_MODE.upper()}")
print(f"{'='*50}")

if TRAINING_MODE == "distributed":
    print(f"\n‚úì Will run as TRAINING WORKER")
    print(f"  Pulls games from: {REDIS_URL.split('@')[1] if '@' in REDIS_URL else REDIS_URL}")
    print(f"  Publishes trained models back to queue")
elif TRAINING_MODE == "single":
    print(f"\n‚úì Will run SINGLE-MACHINE training")
    print(f"  Self-play + training on this machine")
else:
    raise ValueError(f"Invalid TRAINING_MODE: {TRAINING_MODE}. Must be 'single' or 'distributed'")

## 2. Check GPU

In [None]:
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    gpu_memory_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"GPU Memory: {gpu_memory_gb:.1f} GB")
    
    # Recommend batch size
    if gpu_memory_gb >= 32:
        print("‚úì Recommended batch size: 2048")
    elif gpu_memory_gb >= 20:
        print("‚úì Recommended batch size: 1024")
    elif gpu_memory_gb >= 12:
        print("‚úì Recommended batch size: 512")
    else:
        print("‚ö†Ô∏è  GPU has limited memory, will use checkpointing")
else:
    print("‚ùå No GPU detected! Training will be very slow.")
    print("   For Colab: Runtime ‚Üí Change runtime type ‚Üí GPU")

## 3. Setup Storage (Platform-specific)

Choose the appropriate cell based on your platform:

In [None]:
# === GOOGLE COLAB ONLY ===
# Uncomment and run if using Colab

# from google.colab import drive
# drive.mount('/content/drive')
# PROJECT_DIR = '/content/drive/MyDrive/alphagomoku'
# WORK_DIR = '/content/alphagomoku'
# os.makedirs(PROJECT_DIR, exist_ok=True)

In [None]:
# === VAST.AI / RUNPOD / OTHER VMs ===
# Use local storage (usually faster than network storage)

PROJECT_DIR = os.path.expanduser('~/alphagomoku')
WORK_DIR = PROJECT_DIR
os.makedirs(PROJECT_DIR, exist_ok=True)

print(f"Project directory: {PROJECT_DIR}")
print(f"Working directory: {WORK_DIR}")

## 4. Get Code

Choose one method:

In [None]:
# METHOD 1: Clone from Git (Most Common)
!git clone https://github.com/cheshir/alphagomoku.git {WORK_DIR}
%cd {WORK_DIR}

In [None]:
# METHOD 2: If code already exists (Vast.ai with persistent storage)
# Just navigate to it
%cd {WORK_DIR}

In [None]:
# METHOD 3: Upload code manually (for small updates)
# Uncomment if needed

# from google.colab import files
# uploaded = files.upload()  # Upload zip file
# !unzip -o alphagomoku.zip -d {WORK_DIR}
# %cd {WORK_DIR}

## 5. Install Dependencies

In [None]:
# Install dependencies
try:
    import lmdb
    import psutil
    if TRAINING_MODE == "distributed":
        import redis
    print("‚úì Dependencies already installed")
except ImportError:
    print("Installing dependencies...")
    !pip install -q numpy tqdm matplotlib lmdb psutil
    
    if TRAINING_MODE == "distributed":
        print("Installing Redis client for distributed training...")
        !pip install -q redis
    
    print("‚úì Dependencies installed")

## 6. Training Configuration

Configuration depends on your chosen mode:

In [None]:
if TRAINING_MODE == "single":
    # Single-machine training configuration
    CONFIG = {
        'epochs': 200,
        'selfplay_games': 200,
        'mcts_simulations': 150,
        'parallel_workers': 4,
        'lr': 1e-3,
        'min_lr': 5e-4,
        'difficulty': 'medium',
        
        # Paths
        'checkpoint_dir': f'{PROJECT_DIR}/checkpoints',
        'data_dir': f'{PROJECT_DIR}/data',
        
        # Device auto-configured
        'device': 'auto',
    }
    
elif TRAINING_MODE == "distributed":
    # Distributed training worker configuration (GPU-optimized)
    CONFIG = {
        'redis_url': REDIS_URL,
        'model_preset': 'medium',  # 5.04M params
        'batch_size': 1024,        # Large batch for GPU efficiency
        'min_batches_for_training': 50,  # Train when 50+ games available
        'publish_frequency': 5,    # Publish model every 5 training iterations
        'device': 'cuda',          # Force CUDA for cloud GPU
        'lr': 1e-3,
        'min_lr': 5e-4,
        
        # Paths
        'checkpoint_dir': f'{PROJECT_DIR}/checkpoints',
    }

# Create directories
os.makedirs(CONFIG['checkpoint_dir'], exist_ok=True)
if TRAINING_MODE == "single":
    os.makedirs(CONFIG['data_dir'], exist_ok=True)

print(f"\n{'='*50}")
print(f"Configuration ({TRAINING_MODE.upper()} mode):")
print(f"{'='*50}")
for key, value in CONFIG.items():
    if 'password' not in key.lower():  # Don't print passwords
        print(f"  {key}: {value}")

print(f"\n‚úì Ready to train!")

## 7. Start Training

### Single-machine Mode:
- Self-play generates games
- Neural network trains on those games
- All on this machine

### Distributed Mode:
- This machine acts as **training worker**
- Pulls games from Redis queue (generated by Mac/other workers)
- Trains neural network on GPU
- Publishes trained models back to queue for workers to use

In [None]:
if TRAINING_MODE == "single":
    # Run traditional single-machine training
    !python scripts/train.py \
        --epochs {CONFIG['epochs']} \
        --selfplay-games {CONFIG['selfplay_games']} \
        --mcts-simulations {CONFIG['mcts_simulations']} \
        --parallel-workers {CONFIG['parallel_workers']} \
        --lr {CONFIG['lr']} \
        --min-lr {CONFIG['min_lr']} \
        --warmup-epochs 0 \
        --lr-schedule cosine \
        --difficulty {CONFIG['difficulty']} \
        --checkpoint-dir {CONFIG['checkpoint_dir']} \
        --data-dir {CONFIG['data_dir']} \
        --device {CONFIG['device']} \
        --resume auto

elif TRAINING_MODE == "distributed":
    # Run as distributed training worker
    print("Starting distributed training worker...")
    print("This will:")
    print("  1. Connect to Redis queue")
    print("  2. Pull games generated by self-play workers")
    print("  3. Train neural network on GPU")
    print("  4. Publish trained models back to queue")
    print("\nPress Ctrl+C to stop\n")
    
    !python scripts/distributed_training_worker.py \
        --redis-url "{CONFIG['redis_url']}" \
        --model-preset {CONFIG['model_preset']} \
        --batch-size {CONFIG['batch_size']} \
        --device {CONFIG['device']} \
        --min-games-for-training {CONFIG['min_batches_for_training']} \
        --publish-frequency {CONFIG['publish_frequency']} \
        --checkpoint-dir {CONFIG['checkpoint_dir']} \
        --lr {CONFIG['lr']} \
        --min-lr {CONFIG['min_lr']}

## 8. Monitor Progress

### For Single-machine Mode:
Monitor training metrics from checkpoint directory

### For Distributed Mode:
Monitor queue status and training worker progress

In [None]:
if TRAINING_MODE == "distributed":
    # Monitor distributed queue
    print("Checking Redis queue status...\n")
    !python scripts/monitor_queue.py --redis-url "{CONFIG['redis_url']}" --once
    
else:
    # Monitor single-machine training
    import pandas as pd
    import matplotlib.pyplot as plt

    metrics_path = f"{CONFIG['checkpoint_dir']}/training_metrics.csv"

    if os.path.exists(metrics_path):
        df = pd.read_csv(metrics_path)
        
        # Plot metrics
        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
        
        df.plot(x='epoch', y='loss', ax=axes[0,0], title='Training Loss', grid=True)
        df.plot(x='epoch', y='policy_acc', ax=axes[0,1], title='Policy Accuracy', grid=True)
        df.plot(x='epoch', y='value_mae', ax=axes[1,0], title='Value MAE', grid=True)
        df.plot(x='epoch', y='lr', ax=axes[1,1], title='Learning Rate', grid=True)
        
        plt.tight_layout()
        plt.show()
        
        # Print summary
        print(f"\nTraining Progress: {len(df)}/200 epochs")
        print(f"\nLatest metrics (last 5 epochs):")
        print(df.tail())
        
        # Estimate time remaining
        if 'epoch_time' in df.columns and len(df) > 0:
            avg_time = df['epoch_time'].mean()
            remaining_epochs = 200 - len(df)
            remaining_hours = (avg_time * remaining_epochs) / 3600
            print(f"\n‚è±Ô∏è  Estimated time remaining: {remaining_hours:.1f} hours")
    else:
        print("No metrics file found yet. Training hasn't started.")

## 9. Check Latest Checkpoint

In [None]:
import glob

# List all checkpoints
checkpoints = sorted(glob.glob(f"{CONFIG['checkpoint_dir']}/model_epoch_*.pt"))

if checkpoints:
    print(f"Found {len(checkpoints)} checkpoints:\n")
    
    # Show last 5
    for cp in checkpoints[-5:]:
        size_mb = os.path.getsize(cp) / 1024**2
        epoch = cp.split('_')[-1].replace('.pt', '')
        print(f"  Epoch {epoch:>3}: {size_mb:>6.1f} MB - {os.path.basename(cp)}")
    
    latest = checkpoints[-1]
    print(f"\n‚úì Latest checkpoint: {os.path.basename(latest)}")
else:
    print("No checkpoints found yet.")

## 10. Download Checkpoint (Optional)

For Colab/Kaggle, download to local machine:

In [None]:
# Only works in Colab
if IS_COLAB:
    from google.colab import files
    
    checkpoints = sorted(glob.glob(f"{CONFIG['checkpoint_dir']}/model_epoch_*.pt"))
    if checkpoints:
        latest = checkpoints[-1]
        print(f"Downloading: {os.path.basename(latest)}")
        files.download(latest)
    else:
        print("No checkpoints to download")
else:
    print("Download not needed - checkpoints are already on VM storage")
    print(f"Checkpoint location: {CONFIG['checkpoint_dir']}")

## Tips by Platform and Mode

### Google Colab
- ‚úì Sessions timeout after 12-24 hours
- ‚úì Checkpoints saved to Google Drive persist
- ‚úì Just re-run training cell to resume
- ‚úì Use Colab Pro for A100 access
- ‚úì **Perfect for distributed training worker** (free T4 GPU)

### Vast.ai
- ‚úì Hourly billing - pause anytime
- ‚úì Use persistent storage for checkpoints
- ‚úì Can rent spot instances (cheaper)
- ‚ö†Ô∏è Spot instances can be interrupted

### RunPod / Lambda Labs
- ‚úì More reliable than spot instances
- ‚úì Good for long training runs
- ‚úì Usually have persistent storage

### Local / AWS / GCP
- ‚úì Full control over environment
- ‚úì Can run indefinitely
- ‚úì May need to install dependencies

---

## Distributed Training Architecture

When using **distributed mode**, the architecture is:

```
Mac M1 Pro (Self-Play)  ‚Üí  Redis Queue  ‚Üí  Colab T4 (Training)
                              ‚Üì                    ‚Üì
                        Stores games        Trains on GPU
                              ‚Üë                    ‚Üì
                        Latest model  ‚Üê  Publishes model
```

**How it works:**
1. **Self-play workers** (Mac M1 Pro, CPU): Generate games via MCTS, push to queue
2. **Redis queue** (REDIS_DOMAIN): Buffers games and models
3. **Training worker** (This Colab notebook, GPU): Pulls games, trains NN, publishes model

**Benefits:**
- ‚úÖ Continuous training (GPU never idles)
- ‚úÖ Use free Colab GPU + local CPU efficiently
- ‚úÖ Scales to multiple self-play workers
- ‚úÖ CPU and GPU work in parallel (4-6x faster than single-machine)

**Cost:**
- Redis VM: $0/month (minimal resources, 2GB RAM sufficient)
- Colab: $0/month (free tier T4) or $10/month (Colab Pro for A100)
- Total: **$0-10/month** vs $60-120 for dedicated cloud GPU

---

## Expected Training Time

### Single-machine Mode (200 epochs)

| GPU | VRAM | Batch | Time/Epoch | Total |
|-----|------|-------|-----------|-------|
| RTX 4090 | 24GB | 1024 | ~20-30 min | ~3-5 days |
| A100 | 40GB | 2048 | ~15-20 min | ~2-3 days |
| V100 | 16GB | 1024 | ~35-45 min | ~4-6 days |
| T4 | 16GB | 512 | ~60-90 min | ~8-12 days |

### Distributed Mode (Continuous Training)

| Self-Play Workers | Training GPU | Games/Hour | Training Rate | Effective Speed |
|-------------------|--------------|------------|---------------|-----------------|
| 6 CPU (Mac M1) | Colab T4 | ~120-180 | 600-800 games/hr | 4-6x faster |
| 1 MPS (Mac M1) | Colab T4 | ~40-50 | 600-800 games/hr | Balanced |
| 6 CPU (Mac M1) | Colab A100 | ~120-180 | 1500-2000 games/hr | 8-10x faster |

**Note**: GPU processes games faster than generation, so training is continuous (80-95% GPU utilization vs 25-40% single-machine).

---

## Troubleshooting

### Single-machine Mode

**Out of Memory:**
- Script should auto-configure, but if OOM still happens:
- Reduce workers: `--parallel-workers 2`
- Reduce games: `--selfplay-games 100`

**Slow Training:**
- Check GPU is being used (should see CUDA in logs)
- Reduce simulations if needed: `--mcts-simulations 100`

**Connection Lost (Colab/Vast.ai):**
- Just re-run training cell with `--resume auto`
- Checkpoints saved every epoch

### Distributed Mode

**"Connection refused" to Redis:**
- Check REDIS_URL is correct
- Verify Redis server is running: `redis-cli -u <REDIS_URL> ping`
- Check firewall allows port 6379

**Training worker idle (no games):**
- Check self-play workers are running
- Monitor queue: `python scripts/monitor_queue.py --redis-url <URL>`
- Expected: 10+ games in queue

**Self-play workers not fetching new model:**
- Workers fetch every 10 games (normal)
- Check model published: Queue shows "Latest model: epoch X"

**GPU utilization low in distributed mode:**
- Should be 80-95% during training batches
- If low, increase `--batch-size` or `--min-games-for-training`

---

## Getting Started with Distributed Training

### Step 1: Deploy Redis Queue
```bash
# On your VM
git clone https://github.com/cheshir/alphagomoku.git
cd alphagomoku
docker-compose up -d

# Set REDIS_PASSWORD in environment
# Map to REDIS_DOMAIN
```

### Step 2: Start Self-Play Workers (Mac M1 Pro)
```bash
# On your Mac
export REDIS_URL="redis://:YOUR_PASSWORD@REDIS_DOMAIN:6379/0"

# Start 6 CPU workers (recommended for large models)
make distributed-selfplay-cpu-workers

# Or start 1 MPS worker (faster per-game, sequential)
# make distributed-selfplay-mps-worker
```

### Step 3: Start Training Worker (This Colab Notebook)
```python
# In cell 2 above:
TRAINING_MODE = "distributed"
REDIS_URL = "redis://:YOUR_PASSWORD@REDIS_DOMAIN:6379/0"

# Then run cells 3-7 to start training worker
```

### Step 4: Monitor Progress
```python
# Re-run cell 8 periodically to check queue status
# Or run on your Mac:
python scripts/monitor_queue.py --redis-url $REDIS_URL
```

---

## Code Updates

To get latest code changes:
```bash
%cd {WORK_DIR}
!git pull origin master

# If you have local changes, stash first:
# !git stash && git pull && git stash pop
```