# Tako HRM - Setup & Performance Benchmark

This notebook sets up the Tako environment on Google Colab and runs performance benchmarks.

## üöÄ Quick Start

1. **Enable GPU:** Runtime ‚Üí Change runtime type ‚Üí GPU (T4 or better)
2. **Run all cells:** Runtime ‚Üí Run all
3. **Check benchmark results** at the bottom

---

## Setup Steps

- Clone repository
- Install dependencies with `uv`
- Verify GPU availability
- Run performance benchmark

## 1. Check GPU Availability

In [None]:
import torch

print("="*80)
print("GPU Check")
print("="*80)

if torch.cuda.is_available():
    print(f"‚úÖ CUDA GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    device = 'cuda'
elif torch.backends.mps.is_available():
    print("‚úÖ Apple MPS Available (Metal Performance Shaders)")
    device = 'mps'
else:
    print("‚ö†Ô∏è  No GPU available, using CPU (will be slower)")
    device = 'cpu'

print(f"\nUsing device: {device}")
print("="*80)

## 2. Clone Repository & Install Dependencies

**Note:** If you already have the repo, skip the clone and just `cd` into it.

In [None]:
# Install uv (fast Python package manager)
!curl -LsSf https://astral.sh/uv/install.sh | sh

# Add uv to PATH for this session
import os
os.environ['PATH'] = f"{os.path.expanduser('~/.cargo/bin')}:{os.environ['PATH']}"

In [None]:
# Clone repository (modify URL to your fork if needed)
import os
if not os.path.exists('tako-v2'):
    !git clone https://github.com/zfdupont/tako-v2.git
    print("‚úÖ Repository cloned")
else:
    print("‚úÖ Repository already exists")

%cd tako-v2

# Install dependencies
!~/.cargo/bin/uv sync

print("\n‚úÖ Dependencies installed")

## 3. Mount Google Drive (Optional)

Mount your Google Drive to save checkpoints and results.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Create checkpoint directory in Drive
!mkdir -p /content/drive/MyDrive/tako_checkpoints/tictactoe

# Link to local checkpoint directory
!rm -rf checkpoints
!ln -s /content/drive/MyDrive/tako_checkpoints checkpoints

print("‚úÖ Google Drive mounted and checkpoints linked")

## 4. Verify Installation

In [None]:
# Test imports
import sys
sys.path.insert(0, '/content/tako-v2')

import yaml
import torch
import numpy as np

from model.hrm import HRM
from games.tictactoe import TicTacToeGame
from training.mcts import MCTS

print("‚úÖ All imports successful!")
print(f"   PyTorch version: {torch.__version__}")
print(f"   CUDA available: {torch.cuda.is_available()}")

## 5. Run Performance Benchmark

Measure forward pass time and estimate training throughput.

In [None]:
import yaml
import torch
import time
from model.hrm import HRM
from games.tictactoe import TicTacToeGame

# Determine device
if torch.cuda.is_available():
    device = 'cuda'
elif torch.backends.mps.is_available():
    device = 'mps'
else:
    device = 'cpu'

print("="*80)
print("TicTacToe Performance Benchmark")
print("="*80)
print(f"\nDevice: {device}")
if device == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")

# Load config
with open('config/tictactoe.yaml') as f:
    config = yaml.safe_load(f)

model_config = config['model']
mcts_config = config['mcts']

print(f"\nModel Configuration:")
print(f"  d_model: {model_config['d_model']}")
print(f"  n_layers: {model_config['n_layers']}")
print(f"  N√óT: {model_config['N']}√ó{model_config['T']} = {model_config['N']*model_config['T']} timesteps/segment")
print(f"\nMCTS Configuration:")
print(f"  Simulations: {mcts_config['simulations']}")
print(f"  max_segments_inference: {mcts_config.get('max_segments_inference', 1)}")

# Create model
model = HRM(**model_config)
model = model.to(device)
model.eval()

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nModel Parameters: {total_params:,} ({total_params/1e6:.2f}M)")

# Create dummy input
game = TicTacToeGame()
tokens = game.to_tokens().unsqueeze(0).to(device)  # [1, seq_len]

# Warmup
print(f"\nWarming up...")
with torch.no_grad():
    for _ in range(5):
        _ = model.predict(tokens, use_act=True, max_segments=mcts_config.get('max_segments_inference', 1))

# Benchmark
print(f"Benchmarking (20 iterations)...")
times = []
with torch.no_grad():
    for _ in range(20):
        if device == 'cuda':
            torch.cuda.synchronize()
        
        start = time.time()
        policy, value, _ = model.predict(
            tokens, use_act=True, 
            max_segments=mcts_config.get('max_segments_inference', 1)
        )
        
        if device == 'cuda':
            torch.cuda.synchronize()
        
        elapsed = time.time() - start
        times.append(elapsed)

# Results
avg_time = sum(times) / len(times)
min_time = min(times)
max_time = max(times)

print(f"\n{'='*80}")
print("Forward Pass Results")
print(f"{'='*80}")
print(f"  Average: {avg_time*1000:.2f}ms")
print(f"  Min: {min_time*1000:.2f}ms")
print(f"  Max: {max_time*1000:.2f}ms")

# Estimate game generation time
avg_moves = 7
sims = mcts_config['simulations']
total_passes = sims * avg_moves
est_time_per_game = avg_time * total_passes

print(f"\n{'='*80}")
print("Estimated Training Throughput")
print(f"{'='*80}")
print(f"  MCTS simulations: {sims}")
print(f"  Avg moves per game: {avg_moves}")
print(f"  Forward passes per game: {total_passes}")
print(f"  Time per game: {est_time_per_game:.2f}s")
print(f"  Games per hour (1 worker): {3600/est_time_per_game:.0f}")
print(f"  Games per hour (8 workers): {8*3600/est_time_per_game:.0f}")

# Speedup vs CPU baseline
cpu_baseline = 0.311  # 311ms from CPU benchmark
baseline_time_per_game = cpu_baseline * 200 * avg_moves  # 200 sims
speedup = baseline_time_per_game / est_time_per_game

print(f"\n{'='*80}")
print(f"Speedup vs CPU Baseline (311ms/pass, 200 sims)")
print(f"{'='*80}")
print(f"  Baseline: {baseline_time_per_game:.1f}s per game")
print(f"  Current: {est_time_per_game:.2f}s per game")
print(f"  Speedup: {speedup:.1f}x faster")
print(f"\n‚úÖ Benchmark complete!\n")

## üéØ Next Steps

Now that setup is complete, you can:

1. **Train a model:** Open `01_train_tictactoe.ipynb`
2. **Evaluate a model:** Open `02_evaluate_model.ipynb`
3. **Play interactively:** Open `03_interactive_play.ipynb`

---

### Expected Performance on Colab GPUs

| GPU Type | Forward Pass | Games/Hour (8 workers) |
|----------|--------------|------------------------|
| **T4** | ~1-2ms | ~150,000 |
| **V100** | ~0.5-1ms | ~300,000 |
| **A100** | ~0.3-0.5ms | ~500,000 |
| **CPU** | ~3-5ms | ~50,000 |

*(Actual results may vary based on system load)*