# Overcooked Multi-Agent RL - Getting Started

This notebook helps you get started with the project and test your installation.

## 1. Installation Check

First, verify all dependencies are installed correctly.

In [None]:
import sys
import os

# Add src to path
sys.path.insert(0, os.path.abspath('../src'))

# Check imports
try:
    import torch
    print(f"✓ PyTorch {torch.__version__}")
except ImportError:
    print("✗ PyTorch not installed. Install from https://pytorch.org")

try:
    import numpy as np
    print(f"✓ NumPy {np.__version__}")
except ImportError:
    print("✗ NumPy not installed")

try:
    import overcooked_ai_py
    print(f"✓ Overcooked-AI installed")
except ImportError:
    print("✗ Overcooked-AI not installed. Run: pip install overcooked-ai==1.1.0")

try:
    import matplotlib.pyplot as plt
    print(f"✓ Matplotlib installed")
except ImportError:
    print("✗ Matplotlib not installed")

## 2. Test Environment

Test the Overcooked environment with random actions.

In [None]:
from overcooked_ai_py.mdp.overcooked_mdp import OvercookedGridworld
from overcooked_ai_py.mdp.overcooked_env import OvercookedEnv

# Build environment
layout_name = 'cramped_room'
mdp = OvercookedGridworld.from_layout_name(layout_name)
env = OvercookedEnv.from_mdp(mdp, horizon=400)

print(f"Environment created: {layout_name}")
print(f"Horizon: {env.horizon}")

# Test episode with random actions
obs = env.reset()
print(f"\nObservation shape per agent: {obs['both_agent_obs'][0].shape}")

total_reward = 0
num_soups = 0
done = False
steps = 0

while not done and steps < 100:
    # Random actions for both agents
    actions = [env.action_space.sample(), env.action_space.sample()]
    obs, rewards, done, info = env.step(actions)
    
    total_reward += sum(rewards)
    if sum(rewards) > 0:
        num_soups += sum(rewards) // 20
    
    steps += 1

print(f"\nRandom episode results (first 100 steps):")
print(f"  Steps: {steps}")
print(f"  Total reward: {total_reward}")
print(f"  Soups delivered: {num_soups}")

## 3. Test Models

Verify that our PPO models can be created and run.

In [None]:
from models import ActorNetwork, CentralizedCritic
from configs.hyperparameters import HyperParams

# Create networks
device = torch.device('cpu')

actors = [
    ActorNetwork(
        obs_dim=HyperParams.obs_dim,
        action_dim=HyperParams.action_dim,
        hidden_size=HyperParams.hidden_size,
        num_layers=HyperParams.num_layers
    ).to(device),
    ActorNetwork(
        obs_dim=HyperParams.obs_dim,
        action_dim=HyperParams.action_dim,
        hidden_size=HyperParams.hidden_size,
        num_layers=HyperParams.num_layers
    ).to(device)
]

critic = CentralizedCritic(
    joint_obs_dim=HyperParams.joint_obs_dim,
    hidden_size=HyperParams.hidden_size,
    num_layers=HyperParams.num_layers
).to(device)

print("Networks created successfully!")
print(f"\nActor 0 parameters: {sum(p.numel() for p in actors[0].parameters()):,}")
print(f"Actor 1 parameters: {sum(p.numel() for p in actors[1].parameters()):,}")
print(f"Critic parameters: {sum(p.numel() for p in critic.parameters()):,}")

# Test forward pass
obs_test = torch.randn(1, HyperParams.obs_dim)
joint_obs_test = torch.randn(1, HyperParams.joint_obs_dim)

action_probs = actors[0](obs_test)
value = critic(joint_obs_test)

print(f"\nTest forward pass:")
print(f"  Actor output shape: {action_probs.shape}")
print(f"  Critic output shape: {value.shape}")
print("\n✓ Models working correctly!")

## 4. Quick Training Test

Run a few episodes to ensure training loop works.

In [None]:
from ppo import PPO

# Create PPO agent
ppo = PPO(actors, critic, HyperParams, device=device)

print("Running 5 test episodes...")

for episode in range(5):
    obs = env.reset()
    done = False
    episode_reward = 0
    
    while not done:
        # Get observations
        observations = [obs['both_agent_obs'][0], obs['both_agent_obs'][1]]
        
        # Select actions
        actions, log_probs, entropies, value = ppo.select_actions(observations)
        
        # Step
        next_obs, rewards, done, info = env.step(actions)
        
        # Store in buffer
        joint_obs = np.concatenate(observations)
        ppo.buffer.add(observations, joint_obs, actions, log_probs, rewards, value, done)
        
        obs = next_obs
        episode_reward += sum(rewards)
    
    print(f"  Episode {episode + 1}: Reward = {episode_reward:.1f}, Buffer size = {len(ppo.buffer)}")

print("\n✓ Training loop works!")

## 5. Next Steps

If all cells above ran successfully, you're ready to start training!

### To train agents:

```bash
# Cramped room (easiest - start here)
python src/train.py --layout cramped_room --episodes 50000

# Coordination ring
python src/train.py --layout coordination_ring --episodes 100000

# Counter circuit (hardest)
python src/train.py --layout counter_circuit_o_1order --episodes 150000
```

### To evaluate trained agents:

```bash
# Evaluate single layout
python src/evaluate.py --layout cramped_room --num_episodes 100

# Evaluate all layouts
python src/evaluate.py --num_episodes 100
```

### To create report graphs:

```bash
python src/visualize.py
```