# Getting Started with ManaCore Gym

This notebook introduces the ManaCore Gym environment for training AI agents to play Magic: The Gathering.

## What you'll learn:
1. Basic environment usage
2. Understanding observations and actions
3. Playing games with random actions
4. Training an RL agent with Stable Baselines3

## Prerequisites
- `pip install manacore-gym`
- `pip install sb3-contrib` (for training)
- Bun runtime (for the game server)

## 1. Basic Environment Usage

Let's start by creating an environment and exploring its structure.

In [None]:
import gymnasium as gym
import numpy as np

# Import manacore_gym to register the environment
import manacore_gym

print("ManaCore Gym version:", manacore_gym.__version__)

: 

In [None]:
# Create the environment
# The server will auto-start if not running
env = gym.make("ManaCore-v0", opponent="greedy")

print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")

## 2. Understanding Observations

The observation is a 25-dimensional feature vector representing the game state.

| Index | Feature | Description |
|-------|---------|-------------|
| 0-2 | Life totals | Player, opponent, delta |
| 3-9 | Board state | Creatures, power, toughness |
| 10-14 | Card advantage | Hand size, library size |
| 15-18 | Mana | Lands, untapped lands |
| 19-24 | Game state | Turn, phase, combat flags |

In [None]:
# Reset the environment to start a new game
obs, info = env.reset(seed=42)

print(f"Observation shape: {obs.shape}")
print(f"Observation dtype: {obs.dtype}")
print("\nObservation values:")
print(obs)

In [None]:
# The info dict contains additional game information
print("Info keys:", info.keys())
print(f"\nPlayer life: {info.get('playerLife', 'N/A')}")
print(f"Opponent life: {info.get('opponentLife', 'N/A')}")
print(f"Turn: {info.get('turn', 'N/A')}")
print(f"Phase: {info.get('phase', 'N/A')}")

## 3. Understanding Actions and Action Masking

MTG has variable legal actions depending on game state. We use **action masking** to handle this:
- Action space is `Discrete(200)` - max 200 possible actions
- `info["action_mask"]` tells you which actions are legal (True = legal)

In [None]:
# Get the action mask
action_mask = info["action_mask"]

print(f"Action mask shape: {action_mask.shape}")
print(f"Action mask dtype: {action_mask.dtype}")

# Find legal actions
legal_actions = np.where(action_mask)[0]
print(f"\nNumber of legal actions: {len(legal_actions)}")
print(f"Legal action indices: {legal_actions[:10]}...")

In [None]:
# You can also use the environment's action_masks() method
# This is the standard interface for sb3-contrib's MaskablePPO
mask = env.unwrapped.action_masks()
print(f"action_masks() shape: {mask.shape}")

## 4. Playing a Game with Random Actions

Let's play a complete game using random (but legal) actions.

In [None]:
# Reset for a new game
obs, info = env.reset(seed=123)

total_reward = 0
step_count = 0
done = False

while not done and step_count < 200:
    # Get legal actions
    mask = info["action_mask"]
    legal_actions = np.where(mask)[0]

    # Sample a random legal action
    action = np.random.choice(legal_actions)

    # Take the action
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    total_reward += reward
    step_count += 1

    # Print progress every 20 steps
    if step_count % 20 == 0:
        print(f"Step {step_count}: Life {info.get('playerLife', '?')}/{info.get('opponentLife', '?')}")

# Game result
print(f"\n{'=' * 40}")
print(f"Game finished after {step_count} steps")
print(f"Final reward: {total_reward}")
print(f"Winner: {'Player' if total_reward > 0 else 'Opponent'}")

## 5. Training with Stable Baselines3

Now let's train an RL agent using MaskablePPO from sb3-contrib.

**Note:** Training takes time. For demonstration, we'll use a small number of timesteps.

In [None]:
# Import SB3 components
try:
    from sb3_contrib import MaskablePPO
    from sb3_contrib.common.wrappers import ActionMasker

    SB3_AVAILABLE = True
except ImportError:
    print("sb3-contrib not installed. Run: pip install sb3-contrib")
    SB3_AVAILABLE = False

In [None]:
if SB3_AVAILABLE:
    # Create environment
    from manacore_gym import ManaCoreBattleEnv

    env = ManaCoreBattleEnv(opponent="greedy")

    # Wrap with ActionMasker
    def mask_fn(env):
        return env.action_masks()

    env = ActionMasker(env, mask_fn)

    # Create model
    model = MaskablePPO(
        "MlpPolicy",
        env,
        verbose=1,
        learning_rate=3e-4,
        n_steps=512,  # Smaller for demo
        batch_size=64,
    )

    print("Model created successfully!")
    print(f"Policy architecture: {model.policy}")

In [None]:
if SB3_AVAILABLE:
    # Train for a short demo (increase timesteps for real training)
    print("Training for 5000 timesteps (demo)...")
    print("For real training, use 100k+ timesteps.\n")

    model.learn(total_timesteps=5000, progress_bar=True)

    print("\nTraining complete!")

In [None]:
if SB3_AVAILABLE:
    # Evaluate the trained agent
    print("Evaluating trained agent (10 games)...\n")

    eval_env = ManaCoreBattleEnv(opponent="greedy")
    wins = 0

    for game in range(10):
        obs, info = eval_env.reset()
        done = False

        while not done:
            action_mask = eval_env.action_masks()
            action, _ = model.predict(obs, action_masks=action_mask, deterministic=True)
            obs, reward, terminated, truncated, info = eval_env.step(action)
            done = terminated or truncated

        if reward > 0:
            wins += 1
            print(f"Game {game + 1}: WIN")
        else:
            print(f"Game {game + 1}: LOSS")

    print(f"\nWin rate: {wins}/10 ({wins * 10}%)")
    eval_env.close()

## 6. Cleanup

Always close environments when done to release resources.

In [None]:
env.close()
print("Environment closed.")

## Next Steps

1. **Train longer**: Use `total_timesteps=100_000` or more for meaningful learning
2. **Try different opponents**: `random`, `greedy`, `mcts`, `mcts-strong`
3. **Vectorized training**: Use `make_vec_env()` for parallel environments
4. **Experiment with hyperparameters**: Learning rate, batch size, etc.

See the `examples/` directory for more advanced usage:
- `train_ppo.py` - Full training script with TensorBoard logging
- `evaluate_agent.py` - Comprehensive evaluation against multiple opponents
- `benchmark_throughput.py` - Performance benchmarking