# Getting Started with ManaCore Gym

This notebook introduces the ManaCore Gym environment for training AI agents to play Magic: The Gathering.

## What you'll learn:
1. Basic environment usage
2. Understanding observations and actions
3. Playing games with random actions
4. Training an RL agent with Stable Baselines3

## Prerequisites
- Run `uv sync --extra notebook` from `packages/python-gym/` directory
- Bun runtime installed (for the game server)
- **That's it!** The server will auto-start when you run the cells below.

## 0. Auto-Start Game Server

ManaCore Gym needs a game server running on port 3333. This cell will:
1. Check if the server is already running
2. Start it automatically if needed
3. Wait for it to be ready

**Note:** The server runs in the background and stays active throughout your session.

## 1. Basic Environment Usage

Let's start by creating an environment and exploring its structure.

In [None]:
import subprocess
import time
from pathlib import Path

import requests


def check_server_health():
    """Check if the ManaCore Gym server is running."""
    try:
        response = requests.get("http://localhost:3333/health", timeout=1)
        return response.status_code == 200
    except Exception:
        return False

def start_server():
    """Start the ManaCore Gym server in the background."""
    # Find the gym-server directory (relative to this notebook)
    notebook_dir = Path.cwd()
    gym_server_dir = notebook_dir.parent.parent.parent / "packages" / "gym-server"
    
    if not gym_server_dir.exists():
        raise FileNotFoundError(f"Could not find gym-server at: {gym_server_dir}")
    
    # Start the server as a background process
    server_script = gym_server_dir / "src" / "index.ts"
    process = subprocess.Popen(
        ["bun", "run", str(server_script)],
        cwd=str(gym_server_dir),
        stdout=subprocess.DEVNULL,
        stderr=subprocess.DEVNULL,
        start_new_session=True  # Detach from notebook process
    )
    
    return process

# Check if server is running
if check_server_health():
    print("‚úÖ Game server is already running")
else:
    print("üöÄ Starting game server...")
    try:
        process = start_server()
        
        # Wait for server to be ready (max 10 seconds)
        for _i in range(20):
            time.sleep(0.5)
            if check_server_health():
                print("‚úÖ Game server started successfully!")
                break
        else:
            print("‚ö†Ô∏è  Server may still be starting. If you get connection errors, wait a few seconds and try again.")
    except Exception as e:
        print(f"‚ùå Failed to start server: {e}")
        print("\nManual start: Run this in a terminal:")
        print("cd packages/gym-server && bun run src/index.ts")

In [1]:
import gymnasium as gym
import numpy as np

# Import manacore_gym to register the environment
import manacore_gym

print("ManaCore Gym version:", manacore_gym.__version__)

ManaCore Gym version: 0.1.0


In [2]:
# Create the environment
# The server will auto-start if not running
env = gym.make("ManaCore-v0", opponent="greedy")

print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")

Observation space: Box(-1.0, 1.0, (25,), float32)
Action space: Discrete(200)


## 2. Understanding Observations

The observation is a 25-dimensional feature vector representing the game state.

| Index | Feature | Description |
|-------|---------|-------------|
| 0-2 | Life totals | Player, opponent, delta |
| 3-9 | Board state | Creatures, power, toughness |
| 10-14 | Card advantage | Hand size, library size |
| 15-18 | Mana | Lands, untapped lands |
| 19-24 | Game state | Turn, phase, combat flags |

In [3]:
# Reset the environment to start a new game
obs, info = env.reset(seed=42)

print(f"Observation shape: {obs.shape}")
print(f"Observation dtype: {obs.dtype}")
print("\nObservation values:")
print(obs)

Observation shape: (25,)
Observation dtype: float32

Observation values:
[1.        1.        0.        0.        0.        0.        0.
 0.        0.        0.        1.        1.        0.        0.8833333
 0.8833333 0.        0.        0.        0.        0.02      1.
 0.5       0.        0.        0.       ]


In [4]:
# The info dict contains additional game information
print("Info keys:", info.keys())
print(f"\nPlayer life: {info.get('playerLife', 'N/A')}")
print(f"Opponent life: {info.get('opponentLife', 'N/A')}")
print(f"Turn: {info.get('turn', 'N/A')}")
print(f"Phase: {info.get('phase', 'N/A')}")

Info keys: dict_keys(['turn', 'phase', 'playerLife', 'opponentLife', 'winner', 'stepCount', 'action_mask', 'num_legal_actions', 'legal_actions'])

Player life: 20
Opponent life: 20
Turn: 1
Phase: main1


## 3. Understanding Actions and Action Masking

MTG has variable legal actions depending on game state. We use **action masking** to handle this:
- Action space is `Discrete(200)` - max 200 possible actions
- `info["action_mask"]` tells you which actions are legal (True = legal)

In [5]:
# Get the action mask
action_mask = info["action_mask"]

print(f"Action mask shape: {action_mask.shape}")
print(f"Action mask dtype: {action_mask.dtype}")

# Find legal actions
legal_actions = np.where(action_mask)[0]
print(f"\nNumber of legal actions: {len(legal_actions)}")
print(f"Legal action indices: {legal_actions[:10]}...")

Action mask shape: (200,)
Action mask dtype: bool

Number of legal actions: 4
Legal action indices: [0 1 2 3]...


In [6]:
# You can also use the environment's action_masks() method
# This is the standard interface for sb3-contrib's MaskablePPO
mask = env.unwrapped.action_masks()
print(f"action_masks() shape: {mask.shape}")

action_masks() shape: (200,)


## 4. Playing a Game with Random Actions

Let's play a complete game using random (but legal) actions.

In [7]:
# Reset for a new game
obs, info = env.reset(seed=123)

total_reward = 0
step_count = 0
done = False

while not done and step_count < 200:
    # Get legal actions
    mask = info["action_mask"]
    legal_actions = np.where(mask)[0]

    # Sample a random legal action
    action = np.random.choice(legal_actions)

    # Take the action
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated
    total_reward += reward
    step_count += 1

    # Print progress every 20 steps
    if step_count % 20 == 0:
        print(f"Step {step_count}: Life {info.get('playerLife', '?')}/{info.get('opponentLife', '?')}")

# Game result
print(f"\n{'=' * 40}")
print(f"Game finished after {step_count} steps")
print(f"Final reward: {total_reward}")
print(f"Winner: {'Player' if total_reward > 0 else 'Opponent'}")


Game finished after 19 steps
Final reward: -1.0
Winner: Opponent


## 5. Training with Stable Baselines3

Now let's train an RL agent using MaskablePPO from sb3-contrib.

**Note:** Training takes time. For demonstration, we'll use a small number of timesteps.

In [8]:
# Import SB3 components
try:
    from sb3_contrib import MaskablePPO
    from sb3_contrib.common.wrappers import ActionMasker

    SB3_AVAILABLE = True
except ImportError:
    print("sb3-contrib not installed. Run: pip install sb3-contrib")
    SB3_AVAILABLE = False

In [9]:
if SB3_AVAILABLE:
    # Create environment
    from manacore_gym import ManaCoreBattleEnv

    env = ManaCoreBattleEnv(opponent="greedy")

    # Wrap with ActionMasker
    def mask_fn(env):
        return env.action_masks()

    env = ActionMasker(env, mask_fn)

    # Create model
    model = MaskablePPO(
        "MlpPolicy",
        env,
        verbose=1,
        learning_rate=3e-4,
        n_steps=512,  # Smaller for demo
        batch_size=64,
    )

    print("Model created successfully!")
    print(f"Policy architecture: {model.policy}")

Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
Model created successfully!
Policy architecture: MaskableActorCriticPolicy(
  (features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (pi_features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (vf_features_extractor): FlattenExtractor(
    (flatten): Flatten(start_dim=1, end_dim=-1)
  )
  (mlp_extractor): MlpExtractor(
    (policy_net): Sequential(
      (0): Linear(in_features=25, out_features=64, bias=True)
      (1): Tanh()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): Tanh()
    )
    (value_net): Sequential(
      (0): Linear(in_features=25, out_features=64, bias=True)
      (1): Tanh()
      (2): Linear(in_features=64, out_features=64, bias=True)
      (3): Tanh()
    )
  )
  (action_net): Linear(in_features=64, out_features=200, bias=True)
  (value_net): Linear(in_features=64, out_fea

In [10]:
if SB3_AVAILABLE:
    # Train for a short demo (increase timesteps for real training)
    print("Training for 5000 timesteps (demo)...")
    print("For real training, use 100k+ timesteps.\n")

    model.learn(total_timesteps=5000, progress_bar=True)

    print("\nTraining complete!")

Output()

Training for 5000 timesteps (demo)...
For real training, use 100k+ timesteps.



---------------------------------
| rollout/           |          |
|    ep_len_mean     | 66.3     |
|    ep_rew_mean     | -0.714   |
| time/              |          |
|    fps             | 104      |
|    iterations      | 1        |
|    time_elapsed    | 4        |
|    total_timesteps | 512      |
---------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 44.3        |
|    ep_rew_mean          | -0.739      |
| time/                   |             |
|    fps                  | 101         |
|    iterations           | 2           |
|    time_elapsed         | 10          |
|    total_timesteps      | 1024        |
| train/                  |             |
|    approx_kl            | 0.014504848 |
|    clip_fraction        | 0.132       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.73       |
|    explained_variance   | -0.653      |
|    learning_rate        | 0.0003      |
|    loss                 | 0.00477     |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0205     |
|    value_loss           | 0.0743      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 45          |
|    ep_rew_mean          | -0.758      |
| time/                   |             |
|    fps                  | 101         |
|    iterations           | 3           |
|    time_elapsed         | 15          |
|    total_timesteps      | 1536        |
| train/                  |             |
|    approx_kl            | 0.008444786 |
|    clip_fraction        | 0.0281      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.51       |
|    explained_variance   | -0.358      |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0154     |
|    n_updates            | 20          |
|    policy_gradient_loss | -0.0152     |
|    value_loss           | 0.0513      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 52.9        |
|    ep_rew_mean          | -0.737      |
| time/                   |             |
|    fps                  | 95          |
|    iterations           | 4           |
|    time_elapsed         | 21          |
|    total_timesteps      | 2048        |
| train/                  |             |
|    approx_kl            | 0.011593467 |
|    clip_fraction        | 0.117       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.57       |
|    explained_variance   | 0.542       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.00656    |
|    n_updates            | 30          |
|    policy_gradient_loss | -0.024      |
|    value_loss           | 0.0409      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 49.2        |
|    ep_rew_mean          | -0.769      |
| time/                   |             |
|    fps                  | 97          |
|    iterations           | 5           |
|    time_elapsed         | 26          |
|    total_timesteps      | 2560        |
| train/                  |             |
|    approx_kl            | 0.010406626 |
|    clip_fraction        | 0.122       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.8        |
|    explained_variance   | 0.742       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0194     |
|    n_updates            | 40          |
|    policy_gradient_loss | -0.0241     |
|    value_loss           | 0.0368      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 49.5        |
|    ep_rew_mean          | -0.705      |
| time/                   |             |
|    fps                  | 99          |
|    iterations           | 6           |
|    time_elapsed         | 30          |
|    total_timesteps      | 3072        |
| train/                  |             |
|    approx_kl            | 0.014066281 |
|    clip_fraction        | 0.155       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.42       |
|    explained_variance   | 0.426       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.00241    |
|    n_updates            | 50          |
|    policy_gradient_loss | -0.0223     |
|    value_loss           | 0.0465      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 51.6        |
|    ep_rew_mean          | -0.681      |
| time/                   |             |
|    fps                  | 99          |
|    iterations           | 7           |
|    time_elapsed         | 35          |
|    total_timesteps      | 3584        |
| train/                  |             |
|    approx_kl            | 0.010252215 |
|    clip_fraction        | 0.0869      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.39       |
|    explained_variance   | 0.791       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.00288    |
|    n_updates            | 60          |
|    policy_gradient_loss | -0.0133     |
|    value_loss           | 0.0469      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 51.5        |
|    ep_rew_mean          | -0.671      |
| time/                   |             |
|    fps                  | 100         |
|    iterations           | 8           |
|    time_elapsed         | 40          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.012526806 |
|    clip_fraction        | 0.0873      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.6        |
|    explained_variance   | 0.727       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0253     |
|    n_updates            | 70          |
|    policy_gradient_loss | -0.0186     |
|    value_loss           | 0.0389      |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 51.7        |
|    ep_rew_mean          | -0.708      |
| time/                   |             |
|    fps                  | 101         |
|    iterations           | 9           |
|    time_elapsed         | 45          |
|    total_timesteps      | 4608        |
| train/                  |             |
|    approx_kl            | 0.009738848 |
|    clip_fraction        | 0.098       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.35       |
|    explained_variance   | 0.862       |
|    learning_rate        | 0.0003      |
|    loss                 | -0.0035     |
|    n_updates            | 80          |
|    policy_gradient_loss | -0.0179     |
|    value_loss           | 0.051       |
-----------------------------------------


-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 51.1        |
|    ep_rew_mean          | -0.68       |
| time/                   |             |
|    fps                  | 102         |
|    iterations           | 10          |
|    time_elapsed         | 50          |
|    total_timesteps      | 5120        |
| train/                  |             |
|    approx_kl            | 0.008693926 |
|    clip_fraction        | 0.0555      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.33       |
|    explained_variance   | 0.564       |
|    learning_rate        | 0.0003      |
|    loss                 | 0.00878     |
|    n_updates            | 90          |
|    policy_gradient_loss | -0.0146     |
|    value_loss           | 0.0359      |
-----------------------------------------



Training complete!


In [13]:
if SB3_AVAILABLE:
    # Evaluate the trained agent
    print("Evaluating trained agent (10 games)...\n")

    eval_env = ManaCoreBattleEnv(opponent="greedy")
    wins = 0

    for game in range(10):
        obs, info = eval_env.reset()
        done = False

        while not done:
            action_mask = eval_env.action_masks()
            action, _ = model.predict(obs, action_masks=action_mask, deterministic=True)
            obs, reward, terminated, truncated, info = eval_env.step(action)
            done = terminated or truncated

        if reward > 0:
            wins += 1
            print(f"Game {game + 1}: WIN")
        else:
            print(f"Game {game + 1}: LOSS")

    print(f"\nWin rate: {wins}/10 ({wins * 10}%)")
    eval_env.close()

Evaluating trained agent (10 games)...

Game 1: LOSS
Game 2: LOSS
Game 3: WIN
Game 4: LOSS
Game 5: LOSS
Game 6: LOSS
Game 7: LOSS
Game 8: LOSS
Game 9: LOSS
Game 10: LOSS

Win rate: 1/10 (10%)


## 6. Cleanup

Always close environments when done to release resources.

In [14]:
env.close()
print("Environment closed.")

Environment closed.


## Next Steps

1. **Train longer**: Use `total_timesteps=100_000` or more for meaningful learning
2. **Try different opponents**: `random`, `greedy`, `mcts`, `mcts-strong`
3. **Vectorized training**: Use `make_vec_env()` for parallel environments
4. **Experiment with hyperparameters**: Learning rate, batch size, etc.

See the `examples/` directory for more advanced usage:
- `train_ppo.py` - Full training script with TensorBoard logging
- `evaluate_agent.py` - Comprehensive evaluation against multiple opponents
- `benchmark_throughput.py` - Performance benchmarking