# Addition Game: Monte Carlo Agents

## Game Definition (from Blackwell's "Game Theory and Statistical Decisions")

> *The parameters of the game k and N are given. Player I and Player II alternately choose integers, each choice being one of the integers 1,...,k and each choice made with the knowledge of all preceding choices. As soon as the sum of the chosen integers exceeds N, the last player to choose loses the game and pays his opponent one unit.*

In this notebook, we implement **Monte Carlo** reinforcement learning agents. Unlike Q-learning which bootstraps (updates based on estimated future values), Monte Carlo methods learn directly from complete episodes by averaging actual returns.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from dataclasses import dataclass
from typing import Tuple, List, Optional, Dict
import random
from tqdm import tqdm

# Set style for visualizations
plt.style.use('dark_background')
plt.rcParams['font.family'] = 'monospace'
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['axes.facecolor'] = '#1a1a2e'
plt.rcParams['figure.facecolor'] = '#0f0f1a'
plt.rcParams['axes.edgecolor'] = '#533483'
plt.rcParams['axes.labelcolor'] = '#ff6f3c'
plt.rcParams['xtick.color'] = '#a0a0a0'
plt.rcParams['ytick.color'] = '#a0a0a0'
plt.rcParams['text.color'] = '#e0e0e0'
plt.rcParams['grid.color'] = '#2d2d44'
plt.rcParams['grid.alpha'] = 0.5


## 1. Game Environment

The state of the game is characterized by:
- The current sum of all chosen integers
- Whose turn it is (Player I or Player II)

The game ends when the sum exceeds N.


In [None]:
@dataclass
class GameState:
    """Represents the current state of the Addition game."""
    current_sum: int
    current_player: int  # 0 for Player I, 1 for Player II
    
    def __hash__(self):
        return hash((self.current_sum, self.current_player))
    
    def __eq__(self, other):
        return (self.current_sum == other.current_sum and 
                self.current_player == other.current_player)


class AdditionGame:
    """The Addition game environment.
    
    Parameters:
        k: Maximum integer that can be chosen (choices are 1, 2, ..., k)
        N: Threshold - the game ends when sum exceeds N
    """
    
    def __init__(self, k: int, N: int):
        self.k = k
        self.N = N
        self.reset()
    
    def reset(self) -> GameState:
        """Reset the game to initial state."""
        self.state = GameState(current_sum=0, current_player=0)
        self.history: List[int] = []
        self.done = False
        self.winner = None
        return self.state
    
    def get_valid_actions(self) -> List[int]:
        """Returns list of valid actions (1 to k)."""
        return list(range(1, self.k + 1))
    
    def step(self, action: int) -> Tuple[GameState, float, float, bool]:
        """Execute an action and return (new_state, reward_p1, reward_p2, done)."""
        if self.done:
            raise ValueError("Game is already over!")
        
        if action < 1 or action > self.k:
            raise ValueError(f"Invalid action {action}. Must be in [1, {self.k}]")
        
        self.history.append(action)
        current_player = self.state.current_player
        new_sum = self.state.current_sum + action
        
        if new_sum > self.N:
            self.done = True
            self.winner = 1 - current_player
            reward_p1 = 1.0 if self.winner == 0 else -1.0
            reward_p2 = -reward_p1
        else:
            reward_p1 = 0.0
            reward_p2 = 0.0
        
        self.state = GameState(
            current_sum=new_sum,
            current_player=1 - current_player
        )
        
        return self.state, reward_p1, reward_p2, self.done
    
    def render(self):
        """Print the current game state."""
        print(f"Sum: {self.state.current_sum} / N={self.N}")
        print(f"History: {self.history}")
        print(f"Current player: {'I' if self.state.current_player == 0 else 'II'}")
        if self.done:
            print(f"Winner: Player {'I' if self.winner == 0 else 'II'}")


## 2. Monte Carlo Methods

**Monte Carlo** methods learn value functions from complete episodes of experience. Key characteristics:

1. **No bootstrapping**: Unlike TD methods (Q-learning), MC methods wait until the end of an episode to update values
2. **Sample returns**: Q(s,a) is estimated by averaging returns observed after taking action a in state s
3. **First-visit vs Every-visit**: 
   - First-visit MC: Only count the first time (s,a) is visited in an episode
   - Every-visit MC: Count every time (s,a) is visited

For the Addition game, since states can only be visited once per episode (sum always increases), first-visit and every-visit are equivalent.

### Update Rule (Incremental Mean)

$$Q(s,a) \leftarrow Q(s,a) + \frac{1}{N(s,a)}(G - Q(s,a))$$

where $G$ is the return (final reward) and $N(s,a)$ is the visit count.


In [None]:
class MonteCarloAgent:
    """Monte Carlo agent for the Addition game.
    
    Uses first-visit MC with epsilon-greedy exploration.
    Learns Q(s,a) by averaging returns from complete episodes.
    """
    
    def __init__(
        self, 
        player_id: int,
        k: int,
        epsilon_start: float = 1.0,
        epsilon_end: float = 0.05,
        epsilon_decay: float = 0.9995,
        use_constant_alpha: bool = False,
        alpha: float = 0.1
    ):
        self.player_id = player_id
        self.k = k
        self.epsilon = epsilon_start
        self.epsilon_end = epsilon_end
        self.epsilon_decay = epsilon_decay
        self.use_constant_alpha = use_constant_alpha
        self.alpha = alpha
        
        # Q-values: Q[state_key][action] = value
        self.Q: Dict[int, Dict[int, float]] = defaultdict(lambda: {a: 0.0 for a in range(1, k + 1)})
        
        # Visit counts for incremental mean update
        self.N: Dict[int, Dict[int, int]] = defaultdict(lambda: {a: 0 for a in range(1, k + 1)})
        
        # Statistics
        self.wins = 0
        self.losses = 0
        self.games = 0
    
    def get_state_key(self, state: GameState) -> int:
        """Convert state to key for Q-table lookup."""
        return state.current_sum
    
    def select_action(self, state: GameState, valid_actions: List[int], training: bool = True) -> int:
        """Select action using epsilon-greedy policy."""
        if training and random.random() < self.epsilon:
            return random.choice(valid_actions)
        
        state_key = self.get_state_key(state)
        q_values = self.Q[state_key]
        
        # Find action with highest Q-value
        best_action = max(valid_actions, key=lambda a: q_values[a])
        
        # If all Q-values are 0 (unvisited), choose randomly
        if all(q_values[a] == 0 for a in valid_actions):
            return random.choice(valid_actions)
        
        return best_action
    
    def update_from_episode(self, episode: List[Tuple[int, int]], final_reward: float):
        """Update Q-values from a complete episode.
        
        Args:
            episode: List of (state_key, action) tuples visited by this agent
            final_reward: The return (reward) received at the end of the episode
        """
        # In this game, the return is the same for all state-action pairs
        # (no intermediate rewards, only final +1 or -1)
        G = final_reward
        
        # First-visit MC: only update first occurrence of each (s, a)
        visited = set()
        
        for state_key, action in episode:
            if (state_key, action) not in visited:
                visited.add((state_key, action))
                
                # Increment visit count
                self.N[state_key][action] += 1
                n = self.N[state_key][action]
                
                # Update Q-value using incremental mean or constant alpha
                if self.use_constant_alpha:
                    # Constant step-size (better for non-stationary)
                    self.Q[state_key][action] += self.alpha * (G - self.Q[state_key][action])
                else:
                    # Sample average (converges to true mean)
                    self.Q[state_key][action] += (G - self.Q[state_key][action]) / n
    
    def decay_epsilon(self):
        """Decay exploration rate."""
        self.epsilon = max(self.epsilon_end, self.epsilon * self.epsilon_decay)
    
    def record_result(self, won: bool):
        """Record game result."""
        self.games += 1
        if won:
            self.wins += 1
        else:
            self.losses += 1
    
    def win_rate(self) -> float:
        """Calculate win rate."""
        return self.wins / max(1, self.games)
    
    def get_policy(self) -> Dict[int, int]:
        """Return greedy policy."""
        policy = {}
        for state_key in self.Q:
            q_values = self.Q[state_key]
            policy[state_key] = max(range(1, self.k + 1), key=lambda a: q_values[a])
        return policy


## 3. Training Loop

We train both agents through self-play. After each complete game (episode), both agents update their Q-values based on the observed return.


## 4. Training the Agents

Let's train two Monte Carlo agents on the Addition game with parameters `k=3` and `N=10`.


In [None]:
# Game parameters
K = 3  # Can choose 1, 2, or 3
N = 10  # Game ends when sum exceeds 10

# Create game and agents
game = AdditionGame(k=K, N=N)

agent1 = MonteCarloAgent(
    player_id=0,
    k=K,
    epsilon_start=1.0,
    epsilon_end=0.05,
    epsilon_decay=0.9997,
    use_constant_alpha=False  # Use sample average
)

agent2 = MonteCarloAgent(
    player_id=1,
    k=K,
    epsilon_start=1.0,
    epsilon_end=0.05,
    epsilon_decay=0.9997,
    use_constant_alpha=False
)

print(f"Training Monte Carlo agents on Addition game")
print(f"Parameters: k={K}, N={N}")
print(f"Player I starts, choices are integers from 1 to {K}")
print(f"First player to make sum exceed {N} loses")


In [None]:
# Train the agents
stats = train_monte_carlo_agents(game, agent1, agent2, num_episodes=50000, log_interval=500)


## 5. Visualizing Training Progress


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Addition Game: Monte Carlo Training Progress', fontsize=16, fontweight='bold', color='#ff6f3c')

# Win rates
ax1 = axes[0, 0]
ax1.plot(stats['episodes'], stats['agent1_win_rate'], color='#ffc13b', linewidth=2, label='Player I')
ax1.plot(stats['episodes'], stats['agent2_win_rate'], color='#1eb980', linewidth=2, label='Player II')
ax1.axhline(y=50, color='#666', linestyle='--', alpha=0.5, label='50% baseline')
ax1.set_xlabel('Episode')
ax1.set_ylabel('Win Rate (%)')
ax1.set_title('Win Rates Over Training', color='#ff6f3c')
ax1.legend(facecolor='#1a1a2e', edgecolor='#533483')
ax1.grid(True, alpha=0.3)

# Epsilon decay
ax2 = axes[0, 1]
ax2.plot(stats['episodes'], stats['epsilon'], color='#ff6f3c', linewidth=2)
ax2.set_xlabel('Episode')
ax2.set_ylabel('Epsilon')
ax2.set_title('Exploration Rate Decay', color='#ff6f3c')
ax2.grid(True, alpha=0.3)

# Average game length
ax3 = axes[1, 0]
ax3.plot(stats['episodes'], stats['avg_game_length'], color='#a29bfe', linewidth=2)
ax3.set_xlabel('Episode')
ax3.set_ylabel('Average Game Length (moves)')
ax3.set_title('Average Game Length', color='#ff6f3c')
ax3.grid(True, alpha=0.3)

# Total visits (learning progress indicator)
ax4 = axes[1, 1]
ax4.plot(stats['episodes'], stats['total_visits'], color='#fd79a8', linewidth=2)
ax4.set_xlabel('Episode')
ax4.set_ylabel('Total State-Action Visits')
ax4.set_title('Cumulative Experience (Player I)', color='#ff6f3c')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## 6. Analyzing Learned Q-Values and Policies

Let's examine the Q-values and visit counts learned by each agent.


In [None]:
def visualize_mc_policy(agent: MonteCarloAgent, k: int, N: int, player_name: str):
    """Visualize the learned Q-values and visit counts."""
    
    sums = list(range(0, N + 1))
    actions = list(range(1, k + 1))
    
    # Create Q-value and visit count matrices
    q_matrix = np.zeros((len(sums), len(actions)))
    visit_matrix = np.zeros((len(sums), len(actions)))
    
    for i, s in enumerate(sums):
        for j, a in enumerate(actions):
            q_matrix[i, j] = agent.Q[s][a]
            visit_matrix[i, j] = agent.N[s][a]
    
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 6))
    fig.suptitle(f'{player_name} Learned Strategy (Monte Carlo)', fontsize=14, fontweight='bold', color='#ff6f3c')
    
    # Q-value heatmap
    im1 = ax1.imshow(q_matrix, aspect='auto', cmap='RdYlGn', vmin=-1, vmax=1)
    ax1.set_yticks(range(len(sums)))
    ax1.set_yticklabels(sums)
    ax1.set_xticks(range(len(actions)))
    ax1.set_xticklabels(actions)
    ax1.set_ylabel('Current Sum')
    ax1.set_xlabel('Action')
    ax1.set_title('Q-Values', color='#ff6f3c')
    plt.colorbar(im1, ax=ax1, label='Q-value')
    
    for i in range(len(sums)):
        for j in range(len(actions)):
            color = 'white' if abs(q_matrix[i, j]) > 0.5 else 'black'
            ax1.text(j, i, f'{q_matrix[i, j]:.2f}', ha='center', va='center', color=color, fontsize=8)
    
    # Visit count heatmap
    im2 = ax2.imshow(visit_matrix, aspect='auto', cmap='Blues')
    ax2.set_yticks(range(len(sums)))
    ax2.set_yticklabels(sums)
    ax2.set_xticks(range(len(actions)))
    ax2.set_xticklabels(actions)
    ax2.set_ylabel('Current Sum')
    ax2.set_xlabel('Action')
    ax2.set_title('Visit Counts N(s,a)', color='#ff6f3c')
    plt.colorbar(im2, ax=ax2, label='Visits')
    
    for i in range(len(sums)):
        for j in range(len(actions)):
            color = 'white' if visit_matrix[i, j] > visit_matrix.max() * 0.5 else 'black'
            ax2.text(j, i, f'{int(visit_matrix[i, j])}', ha='center', va='center', color=color, fontsize=8)
    
    # Best action visualization
    best_actions = np.argmax(q_matrix, axis=1) + 1
    colors = ['#ffc13b', '#1eb980', '#a29bfe']
    
    ax3.barh(sums, best_actions, color=[colors[a-1] for a in best_actions], edgecolor='white', linewidth=0.5)
    ax3.set_xlabel('Greedy Action')
    ax3.set_ylabel('Current Sum')
    ax3.set_title('Learned Greedy Policy', color='#ff6f3c')
    ax3.set_xticks([1, 2, 3])
    ax3.set_xlim(0, k + 0.5)
    ax3.invert_yaxis()
    
    from matplotlib.patches import Patch
    legend_elements = [Patch(facecolor=colors[i], label=f'Action {i+1}') for i in range(k)]
    ax3.legend(handles=legend_elements, loc='lower right', facecolor='#1a1a2e', edgecolor='#533483')
    
    plt.tight_layout()
    plt.show()
    
    return best_actions

print("Player I Strategy:")
p1_actions = visualize_mc_policy(agent1, K, N, "Player I")

print("\nPlayer II Strategy:")
p2_actions = visualize_mc_policy(agent2, K, N, "Player II")


## 7. Theoretical Optimal Strategy

The Addition game has a known optimal strategy based on modular arithmetic:

- **Losing positions** (for the player to move): Sums satisfying `sum ≡ N (mod k+1)`
- **Winning positions**: All other sums

The optimal play is to move the opponent to a losing position.


In [None]:
def compute_optimal_strategy(k: int, N: int):
    """Compute the theoretically optimal strategy."""
    # Losing positions: sums where sum ≡ N (mod k+1)
    losing_positions = []
    s = N
    while s >= 0:
        losing_positions.append(s)
        s -= (k + 1)
    losing_positions = sorted(losing_positions)
    
    # Optimal actions
    optimal_actions = {}
    for s in range(0, N + 1):
        if s in losing_positions:
            optimal_actions[s] = 1  # Losing position, any move loses
        else:
            for a in range(1, k + 1):
                if s + a in losing_positions:
                    optimal_actions[s] = a
                    break
            else:
                optimal_actions[s] = 1
    
    return losing_positions, optimal_actions

losing_positions, optimal_actions = compute_optimal_strategy(K, N)

print(f"Game parameters: k={K}, N={N}")
print(f"\nLosing positions (for player to move): {losing_positions}")
print(f"These satisfy: sum ≡ {N} (mod {K+1})")

print(f"\nOptimal actions from each position:")
for s in range(0, N + 1):
    status = "LOSING" if s in losing_positions else "winning"
    print(f"  Sum {s:2d}: play {optimal_actions[s]} ({status})")

print(f"\n{'='*50}")
if 0 in losing_positions:
    print("Initial position (0) is LOSING -> Player II wins with optimal play!")
else:
    print("Initial position (0) is winning -> Player I wins with optimal play!")


In [None]:
def compare_with_optimal(agent: MonteCarloAgent, optimal_actions: dict, losing_positions: list,
                         k: int, N: int, player_name: str):
    """Compare learned policy with optimal."""
    matches = 0
    mismatches = []
    
    for s in range(0, N + 1):
        # Get learned action (greedy)
        q_values = agent.Q[s]
        learned_action = max(range(1, k + 1), key=lambda a: q_values[a])
        optimal = optimal_actions[s]
        
        if learned_action == optimal:
            matches += 1
        else:
            # Check if learned action is also winning
            next_sum_learned = s + learned_action
            next_sum_optimal = s + optimal
            both_win = (next_sum_learned in losing_positions) and (next_sum_optimal in losing_positions)
            
            if both_win:
                matches += 1
            else:
                mismatches.append((s, learned_action, optimal, q_values))
    
    accuracy = matches / (N + 1) * 100
    
    print(f"\n{player_name} Policy Comparison:")
    print(f"  Accuracy: {accuracy:.1f}% ({matches}/{N + 1} positions)")
    
    if mismatches:
        print(f"  Mismatches:")
        for s, learned, optimal, q_vals in mismatches[:5]:
            q_str = ", ".join([f"Q({a})={q_vals[a]:.2f}" for a in range(1, k + 1)])
            print(f"    Sum {s}: learned={learned}, optimal={optimal} [{q_str}]")
    else:
        print("  Perfect match with optimal strategy!")
    
    return accuracy

acc1 = compare_with_optimal(agent1, optimal_actions, losing_positions, K, N, "Player I")
acc2 = compare_with_optimal(agent2, optimal_actions, losing_positions, K, N, "Player II")


## 8. Evaluation Against Optimal Agent


In [None]:
class OptimalAgent:
    """Agent that plays the theoretically optimal strategy."""
    
    def __init__(self, k: int, N: int):
        _, self.optimal_actions = compute_optimal_strategy(k, N)
    
    def select_action(self, state: GameState, valid_actions: List[int], training: bool = False) -> int:
        return self.optimal_actions.get(state.current_sum, 1)


def evaluate(game: AdditionGame, agent1, agent2, num_games: int = 1000):
    """Evaluate two agents over multiple games."""
    wins = [0, 0]
    
    for _ in range(num_games):
        state = game.reset()
        agents = [agent1, agent2]
        
        while not game.done:
            agent = agents[state.current_player]
            action = agent.select_action(state, game.get_valid_actions(), training=False)
            state, _, _, _ = game.step(action)
        
        wins[game.winner] += 1
    
    return wins[0], wins[1]


optimal_agent = OptimalAgent(K, N)

print("Evaluation Results (1000 games each):\n")

print("Test 1: MC Player I vs Optimal Player II")
w1, w2 = evaluate(game, agent1, optimal_agent, num_games=1000)
print(f"  MC wins: {w1}, Optimal wins: {w2}")
print(f"  MC win rate: {w1/10:.1f}%")

print("\nTest 2: Optimal Player I vs MC Player II")
w1, w2 = evaluate(game, optimal_agent, agent2, num_games=1000)
print(f"  Optimal wins: {w1}, MC wins: {w2}")
print(f"  MC win rate: {w2/10:.1f}%")

print("\nTest 3: MC vs MC (self-play)")
w1, w2 = evaluate(game, agent1, agent2, num_games=1000)
print(f"  Player I wins: {w1}, Player II wins: {w2}")
print(f"  Player I win rate: {w1/10:.1f}%")

print(f"\n{'='*50}")
print("Note: With k=3, N=10, optimal Player I should always win.")


## 9. Watching a Sample Game


In [None]:
def play_game_verbose(game: AdditionGame, agent1, agent2, name1: str, name2: str, show_q: bool = True):
    """Play a game with detailed output."""
    state = game.reset()
    agents = [agent1, agent2]
    names = [name1, name2]
    
    print(f"{'='*60}")
    print(f"ADDITION GAME: k={game.k}, N={game.N}")
    print(f"{name1} (Player I) vs {name2} (Player II)")
    print(f"First to make sum exceed {game.N} loses!")
    print(f"{'='*60}\n")
    
    move = 1
    while not game.done:
        player = state.current_player
        agent = agents[player]
        name = names[player]
        
        action = agent.select_action(state, game.get_valid_actions(), training=False)
        
        # Show Q-values if agent is Monte Carlo
        if show_q and hasattr(agent, 'Q'):
            q_values = agent.Q[state.current_sum]
            q_str = ", ".join([f"Q({a})={q_values[a]:.2f}" for a in range(1, game.k + 1)])
            print(f"Move {move}: {name} [{q_str}] -> chooses {action}")
        else:
            print(f"Move {move}: {name} chooses {action}")
        
        print(f"         Sum: {state.current_sum} -> {state.current_sum + action}")
        
        state, _, _, done = game.step(action)
        
        if done:
            print(f"\n         *** SUM ({state.current_sum}) EXCEEDS {game.N}! ***")
            print(f"\n  WINNER: {names[game.winner]}")
        else:
            print()
        
        move += 1
    
    print(f"\n{'='*60}")
    print(f"History: {game.history}")
    print(f"{'='*60}")

print("Game 1: MC agents playing each other\n")
play_game_verbose(game, agent1, agent2, "MC-I", "MC-II")

print("\n\n")

print("Game 2: MC Player I vs Optimal Player II\n")
play_game_verbose(game, agent1, optimal_agent, "MC-I", "Optimal-II")


## 10. Conclusion

### Summary

We implemented **Monte Carlo** reinforcement learning agents for Blackwell's Addition game. Key aspects:

1. **Learning from Complete Episodes**: MC methods wait until the end of each game to update Q-values, using the actual return rather than bootstrapped estimates.

2. **Sample Average Updates**: Q(s,a) converges to the true expected return as the number of visits to (s,a) increases.

3. **No Bias from Bootstrapping**: Unlike TD methods, MC estimates are unbiased (given enough samples), though they have higher variance.

### Monte Carlo vs Q-Learning

| Aspect | Monte Carlo | Q-Learning (TD) |
|--------|-------------|-----------------|
| Updates | After episode ends | After each step |
| Bootstrapping | No (uses actual returns) | Yes (uses Q estimates) |
| Bias | Unbiased | Biased (but consistent) |
| Variance | Higher | Lower |
| Data efficiency | Lower | Higher |
| Works for continuing tasks | No | Yes |

### Why MC Works Well for This Game

1. **Short Episodes**: Addition games are relatively short, so waiting until the end isn't costly
2. **Clear Returns**: The reward structure (+1/-1 at end) makes returns unambiguous
3. **No Intermediate Rewards**: All information comes at episode end anyway
4. **Finite State Space**: Visit counts converge to meaningful statistics

### Connections to Blackwell's Work

Monte Carlo methods have deep connections to game theory:
- **Empirical Play**: MC learning is like players empirically learning from repeated games
- **Minimax Convergence**: With enough exploration, MC finds the minimax strategy
- **Blackwell Approachability**: MC's averaging relates to Blackwell's approachability theorem

### Extensions

- **Monte Carlo Tree Search (MCTS)**: Combine MC sampling with tree search (used in AlphaGo)
- **Importance Sampling**: Use off-policy MC with behavior policies
- **On-Policy MC Control**: Use epsilon-soft policies throughout
- **Variance Reduction**: Use control variates or baseline subtraction


In [None]:
def train_monte_carlo_agents(
    game: AdditionGame,
    agent1: MonteCarloAgent,
    agent2: MonteCarloAgent,
    num_episodes: int = 50000,
    log_interval: int = 1000
) -> dict:
    """Train two Monte Carlo agents through self-play."""
    agents = [agent1, agent2]
    stats = {
        'episodes': [],
        'agent1_win_rate': [],
        'agent2_win_rate': [],
        'epsilon': [],
        'avg_game_length': [],
        'total_visits': []
    }
    
    game_lengths = []
    
    for episode in tqdm(range(num_episodes), desc="Training"):
        state = game.reset()
        
        # Store episode history for each agent: list of (state_key, action)
        episode_history = [[], []]
        
        # Play the game
        while not game.done:
            current_player = state.current_player
            agent = agents[current_player]
            
            state_key = agent.get_state_key(state)
            valid_actions = game.get_valid_actions()
            action = agent.select_action(state, valid_actions, training=True)
            
            # Record state-action pair
            episode_history[current_player].append((state_key, action))
            
            state, reward_p1, reward_p2, done = game.step(action)
        
        # Episode complete - update both agents
        rewards = [reward_p1, reward_p2]
        game_lengths.append(len(game.history))
        
        for player_id in [0, 1]:
            agent = agents[player_id]
            reward = rewards[player_id]
            
            # Update Q-values from the episode
            agent.update_from_episode(episode_history[player_id], reward)
            agent.record_result(game.winner == player_id)
        
        # Decay epsilon for both agents
        agent1.decay_epsilon()
        agent2.decay_epsilon()
        
        # Log statistics
        if (episode + 1) % log_interval == 0:
            stats['episodes'].append(episode + 1)
            stats['agent1_win_rate'].append(agent1.wins / max(1, agent1.games) * 100)
            stats['agent2_win_rate'].append(agent2.wins / max(1, agent2.games) * 100)
            stats['epsilon'].append(agent1.epsilon)
            stats['avg_game_length'].append(np.mean(game_lengths[-log_interval:]))
            
            # Count total visits
            total = sum(sum(agent1.N[s].values()) for s in agent1.N)
            stats['total_visits'].append(total)
            
            # Reset counters
            agent1.wins = agent1.losses = agent1.games = 0
            agent2.wins = agent2.losses = agent2.games = 0
    
    return stats
