# Addition Game: Gradient Bandit Agents

## Game Definition (from Blackwell's "Game Theory and Statistical Decisions")

> *The parameters of the game k and N are given. Player I and Player II alternately choose integers, each choice being one of the integers 1,...,k and each choice made with the knowledge of all preceding choices. As soon as the sum of the chosen integers exceeds N, the last player to choose loses the game and pays his opponent one unit.*

In this notebook, we implement **Gradient Bandit** agents that learn to play through self-play. Unlike Q-learning which estimates action values directly, gradient bandits learn a *preference* for each action and use softmax to convert preferences to probabilities.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from dataclasses import dataclass
from typing import Tuple, List, Optional, Dict
import random
from tqdm import tqdm

# Set style for visualizations
plt.style.use('dark_background')
plt.rcParams['font.family'] = 'monospace'
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['axes.facecolor'] = '#16213e'
plt.rcParams['figure.facecolor'] = '#0f0f23'
plt.rcParams['axes.edgecolor'] = '#4a4a6a'
plt.rcParams['axes.labelcolor'] = '#e94560'
plt.rcParams['xtick.color'] = '#a0a0a0'
plt.rcParams['ytick.color'] = '#a0a0a0'
plt.rcParams['text.color'] = '#e0e0e0'
plt.rcParams['grid.color'] = '#1a1a3a'
plt.rcParams['grid.alpha'] = 0.5


## 1. Game Environment

The state of the game is characterized by:
- The current sum of all chosen integers
- Whose turn it is (Player I or Player II)

The game ends when the sum exceeds N.


In [None]:
@dataclass
class GameState:
    """Represents the current state of the Addition game."""
    current_sum: int
    current_player: int  # 0 for Player I, 1 for Player II
    
    def __hash__(self):
        return hash((self.current_sum, self.current_player))
    
    def __eq__(self, other):
        return (self.current_sum == other.current_sum and 
                self.current_player == other.current_player)


class AdditionGame:
    """The Addition game environment.
    
    Parameters:
        k: Maximum integer that can be chosen (choices are 1, 2, ..., k)
        N: Threshold - the game ends when sum exceeds N
    """
    
    def __init__(self, k: int, N: int):
        self.k = k
        self.N = N
        self.reset()
    
    def reset(self) -> GameState:
        """Reset the game to initial state."""
        self.state = GameState(current_sum=0, current_player=0)
        self.history: List[int] = []
        self.done = False
        self.winner = None
        return self.state
    
    def get_valid_actions(self) -> List[int]:
        """Returns list of valid actions (1 to k)."""
        return list(range(1, self.k + 1))
    
    def step(self, action: int) -> Tuple[GameState, float, float, bool]:
        """Execute an action and return (new_state, reward_p1, reward_p2, done)."""
        if self.done:
            raise ValueError("Game is already over!")
        
        if action < 1 or action > self.k:
            raise ValueError(f"Invalid action {action}. Must be in [1, {self.k}]")
        
        self.history.append(action)
        current_player = self.state.current_player
        new_sum = self.state.current_sum + action
        
        if new_sum > self.N:
            self.done = True
            self.winner = 1 - current_player
            reward_p1 = 1.0 if self.winner == 0 else -1.0
            reward_p2 = -reward_p1
        else:
            reward_p1 = 0.0
            reward_p2 = 0.0
        
        self.state = GameState(
            current_sum=new_sum,
            current_player=1 - current_player
        )
        
        return self.state, reward_p1, reward_p2, self.done
    
    def render(self):
        """Print the current game state."""
        print(f"Sum: {self.state.current_sum} / N={self.N}")
        print(f"History: {self.history}")
        print(f"Current player: {'I' if self.state.current_player == 0 else 'II'}")
        if self.done:
            print(f"Winner: Player {'I' if self.winner == 0 else 'II'}")


## 2. Gradient Bandit Algorithm

The **Gradient Bandit** algorithm learns a numerical *preference* $H_t(a)$ for each action. Actions are selected according to a softmax distribution:

$$\pi_t(a) = \frac{e^{H_t(a)}}{\sum_{b=1}^{k} e^{H_t(b)}}$$

After receiving reward $R_t$, preferences are updated using stochastic gradient ascent:

$$H_{t+1}(A_t) = H_t(A_t) + \alpha (R_t - \bar{R}_t)(1 - \pi_t(A_t))$$
$$H_{t+1}(a) = H_t(a) - \alpha (R_t - \bar{R}_t)\pi_t(a) \quad \text{for } a \neq A_t$$

where $\bar{R}_t$ is a baseline (typically the average reward so far).


In [None]:
class GradientBanditAgent:
    """Gradient Bandit agent for the Addition game.
    
    Uses softmax action selection with preference-based learning.
    Each state has its own set of action preferences.
    """
    
    def __init__(
        self, 
        player_id: int,
        k: int,
        alpha: float = 0.1,
        use_baseline: bool = True,
        temperature: float = 1.0
    ):
        self.player_id = player_id
        self.k = k
        self.alpha = alpha  # Learning rate
        self.use_baseline = use_baseline
        self.temperature = temperature
        
        # Preferences H(s, a) for each state-action pair
        # Initialized to 0 (uniform softmax initially)
        self.preferences: Dict[int, np.ndarray] = defaultdict(
            lambda: np.zeros(k)
        )
        
        # Baseline: average reward per state
        self.reward_baseline: Dict[int, float] = defaultdict(float)
        self.state_visits: Dict[int, int] = defaultdict(int)
        
        # Statistics
        self.wins = 0
        self.losses = 0
        self.games = 0
    
    def softmax(self, preferences: np.ndarray) -> np.ndarray:
        """Compute softmax probabilities from preferences."""
        # Apply temperature and subtract max for numerical stability
        scaled = preferences / self.temperature
        exp_prefs = np.exp(scaled - np.max(scaled))
        return exp_prefs / np.sum(exp_prefs)
    
    def get_action_probabilities(self, state: GameState) -> np.ndarray:
        """Get probability distribution over actions for given state."""
        state_key = state.current_sum
        preferences = self.preferences[state_key]
        return self.softmax(preferences)
    
    def select_action(self, state: GameState, valid_actions: List[int], training: bool = True) -> int:
        """Select action using softmax over preferences."""
        probs = self.get_action_probabilities(state)
        
        if training:
            # Sample from distribution
            action_idx = np.random.choice(self.k, p=probs)
        else:
            # Greedy: select highest probability action
            action_idx = np.argmax(probs)
        
        return action_idx + 1  # Actions are 1-indexed
    
    def update(self, state: GameState, action: int, reward: float):
        """Update preferences using gradient bandit update rule."""
        state_key = state.current_sum
        action_idx = action - 1  # Convert to 0-indexed
        
        # Get current probabilities
        probs = self.get_action_probabilities(state)
        
        # Update baseline (incremental mean)
        self.state_visits[state_key] += 1
        n = self.state_visits[state_key]
        
        if self.use_baseline:
            old_baseline = self.reward_baseline[state_key]
            self.reward_baseline[state_key] = old_baseline + (reward - old_baseline) / n
            baseline = old_baseline  # Use old baseline for this update
        else:
            baseline = 0
        
        # Gradient bandit update
        advantage = reward - baseline
        
        # Update all preferences
        for a in range(self.k):
            if a == action_idx:
                # Selected action: positive gradient
                self.preferences[state_key][a] += self.alpha * advantage * (1 - probs[a])
            else:
                # Non-selected actions: negative gradient
                self.preferences[state_key][a] -= self.alpha * advantage * probs[a]
    
    def record_result(self, won: bool):
        """Record game result."""
        self.games += 1
        if won:
            self.wins += 1
        else:
            self.losses += 1
    
    def win_rate(self) -> float:
        """Calculate win rate."""
        return self.wins / max(1, self.games)


## 3. Training Loop

We train both agents through self-play. Each agent updates its preferences based on the final game outcome, propagating the reward back through all states visited during the game.


In [None]:
def train_gradient_bandit_agents(
    game: AdditionGame,
    agent1: GradientBanditAgent,
    agent2: GradientBanditAgent,
    num_episodes: int = 50000,
    log_interval: int = 1000
) -> dict:
    """Train two gradient bandit agents through self-play."""
    agents = [agent1, agent2]
    stats = {
        'episodes': [],
        'agent1_win_rate': [],
        'agent2_win_rate': [],
        'avg_game_length': [],
        'entropy_p1': [],
        'entropy_p2': []
    }
    
    game_lengths = []
    
    for episode in tqdm(range(num_episodes), desc="Training"):
        state = game.reset()
        
        # Store (state, action) pairs for each agent
        history = [[], []]  # [agent1_history, agent2_history]
        
        while not game.done:
            current_player = state.current_player
            agent = agents[current_player]
            
            valid_actions = game.get_valid_actions()
            action = agent.select_action(state, valid_actions, training=True)
            
            history[current_player].append((state, action))
            
            state, reward_p1, reward_p2, done = game.step(action)
        
        # Game ended - get final rewards
        rewards = [reward_p1, reward_p2]
        game_lengths.append(len(game.history))
        
        # Update each agent's preferences for all visited states
        for player_id in [0, 1]:
            agent = agents[player_id]
            reward = rewards[player_id]
            
            # Update all state-action pairs visited by this agent
            for state, action in history[player_id]:
                agent.update(state, action, reward)
            
            agent.record_result(game.winner == player_id)
        
        # Log statistics
        if (episode + 1) % log_interval == 0:
            stats['episodes'].append(episode + 1)
            stats['agent1_win_rate'].append(agent1.wins / max(1, agent1.games) * 100)
            stats['agent2_win_rate'].append(agent2.wins / max(1, agent2.games) * 100)
            stats['avg_game_length'].append(np.mean(game_lengths[-log_interval:]))
            
            # Calculate policy entropy (measure of exploration)
            entropy1 = np.mean([
                -np.sum(agent1.softmax(p) * np.log(agent1.softmax(p) + 1e-10))
                for p in agent1.preferences.values() if len(p) > 0
            ]) if agent1.preferences else 0
            entropy2 = np.mean([
                -np.sum(agent2.softmax(p) * np.log(agent2.softmax(p) + 1e-10))
                for p in agent2.preferences.values() if len(p) > 0
            ]) if agent2.preferences else 0
            
            stats['entropy_p1'].append(entropy1)
            stats['entropy_p2'].append(entropy2)
            
            # Reset counters
            agent1.wins = agent1.losses = agent1.games = 0
            agent2.wins = agent2.losses = agent2.games = 0
    
    return stats


## 4. Training the Agents

Let's train two gradient bandit agents on the Addition game with parameters `k=3` and `N=10`.


In [None]:
# Game parameters
K = 3  # Can choose 1, 2, or 3
N = 10  # Game ends when sum exceeds 10

# Create game and agents
game = AdditionGame(k=K, N=N)

agent1 = GradientBanditAgent(
    player_id=0,
    k=K,
    alpha=0.15,
    use_baseline=True,
    temperature=1.0
)

agent2 = GradientBanditAgent(
    player_id=1,
    k=K,
    alpha=0.15,
    use_baseline=True,
    temperature=1.0
)

print(f"Training Gradient Bandit agents on Addition game")
print(f"Parameters: k={K}, N={N}")
print(f"Player I starts, choices are integers from 1 to {K}")
print(f"First player to make sum exceed {N} loses")


In [None]:
# Train the agents
stats = train_gradient_bandit_agents(game, agent1, agent2, num_episodes=50000, log_interval=500)


## 5. Visualizing Training Progress


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Addition Game: Gradient Bandit Training Progress', fontsize=16, fontweight='bold', color='#e94560')

# Win rates
ax1 = axes[0, 0]
ax1.plot(stats['episodes'], stats['agent1_win_rate'], color='#f7b731', linewidth=2, label='Player I')
ax1.plot(stats['episodes'], stats['agent2_win_rate'], color='#26de81', linewidth=2, label='Player II')
ax1.axhline(y=50, color='#666', linestyle='--', alpha=0.5, label='50% baseline')
ax1.set_xlabel('Episode')
ax1.set_ylabel('Win Rate (%)')
ax1.set_title('Win Rates Over Training', color='#e94560')
ax1.legend(facecolor='#16213e', edgecolor='#4a4a6a')
ax1.grid(True, alpha=0.3)

# Policy entropy
ax2 = axes[0, 1]
ax2.plot(stats['episodes'], stats['entropy_p1'], color='#f7b731', linewidth=2, label='Player I')
ax2.plot(stats['episodes'], stats['entropy_p2'], color='#26de81', linewidth=2, label='Player II')
max_entropy = np.log(K)
ax2.axhline(y=max_entropy, color='#666', linestyle='--', alpha=0.5, label=f'Max entropy ({max_entropy:.2f})')
ax2.set_xlabel('Episode')
ax2.set_ylabel('Policy Entropy')
ax2.set_title('Policy Entropy (exploration measure)', color='#e94560')
ax2.legend(facecolor='#16213e', edgecolor='#4a4a6a')
ax2.grid(True, alpha=0.3)

# Average game length
ax3 = axes[1, 0]
ax3.plot(stats['episodes'], stats['avg_game_length'], color='#a55eea', linewidth=2)
ax3.set_xlabel('Episode')
ax3.set_ylabel('Average Game Length (moves)')
ax3.set_title('Average Game Length', color='#e94560')
ax3.grid(True, alpha=0.3)

# Win rate difference
ax4 = axes[1, 1]
advantage = [a1 - a2 for a1, a2 in zip(stats['agent1_win_rate'], stats['agent2_win_rate'])]
colors = ['#f7b731' if a > 0 else '#26de81' for a in advantage]
ax4.bar(stats['episodes'], advantage, color=colors, alpha=0.7, width=400)
ax4.axhline(y=0, color='#fff', linewidth=1)
ax4.set_xlabel('Episode')
ax4.set_ylabel('Win Rate Difference (%)')
ax4.set_title('Player I Advantage (positive = Player I favored)', color='#e94560')
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


## 6. Analyzing Learned Policies

Let's examine the action probabilities learned by each agent.


In [None]:
def visualize_gradient_bandit_policy(agent: GradientBanditAgent, k: int, N: int, player_name: str):
    """Visualize the learned policy as probability distributions."""
    
    sums = list(range(0, N + 1))
    actions = list(range(1, k + 1))
    
    # Create probability and preference matrices
    prob_matrix = np.zeros((len(sums), len(actions)))
    pref_matrix = np.zeros((len(sums), len(actions)))
    
    for i, s in enumerate(sums):
        prefs = agent.preferences[s]
        probs = agent.softmax(prefs)
        for j in range(k):
            prob_matrix[i, j] = probs[j]
            pref_matrix[i, j] = prefs[j]
    
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(16, 6))
    fig.suptitle(f'{player_name} Learned Strategy (Gradient Bandit)', fontsize=14, fontweight='bold', color='#e94560')
    
    # Preferences heatmap
    im1 = ax1.imshow(pref_matrix, aspect='auto', cmap='RdYlBu_r')
    ax1.set_yticks(range(len(sums)))
    ax1.set_yticklabels(sums)
    ax1.set_xticks(range(len(actions)))
    ax1.set_xticklabels(actions)
    ax1.set_ylabel('Current Sum')
    ax1.set_xlabel('Action')
    ax1.set_title('Preferences H(s,a)', color='#e94560')
    plt.colorbar(im1, ax=ax1, label='Preference')
    
    for i in range(len(sums)):
        for j in range(len(actions)):
            color = 'white' if abs(pref_matrix[i, j]) > 0.5 else 'black'
            ax1.text(j, i, f'{pref_matrix[i, j]:.2f}', ha='center', va='center', color=color, fontsize=8)
    
    # Probabilities heatmap
    im2 = ax2.imshow(prob_matrix, aspect='auto', cmap='Greens', vmin=0, vmax=1)
    ax2.set_yticks(range(len(sums)))
    ax2.set_yticklabels(sums)
    ax2.set_xticks(range(len(actions)))
    ax2.set_xticklabels(actions)
    ax2.set_ylabel('Current Sum')
    ax2.set_xlabel('Action')
    ax2.set_title('Action Probabilities π(a|s)', color='#e94560')
    plt.colorbar(im2, ax=ax2, label='Probability')
    
    for i in range(len(sums)):
        for j in range(len(actions)):
            color = 'white' if prob_matrix[i, j] > 0.5 else 'black'
            ax2.text(j, i, f'{prob_matrix[i, j]:.2f}', ha='center', va='center', color=color, fontsize=8)
    
    # Best action visualization
    best_actions = np.argmax(prob_matrix, axis=1) + 1
    colors = ['#f7b731', '#26de81', '#a55eea']
    
    ax3.barh(sums, best_actions, color=[colors[a-1] for a in best_actions], edgecolor='white', linewidth=0.5)
    ax3.set_xlabel('Most Probable Action')
    ax3.set_ylabel('Current Sum')
    ax3.set_title('Greedy Policy', color='#e94560')
    ax3.set_xticks([1, 2, 3])
    ax3.set_xlim(0, k + 0.5)
    ax3.invert_yaxis()
    
    from matplotlib.patches import Patch
    legend_elements = [Patch(facecolor=colors[i], label=f'Action {i+1}') for i in range(k)]
    ax3.legend(handles=legend_elements, loc='lower right', facecolor='#16213e', edgecolor='#4a4a6a')
    
    plt.tight_layout()
    plt.show()
    
    return best_actions, prob_matrix

print("Player I Strategy:")
p1_actions, p1_probs = visualize_gradient_bandit_policy(agent1, K, N, "Player I")

print("\nPlayer II Strategy:")
p2_actions, p2_probs = visualize_gradient_bandit_policy(agent2, K, N, "Player II")


## 7. Theoretical Optimal Strategy

For the Addition game, the optimal strategy is deterministic and based on modular arithmetic:

- **Losing positions**: Sums where `sum ≡ N (mod k+1)`
- **Winning positions**: All other sums

From a winning position, play the action that moves the opponent to a losing position.


In [None]:
def compare_with_optimal(agent: GradientBanditAgent, optimal_actions: dict, losing_positions: list, 
                         k: int, N: int, player_name: str):
    """Compare learned policy with optimal."""
    matches = 0
    mismatches = []
    
    for s in range(0, N + 1):
        # Get learned action (greedy)
        probs = agent.get_action_probabilities(GameState(s, 0))
        learned_action = np.argmax(probs) + 1
        optimal = optimal_actions[s]
        
        if learned_action == optimal:
            matches += 1
        else:
            # Check if learned action is also winning
            next_sum_learned = s + learned_action
            next_sum_optimal = s + optimal
            both_win = (next_sum_learned in losing_positions) and (next_sum_optimal in losing_positions)
            
            if both_win:
                matches += 1
            else:
                mismatches.append((s, learned_action, optimal, probs))
    
    accuracy = matches / (N + 1) * 100
    
    print(f"\n{player_name} Policy Comparison:")
    print(f"  Accuracy: {accuracy:.1f}% ({matches}/{N + 1} positions)")
    
    if mismatches:
        print(f"  Mismatches:")
        for s, learned, optimal, probs in mismatches[:5]:
            probs_str = ", ".join([f"{p:.2f}" for p in probs])
            print(f"    Sum {s}: learned={learned}, optimal={optimal}, probs=[{probs_str}]")
    else:
        print("  Perfect match with optimal strategy!")
    
    return accuracy

acc1 = compare_with_optimal(agent1, optimal_actions, losing_positions, K, N, "Player I")
acc2 = compare_with_optimal(agent2, optimal_actions, losing_positions, K, N, "Player II")


## 8. Evaluation Against Optimal Agent


In [None]:
class OptimalAgent:
    """Agent that plays the theoretically optimal strategy."""
    
    def __init__(self, k: int, N: int):
        _, self.optimal_actions = compute_optimal_strategy(k, N)
    
    def select_action(self, state: GameState, valid_actions: List[int], training: bool = False) -> int:
        return self.optimal_actions.get(state.current_sum, 1)


def evaluate(game: AdditionGame, agent1, agent2, num_games: int = 1000):
    """Evaluate two agents over multiple games."""
    wins = [0, 0]
    
    for _ in range(num_games):
        state = game.reset()
        agents = [agent1, agent2]
        
        while not game.done:
            agent = agents[state.current_player]
            action = agent.select_action(state, game.get_valid_actions(), training=False)
            state, _, _, _ = game.step(action)
        
        wins[game.winner] += 1
    
    return wins[0], wins[1]


optimal_agent = OptimalAgent(K, N)

print("Evaluation Results (1000 games each):\n")

print("Test 1: Learned Player I vs Optimal Player II")
w1, w2 = evaluate(game, agent1, optimal_agent, num_games=1000)
print(f"  Learned wins: {w1}, Optimal wins: {w2}")
print(f"  Learned win rate: {w1/10:.1f}%")

print("\nTest 2: Optimal Player I vs Learned Player II")
w1, w2 = evaluate(game, optimal_agent, agent2, num_games=1000)
print(f"  Optimal wins: {w1}, Learned wins: {w2}")
print(f"  Learned win rate: {w2/10:.1f}%")

print("\nTest 3: Learned vs Learned (self-play)")
w1, w2 = evaluate(game, agent1, agent2, num_games=1000)
print(f"  Player I wins: {w1}, Player II wins: {w2}")
print(f"  Player I win rate: {w1/10:.1f}%")

print(f"\n{'='*50}")
print("Note: With k=3, N=10, optimal Player I should always win.")


## 9. Watching a Sample Game


In [None]:
def play_game_verbose(game: AdditionGame, agent1, agent2, name1: str, name2: str, show_probs: bool = True):
    """Play a game with detailed output."""
    state = game.reset()
    agents = [agent1, agent2]
    names = [name1, name2]
    
    print(f"{'='*60}")
    print(f"ADDITION GAME: k={game.k}, N={game.N}")
    print(f"{name1} (Player I) vs {name2} (Player II)")
    print(f"First to make sum exceed {game.N} loses!")
    print(f"{'='*60}\n")
    
    move = 1
    while not game.done:
        player = state.current_player
        agent = agents[player]
        name = names[player]
        
        action = agent.select_action(state, game.get_valid_actions(), training=False)
        
        # Show probabilities if agent is gradient bandit
        if show_probs and hasattr(agent, 'get_action_probabilities'):
            probs = agent.get_action_probabilities(state)
            probs_str = ", ".join([f"{p:.2f}" for p in probs])
            print(f"Move {move}: {name} (probs: [{probs_str}]) -> chooses {action}")
        else:
            print(f"Move {move}: {name} chooses {action}")
        
        print(f"         Sum: {state.current_sum} -> {state.current_sum + action}")
        
        state, _, _, done = game.step(action)
        
        if done:
            print(f"\n         *** SUM ({state.current_sum}) EXCEEDS {game.N}! ***")
            print(f"\n  WINNER: {names[game.winner]}")
        else:
            print()
        
        move += 1
    
    print(f"\n{'='*60}")
    print(f"History: {game.history}")
    print(f"{'='*60}")

print("Game 1: Learned agents playing each other\n")
play_game_verbose(game, agent1, agent2, "Learned-I", "Learned-II")

print("\n\n")

print("Game 2: Learned Player I vs Optimal Player II\n")
play_game_verbose(game, agent1, optimal_agent, "Learned-I", "Optimal-II")


## 10. Conclusion

### Summary

We implemented **Gradient Bandit** agents for Blackwell's Addition game. Key aspects:

1. **Preference-Based Learning**: Instead of estimating action values (like Q-learning), gradient bandits learn action *preferences* and use softmax to convert them to probabilities.

2. **Stochastic Policies**: The softmax distribution naturally provides exploration through stochastic action selection, avoiding the need for epsilon-greedy.

3. **Baseline Importance**: Using a baseline (average reward) significantly improves learning stability by reducing variance in the gradient estimates.

### Gradient Bandit vs Q-Learning for This Game

| Aspect | Gradient Bandit | Q-Learning |
|--------|----------------|------------|
| Policy Type | Stochastic (softmax) | Deterministic (argmax) |
| Exploration | Built-in via softmax | Requires epsilon-greedy |
| What it learns | Action preferences | Action values |
| Update rule | Policy gradient | Bellman equation |

### Connections to Blackwell's Work

While the Addition game has a deterministic optimal solution (making Q-learning naturally suited), the gradient bandit approach demonstrates how policy gradient methods can also discover optimal play through iterative preference updates. This connects to broader themes in Blackwell's work on:
- Sequential decision problems
- Adaptive strategies that converge to optimality
- The interplay between exploration and exploitation

### Extensions

- **Actor-Critic**: Combine gradient bandits with value estimation
- **Natural Gradient**: Use Fisher information for more efficient updates
- **Larger Games**: Deep policy gradients for games with larger state spaces
- **Imperfect Information**: Extend to games with hidden information


In [None]:
def compute_optimal_strategy(k: int, N: int):
    """Compute the theoretically optimal strategy."""
    # Losing positions: sums where sum ≡ N (mod k+1)
    losing_positions = []
    s = N
    while s >= 0:
        losing_positions.append(s)
        s -= (k + 1)
    losing_positions = sorted(losing_positions)
    
    # Optimal actions from each position
    optimal_actions = {}
    for s in range(0, N + 1):
        if s in losing_positions:
            optimal_actions[s] = 1  # Any move loses, pick smallest
        else:
            for a in range(1, k + 1):
                next_sum = s + a
                if next_sum in losing_positions:
                    optimal_actions[s] = a
                    break
            else:
                optimal_actions[s] = 1
    
    return losing_positions, optimal_actions

losing_positions, optimal_actions = compute_optimal_strategy(K, N)

print(f"Game parameters: k={K}, N={N}")
print(f"\nLosing positions (for player to move): {losing_positions}")
print(f"These satisfy: sum ≡ {N} (mod {K+1})")

print(f"\nOptimal actions from each position:")
for s in range(0, N + 1):
    status = "LOSING" if s in losing_positions else "winning"
    print(f"  Sum {s:2d}: play {optimal_actions[s]} ({status})")

print(f"\n{'='*50}")
if 0 in losing_positions:
    print("Initial position (0) is LOSING → Player II wins with optimal play!")
else:
    print("Initial position (0) is winning → Player I wins with optimal play!")
