# Othello RL Environment - Basic Usage

This notebook demonstrates the basic usage of the Othello RL environment, including:
- Environment creation and configuration
- Observation and action spaces
- Running episodes with random agents
- Action masking
- Rendering and visualization
- State persistence

## Setup

First, let's import the necessary libraries and create an environment.

In [None]:
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
import aip_rl.othello

# Set random seed for reproducibility
np.random.seed(42)

## 1. Creating the Environment

Let's create a basic Othello environment with default settings.

In [None]:
# Create environment with default settings
env = gym.make("Othello-v0")

print("Environment created successfully!")
print(f"Observation space: {env.observation_space}")
print(f"Action space: {env.action_space}")

## 2. Understanding the Observation Space

The observation is a 3D array with shape (3, 8, 8):
- Channel 0: Agent's pieces
- Channel 1: Opponent's pieces
- Channel 2: Valid moves

In [None]:
# Reset environment and get initial observation
observation, info = env.reset(seed=42)

print(f"Observation shape: {observation.shape}")
print(f"Observation dtype: {observation.dtype}")
print(f"\nInfo dictionary keys: {info.keys()}")
print(f"Initial black count: {info['black_count']}")
print(f"Initial white count: {info['white_count']}")
print(f"Number of valid moves: {np.sum(info['action_mask'])}")

### Visualizing the Observation Channels

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Agent's pieces (Channel 0)
axes[0].imshow(observation[0], cmap='Blues', vmin=0, vmax=1)
axes[0].set_title("Agent's Pieces (Black)")
axes[0].axis('off')

# Opponent's pieces (Channel 1)
axes[1].imshow(observation[1], cmap='Reds', vmin=0, vmax=1)
axes[1].set_title("Opponent's Pieces (White)")
axes[1].axis('off')

# Valid moves (Channel 2)
axes[2].imshow(observation[2], cmap='Greens', vmin=0, vmax=1)
axes[2].set_title("Valid Moves")
axes[2].axis('off')

plt.tight_layout()
plt.show()

## 3. Understanding the Action Space

Actions are integers from 0 to 63, representing board positions:
- action = row * 8 + col
- row = action // 8
- col = action % 8

In [None]:
# Show action mapping
print("Action to position mapping (first few examples):")
for action in [0, 7, 19, 27, 63]:
    row = action // 8
    col = action % 8
    print(f"Action {action:2d} -> Position ({row}, {col})")

# Show valid actions
valid_actions = np.where(info['action_mask'])[0]
print(f"\nValid actions at start: {valid_actions}")
print("Valid positions:")
for action in valid_actions:
    row, col = action // 8, action % 8
    print(f"  Action {action} -> ({row}, {col})")

## 4. Running a Single Episode with Random Agent

Let's run a complete episode using a random agent that selects from valid moves.

In [None]:
# Reset environment
observation, info = env.reset(seed=42)

# Track episode statistics
episode_rewards = []
episode_length = 0
done = False

print("Running episode with random agent...\n")

while not done:
    # Get valid actions from action mask
    action_mask = info['action_mask']
    valid_actions = np.where(action_mask)[0]
    
    if len(valid_actions) == 0:
        print("No valid moves available!")
        break
    
    # Select random valid action
    action = np.random.choice(valid_actions)
    
    # Take step
    observation, reward, terminated, truncated, info = env.step(action)
    
    episode_rewards.append(reward)
    episode_length += 1
    done = terminated or truncated
    
    # Print progress every 5 steps
    if episode_length % 5 == 0:
        print(f"Step {episode_length}: Black={info['black_count']}, White={info['white_count']}, Reward={reward:.2f}")

# Episode summary
print(f"\nEpisode finished!")
print(f"Episode length: {episode_length}")
print(f"Total reward: {sum(episode_rewards):.2f}")
print(f"Final score - Black: {info['black_count']}, White: {info['white_count']}")

# Determine winner
if info['black_count'] > info['white_count']:
    print("Winner: Black (Agent)")
elif info['white_count'] > info['black_count']:
    print("Winner: White (Opponent)")
else:
    print("Result: Draw")

### Visualizing Reward Over Time

In [None]:
plt.figure(figsize=(10, 4))
plt.plot(episode_rewards)
plt.xlabel('Step')
plt.ylabel('Reward')
plt.title('Reward per Step (Sparse Reward Mode)')
plt.grid(True, alpha=0.3)
plt.show()

## 5. Rendering the Game

Let's create an environment with rendering enabled and visualize a game.

In [None]:
# Create environment with ANSI rendering
env_render = gym.make("Othello-v0", render_mode="ansi")
observation, info = env_render.reset(seed=42)

# Show initial board
print("Initial board state:")
print(env_render.render())

In [None]:
# Play a few moves and show board
for i in range(5):
    action_mask = info['action_mask']
    valid_actions = np.where(action_mask)[0]
    action = np.random.choice(valid_actions)
    
    observation, reward, terminated, truncated, info = env_render.step(action)
    
    print(f"\nAfter move {i+1} (action {action}):")
    print(env_render.render())
    
    if terminated:
        break

### RGB Array Rendering

In [None]:
# Create environment with RGB rendering
env_rgb = gym.make("Othello-v0", render_mode="rgb_array")
observation, info = env_rgb.reset(seed=42)

# Get RGB frame
rgb_frame = env_rgb.render()

print(f"RGB frame shape: {rgb_frame.shape}")
print(f"RGB frame dtype: {rgb_frame.dtype}")

# Display the frame
plt.figure(figsize=(8, 8))
plt.imshow(rgb_frame)
plt.title("Initial Board State (RGB Rendering)")
plt.axis('off')
plt.show()

## 6. Different Reward Modes

Let's compare sparse and dense reward modes.

In [None]:
def run_episode(env, seed=42):
    """Run a single episode and return rewards."""
    observation, info = env.reset(seed=seed)
    rewards = []
    done = False
    
    while not done:
        action_mask = info['action_mask']
        valid_actions = np.where(action_mask)[0]
        if len(valid_actions) == 0:
            break
        action = np.random.choice(valid_actions)
        observation, reward, terminated, truncated, info = env.step(action)
        rewards.append(reward)
        done = terminated or truncated
    
    return rewards

# Sparse rewards
env_sparse = gym.make("Othello-v0", reward_mode="sparse")
rewards_sparse = run_episode(env_sparse, seed=42)

# Dense rewards
env_dense = gym.make("Othello-v0", reward_mode="dense")
rewards_dense = run_episode(env_dense, seed=42)

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 4))

axes[0].plot(rewards_sparse)
axes[0].set_title('Sparse Rewards')
axes[0].set_xlabel('Step')
axes[0].set_ylabel('Reward')
axes[0].grid(True, alpha=0.3)

axes[1].plot(rewards_dense)
axes[1].set_title('Dense Rewards')
axes[1].set_xlabel('Step')
axes[1].set_ylabel('Reward')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Sparse - Total reward: {sum(rewards_sparse):.2f}, Non-zero rewards: {np.count_nonzero(rewards_sparse)}")
print(f"Dense - Total reward: {sum(rewards_dense):.2f}, Non-zero rewards: {np.count_nonzero(rewards_dense)}")

## 7. Different Opponent Policies

Let's test different opponent policies: self-play, random, and greedy.

In [None]:
def evaluate_opponent(opponent_type, num_episodes=10, seed=42):
    """Evaluate agent against a specific opponent."""
    env = gym.make("Othello-v0", opponent=opponent_type)
    
    wins = 0
    losses = 0
    draws = 0
    
    for i in range(num_episodes):
        observation, info = env.reset(seed=seed+i)
        done = False
        
        while not done:
            action_mask = info['action_mask']
            valid_actions = np.where(action_mask)[0]
            if len(valid_actions) == 0:
                break
            action = np.random.choice(valid_actions)
            observation, reward, terminated, truncated, info = env.step(action)
            done = terminated or truncated
        
        # Count results
        if info['black_count'] > info['white_count']:
            wins += 1
        elif info['white_count'] > info['black_count']:
            losses += 1
        else:
            draws += 1
    
    return wins, losses, draws

# Evaluate against different opponents
print("Evaluating random agent against different opponents...\n")

for opponent in ["self", "random", "greedy"]:
    wins, losses, draws = evaluate_opponent(opponent, num_episodes=10)
    total = wins + losses + draws
    print(f"Against {opponent:8s}: Wins={wins}/{total}, Losses={losses}/{total}, Draws={draws}/{total}")

## 8. State Persistence

Demonstrate saving and loading game states.

In [None]:
# Create environment and play some moves
env = gym.make("Othello-v0", render_mode="ansi")
observation, info = env.reset(seed=42)

print("Playing 5 moves...\n")
for i in range(5):
    action_mask = info['action_mask']
    valid_actions = np.where(action_mask)[0]
    action = np.random.choice(valid_actions)
    observation, reward, terminated, truncated, info = env.step(action)

print("Board after 5 moves:")
print(env.render())

# Save state
saved_state = env.save_state()
print(f"\nState saved! Move history length: {len(saved_state['move_history'])}")
print(f"Black: {saved_state['black_count']}, White: {saved_state['white_count']}")

In [None]:
# Continue playing
print("\nPlaying 5 more moves...\n")
for i in range(5):
    action_mask = info['action_mask']
    valid_actions = np.where(action_mask)[0]
    if len(valid_actions) == 0:
        break
    action = np.random.choice(valid_actions)
    observation, reward, terminated, truncated, info = env.step(action)

print("Board after 10 moves:")
print(env.render())
print(f"Black: {info['black_count']}, White: {info['white_count']}")

In [None]:
# Load saved state
print("\nLoading saved state...\n")
env.load_state(saved_state)

print("Board after loading saved state:")
print(env.render())

# Verify state was restored
current_state = env.save_state()
print(f"\nState restored successfully!")
print(f"Move history length: {len(current_state['move_history'])}")
print(f"Black: {current_state['black_count']}, White: {current_state['white_count']}")
print(f"Boards match: {np.array_equal(current_state['board'], saved_state['board'])}")

## 9. Invalid Move Handling

Demonstrate different invalid move handling modes.

In [None]:
# Penalty mode (default)
env_penalty = gym.make("Othello-v0", invalid_move_mode="penalty", invalid_move_penalty=-5.0)
observation, info = env_penalty.reset(seed=42)

# Try an invalid move (position 0 is not valid at start)
invalid_action = 0
print(f"Attempting invalid action {invalid_action}...")
print(f"Is action valid? {info['action_mask'][invalid_action]}")

observation, reward, terminated, truncated, info = env_penalty.step(invalid_action)
print(f"Reward received: {reward}")
print(f"Game terminated: {terminated}")
print(f"Game continues after invalid move in penalty mode.\n")

In [None]:
# Random mode - automatically selects valid move
env_random = gym.make("Othello-v0", invalid_move_mode="random")
observation, info = env_random.reset(seed=42)

print(f"Attempting invalid action {invalid_action} in random mode...")
observation, reward, terminated, truncated, info = env_random.step(invalid_action)
print(f"Reward received: {reward}")
print(f"A valid move was automatically selected instead.\n")

In [None]:
# Error mode - raises exception
env_error = gym.make("Othello-v0", invalid_move_mode="error")
observation, info = env_error.reset(seed=42)

print(f"Attempting invalid action {invalid_action} in error mode...")
try:
    observation, reward, terminated, truncated, info = env_error.step(invalid_action)
except ValueError as e:
    print(f"ValueError raised: {e}")
    print("Error mode raises exception for invalid moves.")

## Summary

In this notebook, we covered:
1. Creating and configuring the Othello environment
2. Understanding observation and action spaces
3. Running episodes with random agents
4. Using action masking for valid moves
5. Different rendering modes (ANSI, RGB)
6. Comparing reward modes (sparse vs dense)
7. Testing different opponent policies
8. Saving and loading game states
9. Handling invalid moves

Next steps:
- See `02_training_with_rllib.ipynb` for training RL agents
- See `03_evaluating_trained_agents.ipynb` for evaluation and analysis