A Jupyter notebook (called project.ipynb) that can be run directly and that
demonstrates your project. Your notebook can import a sample of the data that
you used, import 1 or more models that you built, and generate examples of the
types of predictions or simulations your model can make. The notebook should
not take any longer than 1 minute to run in total (if you have models that
require a lot of training time, train them offline and just upload the models and
some sample data to illustrate them). Feel free to generate examples of your
model(s) in action, e.g., for reviews you could generate examples of reviews
where the models work well and reviews where the models work poorly.

## 1. Environment Setup

First, let's import the necessary libraries and create our custom Tetris environment.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from src.env.tetris_env import TetrisEnv

print("✓ Imports successful!")

## 2. Create Environment

The custom `TetrisEnv` wrapper includes:
- **Reward Engineering**: Custom rewards for line clears, penalties for holes/bumpiness
- **State Analysis**: Tracks holes, column heights, and bumpiness
- **Game Statistics**: Monitors lines cleared, episode steps, and total score

In [None]:
# Create the environment
env = TetrisEnv(render_mode=None)

# Display environment information
print(f"Observation Space: {env.observation_space}")
print(f"Action Space: {env.action_space}")
print(f"\nNumber of possible actions: {env.action_space.n}")
print("\n✓ Environment created successfully!")

## 3. Test Environment with Random Agent

Let's run a short episode with random actions to demonstrate the environment functionality and reward engineering.

In [None]:
# Reset environment
obs, info = env.reset()

# Run episode
episode_data = {
    'rewards': [],
    'holes': [],
    'heights': [],
    'bumpiness': []
}

print("Running random agent for 50 steps or until game over...")
for step in range(50):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    
    episode_data['rewards'].append(reward)
    episode_data['holes'].append(info['holes'])
    episode_data['heights'].append(info['max_height'])
    episode_data['bumpiness'].append(info['bumpiness'])
    
    if terminated or truncated:
        print(f"Episode ended at step {step + 1}")
        break

print(f"\n=== Episode Summary ===")
print(f"Total steps: {len(episode_data['rewards'])}")
print(f"Total reward: {sum(episode_data['rewards']):.2f}")
print(f"Lines cleared: {env.total_lines_cleared}")
print(f"Final holes: {episode_data['holes'][-1]}")
print(f"Final max height: {episode_data['heights'][-1]}")

## 4. Visualize Episode Metrics

Visualize how the game state evolved during the episode.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Reward over time
axes[0, 0].plot(episode_data['rewards'], color='blue')
axes[0, 0].set_title('Reward per Step')
axes[0, 0].set_xlabel('Step')
axes[0, 0].set_ylabel('Reward')
axes[0, 0].grid(True, alpha=0.3)

# Cumulative reward
cumulative_reward = np.cumsum(episode_data['rewards'])
axes[0, 1].plot(cumulative_reward, color='green')
axes[0, 1].set_title('Cumulative Reward')
axes[0, 1].set_xlabel('Step')
axes[0, 1].set_ylabel('Cumulative Reward')
axes[0, 1].grid(True, alpha=0.3)

# Holes over time
axes[1, 0].plot(episode_data['holes'], color='red')
axes[1, 0].set_title('Holes in Board')
axes[1, 0].set_xlabel('Step')
axes[1, 0].set_ylabel('Number of Holes')
axes[1, 0].grid(True, alpha=0.3)

# Max height over time
axes[1, 1].plot(episode_data['heights'], color='purple')
axes[1, 1].set_title('Maximum Stack Height')
axes[1, 1].set_xlabel('Step')
axes[1, 1].set_ylabel('Height')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Visualize Final Board State

Display the final state of the Tetris board.

In [None]:
board = obs['board']

plt.figure(figsize=(6, 8))
plt.imshow(board, cmap='gray_r', interpolation='nearest')
plt.title('Final Board State')
plt.xlabel('Column')
plt.ylabel('Row')
plt.colorbar(label='Cell Value')
plt.grid(True, which='both', alpha=0.3, linestyle='-', linewidth=0.5)
plt.show()

print(f"Board dimensions: {board.shape}")
print(f"Filled cells: {np.count_nonzero(board)} / {board.size}")

## Next Steps

Now that the environment is set up and tested, we can proceed to:

1. **Phase 2**: Build the CNN architecture (policy and value networks)
2. **Phase 3**: Implement the PPO training algorithm
3. **Phase 4**: Train the agent and evaluate performance

The environment is ready for RL training with custom reward engineering that encourages:
- Clearing lines (positive reward)
- Avoiding holes (penalty)
- Keeping the stack low (penalty for height)
- Maintaining even column heights (penalty for bumpiness)