<h1 style="text-align:center">Mario RL tutorial</h1>

---

<p style="text-align:center">
    <time datetime="2024-09-22">September 22, 2024</time> | <span>Topics: Mario, Stablebaseline, and simulated enviroments</span>
    <br>
</p>

## Project Overview

In this tutorial, we'll teach an AI to play **Super Mario Bros** using reinforcement learning. By interacting with the game, the AI will learn which actions lead to rewards and which do not, gradually improving its performance. We'll be using tools like **Stable-baselines3**, **OpenAI Gym**, and **Pytorch** to build and train our model in a simulated environment.

Reinforcement learning is a powerful technique for training AI agents in environments where they learn from trial and error. This project will help you understand key concepts of reinforcement learning and how to apply them to real-world problems, like playing a video game autonomously.

### Objectives

- Set up the Mario environment using OpenAI Gym.
- Train an AI model to play Mario using reinforcement learning.
- Evaluate the performance of the model as it interacts with the game.

### <span style="color:blue">Don't feel like reading? still not clear? you can listen what we'll be working on</span>

In [None]:
# UNCOMMENT THIS CELL TO LISTEN TO THE AUDIO FILE

# from IPython.display import Audio
# audio_file = 'public/mario.wav'
# Audio(audio_file)

---

## Requirements

- Python == 3.8 
- pytorch
- gym == 0.21.0
- nes_py ==  8.2.1
- stable-baseline3[extra] == 1.6.0
- Jupyter notebook
- [PyTorch — click link and scroll down to install](https://pytorch.org/)

## Dependencies

Before we start, we need to install a few packages. Here's a brief overview of each:
- **`nes-py`**: Provides an NES emulator environment used to run the Super Mario Bros. game in Python.
- **`gym-super-mario-bros`**: Integrates the Super Mario Bros. game into OpenAI's Gym framework for use in reinforcement learning tasks.
- **`gym`**: Provides environments for developing and comparing reinforcement learning algorithms.
- **`stable-baselines3[extra]`**: A popular library that implements advanced reinforcement learning algorithms, including extra features like visualization and monitoring.

In [None]:
import subprocess

packages = [
    "setuptools==65.5.0", "wheel<0.40.0",
    "gym==0.21.0",
    "stable-baselines3[extra]==1.6.0",
    "nes-py",
    "gym-super-mario-bros==7.3.0"
]

for package in packages:
    subprocess.check_call(["pip", "install", package])

<span style="color:#ef4444">
&#x2B55; If you get this error: 'no matches found'
    for the stable-baseline 3 package, you may have to run: pip install 'stable-baselines3[extra]'
</span>

---

## Setting up Mario Environment

### Objective
- Set up the environment where our AI can interact with and learn to play Super Mario Bros.

### Step-by-Step Explanation
To teach our AI how to play Super Mario Bros, we first need to set up the environment in which it will learn and play. We'll use specific libraries to simulate the game and restrict the AI's actions for more efficient learning.

- **Why use `gym_super_mario_bros`?**  
  We use `gym_super_mario_bros` to create a simulation of the game. This provides an environment where the AI can interact with the game, observe the outcomes, and learn from its actions.

- **Why use `JoypadSpace`?**  
  Super Mario Bros. has many possible actions (like jumping, running, etc.), but we simplify this by using `JoypadSpace`. This reduces the action space to a smaller set of key actions (defined in `SIMPLE_MOVEMENT`), making it easier for our AI to learn how to play.

- **How does the environment work?**  
  - **`gym_super_mario_bros.make('SuperMarioBros-v0')`**: This function sets up the game environment.
  - **`JoypadSpace`**: Wraps the environment, limiting the AI's actions to a simpler, predefined set of movements, which helps the AI focus on learning key behaviors.

### Outcome
After setting up the environment, our AI is now ready to interact with the game. It can start taking actions within the simplified movement space, and you'll be able to see the game being played in a simulated environment.

In [None]:
# Import dependencies
from nes_py.wrappers import JoypadSpace
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
import gym_super_mario_bros

In [None]:
# What are the possible options for our AI agent?
SIMPLE_MOVEMENT

In [None]:
# create the game environment
env = gym_super_mario_bros.make('SuperMarioBros-v0')
# wrap the environment in JoypadSpace, reducing the possible actions by our AI agent
env = JoypadSpace(env, SIMPLE_MOVEMENT)

#### Let's test our game!

**Code Breakdown**:

- **`for step in range(1000)`**: Runs the game for 1000 steps.
- **`env.reset()`**: Resets the game when it ends.
- **`env.step(env.action_space.sample())`**: Takes a random action from the AI's action space.
- **`state, reward, done, info`**: After each action, the game returns:
  - **`state`**: The current game state.
  - **`reward`**: The reward for the action taken.
  - **`done`**: Whether the game has ended.
  - **`info`**: Extra diagnostic information.
- **`env.render()`**: Renders the game so you can see it in action.
- **`env.close()`**: Closes the game when the loop ends.

In [None]:
# Let's test if our game is working!
done = True
for step in range(1000):
	if done:
		env.reset()
	state, reward, done , info = env.step(env.action_space.sample())
	env.render()		
env.close()

In [None]:
# Run this cell to close any leftover windows from the last cell
env.close()

<span style="color:#ef4444">
&#x2B55; If you see an overflow warning but your game is running ok, you can ignore it.
</span>

---

## Frame Stacking and Grayscale Processing

### Objective
Prepare the game environment by simplifying the visual inputs (grayscale) and stacking frames to give the AI temporal information. This will help the AI learn better by providing sequences of images instead of a single frame.

### Step-by-Step Explanation

- **`GrayScaleObservation`**: Converts the game images to grayscale, which simplifies the visual input while retaining important features for the AI.
- **`VecFrameStack`**: Stacks multiple consecutive frames together, allowing the AI to capture motion and temporal information. This helps in understanding movement, such as jumps and enemy approaches.
- **`DummyVecEnv`**: Wraps the environment to make it compatible with vectorized operations, required for `stable-baselines3`.
- **`plt.imshow`**: Visualizes the game state to ensure everything is working properly.
- **`plt.subplot`**: Displays the 4 stacked frames, allowing you to visualize the temporal information given to the AI.

### Outcome
The environment now processes images in grayscale and stacks four frames together, which helps the AI understand motion and time-based actions. You'll also be able to visualize the stacked frames to verify that they contain temporal information.

In [None]:
# Import dependencies
from gym.wrappers import GrayScaleObservation
from stable_baselines3.common.vec_env import VecFrameStack, DummyVecEnv
from matplotlib import pyplot as plt

In [None]:
env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env, SIMPLE_MOVEMENT)
# Grayscale the image
env = GrayScaleObservation(env, keep_dim=True)
# Wrap inside a DummyVecEnv
env = DummyVecEnv([lambda: env])
# Stack the frames
env = VecFrameStack(env, 4, channels_order='last')

In [None]:
# Start game in the background
state = env.reset()

#### Let's analyze the environment state

**Code Breakdown**:

- **`state.shape`**: Returns the shape of the `state` variable, which is:
  - **`(1, 240, 256, 4)`**: 
    - **`1`**: The batch dimension, which allows for multiple environments in parallel (in this case, only one environment).
    - **`240, 256`**: The height and width of the game screen in pixels.
    - **`4`**: The number of stacked frames, which helps the AI capture motion and temporal information.

In [None]:
state.shape

In [None]:
# Let's make mario jump and add to our temporal state variable (run this line 4 time)
state, reward, done , info = env.step([5])

In [None]:
# Let's visualize one frame
plt.imshow(state[0])

In [None]:
# What does the AI agent see?
plt.figure(figsize=(20, 16))
for idx in range(state.shape[3]):
	plt.subplot(1, 4, idx+1)
	plt.imshow(state[0][:, :, idx])
plt.show()

## Model Training

### Objective
Create a reinforcement learning model using Proximal Policy Optimization (PPO) to teach our AI how to play Super Mario Bros, and implement a callback system for saving the model during training.

### Step-by-Step Explanation

- **`PPO` (Proximal Policy Optimization)**: This is the reinforcement learning algorithm we are using. It optimizes the policy that controls the AI’s actions in the game by gradually improving its performance through interaction with the environment.
- **`TrainAndLoggingCallback`**: This custom callback saves the model at regular intervals (defined by `check_freq`). It helps ensure that we can save the model during training and restore it later if needed.
- **`model.learn()`**: This function trains the model over a specified number of timesteps (in this case, 100,000). The model interacts with the game, learning from rewards and adjusting its strategy.
  
**Key components**:
- **`PPO('CnnPolicy', env)`**: This creates the PPO model, using a convolutional neural network (CNN) policy to process image data from the game.
- **`callback=TrainAndLoggingCallback`**: This ensures that the model is saved every 10,000 steps during training.
- **`n_steps`**: Defines how often the agent updates its learning, impacting performance and speed.

### Outcome
The PPO model will now train the AI by interacting with the game and learning from its experiences. Additionally, the model is saved every 10,000 steps, allowing us to resume training or deploy the saved model later.

In [None]:
# Import os for file path management
import os
# Import PPO for using Proximal Policy Optimization
from stable_baselines3 import PPO
# Import Base Callback for saving models
from stable_baselines3.common.callbacks import BaseCallback

In [None]:
# Create the callback for saving models
class TrainAndLoggingCallback(BaseCallback):
	def __init__(self, check_freq, save_path, verbose=1):
		super(TrainAndLoggingCallback, self).__init__(verbose)
		self.check_freq = check_freq
		self.save_path = save_path

	def _init_callback(self):
		if self.save_path is not None:
			os.makedirs(self.save_path, exist_ok=True)
		
	def _on_step(self):
		if self.n_calls % self.check_freq == 0:
			model_path = os.path.join(self.save_path, f'best_model_{self.n_calls}')
			self.model.save(model_path)

		return True

In [None]:
CHECKPOINT_DIR = './train/'
LOG_DIR = './logs/'

In [None]:
# Create the callback
callback = TrainAndLoggingCallback(check_freq=10000, save_path=CHECKPOINT_DIR)

#### Creating the PPO Model

**Code Breakdown**:

- **`PPO('CnnPolicy', env)`**: Initializes a PPO model with a CNN policy, designed to process image data from `env`.
- **`verbose=1`**: Enables progress output during training.
- **`tensorboard_log=LOG_DIR`**: Sets up logging for TensorBoard, allowing you to track training metrics.
- **`learning_rate=0.000001`**: Defines the learning rate, which controls how quickly the model updates its knowledge.
- **`n_steps=512`**: Specifies the number of steps the agent takes before updating its learning.

In [None]:
# Create the model
model = PPO('CnnPolicy', env, verbose=1, tensorboard_log=LOG_DIR, learning_rate=0.000001, n_steps=512)

In [None]:
# Train the model
model.learn(total_timesteps=100000, callback=callback)

## 4. Test it out

In [None]:
from stable_baselines3.common.utils import constant_fn

model = PPO.load('train/[MODEL NAME]')

In [None]:
state = env.reset()

In [None]:
# Start the game 
state = env.reset()
# Loop through the game
while True: 
    
    action, _ = model.predict(state)
    state, reward, done, info = env.step(action)
    env.render()