# AIIR Project - AI Mario
This jupyter notebook contains the application of nueral network and reinforcement learning algorithms learnt from the tutorials to simulate Mario completing a variety of levels in a Super Mario Bros pybullet gym environment.

## Mario Environment
We use a Super Mario Bros environment (https://pypi.org/project/gym-super-mario-bros/) with a continuous state space and discrete action space. The goal of this activity is to complete Mario levels as fast as possible while also achieving a high level score. Episodes end when Mario reaches the end of the level, if Mario dies, or if a certain time as elapsed.

### Action Space
- 0: No Movement
- 1: Move Right
- 2: Move Right + Jump
- 3: Move Right + Speed Up
- 4: Move Right + Jump + Speed Up
- 5: Jump
- 6: Move Left
- 7: Move Left + Jump
- 8: Move Left + Speed Up
- 9: Move Left + Jump + Speed Up
- 10: Down
- 11: Up

### Observation Space
The info dictionary returned by step contains the following:
| Key | Unit | Description |
| --- | ---- | ----------- |
| coins | int | Number of collected coins |
| flag_get | bool | True if Mario reached a flag |
| life | int | Number of lives left |
| score | int | Cumulative in-game score |
| stage | int | Current stage |
| status | str | Mario's status/power |
| time | int | Time left on the clock |
| world | int | Current world |
| x_pos | int | Mario's x position in the stage |
| y_pos | int | Mario's y position in the stage |

### Rewards
| Feature | Description | Value when Positive | Value when Negative | Value when Equal |
|---------|-------------|---------------------|---------------------|------------------|
| Difference in agent x values between states | Controls agent's movement | Moving right | Moving left | Not moving |
| Time difference in the game clock between frames | Prevents agent from staying still | - | Clock ticks | Clock doesn't tick |
| Death Penalty | Discourages agent from death | - | Agent dead | Agent alive |
| Coins | Encourages agent to get coins | Coin collected | - | No coin collected |
| Score | Encourages agent to get higher score | Score Value | Score Value | Score Value |
| Flag | Encourages agent to reach middle & end flag | Flag collected | - | Flag not collected |
| Powerup | Encourages agent to get powerups | Powerup collected | - | Powerup not collected |

## Installation Guide

In [None]:
%pip install gym-super-mario-bros

In [1]:
import gym
import pybullet as p
import matplotlib.pyplot as plt
from pyvirtualdisplay import Display
from IPython.display import HTML
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from collections import deque
import numpy as np
import torch
import random
import math
import os
os.environ['PYVIRTUALDISPLAY_DISPLAYFD'] = '0' 

display = Display(visible=0, size=(400, 300))
display.start()

# Function to display the testing video of the agent in the juypyter notebook
def display_video(frames, framerate=30):
  """Generates video from `frames`.

  Args:
    frames (ndarray): Array of shape (n_frames, height, width, 3).
    framerate (int): Frame rate in units of Hz.

  Returns:
    Display object.
  """
  height, width, _ = frames[0].shape
  dpi = 70
  orig_backend = matplotlib.get_backend()
  matplotlib.use('Agg')  # Switch to headless 'Agg' to inhibit figure rendering.
  fig, ax = plt.subplots(1, 1, figsize=(width / dpi, height / dpi), dpi=dpi)
  matplotlib.use(orig_backend)  # Switch back to the original backend.
  ax.set_axis_off()
  ax.set_aspect('equal')
  ax.set_position([0, 0, 1, 1])
  im = ax.imshow(frames[0])
  def update(frame):
    im.set_data(frame)
    return [im]
  interval = 1000/framerate
  anim = animation.FuncAnimation(fig=fig, func=update, frames=frames,
                                  interval=interval, blit=True, repeat=False)
  return HTML(anim.to_html5_video())

pybullet build time: Nov 28 2023 23:51:11


## Hyperparameters

In [2]:
EPISODES = 2500                 # Number of episodes to train the AI on
MEM_SIZE = 100000               # Size of the memory in replay buffer
REPLAY_START_SIZE = 50000       # Amount of samples to fill the replay buffer before training
MEM_RETAIN = 0.1                # Size of memory that cannot be overwritten (avoids catastrophic forgetting)
BATCH_SIZE = 16                 # Size of random batches when sampling experiences
LEARNING_RATE = 0.00025         # Learning rate for optimizing neural network weights
GAMMA = 0.9                     # Discount factor for future rewards
EPSILON_START = 1.0             # Starting exploration rate
EPSILON_END = 0.0001            # Ending exploration rate
EPSILON_DECAY = 2 * MEM_SIZE    # Rate at which exploration rate decays
NETWORK_UPDATE_ITERS = 10000    # Number of iterations before learning func updates the Q weights
MAX_STEPS = 1000                # Number of steps before the episode is terminated
DQN_DIM1 = 256                  # Number of neurons in DQN's first hidden layer
DQN_DIM2 = 256                  # Number of neurons in DQN's second hidden layer

# Metrics for displaying training status
best_reward = 0
average_reward = 0
episode_history = []
episode_reward_history = []
np.bool = np.bool_

## Neural Network

In [4]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np

# Neural network class comprised of CNN and DQN to approximate Q-values for reinforcement learning
class NeuralNetwork(nn.Module):
    # Constructor for Neural Network class
    def __init__(self, env):
        super().__init__()  # Inheriting from torch.nn.Module constructor

        # Getting the input and output shapes for the neural network layers
        self.input_shape = env.observation_space.shape
        self.action_space = env.action_space.n

        # Defining the convolutional layers for CNN
        # Used for processing image data from the environment
        self.conv_layers = torch.nn.Sequential(
            torch.nn.Conv2d(self.input_shape[0], 32, kernel_size=3, stride=2),
            torch.nn.ReLU(),
            torch.nn.Conv2d(32, 64, kernel_size=1, stride=1),
            torch.nn.ReLU(),
            torch.nn.Conv2d(64, 64, kernel_size=1, stride=1),
            torch.nn.ReLU()
        )
        
        # Getting the output shape of the convolutional layers
        conv_out_size = self._get_conv_out(self.input_shape)

        # Defining the layers of the Neural Network
        self.layers = torch.nn.Sequential(
            self.conv_layers,
            torch.nn.Flatten(),
            torch.nn.Linear(conv_out_size, DQN_DIM1),
            torch.nn.ReLU(),
            torch.nn.Linear(DQN_DIM1, DQN_DIM2),
            torch.nn.ReLU(),
            torch.nn.Linear(DQN_DIM2, self.action_space)
        )

        self.optimizer = optim.Adam(self.parameters(), lr=LEARNING_RATE)  # Optimizer for the network
        self.loss = nn.MSELoss()  # Loss function

        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'  # Device to run the network on
        self.to(self.device)  # Moving the network to the device

    # Function to get the output shape of the convolutional layers
    def _get_conv_out(self, shape):
        o = self.conv_layers(torch.zeros(1, *shape))
        return int(np.prod(o.size()))

    # Foward pass through the layers of the Neural Network
    def forward(self, x):
        return self.layers(x)

## Replay Buffer

In [5]:
# Replay Buffer class for storing and retrieving sampled experiences
class ReplayBuffer:
    # Constructor for Replay Buffer class
    def __init__(self, env):
        # Initialising memory count and creating arrays to store experiences
        self.mem_count = 0
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.states = np.zeros((MEM_SIZE, *env.observation_space.shape),dtype=np.float32)
        self.actions = np.zeros(MEM_SIZE, dtype=np.int64)
        self.rewards = np.zeros(MEM_SIZE, dtype=np.float32)
        self.states_ = np.zeros((MEM_SIZE, *env.observation_space.shape),dtype=np.float32)
        self.dones = np.zeros(MEM_SIZE, dtype=np.bool)

    # Function to add experiences to the memory buffer
    def add(self, state, action, reward, state_, done):
        # If the memory count is at its max size, overwrite previous values
        if self.mem_count < MEM_SIZE:
            mem_index = self.mem_count  # Using mem_count if less than max memory size
        else:
            # Avoiding catastrophic forgetting - retain initial 10% of the replay buffer
            mem_index = int(self.mem_count % ((1-MEM_RETAIN) * MEM_SIZE) + (MEM_RETAIN * MEM_SIZE))

        # Adding the states to the replay buffer memory
        self.states[mem_index]  = state     # Storing the state
        self.actions[mem_index] = action    # Storing the action
        self.rewards[mem_index] = reward    # Storing the reward
        self.states_[mem_index] = state_    # Storing the next state
        self.dones[mem_index] =  1 - done   # Storing the done flag
        self.mem_count += 1  # Incrementing memory count

    # Function to sample random batch of experiences
    def sample(self):
        MEM_MAX = min(self.mem_count, MEM_SIZE)
        batch_indices = np.random.choice(MEM_MAX, BATCH_SIZE, replace=True).to(self.device)

        states = self.states[batch_indices]
        actions = self.actions[batch_indices]
        rewards = self.rewards[batch_indices]
        states_ = self.states_[batch_indices]
        dones = self.dones[batch_indices]

        # Returning the random sampled experiences
        return np.array(states), np.array(actions), np.array(rewards), np.array(states_), np.array(dones)
    
    def __len__(self):
        return self.mem_count

In [None]:
# Replay Buffer class for storing and retrieving sampled experiences
class ReplayBuffer:
    def __init__(self, env, mem_size=MEM_SIZE):
        # Initialising memory count and creating arrays to store experiences
        self.memory = deque(maxlen=mem_size)
        self.mem_count = 0

    def add(self, state, action, reward, state_, done):
        # Adding experience to memory
        self.memory.append((state, action, reward, state_, done))
        self.mem_count += 1

    def sample(self):
        # Randomly sample a batch of experiences
        batch_size = min(BATCH_SIZE, self.mem_count)
        batch = random.sample(self.memory, batch_size)

        states, actions, rewards, states_, dones = zip(*batch)
        return np.array(states), np.array(actions), np.array(rewards), np.array(states_), np.array(dones)
    
    def __len__(self):
        return self.mem_count

## Reinforcement Learning

In [6]:
# Reinforcement Learning class
class ReinforcementLearning:
    # Constructor for Reinforcement Learning class
    def __init__(self, env):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Define the device to run the network on
        self.memory = ReplayBuffer(env)  # Creating replay buffer
        self.policy_network = NeuralNetwork(env)  # Q
        self.target_network = NeuralNetwork(env)  # \hat{Q}
        self.target_network.load_state_dict(self.policy_network.state_dict())  # Initially set weights of Q to \hat{Q}
        self.learn_count = 0  # Tracking number of learning iterations

    # Epsilon-greedy policy
    def choose_action(self, observation):
        # Only start decaying the epsilon once we start learning
        if self.memory.mem_count > REPLAY_START_SIZE:
            eps_threshold = EPSILON_END + (EPSILON_START - EPSILON_END) * \
                math.exp(-1. * self.learn_count / EPSILON_DECAY)
        else:
            eps_threshold = 1.0

        # If we rolled a value lower than the epsilon sample a random action
        if random.random() < eps_threshold:
            return np.random.choice(np.array(range(7)), p=[0.05, 0.2, 0.2, 0.2, 0.2, 0.05, 0.1])  # Random action with set priors
            #return np.random.choice(np.array(range(12)), p=[0.05, 0.1, 0.1, 0.1, 0.1, 0.05, 0.1, 0.1, 0.1, 0.1, 0.05, 0.05])  # Random action with set priors
        
        # Otherwise policy network (Q) chooses action with highest estimated Q value so far
        state = observation.clone().detach().to(self.device)
        state = state.unsqueeze(0)
        self.policy_network.eval()
        with torch.no_grad():
            q_values = self.policy_network(state)  # Get Q-values from policy network

        return torch.argmax(q_values).item()

    # Main training/learning loop
    def learn(self):
        # Sampling a random batch of experiences and converting them to tensors
        states, actions, rewards, states_, dones = self.memory.sample()
        states = torch.tensor(states, dtype=torch.float32).to(self.device)
        actions = torch.tensor(actions, dtype=torch.long).to(self.device)
        rewards = torch.tensor(rewards, dtype=torch.float32).to(self.device)
        states_ = torch.tensor(states_, dtype=torch.float32).to(self.device)
        dones = torch.tensor(dones, dtype=torch.bool).to(self.device)
        batch_indices = torch.from_numpy(np.arange(BATCH_SIZE, dtype=np.int64)).to(self.device)

        self.policy_network.train(True)  # Training the neural network
        q_values = self.policy_network(states)  # Getting predicted Q-values from neural network
        q_values = q_values[batch_indices, actions]  # Getting the Q-values for the sampled experience

        self.target_network.eval()
        with torch.no_grad():
            q_values_next = self.target_network(states_)  # Getting Q-values from target network

        q_values_next_max = torch.max(q_values_next, dim=1)[0]  # Getting max Q-values for next state
        q_target = rewards + GAMMA * q_values_next_max * dones  # Getting target Q-values

        loss = self.policy_network.loss(q_values, q_target)  # Calcualting the loss from target and pred Q-values

        # Computing the gradients and updating Q weights
        self.policy_network.optimizer.zero_grad()
        loss.backward()
        self.policy_network.optimizer.step()  # Updating Q weights
        self.learn_count += 1  # Incrementing learning count

        # Set target network weights to policy network weights every set increment of learning steps
        if self.learn_count % NETWORK_UPDATE_ITERS == NETWORK_UPDATE_ITERS - 1:
            print("Updating target network")
            self.update_target_network()

    # Function to synchronize the weights of the target network with the policy network
    def update_target_network(self):
        self.target_network.load_state_dict(self.policy_network.state_dict())

    # Function to return the exploration rate (epsilon) of the agent
    def returning_epsilon(self):
        return self.exploration_rate

## Training

In [7]:
# Function to apply additional rewards that aren't in the environment already
def reward_shaping(prev_info, info):
    shapedReward = 0  # Container to store the additional reward
    reward_values = {  # Container to store keys for rewards
        'coins': 1,
        'score': lambda previous, current: current - previous,
        'flag_get': 50,
        'powerup': lambda previous, current: 1 if current > previous else 0
    }

    # Applying the reward values to the shaped reward
    for key, reward in reward_values.items():
        prev_value = prev_info.get(key, 0)  # Getting the previous info values for keys
        curr_value = info.get(key, 0)       # Getting the current info values for keys

        # If the reward is a function, apply the function to the previous and current values
        if callable(reward):
            shapedReward += reward(prev_value, curr_value)

        # Otherwise, apply the reward value to the shaped reward
        elif curr_value > prev_value:
            shapedReward += reward

    return shapedReward  # Return the shaped reward

In [8]:
import gym_super_mario_bros
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from nes_py.wrappers import JoypadSpace

# Checking if GPU is available
if torch.cuda.is_available():
    print("Using CUDA device:", torch.cuda.get_device_name(0))
else:
    print("CUDA is not available")

# Loading the Super Mario Bros gym environment and initialising joypad type
env = gym_super_mario_bros.make('SuperMarioBros-1-1-v0', apply_api_compatibility=True, render_mode="rgb_array")
env = JoypadSpace(env, SIMPLE_MOVEMENT)

# Metrics for displaying training status
prev_info = None
episode_reward = 0
episode_batch_score = 0
agent = ReinforcementLearning(env)
plt.clf()  # Clearing previous plot

env.reset()  # Reseting environment
state_, reward, done, trunc, info = env.step(action=0)  # Taking a step in the environment

# Looping through the episodes to train the model
for episode in range(EPISODES):
    done = False  # Setting default done state
    step_count = 0
    state, info = env.reset()  # Resetting environment and getting state
    
    # Running the episode until done or max steps reached
    while not done:
        # Sampling random actions and adding to the replay buffer
        state_copy = np.array(state)
        state_tensor = torch.tensor(state_copy, dtype=torch.float32).unsqueeze(0).squeeze(0).to(agent.device)
        action = agent.choose_action(state_tensor)
        state_, reward, done, trunc, info = env.step(action)

        # Adding additional reward system
        if prev_info is not None:
            reward += reward_shaping(prev_info, info)

        agent.memory.add(state, action, reward, state_, done)  # Add experience to replay buffer
        # Only start learning once replay memory has reached set number of samples
        if agent.memory.mem_count >= REPLAY_START_SIZE:
            agent.learn()

        state = state_  # Updating current state
        prev_info = info  # Updating previous info
        step_count += 1  # Incrementing step count
        episode_batch_score += reward  # Updating batch reward
        episode_reward += reward  # Updating episode reward

    # Appending episode and associated reward to history
    episode_history.append(episode)
    episode_reward_history.append(episode_reward)
    episode_reward = 0  # Resetting episode reward

    # Printing episode number every 100 episodes
    if episode % 100 == 0:
        print("Episode: ", episode)

    # Saving model every batches of 100 episodes
    if episode % 100 == 0 and agent.memory.mem_count > REPLAY_START_SIZE:
        save_path = os.path.join(os.getcwd(), "policy_network.pkl")
        torch.save(agent.policy_network.state_dict(), save_path)
        print("average total reward per episode batch since episode ", episode, ": ", episode_batch_score/ float(100))
        episode_batch_score = 0
    elif agent.memory.mem_count < REPLAY_START_SIZE:
        print("waiting for buffer to fill...")
        episode_batch_score = 0

# Plotting the episode history and reward history
plt.plot(episode_history, episode_reward_history)
plt.show()
env.close()  # Closing the environment

Using CUDA device: NVIDIA RTX A2000 Laptop GPU


  logger.warn(
  logger.warn(
  if not isinstance(terminated, (bool, np.bool8)):


Episode:  0
waiting for buffer to fill...


: 

## Testing

In [None]:
import gym_super_mario_bros
from gym_super_mario_bros.actions import COMPLEX_MOVEMENT
from nes_py.wrappers import JoypadSpace

# Loading the Super Mario Bros gym environment and initialising joypad type
env = gym_super_mario_bros.make('SuperMarioBros-1-1-v0', apply_api_compatibility=True, render_mode="rgb_array")
env = JoypadSpace(env, COMPLEX_MOVEMENT)
agent = ReinforcementLearning(env)  # Creating reinforcement learning agent

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Defining the device to run the network on
agent.policy_network.load_state_dict(torch.load("policy_network.pkl"))  # Loading policy network
agent.policy_network.to(device)  # Moving policy network to device
state, info = env.reset()  # Resettin environment and getting initial state
frames = []  # Frames container for video
frames.append(env.render())  # Appending initial frame to video
agent.policy_network.eval()  # Setting policy network to evaluation mode

# Running the episode until done
while True:
    with torch.no_grad():
        state_copy = np.array(state)  # Copying state
        state_tensor = torch.tensor(state_copy, dtype=torch.float32).unsqueeze(0).to(agent.device)  # Getting state tensor
        q_values = agent.policy_network(state_tensor)  # Getting Q-values from policy network

    # Getting the action with the highest Q-value
    action = torch.argmax(q_values).item()
    state, reward, done, trunc, info = env.step(action)  # Taking a step in the environment
    frames.append(np.copy(env.render()))  # Appending frame to video

    # Breaking the loop if the episode is done
    if done:
        break

env.close()
display_video(frames)  # Displaying the video of the agent playing the game