**Step-0: Install dependencies** 

In [3]:
!pip install tensorflow keras keras-rl2 gym
!pip install gymnasium
!pip install ale-py



In [4]:
!which python
!python -V

/opt/anaconda3/envs/rl_final_env/bin/python
Python 3.11.11


**Step 1: Test Random Environment with OpenAI Gym**

In [5]:
import random
import ale_py
import gymnasium as gym
gym.register_envs(ale_py)

In [7]:
env = gym.make("SpaceInvaders-v4",render_mode="human")
# SpaceInvaders-v5 environment returns an image as part of the state. 
# We extract the shape of image to pass to structure our neural network

A.L.E: Arcade Learning Environment (version 0.10.2+c9d4b19)
[Powered by Stella]


In [8]:
height, width, channels=env.observation_space.shape
actions=env.action_space.n

**We can see the actions in https://www.gymlibrary.dev/environments/atari/space_invaders/, however to see them here, lets unwrap the environment and check what actions we have in action_space**

In [9]:
env.unwrapped.get_action_meanings()

['NOOP', 'FIRE', 'RIGHT', 'LEFT', 'RIGHTFIRE', 'LEFTFIRE']

**So within our environment, our agent can take the following actions: No operation, Fire, Move right, Move left, Fire right and Fire Left**

For the random test, we will play 5 episodes(5 games of Space Invaders)

Let's carefully understand what we did here.

1. env.reset() sets our state to reset.
2. done=False is initial setting so that we play the game until we are done, which means, either until we achieve high score or we die in the game.
3. Score is a counter to keep track fo our game score.
4. random.choice is to make a choice among the 6 actions we have above. This will help us understand how our agent randomly performs taking these actions in the game.
5. env.step(action) allows us to take an action we chose randomly earlier, and apply it to our environment, after which we take the following info:
   
    **n_state:** The new state 

    **reward:** The reward for the action 

    **done:** Whether the episode ended 

    **truncated:** Whether the episode was forcefully stopped (time limit, etc.) 

    **info:** Additional environment details

7. reward of taking an action is constantly added to our score, which maintains the running total through the course of the game. 

In [10]:
episodes = 5
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = random.choice([0,1,2,3,4,5])
        n_state, reward, done, info, _ = env.step(action)
        score += reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

2025-03-07 14:09:30.818 python[20873:875371] +[IMKClient subclass]: chose IMKClient_Modern
2025-03-07 14:09:30.818 python[20873:875371] +[IMKInputSession subclass]: chose IMKInputSession_Modern


Episode:1 Score:120.0
Episode:2 Score:265.0
Episode:3 Score:260.0
Episode:4 Score:75.0
Episode:5 Score:30.0


**This performed fairly well, however we can see that the scores are not consistent, with some high and some low, so our end goal is to train our model in a way that the agent performs to get consistent high rewards as it learns**

**Step-2: Create a Deep Learning Model with Stable Baselines3**

In [11]:
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv
env = DummyVecEnv([lambda: env]) #Wraps the environment for Stable-Baselines3, which requires vectorized environments.

In [13]:
mlp_model = PPO("MlpPolicy", env, verbose=2)
mlp_model.learn(total_timesteps=10000)

# Evaluate MLP Policy
mlp_mean_reward, _ = evaluate_policy(mlp_model, env, n_eval_episodes=15)
mlp_model.save("SpaceInvaders_MLP")

print(f"MLP Policy - Mean Reward: {mlp_mean_reward:.2f}")

Using cpu device
Wrapping the env in a VecTransposeImage.
-----------------------------
| time/              |      |
|    fps             | 18   |
|    iterations      | 1    |
|    time_elapsed    | 109  |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 17          |
|    iterations           | 2           |
|    time_elapsed         | 231         |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.007818995 |
|    clip_fraction        | 0.0362      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.79       |
|    explained_variance   | 0.00237     |
|    learning_rate        | 0.0003      |
|    loss                 | 4.13        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.00585    |
|    value_loss           | 11.1        |
------------------



MLP Policy - Mean Reward: 317.33


In [14]:
# Train model using CNN policy
cnn_model = PPO("CnnPolicy", env, verbose=2)
cnn_model.learn(total_timesteps=10000)

# Evaluate CNN Policy
cnn_mean_reward, _ = evaluate_policy(cnn_model, env, n_eval_episodes=15)
cnn_model.save("SpaceInvaders_CNN")

print(f"CNN Policy - Mean Reward: {cnn_mean_reward:.2f}")

Using cpu device
Wrapping the env in a VecTransposeImage.
-----------------------------
| time/              |      |
|    fps             | 18   |
|    iterations      | 1    |
|    time_elapsed    | 109  |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 15          |
|    iterations           | 2           |
|    time_elapsed         | 270         |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.012101004 |
|    clip_fraction        | 0.126       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.78       |
|    explained_variance   | 0.00262     |
|    learning_rate        | 0.0003      |
|    loss                 | 3.23        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.00217    |
|    value_loss           | 11.5        |
------------------

- It’s surprising that MLP Policy performed better than CNN Policy for Space Invaders, but let’s analyze why this might be happening.

- 10,000 timesteps might not be enough for CNN to outperform MLP. CNNs require more training to extract spatial features effectively.

- So now, let's try with number of timesteps=50,000.

In [15]:
# Train model using CNN policy
cnn_model = PPO("CnnPolicy", env, verbose=2)
cnn_model.learn(total_timesteps=50000)

# Evaluate CNN Policy
cnn_mean_reward, _ = evaluate_policy(cnn_model, env, n_eval_episodes=15)
cnn_model.save("SpaceInvaders_CNN")

print(f"CNN Policy - Mean Reward: {cnn_mean_reward:.2f}")

Using cpu device
Wrapping the env in a VecTransposeImage.
-----------------------------
| time/              |      |
|    fps             | 18   |
|    iterations      | 1    |
|    time_elapsed    | 111  |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 15          |
|    iterations           | 2           |
|    time_elapsed         | 272         |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.010836471 |
|    clip_fraction        | 0.0845      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.78       |
|    explained_variance   | -0.00298    |
|    learning_rate        | 0.0003      |
|    loss                 | 0.773       |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0027     |
|    value_loss           | 8.09        |
------------------

In [16]:
mlp_model = PPO("MlpPolicy", env, verbose=2)
mlp_model.learn(total_timesteps=50000)

# Evaluate MLP Policy
mlp_mean_reward, _ = evaluate_policy(mlp_model, env, n_eval_episodes=15)
mlp_model.save("SpaceInvaders_MLP")

print(f"MLP Policy - Mean Reward: {mlp_mean_reward:.2f}")

Using cpu device
Wrapping the env in a VecTransposeImage.
-----------------------------
| time/              |      |
|    fps             | 31   |
|    iterations      | 1    |
|    time_elapsed    | 64   |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 27          |
|    iterations           | 2           |
|    time_elapsed         | 146         |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.008289233 |
|    clip_fraction        | 0.0787      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.79       |
|    explained_variance   | -0.0028     |
|    learning_rate        | 0.0003      |
|    loss                 | 2.23        |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.0112     |
|    value_loss           | 12.2        |
------------------

**Variations:**

We first use MlpPolicy, but we can see the mean reward isn't as high as that for the lower number of steps(10,000), "CnnPolicy" has better reward. We will understand why. But let's first see what happens in our code below:

- "CnnPolicy" refers to a Convolutional Neural Network (CNN) policy, which is best for image-based environments
- Learning_rate=1e-4 sets the learning rate for policy optimization.
- n_steps=2048 are the number of steps before updating the model.
- learn(total_timesteps=50000) trains the agent for 50,000 steps.
- saves the trained model for later use.

Explaining difference between performance of MlpPolicy and CnnPolicy:

- MlpPolicy works well for low-dimensional, vector-based observations (e.g., stock prices, sensor readings, or tabular data).
- Hence, it is more suitable for environments where observations are represented as numerical arrays rather than images for example: Space Invaders provides raw pixel images as observations, so it cannot perceive it well. 
- Also, MLP (Multi-Layer Perceptron) does not process spatial features well.
- Over 50,000 steps, MLP policy has mean reward of 287 while CNN Policy has 291.33.

**Fun fact:** Without convolutional layers, the model cannot detect enemy positions, player location, or bullet trajectories properly. CnnPolicy is proven to work well on Atari games, as seen in DeepMind’s DQN paper.
- So we consider CnnPolicy to extract spatial features from images and detect patterns like enemies, bullets, and player movement effectively.

**Step-3: Test the model performance**

In [20]:
import cv2
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv, VecTransposeImage
from stable_baselines3.common.evaluation import evaluate_policy

def test_model(env, episodes=10, PATH_SAVED="SpaceInvaders_CNN"):
    # Wrap the environment correctly (without VecFrameStack)
    env = DummyVecEnv([lambda: env])
    env = VecTransposeImage(env)  # Ensures correct shape for CNN input

    # Load the saved model
    model = PPO.load(PATH_SAVED, env=env)

    # Reset the environment and get the first observation
    obs = env.reset()

    # Get frame size for saving video
    frame = env.render(mode="rgb_array")
    frame_size = (frame.shape[1], frame.shape[0])  
    output_file = f'{PATH_SAVED}-video.mp4'  
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')  
    fps = 30.0  
    video_writer = cv2.VideoWriter(output_file, fourcc, fps, frame_size)
    
    # Loop through episodes
    total_score = 0
    for i in range(episodes):
        obs = env.reset()  # Reset environment for each episode
        done = False
        score = 0
        while not done:
            frame = env.render(mode="rgb_array")  # Get frame for video
            action, _ = model.predict(obs)  # Get action from the model
            obs, reward, done, _ = env.step(action)  # Take action in environment
            frame_bgr = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)  # Convert frame to BGR (for OpenCV)
            video_writer.write(frame_bgr)  # Write frame to video
            score += reward  # Keep track of the score
        
        # Display score for each episode
        print(f'Episode {i + 1} - Score: {score}')
        total_score += score
    
    # Calculate and print the mean score for all episodes
    mean_score = total_score / episodes
    print(f'Mean Score across {episodes} episodes: {mean_score}')
    
    # Save the model video
    video_writer.release()
    env.close()

# Create and wrap the environment properly
env = gym.make('SpaceInvaders-v4', render_mode='rgb_array')

# Test the CNN policy
cnn_mean_score = test_model(env, episodes=10, PATH_SAVED='SpaceInvaders_CNN')

Episode 1 - Score: [380.]
Episode 2 - Score: [500.]
Episode 3 - Score: [170.]
Episode 4 - Score: [220.]
Episode 5 - Score: [185.]
Episode 6 - Score: [175.]
Episode 7 - Score: [460.]
Episode 8 - Score: [165.]
Episode 9 - Score: [80.]
Episode 10 - Score: [240.]
Mean Score across 10 episodes: [257.5]


CNN Policy Model gave us mean reward of 257.5 across the 10 episodes. Let's also for reference check how MLP Policy performed, as their training scores weren't very different. Maybe if converged more with high training steps like 100000 or more, it will show a clear distinction between their performances.

In [21]:
import cv2
import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv, VecTransposeImage
from stable_baselines3.common.evaluation import evaluate_policy

def test_model(env, episodes=10, PATH_SAVED="SpaceInvaders_MLP"):
    # Wrap the environment correctly (without VecFrameStack)
    env = DummyVecEnv([lambda: env])
    env = VecTransposeImage(env)  # Ensures correct shape for CNN input

    # Load the saved model
    model = PPO.load(PATH_SAVED, env=env)

    # Reset the environment and get the first observation
    obs = env.reset()

    # Get frame size for saving video
    frame = env.render(mode="rgb_array")
    frame_size = (frame.shape[1], frame.shape[0])  
    output_file = f'{PATH_SAVED}-video.mp4'  
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')  
    fps = 30.0  
    video_writer = cv2.VideoWriter(output_file, fourcc, fps, frame_size)
    
    # Loop through episodes
    total_score = 0
    for i in range(episodes):
        obs = env.reset()  # Reset environment for each episode
        done = False
        score = 0
        while not done:
            frame = env.render(mode="rgb_array")  # Get frame for video
            action, _ = model.predict(obs)  # Get action from the model
            obs, reward, done, _ = env.step(action)  # Take action in environment
            frame_bgr = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)  # Convert frame to BGR (for OpenCV)
            video_writer.write(frame_bgr)  # Write frame to video
            score += reward  # Keep track of the score
        
        # Display score for each episode
        print(f'Episode {i + 1} - Score: {score}')
        total_score += score
    
    # Calculate and print the mean score for all episodes
    mean_score = total_score / episodes
    print(f'Mean Score across {episodes} episodes: {mean_score}')
    
    # Save the model video
    video_writer.release()
    env.close()

# Create and wrap the environment properly
env = gym.make('SpaceInvaders-v4', render_mode='rgb_array')

# Test the CNN policy
cnn_mean_score = test_model(env, episodes=10, PATH_SAVED='SpaceInvaders_MLP')

Episode 1 - Score: [90.]
Episode 2 - Score: [70.]
Episode 3 - Score: [210.]
Episode 4 - Score: [75.]
Episode 5 - Score: [135.]
Episode 6 - Score: [110.]
Episode 7 - Score: [210.]
Episode 8 - Score: [110.]
Episode 9 - Score: [135.]
Episode 10 - Score: [210.]
Mean Score across 10 episodes: [135.5]


**Evidently CNN policy has performed better, as we got a mean score of 257.5 over 10 episodes, compared to MLP Policy which gave mean score of only 135.5.**

- We have an output mp4 video file to show the game play by our agent trained on variants of MLP and CNN policy!