<a href="https://colab.research.google.com/github/ezzeddinegasmi/DRL_comparative_study/blob/main/PPO_Mard_12_h.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import sys

IN_COLAB = "google.colab" in sys.modules

if IN_COLAB:
    !pip install gymnasium==1.0.0



In [2]:
pip install stable-baselines3[extra] gymnasium wandb numpy



In [3]:
import random
from typing import List, Tuple

import gymnasium as gym
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from IPython.display import clear_output
from torch.distributions import Normal

In [6]:
 git clone https://github.com/Neilus03/DRL_comparative_study

SyntaxError: invalid syntax (<ipython-input-6-2d3d2f31762b>, line 1)

In [None]:
BreakOut_sb3_PPO

In [None]:
BreakOut_sb3_PPO/README.md

In [None]:
# Breakout Reinforcement Learning Implementations

Welcome to the Breakout Reinforcement Learning Implementations repository. This project includes various implementations of Reinforcement Learning algorithms applied to the classic Breakout game environment. It aims to provide reproducible results for the research presented in our paper _A COMPARATIVE STUDY OF DEEP REINFORCEMENT
LEARNING MODELS: DQN VS PPO VS A2C_.

## 🔩 Directory Structure

The repository is organized into the following directories, each containing a specific approach to solving Breakout using different RL algorithms:

- **`BreakOut_base`**:
  - *Description*: Base implementation of the Deep Q-Network (DQN) for the Breakout game.

- **`BreakOut_sb3_A2C`**:
  - Implementation of the Advantage Actor-Critic (A2C) algorithm using Stable Baselines 3.

- **`BreakOut_sb3_PPO`**:
  - Proximal Policy Optimization (PPO) approach using Stable Baselines 3.
- **`Breakout_sb3_DQN`**:
  - Implementation of DQN using Stable Baselines 3.

*Details*: The README in this folder will guide you through the necessary steps for training and testing.

## 📦 Installation

Ensure Python 3.x is installed along with the following dependencies, which are common across all implementations:

```bash
pip install -r requirements.txt
```

## 🎮 Usage

1. **Clone & Set Up** this repository:
    ```bash
    git clone https://github.com/Neilus03/DRL_comparative_study
    cd DRL_comparative_study
    pip install -r requirements.txt
    ```

2. **Navigate** to the desired implementation directory, e.g., for PPO:
    ```bash
    cd BreakOut_sb3_PPO
    ```

3. **Configure** the `config.py` file to adjust the model training, the wandb account, and the saving model options.

4. **Execute** the `train.py` to run the model or `test.py` if you already have a pre-trained model to test.

## 👥 Contributing

We welcome contributions! If you have suggestions or improvements, feel free to create a pull request or open an issue.

## 📧 Contact

For any questions or inquiries, please reach out to:

[Daniel Vidal](https://www.linkedin.com/in/daniel-alejandro-vidal-guerra-21386b266/)

[Neil de la Fuente](https://www.linkedin.com/in/neil-de-la-fuente/)

---

Thank you for visiting! We hope you find this repository useful for reproducing the results presented in our paper.


In [None]:
BreakOut_sb3_PPO/config.py

In [None]:

import os
from stable_baselines3.common.utils import get_latest_run_id
import torch


'''FILE TO STORE ALL THE CONFIGURATION VARIABLES'''

#pretrained is a boolean that indicates if a pretrained model will be loaded
pretrained = False # Set to True if you want to load a pretrained model

#check_freq is the frequency at which the callback is called, in this case, the callback is called every 2000 timesteps
check_freq = 2000

#save_path is the path where the best model will be saved
save_path = "./breakout_ppo_1M_save_path"

#log_dir is the path where the logs will be saved
log_dir = "./log_dir"


'''
Hyperparameters of the model {learning_rate, gamma, device, n_steps, gae_lambda, ent_coef, vf_coef, max_grad_norm, rms_prop_eps, use_rms_prop, use_sde, sde_sample_freq, normalize_advantage}
'''
#policy is the policy of the model, in this case, the model will use a convolutional neural network
policy = "CnnPolicy"

#learning_rate is the learning rate of the model
learning_rate =5e-4  #first trial: 5e-4   #second trial: 1e-4  #third trial: 1e-3  #fourth trial: 5e-5 #fifth trial: 5e-5 gamma = 0.90 #sixth trial: 1e-4 gamma = 0.90 #seventh trial: 5e-4 gamma = 0.90

#gamma is the discount factor
gamma = 0.99

#device is the device where the model will be trained, if cuda is available, the model will be trained in the gpu, otherwise, it will be trained in the cpu
device = "cuda" if torch.cuda.is_available() else "cpu"

#n_steps is the number of steps taken by the model before updating the parameters
n_steps = 24

#batch_size is the number of samples used in each update
batch_size = 96

#n_epochs is the number of epochs when optimizing the surrogate loss
n_epochs = 6

#gae_lambda is the lambda parameter of the generalized advantage estimation, set to 1 to disable it
gae_lambda = 0.95

#clip_range is the clipping parameter of the surrogate loss
clip_range = 0.2

#clip_range_vf is the clipping parameter of the value function
clip_range_vf = 1

#ent_coef is the entropy coefficient, set to 0 to disable it
ent_coef = 0.01

#vf_coef is the value function coefficient, If we set it to 0.5, then the value function loss will be half the policy loss
vf_coef = 0.5

#max_grad_norm is the maximum value for the gradient clipping
max_grad_norm = 0.5

#use_sde is a boolean that indicates if the stochastic differential equation will be used
#The stochastic differential equation is a method to add noise to the actions taken by the agent to improve exploration
use_sde = False

#sde_sample_freq is the frequency at which the noise is added to the actions. If set to -1, the noise will be added every timestep
sde_sample_freq = -1

#rollout_buffer_class is the class of the rollout buffer, in this case, the model will use the RolloutBuffer class
rollout_buffer_class = None

#rollout_buffer_kwargs is a dictionary with the keyword arguments for the rollout buffer. If None, it will use the default arguments
rollout_buffer_kwargs = None

#target_kl is the target value for the KL divergence between the old and updated policy
target_kl = 0.5

#normalize_advantage is a boolean that indicates if the advantage will be normalized, by normalizing the advantage,
# the variance of the advantage is reduced, this is done to improve the training process because the advantage is used to calculate the policy loss
normalize_advantage = False

#stats_window_size is the size of the window used to calculate the mean and standard deviation of the advantage
stats_window_size = 100

#tensorboard_log is the path where the tensorboard logs will be saved, in our case, the logs will be saved in the log_dir
tensorboard_log = log_dir

#policy_kwargs is a dictionary with the keyword arguments for the policy. If None, it will use the default arguments
policy_kwargs = None

#verbose is the verbosity level: 0 no output, 1 info, 2 debug
verbose = 2

#seed is the seed for the pseudo random number generator used by the model. It is set to None to use a random seed,
# and set to 0 to use a fixed seed for reproducibility
seed = None

#_init_setup_model is a boolean that indicates if the model will be initialized after being created, set to True to initialize the model
_init_setup_model = True

#total_timesteps is the total number of timesteps that the model will be trained. In this case, the model will be trained for 1e7 timesteps
#Take into account that the number of timesteps is not the number of episodes, in a game like breakout, the agent takes an action every frame,
# then the number of timesteps is the number of frames, which is the number of frames in 1 game multiplied by the number of games played.
#The average number of frames in 1 game is 1000, so 1e7 timesteps is 1000 games more or less.
total_timesteps = int(3e7)

#log_interval is the number of timesteps between each log, in this case, the training process will be logged every 100 timesteps.
log_interval = 100

'''
Saved model path
'''

#for the path to be shorter just put "./a2c_Breakout_1M.zip" instead of the full path
saved_model_path = "./PPO_Breakout_30M_lr_5e-4_gamma_90.zip"
unzip_file_path =  "./PPO_Breakout_30M_lr_5e-4_gamma_90_unzipped"

'''
Environment variables
'''
#n_stack is the number of frames stacked together to form the input to the model
n_stack = 4
#n_envs is the number of environments that will be run in parallel
n_envs = 4

'''
Wandb configuration
'''
#log_to_wandb is a boolean that indicates if the training process will be logged to wandb
log_to_wandb = False

# project is the name of the project in wandb
project_train = "BREAKOUT_SB3_BENCHMARK"
project_test = "breakout-PPO2-test"

#entity is the name of the team in wandb
entity = "ai42"

#name is the name of the run in wandb
name_train = "PPO_breakout_lr_5e-4_gamma_90"
name_test = "PPO2_breakout_test"
#notes is a description of the run
notes = "PPO2_breakout with parameters: {}".format(locals()) #locals() returns a dictionary with all the local variables, in this case, all the variables in this file
#sync_tensorboard is a boolean that indicates if the tensorboard logs will be synced to wandb
sync_tensorboard = True


'''
Test configuration
'''
test_episodes = 100


In [None]:
BreakOut_sb3_PPO/utils.py

In [None]:
import gymnasium as gym
from stable_baselines3.common.atari_wrappers import AtariWrapper
import os
import zipfile
from stable_baselines3.common.callbacks import BaseCallback
import numpy as np
import wandb
from stable_baselines3.common.vec_env import VecFrameStack, DummyVecEnv

'''
The RewardLogger wrapper is used to log the rewards of each episode to wandb
It makes sure that the rewards of each episode are stored in a list and that the current episode reward is reset
'''
class RewardLogger(gym.Wrapper):
    def __init__(self, env):
        super(RewardLogger, self).__init__(env)
        # Store the rewards of each episode
        self.episode_rewards = []
        # Store the current episode reward
        self.current_episode_reward = 0

    # The step function is called every time the agent takes an action in the environment
    def step(self, action):
        # Call the step function of the environment and store the results
        obs, reward, done, truncated, info = self.env.step(action)
        # Update the current episode reward
        self.current_episode_reward += reward
        # If the episode is done, store the episode reward and reset the current episode reward
        if done:
            self.episode_rewards.append(self.current_episode_reward)
            self.current_episode_reward = 0
        # Return the results as in a normal step function
        return obs, reward, done, truncated, info

    # The reset function is called every time the environment is reset (at the beginning of each episode)
    def reset(self, **kwargs):
        return self.env.reset(**kwargs)

    # The get_episode_rewards function returns the rewards of each episode
    def get_episode_rewards(self):
        return self.episode_rewards

'''
The CustomWandbCallback is a callback* that logs the mean reward of the last 100 episodes to wandb.
A callback is a function that is called at the end of each episode to perform some action,
in this case, the action is logging the mean reward of the last 100 episodes to wandb.
'''
class CustomWandbCallback(BaseCallback):
    def __init__(self, check_freq, save_path, verbose=1):
        super(CustomWandbCallback, self).__init__(verbose)
        # Define the frequency at which the callback is called
        self.check_freq = check_freq
        # Define the path where the best model will be saved
        self.save_path = save_path
        # Define the best mean reward as -inf
        self.best_mean_reward = -np.inf


    def _on_step(self) -> bool:
        '''
        The _on_step function is called at the end of each episode.
        It returns True if the callback should be called again, and False otherwise.
        To do this, it checks if the number of calls to the callback is a multiple of the check_freq.
        If it is, it computes the mean reward of the last 100 episodes and logs it to wandb.
        It also saves the model if the mean reward is greater than the best mean reward.
        '''
        # Check if the number of calls to the callback is a multiple of the check_freq
        if self.n_calls % self.check_freq == 0:
            # Gather rewards from all environments, by all environments we mean all the environments in the vectorized environment, usually there is only 1 environment in the vectorized environment
            all_rewards = []
            for env in self.training_env.envs: # self.training_env is the vectorized environment
                # logger_env is the DummyVecEnv wrapper which converts the environment to a single vectorized environment
                logger_env = env.envs[0] if isinstance(env, DummyVecEnv) else env # env.envs[0] is the AtariWrapper which wraps the environment correctly
                #Check if the logger_env is the RewardLogger wrapper
                if isinstance(logger_env, RewardLogger):
                    # If it is, get the rewards of each episode and store them in all_rewards
                    all_rewards.extend(logger_env.get_episode_rewards())#extend is used to add the elements of a list to another list

            #If there are rewards in all_rewards, compute the mean reward of the last 100 episodes and log it to wandb
            if all_rewards:
                # Compute the mean reward of the last 100 episodes
                mean_reward = np.mean(all_rewards[-self.check_freq:])
                # Log the mean reward of the last 100 episodes to wandb
                wandb.log({'mean_reward': mean_reward, 'steps': self.num_timesteps})

                # Save the best model
                if mean_reward > self.best_mean_reward:
                    self.best_mean_reward = mean_reward
                    self.model.save(os.path.join(self.save_path, 'best_model'))
        # Return True if the callback should be called again, and False otherwise
        return True


def make_env(env_id, seed=0):
    '''
    Function for creating the environment with the correct wrappers and rendering.
    '''
    def _init():
        # Create the environment with render mode set to human
        env = gym.make(env_id, render_mode='human')
        # Set the seed of the environment, this is done to make the results reproducible
        env.seed(seed)
        # Wrap the environment with the AtariWrapper which wraps the environment correctly
        env = AtariWrapper(env)
        # Wrap the environment with the RewardLogger wrapper which logs the rewards of each episode to wandb
        env = RewardLogger(env)
        # Return the environment
        return env
    # Return the _init function which is used to create the environment
    return _init


def unzip_file(zip_path, extract_to_folder):
    """
    Unzips a zip file to a specified folder.

    Args:
    zip_path (str): The path to the zip file.
    extract_to_folder (str): The folder to extract the files to.
    """
    # Ensure the target folder exists
    os.makedirs(extract_to_folder, exist_ok=True)
    # Extract the zip file to the target folder
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_to_folder)

In [None]:
BreakOut_sb3_PPO/train_PPO.py

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack, DummyVecEnv
from stable_baselines3.common.atari_wrappers import AtariWrapper
import gymnasium as gym
import torch
import config
import wandb
from wandb.integration.sb3 import WandbCallback
from utils import make_env, unzip_file, CustomWandbCallback, RewardLogger
import os
from stable_baselines3.common.utils import get_latest_run_id

'''
Set up the appropriate directories for logging and saving the model
'''
os.makedirs(config.log_dir, exist_ok=True)
os.makedirs(config.save_path, exist_ok=True)

#Create the callback that logs the mean reward of the last 100 episodes to wandb
custom_callback = CustomWandbCallback(config.check_freq, config.save_path)


'''
Set up loging to wandb
'''

#Set wandb to log the training process
if config.log_to_wandb:
    wandb.init(project=config.project_train, entity = config.entity, name=config.name_train, notes=config.notes, sync_tensorboard=config.sync_tensorboard)
    #wandb_callback is a callback that logs the training process to wandb, this is done because wandb.watch() does not work with sb3
    wandb_callback = WandbCallback()


'''
Set up the environment
'''
# Create multiple environments and wrap them correctly
env = make_atari_env("BreakoutNoFrameskip-v4", n_envs=config.n_envs, seed=config.seed)
env = VecFrameStack(env, n_stack=config.n_stack)


'''
Set up the model
'''
#Create the model with the parameters specified in config.py, go to config.py to see the meaning of each parameter in detail
model = PPO(policy=config.policy
            ,env=env
            ,learning_rate=config.learning_rate
            ,n_steps=config.n_steps
            ,batch_size=config.batch_size
            ,n_epochs=config.n_epochs
            ,gamma=config.gamma
            ,gae_lambda=config.gae_lambda
            ,clip_range=config.clip_range
            ,clip_range_vf=config.clip_range_vf
            ,normalize_advantage=config.normalize_advantage
            ,ent_coef=config.ent_coef
            ,vf_coef=config.vf_coef
            ,max_grad_norm=config.max_grad_norm
            ,use_sde=config.use_sde
            ,sde_sample_freq=config.sde_sample_freq
            #,rollout_buffer_class=config.rollout_buffer_class
            #,rollout_buffer_kwargs=config.rollout_buffer_kwargs
            ,target_kl=config.target_kl
            ,stats_window_size=config.stats_window_size
            ,tensorboard_log=config.log_dir
            ,policy_kwargs=config.policy_kwargs
            ,verbose=config.verbose
            ,seed=config.seed
            ,device=config.device
            ,_init_setup_model=config._init_setup_model
            )

print("model in device: ", model.device)

#Load the model if config.pretrained is set to True in config.py
if config.pretrained:
    model = PPO.load(config.saved_model_path, env=env, verbose=config.verbose, tensorboard_log=config.log_dir)
    #Unzip the file a2c_Breakout_1M.zip and store the unzipped files in the folder a2c_Breakout_unzipped
    unzip_file(config.saved_model_path, config.unzip_file_path)
    model.policy.load_state_dict(torch.load(os.path.join(config.unzip_file_path, "policy.pth")))
    model.policy.optimizer.load_state_dict(torch.load(os.path.join(config.unzip_file_path, "policy.optimizer.pth")))



'''
Train the model and save it
'''
#model.learn will train the model for 1e6 timesteps, timestep is the number of actions taken by the agent,
# in a game like breakout, the agent takes an action every frame, then the number of timesteps is the number of frames,
# which is the number of frames in 1 game multiplied by the number of games played.
#The average number of frames in 1 game is 1000, so 1e6 timesteps is 1000 games more or less.
#log_interval is the number of timesteps between each log, in this case, the training process will be logged every 100 timesteps.
#callback is a callback that logs the training process to wandb, this is done because wandb.watch() does not work with sb3

if config.log_to_wandb:
    model.learn(total_timesteps=config.total_timesteps, log_interval=config.log_interval, callback=[wandb_callback, custom_callback], progress_bar=True)
else:
    model.learn(total_timesteps=config.total_timesteps, log_interval=config.log_interval, callback=custom_callback, progress_bar=True)
#Save the model
model.save(config.saved_model_path[:-4]) #remove the .zip extension from the path


'''
Close the environment and finish the logging
'''
env.close()
if config.log_to_wandb:
    wandb.finish()

In [None]:
BreakOut_sb3_PPO/test_PPO.py

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import VecFrameStack, DummyVecEnv
import torch
import wandb
import os
import tensorboard as tb
import config
from utils import make_env, unzip_file


'''
Set up wandb
'''
if config.log_to_wandb:
    wandb.init(project=config.name_test, entity= config.entity, sync_tensorboard=config.sync_tensorboard, name=config.name_test, notes=config.notes)


'''
Set up the environment and the model to test
'''
# Unzip the saved model if config.pretrained is set to True in config.py
if config.pretrained:
    #Unzip the file PPO_Breakout_1M.zip and store the unzipped files in the folder PPO_Breakout_unzipped
    unzip_file(config.saved_model_path, config.unzip_file_path)

#We start with a single environment for Breakout with render mode set to human
env = make_env("BreakoutNoFrameskip-v4")
#We then wrap the environment with the DummyVecEnv wrapper which converts the environment to a single vectorized environment
env = DummyVecEnv([env]) # Output shape: (1, 84, 84)
#Finally, we wrap the environment with the VecFrameStack wrapper which stacks the observations over the last 4 frames
env = VecFrameStack(env, n_stack=config.n_stack) # Output shape: (4, 84, 84)

# Create the model
model = PPO(policy = config.policy
            ,env = env
            ,verbose = config.verbose)

# Load the model if config.pretrained is set to True in config.py
if config.pretrained:
    # Load the model components, including the policy network and the value network
    model.policy.load_state_dict(torch.load(os.path.join(config.unzip_file_path, "policy.pth")))
    model.policy.optimizer.load_state_dict(torch.load(os.path.join(config.unzip_file_path, "policy.optimizer.pth")))


'''
Test the model in the environment and log the results to wandb
'''
# Run the episodes and render the gameplay
for episode in range(config.test_episodes):
    # Reset the environment and stack the initial state 4 times
    obs = env.reset()#obs = np.stack([obs] * 4, axis=0)  # Initial state stack
    done = False
    episode_reward = 0
    while not done:
        # Take an action in the environment according to the policy of the trained agent
        action, _ = model.predict(obs)
        # Take the action in the environment and store the results in the variables
        obs, reward, done, info = env.step(action) # obs shape: (1, 84, 84), reward shape: (1,), done shape: (1,), info shape: (1,)
        # Update the total reward
        episode_reward += reward[0]
        # Render the environment to visualize the gameplay of the trained agent
        env.render()
    if config.log_to_wandb:
        # Log the total reward of the episode to wandb
        wandb.log({'test_episode_reward': episode_reward, 'test_episode': episode})


'''
Close the environment and finish the logging
'''
env.close()
if config.log_to_wandb:
    wandb.finish()

In [None]:
BreakOut_sb3_PPO/__pycache__