# Import Dependencies

The framework we gonna use in this Notebook is stable-baseline which is a set of improved implementations of Reinforcement Learning (RL) algorithms based on OpenAI Baselines.

In [1]:
import gym 
from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_atari_env
import os

# Load Environment

OpenAI gym provides us with many simulated environment to train our model on and in this notebook we gonna use the simulated environment from OpenAI called cartpole.

How does the environment looks like?

The game involves a wall of blocks, a ball, and a bat. If the ball hits a block, you get some score and the block is removed. You have to move the bat at the bottom of the screen to avoid the ball going out of play, which would cause you to lose one of the five lives.

In [2]:
environment_name = "Breakout-v0"

In [3]:
env = gym.make(environment_name)

A.L.E: Arcade Learning Environment (version 0.7.4+069f8bd)
[Powered by Stella]


Check the environment with random actions for now.

In [4]:
episodes = 5
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

  logger.warn(


Episode:1 Score:3.0
Episode:2 Score:3.0
Episode:3 Score:3.0
Episode:4 Score:0.0
Episode:5 Score:1.0


* Consider an episode as something like when an agent plays an entire game.

* Here, we learn a loop of 5 episodes and reset the environment to original observation after every.

* `env.render()` function is used to graphically represent the environment.

* We then generate a random action from all the possible actions which is present in the sample space. 

* We then apply that action to the environment which was randomly generated, which will return four values. 

* The next set of observation, reward, the boolean value which says if the action is done or not and the info.

* If the action is done, the while loop condition will be false and we will be moved on to next iteration.

# Vectorise Environment and Train Model

Create 4 different environment simultaneously to train the model.

In [5]:
env = make_atari_env('Breakout-v0', n_envs=4, seed=0)

Vectorise the environment.

In [6]:
env = VecFrameStack(env, n_stack=4)

In [7]:
log_path = os.path.join('Training', 'Logs')

In [8]:
model = A2C("CnnPolicy", env, verbose=1, tensorboard_log=log_path)

Using cpu device
Wrapping the env in a VecTransposeImage.


In [9]:
#model.learn(total_timesteps=2000000)

# Load the pretrained Model

I will load the model which is already trained on 2 Millions steps which is also available in Repository.

In [10]:
env = make_atari_env('Breakout-v0', n_envs=1, seed=0)
env = VecFrameStack(env, n_stack=4)

In [11]:
a2c_path = os.path.join('A2C_2M_model')

In [12]:
model = A2C.load(a2c_path, env)

Wrapping the env in a VecTransposeImage.




# Evaluate and Test

In [13]:
evaluate_policy(model, env, n_eval_episodes=10, render=True)

(23.1, 9.782126558167198)

The model worked better. We get average score of 23 and STD of 9 which is a good result because it's trained in 2 millions steps. 