Last time, we looked at an example reinforcement problem that balanced an object in space following  [this tutorial](https://youtu.be/cO5g5qLrLSo).

In this lab, you will choose a reinforcement learning problem to explore. Here are some suggestions for problems that you can investigate.

1. Solving the Lunar Landing Problem using Stable Baselines algorithm: [tutorial](https://youtu.be/nRHjymV2PX8), [code on github](https://github.com/nicknochnack/StableBaselinesRL).
2. Solving one of these three RL problems: project Atari, autonomous driving, as well as building a custom environment: [tutorial](https://youtu.be/Mut_u40Sqz4), [code on github](https://github.com/nicknochnack/ReinforcementLearningCourse).
3. **(Advanced)** Datasets for Deep Data-Driven Reinforcement Learning (D4RL): [environments description](https://sites.google.com/view/d4rl/home), [code on github](https://github.com/rail-berkeley/d4rl).

**Note:** This is a rough guide with the general mains steps in a reinforcement learning program. Please add more sections as your implementation requires, with comments describing each section.

# Problem description
Enter in the text cell below the problem that you chose to solve with reinforcement learning.

#Project Atari

# **Build an RL environment**

Import packages

Note: Please inform the TA of any additional packages that you need to install for the problem that you selected.

In [3]:
import gym 
from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.env_util import make_atari_env
import os

Create the environment.

In [4]:
environment_name = "Breakout"
env = gym.make(environment_name)


Error: Attempted to look up malformed environment ID: b'Breakout'. (Currently all IDs must be of the form ^(?:[\w:-]+\/)?([\w:.-]+)-v(\d+)$.)

Test the environment with random choice.

In [2]:
# Trigger Ed's X display
!xdpyinfo

# Add your code here to display the environment with random choice
episodes = 5
for episode in range(1, episodes+1):
    state = env.reset()
    done = False
    score = 0 
    
    while not done:
        env.render()
        action = env.action_space.sample()
        n_state, reward, done, info = env.step(action)
        score+=reward
    print('Episode:{} Score:{}'.format(episode, score))
env.close()

env.action_space.sample()

env.observation_space.sample()


name of display:    :1.0
version number:    11.0
vendor string:    The X.Org Foundation
vendor release number:    12009000
X.Org version: 1.20.9
maximum request size:  16777212 bytes
motion buffer size:  256
bitmap unit, bit order, padding:    32, LSBFirst, 32
image byte order:    LSBFirst
number of supported pixmap formats:    6
supported pixmap formats:
    depth 1, bits_per_pixel 1, scanline_pad 32
    depth 4, bits_per_pixel 8, scanline_pad 32
    depth 8, bits_per_pixel 8, scanline_pad 32
    depth 16, bits_per_pixel 16, scanline_pad 32
    depth 24, bits_per_pixel 32, scanline_pad 32
    depth 32, bits_per_pixel 32, scanline_pad 32
keycode range:    minimum 8, maximum 255
focus:  PointerRoot
number of extensions:    23
    BIG-REQUESTS
    Composite
    DAMAGE
    DOUBLE-BUFFER
    GLX
    Generic Event Extension
    MIT-SCREEN-SAVER
    MIT-SHM
    Present
    RANDR
    RECORD
    RENDER
    SHAPE
    SYNC
    VNC-EXTENSION
    X-Resource
    

NameError: name 'env' is not defined

# **Build and train the Model**

In [9]:
# Add your code here to import all needed packages. 
env = make_atari_env('Breakout', n_envs=4, seed=0)
env = VecFrameStack(env, n_stack=4)
log_path = os.path.join('Training', 'Logs')
model = A2C("CnnPolicy", env, verbose=1, tensorboard_log=log_path)
model.learn(total_timesteps=100000)



Build the model

Edit the code as needed to save and test the model created in the previous sections.

# **Save and test the Model**

In [14]:
a2c_path = os.path.join('Training', 'Saved Models', 'A2C_model')
model.save(a2c_path)
del model
env = make_atari_env('Breakout-v0', n_envs=1, seed=0)
env = VecFrameStack(env, n_stack=4)
model = A2C.load(a2c_path, env)


Evaluate and visualize the model

In [15]:
# Add your code here to output the score and visualize the model.
evaluate_policy(model, env, n_eval_episodes=10, render=True)
obs = env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

env.close()

