# Reinforcement Learning with OpenAI Gym

## General Information

- __Machine Learning__: applications that learn from data and improve their accuracy over time without being programmed to do so.


- __Supervised Learning__: Machine is feeded dataset with output labels
- __Unsupervised Learning__: Machine is feeded dataset without output labels
- __Reinforcement Learning__: Agent interacts with environment and takes actions while learning through the use of a reward system.

## Key Ideas
- __Reward system__: No supervisor (data) to know if prediction was right, instead use a reward system.
- __Delayed feedback__: Feedback is delayed until next state (don't know if moving right will make you lose game).
- __Agent taking actions__: The actions the agent takes have a drastic effect on the following data it receives

Reinforcement Learning AI makes a discovery: https://www.youtube.com/watch?v=meE5aaRJ0Zs&ab_channel=PatrykChrab%C4%85szcz

A heads up that some python versions may not work. I found Python version 3.8 to work fine.

OpenAI Gym: https://gym.openai.com/

CartPole-v0: https://gym.openai.com/envs/CartPole-v0/

MsPacman-ram-v0: https://gym.openai.com/envs/MsPacman-ram-v0/

## How does Reinforcement Learning work?
![Reinforcement Learning Diagram](https://www.kdnuggets.com/images/reinforcement-learning-fig1-700.jpg)

### Key Terms
__Agent__: The system that interacts and acts on the environment (like a robot, or like you "reading the pixels" on the screen of a game and taking an action).

__Environment__: The entity that the agent utilizes to take actions to maximize the reward (ex: the screen of a game)

__State__: The state of the environment (ex: the pixel data on that timestep)

__Reward__: Defines the reward collected by taking the action a at state s. The objective in reinforcement learning is to maximize the total rewards of a policy. A reward can be the added score in a game, successfully turning a doorknob or winning a game.

__Action__: The action taken by the agent based off of factors such as the state and the reward gained from the previous action.

__Policy__: Defines how an agent acts from a specific state (think of it as a guide when your playing the game). A function that maps a given state to probabilities to selecting each possible action from the given state.

__Episode__: Playing out the whole sequence of state and action until reaching the terminate state or reaching a predefined length of actions (In the case of Pacman, playing until the agent runs out of lives).

## Required Python libraries

The below libraries are necessary in order for the code that follows to function correctly.

__Tensorflow__: free and open-source software library for machine learning

__Gym__: toolkit for developing and comparing reinforcement learning algorithms

__Keras__: open-source software library that provides a Python interface for artificial neural networks

__Keras-rl2__: implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library Keras.

In [1]:
!pip install tensorflow==2.3.0
!pip install gym
!pip install keras
!pip install keras-rl2



You should consider upgrading via the 'c:\users\benja\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'c:\users\benja\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'c:\users\benja\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.




You should consider upgrading via the 'c:\users\benja\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.




## Testing OpenAI Gym with Random Actions

In [2]:
import gym

In [3]:
env = gym.make('CartPole-v0')
states = env.observation_space.shape[0]
actions = env.action_space.n

__States__:
An array of 4 numbers which describe the position of the CartPole

In [4]:
states

4

__Actions__:
1. Apply a force to the left. 
2. Apply a force to the right

In [5]:
actions

2

In [6]:
num_episodes = 20
for episode in range(1, num_episodes + 1):
    observation = env.reset()
    for t in range(100):
        env.render()
        print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

[-0.04354854  0.00750354 -0.03368896  0.04902461]
[-0.04339847 -0.18711955 -0.03270847  0.33089092]
[-0.04714086 -0.38176101 -0.02609065  0.61308237]
[-0.05477608 -0.57650882 -0.013829    0.89743489]
[-0.06630626 -0.38120217  0.0041197   0.60043732]
[-0.0739303  -0.57638151  0.01612844  0.89441504]
[-0.08545793 -0.38148195  0.03401675  0.60684527]
[-0.09308757 -0.5770626   0.04615365  0.91004553]
[-0.10462882 -0.38259463  0.06435456  0.63221856]
[-0.11228072 -0.18842675  0.07699893  0.36047683]
[-0.11604925  0.00552106  0.08420847  0.09303242]
[-0.11593883 -0.19070058  0.08606912  0.41104998]
[-0.11975284  0.00310248  0.09429012  0.14669456]
[-0.11969079  0.19675655  0.09722401 -0.11481597]
[-0.11575566  0.3903607   0.09492769 -0.3753119 ]
[-0.10794845  0.1940276   0.08742145 -0.05427132]
[-0.1040679  -0.002232    0.08633602  0.26466305]
[-0.10411254  0.19155837  0.09162929  0.00041146]
[-0.10028137 -0.00475008  0.09163751  0.32054017]
[-0.10037637  0.18895549  0.09804832  0.05810469]


In [7]:
num_episodes = 20
for episode in range(1, num_episodes + 1):
    observation = env.reset()
    score = 0
    for t in range(100):
        env.render()
        
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        score += reward
        if done:
            print("Episode {}: {} points".format(episode, score))
            break
env.close()

Episode 1: 24.0 points
Episode 2: 12.0 points
Episode 3: 10.0 points
Episode 4: 15.0 points
Episode 5: 11.0 points
Episode 6: 15.0 points
Episode 7: 30.0 points
Episode 8: 20.0 points
Episode 9: 85.0 points
Episode 10: 14.0 points
Episode 11: 42.0 points
Episode 12: 30.0 points
Episode 13: 17.0 points
Episode 14: 24.0 points
Episode 15: 19.0 points
Episode 16: 17.0 points
Episode 17: 10.0 points
Episode 18: 12.0 points
Episode 19: 22.0 points
Episode 20: 23.0 points


## Building the Model

![Reinforcement Learning Diagram](https://miro.medium.com/max/1200/1*05ExQKJ0nOoWV80SNVEyJg.png)

In [8]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam

In [15]:
def build_model(states, actions):
    model = Sequential()
    model.add(Flatten(input_shape=(1, states)))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(24, activation='relu'))
    model.add(Dense(actions, activation='linear'))
    return model

#### Flatten Layer
![Reinforcement Learning Diagram](https://data-flair.training/blogs/wp-content/uploads/sites/2/2020/07/Flatten-Layer-in-Keras-df.jpg)

#### Dense Layer
![Reinforcement Learning Diagram](https://stackabuse.s3.amazonaws.com/media/deep-learning-in-keras-building-a-deep-learning-model-1.png)


In [17]:
del model

In [18]:
model = build_model(states, actions)

In [19]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_1 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_3 (Dense)              (None, 24)                120       
_________________________________________________________________
dense_4 (Dense)              (None, 24)                600       
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 50        
Total params: 770
Trainable params: 770
Non-trainable params: 0
_________________________________________________________________


## Building the Reinforcement Learning Agent

In [20]:
from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy
from rl.memory import SequentialMemory

In [21]:
def build_agent(model, actions):
    policy = BoltzmannQPolicy()
    memory = SequentialMemory(limit=50000, window_length=1)
    dqn = DQNAgent(model=model, policy=policy, memory=memory, nb_actions=actions, nb_steps_warmup=10, target_model_update=1e-2)
    return dqn

## Training the Reinforcement Learning Agent

In [22]:
dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-3), metrics=['mae'])
dqn.fit(env, nb_steps=50000, visualize=True, verbose=1)

Training for 50000 steps ...
Interval 1 (0 steps performed)
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
   11/10000 [..............................] - ETA: 5:48 - reward: 1.0000



109 episodes - episode_reward: 91.706 [8.000, 200.000] - loss: 2.317 - mae: 18.883 - mean_q: 38.320

Interval 2 (10000 steps performed)
52 episodes - episode_reward: 190.231 [143.000, 200.000] - loss: 6.158 - mae: 38.396 - mean_q: 77.524

Interval 3 (20000 steps performed)
52 episodes - episode_reward: 192.115 [144.000, 200.000] - loss: 6.739 - mae: 41.054 - mean_q: 82.433

Interval 4 (30000 steps performed)
51 episodes - episode_reward: 196.471 [154.000, 200.000] - loss: 5.507 - mae: 39.350 - mean_q: 78.983

Interval 5 (40000 steps performed)
done, took 839.008 seconds


<tensorflow.python.keras.callbacks.History at 0x1ec0bc14f70>

## Testing the Reinforcement Learning Agent

In [1]:
scores = dqn.test(env, nb_episodes=100, visualize=True)

NameError: name 'dqn' is not defined

## Save and Load Weights from Trained Model

In [None]:
dqn.save_weights('cartpole_weights.h5f', overwrite=True)
# dqn.load_weights('cartpole_weights.h5f')

## Resources

### Reinforcement Learning
- An overview: https://www.youtube.com/watch?v=JgvyzIkgxF0&ab_channel=ArxivInsights 
- A whole course from the lead researcher of AlphaGo and AlphaZero: https://www.youtube.com/playlist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ
    - Link to the online course page: https://deepmind.com/learning-resources/-introduction-reinforcement-learning-david-silver
- Another easy to follow mini-course: https://www.youtube.com/playlist?list=PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv

### Other Helpful Links
- What are optimizers?: https://www.youtube.com/watch?v=mdKjMPmcWjY&ab_channel=CodeEmporium


## My MsPacman Reinforcement Learning Code


In [None]:
!pip install -f https://github.com/Kojoley/atari-py/releases atari_py

In [None]:
import gym
import tensorflow as tf
import numpy as np

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Input, Conv2D, Conv3D, Convolution2D, Activation, Reshape
from tensorflow.keras.optimizers import Adam

from rl.agents import DQNAgent
from rl.policy import BoltzmannQPolicy, LinearAnnealedPolicy, EpsGreedyQPolicy
from rl.memory import SequentialMemory

env = gym.make('MsPacman-ram-v0')
states = env.observation_space.shape[0]

actions = env.action_space.n
print(actions)

def build_model(states, actions):
    model = Sequential()

    model.add(Flatten(input_shape=(3, states)))
    model.add(Dense(128, activation='relu'))
    model.add(Dense(512, activation='relu'))
    model.add(Dense(512, activation='relu'))
    model.add(Dense(128, activation='relu'))
    model.add(Dense(actions, activation='softmax'))

    return model


def build_agent(model, actions):
    policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr='eps', value_max=1., value_min=.1, value_test=.2, nb_steps=10000)
    memory = SequentialMemory(limit=1000, window_length=3)
    dqn = DQNAgent(model=model, memory=memory, policy=policy, enable_dueling_network=True, dueling_type='avg', nb_actions=actions, nb_steps_warmup=50000)
    return dqn


model = build_model(states, actions)
print(model.summary())


dqn = build_agent(model, actions)
dqn.compile(Adam(lr=1e-4))
dqn.fit(env, nb_steps=500000, visualize=True, verbose=2)

dqn.save_weights('mspacman_weights.h5f', overwrite=True)

scores = dqn.test(env, nb_episodes=100, visualize=True)