<a href="https://colab.research.google.com/github/anthonymelson/portfolio/blob/master/CartPole_with_DeepQ_Learning_and_a_Sarsa_Agent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CartPole with DeepQ-Learning and a Sarsa Agent

## Import Libraries

In [9]:
import gym
!pip install keras.rl
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
from keras.optimizers import Adam
from rl.agents import SARSAAgent
from rl.policy import BoltzmannQPolicy, EpsGreedyQPolicy, MaxBoltzmannQPolicy, MaxBoltzmannQPolicy, BoltzmannGumbelQPolicy
import numpy as np

Processing /root/.cache/pip/wheels/7d/4d/84/9254c9f2e8f51865cb0dac8e79da85330c735551d31f73c894/keras_rl-0.4.2-cp36-none-any.whl
Installing collected packages: keras.rl
Successfully installed keras.rl


## The CartPole Environment

CartPole is a classical control problem that has been prominant in the literature since the 1980's and comes in many variations.  There is always a pole that can freely move in two directions and through all 360 degrees, which is attached to a cart (usually on a pole) that can move in the same two directions or do nothing (right, left, neither).  

In this variation, the pole starts in a position selected at random from a range of possibilities, and the goal is for the agent to make the necessary series of moves (left, right, neither) required for the pole to be balanced above the cart (not fall), while staying within a given range.  If the cart balances the pole for 200 strait moves, then the agent wins and the game is "solved".


**Here is a picture of the cart and pole**:

![alt text](https://miro.medium.com/max/900/1*Q9gDKBugQeYNxA6ZBei1MQ.png)


In [10]:
ENV_NAME = 'CartPole-v0'

# Get the environment and extract the number of actions.
env = gym.make(ENV_NAME)
np.random.seed(38)
env.seed(65)
nb_actions = env.action_space.n

## Create Deep Learning Model
Here we build a very simple deep learning model using keras.  It's first layer is a flattened layer with a number of nodes equal to the observation space of the game.  The next layer is an ordinary dense layer with a rectified linear unit as an activation function.  The third layer is also a dense layer, but one that has a number of nodes equal to the number of actions available to the agent in the game, and it has a linear activation (which helps it to approximate the reward function and transcends the need for discretization of the reward space).

This model is essentially our agents brain.  It maps the state/action pairs the game allows to a continous reward space.  As this model becomes more accurate, the agent will make better (more rewarding) decisions--it learns!

In [11]:
model = Sequential()
model.add(Flatten(input_shape=(1,) + env.observation_space.shape))
model.add(Dense(64))
model.add(Activation('relu'))
model.add(Dense(nb_actions))
model.add(Activation('linear'))
print(model.summary())

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten_2 (Flatten)          (None, 4)                 0         
_________________________________________________________________
dense_3 (Dense)              (None, 64)                320       
_________________________________________________________________
activation_3 (Activation)    (None, 64)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 130       
_________________________________________________________________
activation_4 (Activation)    (None, 2)                 0         
Total params: 450
Trainable params: 450
Non-trainable params: 0
_________________________________________________________________
None


## Select Exploration Policy

Here we are going to stick with a classic--epsilon greedy.  This policy takes a random action some percentage of the time, and follows what it currently believes is "best policy" the rest of the time.  Additionally, the percentage of random behavior decreases on some kind of schedule--eventually leaving an agent that does "it's best".

In [12]:
#policy = BoltzmannQPolicy()
policy = EpsGreedyQPolicy()
#policy = MaxBoltzmannQPolicy()
#policy = BoltzmannGumbelQPolicy()

## Select an Agent

Here we are selecting sarsa (state-action-reward-state-action), which is an on-policy learning algorithm.  This means that i

```
# This is formatted as code
```

t updates its policy based on the result of what it actually did and not on the predicted rewards of possible actions (as Q-learning does).  It also differs from the more common Q-learning because Q-learning only considers state-action-reward-state.

In [13]:
sarsa = SARSAAgent(model=model, nb_actions=nb_actions, nb_steps_warmup=1000, policy=policy)
sarsa.compile(Adam(lr=1e-3), metrics=['mae'])

## Train Agent

Okay, now it's time to learn something!

I'm setting visualize to false because Colab doesn't have support for Gym's rendering capabilities.  However, if you evaluate this in spyder or on a locally hosted Jupyter notebook, there is no problem.  It is actually a great way to understand how your agent is doing as it provides rich qualitative feedback.

The parameters for sarsa.fit:

**env** : The environment you are solving

**nb_steps** : The number of total steps in **all** episodes

**visualize** : Whether to display the rendering of the game

**verbose** : How much data to generate/store about the training process

In [17]:

sarsa.fit(env, nb_steps=10000, visualize=False, verbose=0)


<keras.callbacks.callbacks.History at 0x7f5e50375860>

## Save Weights

After training is done, we save the final weights.

In [15]:
sarsa.save_weights('sarsa_EpsGreedyQ_{}_weights.h5f'.format(ENV_NAME), overwrite=True)

## Test Agent

Finally, we evaluate our algorithm for 5 episodes.

In [18]:
sarsa.test(env, nb_episodes=5, visualize=False)

Testing for 5 episodes ...
Episode 1: reward: 41.000, steps: 41
Episode 2: reward: 37.000, steps: 37
Episode 3: reward: 43.000, steps: 43
Episode 4: reward: 42.000, steps: 42
Episode 5: reward: 37.000, steps: 37


<keras.callbacks.callbacks.History at 0x7f5e50375908>