# Sequential DQN

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

  from ._conv import register_converters as _register_converters


TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
import numpy as np
state = env.reset()
for _ in range(10):
    # env.render()
    action = env.action_space.sample()
    next_state, reward, done, info = env.step(action) # take a random action
    #print('state, action, next_state, reward, done, info:', state, action, next_state, reward, done, info)
    state = next_state
    if done:
        state = env.reset()

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [4]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [5]:
def model_input(state_size, lstm_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # RNN
    gru = tf.nn.rnn_cell.GRUCell(lstm_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    initial_state = cell.zero_state(batch_size, tf.float32)
    return states, actions, targetQs, cell, initial_state

In [6]:
# RNN generator or sequence generator
def generator(states, num_classes, initial_state, cell, lstm_size, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [7]:
def model_loss(action_size, hidden_size, states, cell, initial_state, actions, targetQs):
    actions_logits, final_state = generator(states=states, cell=cell, initial_state=initial_state, 
                                            lstm_size=hidden_size, num_classes=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [8]:
def model_opt(loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [9]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cell, self.initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cell=cell, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [10]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx], [self.states[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [11]:
# print('state:', np.array(states).shape[1], 
#       'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

In [12]:
# Network parameters
action_size = 2
state_size = 4
hidden_size = 4*2               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 128            # memory capacity - 1000 DQN
batch_size = 128             # experience mini-batch size - 20 DQN
gamma = 0.99                 # future reward discount

In [13]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 8)
(1, ?, 8) (1, 8)
(1, ?, 8) (1, 8)
(?, 8)
(?, 2)


## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [14]:
state = env.reset()
for _ in range(memory_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    memory.states.append(np.zeros([1, hidden_size]))
    state = next_state
    if done is True:
        # Reseting the env/first state
        state = env.reset()

In [15]:
# # Training
# batch = memory.buffer
# states = np.array([each[0] for each in batch])
# actions = np.array([each[1] for each in batch])
# next_states = np.array([each[2] for each in batch])
# rewards = np.array([each[3] for each in batch])
# dones = np.array([each[4] for each in batch])

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [16]:
# initial_states = np.array(memory.states)
# initial_states.shape

In [None]:
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        state = env.reset()
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append(initial_state)
            total_reward += reward
            state = next_state
            initial_state = final_state

            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            initial_states = np.array(memory.states)
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states,
                                                        model.initial_state: initial_states[1]})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                                     model.initial_state: initial_states[0]})
            # End of training
            loss_batch.append(loss)
            if done is True:
                break
                
        # Outputing: priting out/Potting
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'loss:{:.4f}'.format(np.mean(loss_batch)))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:9.0000 R:9.0000 loss:1.2233
Episode:1 meanR:9.0000 R:9.0000 loss:1.2138
Episode:2 meanR:9.0000 R:9.0000 loss:1.2093
Episode:3 meanR:9.0000 R:9.0000 loss:1.2098
Episode:4 meanR:9.0000 R:9.0000 loss:1.2867
Episode:5 meanR:9.1667 R:10.0000 loss:1.2996
Episode:6 meanR:9.1429 R:9.0000 loss:1.3731
Episode:7 meanR:9.0000 R:8.0000 loss:1.4004
Episode:8 meanR:9.0000 R:9.0000 loss:1.4432
Episode:9 meanR:9.0000 R:9.0000 loss:1.5317
Episode:10 meanR:9.0909 R:10.0000 loss:1.6059
Episode:11 meanR:9.1667 R:10.0000 loss:1.6575
Episode:12 meanR:9.2308 R:10.0000 loss:1.6962
Episode:13 meanR:9.2143 R:9.0000 loss:1.7094
Episode:14 meanR:9.1333 R:8.0000 loss:1.7141
Episode:15 meanR:9.1875 R:10.0000 loss:1.6172
Episode:16 meanR:9.1765 R:9.0000 loss:1.6447
Episode:17 meanR:9.1667 R:9.0000 loss:1.5614
Episode:18 meanR:9.1053 R:8.0000 loss:1.4644
Episode:19 meanR:9.2000 R:11.0000 loss:1.3644
Episode:20 meanR:9.2381 R:10.0000 loss:1.2323
Episode:21 meanR:9.1818 R:8.0000 loss:1.1936
Episode:22 me

Episode:179 meanR:9.3600 R:9.0000 loss:2.0725
Episode:180 meanR:9.3600 R:10.0000 loss:2.0754
Episode:181 meanR:9.3700 R:10.0000 loss:2.0766
Episode:182 meanR:9.3600 R:10.0000 loss:2.0698
Episode:183 meanR:9.3600 R:10.0000 loss:2.0637
Episode:184 meanR:9.3600 R:10.0000 loss:2.0601
Episode:185 meanR:9.3500 R:9.0000 loss:2.0785
Episode:186 meanR:9.3700 R:10.0000 loss:2.0838
Episode:187 meanR:9.3800 R:10.0000 loss:2.0807
Episode:188 meanR:9.3700 R:9.0000 loss:2.0841
Episode:189 meanR:9.3800 R:9.0000 loss:2.0897
Episode:190 meanR:9.3700 R:9.0000 loss:2.0979
Episode:191 meanR:9.3700 R:10.0000 loss:2.1023
Episode:192 meanR:9.3800 R:10.0000 loss:2.0935
Episode:193 meanR:9.3900 R:10.0000 loss:2.0962
Episode:194 meanR:9.3900 R:9.0000 loss:2.1064
Episode:195 meanR:9.3900 R:9.0000 loss:2.1211
Episode:196 meanR:9.4000 R:10.0000 loss:2.1305
Episode:197 meanR:9.4200 R:10.0000 loss:2.1368
Episode:198 meanR:9.4200 R:10.0000 loss:2.1412
Episode:199 meanR:9.4100 R:9.0000 loss:2.1442
Episode:200 meanR:9.4

Episode:356 meanR:9.3100 R:9.0000 loss:3.1579
Episode:357 meanR:9.3100 R:10.0000 loss:3.1415
Episode:358 meanR:9.3100 R:10.0000 loss:3.1512
Episode:359 meanR:9.3000 R:9.0000 loss:3.1679
Episode:360 meanR:9.3100 R:10.0000 loss:3.1799
Episode:361 meanR:9.3100 R:9.0000 loss:3.1813
Episode:362 meanR:9.3000 R:9.0000 loss:3.2070
Episode:363 meanR:9.3200 R:10.0000 loss:3.2154
Episode:364 meanR:9.3200 R:9.0000 loss:3.1990
Episode:365 meanR:9.3300 R:10.0000 loss:3.1959
Episode:366 meanR:9.3300 R:10.0000 loss:3.1897
Episode:367 meanR:9.3300 R:10.0000 loss:3.1985
Episode:368 meanR:9.3500 R:10.0000 loss:3.2054
Episode:369 meanR:9.3400 R:9.0000 loss:3.2184
Episode:370 meanR:9.3300 R:9.0000 loss:3.2264
Episode:371 meanR:9.3300 R:9.0000 loss:3.2545
Episode:372 meanR:9.3200 R:9.0000 loss:3.2771
Episode:373 meanR:9.3100 R:9.0000 loss:3.2844
Episode:374 meanR:9.3200 R:10.0000 loss:3.2958
Episode:375 meanR:9.3300 R:10.0000 loss:3.2860
Episode:376 meanR:9.3300 R:9.0000 loss:3.2892
Episode:377 meanR:9.3400

Episode:533 meanR:9.3000 R:10.0000 loss:4.4262
Episode:534 meanR:9.3100 R:9.0000 loss:4.4238
Episode:535 meanR:9.3000 R:9.0000 loss:4.4294
Episode:536 meanR:9.3100 R:10.0000 loss:4.3963
Episode:537 meanR:9.3100 R:10.0000 loss:4.4053
Episode:538 meanR:9.3100 R:9.0000 loss:4.4266
Episode:539 meanR:9.2900 R:8.0000 loss:4.4844
Episode:540 meanR:9.2800 R:9.0000 loss:4.5282
Episode:541 meanR:9.2800 R:9.0000 loss:4.5626
Episode:542 meanR:9.2700 R:8.0000 loss:4.5945
Episode:543 meanR:9.2600 R:8.0000 loss:4.6315
Episode:544 meanR:9.2700 R:10.0000 loss:4.6606
Episode:545 meanR:9.2600 R:9.0000 loss:4.6482
Episode:546 meanR:9.2600 R:10.0000 loss:4.6388
Episode:547 meanR:9.2600 R:10.0000 loss:4.6333
Episode:548 meanR:9.2500 R:9.0000 loss:4.6418
Episode:549 meanR:9.2700 R:10.0000 loss:4.6266
Episode:550 meanR:9.2800 R:9.0000 loss:4.6478
Episode:551 meanR:9.2800 R:10.0000 loss:4.6612
Episode:552 meanR:9.2700 R:9.0000 loss:4.6690
Episode:553 meanR:9.2600 R:8.0000 loss:4.6770
Episode:554 meanR:9.2500 R

Episode:710 meanR:9.3000 R:8.0000 loss:5.6495
Episode:711 meanR:9.3000 R:9.0000 loss:5.7099
Episode:712 meanR:9.2900 R:9.0000 loss:5.7167
Episode:713 meanR:9.2900 R:10.0000 loss:5.6983
Episode:714 meanR:9.3000 R:9.0000 loss:5.7324
Episode:715 meanR:9.2900 R:9.0000 loss:5.7777
Episode:716 meanR:9.3100 R:10.0000 loss:5.7897
Episode:717 meanR:9.3000 R:9.0000 loss:5.8275
Episode:718 meanR:9.3000 R:10.0000 loss:5.8015
Episode:719 meanR:9.2900 R:9.0000 loss:5.8026
Episode:720 meanR:9.3000 R:9.0000 loss:5.8086
Episode:721 meanR:9.3000 R:10.0000 loss:5.8208
Episode:722 meanR:9.3100 R:10.0000 loss:5.8266
Episode:723 meanR:9.3100 R:8.0000 loss:5.8640
Episode:724 meanR:9.3100 R:9.0000 loss:5.8334
Episode:725 meanR:9.3100 R:10.0000 loss:5.8101
Episode:726 meanR:9.3000 R:9.0000 loss:5.8062
Episode:727 meanR:9.3100 R:9.0000 loss:5.8505
Episode:728 meanR:9.3100 R:9.0000 loss:5.8562
Episode:729 meanR:9.3000 R:8.0000 loss:5.9018
Episode:730 meanR:9.3100 R:10.0000 loss:5.9115
Episode:731 meanR:9.3000 R:

Episode:887 meanR:9.2700 R:8.0000 loss:6.5256
Episode:888 meanR:9.2700 R:10.0000 loss:6.5669
Episode:889 meanR:9.2600 R:8.0000 loss:6.5883
Episode:890 meanR:9.2500 R:9.0000 loss:6.6083
Episode:891 meanR:9.2400 R:9.0000 loss:6.6141
Episode:892 meanR:9.2600 R:10.0000 loss:6.6287
Episode:893 meanR:9.2500 R:8.0000 loss:6.6603
Episode:894 meanR:9.2500 R:9.0000 loss:6.7175
Episode:895 meanR:9.2500 R:9.0000 loss:6.7229
Episode:896 meanR:9.2600 R:10.0000 loss:6.7263
Episode:897 meanR:9.2600 R:10.0000 loss:6.7288
Episode:898 meanR:9.2600 R:10.0000 loss:6.7315
Episode:899 meanR:9.2600 R:9.0000 loss:6.7750
Episode:900 meanR:9.2500 R:9.0000 loss:6.7793
Episode:901 meanR:9.2500 R:10.0000 loss:6.6968
Episode:902 meanR:9.2600 R:11.0000 loss:6.6636
Episode:903 meanR:9.2500 R:9.0000 loss:6.5997
Episode:904 meanR:9.2500 R:9.0000 loss:6.6030
Episode:905 meanR:9.2600 R:9.0000 loss:6.6082
Episode:906 meanR:9.2600 R:9.0000 loss:6.6588
Episode:907 meanR:9.2600 R:8.0000 loss:6.6480
Episode:908 meanR:9.2700 R:

Episode:1063 meanR:9.3600 R:10.0000 loss:7.0364
Episode:1064 meanR:9.3800 R:10.0000 loss:7.0356
Episode:1065 meanR:9.3700 R:10.0000 loss:7.0361
Episode:1066 meanR:9.3900 R:10.0000 loss:7.0352
Episode:1067 meanR:9.3900 R:9.0000 loss:7.0206
Episode:1068 meanR:9.3900 R:10.0000 loss:6.9895
Episode:1069 meanR:9.4000 R:9.0000 loss:6.9729
Episode:1070 meanR:9.3900 R:9.0000 loss:6.9727
Episode:1071 meanR:9.3900 R:9.0000 loss:7.0261
Episode:1072 meanR:9.3800 R:9.0000 loss:7.0748
Episode:1073 meanR:9.4000 R:10.0000 loss:7.0375
Episode:1074 meanR:9.4000 R:10.0000 loss:6.9480
Episode:1075 meanR:9.3900 R:9.0000 loss:6.9279
Episode:1076 meanR:9.3900 R:10.0000 loss:6.9083
Episode:1077 meanR:9.4000 R:10.0000 loss:6.9128
Episode:1078 meanR:9.4000 R:9.0000 loss:6.9377
Episode:1079 meanR:9.4200 R:10.0000 loss:6.9656
Episode:1080 meanR:9.4000 R:8.0000 loss:7.0274
Episode:1081 meanR:9.4100 R:10.0000 loss:7.0153
Episode:1082 meanR:9.4200 R:9.0000 loss:7.0473
Episode:1083 meanR:9.4100 R:8.0000 loss:7.0857
Ep

Episode:1236 meanR:9.3500 R:9.0000 loss:6.8744
Episode:1237 meanR:9.3500 R:8.0000 loss:6.9037
Episode:1238 meanR:9.3400 R:9.0000 loss:6.9887
Episode:1239 meanR:9.3300 R:8.0000 loss:7.0813
Episode:1240 meanR:9.3500 R:10.0000 loss:7.0574
Episode:1241 meanR:9.3700 R:10.0000 loss:7.0569
Episode:1242 meanR:9.3800 R:10.0000 loss:7.0548
Episode:1243 meanR:9.3700 R:9.0000 loss:7.0391
Episode:1244 meanR:9.3800 R:10.0000 loss:7.0522
Episode:1245 meanR:9.3700 R:9.0000 loss:7.0380
Episode:1246 meanR:9.3400 R:8.0000 loss:7.1319
Episode:1247 meanR:9.3400 R:10.0000 loss:7.1381
Episode:1248 meanR:9.3500 R:10.0000 loss:7.0874
Episode:1249 meanR:9.3400 R:8.0000 loss:7.1752
Episode:1250 meanR:9.3600 R:10.0000 loss:7.1225
Episode:1251 meanR:9.3800 R:10.0000 loss:7.0255
Episode:1252 meanR:9.3700 R:9.0000 loss:7.0082
Episode:1253 meanR:9.3600 R:9.0000 loss:6.9591
Episode:1254 meanR:9.3500 R:10.0000 loss:6.9788
Episode:1255 meanR:9.3700 R:10.0000 loss:6.9812
Episode:1256 meanR:9.3700 R:10.0000 loss:6.9808
Ep

Episode:1409 meanR:9.2600 R:8.0000 loss:7.0851
Episode:1410 meanR:9.2500 R:9.0000 loss:7.0829
Episode:1411 meanR:9.2400 R:9.0000 loss:7.1232
Episode:1412 meanR:9.2300 R:8.0000 loss:7.1272
Episode:1413 meanR:9.2300 R:9.0000 loss:7.1996
Episode:1414 meanR:9.2500 R:10.0000 loss:7.1218
Episode:1415 meanR:9.2500 R:9.0000 loss:7.0768
Episode:1416 meanR:9.2500 R:9.0000 loss:7.0653
Episode:1417 meanR:9.2500 R:10.0000 loss:6.9664
Episode:1418 meanR:9.2400 R:9.0000 loss:6.9960
Episode:1419 meanR:9.2300 R:9.0000 loss:6.9985
Episode:1420 meanR:9.2200 R:9.0000 loss:6.9915
Episode:1421 meanR:9.2200 R:9.0000 loss:6.9878
Episode:1422 meanR:9.2100 R:9.0000 loss:6.9830
Episode:1423 meanR:9.2200 R:9.0000 loss:7.0232
Episode:1424 meanR:9.2300 R:9.0000 loss:6.9697
Episode:1425 meanR:9.2200 R:9.0000 loss:6.9644
Episode:1426 meanR:9.2300 R:10.0000 loss:6.8732
Episode:1427 meanR:9.2100 R:8.0000 loss:6.9082
Episode:1428 meanR:9.2100 R:9.0000 loss:6.9483
Episode:1429 meanR:9.2100 R:8.0000 loss:6.9593
Episode:14

Episode:1582 meanR:9.3100 R:9.0000 loss:6.9745
Episode:1583 meanR:9.3000 R:9.0000 loss:7.0241
Episode:1584 meanR:9.3200 R:10.0000 loss:7.0399
Episode:1585 meanR:9.3200 R:9.0000 loss:7.0740
Episode:1586 meanR:9.3200 R:10.0000 loss:6.9909
Episode:1587 meanR:9.3300 R:10.0000 loss:6.9463
Episode:1588 meanR:9.3200 R:9.0000 loss:7.0228
Episode:1589 meanR:9.3200 R:9.0000 loss:6.9734
Episode:1590 meanR:9.3200 R:10.0000 loss:6.9942
Episode:1591 meanR:9.3300 R:9.0000 loss:6.9773
Episode:1592 meanR:9.3300 R:9.0000 loss:7.0288
Episode:1593 meanR:9.3200 R:9.0000 loss:7.0783
Episode:1594 meanR:9.3100 R:8.0000 loss:7.1217
Episode:1595 meanR:9.3200 R:10.0000 loss:7.0827
Episode:1596 meanR:9.3300 R:11.0000 loss:7.0042
Episode:1597 meanR:9.3200 R:9.0000 loss:6.9668
Episode:1598 meanR:9.3200 R:10.0000 loss:6.9869
Episode:1599 meanR:9.3300 R:10.0000 loss:6.9448
Episode:1600 meanR:9.3100 R:8.0000 loss:7.0039
Episode:1601 meanR:9.3200 R:11.0000 loss:7.0082
Episode:1602 meanR:9.3300 R:10.0000 loss:6.9485
Epi

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [38]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episode/epoch
    for _ in range(10):
        total_reward = 0
        state = env.reset()
        initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
        
        # Steps/batches
        while True:
            env.render()
            action_logits, initial_state = sess.run([model.actions_logits, model.final_state],
                                                    feed_dict = {model.states: state.reshape([1, -1]), 
                                                                 model.initial_state: initial_state})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # At the end of each episode
        print('total_reward:{}'.format(total_reward))

# Close the env
env.close()

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.