# Recurrent DQN

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import tensorflow as tf
print('TensorFlow Version: {}'.format(tf.__version__))
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.12.0
Default GPU Device: /device:GPU:0


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

# Create the Cart-Pole game environment
# env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
# import numpy as np
# state = env.reset()
# for _ in range(10):
#     # env.render()
#     action = env.action_space.sample()
#     next_state, reward, done, info = env.step(action) # take a random action
#     #print('state, action, next_state, reward, done, info:', state, action, next_state, reward, done, info)
#     state = next_state
#     if done:
#         state = env.reset()

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [4]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [5]:
def model_input(state_size, hidden_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # RNN
    cell = tf.nn.rnn_cell.GRUCell(hidden_size)
    #cell = tf.nn.rnn_cell.LSTMCell(hidden_size)
    cells = tf.nn.rnn_cell.MultiRNNCell([cell], state_is_tuple=True)
    initial_state = cells.zero_state(batch_size, tf.float32)
    return states, actions, targetQs, cells, initial_state

In [6]:
# RNN generator or sequence generator
def generator(states, action_size, initial_state, cells, hidden_size, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        #inputs = tf.layers.dense(inputs=states, units=hidden_size) # no xavier
        inputs = tf.contrib.layers.fully_connected(inputs=states, num_outputs=hidden_size, activation_fn=None)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size and
        # static means can NOT adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, hidden_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cells, inputs=inputs_rnn, 
                                                     initial_state=initial_state)
        print(outputs_rnn.shape, final_state)
        outputs = tf.reshape(outputs_rnn, [-1, hidden_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        #logits = tf.layers.dense(inputs=outputs, units=action_size)
        logits = tf.contrib.layers.fully_connected(inputs=outputs, num_outputs=action_size, activation_fn=None)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [7]:
def model_loss(action_size, hidden_size, states, cells, initial_state, actions, targetQs):
    actions_logits, final_state = generator(states=states, cells=cells, initial_state=initial_state, 
                                            hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [8]:
def model_opt(loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize MLP/CNN
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    # # Optimize RNN
    # grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    # grads = tf.gradients(loss, g_vars)
    # opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [9]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cells, self.initial_state = model_input(
                state_size=state_size, hidden_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cells=cells, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [10]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [11]:
# Network parameters
action_size = 2
state_size = 4
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 128            # memory capacity - 1000 DQN
batch_size = 128             # experience mini-batch size - 20 DQN
gamma = 0.99                 # future reward discount

In [12]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (<tf.Tensor 'MultiRNNCellZeroState/GRUCellZeroState/zeros:0' shape=(1, 64) dtype=float32>,)
(1, ?, 64) (<tf.Tensor 'generator/rnn/while/Exit_3:0' shape=(1, 64) dtype=float32>,)
(?, 64)
(?, 2)


In [13]:
model.initial_state[0]

<tf.Tensor 'MultiRNNCellZeroState/GRUCellZeroState/zeros:0' shape=(1, 64) dtype=float32>

## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [14]:
import numpy as np
state = env.reset()
# for _ in range(memory_size):
for _ in range(1000):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    memory.states.append(np.zeros([1, hidden_size])) # gru
    #memory.states.append([np.zeros([1, hidden_size]), np.zeros([1, hidden_size])]) # lstm
    state = next_state
    
    batch = memory.buffer
    dones = np.array([each[4] for each in batch])
    if np.any(dones) != 1:
    #if np.all(dones) == 0:
        print('all zero', dones)
    else:
        print('all not zero', dones)

    if done is True:
        # Reseting the env/first state
        state = env.reset()

all zero [0.]
all zero [0. 0.]
all zero [0. 0. 0.]
all zero [0. 0. 0. 0.]
all zero [0. 0. 0. 0. 0.]
all zero [0. 0. 0. 0. 0. 0.]
all zero [0. 0. 0. 0. 0. 0. 0.]
all zero [0. 0. 0. 0. 0. 0. 0. 0.]
all zero [0. 0. 0. 0. 0. 0. 0. 0. 0.]
all zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
all zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
all zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
all zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 

all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 

 1. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 

 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 

all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 

all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 1. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.

 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0.]
all not zero [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 1. 

In [15]:
# Training
batch = memory.buffer
states = np.array([each[0] for each in batch])
actions = np.array([each[1] for each in batch])
next_states = np.array([each[2] for each in batch])
rewards = np.array([each[3] for each in batch])
dones = np.array([each[4] for each in batch])

In [16]:
np.all(dones) == 0

True

In [17]:
dones

array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [18]:
memory.states[0].shape, model.initial_state[0].shape # gru
# memory.states[0][1].shape, model.initial_state[0][1].shape #lstm

((1, 64), TensorShape([Dimension(1), Dimension(64)]))

In [19]:
memory.states[0][0].shape, model.initial_state[0][0].shape

((64,), TensorShape([Dimension(64)]))

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [20]:
# initial_states = np.array(memory.states)
# initial_states.shape

In [None]:
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        state = env.reset()
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append(initial_state)
            total_reward += reward
            state = next_state
            initial_state = final_state

            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            initial_states = memory.states
            if total_reward < batch_size*2:
                next_actions_logits = sess.run(model.actions_logits, 
                               feed_dict = {model.states: next_states,
                                            model.initial_state: initial_states[1]})
                nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
                targetQs = rewards + (gamma * nextQs)
                loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                         model.actions: actions,
                                                                         model.targetQs: targetQs,
                                                                         model.initial_state: initial_states[0]})
            #if total_reward >= batch_size*2:
            else:
                if np.any(dones) != 1:
                    next_actions_logits = sess.run(model.actions_logits, 
                                                   feed_dict = {model.states: next_states,
                                                                model.initial_state: initial_states[1]})
                    nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
                    targetQs = rewards + (gamma * nextQs)
                    loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                             model.actions: actions,
                                                                             model.targetQs: targetQs,
                                                                             model.initial_state: initial_states[0]})


                
            # End of training
            loss_batch.append(loss)
            if done is True:
                break
                
        # Outputing: priting out/Potting
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'loss:{:.4f}'.format(np.mean(loss_batch)))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:10.0000 R:10.0000 loss:1.0461
Episode:1 meanR:9.0000 R:8.0000 loss:1.0226
Episode:2 meanR:8.6667 R:8.0000 loss:1.0369
Episode:3 meanR:8.7500 R:9.0000 loss:1.0590
Episode:4 meanR:8.8000 R:9.0000 loss:1.0530
Episode:5 meanR:8.8333 R:9.0000 loss:1.1506
Episode:6 meanR:8.7143 R:8.0000 loss:1.2907
Episode:7 meanR:8.8750 R:10.0000 loss:1.5255
Episode:8 meanR:8.8889 R:9.0000 loss:1.6509
Episode:9 meanR:8.9000 R:9.0000 loss:1.7502
Episode:10 meanR:8.9091 R:9.0000 loss:2.0777
Episode:11 meanR:8.9167 R:9.0000 loss:2.4766
Episode:12 meanR:9.0000 R:10.0000 loss:3.0180
Episode:13 meanR:9.0714 R:10.0000 loss:3.7824
Episode:14 meanR:9.1333 R:10.0000 loss:4.6141
Episode:15 meanR:9.1250 R:9.0000 loss:5.2151
Episode:16 meanR:9.1765 R:10.0000 loss:5.4744
Episode:17 meanR:9.2222 R:10.0000 loss:6.0155
Episode:18 meanR:9.2105 R:9.0000 loss:6.5282
Episode:19 meanR:9.2000 R:9.0000 loss:6.6173
Episode:20 meanR:9.1905 R:9.0000 loss:6.0309
Episode:21 meanR:9.1818 R:9.0000 loss:6.3887
Episode:22 m

Episode:176 meanR:23.7500 R:85.0000 loss:3.2936
Episode:177 meanR:24.7800 R:113.0000 loss:4.0674
Episode:178 meanR:24.8800 R:21.0000 loss:6.2252
Episode:179 meanR:25.5300 R:74.0000 loss:4.6256
Episode:180 meanR:25.6200 R:20.0000 loss:6.3641
Episode:181 meanR:26.5900 R:106.0000 loss:3.5971
Episode:182 meanR:27.8400 R:137.0000 loss:2.6298
Episode:183 meanR:29.3200 R:158.0000 loss:3.4979
Episode:184 meanR:29.4000 R:19.0000 loss:5.3051
Episode:185 meanR:29.4700 R:18.0000 loss:9.6686
Episode:186 meanR:29.5400 R:19.0000 loss:14.0093
Episode:187 meanR:29.9900 R:58.0000 loss:10.3577
Episode:188 meanR:30.0000 R:16.0000 loss:10.5548
Episode:189 meanR:29.9700 R:14.0000 loss:14.0711
Episode:190 meanR:29.9700 R:15.0000 loss:18.1053
Episode:191 meanR:29.9300 R:14.0000 loss:22.2209
Episode:192 meanR:29.8800 R:13.0000 loss:26.8222
Episode:193 meanR:29.7800 R:10.0000 loss:29.0232
Episode:194 meanR:29.6800 R:11.0000 loss:33.3440
Episode:195 meanR:29.6000 R:11.0000 loss:37.5003
Episode:196 meanR:29.4900 

Episode:343 meanR:196.4400 R:377.0000 loss:6.8478
Episode:344 meanR:199.7800 R:456.0000 loss:5.1800
Episode:345 meanR:203.5200 R:500.0000 loss:1.3486
Episode:346 meanR:207.4500 R:500.0000 loss:15.9409
Episode:347 meanR:208.5400 R:257.0000 loss:26.3498
Episode:348 meanR:208.4700 R:177.0000 loss:3.4239
Episode:349 meanR:210.2400 R:267.0000 loss:5.6473
Episode:350 meanR:210.1300 R:132.0000 loss:84.8151
Episode:351 meanR:213.6600 R:500.0000 loss:102.5286
Episode:352 meanR:217.9500 R:500.0000 loss:16.2909
Episode:353 meanR:221.5300 R:500.0000 loss:15.5904
Episode:354 meanR:225.9500 R:500.0000 loss:3.0531
Episode:355 meanR:229.5200 R:500.0000 loss:8.1196
Episode:356 meanR:233.0300 R:500.0000 loss:10.0707
Episode:357 meanR:236.5500 R:500.0000 loss:17.0361
Episode:358 meanR:240.4000 R:500.0000 loss:16.1903
Episode:359 meanR:243.9700 R:500.0000 loss:11.8942
Episode:360 meanR:248.5500 R:500.0000 loss:8.9329
Episode:361 meanR:252.8200 R:500.0000 loss:16.2006
Episode:362 meanR:256.5500 R:500.0000 

Episode:505 meanR:421.8700 R:48.0000 loss:24.3613
Episode:506 meanR:417.3500 R:48.0000 loss:26.7019
Episode:507 meanR:412.8900 R:54.0000 loss:26.1423
Episode:508 meanR:408.3700 R:48.0000 loss:27.0389
Episode:509 meanR:403.8000 R:43.0000 loss:30.8578
Episode:510 meanR:399.2700 R:47.0000 loss:28.5032
Episode:511 meanR:394.7400 R:47.0000 loss:27.5654
Episode:512 meanR:390.0900 R:35.0000 loss:28.8347
Episode:513 meanR:385.5800 R:49.0000 loss:30.4454
Episode:514 meanR:381.0900 R:51.0000 loss:26.1368
Episode:515 meanR:376.4500 R:36.0000 loss:25.1532
Episode:516 meanR:371.8700 R:42.0000 loss:29.4760
Episode:517 meanR:367.3100 R:44.0000 loss:29.8976
Episode:518 meanR:362.7200 R:41.0000 loss:32.0583
Episode:519 meanR:358.1700 R:45.0000 loss:33.7952
Episode:520 meanR:353.3900 R:22.0000 loss:37.1411
Episode:521 meanR:348.5900 R:20.0000 loss:48.8424
Episode:522 meanR:343.8800 R:29.0000 loss:78.1563
Episode:523 meanR:339.0400 R:16.0000 loss:86.6178
Episode:524 meanR:334.1900 R:15.0000 loss:93.7336


Episode:674 meanR:12.7000 R:9.0000 loss:4.7905
Episode:675 meanR:12.6400 R:10.0000 loss:4.4826
Episode:676 meanR:12.5600 R:9.0000 loss:5.3991
Episode:677 meanR:12.4900 R:10.0000 loss:5.1819
Episode:678 meanR:12.4100 R:9.0000 loss:4.2815
Episode:679 meanR:12.3300 R:11.0000 loss:5.4918
Episode:680 meanR:12.2500 R:10.0000 loss:5.8960
Episode:681 meanR:12.1600 R:9.0000 loss:5.0027
Episode:682 meanR:12.1000 R:11.0000 loss:5.0877
Episode:683 meanR:12.0100 R:9.0000 loss:6.0857
Episode:684 meanR:11.9300 R:10.0000 loss:5.7440
Episode:685 meanR:11.8400 R:10.0000 loss:4.9518
Episode:686 meanR:11.7500 R:11.0000 loss:6.7921
Episode:687 meanR:11.6500 R:10.0000 loss:6.1087
Episode:688 meanR:11.5300 R:11.0000 loss:5.4566
Episode:689 meanR:11.4200 R:9.0000 loss:5.6826
Episode:690 meanR:11.3200 R:10.0000 loss:6.7342
Episode:691 meanR:11.2100 R:10.0000 loss:6.5494
Episode:692 meanR:11.1200 R:11.0000 loss:6.2132
Episode:693 meanR:11.0300 R:9.0000 loss:6.4274
Episode:694 meanR:10.8900 R:9.0000 loss:6.3848


Episode:848 meanR:9.3900 R:8.0000 loss:22.8542
Episode:849 meanR:9.4200 R:11.0000 loss:22.7646
Episode:850 meanR:9.4100 R:9.0000 loss:23.6150
Episode:851 meanR:9.4100 R:10.0000 loss:22.9432
Episode:852 meanR:9.4200 R:10.0000 loss:25.5837
Episode:853 meanR:9.3900 R:8.0000 loss:24.6805
Episode:854 meanR:9.4000 R:10.0000 loss:24.0692
Episode:855 meanR:9.4000 R:9.0000 loss:23.1020
Episode:856 meanR:9.4100 R:9.0000 loss:24.4383
Episode:857 meanR:9.4100 R:10.0000 loss:25.5834
Episode:858 meanR:9.4200 R:11.0000 loss:24.4010
Episode:859 meanR:9.4200 R:9.0000 loss:23.4450
Episode:860 meanR:9.4100 R:8.0000 loss:25.5985
Episode:861 meanR:9.4100 R:10.0000 loss:24.0113
Episode:862 meanR:9.4200 R:9.0000 loss:26.8224
Episode:863 meanR:9.4200 R:10.0000 loss:27.2345
Episode:864 meanR:9.4100 R:9.0000 loss:24.8070
Episode:865 meanR:9.4000 R:9.0000 loss:25.1116
Episode:866 meanR:9.4100 R:10.0000 loss:25.2405
Episode:867 meanR:9.4100 R:9.0000 loss:26.8479
Episode:868 meanR:9.4100 R:9.0000 loss:28.5969
Epis

Episode:1021 meanR:9.4700 R:9.0000 loss:39.7348
Episode:1022 meanR:9.4800 R:10.0000 loss:55.6169
Episode:1023 meanR:9.4900 R:10.0000 loss:45.9734
Episode:1024 meanR:9.4800 R:9.0000 loss:43.6944
Episode:1025 meanR:9.4800 R:10.0000 loss:44.3618
Episode:1026 meanR:9.4700 R:10.0000 loss:41.3371
Episode:1027 meanR:9.4700 R:10.0000 loss:45.8141
Episode:1028 meanR:9.4700 R:10.0000 loss:42.3654
Episode:1029 meanR:9.4700 R:9.0000 loss:42.9421
Episode:1030 meanR:9.4800 R:10.0000 loss:53.2151
Episode:1031 meanR:9.4900 R:10.0000 loss:49.2492
Episode:1032 meanR:9.5000 R:11.0000 loss:45.4138
Episode:1033 meanR:9.5200 R:10.0000 loss:51.8035
Episode:1034 meanR:9.5200 R:10.0000 loss:42.8154
Episode:1035 meanR:9.5200 R:11.0000 loss:47.9700
Episode:1036 meanR:9.5200 R:9.0000 loss:48.0013
Episode:1037 meanR:9.5400 R:10.0000 loss:40.5743
Episode:1038 meanR:9.5500 R:10.0000 loss:47.7501
Episode:1039 meanR:9.5700 R:10.0000 loss:50.5131
Episode:1040 meanR:9.5600 R:9.0000 loss:45.0501
Episode:1041 meanR:9.5300

Episode:1190 meanR:9.5100 R:10.0000 loss:53.0110
Episode:1191 meanR:9.5200 R:10.0000 loss:61.2980
Episode:1192 meanR:9.5200 R:9.0000 loss:50.6978
Episode:1193 meanR:9.5200 R:10.0000 loss:56.3210
Episode:1194 meanR:9.5300 R:9.0000 loss:55.6725
Episode:1195 meanR:9.5400 R:10.0000 loss:46.3191
Episode:1196 meanR:9.5500 R:10.0000 loss:65.2830
Episode:1197 meanR:9.5600 R:10.0000 loss:70.9511
Episode:1198 meanR:9.5700 R:10.0000 loss:53.0575
Episode:1199 meanR:9.5800 R:10.0000 loss:70.3981
Episode:1200 meanR:9.5700 R:9.0000 loss:67.1932
Episode:1201 meanR:9.5600 R:9.0000 loss:64.8587
Episode:1202 meanR:9.5700 R:11.0000 loss:74.6979
Episode:1203 meanR:9.5700 R:9.0000 loss:63.3575
Episode:1204 meanR:9.5900 R:11.0000 loss:68.7937
Episode:1205 meanR:9.6000 R:10.0000 loss:62.6705
Episode:1206 meanR:9.6000 R:10.0000 loss:72.4889
Episode:1207 meanR:9.6000 R:10.0000 loss:67.8375
Episode:1208 meanR:9.6100 R:10.0000 loss:82.8191
Episode:1209 meanR:9.6000 R:9.0000 loss:66.0274
Episode:1210 meanR:9.6000 

Episode:1357 meanR:174.7400 R:500.0000 loss:18.3772
Episode:1358 meanR:179.6500 R:500.0000 loss:17.8876
Episode:1359 meanR:184.5600 R:500.0000 loss:2.7213
Episode:1360 meanR:189.4500 R:500.0000 loss:12.7177
Episode:1361 meanR:194.3500 R:500.0000 loss:16.1594
Episode:1362 meanR:199.2600 R:500.0000 loss:17.6253
Episode:1363 meanR:204.1700 R:500.0000 loss:17.1249
Episode:1364 meanR:209.0700 R:500.0000 loss:16.7163
Episode:1365 meanR:213.9700 R:500.0000 loss:16.2497
Episode:1366 meanR:218.8700 R:500.0000 loss:15.7353
Episode:1367 meanR:223.7700 R:500.0000 loss:13.7558
Episode:1368 meanR:228.6600 R:500.0000 loss:16.0299
Episode:1369 meanR:233.5600 R:500.0000 loss:14.8963
Episode:1370 meanR:238.4700 R:500.0000 loss:16.0364
Episode:1371 meanR:243.3800 R:500.0000 loss:15.4830
Episode:1372 meanR:248.2900 R:500.0000 loss:15.1818
Episode:1373 meanR:253.1900 R:500.0000 loss:15.3475
Episode:1374 meanR:258.0800 R:500.0000 loss:15.3618
Episode:1375 meanR:262.9800 R:500.0000 loss:15.7135
Episode:1376 

Episode:1515 meanR:402.8400 R:29.0000 loss:98.7559
Episode:1516 meanR:398.1800 R:34.0000 loss:74.6709
Episode:1517 meanR:398.1800 R:500.0000 loss:17.4314
Episode:1518 meanR:398.1800 R:500.0000 loss:11.4501
Episode:1519 meanR:398.1800 R:500.0000 loss:16.1090
Episode:1520 meanR:398.1800 R:500.0000 loss:17.9295
Episode:1521 meanR:398.1800 R:500.0000 loss:17.5267
Episode:1522 meanR:398.1800 R:500.0000 loss:17.0110
Episode:1523 meanR:398.1800 R:500.0000 loss:16.8723
Episode:1524 meanR:398.1800 R:500.0000 loss:15.6686
Episode:1525 meanR:395.4800 R:230.0000 loss:32.7716
Episode:1526 meanR:393.6000 R:312.0000 loss:19.7144
Episode:1527 meanR:393.6000 R:500.0000 loss:2.9918
Episode:1528 meanR:393.6000 R:500.0000 loss:16.4790
Episode:1529 meanR:393.6000 R:500.0000 loss:17.3043
Episode:1530 meanR:390.0400 R:144.0000 loss:44.5516
Episode:1531 meanR:390.0400 R:500.0000 loss:4.0059
Episode:1532 meanR:390.0400 R:500.0000 loss:17.5315
Episode:1533 meanR:390.0400 R:500.0000 loss:17.4757
Episode:1534 mea

Episode:1674 meanR:343.0900 R:117.0000 loss:2.4721
Episode:1675 meanR:338.4300 R:34.0000 loss:3.7417
Episode:1676 meanR:334.4700 R:104.0000 loss:16.9970
Episode:1677 meanR:330.5300 R:106.0000 loss:4.8074
Episode:1678 meanR:326.4100 R:88.0000 loss:4.5866
Episode:1679 meanR:322.3300 R:92.0000 loss:2.7817
Episode:1680 meanR:318.1700 R:84.0000 loss:2.0660
Episode:1681 meanR:314.0400 R:87.0000 loss:1.9202
Episode:1682 meanR:310.1000 R:106.0000 loss:1.8063
Episode:1683 meanR:306.9600 R:186.0000 loss:2.9366
Episode:1684 meanR:303.2000 R:124.0000 loss:69.8986
Episode:1685 meanR:303.2000 R:500.0000 loss:3.4480
Episode:1686 meanR:303.2000 R:500.0000 loss:16.4350
Episode:1687 meanR:303.2000 R:500.0000 loss:17.1081
Episode:1688 meanR:303.2000 R:500.0000 loss:15.4167
Episode:1689 meanR:304.9300 R:500.0000 loss:16.9477
Episode:1690 meanR:307.4800 R:500.0000 loss:14.7706
Episode:1691 meanR:310.5500 R:500.0000 loss:15.0833
Episode:1692 meanR:310.5500 R:500.0000 loss:16.5989
Episode:1693 meanR:310.5500

Episode:1833 meanR:420.7700 R:500.0000 loss:18.2832
Episode:1834 meanR:425.4300 R:500.0000 loss:18.0818
Episode:1835 meanR:429.3100 R:500.0000 loss:17.3192
Episode:1836 meanR:432.8300 R:500.0000 loss:18.2731
Episode:1837 meanR:436.0400 R:500.0000 loss:17.4692
Episode:1838 meanR:439.1700 R:500.0000 loss:15.8292
Episode:1839 meanR:441.9600 R:500.0000 loss:16.1114
Episode:1840 meanR:441.9600 R:500.0000 loss:17.0946
Episode:1841 meanR:442.6600 R:227.0000 loss:43.7793
Episode:1842 meanR:442.6600 R:500.0000 loss:6.0986
Episode:1843 meanR:442.6600 R:500.0000 loss:19.1443
Episode:1844 meanR:442.6600 R:500.0000 loss:16.8899
Episode:1845 meanR:442.6600 R:500.0000 loss:16.0094
Episode:1846 meanR:437.8700 R:21.0000 loss:65.5781
Episode:1847 meanR:432.9800 R:11.0000 loss:154.3963
Episode:1848 meanR:428.0800 R:10.0000 loss:236.1208
Episode:1849 meanR:423.2000 R:12.0000 loss:298.9400
Episode:1850 meanR:423.2000 R:500.0000 loss:35.1156
Episode:1851 meanR:423.2000 R:500.0000 loss:16.4883
Episode:1852 m

Episode:1992 meanR:390.4400 R:500.0000 loss:0.4413
Episode:1993 meanR:390.4400 R:500.0000 loss:13.6312
Episode:1994 meanR:390.4400 R:500.0000 loss:15.2546
Episode:1995 meanR:390.4400 R:500.0000 loss:16.3785
Episode:1996 meanR:390.4400 R:500.0000 loss:19.5763
Episode:1997 meanR:390.4400 R:500.0000 loss:19.7795
Episode:1998 meanR:390.4400 R:500.0000 loss:12.2195
Episode:1999 meanR:390.4400 R:500.0000 loss:19.2819
Episode:2000 meanR:390.4400 R:500.0000 loss:16.6284
Episode:2001 meanR:393.9200 R:500.0000 loss:9.8506
Episode:2002 meanR:393.9200 R:500.0000 loss:16.0254
Episode:2003 meanR:394.7500 R:500.0000 loss:18.7988
Episode:2004 meanR:398.4500 R:500.0000 loss:17.2291
Episode:2005 meanR:401.3100 R:500.0000 loss:15.5316
Episode:2006 meanR:401.3100 R:500.0000 loss:18.8113
Episode:2007 meanR:404.2100 R:500.0000 loss:21.3468
Episode:2008 meanR:408.2000 R:500.0000 loss:18.4296
Episode:2009 meanR:413.0600 R:500.0000 loss:22.4209
Episode:2010 meanR:413.0600 R:13.0000 loss:94.3076
Episode:2011 me

Episode:2151 meanR:334.4800 R:100.0000 loss:4.1486
Episode:2152 meanR:330.3500 R:87.0000 loss:5.0830
Episode:2153 meanR:326.1700 R:82.0000 loss:4.6329
Episode:2154 meanR:322.3700 R:120.0000 loss:3.5555
Episode:2155 meanR:319.1100 R:174.0000 loss:1.3911
Episode:2156 meanR:319.1100 R:500.0000 loss:0.5852
Episode:2157 meanR:319.1100 R:500.0000 loss:14.1187
Episode:2158 meanR:319.1100 R:500.0000 loss:12.3144
Episode:2159 meanR:315.9200 R:181.0000 loss:39.9093
Episode:2160 meanR:315.9200 R:500.0000 loss:1.0204
Episode:2161 meanR:319.9900 R:500.0000 loss:13.7196
Episode:2162 meanR:317.3100 R:232.0000 loss:27.7294
Episode:2163 meanR:317.3100 R:500.0000 loss:0.4962
Episode:2164 meanR:317.3100 R:500.0000 loss:15.8247
Episode:2165 meanR:317.3100 R:500.0000 loss:15.6488
Episode:2166 meanR:314.0600 R:175.0000 loss:41.7216
Episode:2167 meanR:311.3600 R:230.0000 loss:3.0953
Episode:2168 meanR:311.3600 R:500.0000 loss:1.4811
Episode:2169 meanR:311.3600 R:500.0000 loss:13.7616
Episode:2170 meanR:311.3

Episode:2310 meanR:419.5700 R:500.0000 loss:0.4055
Episode:2311 meanR:417.1800 R:261.0000 loss:28.4146
Episode:2312 meanR:415.5700 R:339.0000 loss:0.6832
Episode:2313 meanR:414.5700 R:400.0000 loss:1.3340
Episode:2314 meanR:412.0200 R:245.0000 loss:1.9042
Episode:2315 meanR:409.0700 R:205.0000 loss:1.4083
Episode:2316 meanR:407.2500 R:318.0000 loss:0.7008
Episode:2317 meanR:406.4600 R:421.0000 loss:0.7339
Episode:2318 meanR:404.9600 R:350.0000 loss:0.4033
Episode:2319 meanR:404.9600 R:500.0000 loss:0.1537
Episode:2320 meanR:407.4600 R:500.0000 loss:14.4635
Episode:2321 meanR:407.4600 R:500.0000 loss:17.1962
Episode:2322 meanR:407.4600 R:500.0000 loss:20.8002
Episode:2323 meanR:405.4800 R:302.0000 loss:26.9921
Episode:2324 meanR:405.9600 R:170.0000 loss:1.2462
Episode:2325 meanR:404.0700 R:125.0000 loss:2.1090
Episode:2326 meanR:400.1000 R:103.0000 loss:2.4429
Episode:2327 meanR:396.5300 R:143.0000 loss:3.4792
Episode:2328 meanR:394.9400 R:341.0000 loss:4.6255
Episode:2329 meanR:394.940

Episode:2469 meanR:396.2200 R:126.0000 loss:0.8828
Episode:2470 meanR:395.3300 R:20.0000 loss:3.2100
Episode:2471 meanR:395.0300 R:186.0000 loss:23.3619
Episode:2472 meanR:397.2200 R:373.0000 loss:2.4381
Episode:2473 meanR:399.9000 R:500.0000 loss:0.5950
Episode:2474 meanR:399.9000 R:500.0000 loss:8.1457
Episode:2475 meanR:397.4600 R:256.0000 loss:24.0450
Episode:2476 meanR:393.7600 R:130.0000 loss:24.0074
Episode:2477 meanR:390.0100 R:125.0000 loss:14.5781
Episode:2478 meanR:390.0100 R:500.0000 loss:1.9317
Episode:2479 meanR:390.0100 R:500.0000 loss:12.4701
Episode:2480 meanR:390.7000 R:500.0000 loss:11.7640
Episode:2481 meanR:387.1800 R:148.0000 loss:64.1526
Episode:2482 meanR:383.7600 R:158.0000 loss:10.7577
Episode:2483 meanR:380.2500 R:149.0000 loss:2.2949
Episode:2484 meanR:376.6900 R:144.0000 loss:1.4351
Episode:2485 meanR:373.7000 R:201.0000 loss:1.3447
Episode:2486 meanR:370.6500 R:195.0000 loss:1.3827
Episode:2487 meanR:372.9700 R:303.0000 loss:2.6943
Episode:2488 meanR:372.8

Episode:2629 meanR:300.8900 R:282.0000 loss:30.2587
Episode:2630 meanR:298.7200 R:283.0000 loss:8.2474
Episode:2631 meanR:295.0200 R:130.0000 loss:11.1228
Episode:2632 meanR:292.2800 R:226.0000 loss:3.3541
Episode:2633 meanR:290.1700 R:237.0000 loss:2.1836
Episode:2634 meanR:287.7500 R:258.0000 loss:1.2219
Episode:2635 meanR:287.7500 R:500.0000 loss:1.2143
Episode:2636 meanR:287.7500 R:500.0000 loss:17.9251
Episode:2637 meanR:287.7500 R:500.0000 loss:14.4875
Episode:2638 meanR:287.7500 R:500.0000 loss:18.6538
Episode:2639 meanR:287.7500 R:500.0000 loss:18.3905
Episode:2640 meanR:287.7500 R:500.0000 loss:18.2319
Episode:2641 meanR:287.7500 R:500.0000 loss:14.5969
Episode:2642 meanR:287.7500 R:500.0000 loss:12.1351
Episode:2643 meanR:292.6300 R:500.0000 loss:10.7546
Episode:2644 meanR:297.5100 R:500.0000 loss:16.2251
Episode:2645 meanR:302.4000 R:500.0000 loss:14.6255
Episode:2646 meanR:305.7200 R:343.0000 loss:22.4021
Episode:2647 meanR:303.2400 R:252.0000 loss:20.2215
Episode:2648 mean

Episode:2788 meanR:380.7500 R:500.0000 loss:10.0469
Episode:2789 meanR:375.8800 R:13.0000 loss:74.7477
Episode:2790 meanR:371.0500 R:17.0000 loss:130.6505
Episode:2791 meanR:366.8800 R:83.0000 loss:93.1438
Episode:2792 meanR:366.4700 R:459.0000 loss:13.3895
Episode:2793 meanR:366.5500 R:423.0000 loss:5.7487
Episode:2794 meanR:366.5500 R:500.0000 loss:6.8486
Episode:2795 meanR:366.5500 R:500.0000 loss:7.9346
Episode:2796 meanR:368.6300 R:500.0000 loss:10.7515
Episode:2797 meanR:371.5300 R:500.0000 loss:16.8726
Episode:2798 meanR:373.0300 R:393.0000 loss:17.2058
Episode:2799 meanR:373.4100 R:500.0000 loss:6.7415
Episode:2800 meanR:371.9000 R:349.0000 loss:17.7953
Episode:2801 meanR:368.9400 R:204.0000 loss:9.5447
Episode:2802 meanR:366.3800 R:244.0000 loss:4.2573
Episode:2803 meanR:364.5700 R:319.0000 loss:6.7521
Episode:2804 meanR:361.8000 R:12.0000 loss:37.6472
Episode:2805 meanR:363.9500 R:500.0000 loss:11.6120
Episode:2806 meanR:364.2100 R:500.0000 loss:18.0117
Episode:2807 meanR:365

Episode:2947 meanR:299.5400 R:500.0000 loss:16.6032
Episode:2948 meanR:299.5400 R:500.0000 loss:20.5057
Episode:2949 meanR:304.4400 R:500.0000 loss:17.6667
Episode:2950 meanR:309.3400 R:500.0000 loss:15.8337
Episode:2951 meanR:314.2400 R:500.0000 loss:12.9088
Episode:2952 meanR:317.2100 R:500.0000 loss:9.6908
Episode:2953 meanR:321.3400 R:500.0000 loss:17.9434
Episode:2954 meanR:325.5300 R:500.0000 loss:16.5764
Episode:2955 meanR:329.5600 R:500.0000 loss:12.9165
Episode:2956 meanR:333.4500 R:500.0000 loss:16.2706
Episode:2957 meanR:337.6200 R:500.0000 loss:13.6164
Episode:2958 meanR:338.0700 R:136.0000 loss:56.0611
Episode:2959 meanR:339.5200 R:228.0000 loss:18.4990
Episode:2960 meanR:340.1200 R:165.0000 loss:19.0281
Episode:2961 meanR:338.9500 R:11.0000 loss:38.0017
Episode:2962 meanR:337.9100 R:11.0000 loss:68.1333
Episode:2963 meanR:341.7700 R:500.0000 loss:8.6095
Episode:2964 meanR:345.6600 R:500.0000 loss:14.0633
Episode:2965 meanR:349.4200 R:500.0000 loss:14.5854
Episode:2966 mea

Episode:3106 meanR:422.7500 R:249.0000 loss:35.8499
Episode:3107 meanR:421.0600 R:11.0000 loss:62.3953
Episode:3108 meanR:416.1700 R:11.0000 loss:126.4422
Episode:3109 meanR:411.2700 R:10.0000 loss:148.0002
Episode:3110 meanR:410.0300 R:376.0000 loss:12.8411
Episode:3111 meanR:408.1000 R:236.0000 loss:6.4597
Episode:3112 meanR:408.6200 R:236.0000 loss:3.8888
Episode:3113 meanR:408.7700 R:187.0000 loss:6.1308
Episode:3114 meanR:409.3000 R:185.0000 loss:7.1919
Episode:3115 meanR:409.8800 R:179.0000 loss:4.7528
Episode:3116 meanR:410.5000 R:201.0000 loss:1.9924
Episode:3117 meanR:410.8000 R:212.0000 loss:2.9214
Episode:3118 meanR:410.4900 R:173.0000 loss:1.1967
Episode:3119 meanR:410.3300 R:157.0000 loss:0.7538
Episode:3120 meanR:406.8100 R:148.0000 loss:1.2081
Episode:3121 meanR:403.0500 R:124.0000 loss:0.7965
Episode:3122 meanR:399.2500 R:120.0000 loss:4.3610
Episode:3123 meanR:395.5800 R:133.0000 loss:2.7141
Episode:3124 meanR:391.9000 R:132.0000 loss:2.3600
Episode:3125 meanR:390.9000

Episode:3266 meanR:375.6400 R:500.0000 loss:13.1327
Episode:3267 meanR:380.2300 R:500.0000 loss:13.5866
Episode:3268 meanR:384.7700 R:500.0000 loss:14.7828
Episode:3269 meanR:389.5700 R:500.0000 loss:16.2012
Episode:3270 meanR:394.4500 R:500.0000 loss:13.9748
Episode:3271 meanR:399.2500 R:500.0000 loss:13.6256
Episode:3272 meanR:404.0700 R:500.0000 loss:14.5631
Episode:3273 meanR:408.5400 R:500.0000 loss:15.4746
Episode:3274 meanR:409.8400 R:147.0000 loss:54.2097
Episode:3275 meanR:414.7000 R:500.0000 loss:8.6223
Episode:3276 meanR:418.7900 R:500.0000 loss:15.8907
Episode:3277 meanR:418.0000 R:108.0000 loss:65.6427
Episode:3278 meanR:418.0000 R:500.0000 loss:6.8456
Episode:3279 meanR:418.0000 R:500.0000 loss:15.5766
Episode:3280 meanR:418.0000 R:500.0000 loss:16.1474
Episode:3281 meanR:418.0000 R:500.0000 loss:12.0129
Episode:3282 meanR:416.1900 R:319.0000 loss:20.4751
Episode:3283 meanR:415.2400 R:405.0000 loss:10.0206
Episode:3284 meanR:415.3400 R:500.0000 loss:6.1562
Episode:3285 me

Episode:3425 meanR:397.5200 R:500.0000 loss:12.9405
Episode:3426 meanR:400.3500 R:293.0000 loss:20.4696
Episode:3427 meanR:405.2500 R:500.0000 loss:6.1951
Episode:3428 meanR:408.0900 R:294.0000 loss:19.0606
Episode:3429 meanR:411.0200 R:304.0000 loss:10.5172
Episode:3430 meanR:406.1600 R:14.0000 loss:53.3712
Episode:3431 meanR:402.8600 R:170.0000 loss:81.7941
Episode:3432 meanR:399.8800 R:202.0000 loss:11.0101
Episode:3433 meanR:398.2500 R:186.0000 loss:10.3705
Episode:3434 meanR:395.3700 R:182.0000 loss:4.5895
Episode:3435 meanR:397.6500 R:500.0000 loss:1.0336
Episode:3436 meanR:395.0500 R:240.0000 loss:3.7921
Episode:3437 meanR:395.5000 R:323.0000 loss:1.6180
Episode:3438 meanR:393.9400 R:344.0000 loss:1.5977
Episode:3439 meanR:391.5500 R:261.0000 loss:6.2881
Episode:3440 meanR:387.7800 R:123.0000 loss:6.0025
Episode:3441 meanR:383.9600 R:118.0000 loss:5.2566
Episode:3442 meanR:380.2200 R:126.0000 loss:3.2174
Episode:3443 meanR:376.6000 R:138.0000 loss:2.0282
Episode:3444 meanR:373.2

Episode:3585 meanR:235.6200 R:500.0000 loss:13.4117
Episode:3586 meanR:239.8100 R:500.0000 loss:14.5580
Episode:3587 meanR:243.8400 R:500.0000 loss:10.8144
Episode:3588 meanR:247.8800 R:500.0000 loss:8.6154
Episode:3589 meanR:251.4700 R:500.0000 loss:16.8756
Episode:3590 meanR:255.1400 R:500.0000 loss:14.8022
Episode:3591 meanR:258.9900 R:500.0000 loss:7.8908
Episode:3592 meanR:262.9500 R:500.0000 loss:21.1307
Episode:3593 meanR:266.8600 R:500.0000 loss:16.4171
Episode:3594 meanR:270.5700 R:500.0000 loss:14.1909
Episode:3595 meanR:274.3400 R:500.0000 loss:18.8663
Episode:3596 meanR:277.9500 R:500.0000 loss:14.1085
Episode:3597 meanR:281.7900 R:500.0000 loss:15.0604
Episode:3598 meanR:285.6500 R:500.0000 loss:15.5552
Episode:3599 meanR:289.5000 R:500.0000 loss:12.2110
Episode:3600 meanR:293.4500 R:500.0000 loss:14.4948
Episode:3601 meanR:297.4200 R:500.0000 loss:14.1992
Episode:3602 meanR:301.2800 R:500.0000 loss:11.8871
Episode:3603 meanR:305.2200 R:500.0000 loss:14.0073
Episode:3604 m

Episode:3744 meanR:447.0400 R:500.0000 loss:15.0808
Episode:3745 meanR:447.0400 R:500.0000 loss:13.8546
Episode:3746 meanR:447.0400 R:500.0000 loss:105.3186
Episode:3747 meanR:442.1700 R:13.0000 loss:76.5474
Episode:3748 meanR:437.2600 R:9.0000 loss:132.6165
Episode:3749 meanR:432.3700 R:11.0000 loss:195.4415
Episode:3750 meanR:427.4800 R:11.0000 loss:212.3938
Episode:3751 meanR:422.6000 R:12.0000 loss:214.8519
Episode:3752 meanR:417.6900 R:9.0000 loss:211.3159
Episode:3753 meanR:412.8000 R:11.0000 loss:188.6227
Episode:3754 meanR:412.8000 R:500.0000 loss:59.7730
Episode:3755 meanR:411.7400 R:303.0000 loss:76.5944
Episode:3756 meanR:408.5400 R:180.0000 loss:22.2137
Episode:3757 meanR:405.2600 R:172.0000 loss:9.1537
Episode:3758 meanR:401.8500 R:159.0000 loss:11.8573
Episode:3759 meanR:398.4500 R:160.0000 loss:10.3224
Episode:3760 meanR:394.7900 R:134.0000 loss:11.9189
Episode:3761 meanR:391.1600 R:137.0000 loss:7.9426
Episode:3762 meanR:387.3000 R:114.0000 loss:9.7218
Episode:3763 mean

Episode:3906 meanR:33.3100 R:14.0000 loss:28.6708
Episode:3907 meanR:33.2700 R:65.0000 loss:27.5622
Episode:3908 meanR:33.8400 R:72.0000 loss:62.1313
Episode:3909 meanR:34.4600 R:73.0000 loss:106.4033
Episode:3910 meanR:34.6100 R:84.0000 loss:143.8314
Episode:3911 meanR:34.7700 R:82.0000 loss:128.9876
Episode:3912 meanR:34.8600 R:70.0000 loss:80.5375
Episode:3913 meanR:35.1100 R:83.0000 loss:68.0725
Episode:3914 meanR:35.3200 R:71.0000 loss:68.4324
Episode:3915 meanR:35.3700 R:69.0000 loss:107.4174
Episode:3916 meanR:35.4000 R:78.0000 loss:106.0110
Episode:3917 meanR:34.9400 R:14.0000 loss:96.7037
Episode:3918 meanR:34.5300 R:15.0000 loss:96.2158
Episode:3919 meanR:34.7000 R:80.0000 loss:76.3879
Episode:3920 meanR:34.2200 R:14.0000 loss:83.0319
Episode:3921 meanR:34.1400 R:71.0000 loss:71.8040
Episode:3922 meanR:34.9200 R:92.0000 loss:58.6102
Episode:3923 meanR:35.8900 R:110.0000 loss:80.3234
Episode:3924 meanR:36.7600 R:102.0000 loss:83.1984
Episode:3925 meanR:37.6000 R:96.0000 loss:7

Episode:4066 meanR:403.6200 R:500.0000 loss:4.8482
Episode:4067 meanR:403.6200 R:500.0000 loss:18.4675
Episode:4068 meanR:405.6600 R:500.0000 loss:18.4171
Episode:4069 meanR:406.7600 R:500.0000 loss:17.6761
Episode:4070 meanR:409.4200 R:500.0000 loss:14.3911
Episode:4071 meanR:411.9000 R:500.0000 loss:13.3832
Episode:4072 meanR:414.6500 R:500.0000 loss:12.8128
Episode:4073 meanR:413.2100 R:162.0000 loss:60.1689
Episode:4074 meanR:414.1800 R:500.0000 loss:7.2634
Episode:4075 meanR:414.1800 R:500.0000 loss:15.5438
Episode:4076 meanR:414.1800 R:500.0000 loss:14.4357
Episode:4077 meanR:414.1800 R:500.0000 loss:12.4554
Episode:4078 meanR:414.1800 R:500.0000 loss:13.8090
Episode:4079 meanR:414.1800 R:500.0000 loss:15.4528
Episode:4080 meanR:414.1800 R:500.0000 loss:14.1774
Episode:4081 meanR:414.1800 R:500.0000 loss:13.5192
Episode:4082 meanR:419.0600 R:500.0000 loss:16.1045
Episode:4083 meanR:419.0600 R:500.0000 loss:14.5571
Episode:4084 meanR:419.0600 R:500.0000 loss:13.9307
Episode:4085 m

Episode:4226 meanR:239.4900 R:290.0000 loss:1.2722
Episode:4227 meanR:242.1800 R:380.0000 loss:2.3634
Episode:4228 meanR:244.3700 R:322.0000 loss:0.6681


# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [24]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episode/epoch
    for _ in range(10):
        total_reward = 0
        state = env.reset()
        initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
        
        # Steps/batches
        while True:
            env.render()
            action_logits, initial_state = sess.run([model.actions_logits, model.final_state],
                                                    feed_dict = {model.states: state.reshape([1, -1]), 
                                                                 model.initial_state: initial_state})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # At the end of each episode
        print('total_reward:{}'.format(total_reward))

# Close the env
env.close()

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.