# Recurrent DQN

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import tensorflow as tf
print('TensorFlow Version: {}'.format(tf.__version__))
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

# Create the Cart-Pole game environment
# env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

  result = entry_point.load(False)


We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
# import numpy as np
# state = env.reset()
# for _ in range(10):
#     # env.render()
#     action = env.action_space.sample()
#     next_state, reward, done, info = env.step(action) # take a random action
#     #print('state, action, next_state, reward, done, info:', state, action, next_state, reward, done, info)
#     state = next_state
#     if done:
#         state = env.reset()

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [4]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [6]:
def model_input(state_size, hidden_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # RNN
    cell = tf.nn.rnn_cell.GRUCell(hidden_size)
    #cell = tf.nn.rnn_cell.LSTMCell(hidden_size)
    cells = tf.nn.rnn_cell.MultiRNNCell([cell], state_is_tuple=True)
    initial_state = cells.zero_state(batch_size, tf.float32)
    is_training = tf.placeholder(dtype=tf.bool, shape=[], name='is_training')
    return states, actions, targetQs, cells, initial_state, is_training

In [7]:
# RNN generator or sequence generator
def generator(states, action_size, initial_state, cells, hidden_size, reuse=False, is_training=True): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        #inputs = tf.layers.dense(inputs=states, units=hidden_size)
        inputs = tf.contrib.layers.fully_connected(inputs=states, num_outputs=hidden_size, activation_fn=None)
        #inputs = tf.layers.batch_normalization(inputs=inputs, training=is_training)
        #inputs = tf.maximum(0.1*inputs, inputs)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size and
        # static means can NOT adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, hidden_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cells, inputs=inputs_rnn, initial_state=initial_state)
        #outputs_rnn = tf.layers.batch_normalization(inputs=outputs_rnn, training=is_training)
        #final_state = tf.layers.batch_normalization(inputs=final_state, training=is_training)
        print(outputs_rnn.shape, final_state)
        outputs = tf.reshape(outputs_rnn, [-1, hidden_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        #logits = tf.layers.dense(inputs=outputs, units=action_size)
        logits = tf.contrib.layers.fully_connected(inputs=outputs, num_outputs=action_size, activation_fn=None)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [8]:
def model_loss(action_size, hidden_size, states, cells, initial_state, actions, targetQs, is_training):
    actions_logits, final_state = generator(states=states, cells=cells, initial_state=initial_state, 
                                            hidden_size=hidden_size, action_size=action_size, 
                                            is_training=is_training)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [9]:
def model_opt(loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize MLP/CNN/RNN without clipping grads
    #with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    # # Optimize RNN
    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    #grads = tf.gradients(loss, g_vars)
    #opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))
    return opt

In [10]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cells, self.initial_state, self.is_training = model_input(
                state_size=state_size, hidden_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cells=cells, initial_state=self.initial_state, is_training=self.is_training)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [11]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [12]:
# Network parameters
action_size = 2
state_size = 4
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 128            # memory capacity - 1000 DQN
batch_size = 128             # experience mini-batch size - 20 DQN
gamma = 0.99                 # future reward discount

In [13]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (<tf.Tensor 'MultiRNNCellZeroState/GRUCellZeroState/zeros:0' shape=(1, 64) dtype=float32>,)
(1, ?, 64) (<tf.Tensor 'generator/rnn/while/Exit_3:0' shape=(1, 64) dtype=float32>,)
(?, 64)
(?, 2)


In [14]:
model.initial_state[0]

<tf.Tensor 'MultiRNNCellZeroState/GRUCellZeroState/zeros:0' shape=(1, 64) dtype=float32>

## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [15]:
import numpy as np
state = env.reset()
for _ in range(memory_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    memory.states.append(np.zeros([1, hidden_size])) # gru
    #memory.states.append([np.zeros([1, hidden_size]), np.zeros([1, hidden_size])]) # lstm
    state = next_state
    if done is True:
        # Reseting the env/first state
        state = env.reset()

In [16]:
# # Training
# batch = memory.buffer
# states = np.array([each[0] for each in batch])
# actions = np.array([each[1] for each in batch])
# next_states = np.array([each[2] for each in batch])
# rewards = np.array([each[3] for each in batch])
# dones = np.array([each[4] for each in batch])

In [17]:
memory.states[0].shape, model.initial_state[0].shape # gru
# memory.states[0][1].shape, model.initial_state[0][1].shape #lstm

((1, 64), TensorShape([Dimension(1), Dimension(64)]))

In [18]:
# memory.states[0][0].shape, model.initial_state[0][0].shape

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [19]:
# initial_states = np.array(memory.states)
# initial_states.shape

In [None]:
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        state = env.reset()
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.is_training: False,
                                                               model.initial_state: initial_state})
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append(initial_state)
            total_reward += reward
            state = next_state
            initial_state = final_state

            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            initial_states = memory.states
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states,
                                                        model.is_training: False,
                                                        model.initial_state: initial_states[1]})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                                     model.is_training: True,
                                                                     model.initial_state: initial_states[0]})
            # End of training
            loss_batch.append(loss)
            if done is True:
                break
                
        # Outputing: priting out/Potting
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'loss:{:.4f}'.format(np.mean(loss_batch)))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:9.0000 R:9.0000 loss:1.0467
Episode:1 meanR:9.0000 R:9.0000 loss:0.9682
Episode:2 meanR:9.0000 R:9.0000 loss:0.9560
Episode:3 meanR:9.0000 R:9.0000 loss:0.9286
Episode:4 meanR:8.8000 R:8.0000 loss:0.9060
Episode:5 meanR:8.8333 R:9.0000 loss:0.9004
Episode:6 meanR:8.7143 R:8.0000 loss:0.8640
Episode:7 meanR:8.7500 R:9.0000 loss:0.8014
Episode:8 meanR:8.7778 R:9.0000 loss:0.8661
Episode:9 meanR:8.8000 R:9.0000 loss:0.9917
Episode:10 meanR:8.8182 R:9.0000 loss:0.9834
Episode:11 meanR:8.8333 R:9.0000 loss:1.1289
Episode:12 meanR:8.8462 R:9.0000 loss:1.4240
Episode:13 meanR:8.9286 R:10.0000 loss:1.7760
Episode:14 meanR:8.8667 R:8.0000 loss:2.0664
Episode:15 meanR:8.9375 R:10.0000 loss:2.4236
Episode:16 meanR:8.9412 R:9.0000 loss:2.6084
Episode:17 meanR:9.0000 R:10.0000 loss:3.0851
Episode:18 meanR:9.0000 R:9.0000 loss:3.2646
Episode:19 meanR:8.9500 R:8.0000 loss:3.7522
Episode:20 meanR:8.9048 R:8.0000 loss:3.8816
Episode:21 meanR:8.9091 R:9.0000 loss:4.2070
Episode:22 meanR:

Episode:175 meanR:46.4500 R:32.0000 loss:7.0689
Episode:176 meanR:46.1300 R:28.0000 loss:13.3467
Episode:177 meanR:45.7700 R:41.0000 loss:19.7764
Episode:178 meanR:46.5700 R:119.0000 loss:11.6305
Episode:179 meanR:47.2000 R:106.0000 loss:1.1850
Episode:180 meanR:48.0500 R:122.0000 loss:0.6443
Episode:181 meanR:48.0300 R:33.0000 loss:1.6893
Episode:182 meanR:48.9100 R:134.0000 loss:5.8248
Episode:183 meanR:48.5900 R:18.0000 loss:3.7875
Episode:184 meanR:48.2600 R:18.0000 loss:8.9402
Episode:185 meanR:48.2700 R:17.0000 loss:40.4200
Episode:186 meanR:48.3100 R:18.0000 loss:79.4423
Episode:187 meanR:48.4100 R:24.0000 loss:46.5917
Episode:188 meanR:48.4000 R:16.0000 loss:65.1895
Episode:189 meanR:48.4400 R:18.0000 loss:177.5946
Episode:190 meanR:48.4900 R:18.0000 loss:139.4620
Episode:191 meanR:48.5200 R:20.0000 loss:126.9098
Episode:192 meanR:48.5700 R:20.0000 loss:118.0915
Episode:193 meanR:48.5700 R:16.0000 loss:134.6103
Episode:194 meanR:48.6800 R:21.0000 loss:111.0099
Episode:195 meanR

Episode:340 meanR:285.9300 R:500.0000 loss:2.2083
Episode:341 meanR:284.4600 R:145.0000 loss:55.0653
Episode:342 meanR:285.9800 R:500.0000 loss:2.3889
Episode:343 meanR:288.1400 R:500.0000 loss:15.7232
Episode:344 meanR:288.6700 R:202.0000 loss:38.9608
Episode:345 meanR:293.3700 R:500.0000 loss:4.0650
Episode:346 meanR:293.4800 R:116.0000 loss:61.1797
Episode:347 meanR:296.1900 R:292.0000 loss:4.4649
Episode:348 meanR:301.0000 R:500.0000 loss:5.0194
Episode:349 meanR:305.0900 R:500.0000 loss:15.9677
Episode:350 meanR:308.8000 R:500.0000 loss:15.9677
Episode:351 meanR:312.6800 R:500.0000 loss:16.0498
Episode:352 meanR:313.8600 R:273.0000 loss:29.6934
Episode:353 meanR:317.7300 R:500.0000 loss:7.7973
Episode:354 meanR:317.7300 R:500.0000 loss:5.6743
Episode:355 meanR:320.4700 R:500.0000 loss:15.7821
Episode:356 meanR:324.5100 R:500.0000 loss:15.4527
Episode:357 meanR:327.0500 R:500.0000 loss:15.1766
Episode:358 meanR:329.9100 R:311.0000 loss:15.1078
Episode:359 meanR:329.9600 R:256.0000 

Episode:503 meanR:370.6000 R:293.0000 loss:3.5270
Episode:504 meanR:369.7800 R:235.0000 loss:6.1629
Episode:505 meanR:369.5600 R:201.0000 loss:5.4633
Episode:506 meanR:369.4000 R:291.0000 loss:3.3982
Episode:507 meanR:368.6900 R:366.0000 loss:3.4319
Episode:508 meanR:369.7200 R:500.0000 loss:1.8378
Episode:509 meanR:371.6400 R:500.0000 loss:7.6006
Episode:510 meanR:373.6300 R:500.0000 loss:11.4806
Episode:511 meanR:375.7700 R:500.0000 loss:8.3869
Episode:512 meanR:377.9100 R:500.0000 loss:17.0206
Episode:513 meanR:380.1700 R:500.0000 loss:10.4035
Episode:514 meanR:382.4300 R:500.0000 loss:9.9490
Episode:515 meanR:384.9000 R:500.0000 loss:16.6122
Episode:516 meanR:387.4600 R:500.0000 loss:14.0733
Episode:517 meanR:389.9300 R:500.0000 loss:16.3693
Episode:518 meanR:392.5100 R:500.0000 loss:8.9130
Episode:519 meanR:395.2800 R:500.0000 loss:2.3208
Episode:520 meanR:398.3300 R:500.0000 loss:13.8630
Episode:521 meanR:401.0800 R:500.0000 loss:10.1885
Episode:522 meanR:403.2000 R:500.0000 loss

Episode:665 meanR:464.0200 R:500.0000 loss:13.8148
Episode:666 meanR:464.0200 R:500.0000 loss:16.8639
Episode:667 meanR:464.0200 R:500.0000 loss:15.3410
Episode:668 meanR:464.0200 R:500.0000 loss:18.5413
Episode:669 meanR:459.1100 R:9.0000 loss:74.6169
Episode:670 meanR:454.1900 R:8.0000 loss:135.0093
Episode:671 meanR:449.2800 R:9.0000 loss:180.3061
Episode:672 meanR:444.3700 R:9.0000 loss:213.6908
Episode:673 meanR:439.9900 R:62.0000 loss:174.4552
Episode:674 meanR:439.9900 R:500.0000 loss:16.4388
Episode:675 meanR:439.9900 R:500.0000 loss:20.9716
Episode:676 meanR:439.9900 R:500.0000 loss:19.2024
Episode:677 meanR:439.9900 R:500.0000 loss:17.0914
Episode:678 meanR:439.9900 R:500.0000 loss:16.9162
Episode:679 meanR:439.9900 R:500.0000 loss:15.6760
Episode:680 meanR:439.9900 R:500.0000 loss:16.2475
Episode:681 meanR:439.9900 R:500.0000 loss:16.7801
Episode:682 meanR:439.9900 R:500.0000 loss:16.1077
Episode:683 meanR:439.9900 R:500.0000 loss:16.0893
Episode:684 meanR:439.9900 R:500.000

Episode:827 meanR:421.4500 R:500.0000 loss:16.7476
Episode:828 meanR:421.4500 R:500.0000 loss:16.5852
Episode:829 meanR:421.4500 R:500.0000 loss:17.4901
Episode:830 meanR:421.4500 R:500.0000 loss:14.5099
Episode:831 meanR:421.4500 R:500.0000 loss:17.5423
Episode:832 meanR:421.4500 R:500.0000 loss:16.6442
Episode:833 meanR:421.4500 R:500.0000 loss:16.5223
Episode:834 meanR:421.4500 R:500.0000 loss:17.5643
Episode:835 meanR:421.4500 R:500.0000 loss:17.0826
Episode:836 meanR:421.4500 R:500.0000 loss:16.8257
Episode:837 meanR:421.4500 R:500.0000 loss:16.1577
Episode:838 meanR:421.4500 R:500.0000 loss:13.5846
Episode:839 meanR:421.4500 R:500.0000 loss:17.5997
Episode:840 meanR:421.4500 R:500.0000 loss:17.8539
Episode:841 meanR:421.4500 R:500.0000 loss:16.6840
Episode:842 meanR:421.4500 R:500.0000 loss:17.6159
Episode:843 meanR:421.4500 R:500.0000 loss:17.0170
Episode:844 meanR:421.4500 R:500.0000 loss:15.9543
Episode:845 meanR:421.4500 R:500.0000 loss:16.0632
Episode:846 meanR:426.2400 R:50

Episode:988 meanR:414.0000 R:390.0000 loss:3.1098
Episode:989 meanR:410.4500 R:145.0000 loss:7.9076
Episode:990 meanR:410.4500 R:500.0000 loss:15.5664
Episode:991 meanR:410.4500 R:500.0000 loss:7.2002
Episode:992 meanR:410.4500 R:500.0000 loss:15.1668
Episode:993 meanR:410.4500 R:500.0000 loss:8.3120
Episode:994 meanR:410.4500 R:500.0000 loss:14.7539
Episode:995 meanR:410.4500 R:500.0000 loss:10.0797
Episode:996 meanR:410.4500 R:500.0000 loss:13.7502
Episode:997 meanR:410.4500 R:500.0000 loss:1.6716
Episode:998 meanR:407.1100 R:166.0000 loss:44.1604
Episode:999 meanR:403.9600 R:185.0000 loss:8.4509
Episode:1000 meanR:400.2600 R:130.0000 loss:6.5807
Episode:1001 meanR:396.5000 R:124.0000 loss:4.6192
Episode:1002 meanR:392.7000 R:120.0000 loss:1.5625
Episode:1003 meanR:388.9700 R:127.0000 loss:1.0043
Episode:1004 meanR:385.4900 R:152.0000 loss:4.1244
Episode:1005 meanR:381.9800 R:149.0000 loss:3.3603
Episode:1006 meanR:378.5000 R:152.0000 loss:3.4069
Episode:1007 meanR:375.3200 R:182.000

Episode:1147 meanR:373.8000 R:500.0000 loss:1.7408
Episode:1148 meanR:372.8000 R:400.0000 loss:24.1503
Episode:1149 meanR:370.4300 R:263.0000 loss:13.5825
Episode:1150 meanR:367.0300 R:160.0000 loss:14.4327
Episode:1151 meanR:364.1000 R:207.0000 loss:11.2327
Episode:1152 meanR:361.0900 R:199.0000 loss:8.6173
Episode:1153 meanR:358.2900 R:220.0000 loss:3.7207
Episode:1154 meanR:355.7100 R:242.0000 loss:4.3984
Episode:1155 meanR:352.0200 R:131.0000 loss:10.5910
Episode:1156 meanR:348.1000 R:108.0000 loss:28.5012
Episode:1157 meanR:344.2100 R:111.0000 loss:8.8972
Episode:1158 meanR:340.4300 R:122.0000 loss:7.5888
Episode:1159 meanR:336.6600 R:123.0000 loss:15.8833
Episode:1160 meanR:333.0700 R:141.0000 loss:6.2772
Episode:1161 meanR:329.6100 R:154.0000 loss:3.2040
Episode:1162 meanR:329.6100 R:500.0000 loss:2.0849
Episode:1163 meanR:330.3900 R:500.0000 loss:12.9452
Episode:1164 meanR:330.3900 R:500.0000 loss:15.7760
Episode:1165 meanR:330.3900 R:500.0000 loss:14.5563
Episode:1166 meanR:33

Episode:1306 meanR:412.2900 R:11.0000 loss:104.8844
Episode:1307 meanR:407.3800 R:9.0000 loss:111.9859
Episode:1308 meanR:404.6900 R:11.0000 loss:104.9625
Episode:1309 meanR:399.8000 R:11.0000 loss:98.3439
Episode:1310 meanR:396.3700 R:157.0000 loss:42.3236
Episode:1311 meanR:395.1100 R:374.0000 loss:3.1322
Episode:1312 meanR:395.1100 R:500.0000 loss:7.9798
Episode:1313 meanR:395.1100 R:500.0000 loss:19.7680
Episode:1314 meanR:395.1100 R:500.0000 loss:17.4402
Episode:1315 meanR:399.9900 R:500.0000 loss:14.8028
Episode:1316 meanR:399.9900 R:500.0000 loss:18.6958
Episode:1317 meanR:399.9900 R:500.0000 loss:15.2002
Episode:1318 meanR:396.3000 R:131.0000 loss:53.7690
Episode:1319 meanR:392.5900 R:129.0000 loss:14.3778
Episode:1320 meanR:391.4700 R:141.0000 loss:2.8177
Episode:1321 meanR:391.4700 R:500.0000 loss:1.7678
Episode:1322 meanR:391.4700 R:500.0000 loss:16.5788
Episode:1323 meanR:391.5800 R:232.0000 loss:35.9594
Episode:1324 meanR:388.4300 R:185.0000 loss:23.5344
Episode:1325 meanR

Episode:1465 meanR:462.4200 R:500.0000 loss:17.9916
Episode:1466 meanR:462.4200 R:500.0000 loss:19.2388
Episode:1467 meanR:462.4200 R:500.0000 loss:15.8482
Episode:1468 meanR:463.1300 R:500.0000 loss:15.2927
Episode:1469 meanR:463.1900 R:500.0000 loss:16.5637
Episode:1470 meanR:463.1900 R:500.0000 loss:18.2540
Episode:1471 meanR:464.3500 R:500.0000 loss:14.1503
Episode:1472 meanR:464.3500 R:500.0000 loss:14.6366
Episode:1473 meanR:465.4800 R:500.0000 loss:11.5582
Episode:1474 meanR:465.4800 R:500.0000 loss:14.8158
Episode:1475 meanR:470.2600 R:500.0000 loss:18.9591
Episode:1476 meanR:471.2600 R:118.0000 loss:55.5188
Episode:1477 meanR:476.1100 R:500.0000 loss:5.5733
Episode:1478 meanR:476.1100 R:500.0000 loss:17.8112
Episode:1479 meanR:476.1100 R:500.0000 loss:17.5631
Episode:1480 meanR:476.1100 R:500.0000 loss:16.6429
Episode:1481 meanR:476.1100 R:500.0000 loss:16.3843
Episode:1482 meanR:476.1100 R:500.0000 loss:17.0245
Episode:1483 meanR:476.1100 R:500.0000 loss:17.4207
Episode:1484 

Episode:1625 meanR:221.2900 R:500.0000 loss:1.0265
Episode:1626 meanR:221.5100 R:152.0000 loss:4.9133
Episode:1627 meanR:221.9300 R:162.0000 loss:1.6076
Episode:1628 meanR:222.6200 R:195.0000 loss:2.4083
Episode:1629 meanR:223.8600 R:246.0000 loss:3.4516
Episode:1630 meanR:227.7100 R:500.0000 loss:2.0590
Episode:1631 meanR:231.5500 R:500.0000 loss:16.1068
Episode:1632 meanR:235.4100 R:500.0000 loss:15.6579
Episode:1633 meanR:239.3000 R:500.0000 loss:15.7809
Episode:1634 meanR:243.0500 R:500.0000 loss:13.2456
Episode:1635 meanR:246.9700 R:500.0000 loss:14.3897
Episode:1636 meanR:251.3200 R:500.0000 loss:15.1901
Episode:1637 meanR:255.3600 R:500.0000 loss:13.7688
Episode:1638 meanR:258.4800 R:500.0000 loss:17.8211
Episode:1639 meanR:259.9500 R:301.0000 loss:28.9865
Episode:1640 meanR:261.1900 R:190.0000 loss:26.1155
Episode:1641 meanR:265.7300 R:500.0000 loss:8.1927
Episode:1642 meanR:270.3100 R:500.0000 loss:15.6122
Episode:1643 meanR:274.1500 R:500.0000 loss:8.4628
Episode:1644 meanR:2

Episode:1784 meanR:379.5200 R:500.0000 loss:15.6621
Episode:1785 meanR:379.5200 R:500.0000 loss:16.9907
Episode:1786 meanR:376.2400 R:172.0000 loss:29.8821
Episode:1787 meanR:373.2800 R:204.0000 loss:4.8813
Episode:1788 meanR:373.2800 R:500.0000 loss:2.2122
Episode:1789 meanR:373.2800 R:500.0000 loss:16.3277
Episode:1790 meanR:376.7100 R:500.0000 loss:4.0032
Episode:1791 meanR:381.6000 R:500.0000 loss:17.0274
Episode:1792 meanR:386.5000 R:500.0000 loss:8.8948
Episode:1793 meanR:386.5000 R:500.0000 loss:13.0436
Episode:1794 meanR:388.3400 R:500.0000 loss:14.2799
Episode:1795 meanR:390.2500 R:500.0000 loss:13.0076
Episode:1796 meanR:390.2500 R:500.0000 loss:15.0066
Episode:1797 meanR:391.4600 R:500.0000 loss:12.6649
Episode:1798 meanR:395.1700 R:500.0000 loss:19.1124
Episode:1799 meanR:395.5000 R:148.0000 loss:42.6403
Episode:1800 meanR:395.5100 R:138.0000 loss:9.6472
Episode:1801 meanR:391.8500 R:134.0000 loss:2.8756
Episode:1802 meanR:388.2900 R:144.0000 loss:1.3432
Episode:1803 meanR:

Episode:1943 meanR:447.4200 R:500.0000 loss:11.5162
Episode:1944 meanR:450.3700 R:500.0000 loss:11.2253
Episode:1945 meanR:453.1700 R:500.0000 loss:14.6064
Episode:1946 meanR:456.0400 R:500.0000 loss:13.5088
Episode:1947 meanR:455.6800 R:167.0000 loss:41.4153
Episode:1948 meanR:451.9800 R:130.0000 loss:19.8453
Episode:1949 meanR:451.1100 R:131.0000 loss:5.3257
Episode:1950 meanR:448.6400 R:138.0000 loss:2.3507
Episode:1951 meanR:447.7400 R:127.0000 loss:2.8852
Episode:1952 meanR:445.2300 R:118.0000 loss:2.6975
Episode:1953 meanR:441.5600 R:133.0000 loss:4.1349
Episode:1954 meanR:437.7900 R:123.0000 loss:1.7306
Episode:1955 meanR:434.0600 R:127.0000 loss:2.0055
Episode:1956 meanR:429.9800 R:92.0000 loss:1.8218
Episode:1957 meanR:425.9900 R:101.0000 loss:3.0727
Episode:1958 meanR:422.0700 R:108.0000 loss:2.5720
Episode:1959 meanR:418.1800 R:111.0000 loss:2.7739
Episode:1960 meanR:414.1000 R:92.0000 loss:3.7689
Episode:1961 meanR:410.0100 R:91.0000 loss:2.5218
Episode:1962 meanR:406.0800 

Episode:2102 meanR:421.5000 R:500.0000 loss:16.9919
Episode:2103 meanR:424.6700 R:500.0000 loss:18.0242
Episode:2104 meanR:428.1800 R:500.0000 loss:12.6985
Episode:2105 meanR:431.6100 R:500.0000 loss:11.6813
Episode:2106 meanR:434.8100 R:500.0000 loss:11.9263
Episode:2107 meanR:437.8200 R:500.0000 loss:14.2025
Episode:2108 meanR:434.9700 R:12.0000 loss:85.2197
Episode:2109 meanR:430.0900 R:12.0000 loss:134.9814
Episode:2110 meanR:432.2300 R:500.0000 loss:21.4588
Episode:2111 meanR:433.2300 R:500.0000 loss:8.7990
Episode:2112 meanR:433.2300 R:500.0000 loss:10.6165
Episode:2113 meanR:438.1300 R:500.0000 loss:13.6817
Episode:2114 meanR:438.1300 R:500.0000 loss:17.1965
Episode:2115 meanR:435.8200 R:269.0000 loss:24.0787
Episode:2116 meanR:432.3100 R:149.0000 loss:9.4638
Episode:2117 meanR:430.1800 R:287.0000 loss:3.7019
Episode:2118 meanR:426.7100 R:153.0000 loss:14.8959
Episode:2119 meanR:424.2800 R:257.0000 loss:6.3469
Episode:2120 meanR:421.4100 R:213.0000 loss:19.0762
Episode:2121 mean

Episode:2261 meanR:218.9600 R:191.0000 loss:762.4029
Episode:2262 meanR:218.0300 R:43.0000 loss:1021.7727
Episode:2263 meanR:217.5400 R:104.0000 loss:1878.4880
Episode:2264 meanR:216.9600 R:89.0000 loss:1154.8445
Episode:2265 meanR:216.6700 R:72.0000 loss:895.3315
Episode:2266 meanR:217.1000 R:136.0000 loss:1414.5948
Episode:2267 meanR:217.0700 R:95.0000 loss:826.0140
Episode:2268 meanR:217.9900 R:194.0000 loss:499.3600
Episode:2269 meanR:216.9200 R:23.0000 loss:383.0900
Episode:2270 meanR:218.5100 R:322.0000 loss:807.3749
Episode:2271 meanR:217.4100 R:26.0000 loss:451.4813
Episode:2272 meanR:218.1800 R:223.0000 loss:778.7585
Episode:2273 meanR:215.6900 R:26.0000 loss:66.5100
Episode:2274 meanR:214.5300 R:128.0000 loss:296.6781
Episode:2275 meanR:210.9000 R:137.0000 loss:56.5453
Episode:2276 meanR:211.0200 R:267.0000 loss:741.5386
Episode:2277 meanR:209.6200 R:27.0000 loss:68.4111
Episode:2278 meanR:205.8300 R:121.0000 loss:334.6640
Episode:2279 meanR:202.6700 R:184.0000 loss:15.5824
E

Episode:2422 meanR:9.9500 R:12.0000 loss:425.0192
Episode:2423 meanR:9.9400 R:10.0000 loss:423.1372
Episode:2424 meanR:9.9300 R:8.0000 loss:427.3198
Episode:2425 meanR:9.9300 R:10.0000 loss:419.0976
Episode:2426 meanR:9.9200 R:9.0000 loss:425.6490
Episode:2427 meanR:9.8900 R:9.0000 loss:422.2402
Episode:2428 meanR:9.8900 R:9.0000 loss:417.3763
Episode:2429 meanR:9.8500 R:8.0000 loss:414.9992
Episode:2430 meanR:9.7900 R:12.0000 loss:406.9022
Episode:2431 meanR:9.7900 R:9.0000 loss:399.7598
Episode:2432 meanR:9.7800 R:9.0000 loss:401.7686
Episode:2433 meanR:9.7700 R:10.0000 loss:411.0281
Episode:2434 meanR:9.7800 R:10.0000 loss:410.4541
Episode:2435 meanR:9.7600 R:9.0000 loss:443.3986
Episode:2436 meanR:9.7600 R:9.0000 loss:418.5411
Episode:2437 meanR:9.7400 R:10.0000 loss:418.9617
Episode:2438 meanR:9.7500 R:10.0000 loss:423.3995
Episode:2439 meanR:9.7800 R:12.0000 loss:418.7549
Episode:2440 meanR:9.7500 R:9.0000 loss:418.2703
Episode:2441 meanR:9.7400 R:9.0000 loss:419.8229
Episode:244

Episode:2588 meanR:9.8200 R:9.0000 loss:358.1794
Episode:2589 meanR:9.8300 R:10.0000 loss:379.5706
Episode:2590 meanR:9.8300 R:9.0000 loss:350.5504
Episode:2591 meanR:9.7900 R:9.0000 loss:367.2257
Episode:2592 meanR:9.8100 R:11.0000 loss:376.9030
Episode:2593 meanR:9.8100 R:10.0000 loss:327.3416
Episode:2594 meanR:9.8100 R:11.0000 loss:331.3095
Episode:2595 meanR:9.8100 R:10.0000 loss:324.4184
Episode:2596 meanR:9.8200 R:10.0000 loss:333.2646
Episode:2597 meanR:9.8300 R:10.0000 loss:331.4540
Episode:2598 meanR:9.8200 R:9.0000 loss:322.8317
Episode:2599 meanR:9.8100 R:9.0000 loss:314.4374
Episode:2600 meanR:9.8400 R:15.0000 loss:323.7143
Episode:2601 meanR:9.8800 R:12.0000 loss:320.0905
Episode:2602 meanR:9.8600 R:8.0000 loss:312.2900
Episode:2603 meanR:9.8500 R:9.0000 loss:308.6091
Episode:2604 meanR:9.8400 R:8.0000 loss:318.2644
Episode:2605 meanR:9.8400 R:9.0000 loss:304.4962
Episode:2606 meanR:9.8200 R:10.0000 loss:314.3057
Episode:2607 meanR:9.7700 R:10.0000 loss:310.7912
Episode:2

Episode:2754 meanR:9.8200 R:10.0000 loss:283.7265
Episode:2755 meanR:9.8200 R:10.0000 loss:281.4836
Episode:2756 meanR:9.8100 R:9.0000 loss:276.6579
Episode:2757 meanR:9.8300 R:14.0000 loss:272.5367
Episode:2758 meanR:9.8200 R:9.0000 loss:268.7910
Episode:2759 meanR:9.8200 R:9.0000 loss:317.4495
Episode:2760 meanR:9.8300 R:10.0000 loss:241.6290
Episode:2761 meanR:9.8200 R:9.0000 loss:225.6464
Episode:2762 meanR:9.8200 R:9.0000 loss:240.3109
Episode:2763 meanR:9.8100 R:9.0000 loss:225.3792
Episode:2764 meanR:9.8100 R:10.0000 loss:228.6873
Episode:2765 meanR:9.8000 R:9.0000 loss:232.5682
Episode:2766 meanR:9.7800 R:8.0000 loss:225.0565
Episode:2767 meanR:9.7900 R:9.0000 loss:233.0627
Episode:2768 meanR:9.7900 R:10.0000 loss:236.2890
Episode:2769 meanR:9.8000 R:10.0000 loss:238.6754
Episode:2770 meanR:9.7900 R:9.0000 loss:263.2390
Episode:2771 meanR:9.7800 R:9.0000 loss:236.7460
Episode:2772 meanR:9.7700 R:9.0000 loss:238.2129
Episode:2773 meanR:9.7600 R:9.0000 loss:238.0040
Episode:2774 

Episode:2920 meanR:9.7000 R:10.0000 loss:175.0596
Episode:2921 meanR:9.6400 R:9.0000 loss:186.2088
Episode:2922 meanR:9.6200 R:8.0000 loss:179.1631
Episode:2923 meanR:9.6100 R:9.0000 loss:179.0751
Episode:2924 meanR:9.6000 R:9.0000 loss:183.1270
Episode:2925 meanR:9.6000 R:10.0000 loss:174.7464
Episode:2926 meanR:9.6000 R:10.0000 loss:172.6727
Episode:2927 meanR:9.6000 R:10.0000 loss:170.6753
Episode:2928 meanR:9.6300 R:12.0000 loss:181.1387
Episode:2929 meanR:9.6400 R:10.0000 loss:193.3972
Episode:2930 meanR:9.6400 R:9.0000 loss:165.8876
Episode:2931 meanR:9.6300 R:9.0000 loss:168.2691
Episode:2932 meanR:9.6300 R:10.0000 loss:167.9320
Episode:2933 meanR:9.6500 R:10.0000 loss:168.7272
Episode:2934 meanR:9.5900 R:9.0000 loss:165.5394
Episode:2935 meanR:9.5800 R:9.0000 loss:159.8311
Episode:2936 meanR:9.6000 R:11.0000 loss:162.9053
Episode:2937 meanR:9.6000 R:10.0000 loss:173.4355
Episode:2938 meanR:9.5900 R:9.0000 loss:179.9867
Episode:2939 meanR:9.6000 R:10.0000 loss:183.6461
Episode:2

Episode:3086 meanR:9.6400 R:9.0000 loss:120.2917
Episode:3087 meanR:9.6300 R:9.0000 loss:119.3809
Episode:3088 meanR:9.6700 R:16.0000 loss:132.2842
Episode:3089 meanR:9.6700 R:9.0000 loss:132.1379
Episode:3090 meanR:9.6800 R:10.0000 loss:130.0729
Episode:3091 meanR:9.6800 R:9.0000 loss:137.3533
Episode:3092 meanR:9.6800 R:9.0000 loss:132.9744
Episode:3093 meanR:9.6900 R:10.0000 loss:133.4195
Episode:3094 meanR:9.6600 R:11.0000 loss:136.6207
Episode:3095 meanR:9.6600 R:10.0000 loss:172.3094
Episode:3096 meanR:9.6700 R:10.0000 loss:144.2052
Episode:3097 meanR:9.6600 R:9.0000 loss:134.4454
Episode:3098 meanR:9.6600 R:10.0000 loss:134.3450
Episode:3099 meanR:9.6700 R:10.0000 loss:144.0701
Episode:3100 meanR:9.6700 R:9.0000 loss:199.8040
Episode:3101 meanR:9.6700 R:9.0000 loss:153.3545
Episode:3102 meanR:9.6800 R:9.0000 loss:134.7012
Episode:3103 meanR:9.6900 R:9.0000 loss:133.1085
Episode:3104 meanR:9.7000 R:10.0000 loss:129.7765
Episode:3105 meanR:9.7000 R:10.0000 loss:133.7216
Episode:31

Episode:3252 meanR:10.0000 R:9.0000 loss:110.6198
Episode:3253 meanR:10.0100 R:9.0000 loss:91.3596
Episode:3254 meanR:10.0600 R:14.0000 loss:97.5849
Episode:3255 meanR:10.0600 R:10.0000 loss:96.9505
Episode:3256 meanR:10.0500 R:9.0000 loss:101.5245
Episode:3257 meanR:10.0300 R:9.0000 loss:93.4532
Episode:3258 meanR:10.0100 R:10.0000 loss:92.3173
Episode:3259 meanR:10.0200 R:10.0000 loss:129.0180
Episode:3260 meanR:10.0300 R:10.0000 loss:98.7607
Episode:3261 meanR:10.0300 R:9.0000 loss:98.8633
Episode:3262 meanR:10.0300 R:10.0000 loss:107.3423
Episode:3263 meanR:10.0300 R:10.0000 loss:93.0171
Episode:3264 meanR:10.0100 R:10.0000 loss:88.4781
Episode:3265 meanR:10.0000 R:9.0000 loss:86.5326
Episode:3266 meanR:10.0100 R:10.0000 loss:99.1075
Episode:3267 meanR:10.0000 R:9.0000 loss:94.0499
Episode:3268 meanR:10.0000 R:10.0000 loss:92.0070
Episode:3269 meanR:10.0000 R:9.0000 loss:82.6167
Episode:3270 meanR:10.0300 R:12.0000 loss:85.7225
Episode:3271 meanR:10.0400 R:9.0000 loss:90.9025
Episo

Episode:3417 meanR:10.1900 R:9.0000 loss:58.5782
Episode:3418 meanR:10.1900 R:9.0000 loss:62.7973
Episode:3419 meanR:10.1900 R:10.0000 loss:59.2941
Episode:3420 meanR:10.1700 R:9.0000 loss:59.4489
Episode:3421 meanR:10.2100 R:12.0000 loss:56.0883
Episode:3422 meanR:10.2000 R:9.0000 loss:54.3570
Episode:3423 meanR:10.2100 R:10.0000 loss:57.9770
Episode:3424 meanR:10.2200 R:10.0000 loss:57.0285
Episode:3425 meanR:10.2700 R:14.0000 loss:57.7596
Episode:3426 meanR:10.2700 R:9.0000 loss:57.4513
Episode:3427 meanR:10.2500 R:8.0000 loss:54.7349
Episode:3428 meanR:10.2200 R:9.0000 loss:55.8335
Episode:3429 meanR:10.2200 R:10.0000 loss:57.7824
Episode:3430 meanR:10.2000 R:10.0000 loss:57.8798
Episode:3431 meanR:10.1900 R:9.0000 loss:56.6140
Episode:3432 meanR:10.2000 R:10.0000 loss:55.0660
Episode:3433 meanR:10.2000 R:10.0000 loss:62.9184
Episode:3434 meanR:10.2000 R:10.0000 loss:51.4854
Episode:3435 meanR:10.2600 R:15.0000 loss:41.3799
Episode:3436 meanR:10.2500 R:9.0000 loss:37.2033
Episode:3

Episode:3583 meanR:10.6400 R:10.0000 loss:45.3549
Episode:3584 meanR:10.6300 R:11.0000 loss:36.2564
Episode:3585 meanR:10.6000 R:10.0000 loss:80.5816
Episode:3586 meanR:10.5800 R:11.0000 loss:43.5457
Episode:3587 meanR:10.5800 R:11.0000 loss:38.1459
Episode:3588 meanR:10.5800 R:10.0000 loss:37.9526
Episode:3589 meanR:10.6600 R:18.0000 loss:41.3066
Episode:3590 meanR:10.6800 R:11.0000 loss:35.1837
Episode:3591 meanR:10.6900 R:10.0000 loss:29.4630
Episode:3592 meanR:10.7000 R:10.0000 loss:33.8544
Episode:3593 meanR:10.7200 R:10.0000 loss:28.5961
Episode:3594 meanR:10.7200 R:10.0000 loss:32.5558
Episode:3595 meanR:10.7300 R:10.0000 loss:29.9263
Episode:3596 meanR:10.7800 R:15.0000 loss:33.0343
Episode:3597 meanR:10.8600 R:17.0000 loss:37.0430
Episode:3598 meanR:10.8200 R:10.0000 loss:37.7449
Episode:3599 meanR:10.8400 R:11.0000 loss:37.2379
Episode:3600 meanR:10.8300 R:10.0000 loss:37.6556
Episode:3601 meanR:10.8300 R:10.0000 loss:42.0445
Episode:3602 meanR:10.8300 R:10.0000 loss:47.0973


Episode:3747 meanR:95.0500 R:500.0000 loss:19.0617
Episode:3748 meanR:99.9500 R:500.0000 loss:21.5910
Episode:3749 meanR:104.8600 R:500.0000 loss:12.6512
Episode:3750 meanR:105.3700 R:62.0000 loss:27.6110
Episode:3751 meanR:110.2500 R:500.0000 loss:20.6167
Episode:3752 meanR:115.1100 R:500.0000 loss:15.9992
Episode:3753 meanR:120.0100 R:500.0000 loss:14.4669
Episode:3754 meanR:124.9200 R:500.0000 loss:12.7180
Episode:3755 meanR:129.8100 R:500.0000 loss:11.4531
Episode:3756 meanR:134.6600 R:500.0000 loss:18.4445
Episode:3757 meanR:139.5600 R:500.0000 loss:11.8781
Episode:3758 meanR:144.4600 R:500.0000 loss:10.9887
Episode:3759 meanR:149.3400 R:500.0000 loss:11.1311
Episode:3760 meanR:154.1900 R:500.0000 loss:11.8148
Episode:3761 meanR:159.0900 R:500.0000 loss:11.9222
Episode:3762 meanR:163.9900 R:500.0000 loss:12.9539
Episode:3763 meanR:168.8800 R:500.0000 loss:11.6533
Episode:3764 meanR:173.7500 R:500.0000 loss:12.6574
Episode:3765 meanR:178.6300 R:500.0000 loss:13.3310
Episode:3766 me

Episode:3906 meanR:353.6500 R:500.0000 loss:22.1103
Episode:3907 meanR:352.0100 R:50.0000 loss:60.3594
Episode:3908 meanR:350.7500 R:58.0000 loss:79.6765
Episode:3909 meanR:350.1000 R:117.0000 loss:30.2425
Episode:3910 meanR:347.3400 R:70.0000 loss:7.2665
Episode:3911 meanR:343.7100 R:57.0000 loss:4.6234
Episode:3912 meanR:339.1900 R:48.0000 loss:4.3698
Episode:3913 meanR:334.6900 R:50.0000 loss:6.4677
Episode:3914 meanR:330.2000 R:51.0000 loss:6.0535
Episode:3915 meanR:326.2100 R:101.0000 loss:3.6660
Episode:3916 meanR:324.5600 R:122.0000 loss:2.7996
Episode:3917 meanR:320.7800 R:122.0000 loss:3.2297
Episode:3918 meanR:321.0400 R:165.0000 loss:3.2843
Episode:3919 meanR:320.2900 R:157.0000 loss:3.9043
Episode:3920 meanR:316.8000 R:151.0000 loss:3.2289
Episode:3921 meanR:316.5300 R:164.0000 loss:2.1800
Episode:3922 meanR:316.3100 R:166.0000 loss:1.5176
Episode:3923 meanR:315.7700 R:165.0000 loss:1.4315
Episode:3924 meanR:315.8500 R:278.0000 loss:1.5536
Episode:3925 meanR:315.2100 R:222.

Episode:4065 meanR:390.4400 R:51.0000 loss:61.3422
Episode:4066 meanR:389.2600 R:382.0000 loss:28.6315
Episode:4067 meanR:385.7900 R:153.0000 loss:8.9929
Episode:4068 meanR:382.3700 R:158.0000 loss:6.6784
Episode:4069 meanR:379.0000 R:163.0000 loss:3.5181
Episode:4070 meanR:375.4200 R:142.0000 loss:3.2558
Episode:4071 meanR:371.8100 R:139.0000 loss:2.5701
Episode:4072 meanR:368.1800 R:137.0000 loss:1.7801
Episode:4073 meanR:364.6100 R:143.0000 loss:2.2835
Episode:4074 meanR:361.2300 R:162.0000 loss:3.1950
Episode:4075 meanR:358.1500 R:192.0000 loss:2.6276
Episode:4076 meanR:362.4300 R:440.0000 loss:2.2167
Episode:4077 meanR:362.4300 R:500.0000 loss:1.0593
Episode:4078 meanR:362.4300 R:500.0000 loss:5.5000
Episode:4079 meanR:360.3900 R:296.0000 loss:27.2117
Episode:4080 meanR:360.3900 R:500.0000 loss:1.2678
Episode:4081 meanR:360.3900 R:500.0000 loss:16.3682
Episode:4082 meanR:360.3900 R:500.0000 loss:15.2147
Episode:4083 meanR:360.3900 R:500.0000 loss:18.5703
Episode:4084 meanR:360.390

Episode:4222 meanR:231.7400 R:500.0000 loss:12.1987
Episode:4223 meanR:236.2400 R:500.0000 loss:15.8040
Episode:4224 meanR:240.5800 R:500.0000 loss:18.5237
Episode:4225 meanR:244.9000 R:500.0000 loss:10.5718
Episode:4226 meanR:249.0600 R:500.0000 loss:17.7493
Episode:4227 meanR:253.2300 R:500.0000 loss:17.5936
Episode:4228 meanR:257.2300 R:500.0000 loss:16.8164
Episode:4229 meanR:261.3300 R:500.0000 loss:17.7171
Episode:4230 meanR:265.3600 R:500.0000 loss:16.1886
Episode:4231 meanR:269.5600 R:500.0000 loss:16.1608
Episode:4232 meanR:273.8200 R:500.0000 loss:15.8134
Episode:4233 meanR:277.8500 R:500.0000 loss:15.3769
Episode:4234 meanR:281.6700 R:500.0000 loss:15.0761
Episode:4235 meanR:285.5800 R:500.0000 loss:15.3259
Episode:4236 meanR:289.4500 R:500.0000 loss:14.8505
Episode:4237 meanR:293.2700 R:500.0000 loss:14.4229
Episode:4238 meanR:297.1700 R:500.0000 loss:14.1813
Episode:4239 meanR:301.0700 R:500.0000 loss:14.1932
Episode:4240 meanR:304.9800 R:500.0000 loss:15.3642
Episode:4241

Episode:4381 meanR:403.2000 R:318.0000 loss:0.6368
Episode:4382 meanR:402.3500 R:415.0000 loss:0.8388
Episode:4383 meanR:402.3500 R:500.0000 loss:7.4615
Episode:4384 meanR:401.3300 R:398.0000 loss:35.9788
Episode:4385 meanR:396.9900 R:66.0000 loss:6.1901
Episode:4386 meanR:392.7000 R:71.0000 loss:8.1610
Episode:4387 meanR:388.3700 R:67.0000 loss:4.9665
Episode:4388 meanR:384.0300 R:66.0000 loss:7.9575
Episode:4389 meanR:379.6700 R:64.0000 loss:6.3976
Episode:4390 meanR:375.3000 R:63.0000 loss:3.7749
Episode:4391 meanR:370.9100 R:61.0000 loss:3.9197
Episode:4392 meanR:366.5000 R:59.0000 loss:6.3510
Episode:4393 meanR:362.1500 R:65.0000 loss:4.9962
Episode:4394 meanR:357.7900 R:64.0000 loss:6.5642
Episode:4395 meanR:353.4100 R:62.0000 loss:4.1576
Episode:4396 meanR:349.0600 R:65.0000 loss:5.2329
Episode:4397 meanR:344.7100 R:65.0000 loss:4.3212
Episode:4398 meanR:340.3900 R:68.0000 loss:4.4708
Episode:4399 meanR:336.0000 R:61.0000 loss:3.0814
Episode:4400 meanR:331.6800 R:68.0000 loss:4.

Episode:4542 meanR:167.1700 R:13.0000 loss:16.2380
Episode:4543 meanR:165.3800 R:13.0000 loss:14.6818
Episode:4544 meanR:164.0300 R:12.0000 loss:13.8380
Episode:4545 meanR:162.5300 R:13.0000 loss:12.9515
Episode:4546 meanR:161.1900 R:14.0000 loss:13.4686
Episode:4547 meanR:159.9300 R:13.0000 loss:11.8952
Episode:4548 meanR:158.5900 R:14.0000 loss:13.3830
Episode:4549 meanR:157.1300 R:13.0000 loss:13.4503
Episode:4550 meanR:155.7000 R:14.0000 loss:13.6776
Episode:4551 meanR:153.8400 R:15.0000 loss:11.9896
Episode:4552 meanR:148.9700 R:13.0000 loss:11.6787
Episode:4553 meanR:144.1100 R:14.0000 loss:10.3052
Episode:4554 meanR:139.2600 R:15.0000 loss:12.2855
Episode:4555 meanR:134.4400 R:18.0000 loss:13.8660
Episode:4556 meanR:129.6100 R:17.0000 loss:11.5789
Episode:4557 meanR:124.7500 R:14.0000 loss:11.2845
Episode:4558 meanR:119.9000 R:15.0000 loss:12.5750
Episode:4559 meanR:115.0400 R:14.0000 loss:12.8114
Episode:4560 meanR:110.1700 R:13.0000 loss:13.2304
Episode:4561 meanR:109.3800 R:1

Episode:4707 meanR:12.4400 R:11.0000 loss:9.0848
Episode:4708 meanR:12.4300 R:10.0000 loss:8.2322
Episode:4709 meanR:12.4000 R:11.0000 loss:10.0086
Episode:4710 meanR:12.3800 R:12.0000 loss:8.2809
Episode:4711 meanR:12.3500 R:10.0000 loss:7.9286
Episode:4712 meanR:12.3300 R:10.0000 loss:7.2477
Episode:4713 meanR:12.3000 R:10.0000 loss:6.5588
Episode:4714 meanR:12.3200 R:12.0000 loss:6.3616
Episode:4715 meanR:12.3000 R:12.0000 loss:9.0046
Episode:4716 meanR:12.3100 R:13.0000 loss:7.5903
Episode:4717 meanR:12.2900 R:12.0000 loss:6.3901
Episode:4718 meanR:12.3100 R:13.0000 loss:5.7570
Episode:4719 meanR:12.2900 R:10.0000 loss:6.6736
Episode:4720 meanR:12.3100 R:13.0000 loss:8.4671
Episode:4721 meanR:12.3000 R:12.0000 loss:8.6443
Episode:4722 meanR:12.2700 R:11.0000 loss:8.8320
Episode:4723 meanR:12.2600 R:11.0000 loss:7.4021
Episode:4724 meanR:12.2500 R:12.0000 loss:8.6516
Episode:4725 meanR:12.2700 R:15.0000 loss:9.1609
Episode:4726 meanR:12.2700 R:12.0000 loss:10.3150
Episode:4727 meanR

Episode:4873 meanR:11.3100 R:11.0000 loss:21.9713
Episode:4874 meanR:11.3300 R:12.0000 loss:23.4977
Episode:4875 meanR:11.3300 R:12.0000 loss:20.8306
Episode:4876 meanR:11.3200 R:11.0000 loss:22.8206
Episode:4877 meanR:11.3300 R:12.0000 loss:21.4196
Episode:4878 meanR:11.3300 R:12.0000 loss:25.0917
Episode:4879 meanR:11.3300 R:11.0000 loss:24.4942
Episode:4880 meanR:11.3400 R:12.0000 loss:21.4562
Episode:4881 meanR:11.3300 R:11.0000 loss:19.4424
Episode:4882 meanR:11.3200 R:11.0000 loss:18.5729
Episode:4883 meanR:11.3200 R:13.0000 loss:24.2374
Episode:4884 meanR:11.3400 R:12.0000 loss:18.7634
Episode:4885 meanR:11.3500 R:12.0000 loss:22.2990
Episode:4886 meanR:11.3500 R:11.0000 loss:22.0332
Episode:4887 meanR:11.3500 R:12.0000 loss:18.4540
Episode:4888 meanR:11.3500 R:12.0000 loss:20.6192
Episode:4889 meanR:11.3600 R:11.0000 loss:20.7354
Episode:4890 meanR:11.3700 R:11.0000 loss:19.1184
Episode:4891 meanR:11.3800 R:12.0000 loss:21.4904
Episode:4892 meanR:11.3800 R:11.0000 loss:20.9061


Episode:5034 meanR:151.5600 R:9.0000 loss:117.7085
Episode:5035 meanR:151.5300 R:10.0000 loss:128.3759
Episode:5036 meanR:156.4100 R:500.0000 loss:142.4926
Episode:5037 meanR:156.3900 R:9.0000 loss:202.2130
Episode:5038 meanR:156.4400 R:15.0000 loss:229.2505
Episode:5039 meanR:161.3100 R:500.0000 loss:156.2317
Episode:5040 meanR:161.3500 R:16.0000 loss:182.4472
Episode:5041 meanR:166.2500 R:500.0000 loss:136.1874
Episode:5042 meanR:166.2100 R:10.0000 loss:166.2945
Episode:5043 meanR:171.1100 R:500.0000 loss:122.1448
Episode:5044 meanR:171.1100 R:13.0000 loss:209.2115
Episode:5045 meanR:171.1000 R:9.0000 loss:233.0538
Episode:5046 meanR:176.0000 R:500.0000 loss:187.8890
Episode:5047 meanR:180.8700 R:500.0000 loss:113.1528
Episode:5048 meanR:180.8100 R:9.0000 loss:103.6524
Episode:5049 meanR:185.6900 R:500.0000 loss:105.1871
Episode:5050 meanR:185.6800 R:12.0000 loss:132.5906
Episode:5051 meanR:189.9600 R:500.0000 loss:120.7551
Episode:5052 meanR:192.3500 R:500.0000 loss:101.0373
Episode

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [24]:
import gym
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Episode/epoch
    for _ in range(10):
        total_reward = 0
        state = env.reset()
        initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
        
        # Steps/batches
        while True:
            env.render()
            action_logits, initial_state = sess.run([model.actions_logits, model.final_state],
                                                    feed_dict = {model.states: state.reshape([1, -1]), 
                                                                 model.initial_state: initial_state})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # At the end of each episode
        print('total_reward:{}'.format(total_reward))

# Close the env
env.close()

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0
total_reward:500.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.