# Sequential Q-learning

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import gym
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [5]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [6]:
env.reset()
for _ in range(10):
    # env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

state, action, reward, done, info: [-0.04328005  0.24254737 -0.00955302 -0.25440812] 1 1.0 False {}
state, action, reward, done, info: [-0.0384291   0.04756311 -0.01464118  0.03524637] 0 1.0 False {}
state, action, reward, done, info: [-0.03747784 -0.14734585 -0.01393625  0.32327413] 0 1.0 False {}
state, action, reward, done, info: [-0.04042475  0.04797174 -0.00747077  0.02622906] 1 1.0 False {}
state, action, reward, done, info: [-0.03946532 -0.14704228 -0.00694619  0.31654555] 0 1.0 False {}
state, action, reward, done, info: [-0.04240616 -0.34206461 -0.00061528  0.60702981] 0 1.0 False {}
state, action, reward, done, info: [-0.04924746 -0.53717795  0.01152532  0.89951888] 0 1.0 False {}
state, action, reward, done, info: [-0.05999101 -0.73245418  0.02951569  1.19580214] 0 1.0 False {}
state, action, reward, done, info: [-0.0746401  -0.92794556  0.05343174  1.49758784] 0 1.0 False {}
state, action, reward, done, info: [-0.09319901 -0.73351216  0.08338349  1.22205542] 1 1.0 False {}


To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [7]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [8]:
def model_input(state_size, lstm_size, batch_size=1):
    actions = tf.placeholder(tf.int32, [None], name='actions')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
        
    gru = tf.nn.rnn_cell.GRUCell(lstm_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    initial_state = cell.zero_state(batch_size, tf.float32)
    
    return actions, states, targetQs, cell, initial_state

In [9]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [10]:
def model_loss(action_size, hidden_size, states, cell, initial_state, actions, targetQs):
    actions_logits, final_state = generator(states=states, cell=cell, initial_state=initial_state, 
                                            lstm_size=hidden_size, num_classes=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [11]:
def model_opt(loss, learning_rate):
    """
    Get optimization operations in order
    :param loss: Generator loss Tensor for action prediction
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [12]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.actions, self.states, self.targetQs, cell, self.initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cell=cell, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [13]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx], [self.states[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [14]:
# print('state:', np.array(states).shape[1], 
#       'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

In [15]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
action_size = 2
state_size = 4
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 10000            # memory capacity
batch_size = 20                # experience mini-batch size
gamma = 0.99                   # future reward discount

In [16]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 2)


## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [21]:
state = env.reset()
initial_state = np.zeros([1, hidden_size])
final_state = np.zeros([1, hidden_size])
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    memory.states.append([initial_state, final_state])
    state = next_state
    if done is True:
        state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [22]:
memory.buffer[0], memory.states[0]

([array([ 0.04484891, -0.03715969, -0.01237249,  0.01858427]),
  1,
  array([ 0.04410571,  0.15813749, -0.01200081, -0.2779765 ]),
  1.0,
  0.0],
 [array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
  array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])])

In [23]:
# states, rewards, actions

In [None]:
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        state = env.reset()
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                action = env.action_space.sample()
            else:
                action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([initial_state, final_state])
            total_reward += reward
            initial_state = final_state
            state = next_state

            # Training
            batch, rnn_states = memory.sample(batch_size)
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            initial_states = np.array([each[0] for each in rnn_states])
            final_states = np.array([each[1] for each in rnn_states])
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states, 
                                                        model.initial_state: final_states[0].reshape([1, -1])})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                        model.initial_state: initial_states[0].reshape([1, -1])})
            loss_batch.append(loss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{}'.format(total_reward),
              'loss:{:.4f}'.format(np.mean(loss_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model-qn-seq.ckpt')

Episode:0 meanR:25.0000 R:25.0 loss:0.9614 exploreP:0.9975
Episode:1 meanR:20.5000 R:16.0 loss:1.0372 exploreP:0.9959
Episode:2 meanR:19.0000 R:16.0 loss:0.8661 exploreP:0.9944
Episode:3 meanR:24.2500 R:40.0 loss:1.2054 exploreP:0.9904
Episode:4 meanR:23.4000 R:20.0 loss:1.1202 exploreP:0.9885
Episode:5 meanR:22.0000 R:15.0 loss:0.3873 exploreP:0.9870
Episode:6 meanR:21.7143 R:20.0 loss:0.4816 exploreP:0.9851
Episode:7 meanR:22.0000 R:24.0 loss:0.8634 exploreP:0.9827
Episode:8 meanR:21.2222 R:15.0 loss:1.4622 exploreP:0.9813
Episode:9 meanR:20.5000 R:14.0 loss:2.0031 exploreP:0.9799
Episode:10 meanR:21.4545 R:31.0 loss:1.4592 exploreP:0.9769
Episode:11 meanR:20.6667 R:12.0 loss:1.4791 exploreP:0.9757
Episode:12 meanR:19.8462 R:10.0 loss:2.9227 exploreP:0.9748
Episode:13 meanR:19.1429 R:10.0 loss:3.3290 exploreP:0.9738
Episode:14 meanR:19.0000 R:17.0 loss:3.2573 exploreP:0.9722
Episode:15 meanR:19.4375 R:26.0 loss:0.7342 exploreP:0.9697
Episode:16 meanR:19.4118 R:19.0 loss:2.2238 explor

Episode:136 meanR:19.8400 R:31.0 loss:6.9841 exploreP:0.7653
Episode:137 meanR:19.8900 R:15.0 loss:11.4703 exploreP:0.7642
Episode:138 meanR:19.8600 R:12.0 loss:13.7368 exploreP:0.7633
Episode:139 meanR:19.9000 R:23.0 loss:11.6131 exploreP:0.7615
Episode:140 meanR:19.9100 R:14.0 loss:9.9076 exploreP:0.7605
Episode:141 meanR:19.8700 R:12.0 loss:14.4467 exploreP:0.7596
Episode:142 meanR:19.6500 R:18.0 loss:14.4579 exploreP:0.7582
Episode:143 meanR:19.9400 R:42.0 loss:5.2489 exploreP:0.7551
Episode:144 meanR:19.8400 R:9.0 loss:10.2327 exploreP:0.7544
Episode:145 meanR:19.8200 R:10.0 loss:17.9956 exploreP:0.7537
Episode:146 meanR:19.8500 R:16.0 loss:13.5597 exploreP:0.7525
Episode:147 meanR:19.5900 R:24.0 loss:32.3791 exploreP:0.7507
Episode:148 meanR:19.5900 R:24.0 loss:8.4303 exploreP:0.7489
Episode:149 meanR:19.6400 R:17.0 loss:18.8821 exploreP:0.7477
Episode:150 meanR:19.5300 R:11.0 loss:11.1681 exploreP:0.7469
Episode:151 meanR:19.5000 R:15.0 loss:16.0980 exploreP:0.7458
Episode:152 m

Episode:270 meanR:16.2400 R:18.0 loss:27.7216 exploreP:0.6116
Episode:271 meanR:16.0700 R:14.0 loss:10.2563 exploreP:0.6108
Episode:272 meanR:15.9600 R:9.0 loss:20.6774 exploreP:0.6102
Episode:273 meanR:15.9600 R:11.0 loss:13.1579 exploreP:0.6096
Episode:274 meanR:15.9200 R:10.0 loss:16.5524 exploreP:0.6090
Episode:275 meanR:15.7600 R:12.0 loss:20.8381 exploreP:0.6082
Episode:276 meanR:15.3300 R:12.0 loss:19.4291 exploreP:0.6075
Episode:277 meanR:15.3700 R:14.0 loss:17.6146 exploreP:0.6067
Episode:278 meanR:15.4900 R:24.0 loss:11.6209 exploreP:0.6053
Episode:279 meanR:15.4600 R:15.0 loss:9.0848 exploreP:0.6044
Episode:280 meanR:15.4400 R:11.0 loss:26.5213 exploreP:0.6037
Episode:281 meanR:15.5100 R:21.0 loss:14.2255 exploreP:0.6025
Episode:282 meanR:15.5100 R:12.0 loss:10.6876 exploreP:0.6018
Episode:283 meanR:15.5700 R:15.0 loss:33.1192 exploreP:0.6009
Episode:284 meanR:15.5700 R:10.0 loss:17.5239 exploreP:0.6003
Episode:285 meanR:15.5000 R:14.0 loss:13.6917 exploreP:0.5995
Episode:28

Episode:404 meanR:15.4700 R:13.0 loss:14.8824 exploreP:0.5018
Episode:405 meanR:15.4900 R:16.0 loss:16.3974 exploreP:0.5010
Episode:406 meanR:15.4000 R:9.0 loss:16.0089 exploreP:0.5006
Episode:407 meanR:15.3100 R:9.0 loss:22.1461 exploreP:0.5001
Episode:408 meanR:15.3200 R:15.0 loss:19.5094 exploreP:0.4994
Episode:409 meanR:15.5000 R:27.0 loss:9.5831 exploreP:0.4981
Episode:410 meanR:15.6400 R:24.0 loss:8.4120 exploreP:0.4969
Episode:411 meanR:15.6500 R:13.0 loss:11.5731 exploreP:0.4963
Episode:412 meanR:15.6500 R:10.0 loss:18.0914 exploreP:0.4958
Episode:413 meanR:15.8700 R:33.0 loss:9.4577 exploreP:0.4942
Episode:414 meanR:15.9500 R:19.0 loss:11.3792 exploreP:0.4933
Episode:415 meanR:16.0100 R:14.0 loss:12.1161 exploreP:0.4926
Episode:416 meanR:16.1600 R:43.0 loss:6.9154 exploreP:0.4905
Episode:417 meanR:16.3100 R:24.0 loss:9.1296 exploreP:0.4894
Episode:418 meanR:16.2800 R:10.0 loss:12.6706 exploreP:0.4889
Episode:419 meanR:16.3700 R:21.0 loss:15.5643 exploreP:0.4879
Episode:420 mea

Episode:537 meanR:15.2600 R:10.0 loss:15.1335 exploreP:0.4063
Episode:538 meanR:15.2300 R:18.0 loss:16.6339 exploreP:0.4056
Episode:539 meanR:15.2500 R:12.0 loss:12.5025 exploreP:0.4051
Episode:540 meanR:15.2400 R:11.0 loss:17.9682 exploreP:0.4047
Episode:541 meanR:15.2300 R:9.0 loss:20.0455 exploreP:0.4043
Episode:542 meanR:15.1900 R:11.0 loss:18.7361 exploreP:0.4039
Episode:543 meanR:15.1300 R:11.0 loss:17.3654 exploreP:0.4035
Episode:544 meanR:15.1000 R:9.0 loss:19.0378 exploreP:0.4031
Episode:545 meanR:15.1100 R:13.0 loss:17.5509 exploreP:0.4026
Episode:546 meanR:15.1200 R:12.0 loss:15.2037 exploreP:0.4021
Episode:547 meanR:15.1300 R:16.0 loss:14.2392 exploreP:0.4015
Episode:548 meanR:15.1600 R:20.0 loss:11.3858 exploreP:0.4007
Episode:549 meanR:15.2500 R:36.0 loss:9.3205 exploreP:0.3993
Episode:550 meanR:15.4100 R:40.0 loss:4.6944 exploreP:0.3978
Episode:551 meanR:15.4500 R:25.0 loss:8.4758 exploreP:0.3968
Episode:552 meanR:15.4600 R:11.0 loss:10.4633 exploreP:0.3964
Episode:553 m

Episode:670 meanR:19.6000 R:30.0 loss:9.6138 exploreP:0.3202
Episode:671 meanR:19.6100 R:14.0 loss:13.1687 exploreP:0.3198
Episode:672 meanR:19.6000 R:9.0 loss:19.4440 exploreP:0.3195
Episode:673 meanR:19.5900 R:10.0 loss:24.0929 exploreP:0.3192
Episode:674 meanR:19.6600 R:18.0 loss:17.3503 exploreP:0.3186
Episode:675 meanR:19.6600 R:10.0 loss:12.9362 exploreP:0.3183
Episode:676 meanR:19.7200 R:16.0 loss:17.0716 exploreP:0.3178
Episode:677 meanR:19.6800 R:10.0 loss:14.9540 exploreP:0.3175
Episode:678 meanR:19.6700 R:9.0 loss:21.5388 exploreP:0.3172
Episode:679 meanR:19.6600 R:9.0 loss:21.7010 exploreP:0.3170
Episode:680 meanR:19.6400 R:8.0 loss:22.9728 exploreP:0.3167
Episode:681 meanR:19.6800 R:16.0 loss:20.0047 exploreP:0.3162
Episode:682 meanR:19.7200 R:16.0 loss:12.7675 exploreP:0.3157
Episode:683 meanR:19.7600 R:14.0 loss:13.9461 exploreP:0.3153
Episode:684 meanR:19.8400 R:17.0 loss:15.3190 exploreP:0.3148
Episode:685 meanR:20.2200 R:50.0 loss:5.3263 exploreP:0.3133
Episode:686 me

Episode:803 meanR:25.5600 R:30.0 loss:12.1836 exploreP:0.2339
Episode:804 meanR:25.5400 R:34.0 loss:10.8515 exploreP:0.2331
Episode:805 meanR:25.5200 R:20.0 loss:16.5785 exploreP:0.2327
Episode:806 meanR:25.1700 R:18.0 loss:16.5962 exploreP:0.2323
Episode:807 meanR:25.0600 R:15.0 loss:17.0114 exploreP:0.2319
Episode:808 meanR:24.1700 R:14.0 loss:20.5879 exploreP:0.2316
Episode:809 meanR:23.9100 R:12.0 loss:21.9887 exploreP:0.2313
Episode:810 meanR:23.8000 R:22.0 loss:18.1655 exploreP:0.2309
Episode:811 meanR:23.6100 R:19.0 loss:15.8566 exploreP:0.2304
Episode:812 meanR:23.5500 R:13.0 loss:16.8769 exploreP:0.2301
Episode:813 meanR:23.4300 R:13.0 loss:23.2520 exploreP:0.2299
Episode:814 meanR:23.4000 R:17.0 loss:20.5552 exploreP:0.2295
Episode:815 meanR:23.1400 R:33.0 loss:11.1281 exploreP:0.2288
Episode:816 meanR:23.8300 R:89.0 loss:4.3926 exploreP:0.2268
Episode:817 meanR:24.0400 R:32.0 loss:10.9253 exploreP:0.2261
Episode:818 meanR:25.0300 R:109.0 loss:3.7042 exploreP:0.2238
Episode:8

Episode:936 meanR:27.3900 R:37.0 loss:11.9247 exploreP:0.1629
Episode:937 meanR:27.5000 R:27.0 loss:14.9050 exploreP:0.1625
Episode:938 meanR:27.4900 R:33.0 loss:13.0573 exploreP:0.1620
Episode:939 meanR:27.4700 R:25.0 loss:17.3361 exploreP:0.1617
Episode:940 meanR:26.4100 R:22.0 loss:19.1825 exploreP:0.1613
Episode:941 meanR:26.8000 R:73.0 loss:5.8828 exploreP:0.1602
Episode:942 meanR:27.4200 R:82.0 loss:6.0075 exploreP:0.1590
Episode:943 meanR:27.4600 R:35.0 loss:12.8446 exploreP:0.1585
Episode:944 meanR:28.6600 R:136.0 loss:3.9604 exploreP:0.1565
Episode:945 meanR:28.7300 R:30.0 loss:17.0255 exploreP:0.1560
Episode:946 meanR:28.7600 R:19.0 loss:23.8981 exploreP:0.1558
Episode:947 meanR:28.6000 R:17.0 loss:23.0550 exploreP:0.1555
Episode:948 meanR:28.5400 R:20.0 loss:24.6889 exploreP:0.1552
Episode:949 meanR:28.3900 R:14.0 loss:22.2698 exploreP:0.1550
Episode:950 meanR:28.2000 R:15.0 loss:27.4134 exploreP:0.1548
Episode:951 meanR:28.0700 R:25.0 loss:19.5564 exploreP:0.1544
Episode:95

Episode:1068 meanR:26.0900 R:69.0 loss:7.6703 exploreP:0.1126
Episode:1069 meanR:26.3900 R:73.0 loss:7.7078 exploreP:0.1119
Episode:1070 meanR:26.0700 R:44.0 loss:12.0611 exploreP:0.1114
Episode:1071 meanR:26.1700 R:36.0 loss:14.8023 exploreP:0.1110
Episode:1072 meanR:26.3200 R:43.0 loss:12.3898 exploreP:0.1106
Episode:1073 meanR:26.3900 R:39.0 loss:13.7586 exploreP:0.1102
Episode:1074 meanR:26.5000 R:32.0 loss:16.3376 exploreP:0.1099
Episode:1075 meanR:26.4700 R:28.0 loss:18.4196 exploreP:0.1096
Episode:1076 meanR:26.2100 R:17.0 loss:26.7162 exploreP:0.1094
Episode:1077 meanR:26.2300 R:24.0 loss:22.4316 exploreP:0.1092
Episode:1078 meanR:26.2200 R:18.0 loss:25.7157 exploreP:0.1090
Episode:1079 meanR:26.2200 R:18.0 loss:28.6127 exploreP:0.1089
Episode:1080 meanR:26.0800 R:19.0 loss:27.0205 exploreP:0.1087
Episode:1081 meanR:25.9500 R:13.0 loss:26.0827 exploreP:0.1085
Episode:1082 meanR:25.7800 R:17.0 loss:31.8662 exploreP:0.1084
Episode:1083 meanR:25.7800 R:17.0 loss:26.1842 exploreP:0

Episode:1199 meanR:34.3500 R:21.0 loss:27.3821 exploreP:0.0770
Episode:1200 meanR:34.2500 R:16.0 loss:28.8186 exploreP:0.0769
Episode:1201 meanR:34.2300 R:17.0 loss:33.7876 exploreP:0.0767
Episode:1202 meanR:34.2200 R:29.0 loss:21.0582 exploreP:0.0765
Episode:1203 meanR:34.2700 R:38.0 loss:15.5090 exploreP:0.0763
Episode:1204 meanR:34.7400 R:70.0 loss:8.7143 exploreP:0.0758
Episode:1205 meanR:34.8400 R:48.0 loss:12.7262 exploreP:0.0755
Episode:1206 meanR:34.7400 R:34.0 loss:16.6016 exploreP:0.0753
Episode:1207 meanR:34.8000 R:30.0 loss:20.4048 exploreP:0.0751
Episode:1208 meanR:34.6600 R:29.0 loss:20.3435 exploreP:0.0749
Episode:1209 meanR:34.6300 R:25.0 loss:22.9333 exploreP:0.0747
Episode:1210 meanR:34.5300 R:22.0 loss:22.9574 exploreP:0.0746
Episode:1211 meanR:34.5600 R:29.0 loss:19.5739 exploreP:0.0744
Episode:1212 meanR:34.3100 R:18.0 loss:29.0889 exploreP:0.0743
Episode:1213 meanR:33.4800 R:25.0 loss:22.0484 exploreP:0.0741
Episode:1214 meanR:33.3100 R:30.0 loss:18.8470 exploreP:

Episode:1330 meanR:38.4700 R:27.0 loss:23.7873 exploreP:0.0508
Episode:1331 meanR:37.9700 R:29.0 loss:20.5437 exploreP:0.0507
Episode:1332 meanR:37.3100 R:26.0 loss:23.2119 exploreP:0.0506
Episode:1333 meanR:37.2300 R:34.0 loss:17.2243 exploreP:0.0504
Episode:1334 meanR:38.0100 R:106.0 loss:6.3023 exploreP:0.0500
Episode:1335 meanR:38.9900 R:126.0 loss:5.7116 exploreP:0.0495
Episode:1336 meanR:39.8500 R:111.0 loss:7.2652 exploreP:0.0491
Episode:1337 meanR:39.9300 R:31.0 loss:24.9752 exploreP:0.0490
Episode:1338 meanR:39.7500 R:21.0 loss:35.1544 exploreP:0.0489
Episode:1339 meanR:39.6300 R:106.0 loss:7.5079 exploreP:0.0485
Episode:1340 meanR:39.6500 R:36.0 loss:22.4669 exploreP:0.0483
Episode:1341 meanR:39.0300 R:23.0 loss:30.2674 exploreP:0.0482
Episode:1342 meanR:38.4400 R:24.0 loss:27.7737 exploreP:0.0481
Episode:1343 meanR:38.1800 R:26.0 loss:24.6237 exploreP:0.0480
Episode:1344 meanR:37.9900 R:18.0 loss:33.1538 exploreP:0.0480
Episode:1345 meanR:37.4700 R:16.0 loss:33.1310 exploreP

Episode:1461 meanR:27.9600 R:23.0 loss:28.0913 exploreP:0.0381
Episode:1462 meanR:28.9400 R:107.0 loss:5.7729 exploreP:0.0378
Episode:1463 meanR:29.0300 R:19.0 loss:34.7186 exploreP:0.0377
Episode:1464 meanR:30.5000 R:157.0 loss:4.3326 exploreP:0.0373
Episode:1465 meanR:30.8000 R:44.0 loss:16.1876 exploreP:0.0372
Episode:1466 meanR:30.9100 R:25.0 loss:25.0744 exploreP:0.0371
Episode:1467 meanR:30.9400 R:19.0 loss:29.1114 exploreP:0.0371
Episode:1468 meanR:31.0000 R:18.0 loss:27.7216 exploreP:0.0370
Episode:1469 meanR:31.0400 R:16.0 loss:29.2979 exploreP:0.0370
Episode:1470 meanR:31.1600 R:21.0 loss:26.5280 exploreP:0.0369
Episode:1471 meanR:31.3500 R:30.0 loss:16.7199 exploreP:0.0368
Episode:1472 meanR:32.2000 R:98.0 loss:5.8842 exploreP:0.0366
Episode:1473 meanR:32.6100 R:53.0 loss:11.2037 exploreP:0.0364
Episode:1474 meanR:33.1700 R:66.0 loss:9.1900 exploreP:0.0362
Episode:1475 meanR:33.4700 R:40.0 loss:15.4651 exploreP:0.0361
Episode:1476 meanR:33.6000 R:22.0 loss:30.5395 exploreP:0

Episode:1592 meanR:67.7000 R:87.0 loss:13.6840 exploreP:0.0220
Episode:1593 meanR:68.2700 R:100.0 loss:11.6947 exploreP:0.0218
Episode:1594 meanR:69.9700 R:189.0 loss:6.8170 exploreP:0.0216
Episode:1595 meanR:70.0300 R:20.0 loss:50.2128 exploreP:0.0216
Episode:1596 meanR:70.9700 R:109.0 loss:10.6528 exploreP:0.0215
Episode:1597 meanR:72.5400 R:173.0 loss:8.0869 exploreP:0.0213
Episode:1598 meanR:74.0400 R:165.0 loss:6.3514 exploreP:0.0211
Episode:1599 meanR:75.2400 R:137.0 loss:8.7334 exploreP:0.0209
Episode:1600 meanR:75.6900 R:59.0 loss:19.9329 exploreP:0.0209
Episode:1601 meanR:74.9500 R:30.0 loss:50.5894 exploreP:0.0208
Episode:1602 meanR:74.2300 R:27.0 loss:49.6247 exploreP:0.0208
Episode:1603 meanR:74.3100 R:38.0 loss:33.3501 exploreP:0.0208
Episode:1604 meanR:77.4600 R:338.0 loss:3.7743 exploreP:0.0204
Episode:1605 meanR:77.5700 R:52.0 loss:29.8532 exploreP:0.0204
Episode:1606 meanR:77.6100 R:98.0 loss:14.8987 exploreP:0.0203
Episode:1607 meanR:77.4000 R:24.0 loss:37.6478 explor

Episode:1723 meanR:29.2900 R:98.0 loss:5.3766 exploreP:0.0167
Episode:1724 meanR:29.2900 R:19.0 loss:55.3543 exploreP:0.0167
Episode:1725 meanR:30.2500 R:115.0 loss:8.1822 exploreP:0.0166
Episode:1726 meanR:30.2900 R:22.0 loss:50.9965 exploreP:0.0166
Episode:1727 meanR:30.3100 R:19.0 loss:24.6344 exploreP:0.0166
Episode:1728 meanR:29.8800 R:13.0 loss:23.6793 exploreP:0.0166
Episode:1729 meanR:29.5400 R:18.0 loss:27.7858 exploreP:0.0166
Episode:1730 meanR:30.4000 R:130.0 loss:4.2694 exploreP:0.0165
Episode:1731 meanR:30.2900 R:23.0 loss:49.6070 exploreP:0.0165
Episode:1732 meanR:30.3300 R:35.0 loss:24.2144 exploreP:0.0165
Episode:1733 meanR:30.4800 R:33.0 loss:25.4379 exploreP:0.0165
Episode:1734 meanR:31.4500 R:116.0 loss:7.4334 exploreP:0.0164
Episode:1735 meanR:31.5600 R:26.0 loss:43.3419 exploreP:0.0164
Episode:1736 meanR:33.0000 R:160.0 loss:3.4612 exploreP:0.0163
Episode:1737 meanR:33.1200 R:24.0 loss:51.6830 exploreP:0.0162
Episode:1738 meanR:33.2900 R:33.0 loss:27.3381 exploreP:

Episode:1853 meanR:89.6200 R:32.0 loss:29.5380 exploreP:0.0123
Episode:1854 meanR:91.2200 R:210.0 loss:4.6086 exploreP:0.0123
Episode:1855 meanR:91.7100 R:75.0 loss:11.6121 exploreP:0.0123
Episode:1856 meanR:91.5100 R:109.0 loss:11.5995 exploreP:0.0122
Episode:1857 meanR:91.7100 R:123.0 loss:17.0779 exploreP:0.0122
Episode:1858 meanR:93.2700 R:187.0 loss:9.6838 exploreP:0.0122
Episode:1859 meanR:92.6200 R:97.0 loss:13.0340 exploreP:0.0121
Episode:1860 meanR:93.2400 R:99.0 loss:22.7459 exploreP:0.0121
Episode:1861 meanR:94.1400 R:143.0 loss:9.3588 exploreP:0.0121
Episode:1862 meanR:94.4200 R:57.0 loss:33.2627 exploreP:0.0121
Episode:1863 meanR:95.3700 R:130.0 loss:13.9283 exploreP:0.0121
Episode:1864 meanR:95.6200 R:62.0 loss:31.5280 exploreP:0.0120
Episode:1865 meanR:96.7000 R:136.0 loss:14.1369 exploreP:0.0120
Episode:1866 meanR:97.4800 R:105.0 loss:17.8593 exploreP:0.0120
Episode:1867 meanR:98.0200 R:207.0 loss:9.3776 exploreP:0.0120
Episode:1868 meanR:99.3300 R:153.0 loss:14.0571 ex

Episode:1981 meanR:234.9300 R:359.0 loss:0.9352 exploreP:0.0101
Episode:1982 meanR:234.4900 R:182.0 loss:4.0224 exploreP:0.0101
Episode:1983 meanR:238.4400 R:500.0 loss:1.0881 exploreP:0.0101
Episode:1984 meanR:243.3100 R:500.0 loss:3.3984 exploreP:0.0101
Episode:1985 meanR:247.0000 R:500.0 loss:11.0814 exploreP:0.0101
Episode:1986 meanR:249.0300 R:395.0 loss:15.0341 exploreP:0.0101
Episode:1987 meanR:248.3300 R:159.0 loss:4.0110 exploreP:0.0101
Episode:1988 meanR:250.4300 R:229.0 loss:2.5817 exploreP:0.0101
Episode:1989 meanR:252.3200 R:199.0 loss:3.6203 exploreP:0.0101
Episode:1990 meanR:253.1500 R:104.0 loss:13.9677 exploreP:0.0101
Episode:1991 meanR:254.1600 R:158.0 loss:3.9295 exploreP:0.0101
Episode:1992 meanR:255.6700 R:242.0 loss:4.0195 exploreP:0.0101
Episode:1993 meanR:254.7500 R:135.0 loss:4.4711 exploreP:0.0101
Episode:1994 meanR:254.5300 R:248.0 loss:1.7388 exploreP:0.0101
Episode:1995 meanR:253.2400 R:142.0 loss:2.4504 exploreP:0.0101
Episode:1996 meanR:254.7000 R:160.0 l

Episode:2109 meanR:373.0200 R:500.0 loss:2.0782 exploreP:0.0100
Episode:2110 meanR:373.6500 R:104.0 loss:59.3559 exploreP:0.0100
Episode:2111 meanR:373.6500 R:500.0 loss:4.3662 exploreP:0.0100
Episode:2112 meanR:373.6500 R:500.0 loss:13.7185 exploreP:0.0100
Episode:2113 meanR:373.6500 R:500.0 loss:8.3877 exploreP:0.0100
Episode:2114 meanR:378.4100 R:500.0 loss:12.1105 exploreP:0.0100
Episode:2115 meanR:378.4100 R:500.0 loss:10.8395 exploreP:0.0100
Episode:2116 meanR:380.7000 R:500.0 loss:10.5767 exploreP:0.0100
Episode:2117 meanR:380.7000 R:500.0 loss:13.1727 exploreP:0.0100
Episode:2118 meanR:383.2500 R:500.0 loss:12.3540 exploreP:0.0100
Episode:2119 meanR:383.2500 R:500.0 loss:4.3527 exploreP:0.0100
Episode:2120 meanR:383.2500 R:500.0 loss:6.5501 exploreP:0.0100
Episode:2121 meanR:383.2500 R:500.0 loss:7.6977 exploreP:0.0100
Episode:2122 meanR:379.1300 R:88.0 loss:56.5727 exploreP:0.0100
Episode:2123 meanR:379.1300 R:500.0 loss:3.6743 exploreP:0.0100
Episode:2124 meanR:379.1300 R:500

Episode:2236 meanR:481.6500 R:500.0 loss:12.2781 exploreP:0.0100
Episode:2237 meanR:476.7800 R:13.0 loss:328.6572 exploreP:0.0100
Episode:2238 meanR:476.7800 R:500.0 loss:15.4137 exploreP:0.0100
Episode:2239 meanR:478.8700 R:500.0 loss:15.5114 exploreP:0.0100
Episode:2240 meanR:482.7400 R:500.0 loss:15.3338 exploreP:0.0100
Episode:2241 meanR:479.6200 R:12.0 loss:321.8780 exploreP:0.0100
Episode:2242 meanR:482.4000 R:500.0 loss:15.5351 exploreP:0.0100
Episode:2243 meanR:485.4400 R:500.0 loss:17.9188 exploreP:0.0100
Episode:2244 meanR:485.4400 R:500.0 loss:15.6532 exploreP:0.0100
Episode:2245 meanR:485.4400 R:500.0 loss:11.9161 exploreP:0.0100
Episode:2246 meanR:480.5600 R:12.0 loss:361.4533 exploreP:0.0100
Episode:2247 meanR:480.5600 R:500.0 loss:14.8529 exploreP:0.0100
Episode:2248 meanR:480.5600 R:500.0 loss:13.0160 exploreP:0.0100
Episode:2249 meanR:475.8500 R:29.0 loss:218.4364 exploreP:0.0100
Episode:2250 meanR:475.8500 R:500.0 loss:11.9272 exploreP:0.0100
Episode:2251 meanR:475.85

Episode:2363 meanR:45.5800 R:12.0 loss:627.8365 exploreP:0.0100
Episode:2364 meanR:40.6800 R:10.0 loss:639.7761 exploreP:0.0100
Episode:2365 meanR:35.7700 R:9.0 loss:449.4649 exploreP:0.0100
Episode:2366 meanR:35.7500 R:9.0 loss:751.0530 exploreP:0.0100
Episode:2367 meanR:35.7500 R:11.0 loss:1014.9219 exploreP:0.0100
Episode:2368 meanR:30.8500 R:10.0 loss:1336.1951 exploreP:0.0100
Episode:2369 meanR:30.8400 R:9.0 loss:1333.0994 exploreP:0.0100
Episode:2370 meanR:25.9300 R:9.0 loss:1407.0212 exploreP:0.0100
Episode:2371 meanR:21.0300 R:10.0 loss:1275.9265 exploreP:0.0100
Episode:2372 meanR:21.0500 R:12.0 loss:793.6238 exploreP:0.0100
Episode:2373 meanR:21.0700 R:12.0 loss:418.0009 exploreP:0.0100
Episode:2374 meanR:21.0500 R:9.0 loss:775.8104 exploreP:0.0100
Episode:2375 meanR:16.1400 R:9.0 loss:826.2268 exploreP:0.0100
Episode:2376 meanR:16.1600 R:11.0 loss:761.0157 exploreP:0.0100
Episode:2377 meanR:16.1700 R:10.0 loss:530.7955 exploreP:0.0100
Episode:2378 meanR:16.1500 R:10.0 loss:40

Episode:2495 meanR:9.6600 R:10.0 loss:578.6719 exploreP:0.0100
Episode:2496 meanR:9.6600 R:9.0 loss:743.3437 exploreP:0.0100
Episode:2497 meanR:9.6800 R:11.0 loss:611.3293 exploreP:0.0100
Episode:2498 meanR:9.7000 R:11.0 loss:457.8791 exploreP:0.0100
Episode:2499 meanR:9.7000 R:9.0 loss:541.3118 exploreP:0.0100
Episode:2500 meanR:9.6900 R:9.0 loss:665.5518 exploreP:0.0100
Episode:2501 meanR:9.7000 R:10.0 loss:612.4380 exploreP:0.0100
Episode:2502 meanR:9.6900 R:9.0 loss:651.1035 exploreP:0.0100
Episode:2503 meanR:9.7100 R:11.0 loss:605.4815 exploreP:0.0100
Episode:2504 meanR:9.7100 R:10.0 loss:383.6659 exploreP:0.0100
Episode:2505 meanR:9.7200 R:11.0 loss:451.8666 exploreP:0.0100
Episode:2506 meanR:9.7300 R:10.0 loss:470.3300 exploreP:0.0100
Episode:2507 meanR:9.7400 R:11.0 loss:613.2170 exploreP:0.0100
Episode:2508 meanR:9.7300 R:10.0 loss:568.6302 exploreP:0.0100
Episode:2509 meanR:9.7500 R:11.0 loss:426.0412 exploreP:0.0100
Episode:2510 meanR:9.7500 R:10.0 loss:438.7925 exploreP:0.0

Episode:2625 meanR:28.1300 R:500.0 loss:1.5539 exploreP:0.0100
Episode:2626 meanR:28.1700 R:15.0 loss:24.8186 exploreP:0.0100
Episode:2627 meanR:32.7400 R:500.0 loss:1.7374 exploreP:0.0100
Episode:2628 meanR:37.4200 R:500.0 loss:10.4298 exploreP:0.0100
Episode:2629 meanR:41.9800 R:500.0 loss:14.4315 exploreP:0.0100
Episode:2630 meanR:46.5700 R:500.0 loss:24.0709 exploreP:0.0100
Episode:2631 meanR:51.1200 R:500.0 loss:14.6361 exploreP:0.0100
Episode:2632 meanR:50.6000 R:18.0 loss:1291.6958 exploreP:0.0100
Episode:2633 meanR:55.4800 R:500.0 loss:19.8372 exploreP:0.0100
Episode:2634 meanR:60.3300 R:500.0 loss:20.4127 exploreP:0.0100
Episode:2635 meanR:64.9800 R:500.0 loss:15.2640 exploreP:0.0100
Episode:2636 meanR:69.6500 R:500.0 loss:14.5598 exploreP:0.0100
Episode:2637 meanR:74.1700 R:500.0 loss:15.8585 exploreP:0.0100
Episode:2638 meanR:78.7900 R:500.0 loss:16.2653 exploreP:0.0100
Episode:2639 meanR:83.1900 R:500.0 loss:16.8516 exploreP:0.0100
Episode:2640 meanR:87.8400 R:500.0 loss:16

Episode:2752 meanR:490.4300 R:500.0 loss:14.0085 exploreP:0.0100
Episode:2753 meanR:490.4300 R:500.0 loss:11.9111 exploreP:0.0100
Episode:2754 meanR:490.4300 R:500.0 loss:13.8373 exploreP:0.0100
Episode:2755 meanR:490.4300 R:500.0 loss:14.7926 exploreP:0.0100
Episode:2756 meanR:490.4300 R:500.0 loss:11.6905 exploreP:0.0100
Episode:2757 meanR:490.4300 R:500.0 loss:8.9471 exploreP:0.0100
Episode:2758 meanR:490.4300 R:500.0 loss:12.8842 exploreP:0.0100
Episode:2759 meanR:490.4300 R:500.0 loss:12.4620 exploreP:0.0100
Episode:2760 meanR:490.4300 R:500.0 loss:12.5712 exploreP:0.0100
Episode:2761 meanR:490.4300 R:500.0 loss:15.8924 exploreP:0.0100
Episode:2762 meanR:490.4300 R:500.0 loss:13.3261 exploreP:0.0100
Episode:2763 meanR:490.4300 R:500.0 loss:10.0993 exploreP:0.0100
Episode:2764 meanR:490.4300 R:500.0 loss:11.0368 exploreP:0.0100
Episode:2765 meanR:490.4300 R:500.0 loss:10.2510 exploreP:0.0100
Episode:2766 meanR:490.4300 R:500.0 loss:8.8520 exploreP:0.0100
Episode:2767 meanR:490.4300

Episode:2878 meanR:198.7700 R:500.0 loss:15.8330 exploreP:0.0100
Episode:2879 meanR:198.7700 R:500.0 loss:17.0256 exploreP:0.0100
Episode:2880 meanR:198.7700 R:500.0 loss:26.6458 exploreP:0.0100
Episode:2881 meanR:198.7700 R:500.0 loss:14.6749 exploreP:0.0100
Episode:2882 meanR:198.7700 R:500.0 loss:16.8254 exploreP:0.0100
Episode:2883 meanR:198.7700 R:500.0 loss:16.5131 exploreP:0.0100
Episode:2884 meanR:198.7700 R:500.0 loss:15.9893 exploreP:0.0100
Episode:2885 meanR:198.7700 R:500.0 loss:19.2054 exploreP:0.0100
Episode:2886 meanR:198.7700 R:500.0 loss:15.5667 exploreP:0.0100
Episode:2887 meanR:203.6400 R:500.0 loss:16.1249 exploreP:0.0100
Episode:2888 meanR:208.5200 R:500.0 loss:16.0130 exploreP:0.0100
Episode:2889 meanR:208.5200 R:500.0 loss:14.2668 exploreP:0.0100
Episode:2890 meanR:213.4100 R:500.0 loss:12.1367 exploreP:0.0100
Episode:2891 meanR:218.3000 R:500.0 loss:13.7578 exploreP:0.0100
Episode:2892 meanR:223.1800 R:500.0 loss:15.1166 exploreP:0.0100
Episode:2893 meanR:228.05

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [184]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model-seq.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
    
    # Episode/epoch
    for _ in range(1):
        state = env.reset()
        total_reward = 0
        
        # Steps/batches
        while True:
            env.render()
            action_logits, initial_state = sess.run([model.actions_logits, model.final_state],
                                                    feed_dict = {model.states: state.reshape([1, -1]), 
                                                                 model.initial_state: initial_state})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # At the end of each episode
        print('total_reward:{}'.format(total_reward))

# Close the env
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward:120.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.