# Sequential DQN

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [2]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [3]:
import numpy as np
state = env.reset()
for _ in range(10):
    # env.render()
    action = env.action_space.sample()
    next_state, reward, done, info = env.step(action) # take a random action
    print('state, action, next_state, reward, done, info:', state, action, next_state, reward, done, info)
    state = next_state
    if done:
        state = env.reset()

state, action, next_state, reward, done, info: [ 0.03773248 -0.04329336 -0.00223015 -0.0014003 ] 0 [ 0.03686661 -0.23838326 -0.00225815  0.29057815] 1.0 False {}
state, action, next_state, reward, done, info: [ 0.03686661 -0.23838326 -0.00225815  0.29057815] 1 [ 0.03209894 -0.04322918  0.00355341 -0.00281611] 1.0 False {}
state, action, next_state, reward, done, info: [ 0.03209894 -0.04322918  0.00355341 -0.00281611] 1 [ 0.03123436  0.15184163  0.00349709 -0.29437578] 1.0 False {}
state, action, next_state, reward, done, info: [ 0.03123436  0.15184163  0.00349709 -0.29437578] 0 [ 0.03427119 -0.04333    -0.00239043 -0.00059198] 1.0 False {}
state, action, next_state, reward, done, info: [ 0.03427119 -0.04333    -0.00239043 -0.00059198] 1 [ 0.03340459  0.15182615 -0.00240227 -0.29402815] 1.0 False {}
state, action, next_state, reward, done, info: [ 0.03340459  0.15182615 -0.00240227 -0.29402815] 1 [ 0.03644112  0.34698227 -0.00828283 -0.58746775] 1.0 False {}
state, action, next_state, r

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [4]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [5]:
def model_input(state_size, lstm_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # RNN
    gru = tf.nn.rnn_cell.GRUCell(lstm_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    initial_state = cell.zero_state(batch_size, tf.float32)
    return states, actions, targetQs, cell, initial_state

In [6]:
# RNN generator or sequence generator
def generator(states, num_classes, initial_state, cell, lstm_size, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [7]:
def model_loss(action_size, hidden_size, states, cell, initial_state, actions, targetQs):
    actions_logits, final_state = generator(states=states, cell=cell, initial_state=initial_state, 
                                            lstm_size=hidden_size, num_classes=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [8]:
def model_opt(loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [9]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cell, self.initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cell=cell, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [10]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx], [self.states[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [11]:
# print('state:', np.array(states).shape[1], 
#       'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

In [12]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
action_size = 2
state_size = 4
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 128            # memory capacity - 1000 DQN
batch_size = 128             # experience mini-batch size - 20 DQN
gamma = 0.99                 # future reward discount

In [13]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 2)


## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [14]:
state = env.reset()
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    #print(state, action, next_state, reward, float(done))
    state = next_state
    if done is True:
        # This is a hypothetical state
        # Transition state for RNN: in between last state/Done and fist state/reset
        state = np.zeros_like(state)
        action = 0 # constant action
        next_state = np.zeros_like(state)
        reward = 0
        done = True
        memory.buffer.append([state, action, next_state, reward, float(done)])
        # initial_state = np.zeros_like(initial_state)
        # final_state = np.zeros_like(initial_state)
        # memory.states.append([initial_state, final_state])
                    
        # Reseting the env/first state
        state = env.reset()

In [15]:
memory.buffer

deque([[array([ 0.026137  , -0.63516216,  0.0691559 ,  0.99818489]),
        0,
        array([ 0.01343376, -0.8311371 ,  0.0891196 ,  1.31176028]),
        1.0,
        0.0],
       [array([ 0.01343376, -0.8311371 ,  0.0891196 ,  1.31176028]),
        0,
        array([-0.00318898, -1.02726731,  0.11535481,  1.63095457]),
        1.0,
        0.0],
       [array([-0.00318898, -1.02726731,  0.11535481,  1.63095457]),
        1,
        array([-0.02373433, -0.83367404,  0.1479739 ,  1.37633374]),
        1.0,
        0.0],
       [array([-0.02373433, -0.83367404,  0.1479739 ,  1.37633374]),
        0,
        array([-0.04040781, -1.0303021 ,  0.17550057,  1.71139839]),
        1.0,
        0.0],
       [array([-0.04040781, -1.0303021 ,  0.17550057,  1.71139839]),
        1,
        array([-0.06101385, -0.83757671,  0.20972854,  1.47808361]),
        1.0,
        1.0],
       [array([0., 0., 0., 0.]), 0, array([0., 0., 0., 0.]), 0, 1.0],
       [array([-0.02797967,  0.02900578, -0.002157

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        state = env.reset()
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            # Explore (Env) or Exploit (Model): NO
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            # if explore_p > np.random.rand():
            #     action = env.action_space.sample()
            # else:
            action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([initial_state, final_state])
            total_reward += reward
            initial_state = final_state
            state = next_state

            # Training
            #batch, rnn_states = memory.sample(batch_size)
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rnn_states = memory.states
            initial_states = np.array([each[0] for each in rnn_states])
            final_states = np.array([each[1] for each in rnn_states])
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states, 
                                                        model.initial_state: final_states[0].reshape([1, -1])})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                        model.initial_state: initial_states[0].reshape([1, -1])})
            loss_batch.append(loss)
            if done is True:
                break
            
        # This is a hypothetical state
        # Transition state for RNN: in between last state/Done and fist state/reset
        state = np.zeros_like(state)
        action = 0 # constant action
        next_state = np.zeros_like(state)
        reward = 0
        done = True
        memory.buffer.append([state, action, next_state, reward, float(done)])
        initial_state = np.zeros_like(initial_state)
        final_state = np.zeros_like(initial_state)
        memory.states.append([initial_state, final_state])
        #total_reward += reward
        #initial_state = final_state
        #state = next_state

        # Outputing: priting out/Potting
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'loss:{:.4f}'.format(np.mean(loss_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:106.0000 R:106.0000 loss:1.2721 exploreP:0.9896
Episode:1 meanR:69.0000 R:32.0000 loss:2.7534 exploreP:0.9864
Episode:2 meanR:56.3333 R:31.0000 loss:2.8834 exploreP:0.9834
Episode:3 meanR:48.2500 R:24.0000 loss:2.9061 exploreP:0.9811
Episode:4 meanR:44.0000 R:27.0000 loss:2.8443 exploreP:0.9785
Episode:5 meanR:41.5000 R:29.0000 loss:5.0049 exploreP:0.9757
Episode:6 meanR:39.0000 R:24.0000 loss:2.6617 exploreP:0.9733
Episode:7 meanR:38.1250 R:32.0000 loss:2.9723 exploreP:0.9703
Episode:8 meanR:36.5556 R:24.0000 loss:3.1742 exploreP:0.9680
Episode:9 meanR:35.6000 R:27.0000 loss:3.4647 exploreP:0.9654
Episode:10 meanR:34.7273 R:26.0000 loss:3.6421 exploreP:0.9629
Episode:11 meanR:34.4167 R:31.0000 loss:3.5883 exploreP:0.9599
Episode:12 meanR:38.5385 R:88.0000 loss:3.1695 exploreP:0.9516
Episode:13 meanR:39.5714 R:53.0000 loss:2.3818 exploreP:0.9466
Episode:14 meanR:42.5333 R:84.0000 loss:2.6478 exploreP:0.9388
Episode:15 meanR:44.5000 R:74.0000 loss:2.6391 exploreP:0.9320


Episode:129 meanR:49.7800 R:118.0000 loss:69.4119 exploreP:0.5559
Episode:130 meanR:50.4800 R:82.0000 loss:16.7232 exploreP:0.5514
Episode:131 meanR:50.9700 R:61.0000 loss:32.1088 exploreP:0.5481
Episode:132 meanR:51.3900 R:55.0000 loss:85.6426 exploreP:0.5452
Episode:133 meanR:51.7400 R:49.0000 loss:47.4558 exploreP:0.5426
Episode:134 meanR:52.4800 R:86.0000 loss:83.8800 exploreP:0.5380
Episode:135 meanR:53.0300 R:69.0000 loss:83.3775 exploreP:0.5344
Episode:136 meanR:53.5700 R:68.0000 loss:87.9345 exploreP:0.5308
Episode:137 meanR:53.9200 R:49.0000 loss:31.7792 exploreP:0.5283
Episode:138 meanR:54.3600 R:58.0000 loss:82.9574 exploreP:0.5253
Episode:139 meanR:54.7800 R:54.0000 loss:49.4840 exploreP:0.5225
Episode:140 meanR:55.4700 R:82.0000 loss:46.6280 exploreP:0.5183
Episode:141 meanR:55.7300 R:41.0000 loss:109.3532 exploreP:0.5162
Episode:142 meanR:56.0200 R:44.0000 loss:17.3827 exploreP:0.5140
Episode:143 meanR:56.1900 R:32.0000 loss:21.5767 exploreP:0.5124
Episode:144 meanR:56.66

Episode:255 meanR:53.8900 R:57.0000 loss:105.2589 exploreP:0.2796
Episode:256 meanR:53.7100 R:48.0000 loss:17.7959 exploreP:0.2784
Episode:257 meanR:54.1300 R:96.0000 loss:14.4195 exploreP:0.2758
Episode:258 meanR:54.5600 R:90.0000 loss:9.9809 exploreP:0.2734
Episode:259 meanR:54.3100 R:46.0000 loss:13.7960 exploreP:0.2722
Episode:260 meanR:54.2300 R:38.0000 loss:15.7816 exploreP:0.2712
Episode:261 meanR:54.4300 R:69.0000 loss:19.8885 exploreP:0.2694
Episode:262 meanR:54.4400 R:64.0000 loss:17.5901 exploreP:0.2678
Episode:263 meanR:54.8000 R:132.0000 loss:70.9094 exploreP:0.2644
Episode:264 meanR:54.7000 R:61.0000 loss:45.4663 exploreP:0.2628
Episode:265 meanR:55.1500 R:93.0000 loss:110.3224 exploreP:0.2605
Episode:266 meanR:55.4700 R:82.0000 loss:90.0984 exploreP:0.2584
Episode:267 meanR:56.2500 R:150.0000 loss:92.3238 exploreP:0.2547
Episode:268 meanR:56.6500 R:120.0000 loss:10.0618 exploreP:0.2518
Episode:269 meanR:57.7900 R:175.0000 loss:71.3413 exploreP:0.2476
Episode:270 meanR:57

Episode:381 meanR:77.3000 R:61.0000 loss:20.6478 exploreP:0.1091
Episode:382 meanR:77.4100 R:87.0000 loss:50.9510 exploreP:0.1083
Episode:383 meanR:77.3000 R:77.0000 loss:54.8420 exploreP:0.1075
Episode:384 meanR:77.3000 R:83.0000 loss:57.0921 exploreP:0.1067
Episode:385 meanR:77.2300 R:68.0000 loss:72.8277 exploreP:0.1061
Episode:386 meanR:76.7100 R:54.0000 loss:51.7582 exploreP:0.1055
Episode:387 meanR:76.8800 R:112.0000 loss:72.6404 exploreP:0.1045
Episode:388 meanR:76.4000 R:58.0000 loss:47.4735 exploreP:0.1039
Episode:389 meanR:75.7900 R:78.0000 loss:98.5917 exploreP:0.1032
Episode:390 meanR:75.4700 R:59.0000 loss:52.8676 exploreP:0.1027
Episode:391 meanR:75.1100 R:56.0000 loss:101.8194 exploreP:0.1021
Episode:392 meanR:74.9800 R:90.0000 loss:27.8116 exploreP:0.1013
Episode:393 meanR:75.0300 R:84.0000 loss:18.7363 exploreP:0.1006
Episode:394 meanR:74.6700 R:66.0000 loss:68.7631 exploreP:0.1000
Episode:395 meanR:74.3600 R:71.0000 loss:120.8825 exploreP:0.0993
Episode:396 meanR:74.5

Episode:507 meanR:77.7800 R:500.0000 loss:5.1428 exploreP:0.0477
Episode:508 meanR:77.3900 R:63.0000 loss:16.7050 exploreP:0.0474
Episode:509 meanR:81.7100 R:500.0000 loss:6.1590 exploreP:0.0456
Episode:510 meanR:83.5800 R:273.0000 loss:9.0418 exploreP:0.0447
Episode:511 meanR:83.3700 R:63.0000 loss:19.4252 exploreP:0.0444
Episode:512 meanR:84.6700 R:209.0000 loss:17.3131 exploreP:0.0437
Episode:513 meanR:84.2800 R:61.0000 loss:20.2267 exploreP:0.0435
Episode:514 meanR:86.6300 R:321.0000 loss:124.4880 exploreP:0.0425
Episode:515 meanR:88.3900 R:262.0000 loss:96.0707 exploreP:0.0416
Episode:516 meanR:89.7900 R:212.0000 loss:111.1577 exploreP:0.0410
Episode:517 meanR:90.6000 R:131.0000 loss:72.9220 exploreP:0.0406
Episode:518 meanR:91.2100 R:166.0000 loss:14.9311 exploreP:0.0401
Episode:519 meanR:91.1700 R:78.0000 loss:22.2102 exploreP:0.0398
Episode:520 meanR:91.4800 R:212.0000 loss:16.7092 exploreP:0.0392
Episode:521 meanR:92.2600 R:145.0000 loss:19.6208 exploreP:0.0388
Episode:522 mea

Episode:630 meanR:209.6800 R:500.0000 loss:16.1459 exploreP:0.0131
Episode:631 meanR:213.2100 R:500.0000 loss:16.3492 exploreP:0.0130
Episode:632 meanR:215.5600 R:363.0000 loss:21.1979 exploreP:0.0129
Episode:633 meanR:219.2200 R:500.0000 loss:16.1170 exploreP:0.0127
Episode:634 meanR:223.1700 R:500.0000 loss:18.0517 exploreP:0.0126
Episode:635 meanR:225.8800 R:500.0000 loss:18.6556 exploreP:0.0125
Episode:636 meanR:229.0300 R:500.0000 loss:18.1238 exploreP:0.0123
Episode:637 meanR:232.1700 R:500.0000 loss:18.7915 exploreP:0.0122
Episode:638 meanR:235.3500 R:500.0000 loss:19.1289 exploreP:0.0121
Episode:639 meanR:238.7400 R:500.0000 loss:18.8456 exploreP:0.0120
Episode:640 meanR:242.0400 R:500.0000 loss:19.2044 exploreP:0.0119
Episode:641 meanR:246.4900 R:500.0000 loss:19.5772 exploreP:0.0118
Episode:642 meanR:249.1800 R:402.0000 loss:23.8270 exploreP:0.0118
Episode:643 meanR:247.5800 R:232.0000 loss:36.8586 exploreP:0.0117
Episode:644 meanR:246.1200 R:128.0000 loss:64.5313 exploreP:0.

Episode:753 meanR:432.1200 R:500.0000 loss:12.7497 exploreP:0.0100
Episode:754 meanR:434.0800 R:500.0000 loss:19.2287 exploreP:0.0100
Episode:755 meanR:436.7700 R:500.0000 loss:26.9413 exploreP:0.0100
Episode:756 meanR:440.7900 R:500.0000 loss:24.4339 exploreP:0.0100
Episode:757 meanR:442.9800 R:500.0000 loss:14.0961 exploreP:0.0100
Episode:758 meanR:445.0400 R:500.0000 loss:25.4104 exploreP:0.0100
Episode:759 meanR:445.0400 R:500.0000 loss:11.6859 exploreP:0.0100
Episode:760 meanR:446.7700 R:500.0000 loss:19.9179 exploreP:0.0100
Episode:761 meanR:446.7700 R:500.0000 loss:26.2256 exploreP:0.0100
Episode:762 meanR:446.7700 R:500.0000 loss:21.5455 exploreP:0.0100
Episode:763 meanR:446.7700 R:500.0000 loss:24.1105 exploreP:0.0100
Episode:764 meanR:451.2300 R:500.0000 loss:25.3431 exploreP:0.0100
Episode:765 meanR:455.7900 R:500.0000 loss:22.4880 exploreP:0.0100
Episode:766 meanR:456.2400 R:112.0000 loss:92.8997 exploreP:0.0100
Episode:767 meanR:456.1400 R:126.0000 loss:32.0293 exploreP:0.

Episode:876 meanR:373.5400 R:500.0000 loss:37.8568 exploreP:0.0100
Episode:877 meanR:373.5400 R:500.0000 loss:23.7334 exploreP:0.0100
Episode:878 meanR:373.5400 R:500.0000 loss:18.2694 exploreP:0.0100
Episode:879 meanR:373.5400 R:500.0000 loss:33.7434 exploreP:0.0100
Episode:880 meanR:373.5400 R:500.0000 loss:135.2586 exploreP:0.0100
Episode:881 meanR:373.5400 R:500.0000 loss:161.7821 exploreP:0.0100
Episode:882 meanR:373.3700 R:483.0000 loss:178.5020 exploreP:0.0100
Episode:883 meanR:371.9900 R:362.0000 loss:226.0865 exploreP:0.0100
Episode:884 meanR:371.7700 R:478.0000 loss:164.3495 exploreP:0.0100
Episode:885 meanR:371.7700 R:500.0000 loss:126.7280 exploreP:0.0100
Episode:886 meanR:371.7700 R:500.0000 loss:103.9667 exploreP:0.0100
Episode:887 meanR:371.7700 R:500.0000 loss:52.0461 exploreP:0.0100
Episode:888 meanR:371.7700 R:500.0000 loss:49.1500 exploreP:0.0100
Episode:889 meanR:371.7700 R:500.0000 loss:48.6543 exploreP:0.0100
Episode:890 meanR:371.7700 R:500.0000 loss:23.2326 expl

Episode:998 meanR:111.3800 R:42.0000 loss:850.8348 exploreP:0.0100
Episode:999 meanR:106.7700 R:39.0000 loss:880.9949 exploreP:0.0100
Episode:1000 meanR:102.2600 R:49.0000 loss:932.8964 exploreP:0.0100
Episode:1001 meanR:97.6700 R:41.0000 loss:900.3249 exploreP:0.0100
Episode:1002 meanR:93.0700 R:40.0000 loss:875.1217 exploreP:0.0100
Episode:1003 meanR:88.4600 R:39.0000 loss:806.5023 exploreP:0.0100
Episode:1004 meanR:83.9800 R:52.0000 loss:940.4077 exploreP:0.0100
Episode:1005 meanR:79.3700 R:39.0000 loss:959.5498 exploreP:0.0100
Episode:1006 meanR:74.8600 R:49.0000 loss:861.0711 exploreP:0.0100
Episode:1007 meanR:70.3900 R:53.0000 loss:918.3151 exploreP:0.0100
Episode:1008 meanR:65.8100 R:42.0000 loss:878.2374 exploreP:0.0100
Episode:1009 meanR:61.2800 R:47.0000 loss:856.9659 exploreP:0.0100
Episode:1010 meanR:56.7400 R:46.0000 loss:916.0070 exploreP:0.0100
Episode:1011 meanR:52.1200 R:38.0000 loss:959.3371 exploreP:0.0100
Episode:1012 meanR:50.6700 R:49.0000 loss:874.5239 exploreP:0

Episode:1120 meanR:50.0600 R:57.0000 loss:1101.7864 exploreP:0.0100
Episode:1121 meanR:50.1500 R:59.0000 loss:1036.8722 exploreP:0.0100
Episode:1122 meanR:50.2700 R:57.0000 loss:932.6665 exploreP:0.0100
Episode:1123 meanR:50.2800 R:44.0000 loss:955.2253 exploreP:0.0100
Episode:1124 meanR:50.4300 R:53.0000 loss:1098.7699 exploreP:0.0100
Episode:1125 meanR:50.4300 R:48.0000 loss:1155.6116 exploreP:0.0100
Episode:1126 meanR:50.6000 R:62.0000 loss:1227.1321 exploreP:0.0100
Episode:1127 meanR:50.7100 R:56.0000 loss:1264.7611 exploreP:0.0100
Episode:1128 meanR:50.6300 R:45.0000 loss:1149.2916 exploreP:0.0100
Episode:1129 meanR:50.6600 R:51.0000 loss:1088.4166 exploreP:0.0100
Episode:1130 meanR:50.7600 R:58.0000 loss:1165.0570 exploreP:0.0100
Episode:1131 meanR:50.8200 R:53.0000 loss:1211.3368 exploreP:0.0100
Episode:1132 meanR:50.8600 R:58.0000 loss:1104.0876 exploreP:0.0100
Episode:1133 meanR:51.0300 R:60.0000 loss:1273.1559 exploreP:0.0100
Episode:1134 meanR:51.1100 R:54.0000 loss:1027.694

Episode:1241 meanR:59.9900 R:63.0000 loss:1276.0282 exploreP:0.0100
Episode:1242 meanR:60.1100 R:72.0000 loss:1359.4663 exploreP:0.0100
Episode:1243 meanR:60.2400 R:68.0000 loss:1320.4843 exploreP:0.0100
Episode:1244 meanR:60.3800 R:69.0000 loss:1221.7216 exploreP:0.0100
Episode:1245 meanR:60.6800 R:81.0000 loss:1305.2716 exploreP:0.0100
Episode:1246 meanR:60.6800 R:57.0000 loss:1243.4111 exploreP:0.0100
Episode:1247 meanR:60.8300 R:65.0000 loss:1090.4180 exploreP:0.0100
Episode:1248 meanR:60.8100 R:51.0000 loss:1345.9778 exploreP:0.0100
Episode:1249 meanR:61.0700 R:75.0000 loss:1240.4772 exploreP:0.0100
Episode:1250 meanR:61.1100 R:61.0000 loss:1352.3099 exploreP:0.0100
Episode:1251 meanR:61.3000 R:73.0000 loss:1207.6157 exploreP:0.0100
Episode:1252 meanR:61.2400 R:52.0000 loss:1406.9475 exploreP:0.0100
Episode:1253 meanR:61.2200 R:56.0000 loss:1184.9658 exploreP:0.0100
Episode:1254 meanR:61.2900 R:62.0000 loss:1358.6505 exploreP:0.0100
Episode:1255 meanR:61.3000 R:57.0000 loss:1263.5

Episode:1362 meanR:62.4500 R:64.0000 loss:1319.7179 exploreP:0.0100
Episode:1363 meanR:62.5400 R:60.0000 loss:1342.9017 exploreP:0.0100
Episode:1364 meanR:62.5800 R:57.0000 loss:1199.7985 exploreP:0.0100
Episode:1365 meanR:62.4900 R:61.0000 loss:1315.6285 exploreP:0.0100
Episode:1366 meanR:62.2300 R:45.0000 loss:1317.4581 exploreP:0.0100
Episode:1367 meanR:62.3100 R:62.0000 loss:1283.8647 exploreP:0.0100
Episode:1368 meanR:62.4400 R:73.0000 loss:1356.7927 exploreP:0.0100
Episode:1369 meanR:62.4800 R:56.0000 loss:1445.6630 exploreP:0.0100
Episode:1370 meanR:62.5600 R:63.0000 loss:1254.2654 exploreP:0.0100
Episode:1371 meanR:62.5400 R:69.0000 loss:1424.0493 exploreP:0.0100
Episode:1372 meanR:62.5200 R:66.0000 loss:1348.7749 exploreP:0.0100
Episode:1373 meanR:62.5900 R:58.0000 loss:1235.8346 exploreP:0.0100
Episode:1374 meanR:62.6600 R:62.0000 loss:1342.2399 exploreP:0.0100
Episode:1375 meanR:62.6400 R:61.0000 loss:1373.9495 exploreP:0.0100
Episode:1376 meanR:62.5900 R:66.0000 loss:1321.5

Episode:1483 meanR:65.3900 R:63.0000 loss:1370.3872 exploreP:0.0100
Episode:1484 meanR:65.3400 R:73.0000 loss:1318.9170 exploreP:0.0100
Episode:1485 meanR:65.4000 R:73.0000 loss:1428.6130 exploreP:0.0100
Episode:1486 meanR:65.4200 R:60.0000 loss:1313.6337 exploreP:0.0100
Episode:1487 meanR:65.4000 R:64.0000 loss:1232.9629 exploreP:0.0100
Episode:1488 meanR:65.4900 R:69.0000 loss:1413.0038 exploreP:0.0100
Episode:1489 meanR:65.5200 R:74.0000 loss:1472.4121 exploreP:0.0100
Episode:1490 meanR:65.5000 R:54.0000 loss:1355.9711 exploreP:0.0100
Episode:1491 meanR:65.8300 R:90.0000 loss:1223.4924 exploreP:0.0100
Episode:1492 meanR:65.7900 R:60.0000 loss:1303.4441 exploreP:0.0100
Episode:1493 meanR:65.9100 R:70.0000 loss:1219.5006 exploreP:0.0100
Episode:1494 meanR:65.8700 R:63.0000 loss:1405.8817 exploreP:0.0100
Episode:1495 meanR:65.9400 R:77.0000 loss:1357.0811 exploreP:0.0100
Episode:1496 meanR:66.1200 R:78.0000 loss:1394.6196 exploreP:0.0100
Episode:1497 meanR:65.9800 R:58.0000 loss:1276.8

Episode:1604 meanR:69.1500 R:54.0000 loss:1241.9777 exploreP:0.0100
Episode:1605 meanR:69.2500 R:76.0000 loss:1335.6320 exploreP:0.0100
Episode:1606 meanR:69.2000 R:63.0000 loss:1338.4165 exploreP:0.0100
Episode:1607 meanR:69.3100 R:78.0000 loss:1317.9785 exploreP:0.0100
Episode:1608 meanR:69.2400 R:58.0000 loss:1471.1124 exploreP:0.0100
Episode:1609 meanR:69.3900 R:76.0000 loss:1143.0457 exploreP:0.0100
Episode:1610 meanR:69.3400 R:54.0000 loss:1408.6948 exploreP:0.0100
Episode:1611 meanR:69.4200 R:79.0000 loss:1410.4940 exploreP:0.0100
Episode:1612 meanR:69.5400 R:66.0000 loss:1375.5679 exploreP:0.0100
Episode:1613 meanR:69.5400 R:64.0000 loss:1360.3431 exploreP:0.0100
Episode:1614 meanR:69.4700 R:66.0000 loss:1406.9807 exploreP:0.0100
Episode:1615 meanR:69.4200 R:66.0000 loss:1326.1283 exploreP:0.0100
Episode:1616 meanR:69.5400 R:78.0000 loss:1354.1598 exploreP:0.0100
Episode:1617 meanR:69.5700 R:64.0000 loss:1393.1401 exploreP:0.0100
Episode:1618 meanR:69.5000 R:53.0000 loss:1162.8

Episode:1725 meanR:69.9800 R:65.0000 loss:1296.2771 exploreP:0.0100
Episode:1726 meanR:69.9700 R:58.0000 loss:1241.5271 exploreP:0.0100
Episode:1727 meanR:70.0500 R:72.0000 loss:1430.6753 exploreP:0.0100
Episode:1728 meanR:69.8700 R:65.0000 loss:1364.5941 exploreP:0.0100
Episode:1729 meanR:69.9100 R:78.0000 loss:1380.0868 exploreP:0.0100
Episode:1730 meanR:69.5700 R:55.0000 loss:1413.4766 exploreP:0.0100
Episode:1731 meanR:69.3700 R:59.0000 loss:1232.0363 exploreP:0.0100
Episode:1732 meanR:69.3100 R:58.0000 loss:1375.1841 exploreP:0.0100
Episode:1733 meanR:69.2700 R:71.0000 loss:1385.6854 exploreP:0.0100
Episode:1734 meanR:69.3000 R:67.0000 loss:1400.0813 exploreP:0.0100
Episode:1735 meanR:69.1700 R:73.0000 loss:1346.2136 exploreP:0.0100
Episode:1736 meanR:69.1600 R:72.0000 loss:1420.0541 exploreP:0.0100
Episode:1737 meanR:69.2300 R:86.0000 loss:1414.5226 exploreP:0.0100
Episode:1738 meanR:69.3000 R:64.0000 loss:1348.9633 exploreP:0.0100
Episode:1739 meanR:69.0600 R:52.0000 loss:1182.9

Episode:1846 meanR:68.1200 R:59.0000 loss:1391.2228 exploreP:0.0100
Episode:1847 meanR:68.1400 R:80.0000 loss:1368.6956 exploreP:0.0100
Episode:1848 meanR:68.0800 R:62.0000 loss:1458.6953 exploreP:0.0100
Episode:1849 meanR:68.1700 R:73.0000 loss:1218.3004 exploreP:0.0100
Episode:1850 meanR:67.9600 R:58.0000 loss:1250.4731 exploreP:0.0100
Episode:1851 meanR:67.9000 R:60.0000 loss:1357.1674 exploreP:0.0100
Episode:1852 meanR:67.7100 R:55.0000 loss:1317.5164 exploreP:0.0100
Episode:1853 meanR:67.5700 R:52.0000 loss:1358.4005 exploreP:0.0100
Episode:1854 meanR:67.5100 R:70.0000 loss:1451.8055 exploreP:0.0100
Episode:1855 meanR:67.6700 R:76.0000 loss:1356.2065 exploreP:0.0100
Episode:1856 meanR:67.4100 R:68.0000 loss:1315.6251 exploreP:0.0100
Episode:1857 meanR:67.3800 R:73.0000 loss:1296.4957 exploreP:0.0100
Episode:1858 meanR:67.4600 R:63.0000 loss:1379.9003 exploreP:0.0100
Episode:1859 meanR:67.5900 R:80.0000 loss:1412.3225 exploreP:0.0100
Episode:1860 meanR:67.4500 R:64.0000 loss:1401.1

Episode:1967 meanR:69.3300 R:80.0000 loss:1274.4062 exploreP:0.0100
Episode:1968 meanR:69.2200 R:63.0000 loss:1215.0618 exploreP:0.0100
Episode:1969 meanR:68.9400 R:58.0000 loss:1148.5797 exploreP:0.0100
Episode:1970 meanR:68.8200 R:64.0000 loss:1362.2361 exploreP:0.0100
Episode:1971 meanR:68.8900 R:74.0000 loss:1399.0707 exploreP:0.0100
Episode:1972 meanR:68.8500 R:72.0000 loss:1408.2609 exploreP:0.0100
Episode:1973 meanR:68.8800 R:57.0000 loss:1437.8424 exploreP:0.0100
Episode:1974 meanR:69.0400 R:71.0000 loss:1268.8004 exploreP:0.0100
Episode:1975 meanR:68.7400 R:56.0000 loss:1328.8898 exploreP:0.0100
Episode:1976 meanR:68.7000 R:58.0000 loss:1189.1118 exploreP:0.0100
Episode:1977 meanR:69.1100 R:103.0000 loss:1378.9246 exploreP:0.0100
Episode:1978 meanR:69.0700 R:65.0000 loss:1308.8091 exploreP:0.0100
Episode:1979 meanR:69.2600 R:77.0000 loss:1347.6765 exploreP:0.0100
Episode:1980 meanR:69.3600 R:68.0000 loss:1350.5641 exploreP:0.0100
Episode:1981 meanR:69.2200 R:55.0000 loss:1134.

Episode:2088 meanR:70.5700 R:59.0000 loss:1320.4917 exploreP:0.0100
Episode:2089 meanR:70.5100 R:74.0000 loss:1352.0619 exploreP:0.0100
Episode:2090 meanR:70.2700 R:57.0000 loss:1448.0481 exploreP:0.0100
Episode:2091 meanR:70.5100 R:81.0000 loss:1296.9644 exploreP:0.0100
Episode:2092 meanR:70.6200 R:72.0000 loss:1350.7253 exploreP:0.0100
Episode:2093 meanR:70.4900 R:70.0000 loss:1273.7069 exploreP:0.0100
Episode:2094 meanR:70.5500 R:74.0000 loss:1348.9003 exploreP:0.0100
Episode:2095 meanR:70.4400 R:57.0000 loss:1399.5200 exploreP:0.0100
Episode:2096 meanR:70.7800 R:91.0000 loss:1374.5525 exploreP:0.0100
Episode:2097 meanR:70.8200 R:57.0000 loss:1369.6852 exploreP:0.0100
Episode:2098 meanR:70.8200 R:74.0000 loss:1253.5006 exploreP:0.0100
Episode:2099 meanR:70.9700 R:74.0000 loss:1400.9641 exploreP:0.0100
Episode:2100 meanR:70.6700 R:59.0000 loss:1353.3817 exploreP:0.0100
Episode:2101 meanR:70.6000 R:62.0000 loss:1286.1514 exploreP:0.0100
Episode:2102 meanR:70.4200 R:58.0000 loss:1464.6

Episode:2209 meanR:69.2000 R:75.0000 loss:1313.0657 exploreP:0.0100
Episode:2210 meanR:69.1500 R:60.0000 loss:1265.9220 exploreP:0.0100
Episode:2211 meanR:69.1300 R:59.0000 loss:1261.3262 exploreP:0.0100
Episode:2212 meanR:68.9100 R:56.0000 loss:1310.3060 exploreP:0.0100
Episode:2213 meanR:68.8300 R:62.0000 loss:1402.3529 exploreP:0.0100
Episode:2214 meanR:68.9000 R:70.0000 loss:1402.7637 exploreP:0.0100
Episode:2215 meanR:69.0000 R:80.0000 loss:1363.2297 exploreP:0.0100
Episode:2216 meanR:68.9100 R:61.0000 loss:1261.6533 exploreP:0.0100
Episode:2217 meanR:68.9500 R:72.0000 loss:1179.9823 exploreP:0.0100
Episode:2218 meanR:69.1000 R:74.0000 loss:1418.3475 exploreP:0.0100
Episode:2219 meanR:69.0400 R:69.0000 loss:1342.0685 exploreP:0.0100
Episode:2220 meanR:69.0100 R:60.0000 loss:1409.4269 exploreP:0.0100
Episode:2221 meanR:69.1000 R:83.0000 loss:1319.8192 exploreP:0.0100
Episode:2222 meanR:69.0900 R:57.0000 loss:1323.9413 exploreP:0.0100
Episode:2223 meanR:68.9700 R:62.0000 loss:1184.6

Episode:2330 meanR:68.2100 R:61.0000 loss:1377.1742 exploreP:0.0100
Episode:2331 meanR:68.2600 R:65.0000 loss:1419.4781 exploreP:0.0100
Episode:2332 meanR:68.3200 R:74.0000 loss:1378.0149 exploreP:0.0100
Episode:2333 meanR:68.1300 R:56.0000 loss:1371.0439 exploreP:0.0100
Episode:2334 meanR:67.9300 R:55.0000 loss:1312.5465 exploreP:0.0100
Episode:2335 meanR:67.9100 R:64.0000 loss:1376.0657 exploreP:0.0100
Episode:2336 meanR:68.0300 R:78.0000 loss:1388.3325 exploreP:0.0100
Episode:2337 meanR:68.0400 R:78.0000 loss:1419.8279 exploreP:0.0100
Episode:2338 meanR:67.9200 R:68.0000 loss:1284.1514 exploreP:0.0100
Episode:2339 meanR:68.1000 R:74.0000 loss:1229.9570 exploreP:0.0100
Episode:2340 meanR:68.1500 R:86.0000 loss:1362.9773 exploreP:0.0100
Episode:2341 meanR:68.1400 R:75.0000 loss:1360.2704 exploreP:0.0100
Episode:2342 meanR:68.2000 R:82.0000 loss:1312.6873 exploreP:0.0100
Episode:2343 meanR:68.1100 R:64.0000 loss:1279.0784 exploreP:0.0100
Episode:2344 meanR:68.1500 R:61.0000 loss:1213.7

Episode:2451 meanR:68.7300 R:57.0000 loss:1315.7198 exploreP:0.0100
Episode:2452 meanR:68.8900 R:79.0000 loss:1302.8438 exploreP:0.0100
Episode:2453 meanR:68.8800 R:62.0000 loss:1452.6884 exploreP:0.0100
Episode:2454 meanR:68.7400 R:56.0000 loss:1312.0447 exploreP:0.0100
Episode:2455 meanR:68.7800 R:56.0000 loss:1344.6251 exploreP:0.0100
Episode:2456 meanR:68.4300 R:56.0000 loss:1337.4854 exploreP:0.0100
Episode:2457 meanR:68.4800 R:74.0000 loss:1384.1534 exploreP:0.0100
Episode:2458 meanR:68.4100 R:59.0000 loss:1419.3351 exploreP:0.0100
Episode:2459 meanR:68.5800 R:76.0000 loss:1330.9354 exploreP:0.0100
Episode:2460 meanR:68.7200 R:73.0000 loss:1376.7255 exploreP:0.0100
Episode:2461 meanR:68.9000 R:83.0000 loss:1301.1439 exploreP:0.0100
Episode:2462 meanR:68.7600 R:57.0000 loss:1300.2040 exploreP:0.0100
Episode:2463 meanR:68.8900 R:71.0000 loss:1165.5155 exploreP:0.0100
Episode:2464 meanR:68.9600 R:77.0000 loss:1341.5100 exploreP:0.0100
Episode:2465 meanR:68.8000 R:62.0000 loss:1084.4

Episode:2572 meanR:68.6700 R:74.0000 loss:1421.6882 exploreP:0.0100
Episode:2573 meanR:68.5300 R:60.0000 loss:1484.7147 exploreP:0.0100
Episode:2574 meanR:68.6700 R:70.0000 loss:1134.8495 exploreP:0.0100
Episode:2575 meanR:68.7800 R:78.0000 loss:1359.4066 exploreP:0.0100
Episode:2576 meanR:68.6600 R:53.0000 loss:1310.8918 exploreP:0.0100
Episode:2577 meanR:68.5900 R:78.0000 loss:1279.0405 exploreP:0.0100
Episode:2578 meanR:68.4700 R:61.0000 loss:1394.6052 exploreP:0.0100
Episode:2579 meanR:68.3800 R:58.0000 loss:1151.6598 exploreP:0.0100
Episode:2580 meanR:68.4800 R:65.0000 loss:1342.4751 exploreP:0.0100
Episode:2581 meanR:68.2700 R:57.0000 loss:1410.6807 exploreP:0.0100
Episode:2582 meanR:68.0800 R:68.0000 loss:1359.6736 exploreP:0.0100
Episode:2583 meanR:68.0700 R:62.0000 loss:1399.2268 exploreP:0.0100
Episode:2584 meanR:68.1100 R:84.0000 loss:1338.9969 exploreP:0.0100
Episode:2585 meanR:68.1600 R:74.0000 loss:1360.7584 exploreP:0.0100
Episode:2586 meanR:68.1500 R:88.0000 loss:1317.6

Episode:2693 meanR:71.0100 R:65.0000 loss:1404.3835 exploreP:0.0100
Episode:2694 meanR:70.9500 R:64.0000 loss:1412.3896 exploreP:0.0100
Episode:2695 meanR:70.9300 R:60.0000 loss:1402.5762 exploreP:0.0100
Episode:2696 meanR:70.9300 R:73.0000 loss:1369.2366 exploreP:0.0100
Episode:2697 meanR:70.9700 R:73.0000 loss:1389.3893 exploreP:0.0100
Episode:2698 meanR:70.8800 R:69.0000 loss:1341.5852 exploreP:0.0100
Episode:2699 meanR:70.7000 R:61.0000 loss:1344.0896 exploreP:0.0100
Episode:2700 meanR:70.8100 R:87.0000 loss:1427.7195 exploreP:0.0100
Episode:2701 meanR:70.8000 R:79.0000 loss:1357.8555 exploreP:0.0100
Episode:2702 meanR:70.6500 R:66.0000 loss:1296.2672 exploreP:0.0100
Episode:2703 meanR:70.6900 R:70.0000 loss:1294.5527 exploreP:0.0100
Episode:2704 meanR:70.6500 R:68.0000 loss:1418.4320 exploreP:0.0100
Episode:2705 meanR:70.4600 R:56.0000 loss:1314.7604 exploreP:0.0100
Episode:2706 meanR:70.5100 R:71.0000 loss:1330.3169 exploreP:0.0100
Episode:2707 meanR:70.5800 R:70.0000 loss:1345.9

Episode:2814 meanR:68.9700 R:87.0000 loss:1234.0306 exploreP:0.0100
Episode:2815 meanR:68.8200 R:63.0000 loss:1306.2808 exploreP:0.0100
Episode:2816 meanR:68.9600 R:73.0000 loss:1228.6122 exploreP:0.0100
Episode:2817 meanR:68.9800 R:79.0000 loss:1313.6716 exploreP:0.0100
Episode:2818 meanR:68.8300 R:60.0000 loss:1308.4830 exploreP:0.0100
Episode:2819 meanR:68.7400 R:72.0000 loss:1227.1331 exploreP:0.0100
Episode:2820 meanR:68.6200 R:56.0000 loss:1471.5725 exploreP:0.0100
Episode:2821 meanR:68.7300 R:70.0000 loss:1247.7626 exploreP:0.0100
Episode:2822 meanR:68.4900 R:63.0000 loss:1371.2366 exploreP:0.0100
Episode:2823 meanR:68.4000 R:54.0000 loss:1322.0659 exploreP:0.0100
Episode:2824 meanR:68.5100 R:68.0000 loss:1355.5358 exploreP:0.0100
Episode:2825 meanR:68.6700 R:81.0000 loss:1410.7240 exploreP:0.0100
Episode:2826 meanR:68.5700 R:67.0000 loss:1395.8507 exploreP:0.0100
Episode:2827 meanR:68.7800 R:80.0000 loss:1313.1799 exploreP:0.0100
Episode:2828 meanR:68.8100 R:75.0000 loss:1346.5

Episode:2935 meanR:67.6700 R:84.0000 loss:1288.6713 exploreP:0.0100
Episode:2936 meanR:67.6700 R:68.0000 loss:1242.6017 exploreP:0.0100
Episode:2937 meanR:67.5600 R:61.0000 loss:1379.9487 exploreP:0.0100
Episode:2938 meanR:67.9200 R:90.0000 loss:1381.2158 exploreP:0.0100
Episode:2939 meanR:67.9000 R:68.0000 loss:1287.0471 exploreP:0.0100
Episode:2940 meanR:67.9900 R:66.0000 loss:1257.2029 exploreP:0.0100
Episode:2941 meanR:68.0100 R:77.0000 loss:1326.8788 exploreP:0.0100
Episode:2942 meanR:67.8100 R:57.0000 loss:1380.2664 exploreP:0.0100
Episode:2943 meanR:67.9500 R:78.0000 loss:1216.4912 exploreP:0.0100
Episode:2944 meanR:67.8100 R:57.0000 loss:1376.6766 exploreP:0.0100
Episode:2945 meanR:67.9300 R:66.0000 loss:1308.0449 exploreP:0.0100
Episode:2946 meanR:67.8800 R:72.0000 loss:1350.5895 exploreP:0.0100
Episode:2947 meanR:68.0200 R:84.0000 loss:1402.1881 exploreP:0.0100
Episode:2948 meanR:68.2000 R:76.0000 loss:1358.8169 exploreP:0.0100
Episode:2949 meanR:68.1900 R:52.0000 loss:1294.6

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [184]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model-seq.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
    
    # Episode/epoch
    for _ in range(1):
        state = env.reset()
        total_reward = 0
        
        # Steps/batches
        while True:
            env.render()
            action_logits, initial_state = sess.run([model.actions_logits, model.final_state],
                                                    feed_dict = {model.states: state.reshape([1, -1]), 
                                                                 model.initial_state: initial_state})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # At the end of each episode
        print('total_reward:{}'.format(total_reward))

# Close the env
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward:120.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.