# Sequential DQN

In this notebook, we'll build a neural network that can learn to play games through reinforcement learning. More specifically, we'll use Q-learning to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [11]:
import gym
import numpy as np

In [12]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


>**Note:** Make sure you have OpenAI Gym cloned into the same directory with this notebook. I've included `gym` as a submodule, so you can run `git submodule --init --recursive` to pull the contents into the `gym` repo.

>**Note:** Make sure you have OpenAI Gym cloned. Then run this command `pip install -e gym/[all]`.

In [13]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [14]:
env.reset()
for _ in range(10):
    # env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    print('state, action, reward, done, info:', state, action, reward, done, info)
    if done:
        env.reset()

state, action, reward, done, info: [-0.04901964  0.23559912  0.04938931 -0.27629412] 1 1.0 False {}
state, action, reward, done, info: [-0.04430766  0.0398086   0.04386343  0.03154809] 0 1.0 False {}
state, action, reward, done, info: [-0.04351149 -0.15591403  0.04449439  0.33774137] 0 1.0 False {}
state, action, reward, done, info: [-0.04662977  0.03854745  0.05124922  0.05941488] 1 1.0 False {}
state, action, reward, done, info: [-0.04585882 -0.15727043  0.05243752  0.36781672] 0 1.0 False {}
state, action, reward, done, info: [-0.04900423 -0.35309675  0.05979385  0.67656202] 0 1.0 False {}
state, action, reward, done, info: [-0.05606616 -0.54899632  0.07332509  0.98745514] 0 1.0 False {}
state, action, reward, done, info: [-0.06704609 -0.74501942  0.0930742   1.30223797] 0 1.0 False {}
state, action, reward, done, info: [-0.08194648 -0.94119075  0.11911895  1.62254565] 0 1.0 False {}
state, action, reward, done, info: [-0.10077029 -0.74765578  0.15156987  1.36923858] 1 1.0 False {}


To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [15]:
# print(rewards[-20:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)))
# print((np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

## Q-Network

We train our Q-learning agent using the Bellman Equation:

$$
Q(s, a) = r + \gamma \max{Q(s', a')}
$$

where $s$ is a state, $a$ is an action, and $s'$ is the next state from state $s$ and action $a$.

Before we used this equation to learn values for a Q-_table_. However, for this game there are a huge number of states available. The state has four values: the position and velocity of the cart, and the position and velocity of the pole. These are all real-valued numbers, so ignoring floating point precisions, you practically have infinite states. Instead of using a table then, we'll replace it with a neural network that will approximate the Q-table lookup function.

<img src="assets/deep-q-learning.png" width=450px>

Now, our Q value, $Q(s, a)$ is calculated by passing in a state to the network. The output will be Q-values for each available action, with fully connected hidden layers.

<img src="assets/q-network.png" width=550px>


As I showed before, we can define our targets for training as $\hat{Q}(s,a) = r + \gamma \max{Q(s', a')}$. Then we update the weights by minimizing $(\hat{Q}(s,a) - Q(s,a))^2$. 

For this Cart-Pole game, we have four inputs, one for each value in the state, and two outputs, one for each action. To get $\hat{Q}$, we'll first choose an action, then simulate the game using that action. This will get us the next state, $s'$, and the reward. With that, we can calculate $\hat{Q}$ then pass it back into the $Q$ network to run the optimizer and update the weights.

Below is my implementation of the Q-network. I used two fully connected layers with ReLU activations. Two seems to be good enough, three might be better. Feel free to try it out.

In [16]:
def model_input(state_size, lstm_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # RNN
    gru = tf.nn.rnn_cell.GRUCell(lstm_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    initial_state = cell.zero_state(batch_size, tf.float32)
    return states, actions, targetQs, cell, initial_state

In [17]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [18]:
def model_loss(action_size, hidden_size, states, cell, initial_state, actions, targetQs):
    actions_logits, final_state = generator(states=states, cell=cell, initial_state=initial_state, 
                                            lstm_size=hidden_size, num_classes=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [19]:
def model_opt(loss, learning_rate):
    """
    Get optimization operations in order
    :param loss: Generator loss Tensor for action prediction
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [20]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cell, self.initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cell=cell, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [21]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)
#     def sample(self, batch_size):
#         idx = np.random.choice(np.arange(len(self.buffer)), 
#                                size=batch_size, 
#                                replace=False)
#         return [self.buffer[ii] for ii in idx], [self.states[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [22]:
# print('state:', np.array(states).shape[1], 
#       'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

In [23]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
action_size = 2
state_size = 4
hidden_size = 64               # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 128            # memory capacity - 1000 DQN
batch_size = 128             # experience mini-batch size - 20 DQN
gamma = 0.99                 # future reward discount

In [24]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=batch_size)

(?, 4) (?, 64)
(1, ?, 64) (1, 64)
(1, ?, 64) (1, 64)
(?, 64)
(?, 2)


## Populate the memory (exprience memory)

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [25]:
state = env.reset()
# initial_state = np.zeros([1, hidden_size])
# final_state = np.zeros([1, hidden_size])
for _ in range(batch_size):
    action = env.action_space.sample()
    next_state, reward, done, _ = env.step(action)
    memory.buffer.append([state, action, next_state, reward, float(done)])
    #memory.states.append([initial_state, final_state])
    state = next_state
    if done is True:
        state = env.reset()

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [26]:
# memory.buffer[0], memory.states[0]

In [27]:
# states, rewards, actions

In [None]:
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        state = env.reset()
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                action = env.action_space.sample()
            else:
                action = np.argmax(action_logits)
            next_state, reward, done, _ = env.step(action)
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([initial_state, final_state])
            total_reward += reward
            initial_state = final_state
            state = next_state

            # Training
            #batch, rnn_states = memory.sample(batch_size)
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rnn_states = memory.states
            initial_states = np.array([each[0] for each in rnn_states])
            final_states = np.array([each[1] for each in rnn_states])
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states, 
                                                        model.initial_state: final_states[0].reshape([1, -1])})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                        model.initial_state: initial_states[0].reshape([1, -1])})
            loss_batch.append(loss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{}'.format(total_reward),
              'loss:{:.4f}'.format(np.mean(loss_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= 500:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:58.0000 R:58.0 loss:0.9462 exploreP:0.9943
Episode:1 meanR:37.5000 R:17.0 loss:0.8271 exploreP:0.9926
Episode:2 meanR:30.3333 R:16.0 loss:0.7100 exploreP:0.9910
Episode:3 meanR:32.0000 R:37.0 loss:1.0007 exploreP:0.9874
Episode:4 meanR:30.8000 R:26.0 loss:1.4869 exploreP:0.9849
Episode:5 meanR:28.1667 R:15.0 loss:1.5626 exploreP:0.9834
Episode:6 meanR:25.5714 R:10.0 loss:1.8660 exploreP:0.9824
Episode:7 meanR:25.0000 R:21.0 loss:1.8378 exploreP:0.9804
Episode:8 meanR:23.5556 R:12.0 loss:2.1669 exploreP:0.9792
Episode:9 meanR:23.5000 R:23.0 loss:2.3497 exploreP:0.9770
Episode:10 meanR:24.0000 R:29.0 loss:2.6005 exploreP:0.9742
Episode:11 meanR:23.4167 R:17.0 loss:2.6360 exploreP:0.9726
Episode:12 meanR:22.9231 R:17.0 loss:3.1568 exploreP:0.9709
Episode:13 meanR:23.5000 R:31.0 loss:3.0180 exploreP:0.9680
Episode:14 meanR:23.0000 R:16.0 loss:2.6628 exploreP:0.9664
Episode:15 meanR:23.5000 R:31.0 loss:3.0337 exploreP:0.9635
Episode:16 meanR:22.8235 R:12.0 loss:3.2760 explor

Episode:136 meanR:20.0900 R:19.0 loss:6.5445 exploreP:0.7510
Episode:137 meanR:20.1600 R:16.0 loss:8.9732 exploreP:0.7498
Episode:138 meanR:19.9900 R:11.0 loss:11.4522 exploreP:0.7490
Episode:139 meanR:19.8700 R:13.0 loss:14.0245 exploreP:0.7480
Episode:140 meanR:19.8300 R:18.0 loss:16.2413 exploreP:0.7467
Episode:141 meanR:19.9100 R:24.0 loss:15.9009 exploreP:0.7450
Episode:142 meanR:19.8400 R:11.0 loss:16.5975 exploreP:0.7441
Episode:143 meanR:19.8800 R:21.0 loss:18.1536 exploreP:0.7426
Episode:144 meanR:20.1400 R:39.0 loss:16.2453 exploreP:0.7398
Episode:145 meanR:20.0500 R:13.0 loss:13.8283 exploreP:0.7388
Episode:146 meanR:20.3100 R:37.0 loss:13.5123 exploreP:0.7361
Episode:147 meanR:20.3900 R:30.0 loss:13.1228 exploreP:0.7339
Episode:148 meanR:20.6300 R:35.0 loss:11.8227 exploreP:0.7314
Episode:149 meanR:20.8900 R:39.0 loss:8.7767 exploreP:0.7286
Episode:150 meanR:20.8200 R:23.0 loss:9.1883 exploreP:0.7269
Episode:151 meanR:21.0200 R:42.0 loss:8.8770 exploreP:0.7239
Episode:152 m

Episode:269 meanR:34.7800 R:104.0 loss:8.2292 exploreP:0.4901
Episode:270 meanR:34.9500 R:29.0 loss:14.2891 exploreP:0.4887
Episode:271 meanR:34.8500 R:19.0 loss:13.6637 exploreP:0.4878
Episode:272 meanR:35.1300 R:37.0 loss:17.3314 exploreP:0.4860
Episode:273 meanR:35.2300 R:19.0 loss:23.8897 exploreP:0.4851
Episode:274 meanR:36.2900 R:115.0 loss:14.2140 exploreP:0.4797
Episode:275 meanR:37.0400 R:86.0 loss:8.1489 exploreP:0.4756
Episode:276 meanR:37.4200 R:53.0 loss:10.5380 exploreP:0.4732
Episode:277 meanR:37.3300 R:26.0 loss:11.8645 exploreP:0.4720
Episode:278 meanR:37.3200 R:19.0 loss:16.5027 exploreP:0.4711
Episode:279 meanR:37.4900 R:29.0 loss:20.8529 exploreP:0.4698
Episode:280 meanR:38.0200 R:62.0 loss:20.6774 exploreP:0.4669
Episode:281 meanR:38.6600 R:75.0 loss:14.6266 exploreP:0.4635
Episode:282 meanR:38.9400 R:52.0 loss:13.9695 exploreP:0.4612
Episode:283 meanR:40.4400 R:164.0 loss:8.4188 exploreP:0.4538
Episode:284 meanR:41.8100 R:153.0 loss:7.6141 exploreP:0.4471
Episode:

Episode:401 meanR:102.1700 R:211.0 loss:2.8167 exploreP:0.1509
Episode:402 meanR:103.8000 R:205.0 loss:7.3129 exploreP:0.1480
Episode:403 meanR:107.2700 R:500.0 loss:2.9062 exploreP:0.1413
Episode:404 meanR:106.3600 R:53.0 loss:21.4660 exploreP:0.1406
Episode:405 meanR:105.8600 R:61.0 loss:40.1876 exploreP:0.1398
Episode:406 meanR:106.0800 R:174.0 loss:17.6571 exploreP:0.1376
Episode:407 meanR:108.4000 R:254.0 loss:8.2966 exploreP:0.1344
Episode:408 meanR:110.9700 R:282.0 loss:3.0053 exploreP:0.1309
Episode:409 meanR:112.9100 R:297.0 loss:9.1442 exploreP:0.1274
Episode:410 meanR:113.0300 R:107.0 loss:0.9727 exploreP:0.1261
Episode:411 meanR:113.7000 R:126.0 loss:0.9811 exploreP:0.1247
Episode:412 meanR:114.1500 R:127.0 loss:1.7368 exploreP:0.1232
Episode:413 meanR:114.5100 R:114.0 loss:2.1604 exploreP:0.1219
Episode:414 meanR:117.6300 R:367.0 loss:1.2482 exploreP:0.1179
Episode:415 meanR:118.6800 R:212.0 loss:0.4932 exploreP:0.1156
Episode:416 meanR:118.7400 R:133.0 loss:2.7666 explore

Episode:531 meanR:336.4600 R:500.0 loss:15.8164 exploreP:0.0131
Episode:532 meanR:340.7100 R:500.0 loss:38.8289 exploreP:0.0130
Episode:533 meanR:345.1600 R:500.0 loss:15.4725 exploreP:0.0128
Episode:534 meanR:346.7800 R:500.0 loss:15.2441 exploreP:0.0127
Episode:535 meanR:351.1100 R:500.0 loss:15.2040 exploreP:0.0126
Episode:536 meanR:354.8700 R:500.0 loss:15.0182 exploreP:0.0124
Episode:537 meanR:356.5400 R:500.0 loss:14.5703 exploreP:0.0123
Episode:538 meanR:356.6800 R:500.0 loss:15.3825 exploreP:0.0122
Episode:539 meanR:360.1200 R:500.0 loss:14.6122 exploreP:0.0121
Episode:540 meanR:363.9100 R:500.0 loss:14.2150 exploreP:0.0120
Episode:541 meanR:363.9100 R:500.0 loss:36.8205 exploreP:0.0119
Episode:542 meanR:366.9600 R:500.0 loss:87.9234 exploreP:0.0118
Episode:543 meanR:369.5900 R:500.0 loss:17.1805 exploreP:0.0117
Episode:544 meanR:372.0800 R:500.0 loss:16.3077 exploreP:0.0116
Episode:545 meanR:375.6600 R:500.0 loss:15.5431 exploreP:0.0116
Episode:546 meanR:379.0500 R:500.0 loss:

Episode:660 meanR:472.9700 R:415.0 loss:14.6216 exploreP:0.0100
Episode:661 meanR:469.6400 R:167.0 loss:9.7130 exploreP:0.0100
Episode:662 meanR:466.2000 R:156.0 loss:50.3304 exploreP:0.0100
Episode:663 meanR:463.5900 R:239.0 loss:15.6732 exploreP:0.0100
Episode:664 meanR:460.7900 R:220.0 loss:24.8833 exploreP:0.0100
Episode:665 meanR:460.7900 R:500.0 loss:6.2416 exploreP:0.0100
Episode:666 meanR:457.5100 R:172.0 loss:47.8239 exploreP:0.0100
Episode:667 meanR:454.1400 R:163.0 loss:19.4070 exploreP:0.0100
Episode:668 meanR:450.5600 R:142.0 loss:11.3901 exploreP:0.0100
Episode:669 meanR:450.5600 R:500.0 loss:9.2879 exploreP:0.0100
Episode:670 meanR:450.5600 R:500.0 loss:15.5256 exploreP:0.0100
Episode:671 meanR:447.0000 R:144.0 loss:49.1424 exploreP:0.0100
Episode:672 meanR:443.4200 R:142.0 loss:41.2381 exploreP:0.0100
Episode:673 meanR:439.2700 R:85.0 loss:54.9987 exploreP:0.0100
Episode:674 meanR:435.1200 R:85.0 loss:64.8811 exploreP:0.0100
Episode:675 meanR:431.4300 R:131.0 loss:31.29

Episode:790 meanR:167.5000 R:334.0 loss:134.6154 exploreP:0.0100
Episode:791 meanR:170.0400 R:500.0 loss:35.2718 exploreP:0.0100
Episode:792 meanR:169.8700 R:430.0 loss:24.2056 exploreP:0.0100
Episode:793 meanR:167.9200 R:256.0 loss:14.4269 exploreP:0.0100
Episode:794 meanR:168.6400 R:463.0 loss:3.6451 exploreP:0.0100
Episode:795 meanR:171.8300 R:500.0 loss:173.7887 exploreP:0.0100
Episode:796 meanR:171.0300 R:401.0 loss:29.2262 exploreP:0.0100
Episode:797 meanR:169.8200 R:379.0 loss:142.6872 exploreP:0.0100
Episode:798 meanR:167.3400 R:162.0 loss:341.4327 exploreP:0.0100
Episode:799 meanR:166.1300 R:355.0 loss:4.0659 exploreP:0.0100
Episode:800 meanR:165.9000 R:451.0 loss:98.6209 exploreP:0.0100
Episode:801 meanR:167.1900 R:395.0 loss:112.6389 exploreP:0.0100
Episode:802 meanR:169.7500 R:463.0 loss:85.6366 exploreP:0.0100
Episode:803 meanR:171.6700 R:471.0 loss:122.9507 exploreP:0.0100
Episode:804 meanR:173.7400 R:473.0 loss:55.3427 exploreP:0.0100
Episode:805 meanR:174.6600 R:335.0 l

Episode:919 meanR:418.1300 R:500.0 loss:16.1239 exploreP:0.0100
Episode:920 meanR:420.4700 R:500.0 loss:15.3049 exploreP:0.0100
Episode:921 meanR:420.4700 R:500.0 loss:12.0236 exploreP:0.0100
Episode:922 meanR:423.7800 R:500.0 loss:8.9243 exploreP:0.0100
Episode:923 meanR:423.7800 R:500.0 loss:12.2565 exploreP:0.0100
Episode:924 meanR:423.7800 R:500.0 loss:13.4906 exploreP:0.0100
Episode:925 meanR:423.7800 R:500.0 loss:21.4998 exploreP:0.0100
Episode:926 meanR:425.1700 R:500.0 loss:14.0142 exploreP:0.0100
Episode:927 meanR:425.1700 R:500.0 loss:16.3467 exploreP:0.0100
Episode:928 meanR:427.5100 R:500.0 loss:18.4651 exploreP:0.0100
Episode:929 meanR:429.5100 R:500.0 loss:20.9044 exploreP:0.0100
Episode:930 meanR:426.8700 R:236.0 loss:25.9187 exploreP:0.0100
Episode:931 meanR:426.0900 R:117.0 loss:49.6486 exploreP:0.0100
Episode:932 meanR:423.4700 R:129.0 loss:8.2962 exploreP:0.0100
Episode:933 meanR:419.5800 R:111.0 loss:3.6849 exploreP:0.0100
Episode:934 meanR:418.7200 R:110.0 loss:2.8

Episode:1052 meanR:17.1600 R:449.0 loss:17.4113 exploreP:0.0100
Episode:1053 meanR:22.0500 R:500.0 loss:1.0253 exploreP:0.0100
Episode:1054 meanR:25.5100 R:356.0 loss:27.0155 exploreP:0.0100
Episode:1055 meanR:30.4100 R:500.0 loss:16.4463 exploreP:0.0100
Episode:1056 meanR:35.3200 R:500.0 loss:10.4982 exploreP:0.0100
Episode:1057 meanR:40.2200 R:500.0 loss:15.3005 exploreP:0.0100
Episode:1058 meanR:45.1400 R:500.0 loss:22.5423 exploreP:0.0100
Episode:1059 meanR:50.0500 R:500.0 loss:11.4587 exploreP:0.0100
Episode:1060 meanR:52.1000 R:214.0 loss:46.9470 exploreP:0.0100
Episode:1061 meanR:56.2100 R:420.0 loss:3.4426 exploreP:0.0100
Episode:1062 meanR:56.2800 R:17.0 loss:59.3662 exploreP:0.0100
Episode:1063 meanR:56.3700 R:17.0 loss:123.5584 exploreP:0.0100
Episode:1064 meanR:56.5100 R:23.0 loss:133.1488 exploreP:0.0100
Episode:1065 meanR:57.3700 R:95.0 loss:61.5858 exploreP:0.0100
Episode:1066 meanR:58.4100 R:113.0 loss:6.7759 exploreP:0.0100
Episode:1067 meanR:63.3300 R:500.0 loss:3.730

Episode:1181 meanR:127.2700 R:199.0 loss:12.0141 exploreP:0.0100
Episode:1182 meanR:132.1600 R:500.0 loss:1.2599 exploreP:0.0100
Episode:1183 meanR:137.0500 R:500.0 loss:15.8626 exploreP:0.0100
Episode:1184 meanR:141.9500 R:500.0 loss:10.5004 exploreP:0.0100
Episode:1185 meanR:146.8600 R:500.0 loss:15.9934 exploreP:0.0100
Episode:1186 meanR:151.7600 R:500.0 loss:5.5866 exploreP:0.0100
Episode:1187 meanR:155.9700 R:436.0 loss:14.6699 exploreP:0.0100
Episode:1188 meanR:160.4900 R:500.0 loss:1.5453 exploreP:0.0100
Episode:1189 meanR:160.4900 R:500.0 loss:9.7494 exploreP:0.0100
Episode:1190 meanR:158.5800 R:309.0 loss:12.6675 exploreP:0.0100
Episode:1191 meanR:155.0300 R:29.0 loss:58.3822 exploreP:0.0100
Episode:1192 meanR:153.2600 R:160.0 loss:39.6577 exploreP:0.0100
Episode:1193 meanR:155.2600 R:500.0 loss:5.2878 exploreP:0.0100
Episode:1194 meanR:157.0700 R:500.0 loss:13.2823 exploreP:0.0100
Episode:1195 meanR:159.9000 R:500.0 loss:13.7323 exploreP:0.0100
Episode:1196 meanR:162.3800 R:5

Episode:1308 meanR:444.8000 R:500.0 loss:13.3067 exploreP:0.0100
Episode:1309 meanR:444.8000 R:500.0 loss:12.1382 exploreP:0.0100
Episode:1310 meanR:443.1400 R:334.0 loss:16.4714 exploreP:0.0100
Episode:1311 meanR:438.2700 R:13.0 loss:44.7333 exploreP:0.0100
Episode:1312 meanR:438.2700 R:500.0 loss:11.8982 exploreP:0.0100
Episode:1313 meanR:438.2700 R:500.0 loss:14.4449 exploreP:0.0100
Episode:1314 meanR:438.2700 R:500.0 loss:12.1930 exploreP:0.0100
Episode:1315 meanR:438.2700 R:500.0 loss:12.8848 exploreP:0.0100
Episode:1316 meanR:438.2700 R:500.0 loss:5.3949 exploreP:0.0100
Episode:1317 meanR:438.2700 R:500.0 loss:10.3296 exploreP:0.0100
Episode:1318 meanR:438.2700 R:500.0 loss:11.5736 exploreP:0.0100
Episode:1319 meanR:438.2700 R:500.0 loss:13.0352 exploreP:0.0100
Episode:1320 meanR:438.2700 R:500.0 loss:13.9130 exploreP:0.0100
Episode:1321 meanR:438.2700 R:500.0 loss:11.5907 exploreP:0.0100
Episode:1322 meanR:438.2700 R:500.0 loss:7.4743 exploreP:0.0100
Episode:1323 meanR:438.2700 

Episode:1435 meanR:422.5400 R:11.0 loss:67.4631 exploreP:0.0100
Episode:1436 meanR:417.6500 R:11.0 loss:134.4856 exploreP:0.0100
Episode:1437 meanR:414.9400 R:229.0 loss:60.3820 exploreP:0.0100
Episode:1438 meanR:414.9400 R:500.0 loss:10.8643 exploreP:0.0100
Episode:1439 meanR:414.9400 R:500.0 loss:16.1206 exploreP:0.0100
Episode:1440 meanR:414.9400 R:500.0 loss:14.4403 exploreP:0.0100
Episode:1441 meanR:414.9400 R:500.0 loss:14.2354 exploreP:0.0100
Episode:1442 meanR:414.9400 R:500.0 loss:14.1138 exploreP:0.0100
Episode:1443 meanR:414.9400 R:500.0 loss:13.9561 exploreP:0.0100
Episode:1444 meanR:414.9400 R:500.0 loss:14.2262 exploreP:0.0100
Episode:1445 meanR:410.1000 R:16.0 loss:72.4081 exploreP:0.0100
Episode:1446 meanR:405.2500 R:15.0 loss:137.9974 exploreP:0.0100
Episode:1447 meanR:400.4000 R:15.0 loss:191.3415 exploreP:0.0100
Episode:1448 meanR:395.5300 R:13.0 loss:201.9689 exploreP:0.0100
Episode:1449 meanR:390.6400 R:11.0 loss:193.2248 exploreP:0.0100
Episode:1450 meanR:385.7800

Episode:1562 meanR:414.9500 R:128.0 loss:0.6726 exploreP:0.0100
Episode:1563 meanR:411.0900 R:114.0 loss:0.5155 exploreP:0.0100
Episode:1564 meanR:407.1700 R:108.0 loss:0.7925 exploreP:0.0100
Episode:1565 meanR:403.2800 R:111.0 loss:2.2318 exploreP:0.0100
Episode:1566 meanR:398.4700 R:19.0 loss:5.2094 exploreP:0.0100
Episode:1567 meanR:393.6300 R:16.0 loss:32.3487 exploreP:0.0100
Episode:1568 meanR:389.9300 R:130.0 loss:10.7603 exploreP:0.0100
Episode:1569 meanR:386.2300 R:130.0 loss:1.9683 exploreP:0.0100
Episode:1570 meanR:382.6200 R:139.0 loss:1.0867 exploreP:0.0100
Episode:1571 meanR:379.0200 R:140.0 loss:1.6946 exploreP:0.0100
Episode:1572 meanR:377.9300 R:111.0 loss:0.8561 exploreP:0.0100
Episode:1573 meanR:374.0100 R:108.0 loss:0.7097 exploreP:0.0100
Episode:1574 meanR:370.0600 R:105.0 loss:0.6815 exploreP:0.0100
Episode:1575 meanR:366.1300 R:107.0 loss:1.6092 exploreP:0.0100
Episode:1576 meanR:362.1600 R:103.0 loss:1.0706 exploreP:0.0100
Episode:1577 meanR:358.3000 R:114.0 loss

Episode:1689 meanR:459.7100 R:500.0 loss:16.3830 exploreP:0.0100
Episode:1690 meanR:463.7100 R:500.0 loss:16.5003 exploreP:0.0100
Episode:1691 meanR:466.6200 R:500.0 loss:14.4929 exploreP:0.0100
Episode:1692 meanR:466.6200 R:500.0 loss:14.8645 exploreP:0.0100
Episode:1693 meanR:466.6200 R:500.0 loss:13.4066 exploreP:0.0100
Episode:1694 meanR:466.7500 R:500.0 loss:17.5101 exploreP:0.0100
Episode:1695 meanR:466.7500 R:500.0 loss:13.3433 exploreP:0.0100
Episode:1696 meanR:466.7500 R:500.0 loss:17.6578 exploreP:0.0100
Episode:1697 meanR:467.5300 R:500.0 loss:14.4285 exploreP:0.0100
Episode:1698 meanR:467.5300 R:500.0 loss:14.0961 exploreP:0.0100
Episode:1699 meanR:467.5300 R:500.0 loss:13.7307 exploreP:0.0100
Episode:1700 meanR:467.5300 R:500.0 loss:16.4939 exploreP:0.0100
Episode:1701 meanR:467.5300 R:500.0 loss:11.6359 exploreP:0.0100
Episode:1702 meanR:467.5300 R:500.0 loss:7.3861 exploreP:0.0100
Episode:1703 meanR:467.5300 R:500.0 loss:20.9928 exploreP:0.0100
Episode:1704 meanR:467.530

Episode:1816 meanR:413.8400 R:500.0 loss:16.4515 exploreP:0.0100
Episode:1817 meanR:413.8400 R:500.0 loss:14.8913 exploreP:0.0100
Episode:1818 meanR:413.8400 R:500.0 loss:17.1635 exploreP:0.0100
Episode:1819 meanR:413.8400 R:500.0 loss:16.1686 exploreP:0.0100
Episode:1820 meanR:413.8400 R:500.0 loss:15.0341 exploreP:0.0100
Episode:1821 meanR:413.8400 R:500.0 loss:15.3548 exploreP:0.0100
Episode:1822 meanR:413.8400 R:500.0 loss:12.7706 exploreP:0.0100
Episode:1823 meanR:413.8400 R:500.0 loss:17.5603 exploreP:0.0100
Episode:1824 meanR:413.8400 R:500.0 loss:15.8924 exploreP:0.0100
Episode:1825 meanR:413.8400 R:500.0 loss:18.0334 exploreP:0.0100
Episode:1826 meanR:413.8400 R:500.0 loss:20.9521 exploreP:0.0100
Episode:1827 meanR:418.7200 R:500.0 loss:8.5292 exploreP:0.0100
Episode:1828 meanR:423.6100 R:500.0 loss:20.2746 exploreP:0.0100
Episode:1829 meanR:427.1400 R:365.0 loss:25.4484 exploreP:0.0100
Episode:1830 meanR:422.9300 R:79.0 loss:46.4983 exploreP:0.0100
Episode:1831 meanR:418.7800

Episode:1943 meanR:323.5200 R:500.0 loss:15.8937 exploreP:0.0100
Episode:1944 meanR:326.7200 R:500.0 loss:14.2384 exploreP:0.0100
Episode:1945 meanR:325.0400 R:11.0 loss:80.4388 exploreP:0.0100
Episode:1946 meanR:328.1600 R:500.0 loss:18.2432 exploreP:0.0100
Episode:1947 meanR:331.0900 R:500.0 loss:17.2927 exploreP:0.0100
Episode:1948 meanR:334.1600 R:500.0 loss:13.3785 exploreP:0.0100
Episode:1949 meanR:337.0600 R:500.0 loss:14.4558 exploreP:0.0100
Episode:1950 meanR:340.0500 R:500.0 loss:20.1647 exploreP:0.0100
Episode:1951 meanR:343.1600 R:500.0 loss:14.8843 exploreP:0.0100
Episode:1952 meanR:341.2500 R:11.0 loss:84.5469 exploreP:0.0100
Episode:1953 meanR:344.2000 R:500.0 loss:21.9539 exploreP:0.0100
Episode:1954 meanR:347.0000 R:500.0 loss:16.6641 exploreP:0.0100
Episode:1955 meanR:350.2600 R:500.0 loss:15.1872 exploreP:0.0100
Episode:1956 meanR:353.5200 R:500.0 loss:15.9512 exploreP:0.0100
Episode:1957 meanR:356.6300 R:500.0 loss:20.4703 exploreP:0.0100
Episode:1958 meanR:359.6000

Episode:2071 meanR:264.2400 R:500.0 loss:6.2732 exploreP:0.0100
Episode:2072 meanR:264.2400 R:500.0 loss:15.5054 exploreP:0.0100
Episode:2073 meanR:261.0600 R:182.0 loss:38.0336 exploreP:0.0100
Episode:2074 meanR:261.0600 R:500.0 loss:4.9188 exploreP:0.0100
Episode:2075 meanR:261.0600 R:500.0 loss:11.6680 exploreP:0.0100
Episode:2076 meanR:258.0700 R:201.0 loss:36.9955 exploreP:0.0100
Episode:2077 meanR:258.0700 R:500.0 loss:5.3695 exploreP:0.0100
Episode:2078 meanR:258.0700 R:500.0 loss:20.9295 exploreP:0.0100
Episode:2079 meanR:258.0700 R:500.0 loss:12.5578 exploreP:0.0100
Episode:2080 meanR:254.8600 R:179.0 loss:30.0460 exploreP:0.0100
Episode:2081 meanR:251.1900 R:133.0 loss:3.1819 exploreP:0.0100
Episode:2082 meanR:251.1900 R:500.0 loss:1.6345 exploreP:0.0100
Episode:2083 meanR:251.1900 R:500.0 loss:13.2164 exploreP:0.0100
Episode:2084 meanR:251.1900 R:500.0 loss:13.5536 exploreP:0.0100
Episode:2085 meanR:255.4800 R:500.0 loss:23.2846 exploreP:0.0100
Episode:2086 meanR:255.4800 R:

Episode:2199 meanR:291.2200 R:100.0 loss:4.0602 exploreP:0.0100
Episode:2200 meanR:290.6700 R:98.0 loss:2.4375 exploreP:0.0100
Episode:2201 meanR:290.0700 R:119.0 loss:2.4190 exploreP:0.0100
Episode:2202 meanR:289.2000 R:104.0 loss:6.8152 exploreP:0.0100
Episode:2203 meanR:285.5400 R:134.0 loss:2.1287 exploreP:0.0100
Episode:2204 meanR:281.5000 R:96.0 loss:10.1818 exploreP:0.0100
Episode:2205 meanR:277.7200 R:122.0 loss:2.4574 exploreP:0.0100
Episode:2206 meanR:274.3700 R:126.0 loss:3.8841 exploreP:0.0100
Episode:2207 meanR:272.7000 R:150.0 loss:0.7444 exploreP:0.0100
Episode:2208 meanR:271.8600 R:84.0 loss:3.0066 exploreP:0.0100
Episode:2209 meanR:271.4800 R:81.0 loss:4.1977 exploreP:0.0100
Episode:2210 meanR:271.8800 R:136.0 loss:2.8312 exploreP:0.0100
Episode:2211 meanR:273.6100 R:252.0 loss:0.7304 exploreP:0.0100
Episode:2212 meanR:277.7600 R:500.0 loss:0.4439 exploreP:0.0100
Episode:2213 meanR:282.1900 R:500.0 loss:12.7405 exploreP:0.0100
Episode:2214 meanR:285.9600 R:500.0 loss:7

Episode:2327 meanR:316.9100 R:229.0 loss:2.5306 exploreP:0.0100
Episode:2328 meanR:314.8200 R:291.0 loss:2.9185 exploreP:0.0100
Episode:2329 meanR:314.0500 R:423.0 loss:1.2277 exploreP:0.0100
Episode:2330 meanR:312.6700 R:362.0 loss:3.5129 exploreP:0.0100
Episode:2331 meanR:309.9700 R:230.0 loss:3.0038 exploreP:0.0100
Episode:2332 meanR:310.3900 R:220.0 loss:1.6231 exploreP:0.0100
Episode:2333 meanR:307.6600 R:227.0 loss:1.1267 exploreP:0.0100
Episode:2334 meanR:305.1200 R:246.0 loss:1.1090 exploreP:0.0100
Episode:2335 meanR:305.2700 R:500.0 loss:4.6113 exploreP:0.0100
Episode:2336 meanR:309.6400 R:500.0 loss:13.7174 exploreP:0.0100
Episode:2337 meanR:313.8700 R:500.0 loss:18.1945 exploreP:0.0100
Episode:2338 meanR:318.0600 R:500.0 loss:16.4521 exploreP:0.0100
Episode:2339 meanR:318.0600 R:500.0 loss:15.3434 exploreP:0.0100
Episode:2340 meanR:321.5800 R:500.0 loss:26.3055 exploreP:0.0100
Episode:2341 meanR:325.0700 R:500.0 loss:16.3934 exploreP:0.0100
Episode:2342 meanR:328.5000 R:500.

Episode:2455 meanR:380.9700 R:416.0 loss:20.3067 exploreP:0.0100
Episode:2456 meanR:383.7200 R:500.0 loss:7.2795 exploreP:0.0100
Episode:2457 meanR:385.4600 R:500.0 loss:20.3539 exploreP:0.0100
Episode:2458 meanR:385.8600 R:215.0 loss:13.9143 exploreP:0.0100
Episode:2459 meanR:381.9900 R:71.0 loss:14.7316 exploreP:0.0100
Episode:2460 meanR:381.9900 R:500.0 loss:7.7709 exploreP:0.0100
Episode:2461 meanR:384.0000 R:500.0 loss:18.1087 exploreP:0.0100
Episode:2462 meanR:386.3300 R:421.0 loss:28.3760 exploreP:0.0100
Episode:2463 meanR:389.3900 R:500.0 loss:12.0737 exploreP:0.0100
Episode:2464 meanR:392.5900 R:500.0 loss:21.8766 exploreP:0.0100
Episode:2465 meanR:395.1600 R:500.0 loss:20.5953 exploreP:0.0100
Episode:2466 meanR:395.1600 R:500.0 loss:20.4523 exploreP:0.0100
Episode:2467 meanR:390.4200 R:26.0 loss:65.7022 exploreP:0.0100
Episode:2468 meanR:387.2900 R:187.0 loss:58.1931 exploreP:0.0100
Episode:2469 meanR:382.4700 R:18.0 loss:52.5216 exploreP:0.0100
Episode:2470 meanR:378.4800 R:

Episode:2582 meanR:462.0400 R:500.0 loss:11.9258 exploreP:0.0100
Episode:2583 meanR:462.0400 R:500.0 loss:15.4051 exploreP:0.0100
Episode:2584 meanR:462.0400 R:500.0 loss:14.7463 exploreP:0.0100
Episode:2585 meanR:462.0400 R:500.0 loss:14.7255 exploreP:0.0100
Episode:2586 meanR:462.0400 R:500.0 loss:19.3052 exploreP:0.0100
Episode:2587 meanR:462.0400 R:500.0 loss:18.1792 exploreP:0.0100
Episode:2588 meanR:462.0400 R:500.0 loss:13.1839 exploreP:0.0100
Episode:2589 meanR:462.0400 R:500.0 loss:12.0175 exploreP:0.0100
Episode:2590 meanR:462.0400 R:500.0 loss:19.3518 exploreP:0.0100
Episode:2591 meanR:462.0400 R:500.0 loss:16.9448 exploreP:0.0100
Episode:2592 meanR:462.0400 R:500.0 loss:13.5769 exploreP:0.0100
Episode:2593 meanR:462.0400 R:500.0 loss:19.0259 exploreP:0.0100
Episode:2594 meanR:462.0400 R:500.0 loss:14.2805 exploreP:0.0100
Episode:2595 meanR:462.0400 R:500.0 loss:19.0964 exploreP:0.0100
Episode:2596 meanR:462.0400 R:500.0 loss:15.0199 exploreP:0.0100
Episode:2597 meanR:457.20

Episode:2709 meanR:452.6800 R:500.0 loss:14.2108 exploreP:0.0100
Episode:2710 meanR:452.6800 R:500.0 loss:15.6739 exploreP:0.0100
Episode:2711 meanR:452.6800 R:500.0 loss:14.1954 exploreP:0.0100
Episode:2712 meanR:452.6800 R:500.0 loss:15.0051 exploreP:0.0100
Episode:2713 meanR:452.6800 R:500.0 loss:17.2524 exploreP:0.0100
Episode:2714 meanR:452.6800 R:500.0 loss:16.6662 exploreP:0.0100
Episode:2715 meanR:452.6800 R:500.0 loss:14.0933 exploreP:0.0100
Episode:2716 meanR:452.6800 R:500.0 loss:15.3826 exploreP:0.0100
Episode:2717 meanR:452.6800 R:500.0 loss:15.6474 exploreP:0.0100
Episode:2718 meanR:452.6800 R:500.0 loss:13.3065 exploreP:0.0100
Episode:2719 meanR:452.6800 R:500.0 loss:16.6364 exploreP:0.0100
Episode:2720 meanR:452.6800 R:500.0 loss:21.8820 exploreP:0.0100
Episode:2721 meanR:454.1800 R:500.0 loss:21.4589 exploreP:0.0100
Episode:2722 meanR:454.1800 R:500.0 loss:18.7510 exploreP:0.0100
Episode:2723 meanR:450.5000 R:132.0 loss:61.2616 exploreP:0.0100
Episode:2724 meanR:448.91

Episode:2836 meanR:322.4200 R:500.0 loss:14.1113 exploreP:0.0100
Episode:2837 meanR:322.4200 R:500.0 loss:15.3828 exploreP:0.0100
Episode:2838 meanR:322.4200 R:500.0 loss:13.7864 exploreP:0.0100
Episode:2839 meanR:326.2600 R:500.0 loss:13.7538 exploreP:0.0100
Episode:2840 meanR:323.4800 R:141.0 loss:61.7214 exploreP:0.0100
Episode:2841 meanR:326.9800 R:500.0 loss:8.2504 exploreP:0.0100
Episode:2842 meanR:330.7400 R:500.0 loss:14.3987 exploreP:0.0100
Episode:2843 meanR:334.5400 R:500.0 loss:13.0402 exploreP:0.0100
Episode:2844 meanR:338.2200 R:500.0 loss:12.3262 exploreP:0.0100
Episode:2845 meanR:341.7500 R:500.0 loss:15.4934 exploreP:0.0100
Episode:2846 meanR:345.2900 R:500.0 loss:16.7009 exploreP:0.0100
Episode:2847 meanR:348.7600 R:500.0 loss:16.3046 exploreP:0.0100
Episode:2848 meanR:351.6800 R:500.0 loss:14.8327 exploreP:0.0100
Episode:2849 meanR:354.1100 R:500.0 loss:15.4065 exploreP:0.0100
Episode:2850 meanR:354.1100 R:500.0 loss:18.1850 exploreP:0.0100
Episode:2851 meanR:356.390

Episode:2963 meanR:401.5800 R:500.0 loss:12.6119 exploreP:0.0100
Episode:2964 meanR:401.5800 R:500.0 loss:15.5648 exploreP:0.0100
Episode:2965 meanR:406.1700 R:500.0 loss:16.6081 exploreP:0.0100
Episode:2966 meanR:406.1700 R:500.0 loss:14.4917 exploreP:0.0100
Episode:2967 meanR:406.1700 R:500.0 loss:19.1386 exploreP:0.0100
Episode:2968 meanR:406.1700 R:500.0 loss:21.9718 exploreP:0.0100
Episode:2969 meanR:406.1700 R:500.0 loss:18.5710 exploreP:0.0100
Episode:2970 meanR:401.2900 R:12.0 loss:85.0895 exploreP:0.0100
Episode:2971 meanR:401.2900 R:500.0 loss:20.5494 exploreP:0.0100
Episode:2972 meanR:402.6100 R:500.0 loss:15.6235 exploreP:0.0100
Episode:2973 meanR:398.8300 R:122.0 loss:67.0006 exploreP:0.0100
Episode:2974 meanR:394.8400 R:101.0 loss:30.8935 exploreP:0.0100
Episode:2975 meanR:391.0100 R:117.0 loss:12.2751 exploreP:0.0100
Episode:2976 meanR:387.1200 R:111.0 loss:12.1529 exploreP:0.0100
Episode:2977 meanR:387.9600 R:97.0 loss:12.3366 exploreP:0.0100
Episode:2978 meanR:383.8600

Episode:3089 meanR:357.4000 R:450.0 loss:1006.4779 exploreP:0.0100
Episode:3090 meanR:361.2000 R:490.0 loss:1011.1743 exploreP:0.0100
Episode:3091 meanR:365.1500 R:500.0 loss:1489.7848 exploreP:0.0100
Episode:3092 meanR:368.9300 R:500.0 loss:1616.2584 exploreP:0.0100
Episode:3093 meanR:372.7800 R:500.0 loss:1318.5027 exploreP:0.0100
Episode:3094 meanR:376.6800 R:500.0 loss:1847.5389 exploreP:0.0100
Episode:3095 meanR:380.6000 R:500.0 loss:1589.6613 exploreP:0.0100
Episode:3096 meanR:385.0900 R:500.0 loss:1405.8333 exploreP:0.0100
Episode:3097 meanR:389.0800 R:500.0 loss:1550.7858 exploreP:0.0100
Episode:3098 meanR:392.8000 R:500.0 loss:1408.1571 exploreP:0.0100
Episode:3099 meanR:396.3100 R:500.0 loss:1315.1418 exploreP:0.0100
Episode:3100 meanR:399.8400 R:500.0 loss:1172.8204 exploreP:0.0100
Episode:3101 meanR:403.3400 R:500.0 loss:1328.3895 exploreP:0.0100
Episode:3102 meanR:406.8900 R:500.0 loss:1530.0596 exploreP:0.0100
Episode:3103 meanR:410.4500 R:500.0 loss:1244.7192 exploreP:0.

Episode:3213 meanR:310.7200 R:143.0 loss:189.3528 exploreP:0.0100
Episode:3214 meanR:309.7200 R:154.0 loss:147.9663 exploreP:0.0100
Episode:3215 meanR:308.6500 R:156.0 loss:253.0004 exploreP:0.0100
Episode:3216 meanR:307.0900 R:167.0 loss:106.6475 exploreP:0.0100
Episode:3217 meanR:305.6400 R:165.0 loss:131.5512 exploreP:0.0100
Episode:3218 meanR:303.9300 R:153.0 loss:188.3557 exploreP:0.0100
Episode:3219 meanR:302.5800 R:153.0 loss:128.0088 exploreP:0.0100
Episode:3220 meanR:301.6500 R:179.0 loss:90.1538 exploreP:0.0100
Episode:3221 meanR:300.6100 R:157.0 loss:172.9067 exploreP:0.0100
Episode:3222 meanR:299.7000 R:172.0 loss:131.3521 exploreP:0.0100
Episode:3223 meanR:299.3300 R:175.0 loss:90.3164 exploreP:0.0100
Episode:3224 meanR:297.5200 R:182.0 loss:67.9076 exploreP:0.0100
Episode:3225 meanR:295.7800 R:174.0 loss:114.6668 exploreP:0.0100
Episode:3226 meanR:294.0400 R:172.0 loss:84.7035 exploreP:0.0100
Episode:3227 meanR:292.5500 R:180.0 loss:63.0875 exploreP:0.0100
Episode:3228 me

Episode:3339 meanR:225.6300 R:127.0 loss:216.3912 exploreP:0.0100
Episode:3340 meanR:226.0200 R:290.0 loss:79.2389 exploreP:0.0100
Episode:3341 meanR:224.3900 R:134.0 loss:42.3828 exploreP:0.0100
Episode:3342 meanR:222.7700 R:138.0 loss:198.9293 exploreP:0.0100
Episode:3343 meanR:221.3000 R:114.0 loss:30.4643 exploreP:0.0100
Episode:3344 meanR:219.6900 R:156.0 loss:27.7394 exploreP:0.0100
Episode:3345 meanR:217.4700 R:107.0 loss:44.0625 exploreP:0.0100
Episode:3346 meanR:216.2500 R:155.0 loss:285.1881 exploreP:0.0100
Episode:3347 meanR:213.9000 R:124.0 loss:106.1569 exploreP:0.0100
Episode:3348 meanR:213.8000 R:191.0 loss:232.4126 exploreP:0.0100
Episode:3349 meanR:211.6400 R:133.0 loss:117.9413 exploreP:0.0100
Episode:3350 meanR:208.4800 R:180.0 loss:390.2301 exploreP:0.0100
Episode:3351 meanR:204.8800 R:140.0 loss:295.5752 exploreP:0.0100
Episode:3352 meanR:201.1500 R:127.0 loss:396.4996 exploreP:0.0100
Episode:3353 meanR:197.3100 R:116.0 loss:140.7509 exploreP:0.0100
Episode:3354 me

Episode:3464 meanR:130.2000 R:121.0 loss:142.0017 exploreP:0.0100
Episode:3465 meanR:129.8100 R:111.0 loss:95.9712 exploreP:0.0100
Episode:3466 meanR:130.1500 R:134.0 loss:22.2252 exploreP:0.0100
Episode:3467 meanR:130.5100 R:145.0 loss:80.1501 exploreP:0.0100
Episode:3468 meanR:130.5700 R:122.0 loss:9.9359 exploreP:0.0100
Episode:3469 meanR:130.4300 R:119.0 loss:252.4788 exploreP:0.0100
Episode:3470 meanR:130.6100 R:143.0 loss:314.3817 exploreP:0.0100
Episode:3471 meanR:130.7100 R:127.0 loss:79.2925 exploreP:0.0100
Episode:3472 meanR:130.5700 R:103.0 loss:83.8117 exploreP:0.0100
Episode:3473 meanR:130.7100 R:127.0 loss:231.5226 exploreP:0.0100
Episode:3474 meanR:130.8000 R:120.0 loss:116.0057 exploreP:0.0100
Episode:3475 meanR:130.5400 R:113.0 loss:108.6394 exploreP:0.0100
Episode:3476 meanR:130.4200 R:131.0 loss:102.9911 exploreP:0.0100
Episode:3477 meanR:130.2500 R:109.0 loss:47.8230 exploreP:0.0100
Episode:3478 meanR:129.8000 R:109.0 loss:58.8849 exploreP:0.0100
Episode:3479 meanR:

Episode:3590 meanR:150.6900 R:113.0 loss:17.6320 exploreP:0.0100
Episode:3591 meanR:150.9200 R:131.0 loss:149.2017 exploreP:0.0100
Episode:3592 meanR:150.9800 R:116.0 loss:27.1957 exploreP:0.0100
Episode:3593 meanR:151.0000 R:116.0 loss:77.0101 exploreP:0.0100
Episode:3594 meanR:151.6100 R:172.0 loss:125.5184 exploreP:0.0100
Episode:3595 meanR:151.7000 R:129.0 loss:67.1572 exploreP:0.0100
Episode:3596 meanR:151.9700 R:132.0 loss:112.3095 exploreP:0.0100
Episode:3597 meanR:152.1300 R:128.0 loss:78.6268 exploreP:0.0100
Episode:3598 meanR:152.1800 R:118.0 loss:22.3646 exploreP:0.0100
Episode:3599 meanR:152.7700 R:165.0 loss:131.8463 exploreP:0.0100
Episode:3600 meanR:152.9000 R:124.0 loss:213.5353 exploreP:0.0100
Episode:3601 meanR:152.7200 R:105.0 loss:41.7072 exploreP:0.0100
Episode:3602 meanR:152.3900 R:105.0 loss:12.0261 exploreP:0.0100
Episode:3603 meanR:152.7400 R:149.0 loss:128.4738 exploreP:0.0100
Episode:3604 meanR:153.3200 R:169.0 loss:126.9938 exploreP:0.0100
Episode:3605 meanR

Episode:3716 meanR:223.9300 R:145.0 loss:17.4688 exploreP:0.0100
Episode:3717 meanR:224.2500 R:161.0 loss:22.2865 exploreP:0.0100
Episode:3718 meanR:224.7300 R:151.0 loss:28.6932 exploreP:0.0100
Episode:3719 meanR:225.1100 R:153.0 loss:18.0188 exploreP:0.0100
Episode:3720 meanR:225.2700 R:153.0 loss:15.0235 exploreP:0.0100
Episode:3721 meanR:225.6800 R:175.0 loss:42.4077 exploreP:0.0100
Episode:3722 meanR:225.5000 R:161.0 loss:39.4160 exploreP:0.0100
Episode:3723 meanR:225.6300 R:144.0 loss:50.2473 exploreP:0.0100
Episode:3724 meanR:225.9100 R:157.0 loss:19.7043 exploreP:0.0100
Episode:3725 meanR:226.1600 R:141.0 loss:20.6459 exploreP:0.0100
Episode:3726 meanR:226.5900 R:161.0 loss:33.2488 exploreP:0.0100
Episode:3727 meanR:226.8900 R:171.0 loss:32.4531 exploreP:0.0100
Episode:3728 meanR:227.1500 R:141.0 loss:22.6291 exploreP:0.0100
Episode:3729 meanR:227.5500 R:163.0 loss:31.7652 exploreP:0.0100
Episode:3730 meanR:227.6800 R:159.0 loss:7.3953 exploreP:0.0100
Episode:3731 meanR:227.040

Episode:3843 meanR:280.7100 R:397.0 loss:18.0332 exploreP:0.0100
Episode:3844 meanR:283.2500 R:427.0 loss:6.5946 exploreP:0.0100
Episode:3845 meanR:285.9400 R:426.0 loss:9.4689 exploreP:0.0100
Episode:3846 meanR:288.4800 R:414.0 loss:9.7885 exploreP:0.0100
Episode:3847 meanR:291.1400 R:438.0 loss:9.0259 exploreP:0.0100
Episode:3848 meanR:293.4900 R:403.0 loss:8.1085 exploreP:0.0100
Episode:3849 meanR:296.7500 R:496.0 loss:12.1696 exploreP:0.0100
Episode:3850 meanR:300.0700 R:500.0 loss:13.7935 exploreP:0.0100
Episode:3851 meanR:303.3500 R:500.0 loss:14.1181 exploreP:0.0100
Episode:3852 meanR:306.7700 R:500.0 loss:11.3857 exploreP:0.0100
Episode:3853 meanR:310.0000 R:500.0 loss:8.1396 exploreP:0.0100
Episode:3854 meanR:313.3400 R:500.0 loss:6.3705 exploreP:0.0100
Episode:3855 meanR:316.6100 R:500.0 loss:12.0390 exploreP:0.0100
Episode:3856 meanR:319.4400 R:462.0 loss:10.9436 exploreP:0.0100
Episode:3857 meanR:322.7100 R:500.0 loss:6.8348 exploreP:0.0100
Episode:3858 meanR:325.9000 R:500

Episode:3970 meanR:479.6600 R:500.0 loss:16.1594 exploreP:0.0100
Episode:3971 meanR:475.8100 R:115.0 loss:66.1668 exploreP:0.0100
Episode:3972 meanR:475.8100 R:500.0 loss:7.7672 exploreP:0.0100
Episode:3973 meanR:475.8100 R:500.0 loss:12.3972 exploreP:0.0100
Episode:3974 meanR:475.8100 R:500.0 loss:9.7325 exploreP:0.0100
Episode:3975 meanR:474.2500 R:344.0 loss:5.1049 exploreP:0.0100
Episode:3976 meanR:471.5200 R:227.0 loss:10.3687 exploreP:0.0100
Episode:3977 meanR:468.7900 R:227.0 loss:7.0886 exploreP:0.0100
Episode:3978 meanR:465.7000 R:191.0 loss:5.7465 exploreP:0.0100
Episode:3979 meanR:462.7900 R:209.0 loss:4.9193 exploreP:0.0100
Episode:3980 meanR:460.5300 R:274.0 loss:2.6714 exploreP:0.0100
Episode:3981 meanR:457.8600 R:233.0 loss:3.9348 exploreP:0.0100
Episode:3982 meanR:457.4600 R:460.0 loss:1.9971 exploreP:0.0100
Episode:3983 meanR:457.7400 R:500.0 loss:7.1508 exploreP:0.0100
Episode:3984 meanR:455.6200 R:288.0 loss:17.7221 exploreP:0.0100
Episode:3985 meanR:455.7600 R:500.0

Episode:4097 meanR:424.2300 R:500.0 loss:12.3232 exploreP:0.0100
Episode:4098 meanR:424.2300 R:500.0 loss:13.0891 exploreP:0.0100
Episode:4099 meanR:424.2300 R:500.0 loss:12.5675 exploreP:0.0100
Episode:4100 meanR:424.2300 R:500.0 loss:12.5035 exploreP:0.0100
Episode:4101 meanR:424.2300 R:500.0 loss:11.0645 exploreP:0.0100
Episode:4102 meanR:424.2300 R:500.0 loss:10.0605 exploreP:0.0100
Episode:4103 meanR:424.2300 R:500.0 loss:13.0114 exploreP:0.0100
Episode:4104 meanR:421.7000 R:247.0 loss:38.6967 exploreP:0.0100
Episode:4105 meanR:421.7000 R:500.0 loss:9.2443 exploreP:0.0100
Episode:4106 meanR:421.7000 R:500.0 loss:16.5801 exploreP:0.0100
Episode:4107 meanR:426.5500 R:500.0 loss:16.6711 exploreP:0.0100
Episode:4108 meanR:428.9100 R:500.0 loss:15.3892 exploreP:0.0100
Episode:4109 meanR:428.9100 R:500.0 loss:14.0651 exploreP:0.0100
Episode:4110 meanR:428.9100 R:500.0 loss:17.7703 exploreP:0.0100
Episode:4111 meanR:428.9100 R:500.0 loss:16.4875 exploreP:0.0100
Episode:4112 meanR:428.910

Episode:4225 meanR:244.5300 R:500.0 loss:6.5354 exploreP:0.0100
Episode:4226 meanR:246.9500 R:386.0 loss:15.3626 exploreP:0.0100
Episode:4227 meanR:247.9100 R:500.0 loss:1.8817 exploreP:0.0100
Episode:4228 meanR:245.3400 R:243.0 loss:21.5413 exploreP:0.0100
Episode:4229 meanR:240.4900 R:15.0 loss:62.1140 exploreP:0.0100
Episode:4230 meanR:239.3100 R:13.0 loss:124.0304 exploreP:0.0100
Episode:4231 meanR:239.8700 R:177.0 loss:45.1827 exploreP:0.0100
Episode:4232 meanR:241.8300 R:320.0 loss:18.0321 exploreP:0.0100
Episode:4233 meanR:243.7000 R:326.0 loss:7.1107 exploreP:0.0100
Episode:4234 meanR:245.0000 R:430.0 loss:2.0933 exploreP:0.0100
Episode:4235 meanR:244.9900 R:179.0 loss:18.6692 exploreP:0.0100
Episode:4236 meanR:244.9500 R:337.0 loss:2.6005 exploreP:0.0100
Episode:4237 meanR:244.9500 R:500.0 loss:2.9501 exploreP:0.0100
Episode:4238 meanR:244.9500 R:500.0 loss:16.6730 exploreP:0.0100
Episode:4239 meanR:247.6700 R:500.0 loss:16.8500 exploreP:0.0100
Episode:4240 meanR:249.4000 R:41

Episode:4352 meanR:338.4000 R:231.0 loss:1.6666 exploreP:0.0100
Episode:4353 meanR:338.4000 R:500.0 loss:1.0870 exploreP:0.0100
Episode:4354 meanR:335.6000 R:220.0 loss:5.2593 exploreP:0.0100
Episode:4355 meanR:335.6000 R:500.0 loss:1.8051 exploreP:0.0100
Episode:4356 meanR:335.6000 R:500.0 loss:17.2004 exploreP:0.0100
Episode:4357 meanR:335.6000 R:500.0 loss:18.1029 exploreP:0.0100
Episode:4358 meanR:335.6000 R:500.0 loss:17.6646 exploreP:0.0100
Episode:4359 meanR:340.4600 R:500.0 loss:19.0946 exploreP:0.0100
Episode:4360 meanR:345.0000 R:500.0 loss:17.0828 exploreP:0.0100
Episode:4361 meanR:349.6000 R:500.0 loss:14.5194 exploreP:0.0100
Episode:4362 meanR:351.2000 R:500.0 loss:17.4620 exploreP:0.0100
Episode:4363 meanR:354.3700 R:337.0 loss:21.9715 exploreP:0.0100
Episode:4364 meanR:355.3500 R:402.0 loss:5.0644 exploreP:0.0100
Episode:4365 meanR:356.0600 R:500.0 loss:6.0751 exploreP:0.0100
Episode:4366 meanR:358.6600 R:500.0 loss:16.5436 exploreP:0.0100
Episode:4367 meanR:355.7100 R:2

Episode:4479 meanR:374.7700 R:187.0 loss:29.7372 exploreP:0.0100
Episode:4480 meanR:374.7700 R:500.0 loss:1.4457 exploreP:0.0100
Episode:4481 meanR:374.7700 R:500.0 loss:19.3854 exploreP:0.0100
Episode:4482 meanR:374.7700 R:500.0 loss:13.8990 exploreP:0.0100
Episode:4483 meanR:372.7700 R:300.0 loss:18.6515 exploreP:0.0100
Episode:4484 meanR:372.7700 R:500.0 loss:0.8614 exploreP:0.0100
Episode:4485 meanR:372.7700 R:500.0 loss:14.1715 exploreP:0.0100
Episode:4486 meanR:367.8800 R:11.0 loss:76.1712 exploreP:0.0100
Episode:4487 meanR:367.8800 R:500.0 loss:16.7158 exploreP:0.0100
Episode:4488 meanR:371.5100 R:500.0 loss:16.9525 exploreP:0.0100
Episode:4489 meanR:371.5100 R:500.0 loss:14.0059 exploreP:0.0100
Episode:4490 meanR:371.5100 R:500.0 loss:14.1149 exploreP:0.0100
Episode:4491 meanR:371.5100 R:500.0 loss:13.5218 exploreP:0.0100
Episode:4492 meanR:376.4000 R:500.0 loss:11.1434 exploreP:0.0100
Episode:4493 meanR:381.3100 R:500.0 loss:18.4031 exploreP:0.0100
Episode:4494 meanR:381.3100 

Episode:4606 meanR:363.9200 R:500.0 loss:16.9206 exploreP:0.0100
Episode:4607 meanR:363.9200 R:500.0 loss:17.8592 exploreP:0.0100
Episode:4608 meanR:363.9200 R:500.0 loss:18.4294 exploreP:0.0100
Episode:4609 meanR:363.9200 R:500.0 loss:15.9843 exploreP:0.0100
Episode:4610 meanR:363.9200 R:500.0 loss:16.7612 exploreP:0.0100
Episode:4611 meanR:363.9200 R:500.0 loss:14.1356 exploreP:0.0100
Episode:4612 meanR:363.9200 R:500.0 loss:15.0923 exploreP:0.0100
Episode:4613 meanR:363.9200 R:500.0 loss:14.1304 exploreP:0.0100
Episode:4614 meanR:363.9200 R:500.0 loss:12.1936 exploreP:0.0100
Episode:4615 meanR:363.9200 R:500.0 loss:12.6861 exploreP:0.0100
Episode:4616 meanR:363.9200 R:500.0 loss:14.6354 exploreP:0.0100
Episode:4617 meanR:363.9200 R:500.0 loss:14.7088 exploreP:0.0100
Episode:4618 meanR:363.9200 R:500.0 loss:15.1176 exploreP:0.0100
Episode:4619 meanR:363.9200 R:500.0 loss:13.1977 exploreP:0.0100
Episode:4620 meanR:360.6300 R:171.0 loss:37.2583 exploreP:0.0100
Episode:4621 meanR:360.63

Episode:4733 meanR:435.6500 R:320.0 loss:24.7820 exploreP:0.0100
Episode:4734 meanR:431.8400 R:119.0 loss:5.8277 exploreP:0.0100
Episode:4735 meanR:432.3900 R:500.0 loss:4.0658 exploreP:0.0100
Episode:4736 meanR:432.3900 R:500.0 loss:20.8308 exploreP:0.0100
Episode:4737 meanR:432.3900 R:500.0 loss:18.7952 exploreP:0.0100
Episode:4738 meanR:432.3900 R:500.0 loss:19.7102 exploreP:0.0100
Episode:4739 meanR:432.3900 R:500.0 loss:18.0881 exploreP:0.0100
Episode:4740 meanR:432.3900 R:500.0 loss:18.6875 exploreP:0.0100
Episode:4741 meanR:432.3900 R:500.0 loss:17.4902 exploreP:0.0100
Episode:4742 meanR:432.3900 R:500.0 loss:17.0257 exploreP:0.0100
Episode:4743 meanR:432.3900 R:500.0 loss:16.6512 exploreP:0.0100
Episode:4744 meanR:432.3900 R:500.0 loss:18.8411 exploreP:0.0100
Episode:4745 meanR:432.3900 R:500.0 loss:15.4282 exploreP:0.0100
Episode:4746 meanR:432.3900 R:500.0 loss:14.9363 exploreP:0.0100
Episode:4747 meanR:432.3900 R:500.0 loss:13.6205 exploreP:0.0100
Episode:4748 meanR:432.3900

Episode:4860 meanR:436.9200 R:500.0 loss:15.1444 exploreP:0.0100
Episode:4861 meanR:436.9200 R:500.0 loss:14.8268 exploreP:0.0100
Episode:4862 meanR:436.9200 R:500.0 loss:15.0960 exploreP:0.0100
Episode:4863 meanR:440.5700 R:500.0 loss:16.6076 exploreP:0.0100
Episode:4864 meanR:444.4400 R:500.0 loss:13.1541 exploreP:0.0100
Episode:4865 meanR:448.2600 R:500.0 loss:16.2203 exploreP:0.0100
Episode:4866 meanR:451.9400 R:500.0 loss:15.6082 exploreP:0.0100
Episode:4867 meanR:456.8300 R:500.0 loss:18.1257 exploreP:0.0100
Episode:4868 meanR:456.8300 R:500.0 loss:17.7690 exploreP:0.0100
Episode:4869 meanR:456.8300 R:500.0 loss:18.7480 exploreP:0.0100
Episode:4870 meanR:452.5600 R:73.0 loss:65.0808 exploreP:0.0100
Episode:4871 meanR:452.5600 R:500.0 loss:11.3506 exploreP:0.0100
Episode:4872 meanR:449.5600 R:200.0 loss:42.1178 exploreP:0.0100
Episode:4873 meanR:444.6700 R:11.0 loss:74.9086 exploreP:0.0100
Episode:4874 meanR:439.7700 R:10.0 loss:103.3289 exploreP:0.0100
Episode:4875 meanR:439.7700

Episode:4987 meanR:282.3500 R:500.0 loss:6.6357 exploreP:0.0100
Episode:4988 meanR:285.3300 R:308.0 loss:15.5120 exploreP:0.0100
Episode:4989 meanR:290.2400 R:500.0 loss:1.4561 exploreP:0.0100
Episode:4990 meanR:295.1400 R:500.0 loss:16.6405 exploreP:0.0100
Episode:4991 meanR:295.1400 R:500.0 loss:16.2875 exploreP:0.0100
Episode:4992 meanR:295.1400 R:500.0 loss:12.0549 exploreP:0.0100
Episode:4993 meanR:295.1400 R:500.0 loss:13.2841 exploreP:0.0100
Episode:4994 meanR:295.1400 R:500.0 loss:15.1445 exploreP:0.0100
Episode:4995 meanR:295.1400 R:500.0 loss:14.0430 exploreP:0.0100
Episode:4996 meanR:295.1400 R:500.0 loss:18.4105 exploreP:0.0100
Episode:4997 meanR:300.0300 R:500.0 loss:13.5998 exploreP:0.0100
Episode:4998 meanR:304.9000 R:500.0 loss:17.5732 exploreP:0.0100
Episode:4999 meanR:309.6300 R:500.0 loss:11.8093 exploreP:0.0100
Episode:5000 meanR:314.5000 R:500.0 loss:16.2146 exploreP:0.0100
Episode:5001 meanR:317.5200 R:500.0 loss:17.8368 exploreP:0.0100
Episode:5002 meanR:317.5200

Episode:5114 meanR:417.8400 R:336.0 loss:26.0415 exploreP:0.0100
Episode:5115 meanR:412.9700 R:13.0 loss:56.1014 exploreP:0.0100
Episode:5116 meanR:410.4900 R:11.0 loss:93.6464 exploreP:0.0100
Episode:5117 meanR:409.9800 R:160.0 loss:46.2635 exploreP:0.0100
Episode:5118 meanR:411.1400 R:278.0 loss:14.2432 exploreP:0.0100
Episode:5119 meanR:414.2100 R:500.0 loss:7.1598 exploreP:0.0100
Episode:5120 meanR:416.6500 R:452.0 loss:17.4483 exploreP:0.0100
Episode:5121 meanR:420.0600 R:500.0 loss:9.6994 exploreP:0.0100
Episode:5122 meanR:419.3600 R:430.0 loss:10.9555 exploreP:0.0100
Episode:5123 meanR:421.9200 R:500.0 loss:3.3148 exploreP:0.0100
Episode:5124 meanR:421.9200 R:500.0 loss:17.8452 exploreP:0.0100
Episode:5125 meanR:421.9200 R:500.0 loss:13.2907 exploreP:0.0100
Episode:5126 meanR:421.9200 R:500.0 loss:19.2790 exploreP:0.0100
Episode:5127 meanR:421.9200 R:500.0 loss:18.2900 exploreP:0.0100
Episode:5128 meanR:421.9200 R:500.0 loss:16.3611 exploreP:0.0100
Episode:5129 meanR:421.9200 R:

Episode:5241 meanR:375.9200 R:153.0 loss:3.8694 exploreP:0.0100
Episode:5242 meanR:372.6600 R:174.0 loss:2.5643 exploreP:0.0100
Episode:5243 meanR:369.1900 R:153.0 loss:1.2122 exploreP:0.0100
Episode:5244 meanR:368.6500 R:197.0 loss:1.5370 exploreP:0.0100
Episode:5245 meanR:372.6800 R:500.0 loss:1.7217 exploreP:0.0100
Episode:5246 meanR:374.2100 R:308.0 loss:25.1877 exploreP:0.0100
Episode:5247 meanR:377.2300 R:500.0 loss:1.4773 exploreP:0.0100
Episode:5248 meanR:380.1400 R:500.0 loss:9.7864 exploreP:0.0100
Episode:5249 meanR:380.4700 R:257.0 loss:41.2284 exploreP:0.0100
Episode:5250 meanR:380.5900 R:190.0 loss:4.1665 exploreP:0.0100
Episode:5251 meanR:380.0700 R:143.0 loss:3.0700 exploreP:0.0100
Episode:5252 meanR:379.4700 R:163.0 loss:1.6709 exploreP:0.0100
Episode:5253 meanR:382.4100 R:500.0 loss:0.6530 exploreP:0.0100
Episode:5254 meanR:383.5400 R:305.0 loss:21.3205 exploreP:0.0100
Episode:5255 meanR:386.3900 R:500.0 loss:0.6427 exploreP:0.0100
Episode:5256 meanR:388.3400 R:437.0 l

Episode:5368 meanR:402.0100 R:500.0 loss:15.9475 exploreP:0.0100
Episode:5369 meanR:404.0300 R:500.0 loss:18.1411 exploreP:0.0100
Episode:5370 meanR:404.0300 R:500.0 loss:13.7949 exploreP:0.0100
Episode:5371 meanR:404.0300 R:500.0 loss:15.8580 exploreP:0.0100
Episode:5372 meanR:404.0300 R:500.0 loss:14.5536 exploreP:0.0100
Episode:5373 meanR:404.0300 R:500.0 loss:14.1456 exploreP:0.0100
Episode:5374 meanR:405.1100 R:500.0 loss:13.2360 exploreP:0.0100
Episode:5375 meanR:408.2700 R:500.0 loss:13.0461 exploreP:0.0100
Episode:5376 meanR:408.2700 R:500.0 loss:13.4996 exploreP:0.0100
Episode:5377 meanR:410.2400 R:500.0 loss:14.9347 exploreP:0.0100
Episode:5378 meanR:412.8600 R:500.0 loss:9.4160 exploreP:0.0100
Episode:5379 meanR:415.9200 R:500.0 loss:11.8929 exploreP:0.0100
Episode:5380 meanR:419.0700 R:500.0 loss:10.8919 exploreP:0.0100
Episode:5381 meanR:421.8200 R:500.0 loss:11.0969 exploreP:0.0100
Episode:5382 meanR:424.7400 R:500.0 loss:18.6882 exploreP:0.0100
Episode:5383 meanR:427.190

Episode:5495 meanR:412.8400 R:500.0 loss:15.0189 exploreP:0.0100
Episode:5496 meanR:412.8400 R:500.0 loss:15.5809 exploreP:0.0100
Episode:5497 meanR:409.5300 R:169.0 loss:44.1522 exploreP:0.0100
Episode:5498 meanR:406.5700 R:204.0 loss:14.7146 exploreP:0.0100
Episode:5499 meanR:401.8200 R:25.0 loss:17.4332 exploreP:0.0100
Episode:5500 meanR:396.9900 R:17.0 loss:57.8492 exploreP:0.0100
Episode:5501 meanR:392.1100 R:12.0 loss:97.0936 exploreP:0.0100
Episode:5502 meanR:387.2600 R:15.0 loss:120.8693 exploreP:0.0100
Episode:5503 meanR:382.3700 R:11.0 loss:127.9791 exploreP:0.0100
Episode:5504 meanR:377.4900 R:12.0 loss:107.0105 exploreP:0.0100
Episode:5505 meanR:372.5900 R:10.0 loss:91.9495 exploreP:0.0100
Episode:5506 meanR:372.6000 R:13.0 loss:101.4550 exploreP:0.0100
Episode:5507 meanR:373.5500 R:106.0 loss:51.0615 exploreP:0.0100
Episode:5508 meanR:369.7800 R:123.0 loss:25.2329 exploreP:0.0100
Episode:5509 meanR:366.2800 R:150.0 loss:10.4645 exploreP:0.0100
Episode:5510 meanR:364.7600 R

Episode:5622 meanR:412.0900 R:500.0 loss:7.9283 exploreP:0.0100
Episode:5623 meanR:414.0600 R:312.0 loss:21.7251 exploreP:0.0100
Episode:5624 meanR:417.3400 R:500.0 loss:4.2506 exploreP:0.0100
Episode:5625 meanR:420.7000 R:500.0 loss:16.5969 exploreP:0.0100
Episode:5626 meanR:423.8900 R:500.0 loss:15.6341 exploreP:0.0100
Episode:5627 meanR:426.9200 R:500.0 loss:13.8269 exploreP:0.0100
Episode:5628 meanR:428.3100 R:500.0 loss:15.0919 exploreP:0.0100
Episode:5629 meanR:428.7300 R:500.0 loss:15.9941 exploreP:0.0100
Episode:5630 meanR:431.4200 R:500.0 loss:13.6649 exploreP:0.0100
Episode:5631 meanR:431.4200 R:500.0 loss:15.5426 exploreP:0.0100
Episode:5632 meanR:431.4200 R:500.0 loss:17.4720 exploreP:0.0100
Episode:5633 meanR:432.4500 R:500.0 loss:15.8354 exploreP:0.0100
Episode:5634 meanR:429.0700 R:162.0 loss:57.7895 exploreP:0.0100
Episode:5635 meanR:429.0700 R:500.0 loss:5.1028 exploreP:0.0100
Episode:5636 meanR:429.0700 R:500.0 loss:17.5614 exploreP:0.0100
Episode:5637 meanR:429.0700 

Episode:5749 meanR:172.4600 R:14.0 loss:346.2579 exploreP:0.0100
Episode:5750 meanR:167.6200 R:16.0 loss:357.4025 exploreP:0.0100
Episode:5751 meanR:163.8900 R:127.0 loss:200.6337 exploreP:0.0100
Episode:5752 meanR:160.7700 R:188.0 loss:17.1238 exploreP:0.0100
Episode:5753 meanR:157.8700 R:210.0 loss:15.8740 exploreP:0.0100
Episode:5754 meanR:159.1200 R:217.0 loss:12.1458 exploreP:0.0100
Episode:5755 meanR:160.8200 R:260.0 loss:12.0574 exploreP:0.0100
Episode:5756 meanR:161.2800 R:247.0 loss:10.5431 exploreP:0.0100
Episode:5757 meanR:158.3200 R:204.0 loss:16.4568 exploreP:0.0100
Episode:5758 meanR:163.1600 R:500.0 loss:15.4996 exploreP:0.0100
Episode:5759 meanR:163.1600 R:500.0 loss:31.6875 exploreP:0.0100
Episode:5760 meanR:159.9500 R:179.0 loss:32.8940 exploreP:0.0100
Episode:5761 meanR:159.9500 R:500.0 loss:13.0212 exploreP:0.0100
Episode:5762 meanR:156.2700 R:132.0 loss:39.2494 exploreP:0.0100
Episode:5763 meanR:152.9300 R:166.0 loss:19.9024 exploreP:0.0100
Episode:5764 meanR:148.0

Episode:5876 meanR:353.1400 R:500.0 loss:14.9267 exploreP:0.0100
Episode:5877 meanR:358.0200 R:500.0 loss:15.4764 exploreP:0.0100
Episode:5878 meanR:361.2600 R:500.0 loss:11.2898 exploreP:0.0100
Episode:5879 meanR:364.4000 R:500.0 loss:18.2134 exploreP:0.0100
Episode:5880 meanR:367.6400 R:500.0 loss:15.2860 exploreP:0.0100
Episode:5881 meanR:370.9000 R:500.0 loss:17.4975 exploreP:0.0100
Episode:5882 meanR:370.9600 R:198.0 loss:37.4981 exploreP:0.0100
Episode:5883 meanR:373.9800 R:500.0 loss:12.5802 exploreP:0.0100
Episode:5884 meanR:376.7800 R:500.0 loss:19.6901 exploreP:0.0100
Episode:5885 meanR:380.2200 R:500.0 loss:19.0030 exploreP:0.0100
Episode:5886 meanR:383.1000 R:500.0 loss:18.1846 exploreP:0.0100
Episode:5887 meanR:386.5400 R:500.0 loss:17.2499 exploreP:0.0100
Episode:5888 meanR:387.9500 R:500.0 loss:13.2585 exploreP:0.0100
Episode:5889 meanR:391.7900 R:500.0 loss:15.3707 exploreP:0.0100
Episode:5890 meanR:395.5900 R:500.0 loss:16.7945 exploreP:0.0100
Episode:5891 meanR:399.27

Episode:6003 meanR:396.7000 R:500.0 loss:1.4120 exploreP:0.0100
Episode:6004 meanR:400.4600 R:500.0 loss:18.7897 exploreP:0.0100
Episode:6005 meanR:404.0300 R:500.0 loss:17.6373 exploreP:0.0100
Episode:6006 meanR:404.1800 R:175.0 loss:45.9996 exploreP:0.0100
Episode:6007 meanR:403.9100 R:138.0 loss:1.8526 exploreP:0.0100
Episode:6008 meanR:406.8500 R:500.0 loss:1.8570 exploreP:0.0100
Episode:6009 meanR:405.8700 R:88.0 loss:66.4912 exploreP:0.0100
Episode:6010 meanR:401.4600 R:59.0 loss:31.8263 exploreP:0.0100
Episode:6011 meanR:398.4200 R:130.0 loss:4.5376 exploreP:0.0100
Episode:6012 meanR:398.2800 R:486.0 loss:0.7629 exploreP:0.0100
Episode:6013 meanR:398.2800 R:500.0 loss:8.6410 exploreP:0.0100
Episode:6014 meanR:394.7200 R:144.0 loss:52.1300 exploreP:0.0100
Episode:6015 meanR:391.6200 R:190.0 loss:3.4508 exploreP:0.0100
Episode:6016 meanR:390.1100 R:349.0 loss:1.6799 exploreP:0.0100
Episode:6017 meanR:387.4500 R:234.0 loss:2.5225 exploreP:0.0100
Episode:6018 meanR:384.9000 R:245.0 

Episode:6130 meanR:362.5100 R:223.0 loss:38.7967 exploreP:0.0100
Episode:6131 meanR:361.4100 R:390.0 loss:0.9811 exploreP:0.0100
Episode:6132 meanR:361.7500 R:394.0 loss:0.4918 exploreP:0.0100
Episode:6133 meanR:364.7400 R:500.0 loss:0.8390 exploreP:0.0100
Episode:6134 meanR:361.0400 R:130.0 loss:22.6580 exploreP:0.0100
Episode:6135 meanR:363.6100 R:500.0 loss:1.1483 exploreP:0.0100
Episode:6136 meanR:363.6100 R:500.0 loss:8.3209 exploreP:0.0100
Episode:6137 meanR:363.7600 R:500.0 loss:15.8274 exploreP:0.0100
Episode:6138 meanR:363.7600 R:500.0 loss:11.9851 exploreP:0.0100
Episode:6139 meanR:364.9600 R:500.0 loss:15.0183 exploreP:0.0100
Episode:6140 meanR:364.9600 R:500.0 loss:17.9220 exploreP:0.0100
Episode:6141 meanR:367.7700 R:500.0 loss:10.7583 exploreP:0.0100
Episode:6142 meanR:367.7700 R:500.0 loss:16.9785 exploreP:0.0100
Episode:6143 meanR:367.7700 R:500.0 loss:7.5770 exploreP:0.0100
Episode:6144 meanR:367.7700 R:500.0 loss:15.3907 exploreP:0.0100
Episode:6145 meanR:367.7700 R:5

Episode:6257 meanR:444.3000 R:500.0 loss:14.5364 exploreP:0.0100
Episode:6258 meanR:444.3000 R:500.0 loss:15.6823 exploreP:0.0100
Episode:6259 meanR:444.3000 R:500.0 loss:16.3966 exploreP:0.0100
Episode:6260 meanR:444.3000 R:500.0 loss:16.2353 exploreP:0.0100
Episode:6261 meanR:444.3000 R:500.0 loss:15.7975 exploreP:0.0100
Episode:6262 meanR:444.3000 R:500.0 loss:17.4501 exploreP:0.0100
Episode:6263 meanR:444.3000 R:500.0 loss:17.3884 exploreP:0.0100
Episode:6264 meanR:444.5100 R:500.0 loss:16.4502 exploreP:0.0100
Episode:6265 meanR:443.2000 R:68.0 loss:70.7431 exploreP:0.0100
Episode:6266 meanR:445.9000 R:500.0 loss:17.9178 exploreP:0.0100
Episode:6267 meanR:444.7500 R:385.0 loss:21.0468 exploreP:0.0100
Episode:6268 meanR:440.8400 R:109.0 loss:20.1319 exploreP:0.0100
Episode:6269 meanR:438.8000 R:277.0 loss:19.2796 exploreP:0.0100
Episode:6270 meanR:436.4100 R:261.0 loss:22.6888 exploreP:0.0100
Episode:6271 meanR:434.8800 R:270.0 loss:14.3237 exploreP:0.0100
Episode:6272 meanR:433.120

Episode:6385 meanR:286.0400 R:500.0 loss:14.9727 exploreP:0.0100
Episode:6386 meanR:284.3200 R:328.0 loss:27.7703 exploreP:0.0100
Episode:6387 meanR:285.7400 R:500.0 loss:0.8644 exploreP:0.0100
Episode:6388 meanR:285.7400 R:500.0 loss:16.3450 exploreP:0.0100
Episode:6389 meanR:285.7400 R:500.0 loss:18.9698 exploreP:0.0100
Episode:6390 meanR:285.7400 R:500.0 loss:17.6726 exploreP:0.0100
Episode:6391 meanR:285.7400 R:500.0 loss:16.3734 exploreP:0.0100
Episode:6392 meanR:285.7400 R:500.0 loss:16.4204 exploreP:0.0100
Episode:6393 meanR:281.7200 R:98.0 loss:63.5927 exploreP:0.0100
Episode:6394 meanR:277.3100 R:59.0 loss:27.6730 exploreP:0.0100
Episode:6395 meanR:273.5100 R:120.0 loss:3.5201 exploreP:0.0100
Episode:6396 meanR:270.2800 R:177.0 loss:2.8409 exploreP:0.0100
Episode:6397 meanR:270.2800 R:500.0 loss:1.2632 exploreP:0.0100
Episode:6398 meanR:270.2800 R:500.0 loss:16.0738 exploreP:0.0100
Episode:6399 meanR:270.2800 R:500.0 loss:16.1548 exploreP:0.0100
Episode:6400 meanR:270.2800 R:5

Episode:6512 meanR:386.8100 R:500.0 loss:15.5885 exploreP:0.0100
Episode:6513 meanR:389.6800 R:500.0 loss:15.0330 exploreP:0.0100
Episode:6514 meanR:392.4000 R:500.0 loss:13.7739 exploreP:0.0100
Episode:6515 meanR:394.9900 R:500.0 loss:17.7599 exploreP:0.0100
Episode:6516 meanR:397.5900 R:500.0 loss:16.5340 exploreP:0.0100
Episode:6517 meanR:397.5900 R:500.0 loss:14.5742 exploreP:0.0100
Episode:6518 meanR:394.4100 R:143.0 loss:51.9740 exploreP:0.0100
Episode:6519 meanR:394.4100 R:500.0 loss:10.4341 exploreP:0.0100
Episode:6520 meanR:394.9300 R:283.0 loss:29.2238 exploreP:0.0100
Episode:6521 meanR:394.4300 R:135.0 loss:16.4084 exploreP:0.0100
Episode:6522 meanR:396.9900 R:500.0 loss:8.0857 exploreP:0.0100
Episode:6523 meanR:399.8500 R:500.0 loss:19.7015 exploreP:0.0100
Episode:6524 meanR:400.4900 R:500.0 loss:11.9477 exploreP:0.0100
Episode:6525 meanR:396.6100 R:112.0 loss:13.1870 exploreP:0.0100
Episode:6526 meanR:398.1800 R:407.0 loss:10.3352 exploreP:0.0100
Episode:6527 meanR:394.510

Episode:6640 meanR:252.0200 R:65.0 loss:2.9167 exploreP:0.0100
Episode:6641 meanR:251.4300 R:67.0 loss:1.6176 exploreP:0.0100
Episode:6642 meanR:250.5200 R:65.0 loss:1.5617 exploreP:0.0100
Episode:6643 meanR:246.9800 R:64.0 loss:2.6012 exploreP:0.0100
Episode:6644 meanR:244.7700 R:69.0 loss:5.1150 exploreP:0.0100
Episode:6645 meanR:240.4800 R:71.0 loss:7.6413 exploreP:0.0100
Episode:6646 meanR:239.8300 R:122.0 loss:3.6095 exploreP:0.0100
Episode:6647 meanR:236.9100 R:151.0 loss:5.5030 exploreP:0.0100
Episode:6648 meanR:235.5500 R:139.0 loss:4.7850 exploreP:0.0100
Episode:6649 meanR:235.5600 R:179.0 loss:4.4048 exploreP:0.0100
Episode:6650 meanR:237.9200 R:500.0 loss:1.8120 exploreP:0.0100
Episode:6651 meanR:237.9200 R:500.0 loss:16.3255 exploreP:0.0100
Episode:6652 meanR:236.9500 R:397.0 loss:19.5851 exploreP:0.0100
Episode:6653 meanR:236.5300 R:216.0 loss:4.5955 exploreP:0.0100
Episode:6654 meanR:236.4500 R:151.0 loss:3.0690 exploreP:0.0100
Episode:6655 meanR:238.4700 R:217.0 loss:1.7

Episode:6767 meanR:401.3000 R:287.0 loss:28.6069 exploreP:0.0100
Episode:6768 meanR:404.0500 R:500.0 loss:7.3044 exploreP:0.0100
Episode:6769 meanR:407.5700 R:500.0 loss:14.9486 exploreP:0.0100
Episode:6770 meanR:406.2200 R:112.0 loss:46.8766 exploreP:0.0100
Episode:6771 meanR:402.1200 R:90.0 loss:20.8955 exploreP:0.0100
Episode:6772 meanR:400.7200 R:96.0 loss:12.4844 exploreP:0.0100
Episode:6773 meanR:399.8300 R:114.0 loss:10.0626 exploreP:0.0100
Episode:6774 meanR:395.9200 R:109.0 loss:8.0932 exploreP:0.0100
Episode:6775 meanR:392.1000 R:118.0 loss:10.5831 exploreP:0.0100
Episode:6776 meanR:388.2400 R:114.0 loss:6.7122 exploreP:0.0100
Episode:6777 meanR:384.2700 R:103.0 loss:6.0759 exploreP:0.0100
Episode:6778 meanR:380.3300 R:106.0 loss:5.2870 exploreP:0.0100
Episode:6779 meanR:376.4800 R:115.0 loss:5.0016 exploreP:0.0100
Episode:6780 meanR:372.6400 R:116.0 loss:4.1712 exploreP:0.0100
Episode:6781 meanR:368.8500 R:121.0 loss:3.3845 exploreP:0.0100
Episode:6782 meanR:365.0900 R:124.0

Episode:6895 meanR:348.9100 R:500.0 loss:15.4017 exploreP:0.0100
Episode:6896 meanR:348.9100 R:500.0 loss:12.3059 exploreP:0.0100
Episode:6897 meanR:348.9100 R:500.0 loss:16.5355 exploreP:0.0100
Episode:6898 meanR:353.7900 R:500.0 loss:15.0025 exploreP:0.0100
Episode:6899 meanR:353.7900 R:500.0 loss:18.3769 exploreP:0.0100
Episode:6900 meanR:353.7900 R:500.0 loss:18.6385 exploreP:0.0100
Episode:6901 meanR:353.7900 R:500.0 loss:17.9211 exploreP:0.0100
Episode:6902 meanR:356.1400 R:500.0 loss:17.2270 exploreP:0.0100
Episode:6903 meanR:358.4800 R:500.0 loss:14.7905 exploreP:0.0100
Episode:6904 meanR:360.9400 R:500.0 loss:13.9822 exploreP:0.0100
Episode:6905 meanR:363.4400 R:500.0 loss:14.6760 exploreP:0.0100
Episode:6906 meanR:365.8800 R:500.0 loss:14.5786 exploreP:0.0100
Episode:6907 meanR:365.8800 R:500.0 loss:14.0929 exploreP:0.0100
Episode:6908 meanR:365.8800 R:500.0 loss:15.2409 exploreP:0.0100
Episode:6909 meanR:365.8800 R:500.0 loss:14.9345 exploreP:0.0100
Episode:6910 meanR:364.67

Episode:7022 meanR:398.9500 R:500.0 loss:9.2586 exploreP:0.0100
Episode:7023 meanR:398.9500 R:500.0 loss:17.1281 exploreP:0.0100
Episode:7024 meanR:398.9500 R:500.0 loss:15.2029 exploreP:0.0100
Episode:7025 meanR:398.9500 R:500.0 loss:13.0262 exploreP:0.0100
Episode:7026 meanR:395.4200 R:147.0 loss:50.5861 exploreP:0.0100
Episode:7027 meanR:395.4200 R:500.0 loss:2.5526 exploreP:0.0100
Episode:7028 meanR:395.4200 R:500.0 loss:15.8257 exploreP:0.0100
Episode:7029 meanR:395.4200 R:500.0 loss:13.9885 exploreP:0.0100
Episode:7030 meanR:395.4200 R:500.0 loss:15.8657 exploreP:0.0100
Episode:7031 meanR:390.5500 R:13.0 loss:68.3951 exploreP:0.0100
Episode:7032 meanR:390.5500 R:500.0 loss:14.7665 exploreP:0.0100
Episode:7033 meanR:395.4300 R:500.0 loss:15.4911 exploreP:0.0100
Episode:7034 meanR:391.8900 R:146.0 loss:56.4794 exploreP:0.0100
Episode:7035 meanR:388.2300 R:134.0 loss:4.0538 exploreP:0.0100
Episode:7036 meanR:384.6300 R:140.0 loss:3.6911 exploreP:0.0100
Episode:7037 meanR:385.8200 R:

Episode:7150 meanR:287.7500 R:500.0 loss:3.5905 exploreP:0.0100
Episode:7151 meanR:287.7500 R:500.0 loss:14.2417 exploreP:0.0100
Episode:7152 meanR:287.7500 R:500.0 loss:8.2409 exploreP:0.0100
Episode:7153 meanR:287.7500 R:500.0 loss:16.4460 exploreP:0.0100
Episode:7154 meanR:287.7500 R:500.0 loss:18.5462 exploreP:0.0100
Episode:7155 meanR:284.4400 R:169.0 loss:25.0204 exploreP:0.0100
Episode:7156 meanR:279.5500 R:11.0 loss:28.5565 exploreP:0.0100
Episode:7157 meanR:278.3000 R:11.0 loss:42.9489 exploreP:0.0100
Episode:7158 meanR:281.5500 R:500.0 loss:10.2987 exploreP:0.0100
Episode:7159 meanR:284.7300 R:500.0 loss:17.4661 exploreP:0.0100
Episode:7160 meanR:284.7300 R:500.0 loss:15.6865 exploreP:0.0100
Episode:7161 meanR:284.7300 R:500.0 loss:17.2544 exploreP:0.0100
Episode:7162 meanR:284.7300 R:500.0 loss:12.7371 exploreP:0.0100
Episode:7163 meanR:287.8800 R:500.0 loss:18.7077 exploreP:0.0100
Episode:7164 meanR:287.8800 R:500.0 loss:17.2335 exploreP:0.0100
Episode:7165 meanR:287.8800 R

Episode:7277 meanR:364.7700 R:315.0 loss:104.3800 exploreP:0.0100
Episode:7278 meanR:363.6600 R:389.0 loss:53.1363 exploreP:0.0100
Episode:7279 meanR:364.9300 R:500.0 loss:25.7395 exploreP:0.0100
Episode:7280 meanR:366.5700 R:417.0 loss:29.8620 exploreP:0.0100
Episode:7281 meanR:369.5500 R:500.0 loss:23.1979 exploreP:0.0100
Episode:7282 meanR:372.5800 R:500.0 loss:42.6243 exploreP:0.0100
Episode:7283 meanR:375.9300 R:500.0 loss:41.3550 exploreP:0.0100
Episode:7284 meanR:379.4600 R:500.0 loss:43.5550 exploreP:0.0100
Episode:7285 meanR:383.1100 R:500.0 loss:36.6633 exploreP:0.0100
Episode:7286 meanR:386.4100 R:500.0 loss:24.0396 exploreP:0.0100
Episode:7287 meanR:389.9600 R:500.0 loss:52.9357 exploreP:0.0100
Episode:7288 meanR:392.0800 R:500.0 loss:49.6595 exploreP:0.0100
Episode:7289 meanR:392.0800 R:500.0 loss:43.1244 exploreP:0.0100
Episode:7290 meanR:393.0700 R:500.0 loss:100.8210 exploreP:0.0100
Episode:7291 meanR:396.7200 R:500.0 loss:87.2193 exploreP:0.0100
Episode:7292 meanR:400.

Episode:7404 meanR:417.9300 R:136.0 loss:10.9779 exploreP:0.0100
Episode:7405 meanR:414.2900 R:136.0 loss:10.5255 exploreP:0.0100
Episode:7406 meanR:410.4300 R:114.0 loss:10.7144 exploreP:0.0100
Episode:7407 meanR:406.7200 R:129.0 loss:7.9109 exploreP:0.0100
Episode:7408 meanR:402.9500 R:123.0 loss:9.4394 exploreP:0.0100
Episode:7409 meanR:399.0600 R:111.0 loss:9.0387 exploreP:0.0100
Episode:7410 meanR:395.0900 R:103.0 loss:11.3455 exploreP:0.0100
Episode:7411 meanR:391.2300 R:114.0 loss:11.0146 exploreP:0.0100
Episode:7412 meanR:387.3500 R:112.0 loss:6.9883 exploreP:0.0100
Episode:7413 meanR:383.4000 R:105.0 loss:6.4428 exploreP:0.0100
Episode:7414 meanR:379.5000 R:110.0 loss:6.5298 exploreP:0.0100
Episode:7415 meanR:374.7800 R:28.0 loss:12.1589 exploreP:0.0100
Episode:7416 meanR:370.2400 R:46.0 loss:41.0477 exploreP:0.0100
Episode:7417 meanR:365.5800 R:34.0 loss:62.2123 exploreP:0.0100
Episode:7418 meanR:360.9100 R:33.0 loss:80.5185 exploreP:0.0100
Episode:7419 meanR:357.4200 R:35.0 

Episode:7531 meanR:309.7500 R:500.0 loss:9.2945 exploreP:0.0100
Episode:7532 meanR:314.1300 R:500.0 loss:19.3880 exploreP:0.0100
Episode:7533 meanR:317.8000 R:500.0 loss:14.8874 exploreP:0.0100
Episode:7534 meanR:321.3300 R:500.0 loss:20.2739 exploreP:0.0100
Episode:7535 meanR:325.2800 R:500.0 loss:10.8708 exploreP:0.0100
Episode:7536 meanR:329.1800 R:500.0 loss:17.0411 exploreP:0.0100
Episode:7537 meanR:332.9300 R:500.0 loss:14.5867 exploreP:0.0100
Episode:7538 meanR:337.1800 R:500.0 loss:11.7040 exploreP:0.0100
Episode:7539 meanR:341.3300 R:500.0 loss:10.6510 exploreP:0.0100
Episode:7540 meanR:345.0600 R:500.0 loss:17.4376 exploreP:0.0100
Episode:7541 meanR:349.2800 R:500.0 loss:18.4092 exploreP:0.0100
Episode:7542 meanR:353.1700 R:500.0 loss:9.5973 exploreP:0.0100
Episode:7543 meanR:357.0600 R:500.0 loss:21.2546 exploreP:0.0100
Episode:7544 meanR:361.0100 R:500.0 loss:18.4128 exploreP:0.0100
Episode:7545 meanR:365.0500 R:500.0 loss:17.5179 exploreP:0.0100
Episode:7546 meanR:369.2600

Episode:7659 meanR:9.6500 R:9.0 loss:396.8948 exploreP:0.0100
Episode:7660 meanR:9.6500 R:10.0 loss:399.7632 exploreP:0.0100
Episode:7661 meanR:9.6400 R:9.0 loss:388.2379 exploreP:0.0100
Episode:7662 meanR:9.6400 R:10.0 loss:397.3186 exploreP:0.0100
Episode:7663 meanR:9.6300 R:9.0 loss:400.8425 exploreP:0.0100
Episode:7664 meanR:9.6300 R:10.0 loss:403.7435 exploreP:0.0100
Episode:7665 meanR:9.6300 R:10.0 loss:409.8669 exploreP:0.0100
Episode:7666 meanR:9.6200 R:9.0 loss:417.7964 exploreP:0.0100
Episode:7667 meanR:9.6100 R:9.0 loss:430.0117 exploreP:0.0100
Episode:7668 meanR:9.6000 R:9.0 loss:440.1942 exploreP:0.0100
Episode:7669 meanR:9.6000 R:9.0 loss:451.2596 exploreP:0.0100
Episode:7670 meanR:9.6000 R:9.0 loss:457.5360 exploreP:0.0100
Episode:7671 meanR:9.6000 R:10.0 loss:456.4818 exploreP:0.0100
Episode:7672 meanR:9.6000 R:10.0 loss:454.7762 exploreP:0.0100
Episode:7673 meanR:9.5900 R:9.0 loss:445.8526 exploreP:0.0100
Episode:7674 meanR:9.5700 R:9.0 loss:461.7431 exploreP:0.0100
Ep

Episode:7790 meanR:9.5800 R:9.0 loss:339.8403 exploreP:0.0100
Episode:7791 meanR:9.5800 R:9.0 loss:326.3210 exploreP:0.0100
Episode:7792 meanR:9.5900 R:10.0 loss:340.6141 exploreP:0.0100
Episode:7793 meanR:9.6000 R:10.0 loss:336.5013 exploreP:0.0100
Episode:7794 meanR:9.5900 R:9.0 loss:335.4317 exploreP:0.0100
Episode:7795 meanR:9.5900 R:10.0 loss:336.6073 exploreP:0.0100
Episode:7796 meanR:9.5800 R:10.0 loss:331.1174 exploreP:0.0100
Episode:7797 meanR:9.5700 R:8.0 loss:323.6944 exploreP:0.0100
Episode:7798 meanR:9.5800 R:10.0 loss:320.9638 exploreP:0.0100
Episode:7799 meanR:9.5800 R:10.0 loss:320.4549 exploreP:0.0100
Episode:7800 meanR:9.5800 R:9.0 loss:322.4206 exploreP:0.0100
Episode:7801 meanR:9.5600 R:9.0 loss:317.9604 exploreP:0.0100
Episode:7802 meanR:9.5500 R:8.0 loss:321.4041 exploreP:0.0100
Episode:7803 meanR:9.5700 R:11.0 loss:306.3543 exploreP:0.0100
Episode:7804 meanR:9.5800 R:10.0 loss:306.4927 exploreP:0.0100
Episode:7805 meanR:9.5900 R:10.0 loss:320.3856 exploreP:0.0100

Episode:7921 meanR:9.7100 R:10.0 loss:209.8553 exploreP:0.0100
Episode:7922 meanR:9.7300 R:11.0 loss:202.3216 exploreP:0.0100
Episode:7923 meanR:9.7400 R:10.0 loss:201.4796 exploreP:0.0100
Episode:7924 meanR:9.7400 R:10.0 loss:204.3227 exploreP:0.0100
Episode:7925 meanR:9.7500 R:11.0 loss:212.7112 exploreP:0.0100
Episode:7926 meanR:9.7500 R:10.0 loss:213.7268 exploreP:0.0100
Episode:7927 meanR:9.7500 R:10.0 loss:213.5690 exploreP:0.0100
Episode:7928 meanR:9.7700 R:11.0 loss:209.4829 exploreP:0.0100
Episode:7929 meanR:9.7600 R:8.0 loss:202.1248 exploreP:0.0100
Episode:7930 meanR:9.7700 R:10.0 loss:198.2927 exploreP:0.0100
Episode:7931 meanR:9.7800 R:10.0 loss:201.8738 exploreP:0.0100
Episode:7932 meanR:9.7700 R:9.0 loss:209.4173 exploreP:0.0100
Episode:7933 meanR:9.7800 R:11.0 loss:203.2881 exploreP:0.0100
Episode:7934 meanR:9.7900 R:11.0 loss:192.5194 exploreP:0.0100
Episode:7935 meanR:9.7900 R:10.0 loss:191.4413 exploreP:0.0100
Episode:7936 meanR:9.7900 R:9.0 loss:191.0668 exploreP:0.

Episode:8051 meanR:11.4000 R:10.0 loss:116.8663 exploreP:0.0100
Episode:8052 meanR:11.4100 R:10.0 loss:114.6934 exploreP:0.0100
Episode:8053 meanR:11.4100 R:10.0 loss:110.8352 exploreP:0.0100
Episode:8054 meanR:11.3900 R:9.0 loss:111.1905 exploreP:0.0100
Episode:8055 meanR:11.4000 R:10.0 loss:109.4156 exploreP:0.0100
Episode:8056 meanR:11.4000 R:10.0 loss:111.2543 exploreP:0.0100
Episode:8057 meanR:11.4000 R:10.0 loss:116.1463 exploreP:0.0100
Episode:8058 meanR:11.4100 R:10.0 loss:120.5551 exploreP:0.0100
Episode:8059 meanR:11.4100 R:10.0 loss:117.1601 exploreP:0.0100
Episode:8060 meanR:11.4200 R:10.0 loss:122.8081 exploreP:0.0100
Episode:8061 meanR:11.4100 R:10.0 loss:129.0286 exploreP:0.0100
Episode:8062 meanR:11.4200 R:10.0 loss:118.5031 exploreP:0.0100
Episode:8063 meanR:11.4400 R:11.0 loss:113.3453 exploreP:0.0100
Episode:8064 meanR:11.4600 R:11.0 loss:108.8074 exploreP:0.0100
Episode:8065 meanR:11.4800 R:11.0 loss:111.1652 exploreP:0.0100
Episode:8066 meanR:11.4800 R:10.0 loss:11

Episode:8182 meanR:11.5000 R:10.0 loss:72.0781 exploreP:0.0100
Episode:8183 meanR:11.5200 R:12.0 loss:76.1990 exploreP:0.0100
Episode:8184 meanR:11.5300 R:11.0 loss:78.5603 exploreP:0.0100
Episode:8185 meanR:11.5200 R:9.0 loss:74.9937 exploreP:0.0100
Episode:8186 meanR:11.5200 R:11.0 loss:75.4985 exploreP:0.0100
Episode:8187 meanR:11.5300 R:10.0 loss:71.9942 exploreP:0.0100
Episode:8188 meanR:11.5300 R:10.0 loss:79.6410 exploreP:0.0100
Episode:8189 meanR:11.5300 R:10.0 loss:75.6694 exploreP:0.0100
Episode:8190 meanR:11.5400 R:10.0 loss:70.2460 exploreP:0.0100
Episode:8191 meanR:11.5400 R:10.0 loss:70.8420 exploreP:0.0100
Episode:8192 meanR:11.5500 R:10.0 loss:70.1698 exploreP:0.0100
Episode:8193 meanR:10.7600 R:10.0 loss:69.2558 exploreP:0.0100
Episode:8194 meanR:10.7700 R:10.0 loss:61.6463 exploreP:0.0100
Episode:8195 meanR:10.7700 R:10.0 loss:59.0422 exploreP:0.0100
Episode:8196 meanR:10.7700 R:10.0 loss:57.1163 exploreP:0.0100
Episode:8197 meanR:10.7800 R:10.0 loss:53.7854 exploreP:

Episode:8313 meanR:12.3900 R:11.0 loss:37.9186 exploreP:0.0100
Episode:8314 meanR:12.4000 R:10.0 loss:39.3439 exploreP:0.0100
Episode:8315 meanR:12.4000 R:9.0 loss:41.8694 exploreP:0.0100
Episode:8316 meanR:12.4200 R:11.0 loss:41.8912 exploreP:0.0100
Episode:8317 meanR:12.4200 R:10.0 loss:45.5795 exploreP:0.0100
Episode:8318 meanR:12.4200 R:10.0 loss:48.4321 exploreP:0.0100
Episode:8319 meanR:12.4500 R:12.0 loss:45.7453 exploreP:0.0100
Episode:8320 meanR:12.4700 R:11.0 loss:43.5790 exploreP:0.0100
Episode:8321 meanR:13.3700 R:100.0 loss:41.5342 exploreP:0.0100
Episode:8322 meanR:13.3700 R:10.0 loss:33.7975 exploreP:0.0100
Episode:8323 meanR:13.3800 R:10.0 loss:48.1200 exploreP:0.0100
Episode:8324 meanR:13.3800 R:11.0 loss:54.6799 exploreP:0.0100
Episode:8325 meanR:12.7000 R:9.0 loss:32.6343 exploreP:0.0100
Episode:8326 meanR:12.6800 R:9.0 loss:34.6425 exploreP:0.0100
Episode:8327 meanR:12.6900 R:10.0 loss:35.1068 exploreP:0.0100
Episode:8328 meanR:13.7700 R:118.0 loss:34.4142 exploreP:

Episode:8444 meanR:22.4400 R:91.0 loss:15.5609 exploreP:0.0100
Episode:8445 meanR:23.3400 R:99.0 loss:27.2837 exploreP:0.0100
Episode:8446 meanR:23.3400 R:12.0 loss:23.4089 exploreP:0.0100
Episode:8447 meanR:23.3500 R:11.0 loss:23.7295 exploreP:0.0100
Episode:8448 meanR:23.6300 R:38.0 loss:37.0398 exploreP:0.0100
Episode:8449 meanR:23.6400 R:11.0 loss:30.1974 exploreP:0.0100
Episode:8450 meanR:23.6600 R:12.0 loss:23.5361 exploreP:0.0100
Episode:8451 meanR:24.5900 R:103.0 loss:24.9434 exploreP:0.0100
Episode:8452 meanR:25.8100 R:132.0 loss:27.6520 exploreP:0.0100
Episode:8453 meanR:25.8200 R:12.0 loss:19.2816 exploreP:0.0100
Episode:8454 meanR:24.9100 R:13.0 loss:28.5508 exploreP:0.0100
Episode:8455 meanR:25.9400 R:114.0 loss:24.6405 exploreP:0.0100
Episode:8456 meanR:26.9800 R:115.0 loss:23.8030 exploreP:0.0100
Episode:8457 meanR:28.2200 R:135.0 loss:27.6634 exploreP:0.0100
Episode:8458 meanR:28.6100 R:135.0 loss:26.4992 exploreP:0.0100
Episode:8459 meanR:29.5800 R:108.0 loss:15.9280 e

Episode:8572 meanR:237.9700 R:157.0 loss:18.8848 exploreP:0.0100
Episode:8573 meanR:237.0300 R:149.0 loss:11.5094 exploreP:0.0100
Episode:8574 meanR:237.4300 R:165.0 loss:9.2500 exploreP:0.0100
Episode:8575 meanR:237.1000 R:136.0 loss:9.9521 exploreP:0.0100
Episode:8576 meanR:236.9600 R:136.0 loss:10.3506 exploreP:0.0100
Episode:8577 meanR:236.3400 R:149.0 loss:8.3058 exploreP:0.0100
Episode:8578 meanR:236.6700 R:157.0 loss:10.5419 exploreP:0.0100
Episode:8579 meanR:236.7800 R:140.0 loss:9.9344 exploreP:0.0100
Episode:8580 meanR:237.3500 R:170.0 loss:6.8871 exploreP:0.0100
Episode:8581 meanR:238.9000 R:166.0 loss:5.8289 exploreP:0.0100
Episode:8582 meanR:238.8800 R:183.0 loss:5.6280 exploreP:0.0100
Episode:8583 meanR:238.5400 R:168.0 loss:6.0034 exploreP:0.0100
Episode:8584 meanR:238.5300 R:146.0 loss:5.7840 exploreP:0.0100
Episode:8585 meanR:238.9700 R:188.0 loss:4.2435 exploreP:0.0100
Episode:8586 meanR:238.9400 R:194.0 loss:4.4237 exploreP:0.0100
Episode:8587 meanR:239.4300 R:195.0 

Episode:8699 meanR:306.7000 R:500.0 loss:490.1024 exploreP:0.0100
Episode:8700 meanR:306.7000 R:500.0 loss:435.4170 exploreP:0.0100
Episode:8701 meanR:308.6200 R:500.0 loss:274.3095 exploreP:0.0100
Episode:8702 meanR:311.5800 R:500.0 loss:285.2953 exploreP:0.0100
Episode:8703 meanR:313.8500 R:500.0 loss:280.1034 exploreP:0.0100
Episode:8704 meanR:315.9000 R:500.0 loss:331.6079 exploreP:0.0100
Episode:8705 meanR:317.3200 R:500.0 loss:291.7228 exploreP:0.0100
Episode:8706 meanR:319.0100 R:500.0 loss:300.9991 exploreP:0.0100
Episode:8707 meanR:319.0100 R:500.0 loss:282.5147 exploreP:0.0100
Episode:8708 meanR:319.0100 R:500.0 loss:326.0841 exploreP:0.0100
Episode:8709 meanR:319.0100 R:500.0 loss:387.2720 exploreP:0.0100
Episode:8710 meanR:319.1800 R:500.0 loss:306.1843 exploreP:0.0100
Episode:8711 meanR:321.8800 R:500.0 loss:312.0161 exploreP:0.0100
Episode:8712 meanR:324.3000 R:500.0 loss:417.0329 exploreP:0.0100
Episode:8713 meanR:324.7600 R:500.0 loss:283.7989 exploreP:0.0100
Episode:87

Episode:8825 meanR:162.1700 R:126.0 loss:68.2411 exploreP:0.0100
Episode:8826 meanR:161.9400 R:144.0 loss:90.8940 exploreP:0.0100
Episode:8827 meanR:159.6900 R:111.0 loss:45.6361 exploreP:0.0100
Episode:8828 meanR:158.9400 R:119.0 loss:45.8171 exploreP:0.0100
Episode:8829 meanR:158.8200 R:148.0 loss:99.6753 exploreP:0.0100
Episode:8830 meanR:157.6400 R:128.0 loss:56.0719 exploreP:0.0100
Episode:8831 meanR:156.7600 R:133.0 loss:65.6874 exploreP:0.0100
Episode:8832 meanR:155.0900 R:124.0 loss:47.7965 exploreP:0.0100
Episode:8833 meanR:154.1400 R:110.0 loss:36.9897 exploreP:0.0100
Episode:8834 meanR:154.1800 R:139.0 loss:64.0425 exploreP:0.0100
Episode:8835 meanR:153.0300 R:129.0 loss:49.6926 exploreP:0.0100
Episode:8836 meanR:153.9200 R:119.0 loss:35.6395 exploreP:0.0100
Episode:8837 meanR:151.7100 R:144.0 loss:80.5666 exploreP:0.0100
Episode:8838 meanR:151.1500 R:123.0 loss:37.9460 exploreP:0.0100
Episode:8839 meanR:150.7500 R:118.0 loss:34.4267 exploreP:0.0100
Episode:8840 meanR:150.23

Episode:8952 meanR:113.6800 R:105.0 loss:33.1435 exploreP:0.0100
Episode:8953 meanR:113.5000 R:102.0 loss:35.6948 exploreP:0.0100
Episode:8954 meanR:113.4100 R:108.0 loss:33.3512 exploreP:0.0100
Episode:8955 meanR:112.4400 R:28.0 loss:44.6739 exploreP:0.0100
Episode:8956 meanR:112.3100 R:97.0 loss:65.7538 exploreP:0.0100
Episode:8957 meanR:111.4400 R:22.0 loss:70.0435 exploreP:0.0100
Episode:8958 meanR:110.4300 R:19.0 loss:89.1595 exploreP:0.0100
Episode:8959 meanR:109.4300 R:19.0 loss:103.3641 exploreP:0.0100
Episode:8960 meanR:108.9000 R:79.0 loss:128.4037 exploreP:0.0100
Episode:8961 meanR:108.5900 R:90.0 loss:64.2895 exploreP:0.0100
Episode:8962 meanR:107.8800 R:41.0 loss:55.5769 exploreP:0.0100
Episode:8963 meanR:107.5600 R:91.0 loss:92.0954 exploreP:0.0100
Episode:8964 meanR:107.3800 R:94.0 loss:57.4228 exploreP:0.0100
Episode:8965 meanR:106.4500 R:30.0 loss:54.3248 exploreP:0.0100
Episode:8966 meanR:106.2500 R:92.0 loss:75.6435 exploreP:0.0100
Episode:8967 meanR:106.1300 R:101.0

Episode:9081 meanR:27.6600 R:27.0 loss:134.3048 exploreP:0.0100
Episode:9082 meanR:27.6100 R:28.0 loss:126.4338 exploreP:0.0100
Episode:9083 meanR:27.5400 R:23.0 loss:129.5249 exploreP:0.0100
Episode:9084 meanR:27.3800 R:32.0 loss:111.6347 exploreP:0.0100
Episode:9085 meanR:27.1700 R:14.0 loss:128.0482 exploreP:0.0100
Episode:9086 meanR:27.2400 R:32.0 loss:132.3584 exploreP:0.0100
Episode:9087 meanR:27.2500 R:34.0 loss:125.1004 exploreP:0.0100
Episode:9088 meanR:26.6300 R:16.0 loss:129.3199 exploreP:0.0100
Episode:9089 meanR:26.3900 R:23.0 loss:114.4964 exploreP:0.0100
Episode:9090 meanR:26.3300 R:17.0 loss:117.2379 exploreP:0.0100
Episode:9091 meanR:26.3200 R:28.0 loss:116.2937 exploreP:0.0100
Episode:9092 meanR:26.0600 R:28.0 loss:122.7289 exploreP:0.0100
Episode:9093 meanR:26.0100 R:25.0 loss:138.3431 exploreP:0.0100
Episode:9094 meanR:25.6400 R:29.0 loss:125.9951 exploreP:0.0100
Episode:9095 meanR:25.4000 R:14.0 loss:116.9641 exploreP:0.0100
Episode:9096 meanR:25.4100 R:28.0 loss:1

Episode:9210 meanR:12.7900 R:11.0 loss:156.0608 exploreP:0.0100
Episode:9211 meanR:12.6900 R:10.0 loss:148.5357 exploreP:0.0100
Episode:9212 meanR:12.6000 R:11.0 loss:149.0518 exploreP:0.0100
Episode:9213 meanR:12.5100 R:10.0 loss:158.2444 exploreP:0.0100
Episode:9214 meanR:12.4300 R:9.0 loss:154.4196 exploreP:0.0100
Episode:9215 meanR:12.3800 R:9.0 loss:147.7027 exploreP:0.0100
Episode:9216 meanR:12.3400 R:9.0 loss:138.8182 exploreP:0.0100
Episode:9217 meanR:12.1900 R:9.0 loss:149.0730 exploreP:0.0100
Episode:9218 meanR:12.1200 R:12.0 loss:153.1144 exploreP:0.0100
Episode:9219 meanR:12.0000 R:9.0 loss:132.6797 exploreP:0.0100
Episode:9220 meanR:11.8900 R:9.0 loss:133.1938 exploreP:0.0100
Episode:9221 meanR:11.8600 R:10.0 loss:144.6580 exploreP:0.0100
Episode:9222 meanR:11.7800 R:9.0 loss:116.0115 exploreP:0.0100
Episode:9223 meanR:11.7400 R:10.0 loss:131.4564 exploreP:0.0100
Episode:9224 meanR:11.7100 R:11.0 loss:136.0298 exploreP:0.0100
Episode:9225 meanR:11.6900 R:11.0 loss:120.0479

Episode:9339 meanR:10.3800 R:10.0 loss:124.0934 exploreP:0.0100
Episode:9340 meanR:10.3600 R:10.0 loss:120.8852 exploreP:0.0100
Episode:9341 meanR:10.3700 R:10.0 loss:106.4479 exploreP:0.0100
Episode:9342 meanR:10.3700 R:11.0 loss:113.6374 exploreP:0.0100
Episode:9343 meanR:10.3900 R:12.0 loss:100.5196 exploreP:0.0100
Episode:9344 meanR:10.3800 R:9.0 loss:120.0467 exploreP:0.0100
Episode:9345 meanR:10.3900 R:11.0 loss:119.1568 exploreP:0.0100
Episode:9346 meanR:10.3900 R:10.0 loss:119.9795 exploreP:0.0100
Episode:9347 meanR:10.4000 R:10.0 loss:108.2780 exploreP:0.0100
Episode:9348 meanR:10.4100 R:10.0 loss:113.3219 exploreP:0.0100
Episode:9349 meanR:10.4100 R:11.0 loss:107.3142 exploreP:0.0100
Episode:9350 meanR:10.4200 R:13.0 loss:115.8530 exploreP:0.0100
Episode:9351 meanR:10.4200 R:10.0 loss:119.5906 exploreP:0.0100
Episode:9352 meanR:10.4200 R:10.0 loss:101.7422 exploreP:0.0100
Episode:9353 meanR:10.4200 R:11.0 loss:121.1852 exploreP:0.0100
Episode:9354 meanR:10.4400 R:11.0 loss:13

Episode:9468 meanR:10.3900 R:11.0 loss:114.7386 exploreP:0.0100
Episode:9469 meanR:10.3900 R:10.0 loss:108.9081 exploreP:0.0100
Episode:9470 meanR:10.3700 R:10.0 loss:88.8151 exploreP:0.0100
Episode:9471 meanR:10.3400 R:9.0 loss:86.7991 exploreP:0.0100
Episode:9472 meanR:10.3000 R:9.0 loss:81.7743 exploreP:0.0100
Episode:9473 meanR:10.3200 R:13.0 loss:88.9169 exploreP:0.0100
Episode:9474 meanR:10.3300 R:10.0 loss:87.8223 exploreP:0.0100
Episode:9475 meanR:10.3500 R:13.0 loss:92.6484 exploreP:0.0100
Episode:9476 meanR:10.3500 R:10.0 loss:84.0314 exploreP:0.0100
Episode:9477 meanR:10.3600 R:11.0 loss:72.7849 exploreP:0.0100
Episode:9478 meanR:10.3700 R:12.0 loss:83.1828 exploreP:0.0100
Episode:9479 meanR:10.3600 R:10.0 loss:80.2485 exploreP:0.0100
Episode:9480 meanR:10.3700 R:12.0 loss:84.3650 exploreP:0.0100
Episode:9481 meanR:10.3700 R:10.0 loss:93.1517 exploreP:0.0100
Episode:9482 meanR:10.3700 R:10.0 loss:93.3506 exploreP:0.0100
Episode:9483 meanR:10.3800 R:11.0 loss:95.7676 exploreP

Episode:9598 meanR:10.6400 R:10.0 loss:107.9233 exploreP:0.0100
Episode:9599 meanR:10.6400 R:12.0 loss:98.0715 exploreP:0.0100
Episode:9600 meanR:10.6300 R:10.0 loss:102.4915 exploreP:0.0100
Episode:9601 meanR:10.6700 R:13.0 loss:113.2382 exploreP:0.0100
Episode:9602 meanR:10.6700 R:10.0 loss:105.3633 exploreP:0.0100
Episode:9603 meanR:10.6800 R:10.0 loss:88.0075 exploreP:0.0100
Episode:9604 meanR:10.6700 R:10.0 loss:78.4708 exploreP:0.0100
Episode:9605 meanR:10.6600 R:10.0 loss:81.6054 exploreP:0.0100
Episode:9606 meanR:10.6400 R:9.0 loss:90.4547 exploreP:0.0100
Episode:9607 meanR:10.6600 R:12.0 loss:65.3631 exploreP:0.0100
Episode:9608 meanR:10.6600 R:10.0 loss:62.8840 exploreP:0.0100
Episode:9609 meanR:10.6600 R:12.0 loss:94.0353 exploreP:0.0100
Episode:9610 meanR:10.6300 R:10.0 loss:72.7323 exploreP:0.0100
Episode:9611 meanR:10.6400 R:11.0 loss:69.5312 exploreP:0.0100
Episode:9612 meanR:10.6100 R:9.0 loss:69.0617 exploreP:0.0100
Episode:9613 meanR:10.6200 R:11.0 loss:77.4883 explor

Episode:9729 meanR:10.9400 R:12.0 loss:83.8932 exploreP:0.0100
Episode:9730 meanR:10.9200 R:11.0 loss:87.7905 exploreP:0.0100
Episode:9731 meanR:10.9300 R:11.0 loss:85.2508 exploreP:0.0100
Episode:9732 meanR:10.9200 R:11.0 loss:79.6426 exploreP:0.0100
Episode:9733 meanR:10.9400 R:12.0 loss:69.9566 exploreP:0.0100
Episode:9734 meanR:10.9400 R:10.0 loss:73.0754 exploreP:0.0100
Episode:9735 meanR:10.9300 R:11.0 loss:82.6561 exploreP:0.0100
Episode:9736 meanR:10.9400 R:12.0 loss:79.7155 exploreP:0.0100
Episode:9737 meanR:10.9300 R:11.0 loss:71.0364 exploreP:0.0100
Episode:9738 meanR:10.9500 R:13.0 loss:64.1826 exploreP:0.0100
Episode:9739 meanR:10.9700 R:13.0 loss:68.7486 exploreP:0.0100
Episode:9740 meanR:10.9600 R:11.0 loss:62.4898 exploreP:0.0100
Episode:9741 meanR:10.9600 R:11.0 loss:60.9667 exploreP:0.0100
Episode:9742 meanR:10.9700 R:12.0 loss:68.3374 exploreP:0.0100
Episode:9743 meanR:10.9800 R:12.0 loss:64.8854 exploreP:0.0100
Episode:9744 meanR:10.9800 R:11.0 loss:49.2699 exploreP

Episode:9860 meanR:12.3000 R:11.0 loss:64.8461 exploreP:0.0100
Episode:9861 meanR:12.3300 R:14.0 loss:58.4964 exploreP:0.0100
Episode:9862 meanR:12.3700 R:14.0 loss:55.6682 exploreP:0.0100
Episode:9863 meanR:12.4400 R:18.0 loss:61.3307 exploreP:0.0100
Episode:9864 meanR:12.4700 R:14.0 loss:49.8547 exploreP:0.0100
Episode:9865 meanR:12.4700 R:11.0 loss:56.6813 exploreP:0.0100
Episode:9866 meanR:12.5100 R:17.0 loss:65.2144 exploreP:0.0100
Episode:9867 meanR:12.5800 R:19.0 loss:55.3027 exploreP:0.0100
Episode:9868 meanR:12.6800 R:20.0 loss:63.4294 exploreP:0.0100
Episode:9869 meanR:12.7300 R:16.0 loss:60.9921 exploreP:0.0100
Episode:9870 meanR:12.7900 R:18.0 loss:54.8390 exploreP:0.0100
Episode:9871 meanR:12.8300 R:14.0 loss:74.8411 exploreP:0.0100
Episode:9872 meanR:12.8500 R:15.0 loss:55.2719 exploreP:0.0100
Episode:9873 meanR:12.8600 R:12.0 loss:54.4803 exploreP:0.0100
Episode:9874 meanR:12.8900 R:14.0 loss:50.9166 exploreP:0.0100
Episode:9875 meanR:12.9000 R:13.0 loss:67.1877 exploreP

Episode:9990 meanR:86.7800 R:135.0 loss:17.8143 exploreP:0.0100
Episode:9991 meanR:88.0800 R:142.0 loss:26.0827 exploreP:0.0100
Episode:9992 meanR:89.2000 R:127.0 loss:22.2250 exploreP:0.0100
Episode:9993 meanR:90.3300 R:129.0 loss:22.4959 exploreP:0.0100
Episode:9994 meanR:91.4100 R:134.0 loss:19.7247 exploreP:0.0100
Episode:9995 meanR:92.5000 R:124.0 loss:24.6231 exploreP:0.0100
Episode:9996 meanR:93.5800 R:124.0 loss:24.2365 exploreP:0.0100
Episode:9997 meanR:94.4300 R:128.0 loss:18.1717 exploreP:0.0100
Episode:9998 meanR:95.4700 R:130.0 loss:26.2247 exploreP:0.0100
Episode:9999 meanR:96.5400 R:134.0 loss:23.8426 exploreP:0.0100
Episode:10000 meanR:97.6600 R:135.0 loss:28.3650 exploreP:0.0100
Episode:10001 meanR:98.8400 R:135.0 loss:25.5142 exploreP:0.0100
Episode:10002 meanR:99.9300 R:127.0 loss:20.3254 exploreP:0.0100
Episode:10003 meanR:101.0600 R:134.0 loss:21.4628 exploreP:0.0100
Episode:10004 meanR:102.2400 R:143.0 loss:21.1875 exploreP:0.0100
Episode:10005 meanR:103.1900 R:12

Episode:10115 meanR:117.6300 R:114.0 loss:17.6504 exploreP:0.0100
Episode:10116 meanR:117.3500 R:108.0 loss:21.3368 exploreP:0.0100
Episode:10117 meanR:117.2500 R:114.0 loss:20.1057 exploreP:0.0100
Episode:10118 meanR:117.0700 R:114.0 loss:18.6719 exploreP:0.0100
Episode:10119 meanR:116.9300 R:104.0 loss:18.8901 exploreP:0.0100
Episode:10120 meanR:116.7600 R:112.0 loss:19.6432 exploreP:0.0100
Episode:10121 meanR:116.5800 R:110.0 loss:18.2379 exploreP:0.0100
Episode:10122 meanR:116.3700 R:103.0 loss:16.2269 exploreP:0.0100
Episode:10123 meanR:116.2200 R:116.0 loss:20.6677 exploreP:0.0100
Episode:10124 meanR:116.2000 R:114.0 loss:21.2287 exploreP:0.0100
Episode:10125 meanR:115.9200 R:112.0 loss:19.5473 exploreP:0.0100
Episode:10126 meanR:115.7100 R:108.0 loss:19.3608 exploreP:0.0100
Episode:10127 meanR:115.5200 R:104.0 loss:25.2702 exploreP:0.0100
Episode:10128 meanR:115.4700 R:115.0 loss:18.5292 exploreP:0.0100
Episode:10129 meanR:115.4400 R:114.0 loss:18.3973 exploreP:0.0100
Episode:10

Episode:10240 meanR:111.4600 R:125.0 loss:2.2452 exploreP:0.0100
Episode:10241 meanR:111.6400 R:134.0 loss:2.5938 exploreP:0.0100
Episode:10242 meanR:111.8900 R:141.0 loss:2.6937 exploreP:0.0100
Episode:10243 meanR:112.1400 R:136.0 loss:2.8486 exploreP:0.0100
Episode:10244 meanR:112.3800 R:135.0 loss:3.3002 exploreP:0.0100
Episode:10245 meanR:112.5600 R:138.0 loss:3.3599 exploreP:0.0100
Episode:10246 meanR:112.7000 R:136.0 loss:2.5405 exploreP:0.0100
Episode:10247 meanR:112.9400 R:136.0 loss:2.2924 exploreP:0.0100
Episode:10248 meanR:113.1500 R:135.0 loss:2.5752 exploreP:0.0100
Episode:10249 meanR:113.4700 R:138.0 loss:3.6181 exploreP:0.0100
Episode:10250 meanR:113.6800 R:131.0 loss:2.9507 exploreP:0.0100
Episode:10251 meanR:113.7800 R:131.0 loss:3.2254 exploreP:0.0100
Episode:10252 meanR:113.8800 R:133.0 loss:2.9477 exploreP:0.0100
Episode:10253 meanR:114.0400 R:132.0 loss:2.7791 exploreP:0.0100
Episode:10254 meanR:114.2200 R:132.0 loss:4.2290 exploreP:0.0100
Episode:10255 meanR:114.4

Episode:10365 meanR:424.0200 R:500.0 loss:16.6848 exploreP:0.0100
Episode:10366 meanR:427.6100 R:500.0 loss:16.3552 exploreP:0.0100
Episode:10367 meanR:426.2200 R:14.0 loss:66.7291 exploreP:0.0100
Episode:10368 meanR:424.8400 R:14.0 loss:106.3387 exploreP:0.0100
Episode:10369 meanR:428.1400 R:500.0 loss:23.2488 exploreP:0.0100
Episode:10370 meanR:431.2500 R:500.0 loss:16.0173 exploreP:0.0100
Episode:10371 meanR:434.2900 R:500.0 loss:16.2078 exploreP:0.0100
Episode:10372 meanR:437.3400 R:500.0 loss:15.8864 exploreP:0.0100
Episode:10373 meanR:440.4100 R:500.0 loss:15.3867 exploreP:0.0100
Episode:10374 meanR:443.3800 R:500.0 loss:15.2790 exploreP:0.0100
Episode:10375 meanR:446.5300 R:500.0 loss:18.6939 exploreP:0.0100
Episode:10376 meanR:449.5200 R:500.0 loss:18.9041 exploreP:0.0100
Episode:10377 meanR:452.5100 R:500.0 loss:18.4217 exploreP:0.0100
Episode:10378 meanR:455.5400 R:500.0 loss:19.7035 exploreP:0.0100
Episode:10379 meanR:458.6300 R:500.0 loss:19.0144 exploreP:0.0100
Episode:103

Episode:10490 meanR:415.0900 R:226.0 loss:20.2663 exploreP:0.0100
Episode:10491 meanR:412.2800 R:219.0 loss:20.5402 exploreP:0.0100
Episode:10492 meanR:411.0200 R:235.0 loss:12.7857 exploreP:0.0100
Episode:10493 meanR:408.3400 R:232.0 loss:11.0535 exploreP:0.0100
Episode:10494 meanR:405.0600 R:172.0 loss:13.2213 exploreP:0.0100
Episode:10495 meanR:402.7900 R:273.0 loss:2.0391 exploreP:0.0100
Episode:10496 meanR:401.1900 R:340.0 loss:7.5003 exploreP:0.0100
Episode:10497 meanR:397.1100 R:92.0 loss:17.4754 exploreP:0.0100
Episode:10498 meanR:392.8400 R:73.0 loss:22.0913 exploreP:0.0100
Episode:10499 meanR:388.5000 R:66.0 loss:17.1237 exploreP:0.0100
Episode:10500 meanR:384.1700 R:67.0 loss:14.7439 exploreP:0.0100
Episode:10501 meanR:379.5700 R:40.0 loss:12.5183 exploreP:0.0100
Episode:10502 meanR:375.0100 R:44.0 loss:16.5382 exploreP:0.0100
Episode:10503 meanR:370.2600 R:25.0 loss:15.8969 exploreP:0.0100
Episode:10504 meanR:365.7300 R:23.0 loss:20.4482 exploreP:0.0100
Episode:10505 meanR:

Episode:10615 meanR:398.6400 R:500.0 loss:16.1922 exploreP:0.0100
Episode:10616 meanR:403.0300 R:500.0 loss:15.9135 exploreP:0.0100
Episode:10617 meanR:407.4100 R:500.0 loss:15.2745 exploreP:0.0100
Episode:10618 meanR:408.9100 R:500.0 loss:16.7946 exploreP:0.0100
Episode:10619 meanR:410.4700 R:500.0 loss:15.9341 exploreP:0.0100
Episode:10620 meanR:411.7300 R:500.0 loss:14.5995 exploreP:0.0100
Episode:10621 meanR:412.2900 R:500.0 loss:16.9443 exploreP:0.0100
Episode:10622 meanR:412.2900 R:500.0 loss:15.5287 exploreP:0.0100
Episode:10623 meanR:414.1200 R:500.0 loss:17.2984 exploreP:0.0100
Episode:10624 meanR:414.1200 R:500.0 loss:19.4409 exploreP:0.0100
Episode:10625 meanR:409.2500 R:13.0 loss:81.4461 exploreP:0.0100
Episode:10626 meanR:409.4600 R:500.0 loss:19.9318 exploreP:0.0100
Episode:10627 meanR:409.4600 R:500.0 loss:17.5709 exploreP:0.0100
Episode:10628 meanR:409.4600 R:500.0 loss:18.5715 exploreP:0.0100
Episode:10629 meanR:409.4600 R:500.0 loss:15.8833 exploreP:0.0100
Episode:106

Episode:10740 meanR:374.5600 R:11.0 loss:217.0260 exploreP:0.0100
Episode:10741 meanR:369.9400 R:38.0 loss:220.5530 exploreP:0.0100
Episode:10742 meanR:365.0700 R:13.0 loss:204.8211 exploreP:0.0100
Episode:10743 meanR:360.5100 R:44.0 loss:212.3909 exploreP:0.0100
Episode:10744 meanR:355.9300 R:42.0 loss:190.4874 exploreP:0.0100
Episode:10745 meanR:351.3000 R:37.0 loss:166.1978 exploreP:0.0100
Episode:10746 meanR:347.1200 R:82.0 loss:93.8482 exploreP:0.0100
Episode:10747 meanR:347.1200 R:500.0 loss:14.4215 exploreP:0.0100
Episode:10748 meanR:344.3400 R:222.0 loss:37.1195 exploreP:0.0100
Episode:10749 meanR:340.8700 R:153.0 loss:17.5152 exploreP:0.0100
Episode:10750 meanR:340.8700 R:500.0 loss:6.2131 exploreP:0.0100
Episode:10751 meanR:337.2500 R:138.0 loss:44.0727 exploreP:0.0100
Episode:10752 meanR:333.4400 R:119.0 loss:19.5536 exploreP:0.0100
Episode:10753 meanR:329.5200 R:108.0 loss:24.0570 exploreP:0.0100
Episode:10754 meanR:325.4900 R:97.0 loss:29.9708 exploreP:0.0100
Episode:10755

# Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

## Testing

Let's checkout how our trained agent plays the game.

In [184]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
env = gym.make('CartPole-v1')

with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model-seq.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    initial_state = sess.run(model.initial_state) # Qs or current batch or states[:-1]
    
    # Episode/epoch
    for _ in range(1):
        state = env.reset()
        total_reward = 0
        
        # Steps/batches
        while True:
            env.render()
            action_logits, initial_state = sess.run([model.actions_logits, model.final_state],
                                                    feed_dict = {model.states: state.reshape([1, -1]), 
                                                                 model.initial_state: initial_state})
            action = np.argmax(action_logits)
            state, reward, done, _ = env.step(action)
            total_reward += reward
            if done:
                break
        # At the end of each episode
        print('total_reward:{}'.format(total_reward))

# Close the env
env.close()

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m
INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt




total_reward:120.0


## Extending this

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.