
# Q learning (Q-Net)

More specifically, we'll use Q-GAN to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.8.0
Default GPU Device: /device:GPU:0


>**Note:** Make sure you have OpenAI Gym cloned. Then run this command 'pip install -e gym/[all]'.

In [3]:
import gym
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
# env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [4]:
env.reset()
rewards, states, actions, dones = [], [], [], []
for _ in range(10):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    states.append(state)
    rewards.append(reward)
    actions.append(action)
    dones.append(done)
    #     print('state, action, reward, done, info')
    #     print(state, action, reward, done, info)
    if done:
    #         print('state, action, reward, done, info')
    #         print(state, action, reward, done, info)
        states.append(state)
        rewards.append(reward)
        actions.append(action)
        dones.append(done)

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [5]:
print(rewards[-20:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
print('rewards min and max:', np.max(np.array(rewards)), np.min(np.array(rewards)))
print('state size:', np.array(states).shape, 
      'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
(10,) (10, 4) (10,) (10,)
float64 float64 int64 bool
actions: 1 0
rewards min and max: 1.0 1.0
state size: (10, 4) action size: 2


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

In [6]:
# Data of the model
def model_input(state_size):
    # Current states given
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    
    # Current actions given
    prev_actions = tf.placeholder(tf.int32, [None], name='prev_actions')
    actions = tf.placeholder(tf.int32, [None], name='actions')

    # Qs = qs+ (gamma * nextQs)/values/logits: using next_states and dones/end-of-episodes
    nextQs = tf.placeholder(tf.float32, [None], name='nextQs')
    
    # returning the given data to the model
    return prev_actions, states, actions, nextQs

In [7]:
# Q: Qfunction/Encoder/Classifier
def qfunction(prev_actions, states, action_size, state_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('qfunction', reuse=reuse):
        # Fusing states and actions
        x_fused = tf.concat(axis=1, values=[prev_actions, states])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=(action_size + state_size))
        actions_logits, next_states_logits = tf.split(axis=1, num_or_size_splits=[action_size, state_size], 
                                                      value=logits)
        #predictions = tf.nn.softmax(logits)

        # return actions logits
        return actions_logits, next_states_logits

In [8]:
# Q2: Qfunction2/Encoder/Classifier
def qfunction2(prev_actions, states, action_size, state_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('qfunction2', reuse=reuse):
        # Fusing states and actions
        x_fused = tf.concat(axis=1, values=[prev_actions, states])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)
        #predictions = tf.nn.softmax(logits)

        # return reward logits/Qs
        return logits

In [9]:
def model_loss(prev_actions, states, actions, # model input data for Qs/qs/rs 
               nextQs, gamma, # model input data for targetQs
               state_size, action_size, hidden_size): # model init for Qs
    # Calculating Qs total rewards
    prev_actions_onehot = tf.one_hot(indices=prev_actions, depth=action_size)
    actions_logits, _ = qfunction(prev_actions=prev_actions_onehot, states=states, 
                                  hidden_size=hidden_size, state_size=state_size, action_size=action_size)
    actions_onehot = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs_masked = tf.multiply(actions_logits, actions_onehot)
    Qs = tf.reduce_max(Qs_masked, axis=1)
    
    # Bellman equaion for calculating total rewards using current reward + total future rewards/nextQs
    qs = tf.sigmoid(Qs) # qt
    targetQs = qs + (gamma * nextQs)
    
    # Calculating the loss: logits/predictions vs labels
    loss = tf.reduce_mean(tf.square(Qs - targetQs))

    return actions_logits, loss

In [10]:
def model_opt(q_loss, learning_rate):
    """
    Get optimization operations in order
    :param q_loss: Qfunction/Value loss Tensor for next action prediction
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    q_vars = [var for var in t_vars if var.name.startswith('qfunction')] # Q: action At/at

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
        q_opt = tf.train.AdamOptimizer(learning_rate).minimize(q_loss, var_list=q_vars)

    return q_opt

In [11]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate, gamma):

        # Data of the Model: make the data available inside the framework
        self.prev_actions, self.states, self.actions, self.nextQs = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.q_loss = model_loss(
            state_size=state_size, action_size=action_size, hidden_size=hidden_size, gamma=gamma, # model init parameters
            prev_actions=self.prev_actions, states=self.states, actions=self.actions, nextQs=self.nextQs) # model input data

        # Update the model: backward pass and backprop
        self.q_opt = model_opt(q_loss=self.q_loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [12]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [13]:
print('state size:', np.array(states).shape[1], 
      'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

state size: 4 action size: 2


In [14]:
train_episodes = 2000          # max number of episodes to learn from
max_steps = 2000000000000000   # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
action_size = 2                # number of units for the output actions -- simulation

# Memory parameters
memory_size = 100000           # memory capacity
batch_size = 200               # experience mini-batch size
learning_rate = 0.001          # learning rate for adam

In [15]:
# Reset/init the graph/session
tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate, 
             gamma=gamma)

# Init the memory
memory = Memory(max_size=memory_size)

## Populate the experience memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [16]:
# Initialize the simulation
env.reset()

# Take one random step to get the pole and cart moving
prev_action = env.action_space.sample() # At-1
state, reward, done, info = env.step(prev_action) # St, Rt/Et (Epiosde)

# Make a bunch of random actions and store the experiences
for _ in range(batch_size):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()# At
    next_state, reward, done, info = env.step(action) #St+1

    # End of the episodes which defines the goal of the episode/mission
    if done is True:
        
        # Print out reward and done and check if they are the same: They are NOT.
        #print('if done is true:', reward, done)
        
        # # the episode ends so no next state
        # next_state = np.zeros(state.shape)
                
        # Add experience to memory
        memory.add((prev_action, state, action, next_state, done))
        
        # Start new episode
        env.reset()
        
        # Take one random step to get the pole and cart moving
        prev_action = env.action_space.sample()
        state, reward, done, info = env.step(prev_action)
    else:
        # Print out reward and done and check if they are the same!
        #print('else done is false:', reward, done)
        
        # Add experience to memory
        memory.add((prev_action, state, action, next_state, done))
        
        # Prepare for the next round
        prev_action = action
        state = next_state

## Training

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Now train with experiences
saver = tf.train.Saver()

# Total rewards and losses list for plotting
rewards_list = []
q_loss_list = []

# TF session for training
with tf.Session() as sess:
    
    # Initialize variables
    sess.run(tf.global_variables_initializer())

    # Restore/load the trained model 
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    step = 0
    for ep in range(train_episodes):
        
        # Env/agent steps/batches/minibatches
        total_reward = 0
        q_loss = 0
        t = 0
        while t < max_steps:
            step += 1
            
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from model
                feed_dict = {model.prev_actions: np.array([prev_action]), 
                             model.states: state.reshape((1, *state.shape))}
                actions_logits = sess.run(model.actions_logits, feed_dict)
                action = np.argmax(actions_logits) # arg with max value/Q is the class of action
            
            # Take action, get new state and reward
            next_state, reward, done, info = env.step(action)
    
            # Cumulative reward
            total_reward += reward
            
            # Episode/epoch training is done/failed!
            if done is True:
                # the episode ends so no next state
                #next_state = np.zeros(state.shape)
                t = max_steps
                
                print('-------------------------------------------------------------------------------')
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training q_loss: {:.4f}'.format(q_loss),
                      'Explore P: {:.4f}'.format(explore_p))
                print('-------------------------------------------------------------------------------')
                
                # total rewards and losses for plotting
                rewards_list.append((ep, total_reward))
                q_loss_list.append((ep, q_loss))
                
                # Add experience to memory
                memory.add((prev_action, state, action, next_state, done))
                
                # Start new episode
                env.reset()
                
                # Take one random step to get the pole and cart moving
                prev_action = env.action_space.sample()
                state, reward, done, info = env.step(prev_action)

            else:
                # Add experience to memory
                memory.add((prev_action, state, action, next_state, done))
                
                # One step forward: At-1=At and St=St+1
                prev_action = action
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            prev_actions = np.array([each[0] for each in batch])
            states = np.array([each[1] for each in batch])
            actions = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            
            # Calculating nextQs and setting them to 0 for states where episode ends/fails
            feed_dict={model.prev_actions: actions, 
                       model.states: next_states}
            next_actions_logits = sess.run(model.actions_logits, feed_dict)            
            next_actions_mask = (1 - dones.astype(next_actions_logits.dtype)).reshape(-1, 1) 
            nextQs_masked = np.multiply(next_actions_logits, next_actions_mask)
            nextQs = np.max(nextQs_masked, axis=1)
            
            # Updating the model: Calculating Qs using states and actions and Qt = rs/qs + (gamma * nextQs) 
            feed_dict = {model.prev_actions: prev_actions, 
                         model.states: states, 
                         model.actions: actions, 
                         model.nextQs: nextQs}
            q_loss, _ = sess.run([model.q_loss, model.q_opt], feed_dict)

#             ################################################################################
#             ################################################################################
#             # Calculating nextQs for Discriminator using D(At-1, St)= Qt: NOT this one
#             # Calculating nextQs for Discriminator using D(At, St+1)= Qt+1/nextQs_D/nextQs
#             # Calculating nextQs for Discriminator using D(~At, ~St+1)= ~Qt+1/nextQs_G/nextQs2
#             feed_dict={model.prev_actions: prev_actions, model.states: states,
#                        model.actions: actions, model.next_states: next_states}
#             # First to G and then to D
#             nextQs_G, nextQs_D = sess.run([model.nextQs_G, model.nextQs_D], feed_dict)
            
#             # Masking for the end of episodes/ goals
#             dones_mask = (1 - dones.astype(next_actions_logits.dtype)).reshape(-1, 1) 
#             nextQs_G_masked = np.multiply(nextQs_G, dones_mask)
#             nextQs_G = np.max(nextQs_G_masked, axis=1)
#             nextQs_D_masked = np.multiply(nextQs_D, dones_mask)
#             nextQs_D = np.max(nextQs_D_masked, axis=1)

#             # NEW: Updating the model: Calculating Qs using states and actions and Qt = rs/qs + (gamma * nextQs) 
#             # Calculating nextQs for Discriminator using D(At-1, St)= Qt: NOT this one
#             # D(At-1, St)= Qs and qs = tf.sigmoid(Qs)
#             # NextQs/Qt+1 are given both:
#             # targetQs = qs + gamma * nextQs_G
#             # targetQs = qs + gamma * nextQs_D
#             feed_dict = {model.prev_actions: prev_actions, 
#                          model.states: states, 
#                          model.actions: actions, 
#                          model.nextQs: nextQs,
#                          model.nextQs_G: nextQs_G,
#                          model.nextQs_D: nextQs_D}
#             q_loss, _ = sess.run([model.q_loss, model.q_opt], feed_dict)
#             g_loss, _ = sess.run([model.g_loss, model.g_opt], feed_dict)
#             d_loss, _ = sess.run([model.d_loss, model.d_opt], feed_dict)
#             ################################################################################
#             ################################################################################
                        
    # Save the trained model
    saver.save(sess, 'checkpoints/model.ckpt')

-------------------------------------------------------------------------------
Episode: 0 Total reward: 9.0 Training q_loss: 0.6097 Explore P: 0.9991
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1 Total reward: 13.0 Training q_loss: 1.0882 Explore P: 0.9978
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2 Total reward: 14.0 Training q_loss: 2.0889 Explore P: 0.9964
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3 Total reward: 22.0 Training q_loss: 5.1348 Explore P: 0.9943
-------------------------------------------------------------------------------
-------------------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 37 Total reward: 24.0 Training q_loss: 67.7152 Explore P: 0.9226
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 38 Total reward: 16.0 Training q_loss: 89.7203 Explore P: 0.9211
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 39 Total reward: 35.0 Training q_loss: 113.2068 Explore P: 0.9180
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 40 Total reward: 28.0 Training q_loss: 78.4460 Explore P: 0.9154
-------------------------------------------------------------------------------
---------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 74 Total reward: 20.0 Training q_loss: 3378.4084 Explore P: 0.8408
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 75 Total reward: 27.0 Training q_loss: 1425.2632 Explore P: 0.8386
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 76 Total reward: 28.0 Training q_loss: 2955.7920 Explore P: 0.8363
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 77 Total reward: 10.0 Training q_loss: 1621.9222 Explore P: 0.8354
-------------------------------------------------------------------------------
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 109 Total reward: 40.0 Training q_loss: 477530.1562 Explore P: 0.7434
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 110 Total reward: 103.0 Training q_loss: 362657.4375 Explore P: 0.7359
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 111 Total reward: 96.0 Training q_loss: 539974.2500 Explore P: 0.7290
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 112 Total reward: 48.0 Training q_loss: 965810.1875 Explore P: 0.7255
-------------------------------------------------------------------------------
-------------------------------------------

-------------------------------------------------------------------------------
Episode: 144 Total reward: 199.0 Training q_loss: 222511264.0000 Explore P: 0.5823
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 145 Total reward: 199.0 Training q_loss: 137878832.0000 Explore P: 0.5710
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 146 Total reward: 192.0 Training q_loss: 45764648.0000 Explore P: 0.5603
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 147 Total reward: 21.0 Training q_loss: 93740736.0000 Explore P: 0.5592
-------------------------------------------------------------------------------
-------------------------------

-------------------------------------------------------------------------------
Episode: 178 Total reward: 199.0 Training q_loss: 225961088.0000 Explore P: 0.3818
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 179 Total reward: 199.0 Training q_loss: 864532096.0000 Explore P: 0.3745
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 180 Total reward: 118.0 Training q_loss: 837895360.0000 Explore P: 0.3702
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 181 Total reward: 100.0 Training q_loss: 504592352.0000 Explore P: 0.3666
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 212 Total reward: 199.0 Training q_loss: 941323648.0000 Explore P: 0.2030
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 213 Total reward: 199.0 Training q_loss: 168553232.0000 Explore P: 0.1992
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 214 Total reward: 199.0 Training q_loss: 321334816.0000 Explore P: 0.1954
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 215 Total reward: 199.0 Training q_loss: 408124992.0000 Explore P: 0.1918
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 246 Total reward: 191.0 Training q_loss: 179463376.0000 Explore P: 0.1089
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 247 Total reward: 185.0 Training q_loss: 82746512.0000 Explore P: 0.1071
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 248 Total reward: 197.0 Training q_loss: 277829984.0000 Explore P: 0.1052
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 249 Total reward: 199.0 Training q_loss: 182382288.0000 Explore P: 0.1034
-------------------------------------------------------------------------------
-----------------------------

-------------------------------------------------------------------------------
Episode: 280 Total reward: 199.0 Training q_loss: 1849163136.0000 Explore P: 0.0609
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 281 Total reward: 199.0 Training q_loss: 1862724608.0000 Explore P: 0.0599
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 282 Total reward: 199.0 Training q_loss: 2061269504.0000 Explore P: 0.0589
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 283 Total reward: 199.0 Training q_loss: 2169740544.0000 Explore P: 0.0579
-------------------------------------------------------------------------------
------------------------

-------------------------------------------------------------------------------
Episode: 314 Total reward: 199.0 Training q_loss: 101256272.0000 Explore P: 0.0359
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 315 Total reward: 199.0 Training q_loss: 75810904.0000 Explore P: 0.0353
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 316 Total reward: 199.0 Training q_loss: 91284048.0000 Explore P: 0.0348
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 317 Total reward: 199.0 Training q_loss: 46735576.0000 Explore P: 0.0344
-------------------------------------------------------------------------------
-------------------------------

-------------------------------------------------------------------------------
Episode: 348 Total reward: 199.0 Training q_loss: 669657408.0000 Explore P: 0.0234
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 349 Total reward: 199.0 Training q_loss: 990181696.0000 Explore P: 0.0231
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 350 Total reward: 199.0 Training q_loss: 831173120.0000 Explore P: 0.0229
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 351 Total reward: 199.0 Training q_loss: 502277216.0000 Explore P: 0.0226
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 382 Total reward: 199.0 Training q_loss: 369017216.0000 Explore P: 0.0168
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 383 Total reward: 199.0 Training q_loss: 785667520.0000 Explore P: 0.0167
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 384 Total reward: 199.0 Training q_loss: 332974496.0000 Explore P: 0.0165
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 385 Total reward: 199.0 Training q_loss: 1101225984.0000 Explore P: 0.0164
-------------------------------------------------------------------------------
---------------------------

-------------------------------------------------------------------------------
Episode: 416 Total reward: 199.0 Training q_loss: 807992512.0000 Explore P: 0.0135
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 417 Total reward: 199.0 Training q_loss: 557894976.0000 Explore P: 0.0134
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 418 Total reward: 199.0 Training q_loss: 500397504.0000 Explore P: 0.0133
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 419 Total reward: 199.0 Training q_loss: 735489728.0000 Explore P: 0.0133
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 450 Total reward: 199.0 Training q_loss: 358734944.0000 Explore P: 0.0118
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 451 Total reward: 199.0 Training q_loss: 197545776.0000 Explore P: 0.0117
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 452 Total reward: 199.0 Training q_loss: 331214560.0000 Explore P: 0.0117
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 453 Total reward: 199.0 Training q_loss: 761091584.0000 Explore P: 0.0117
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 484 Total reward: 199.0 Training q_loss: 607173184.0000 Explore P: 0.0109
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 485 Total reward: 199.0 Training q_loss: 256251392.0000 Explore P: 0.0109
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 486 Total reward: 199.0 Training q_loss: 366351808.0000 Explore P: 0.0109
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 487 Total reward: 199.0 Training q_loss: 236695616.0000 Explore P: 0.0108
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 518 Total reward: 199.0 Training q_loss: 113522848.0000 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 519 Total reward: 199.0 Training q_loss: 653798976.0000 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 520 Total reward: 199.0 Training q_loss: 482255488.0000 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 521 Total reward: 199.0 Training q_loss: 258051808.0000 Explore P: 0.0104
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 552 Total reward: 199.0 Training q_loss: 500854880.0000 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 553 Total reward: 199.0 Training q_loss: 633548992.0000 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 554 Total reward: 199.0 Training q_loss: 640916992.0000 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 555 Total reward: 199.0 Training q_loss: 235400768.0000 Explore P: 0.0102
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 586 Total reward: 199.0 Training q_loss: 224588640.0000 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 587 Total reward: 199.0 Training q_loss: 677182720.0000 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 588 Total reward: 199.0 Training q_loss: 442785824.0000 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 589 Total reward: 199.0 Training q_loss: 257009392.0000 Explore P: 0.0101
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 620 Total reward: 199.0 Training q_loss: 150419952.0000 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 621 Total reward: 199.0 Training q_loss: 356915840.0000 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 622 Total reward: 199.0 Training q_loss: 335414592.0000 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 623 Total reward: 199.0 Training q_loss: 649603712.0000 Explore P: 0.0101
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 654 Total reward: 199.0 Training q_loss: 320865760.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 655 Total reward: 199.0 Training q_loss: 293954432.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 656 Total reward: 199.0 Training q_loss: 317559968.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 657 Total reward: 199.0 Training q_loss: 163212416.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 688 Total reward: 199.0 Training q_loss: 268602240.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 689 Total reward: 199.0 Training q_loss: 243338960.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 690 Total reward: 199.0 Training q_loss: 153915728.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 691 Total reward: 199.0 Training q_loss: 378414976.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 722 Total reward: 199.0 Training q_loss: 314103552.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 723 Total reward: 199.0 Training q_loss: 155345024.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 724 Total reward: 199.0 Training q_loss: 394684672.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 725 Total reward: 199.0 Training q_loss: 306439552.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
----------------------------

-------------------------------------------------------------------------------
Episode: 756 Total reward: 199.0 Training q_loss: 18326466.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 757 Total reward: 199.0 Training q_loss: 111589608.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 758 Total reward: 199.0 Training q_loss: 67908624.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 759 Total reward: 199.0 Training q_loss: 11971748.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------

-------------------------------------------------------------------------------
Episode: 790 Total reward: 199.0 Training q_loss: 11018821.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 791 Total reward: 199.0 Training q_loss: 12152950.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 792 Total reward: 199.0 Training q_loss: 11673321.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 793 Total reward: 199.0 Training q_loss: 13126260.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 824 Total reward: 199.0 Training q_loss: 11011612.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 825 Total reward: 199.0 Training q_loss: 10931706.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 826 Total reward: 199.0 Training q_loss: 13056099.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 827 Total reward: 199.0 Training q_loss: 12108960.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 858 Total reward: 199.0 Training q_loss: 7412197.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 859 Total reward: 199.0 Training q_loss: 6871753.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 860 Total reward: 199.0 Training q_loss: 7663188.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 861 Total reward: 199.0 Training q_loss: 7400462.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
------------------------------------

-------------------------------------------------------------------------------
Episode: 892 Total reward: 199.0 Training q_loss: 7199396.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 893 Total reward: 199.0 Training q_loss: 6941156.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 894 Total reward: 199.0 Training q_loss: 7237862.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 895 Total reward: 199.0 Training q_loss: 7216749.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
------------------------------------

-------------------------------------------------------------------------------
Episode: 926 Total reward: 199.0 Training q_loss: 6900842.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 927 Total reward: 199.0 Training q_loss: 6505760.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 928 Total reward: 199.0 Training q_loss: 7186282.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 929 Total reward: 199.0 Training q_loss: 7074809.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
------------------------------------

-------------------------------------------------------------------------------
Episode: 960 Total reward: 199.0 Training q_loss: 6997924.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 961 Total reward: 199.0 Training q_loss: 6949035.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 962 Total reward: 199.0 Training q_loss: 7440153.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 963 Total reward: 199.0 Training q_loss: 6994143.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
------------------------------------

-------------------------------------------------------------------------------
Episode: 994 Total reward: 199.0 Training q_loss: 7287811.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 995 Total reward: 199.0 Training q_loss: 6913057.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 996 Total reward: 199.0 Training q_loss: 7061602.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 997 Total reward: 199.0 Training q_loss: 6985465.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
------------------------------------

-------------------------------------------------------------------------------
Episode: 1028 Total reward: 199.0 Training q_loss: 7253992.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1029 Total reward: 199.0 Training q_loss: 6790162.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1030 Total reward: 199.0 Training q_loss: 6908396.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1031 Total reward: 199.0 Training q_loss: 6876802.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1062 Total reward: 199.0 Training q_loss: 7298969.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1063 Total reward: 199.0 Training q_loss: 7024520.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1064 Total reward: 199.0 Training q_loss: 7061323.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1065 Total reward: 199.0 Training q_loss: 7205694.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1096 Total reward: 199.0 Training q_loss: 6898757.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1097 Total reward: 199.0 Training q_loss: 7018960.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1098 Total reward: 199.0 Training q_loss: 6990125.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1099 Total reward: 199.0 Training q_loss: 6724233.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1130 Total reward: 199.0 Training q_loss: 6920658.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1131 Total reward: 199.0 Training q_loss: 6928109.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1132 Total reward: 199.0 Training q_loss: 6929199.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1133 Total reward: 199.0 Training q_loss: 7123762.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1164 Total reward: 199.0 Training q_loss: 6922949.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1165 Total reward: 199.0 Training q_loss: 7274002.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1166 Total reward: 199.0 Training q_loss: 7050486.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1167 Total reward: 199.0 Training q_loss: 7000990.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1198 Total reward: 199.0 Training q_loss: 7139518.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1199 Total reward: 199.0 Training q_loss: 7139198.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1200 Total reward: 199.0 Training q_loss: 7038891.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1201 Total reward: 199.0 Training q_loss: 7486397.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1232 Total reward: 199.0 Training q_loss: 6852738.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1233 Total reward: 199.0 Training q_loss: 7279975.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1234 Total reward: 199.0 Training q_loss: 7103238.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1235 Total reward: 199.0 Training q_loss: 6838217.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1266 Total reward: 199.0 Training q_loss: 7430334.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1267 Total reward: 199.0 Training q_loss: 7060919.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1268 Total reward: 199.0 Training q_loss: 6821579.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1269 Total reward: 199.0 Training q_loss: 7158697.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1300 Total reward: 199.0 Training q_loss: 7228813.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1301 Total reward: 199.0 Training q_loss: 7191662.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1302 Total reward: 199.0 Training q_loss: 7399291.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1303 Total reward: 199.0 Training q_loss: 7143638.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1334 Total reward: 199.0 Training q_loss: 7407369.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1335 Total reward: 199.0 Training q_loss: 7174132.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1336 Total reward: 199.0 Training q_loss: 7281052.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1337 Total reward: 199.0 Training q_loss: 7408808.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1368 Total reward: 199.0 Training q_loss: 7108798.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1369 Total reward: 199.0 Training q_loss: 7085401.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1370 Total reward: 199.0 Training q_loss: 7324627.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1371 Total reward: 199.0 Training q_loss: 7142009.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1402 Total reward: 199.0 Training q_loss: 7220771.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1403 Total reward: 199.0 Training q_loss: 6882170.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1404 Total reward: 199.0 Training q_loss: 6967773.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1405 Total reward: 199.0 Training q_loss: 7137422.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1436 Total reward: 199.0 Training q_loss: 7150624.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1437 Total reward: 199.0 Training q_loss: 7296050.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1438 Total reward: 199.0 Training q_loss: 7809014.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1439 Total reward: 199.0 Training q_loss: 7305716.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1470 Total reward: 199.0 Training q_loss: 7132912.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1471 Total reward: 199.0 Training q_loss: 7028283.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1472 Total reward: 199.0 Training q_loss: 6911225.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1473 Total reward: 199.0 Training q_loss: 7100312.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1504 Total reward: 199.0 Training q_loss: 7097125.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1505 Total reward: 199.0 Training q_loss: 7151146.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1506 Total reward: 199.0 Training q_loss: 7500174.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1507 Total reward: 199.0 Training q_loss: 7207588.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1538 Total reward: 199.0 Training q_loss: 6975346.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1539 Total reward: 199.0 Training q_loss: 6880658.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1540 Total reward: 199.0 Training q_loss: 7129594.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1541 Total reward: 199.0 Training q_loss: 7194444.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

-------------------------------------------------------------------------------
Episode: 1572 Total reward: 199.0 Training q_loss: 7214410.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1573 Total reward: 199.0 Training q_loss: 7559156.5000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1574 Total reward: 199.0 Training q_loss: 7419498.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1575 Total reward: 199.0 Training q_loss: 7350212.0000 Explore P: 0.0100
-------------------------------------------------------------------------------
--------------------------------

## Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
# plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(q_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
# plt.ylabel('Q losses')

## Testing

Let's checkout how our trained agent plays the game.

In [26]:
test_episodes = 1
test_max_steps = 20000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

# # # Create the env after closing it.
# env = gym.make('CartPole-v0')
# # env = gym.make('Acrobot-v1')
env.reset()

with tf.Session() as sess:
    
    # Restore/load the trained model 
    saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # iterations
    for ep in range(test_episodes):
        
        # number of env/rob steps
        t = 0
        while t < test_max_steps:
            
            # Rendering the env graphics
            env.render()
            
            # Get action from the model
            feed_dict = {model.prev_actions: np.array([prev_action]), 
                         model.states: state.reshape((1, *state.shape))}
            actions_logits = sess.run(model.actions_logits, feed_dict)
            action = np.argmax(actions_logits)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            # The task is done or not;
            if done:
                t = test_max_steps
                env.reset()
                
                # Take one random step to get the pole and cart moving
                prev_action = env.action_space.sample()
                state, reward, done, _ = env.step(prev_action)
            else:
                state = next_state
                t += 1

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt


In [27]:
# Closing the env
# WARNING: If you close, you can NOT restart again!!!!!!
env.close()

## Extending this to Deep Convolutional QAN

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.