
# Q-GAN

More specifically, we'll use Q-GAN to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.8.0
Default GPU Device: /device:GPU:0


>**Note:** Make sure you have OpenAI Gym cloned. Then run this command 'pip install -e gym/[all]'.

In [3]:
import gym

# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
# env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [4]:
env.reset()
rewards, states, actions, dones = [], [], [], []
for _ in range(10):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    states.append(state)
    rewards.append(reward)
    actions.append(action)
    dones.append(done)
    if done:
        states.append(state)
        rewards.append(reward)
        actions.append(action)
        dones.append(done)

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [5]:
print(rewards[-20:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
print('rewards min and max:', np.max(np.array(rewards)), np.min(np.array(rewards)))
print('state size:', np.array(states).shape, 
      'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
(10,) (10, 4) (10,) (10,)
float64 float64 int64 bool
actions: 1 0
rewards min and max: 1.0 1.0
state size: (10, 4) action size: 2


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

In [7]:
# Data of the model
def model_input(state_size):
    # Current and next states given
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    next_states = tf.placeholder(tf.float32, [None, state_size], name='next_states')
    
    # Previous and current actions given
    prev_actions = tf.placeholder(tf.int32, [None], name='prev_actions')
    actions = tf.placeholder(tf.int32, [None], name='actions')

    # Qs = qs+ (gamma * nextQs)
    nextQs = tf.placeholder(tf.float32, [None], name='nextQs') # masked
    
    # returning the given data to the model
    return prev_actions, states, actions, next_states, nextQs

In [8]:
# Generator: Generating/predicting action and next states
def generator(prev_actions, states, action_size, state_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # Fusing states and actions
        x_fused = tf.concat(axis=1, values=[prev_actions, states])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=(action_size + state_size))
        actions_logits, next_states_logits = tf.split(axis=1, num_or_size_splits=[action_size, state_size], 
                                                      value=logits)
        #predictions = tf.nn.softmax(actions_logits)
        #predictions = tf.sigmoid(next_states_logits)

        # return actions and states logits
        return actions_logits, next_states_logits

In [9]:
def discriminator(prev_actions, states, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusing states and actions
        x_fused = tf.concat(axis=1, values=[prev_actions, states])
        #print(x_fused.shape)
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        #print(h1.shape)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        #print(h2.shape)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)
        #predictions = tf.nn.softmax(logits)

        # return reward logits/Qs
        return logits

In [10]:
# The model loss for predicted/generated actions
def model_loss(prev_actions, states, actions, next_states, nextQs, # model data
               state_size, action_size, hidden_size): # model init
    # Calculating Qt
    prev_actions_onehot = tf.one_hot(indices=prev_actions, depth=action_size)
    actions_logits, next_states_logits = generator(prev_actions=prev_actions_onehot, states=states, 
                                                   hidden_size=hidden_size, state_size=state_size, 
                                                   action_size=action_size)
    # Calculating Qt+1
    actions_onehot = tf.one_hot(indices=actions, depth=action_size)
    next_actions_logits, nextnext_states_logits = generator(prev_actions=actions_onehot, states=next_states,
                                                            hidden_size=hidden_size, state_size=state_size, 
                                                            action_size=action_size, reuse=True)
    # Masking actions_logits unmasked to create Qs
    Qs_masked = tf.multiply(actions_logits, actions_onehot)
    Qs = tf.reduce_max(Qs_masked, axis=1)
    
    # nextQs is targetQs/labels
    nextQs = tf.reshape(nextQs, [-1, 1])
    Qs = tf.reshape(Qs, [-1, 1])

    # Discriminator for nextQs_real and nextQs_fake
    # ~Qt
    nextQs_fake = discriminator(prev_actions=actions_logits, states=next_states_logits, 
                                hidden_size=hidden_size)
    # ~Qt+1
    nextQs_real = discriminator(prev_actions=next_actions_logits, states=nextnext_states_logits, 
                                hidden_size=hidden_size, reuse=True)

    # Generator loss
    #g_loss = tf.reduce_mean(tf.square(Qs - nextQs))
    g_loss_fake1 = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qs, #Qt vs Qt+1
                                                                          labels=tf.nn.sigmoid(nextQs)))
    g_loss_fake2 = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=nextQs_fake,#~Qt vs Qt+1 
                                                                          labels=tf.nn.sigmoid(nextQs)))
    g_loss_real = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=nextQs_real, #~Qt+1 vs 0
                                                                         labels=tf.zeros_like(nextQs)))
    g_loss = g_loss_fake1 + g_loss_fake2 + g_loss_real

    # Discriminator loss
    d_loss_fake = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=nextQs_fake, #~Qt vs 0
                                                                         labels=tf.zeros_like(nextQs)))
    d_loss_real = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=nextQs_real, #~Qt+1 vs Qt+1
                                                                         labels=tf.nn.sigmoid(nextQs)))
    d_loss = d_loss_fake + d_loss_real

    return actions_logits, g_loss, d_loss

In [11]:
def model_opt(g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param g_loss: Generator loss for state prediction
    :param d_loss: Discriminator loss for reward prediction
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Used for BN (batchnorm params)
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars) # state prediction
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars) # reward prediction

    return g_opt, d_opt

In [12]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        ####################################### Model data inputs/outputs #######################################
        # Input of the Model: make the data available inside the framework
        self.prev_actions, self.states, self.actions, self.next_states, self.nextQs = model_input(
            state_size=state_size)

        ######################################## Model losses #####################################################
        # Loss of the Model: action prediction/generation
        self.actions_logits, self.g_loss, self.d_loss = model_loss(
            state_size=state_size, action_size=action_size, hidden_size=hidden_size, # model init parameters
            prev_actions=self.prev_actions, states=self.states, 
            actions=self.actions, next_states=self.next_states,
            nextQs=self.nextQs) # model input data
        
        ######################################## Model updates #####################################################
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(g_loss=self.g_loss,
                                           d_loss=self.d_loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [13]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [14]:
print('state size:', np.array(states).shape[1], 
      'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

state size: 4 action size: 2


In [15]:
train_episodes = 2000          # max number of episodes to learn from
max_steps = 2000000000000000   # max steps in an episode

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
action_size = 2                # number of units for the output actions -- simulation

# Memory parameters
memory_size = 100000           # memory capacity
batch_size = 2000              # experience mini-batch size
learning_rate = 0.001          # learning rate for adam

In [16]:
# Reset/init the graph/session
tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

## Populate the memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [17]:
# Initialize the simulation
env.reset()

# Take one random step to get the pole and cart moving
prev_action = env.action_space.sample() # At-1
state, _, done, _ = env.step(prev_action) # St, Rt/Et (Epiosde)

# Make a bunch of random actions and store the experiences
for _ in range(batch_size):
    # Make a random action
    action = env.action_space.sample()# At
    next_state, _, done, _ = env.step(action) #St+1

    # End of the episodes which defines the goal of the episode/mission
    if done is True:
        # Add experience to memory
        memory.add((prev_action, state, action, next_state, done))
        
        # Start new episode
        env.reset()
        
        # Take one random step to get the pole and cart moving
        prev_action = env.action_space.sample()
        state, _, done, _ = env.step(prev_action)
    else:
        # Add experience to memory
        memory.add((prev_action, state, action, next_state, done))
        
        # Prepare for the next round
        prev_action = action
        state = next_state

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Now train with experiences
saver = tf.train.Saver()

# Total rewards and losses list for plotting
rewards_list = []
g_loss_list = []
d_loss_list = []

# TF session for training
with tf.Session() as sess:
    
    # Initialize variables
    sess.run(tf.global_variables_initializer())

    # Restore/load the trained model 
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    step = 0
    for ep in range(train_episodes):
        
        # Env/agent steps/batches/minibatches
        total_reward = 0
        g_loss = 0
        d_loss = 0
        t = 0
        while t < max_steps:
            step += 1
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from model
                feed_dict = {model.prev_actions: np.array([prev_action]), 
                             model.states: state.reshape((1, *state.shape))}
                actions_logits = sess.run(model.actions_logits, feed_dict)
                action = np.argmax(actions_logits) # arg with max value/Q is the class of action
            
            # Take action, get new state and reward
            next_state, _, done, _ = env.step(action)
    
            # Cumulative reward
            #total_reward += reward
            total_reward += 1 # done=False
            
            # Episode/epoch training is done/failed!
            if done is True:
                # the episode ends so no next state
                #next_state = np.zeros(state.shape)
                t = max_steps
                
                print('-------------------------------------------------------------------------------')
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training g_loss: {:.4f}'.format(g_loss),
                      'Training d_loss: {:.4f}'.format(d_loss),
                      'Explore P: {:.4f}'.format(explore_p))
                print('-------------------------------------------------------------------------------')
                
                # total rewards and losses for plotting
                rewards_list.append((ep, total_reward))
                g_loss_list.append((ep, g_loss))
                d_loss_list.append((ep, d_loss))
                
                # Add experience to memory
                memory.add((prev_action, state, action, next_state, done))
                
                # Start new episode
                env.reset()
                
                # Take one random step to get the pole and cart moving
                prev_action = env.action_space.sample()
                state, _, done, _ = env.step(prev_action)

            else:
                # Add experience to memory
                memory.add((prev_action, state, action, next_state, done))
                
                # One step forward: At-1=At and St=St+1
                prev_action = action
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            prev_actions = np.array([each[0] for each in batch])
            states = np.array([each[1] for each in batch])
            actions = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            
            # Calculating nextQs and setting them to 0 for states where episode ends/fails
            feed_dict={model.prev_actions: actions, model.states: next_states}
            next_actions_logits = sess.run(model.actions_logits, feed_dict)
            
            # Masking for the end of episodes/ goals
            next_actions_mask = (1 - dones.astype(next_actions_logits.dtype)).reshape(-1, 1) 
            nextQs_masked = np.multiply(next_actions_logits, next_actions_mask)
            nextQs = np.max(nextQs_masked, axis=1)
            
            # Calculating nextQs for Discriminator using D(At-1, St)= Qt: NOT this one
            # NextQs/Qt+1 are given both:
            feed_dict = {model.prev_actions: prev_actions, 
                         model.states: states, 
                         model.actions: actions, 
                         model.next_states: next_states, 
                         model.nextQs: nextQs}
            g_loss, _ = sess.run([model.g_loss, model.g_opt], feed_dict)
            d_loss, _ = sess.run([model.d_loss, model.d_opt], feed_dict)
                        
    # Save the trained model
    saver.save(sess, 'checkpoints/model.ckpt')

-------------------------------------------------------------------------------
Episode: 0 Total reward: 23 Training g_loss: 1.7791 Training d_loss: 1.0989 Explore P: 0.9977
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1 Total reward: 22 Training g_loss: 1.7740 Training d_loss: 1.0859 Explore P: 0.9956
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2 Total reward: 15 Training g_loss: 1.7620 Training d_loss: 1.0729 Explore P: 0.9941
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3 Total reward: 32 Training g_loss: 1.6831 Training d_loss: 0.9955 Explore P: 0.9909
----------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 33 Total reward: 20 Training g_loss: 1.8046 Training d_loss: 1.3595 Explore P: 0.9189
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 34 Total reward: 44 Training g_loss: 1.8534 Training d_loss: 1.3433 Explore P: 0.9149
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 35 Total reward: 10 Training g_loss: 1.8623 Training d_loss: 1.3386 Explore P: 0.9140
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 36 Total reward: 12 Training g_loss: 1.8686 Training d_loss: 1.3350 Explore P: 0.9129
------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 66 Total reward: 18 Training g_loss: 1.8855 Training d_loss: 1.2059 Explore P: 0.8522
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 67 Total reward: 16 Training g_loss: 1.8836 Training d_loss: 1.2030 Explore P: 0.8508
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 68 Total reward: 23 Training g_loss: 1.8794 Training d_loss: 1.1967 Explore P: 0.8489
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 69 Total reward: 15 Training g_loss: 1.8797 Training d_loss: 1.1972 Explore P: 0.8477
------------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 99 Total reward: 12 Training g_loss: 1.8722 Training d_loss: 1.1865 Explore P: 0.7960
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 100 Total reward: 24 Training g_loss: 1.8710 Training d_loss: 1.1849 Explore P: 0.7942
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 101 Total reward: 40 Training g_loss: 1.8734 Training d_loss: 1.1882 Explore P: 0.7910
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 102 Total reward: 16 Training g_loss: 1.8777 Training d_loss: 1.1942 Explore P: 0.7898
---------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 132 Total reward: 13 Training g_loss: 1.8085 Training d_loss: 1.3580 Explore P: 0.7451
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 133 Total reward: 49 Training g_loss: 1.8964 Training d_loss: 1.3011 Explore P: 0.7415
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 134 Total reward: 14 Training g_loss: 1.8991 Training d_loss: 1.2926 Explore P: 0.7405
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 135 Total reward: 21 Training g_loss: 1.8986 Training d_loss: 1.2766 Explore P: 0.7389
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 165 Total reward: 19 Training g_loss: 1.5900 Training d_loss: 1.3730 Explore P: 0.6774
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 166 Total reward: 86 Training g_loss: 1.5830 Training d_loss: 1.3707 Explore P: 0.6717
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 167 Total reward: 83 Training g_loss: 1.5771 Training d_loss: 1.3704 Explore P: 0.6662
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 168 Total reward: 54 Training g_loss: 1.5732 Training d_loss: 1.3727 Explore P: 0.6627
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 199 Total reward: 17 Training g_loss: 1.5027 Training d_loss: 1.3520 Explore P: 0.5995
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 200 Total reward: 62 Training g_loss: 1.5189 Training d_loss: 1.2446 Explore P: 0.5959
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 201 Total reward: 43 Training g_loss: 1.4808 Training d_loss: 1.2454 Explore P: 0.5934
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 202 Total reward: 20 Training g_loss: 1.4932 Training d_loss: 1.3672 Explore P: 0.5922
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 231 Total reward: 19 Training g_loss: 0.8112 Training d_loss: 0.0689 Explore P: 0.5046
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 232 Total reward: 24 Training g_loss: 0.8185 Training d_loss: 0.0867 Explore P: 0.5034
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 233 Total reward: 33 Training g_loss: 1.7186 Training d_loss: 1.5596 Explore P: 0.5018
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 234 Total reward: 8 Training g_loss: 1.5797 Training d_loss: 1.4374 Explore P: 0.5014
---------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 264 Total reward: 27 Training g_loss: 1.4599 Training d_loss: 1.3846 Explore P: 0.4499
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 265 Total reward: 61 Training g_loss: 1.4899 Training d_loss: 1.3810 Explore P: 0.4472
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 266 Total reward: 30 Training g_loss: 1.4528 Training d_loss: 1.3604 Explore P: 0.4459
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 267 Total reward: 15 Training g_loss: 1.4631 Training d_loss: 1.3817 Explore P: 0.4452
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 297 Total reward: 89 Training g_loss: 1.4524 Training d_loss: 1.3313 Explore P: 0.3983
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 298 Total reward: 91 Training g_loss: 1.4523 Training d_loss: 1.2965 Explore P: 0.3948
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 299 Total reward: 26 Training g_loss: 1.4477 Training d_loss: 1.3770 Explore P: 0.3938
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 300 Total reward: 24 Training g_loss: 1.4446 Training d_loss: 1.3074 Explore P: 0.3928
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 329 Total reward: 37 Training g_loss: 1.4402 Training d_loss: 1.3354 Explore P: 0.3495
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 330 Total reward: 26 Training g_loss: 1.4399 Training d_loss: 1.3278 Explore P: 0.3486
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 331 Total reward: 17 Training g_loss: 1.4385 Training d_loss: 1.3301 Explore P: 0.3480
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 332 Total reward: 43 Training g_loss: 1.4549 Training d_loss: 1.3362 Explore P: 0.3466
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 363 Total reward: 10 Training g_loss: 0.7694 Training d_loss: 0.0725 Explore P: 0.3316
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 364 Total reward: 14 Training g_loss: 0.7686 Training d_loss: 0.0718 Explore P: 0.3311
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 365 Total reward: 10 Training g_loss: 0.7782 Training d_loss: 0.0781 Explore P: 0.3308
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 366 Total reward: 11 Training g_loss: 0.7989 Training d_loss: 0.0956 Explore P: 0.3305
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 396 Total reward: 12 Training g_loss: 0.7625 Training d_loss: 0.0613 Explore P: 0.3171
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 397 Total reward: 13 Training g_loss: 0.7677 Training d_loss: 0.0647 Explore P: 0.3167
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 398 Total reward: 26 Training g_loss: 0.8134 Training d_loss: 0.1118 Explore P: 0.3159
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 399 Total reward: 29 Training g_loss: 0.9012 Training d_loss: 0.1899 Explore P: 0.3150
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 429 Total reward: 37 Training g_loss: 0.8795 Training d_loss: 0.0663 Explore P: 0.2679
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 430 Total reward: 9 Training g_loss: 0.8013 Training d_loss: 0.0637 Explore P: 0.2677
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 431 Total reward: 12 Training g_loss: 3.1643 Training d_loss: 2.7263 Explore P: 0.2673
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 432 Total reward: 9 Training g_loss: 1.0286 Training d_loss: 0.3285 Explore P: 0.2671
----------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 463 Total reward: 10 Training g_loss: 0.7643 Training d_loss: 0.0544 Explore P: 0.2585
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 464 Total reward: 12 Training g_loss: 0.7734 Training d_loss: 0.0604 Explore P: 0.2582
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 465 Total reward: 21 Training g_loss: 0.7853 Training d_loss: 0.0751 Explore P: 0.2576
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 466 Total reward: 8 Training g_loss: 0.8125 Training d_loss: 0.1005 Explore P: 0.2574
---------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 496 Total reward: 167 Training g_loss: 1.4631 Training d_loss: 1.2741 Explore P: 0.2415
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 497 Total reward: 199 Training g_loss: 1.4338 Training d_loss: 1.2008 Explore P: 0.2369
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 498 Total reward: 130 Training g_loss: 1.3954 Training d_loss: 1.1074 Explore P: 0.2340
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 499 Total reward: 34 Training g_loss: 1.4206 Training d_loss: 1.1595 Explore P: 0.2332
-----------------------------------------------------

-------------------------------------------------------------------------------
Episode: 528 Total reward: 83 Training g_loss: 2.0280 Training d_loss: 1.3973 Explore P: 0.1991
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 529 Total reward: 8 Training g_loss: 1.8869 Training d_loss: 1.4004 Explore P: 0.1989
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 530 Total reward: 11 Training g_loss: 1.8061 Training d_loss: 1.4050 Explore P: 0.1987
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 531 Total reward: 13 Training g_loss: 1.6939 Training d_loss: 1.4009 Explore P: 0.1985
---------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 561 Total reward: 10 Training g_loss: 0.7747 Training d_loss: 0.0665 Explore P: 0.1764
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 562 Total reward: 9 Training g_loss: 0.7665 Training d_loss: 0.0575 Explore P: 0.1762
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 563 Total reward: 9 Training g_loss: 0.7838 Training d_loss: 0.0564 Explore P: 0.1761
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 564 Total reward: 9 Training g_loss: 0.8069 Training d_loss: 0.0555 Explore P: 0.1759
-----------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 594 Total reward: 75 Training g_loss: 0.7888 Training d_loss: 0.0490 Explore P: 0.1573
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 595 Total reward: 13 Training g_loss: 1.0960 Training d_loss: 0.0585 Explore P: 0.1572
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 596 Total reward: 15 Training g_loss: 1.0774 Training d_loss: 0.1825 Explore P: 0.1569
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 597 Total reward: 63 Training g_loss: 1.2257 Training d_loss: 0.5373 Explore P: 0.1560
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 627 Total reward: 8 Training g_loss: 0.7606 Training d_loss: 0.0577 Explore P: 0.1514
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 628 Total reward: 9 Training g_loss: 0.7539 Training d_loss: 0.0527 Explore P: 0.1512
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 629 Total reward: 10 Training g_loss: 0.7684 Training d_loss: 0.0654 Explore P: 0.1511
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 630 Total reward: 9 Training g_loss: 0.7621 Training d_loss: 0.0595 Explore P: 0.1510
-----------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 661 Total reward: 9 Training g_loss: 0.7628 Training d_loss: 0.0578 Explore P: 0.1466
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 662 Total reward: 11 Training g_loss: 0.7653 Training d_loss: 0.0571 Explore P: 0.1465
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 663 Total reward: 9 Training g_loss: 0.7458 Training d_loss: 0.0432 Explore P: 0.1463
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 664 Total reward: 10 Training g_loss: 0.7592 Training d_loss: 0.0541 Explore P: 0.1462
----------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 694 Total reward: 12 Training g_loss: 0.7774 Training d_loss: 0.0683 Explore P: 0.1308
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 695 Total reward: 11 Training g_loss: 0.7657 Training d_loss: 0.0604 Explore P: 0.1307
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 696 Total reward: 9 Training g_loss: 0.7959 Training d_loss: 0.0917 Explore P: 0.1305
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 697 Total reward: 9 Training g_loss: 0.7914 Training d_loss: 0.0660 Explore P: 0.1304
----------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 726 Total reward: 122 Training g_loss: 1.4303 Training d_loss: 1.2891 Explore P: 0.1002
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 727 Total reward: 168 Training g_loss: 1.4097 Training d_loss: 1.3200 Explore P: 0.0987
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 728 Total reward: 178 Training g_loss: 1.3726 Training d_loss: 1.1530 Explore P: 0.0971
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 729 Total reward: 95 Training g_loss: 0.7586 Training d_loss: 0.0562 Explore P: 0.0963
-----------------------------------------------------

-------------------------------------------------------------------------------
Episode: 758 Total reward: 199 Training g_loss: 0.7583 Training d_loss: 0.0532 Explore P: 0.0771
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 759 Total reward: 65 Training g_loss: 0.7464 Training d_loss: 0.0422 Explore P: 0.0767
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 760 Total reward: 33 Training g_loss: 0.7713 Training d_loss: 0.0648 Explore P: 0.0765
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 761 Total reward: 13 Training g_loss: 0.7681 Training d_loss: 0.0623 Explore P: 0.0764
-------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 790 Total reward: 123 Training g_loss: 1.4213 Training d_loss: 1.3270 Explore P: 0.0598
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 791 Total reward: 133 Training g_loss: 1.4257 Training d_loss: 1.3439 Explore P: 0.0592
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 792 Total reward: 126 Training g_loss: 1.4410 Training d_loss: 1.3472 Explore P: 0.0586
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 793 Total reward: 127 Training g_loss: 1.3670 Training d_loss: 1.1188 Explore P: 0.0579
----------------------------------------------------

-------------------------------------------------------------------------------
Episode: 823 Total reward: 8 Training g_loss: 0.7481 Training d_loss: 0.0433 Explore P: 0.0562
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 824 Total reward: 8 Training g_loss: 0.7633 Training d_loss: 0.0516 Explore P: 0.0562
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 825 Total reward: 8 Training g_loss: 0.7508 Training d_loss: 0.0445 Explore P: 0.0561
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 826 Total reward: 10 Training g_loss: 0.7647 Training d_loss: 0.0556 Explore P: 0.0561
-----------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 856 Total reward: 11 Training g_loss: 0.7489 Training d_loss: 0.0458 Explore P: 0.0538
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 857 Total reward: 8 Training g_loss: 0.7504 Training d_loss: 0.0477 Explore P: 0.0538
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 858 Total reward: 13 Training g_loss: 0.7456 Training d_loss: 0.0410 Explore P: 0.0537
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 859 Total reward: 12 Training g_loss: 0.7509 Training d_loss: 0.0457 Explore P: 0.0537
---------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 890 Total reward: 128 Training g_loss: 1.4430 Training d_loss: 1.2498 Explore P: 0.0513
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 891 Total reward: 127 Training g_loss: 1.3697 Training d_loss: 1.0244 Explore P: 0.0507
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 892 Total reward: 153 Training g_loss: 1.3084 Training d_loss: 1.0264 Explore P: 0.0501
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 893 Total reward: 199 Training g_loss: 0.7574 Training d_loss: 0.0513 Explore P: 0.0493
----------------------------------------------------

-------------------------------------------------------------------------------
Episode: 922 Total reward: 33 Training g_loss: 1.4743 Training d_loss: 1.3569 Explore P: 0.0405
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 923 Total reward: 92 Training g_loss: 1.4812 Training d_loss: 1.3374 Explore P: 0.0403
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 924 Total reward: 111 Training g_loss: 1.4527 Training d_loss: 1.3795 Explore P: 0.0399
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 925 Total reward: 110 Training g_loss: 1.4840 Training d_loss: 1.3776 Explore P: 0.0396
------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 954 Total reward: 9 Training g_loss: 0.7543 Training d_loss: 0.0523 Explore P: 0.0343
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 955 Total reward: 22 Training g_loss: 0.7621 Training d_loss: 0.0589 Explore P: 0.0343
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 956 Total reward: 10 Training g_loss: 0.7497 Training d_loss: 0.0462 Explore P: 0.0342
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 957 Total reward: 9 Training g_loss: 0.7463 Training d_loss: 0.0435 Explore P: 0.0342
----------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 987 Total reward: 11 Training g_loss: 1.4806 Training d_loss: 1.3502 Explore P: 0.0294
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 988 Total reward: 37 Training g_loss: 1.4640 Training d_loss: 1.3570 Explore P: 0.0294
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 989 Total reward: 70 Training g_loss: 1.4434 Training d_loss: 1.3705 Explore P: 0.0292
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 990 Total reward: 87 Training g_loss: 1.4513 Training d_loss: 1.3511 Explore P: 0.0291
--------------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1019 Total reward: 149 Training g_loss: 1.5335 Training d_loss: 1.4151 Explore P: 0.0256
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1020 Total reward: 24 Training g_loss: 1.4804 Training d_loss: 1.3698 Explore P: 0.0256
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1021 Total reward: 86 Training g_loss: 1.4462 Training d_loss: 1.3656 Explore P: 0.0254
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1022 Total reward: 85 Training g_loss: 1.4432 Training d_loss: 1.3490 Explore P: 0.0253
---------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1051 Total reward: 96 Training g_loss: 1.4104 Training d_loss: 1.2770 Explore P: 0.0219
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1052 Total reward: 92 Training g_loss: 1.4360 Training d_loss: 1.3881 Explore P: 0.0218
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1053 Total reward: 61 Training g_loss: 1.5052 Training d_loss: 1.3886 Explore P: 0.0218
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1054 Total reward: 102 Training g_loss: 1.4310 Training d_loss: 1.3256 Explore P: 0.0216
---------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1083 Total reward: 131 Training g_loss: 1.4272 Training d_loss: 1.3111 Explore P: 0.0183
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1084 Total reward: 118 Training g_loss: 1.3596 Training d_loss: 1.1550 Explore P: 0.0182
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1085 Total reward: 143 Training g_loss: 1.3225 Training d_loss: 1.0969 Explore P: 0.0181
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1086 Total reward: 82 Training g_loss: 1.4247 Training d_loss: 1.2875 Explore P: 0.0180
-------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1115 Total reward: 199 Training g_loss: 1.4214 Training d_loss: 1.3403 Explore P: 0.0156
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1116 Total reward: 199 Training g_loss: 1.4632 Training d_loss: 1.3908 Explore P: 0.0155
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1117 Total reward: 199 Training g_loss: 1.2087 Training d_loss: 0.7918 Explore P: 0.0154
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1118 Total reward: 199 Training g_loss: 1.3021 Training d_loss: 1.0684 Explore P: 0.0153
------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1147 Total reward: 21 Training g_loss: 1.4751 Training d_loss: 1.3728 Explore P: 0.0136
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1148 Total reward: 199 Training g_loss: 1.4403 Training d_loss: 1.3279 Explore P: 0.0136
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1149 Total reward: 62 Training g_loss: 1.4441 Training d_loss: 1.3416 Explore P: 0.0135
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1150 Total reward: 122 Training g_loss: 1.2721 Training d_loss: 0.8629 Explore P: 0.0135
--------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1179 Total reward: 33 Training g_loss: 1.3762 Training d_loss: 1.2202 Explore P: 0.0126
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1180 Total reward: 107 Training g_loss: 1.2825 Training d_loss: 0.9539 Explore P: 0.0126
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1181 Total reward: 54 Training g_loss: 1.2476 Training d_loss: 0.9145 Explore P: 0.0126
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1182 Total reward: 20 Training g_loss: 1.2361 Training d_loss: 0.9556 Explore P: 0.0126
---------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1211 Total reward: 31 Training g_loss: 1.5242 Training d_loss: 1.3634 Explore P: 0.0121
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1212 Total reward: 19 Training g_loss: 1.4753 Training d_loss: 1.3546 Explore P: 0.0121
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1213 Total reward: 9 Training g_loss: 1.4782 Training d_loss: 1.3538 Explore P: 0.0121
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1214 Total reward: 12 Training g_loss: 1.4535 Training d_loss: 1.3447 Explore P: 0.0121
-----------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1243 Total reward: 89 Training g_loss: 1.0753 Training d_loss: 0.5646 Explore P: 0.0115
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1244 Total reward: 199 Training g_loss: 1.4563 Training d_loss: 1.3791 Explore P: 0.0115
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1245 Total reward: 199 Training g_loss: 1.5649 Training d_loss: 1.3940 Explore P: 0.0115
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1246 Total reward: 29 Training g_loss: 1.4925 Training d_loss: 1.3275 Explore P: 0.0114
--------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1275 Total reward: 199 Training g_loss: 1.4234 Training d_loss: 1.3542 Explore P: 0.0111
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1276 Total reward: 199 Training g_loss: 1.4187 Training d_loss: 1.3017 Explore P: 0.0111
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1277 Total reward: 131 Training g_loss: 1.3953 Training d_loss: 1.1987 Explore P: 0.0110
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1278 Total reward: 110 Training g_loss: 1.2219 Training d_loss: 0.9067 Explore P: 0.0110
------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1307 Total reward: 15 Training g_loss: 1.4424 Training d_loss: 1.3715 Explore P: 0.0108
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1308 Total reward: 140 Training g_loss: 1.4253 Training d_loss: 1.3555 Explore P: 0.0108
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1309 Total reward: 140 Training g_loss: 1.3477 Training d_loss: 1.1266 Explore P: 0.0108
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1310 Total reward: 101 Training g_loss: 1.3209 Training d_loss: 1.0424 Explore P: 0.0108
-------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1339 Total reward: 199 Training g_loss: 1.4713 Training d_loss: 1.3501 Explore P: 0.0106
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1340 Total reward: 159 Training g_loss: 1.2255 Training d_loss: 0.6779 Explore P: 0.0106
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1341 Total reward: 55 Training g_loss: 1.2252 Training d_loss: 0.8348 Explore P: 0.0106
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1342 Total reward: 11 Training g_loss: 1.2557 Training d_loss: 0.9548 Explore P: 0.0106
--------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1371 Total reward: 19 Training g_loss: 1.1767 Training d_loss: 0.7626 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1372 Total reward: 14 Training g_loss: 1.5602 Training d_loss: 1.3613 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1373 Total reward: 23 Training g_loss: 1.4846 Training d_loss: 1.3871 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1374 Total reward: 152 Training g_loss: 1.4125 Training d_loss: 1.2092 Explore P: 0.0105
---------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1403 Total reward: 55 Training g_loss: 1.2794 Training d_loss: 0.9929 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1404 Total reward: 133 Training g_loss: 1.1375 Training d_loss: 0.7492 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1405 Total reward: 104 Training g_loss: 1.5238 Training d_loss: 1.3918 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1406 Total reward: 17 Training g_loss: 1.4537 Training d_loss: 1.3825 Explore P: 0.0104
--------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1435 Total reward: 64 Training g_loss: 1.2922 Training d_loss: 0.9815 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1436 Total reward: 175 Training g_loss: 1.1293 Training d_loss: 0.7556 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1437 Total reward: 47 Training g_loss: 1.4378 Training d_loss: 1.3859 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1438 Total reward: 41 Training g_loss: 1.1696 Training d_loss: 0.6769 Explore P: 0.0103
---------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1467 Total reward: 97 Training g_loss: 1.1930 Training d_loss: 0.8507 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1468 Total reward: 199 Training g_loss: 0.7478 Training d_loss: 0.0540 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1469 Total reward: 160 Training g_loss: 1.4196 Training d_loss: 1.3020 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1470 Total reward: 84 Training g_loss: 1.3203 Training d_loss: 1.1104 Explore P: 0.0102
--------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1499 Total reward: 80 Training g_loss: 1.0255 Training d_loss: 0.3161 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1500 Total reward: 199 Training g_loss: 0.7275 Training d_loss: 0.0332 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1501 Total reward: 142 Training g_loss: 1.4394 Training d_loss: 1.3732 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1502 Total reward: 192 Training g_loss: 0.9405 Training d_loss: 0.2490 Explore P: 0.0101
-------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1531 Total reward: 199 Training g_loss: 0.7264 Training d_loss: 0.0327 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1532 Total reward: 136 Training g_loss: 1.4607 Training d_loss: 1.3820 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1533 Total reward: 14 Training g_loss: 1.4598 Training d_loss: 1.3530 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1534 Total reward: 113 Training g_loss: 1.4099 Training d_loss: 1.3091 Explore P: 0.0101
-------------------------------------------------

-------------------------------------------------------------------------------
Episode: 1563 Total reward: 95 Training g_loss: 1.3414 Training d_loss: 1.1347 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1564 Total reward: 113 Training g_loss: 1.3699 Training d_loss: 1.1732 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1565 Total reward: 199 Training g_loss: 1.4382 Training d_loss: 1.3763 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1566 Total reward: 199 Training g_loss: 1.1815 Training d_loss: 0.7171 Explore P: 0.0101
-------------------------------------------------

## Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(g_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(d_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [43]:
test_episodes = 1
test_max_steps = 20000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

# # # Create the env after closing it.
# # env = gym.make('CartPole-v0')
# # env = gym.make('Acrobot-v1')
# env.reset()

with tf.Session() as sess:
    
    # Restore/load the trained model 
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # iterations
    for ep in range(test_episodes):
        
        # number of env/rob steps
        t = 0
        while t < test_max_steps:
            
            # Rendering the env graphics
            env.render()
            
            # Get action from the model
            feed_dict = {model.prev_actions: np.array([prev_action]), 
                         model.states: state.reshape((1, *state.shape))}
            actions_logits = sess.run(model.actions_logits, feed_dict)
            action = np.argmax(actions_logits)
            
            # Take action, get new state and reward
            next_state, _, done, _ = env.step(action)
            
            # The task is done or not;
            if done:
                t = test_max_steps
                env.reset()
                
                # Take one random step to get the pole and cart moving
                prev_action = env.action_space.sample()
                state, reward, done, _ = env.step(prev_action)
            else:
                state = next_state
                t += 1

INFO:tensorflow:Restoring parameters from checkpoints/model.ckpt


In [45]:
# Closing the env
# WARNING: If you close, you can NOT restart again!!!!!!
env.close()

## Extending this to Deep Convolutional QAN

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.