
# Q learning (Q-Net)

More specifically, we'll use Q-GAN to train an agent to play a game called [Cart-Pole](https://gym.openai.com/envs/CartPole-v0). In this game, a freely swinging pole is attached to a cart. The cart can move to the left and right, and the goal is to keep the pole upright as long as possible.

![Cart-Pole](assets/cart-pole.jpg)

We can simulate this game using [OpenAI Gym](https://gym.openai.com/). First, let's check out how OpenAI Gym works. Then, we'll get into training an agent to play the Cart-Pole game.

In [1]:
import numpy as np

In [2]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.8.0
Default GPU Device: /device:GPU:0


>**Note:** Make sure you have OpenAI Gym cloned. Then run this command 'pip install -e gym/[all]'.

In [3]:
import gym
# Create the Cart-Pole game environment
env = gym.make('CartPole-v0')
# env = gym.make('CartPole-v1')
# env = gym.make('Acrobot-v1')
# env = gym.make('MountainCar-v0')
# env = gym.make('Pendulum-v0')
# env = gym.make('Blackjack-v0')
# env = gym.make('FrozenLake-v0')
# env = gym.make('AirRaid-ram-v0')
# env = gym.make('AirRaid-v0')
# env = gym.make('BipedalWalker-v2')
# env = gym.make('Copy-v0')
# env = gym.make('CarRacing-v0')
# env = gym.make('Ant-v2') #mujoco
# env = gym.make('FetchPickAndPlace-v1') # mujoco required!

[33mWARN: gym.spaces.Box autodetected dtype as <class 'numpy.float32'>. Please provide explicit dtype.[0m




We interact with the simulation through `env`. To show the simulation running, you can use `env.render()` to render one frame. Passing in an action as an integer to `env.step` will generate the next step in the simulation.  You can see how many actions are possible from `env.action_space` and to get a random action you can use `env.action_space.sample()`. This is general to all Gym games. In the Cart-Pole game, there are two possible actions, moving the cart left or right. So there are two actions we can take, encoded as 0 and 1.

Run the code below to watch the simulation run.

In [4]:
env.reset()
rewards, states, actions, dones = [], [], [], []
for _ in range(10):
    #env.render()
    action = env.action_space.sample()
    state, reward, done, info = env.step(action) # take a random action
    states.append(state)
    rewards.append(reward)
    actions.append(action)
    dones.append(done)
    #     print('state, action, reward, done, info')
    #     print(state, action, reward, done, info)
    if done:
    #         print('state, action, reward, done, info')
    #         print(state, action, reward, done, info)
        states.append(state)
        rewards.append(reward)
        actions.append(action)
        dones.append(done)

To shut the window showing the simulation, use `env.close()`.

If you ran the simulation above, we can look at the rewards:

In [5]:
print(rewards[-20:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print('actions:', np.max(np.array(actions)), np.min(np.array(actions)))
print('rewards min and max:', np.max(np.array(rewards)), np.min(np.array(rewards)))
print('state size:', np.array(states).shape, 
      'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
(10,) (10, 4) (10,) (10,)
float64 float64 int64 bool
actions: 1 0
rewards min and max: 1.0 1.0
state size: (10, 4) action size: 2


The game resets after the pole has fallen past a certain angle. For each frame while the simulation is running, it returns a reward of 1.0. The longer the game runs, the more reward we get. Then, our network's goal is to maximize the reward by keeping the pole vertical. It will do this by moving the cart to the left and the right.

In [6]:
# Data of the model
def model_input(state_size):
    # Current and next states given
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    next_states = tf.placeholder(tf.float32, [None, state_size], name='next_states')
    
    # Previous and current actions given
    prev_actions = tf.placeholder(tf.int32, [None], name='prev_actions')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    
    # End of episodes/goal/task where nextQs = 0 and Qs=rs
    dones = tf.placeholder(tf.bool, [None], name='dones') # masked

    # Qs = qs+ (gamma * nextQs)
    nextQs = tf.placeholder(tf.float32, [None], name='nextQs') # masked
    nextQs_D = tf.placeholder(tf.float32, [None], name='nextQs_D') # masked
    
    # returning the given data to the model
    return prev_actions, states, actions, next_states, dones, nextQs, nextQs_D

In [7]:
# Generator: Generating/predicting action and next states
def generator(prev_actions, states, action_size, state_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # Fusing states and actions
        x_fused = tf.concat(axis=1, values=[prev_actions, states])
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=(action_size + state_size))
        actions_logits, next_states_logits = tf.split(axis=1, num_or_size_splits=[action_size, state_size], 
                                                      value=logits)
        #predictions = tf.nn.softmax(actions_logits)
        #predictions = tf.sigmoid(next_states_logits)

        # return actions and states logits
        return actions_logits, next_states_logits

In [8]:
def discriminator(prev_actions, states, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # Fusing states and actions
        x_fused = tf.concat(axis=1, values=[prev_actions, states])
        #print(x_fused.shape)
        
        # First fully connected layer
        h1 = tf.layers.dense(inputs=x_fused, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        #print(h1.shape)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        #print(h2.shape)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)
        #predictions = tf.nn.softmax(logits)

        # return reward logits/Qs
        return logits

In [9]:
# The model loss for predicted/generated actions
def model_loss(prev_actions, states, actions, # model input data for Qs/qs/rs 
               nextQs, gamma, # model input data for targetQs
               state_size, action_size, hidden_size): # model init for Qs
    # Calculating Qs total rewards
    prev_actions_onehot = tf.one_hot(indices=prev_actions, depth=action_size)
    actions_logits, _ = generator(prev_actions=prev_actions_onehot, states=states, 
                                  hidden_size=hidden_size, state_size=state_size, action_size=action_size)
    
    # Masking actions_logits unmasked
    actions_mask = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs_masked = tf.multiply(actions_logits, actions_mask)
    Qs = tf.reduce_max(Qs_masked, axis=1)
    
    # Bellman equaion for calculating total rewards using current reward + total future rewards/nextQs
    qs = tf.sigmoid(Qs) # qt
    targetQs = qs + (gamma * nextQs)
    
    # Calculating the loss: logits/predictions vs labels
    q_loss = tf.reduce_mean(tf.square(Qs - targetQs))

    return actions_logits, q_loss

In [10]:
# Outputing the unmasked nextQs for D to be used as the target/label
def model_output(actions, next_states,
                 action_size, hidden_size):
    # Discriminator for nextQs_D
    actions_onehot = tf.one_hot(indices=actions, depth=action_size)
    nextQs_D_unmasked = discriminator(prev_actions=actions_onehot, states=next_states, hidden_size=hidden_size)
    
    # Returning unmasked nextQs_D to masked using dones/ends of episodes
    return nextQs_D_unmasked

In [11]:
# The model loss for the NEW idea G & D
def model_loss2(nextQs_D, gamma, 
                prev_actions, states, 
                action_size, hidden_size):
    # Calculating Qs total rewards using Discriminator
    prev_actions_onehot = tf.one_hot(indices=prev_actions, depth=action_size)
    Qs = discriminator(prev_actions=prev_actions_onehot, states=states, hidden_size=hidden_size, reuse=True)
        
    # Bellman equaion: Qs = rt/qt + nextQs_G/D
    qs = tf.sigmoid(Qs) # qt
    targetQs_D = qs + (gamma * nextQs_D)
    
    # Calculating the loss: logits/predictions vs labels
    d_loss = tf.reduce_mean(tf.square(Qs - targetQs_D))
    return d_loss

In [12]:
# Calculating the loss of generator based on the generated/predicted states and actions
def model_loss3(nextQs_D, gamma,
                prev_actions, states, dones, 
                state_size, action_size, hidden_size):
    # Generator for nextQs_G
    prev_actions_onehot = tf.one_hot(indices=prev_actions, depth=action_size)
    actions_logits, next_states_logits = generator(prev_actions=prev_actions_onehot, states=states,
                                                   hidden_size=hidden_size, state_size=state_size, 
                                                   action_size=action_size, reuse=True)
    
    # Discriminator for nextQs_G
    nextQs_G_unmasked = discriminator(prev_actions=actions_logits, states=next_states_logits, 
                                      hidden_size=hidden_size, reuse=True)
    
    # Masking the unmasked nextQs_G using dones/end of episodes/goal
    dones_mask = tf.reshape(tensor=(1 - tf.cast(dtype=nextQs_G_unmasked.dtype, x=dones)), shape=[-1, 1])
    nextQs_G_masked = tf.multiply(nextQs_G_unmasked, dones_mask)
    nextQs_G = tf.reduce_max(axis=1, input_tensor=nextQs_G_masked)

    # Below is the idea behind this loss
    # # Bellman equaion: Qs = rt/qt + nextQs_G/D
    # qs = tf.sigmoid(Qs) # qt
    # targetQs_G = qs + (gamma * nextQs_G)
    # targetQs_D = qs + (gamma * nextQs_D)
    # targetQs_G = targetQs_D
    # nextQs_G = nextQs_D 
    # Calculating the loss: logits/predictions vs labels
    g_loss = tf.reduce_mean(tf.square(nextQs_G - nextQs_D))
    
    # Returning g_loss which should impact Generator
    return g_loss

In [13]:
def model_opt(q_loss, g_loss, d_loss, learning_rate):
    """
    Get optimization operations in order
    :param q_loss: Generator loss for action prediction
    :param g_loss: Generator loss for state prediction
    :param d_loss: Discriminator loss for reward prediction
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Used for BN (batchnorm params)
        q_opt = tf.train.AdamOptimizer(learning_rate).minimize(q_loss, var_list=g_vars) # action prediction
        g_opt = tf.train.AdamOptimizer(learning_rate).minimize(g_loss, var_list=g_vars) # state prediction
        d_opt = tf.train.AdamOptimizer(learning_rate).minimize(d_loss, var_list=d_vars) # reward prediction

    return q_opt, g_opt, d_opt

In [14]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate, gamma):

        ####################################### Model data inputs/outputs #######################################
        # Input of the Model: make the data available inside the framework
        self.prev_actions, self.states, self.actions, self.next_states, self.dones, self.nextQs, self.nextQs_D = model_input(
            state_size=state_size)

        # Output of the Model
        self.nextQs_D_unmasked = model_output(actions=self.actions, next_states=self.next_states,
                                              action_size=action_size, hidden_size=hidden_size)
        
        ######################################## Model losses #####################################################
        # Loss of the Model: action prediction/generation
        self.actions_logits, self.q_loss = model_loss(
            state_size=state_size, action_size=action_size, hidden_size=hidden_size, gamma=gamma, # model init parameters
            prev_actions=self.prev_actions, states=self.states, actions=self.actions, nextQs=self.nextQs) # model input data

        # Loss of the model: reward prob/logits prediction
        self.d_loss = model_loss2(nextQs_D=self.nextQs_D, gamma=gamma,
                                  action_size=action_size, hidden_size=hidden_size,
                                  prev_actions=self.prev_actions, states=self.states)
        
        # Loss of the model: states prediction/generation
        self.g_loss = model_loss3(nextQs_D=self.nextQs_D, gamma=gamma, dones=self.dones,
                                  state_size=state_size, action_size=action_size, hidden_size=hidden_size,
                                  prev_actions=self.prev_actions, states=self.states)
        
        ######################################## Model updates #####################################################
        # Update the model: backward pass and backprop
        self.q_opt, self.g_opt, self.d_opt = model_opt(q_loss=self.q_loss, 
                                                       g_loss=self.g_loss, 
                                                       d_loss=self.d_loss, learning_rate=learning_rate)

## Experience replay

Reinforcement learning algorithms can have stability issues due to correlations between states. To reduce correlations when training, we can store the agent's experiences and later draw a random mini-batch of those experiences to train on. 

Here, we'll create a `Memory` object that will store our experiences, our transitions $<s, a, r, s'>$. This memory will have a maxmium capacity, so we can keep newer experiences in memory while getting rid of older experiences. Then, we'll sample a random mini-batch of transitions $<s, a, r, s'>$ and train on those.

Below, I've implemented a `Memory` object. If you're unfamiliar with `deque`, this is a double-ended queue. You can think of it like a tube open on both sides. You can put objects in either side of the tube. But if it's full, adding anything more will push an object out the other side. This is a great data structure to use for the memory buffer.

In [15]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    
    def add(self, experience):
        self.buffer.append(experience)
            
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

## Exploration - Exploitation

To learn about the environment and rules of the game, the agent needs to explore by taking random actions. We'll do this by choosing a random action with some probability $\epsilon$ (epsilon).  That is, with some probability $\epsilon$ the agent will make a random action and with probability $1 - \epsilon$, the agent will choose an action from $Q(s,a)$. This is called an **$\epsilon$-greedy policy**.


At first, the agent needs to do a lot of exploring. Later when it has learned more, the agent can favor choosing actions based on what it has learned. This is called _exploitation_. We'll set it up so the agent is more likely to explore early in training, then more likely to exploit later in training.

## Q-Learning training algorithm

Putting all this together, we can list out the algorithm we'll use to train the network. We'll train the network in _episodes_. One *episode* is one simulation of the game. For this game, the goal is to keep the pole upright for 195 frames. So we can start a new episode once meeting that goal. The game ends if the pole tilts over too far, or if the cart moves too far the left or right. When a game ends, we'll start a new episode. Now, to train the agent:

* Initialize the memory $D$
* Initialize the action-value network $Q$ with random weights
* **For** episode = 1, $M$ **do**
  * **For** $t$, $T$ **do**
     * With probability $\epsilon$ select a random action $a_t$, otherwise select $a_t = \mathrm{argmax}_a Q(s,a)$
     * Execute action $a_t$ in simulator and observe reward $r_{t+1}$ and new state $s_{t+1}$
     * Store transition $<s_t, a_t, r_{t+1}, s_{t+1}>$ in memory $D$
     * Sample random mini-batch from $D$: $<s_j, a_j, r_j, s'_j>$
     * Set $\hat{Q}_j = r_j$ if the episode ends at $j+1$, otherwise set $\hat{Q}_j = r_j + \gamma \max_{a'}{Q(s'_j, a')}$
     * Make a gradient descent step with loss $(\hat{Q}_j - Q(s_j, a_j))^2$
  * **endfor**
* **endfor**

## Hyperparameters

One of the more difficult aspects of reinforcememt learning are the large number of hyperparameters. Not only are we tuning the network, but we're tuning the simulation.

In [16]:
print('state size:', np.array(states).shape[1], 
      'action size: {}'.format((np.max(np.array(actions)) - np.min(np.array(actions)))+1))

state size: 4 action size: 2


In [17]:
train_episodes = 10000         # max number of episodes to learn from
max_steps = 2000000000000000   # max steps in an episode
gamma = 0.99                   # future reward discount

# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 4                 # number of units for the input state/observation -- simulation
hidden_size = 64               # number of units in each Q-network hidden layer -- simulation
action_size = 2                # number of units for the output actions -- simulation

# Memory parameters
memory_size = 100000           # memory capacity
batch_size = 200               # experience mini-batch size
learning_rate = 0.001          # learning rate for adam

In [18]:
# Reset/init the graph/session
tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate, 
             gamma=gamma)

# Init the memory
memory = Memory(max_size=memory_size)

## Populate the memory

Here I'm re-initializing the simulation and pre-populating the memory. The agent is taking random actions and storing the transitions in memory. This will help the agent with exploring the game.

In [20]:
# Initialize the simulation
env.reset()

# Take one random step to get the pole and cart moving
prev_action = env.action_space.sample() # At-1
state, reward, done, info = env.step(prev_action) # St, Rt/Et (Epiosde)

# Make a bunch of random actions and store the experiences
for _ in range(batch_size):
    # Uncomment the line below to watch the simulation
    # env.render()

    # Make a random action
    action = env.action_space.sample()# At
    next_state, reward, done, info = env.step(action) #St+1

    # End of the episodes which defines the goal of the episode/mission
    if done is True:
        
        # Print out reward and done and check if they are the same: They are NOT.
        #print('if done is true:', reward, done)
        
        # # the episode ends so no next state
        # next_state = np.zeros(state.shape)
                
        # Add experience to memory
        memory.add((prev_action, state, action, next_state, done))
        
        # Start new episode
        env.reset()
        
        # Take one random step to get the pole and cart moving
        prev_action = env.action_space.sample()
        state, reward, done, info = env.step(prev_action)
    else:
        # Print out reward and done and check if they are the same!
        #print('else done is false:', reward, done)
        
        # Add experience to memory
        memory.add((prev_action, state, action, next_state, done))
        
        # Prepare for the next round
        prev_action = action
        state = next_state

## Training the model

Below we'll train our agent. If you want to watch it train, uncomment the `env.render()` line. This is slow because it's rendering the frames slower than the network can train. But, it's cool to watch the agent get better at the game.

In [None]:
# Now train with experiences
saver = tf.train.Saver()

# Total rewards and losses list for plotting
rewards_list = []
q_loss_list = []
g_loss_list = []
d_loss_list = []

# TF session for training
with tf.Session() as sess:
    
    # Initialize variables
    sess.run(tf.global_variables_initializer())

    # Restore/load the trained model 
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Training episodes/epochs
    step = 0
    for ep in range(train_episodes):
        
        # Env/agent steps/batches/minibatches
        total_reward = 0
        q_loss = 0
        g_loss = 0
        d_loss = 0
        t = 0
        while t < max_steps:
            step += 1
            
            # Uncomment this next line to watch the training
            # env.render() 
            
            # Explore or Exploit
            explore_p = explore_stop + (explore_start - explore_stop)*np.exp(-decay_rate*step) 
            if explore_p > np.random.rand():
                # Make a random action
                action = env.action_space.sample()
            else:
                # Get action from model
                feed_dict = {model.prev_actions: np.array([prev_action]), 
                             model.states: state.reshape((1, *state.shape))}
                actions_logits = sess.run(model.actions_logits, feed_dict)
                action = np.argmax(actions_logits) # arg with max value/Q is the class of action
            
            # Take action, get new state and reward
            next_state, reward, done, info = env.step(action)
    
            # Cumulative reward
            total_reward += reward
            
            # Episode/epoch training is done/failed!
            if done is True:
                # the episode ends so no next state
                #next_state = np.zeros(state.shape)
                t = max_steps
                
                print('-------------------------------------------------------------------------------')
                print('Episode: {}'.format(ep),
                      'Total reward: {}'.format(total_reward),
                      'Training q_loss: {:.4f}'.format(q_loss),
                      'Training g_loss: {:.4f}'.format(g_loss),
                      'Training d_loss: {:.4f}'.format(d_loss),
                      'Explore P: {:.4f}'.format(explore_p))
                print('-------------------------------------------------------------------------------')
                
                # total rewards and losses for plotting
                rewards_list.append((ep, total_reward))
                q_loss_list.append((ep, q_loss))
                g_loss_list.append((ep, g_loss))
                d_loss_list.append((ep, d_loss))
                
                # Add experience to memory
                memory.add((prev_action, state, action, next_state, done))
                
                # Start new episode
                env.reset()
                
                # Take one random step to get the pole and cart moving
                prev_action = env.action_space.sample()
                state, reward, done, info = env.step(prev_action)

            else:
                # Add experience to memory
                memory.add((prev_action, state, action, next_state, done))
                
                # One step forward: At-1=At and St=St+1
                prev_action = action
                state = next_state
                t += 1
            
            # Sample mini-batch from memory
            batch = memory.sample(batch_size)
            prev_actions = np.array([each[0] for each in batch])
            states = np.array([each[1] for each in batch])
            actions = np.array([each[2] for each in batch])
            next_states = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            #print(prev_actions.shape, states.shape, actions.shape, next_states.shape, dones.shape, dones.dtype)
            #print(dones[:3])
            
            # Calculating nextQs and setting them to 0 for states where episode ends/fails
            feed_dict={model.prev_actions: actions, 
                       model.states: next_states}
            next_actions_logits = sess.run(model.actions_logits, feed_dict)
            
            # Masking for the end of episodes/ goals
            next_actions_mask = (1 - dones.astype(next_actions_logits.dtype)).reshape(-1, 1) 
            nextQs_masked = np.multiply(next_actions_logits, next_actions_mask)
            nextQs = np.max(nextQs_masked, axis=1)
            
            # Calculating nextQs for Discriminator using D(At-1, St)= Qt: NOT this one
            # Calculating nextQs for Discriminator using D(At, St+1)= Qt+1/nextQs_D/nextQs
            # Calculating nextQs for Discriminator using D(~At, ~St+1)= ~Qt+1/nextQs_G/nextQs2
            feed_dict={model.prev_actions: prev_actions, model.states: states,
                       model.actions: actions, model.next_states: next_states}
            nextQs_D_unmasked = sess.run(model.nextQs_D_unmasked, feed_dict)
            
            # Masking for the end of episodes/ goals
            dones_mask = (1 - dones.astype(nextQs_D_unmasked[0].dtype)).reshape(-1, 1)
            nextQs_D_masked = np.multiply(nextQs_D_unmasked[0], dones_mask)
            nextQs_D = np.max(nextQs_D_masked, axis=1)
            
            # Calculating nextQs for Discriminator using D(At-1, St)= Qt: NOT this one
            # D(At-1, St)= Qs and qs = tf.sigmoid(Qs)
            # NextQs/Qt+1 are given both:
            # targetQs = qs + gamma * nextQs_G
            # targetQs = qs + gamma * nextQs_D
            feed_dict = {model.prev_actions: prev_actions, 
                         model.states: states, 
                         model.actions: actions, 
                         model.next_states: next_states, 
                         model.dones: dones,
                         model.nextQs: nextQs,
                         model.nextQs_D: nextQs_D}
            q_loss, _ = sess.run([model.q_loss, model.q_opt], feed_dict)
            d_loss, _ = sess.run([model.d_loss, model.d_opt], feed_dict)
            g_loss, _ = sess.run([model.g_loss, model.g_opt], feed_dict)
                        
    # Save the trained model
    saver.save(sess, 'checkpoints/model.ckpt')

-------------------------------------------------------------------------------
Episode: 0 Total reward: 3.0 Training q_loss: 0.3319 Training g_loss: 0.0153 Training d_loss: 0.1670 Explore P: 0.9997
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1 Total reward: 34.0 Training q_loss: 0.6930 Training g_loss: 0.7829 Training d_loss: 1.2712 Explore P: 0.9963
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2 Total reward: 15.0 Training q_loss: 0.4860 Training g_loss: 0.1517 Training d_loss: 4.9132 Explore P: 0.9949
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3 Total reward: 36.0 Training q_loss: 0.4252 Training g_loss: 0.0602 Tra

-------------------------------------------------------------------------------
Episode: 30 Total reward: 39.0 Training q_loss: 0.4647 Training g_loss: 0.2724 Training d_loss: 12.0865 Explore P: 0.9429
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 31 Total reward: 16.0 Training q_loss: 0.4657 Training g_loss: 0.3004 Training d_loss: 13.1454 Explore P: 0.9414
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 32 Total reward: 12.0 Training q_loss: 0.4454 Training g_loss: 0.2213 Training d_loss: 26.5687 Explore P: 0.9403
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 33 Total reward: 44.0 Training q_loss: 0.4885 Training g_loss: 0.

-------------------------------------------------------------------------------
Episode: 60 Total reward: 12.0 Training q_loss: 0.4613 Training g_loss: 0.1615 Training d_loss: 19.2593 Explore P: 0.8864
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 61 Total reward: 22.0 Training q_loss: 0.4465 Training g_loss: 0.3353 Training d_loss: 11.9927 Explore P: 0.8845
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 62 Total reward: 20.0 Training q_loss: 0.4654 Training g_loss: 0.1487 Training d_loss: 16.4736 Explore P: 0.8827
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 63 Total reward: 37.0 Training q_loss: 0.4533 Training g_loss: 0.

-------------------------------------------------------------------------------
Episode: 90 Total reward: 22.0 Training q_loss: 0.4412 Training g_loss: 0.4632 Training d_loss: 14.3897 Explore P: 0.8375
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 91 Total reward: 38.0 Training q_loss: 0.4286 Training g_loss: 0.3905 Training d_loss: 8.0241 Explore P: 0.8344
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 92 Total reward: 17.0 Training q_loss: 0.4166 Training g_loss: 0.0891 Training d_loss: 19.9363 Explore P: 0.8330
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 93 Total reward: 16.0 Training q_loss: 0.4324 Training g_loss: 0.1

-------------------------------------------------------------------------------
Episode: 120 Total reward: 15.0 Training q_loss: 0.4788 Training g_loss: 0.3495 Training d_loss: 15.1281 Explore P: 0.7924
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 121 Total reward: 15.0 Training q_loss: 0.4944 Training g_loss: 0.3782 Training d_loss: 13.9540 Explore P: 0.7912
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 122 Total reward: 12.0 Training q_loss: 0.5077 Training g_loss: 0.3963 Training d_loss: 16.9554 Explore P: 0.7902
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 123 Total reward: 9.0 Training q_loss: 0.5260 Training g_loss:

-------------------------------------------------------------------------------
Episode: 149 Total reward: 9.0 Training q_loss: 0.5245 Training g_loss: 1.2062 Training d_loss: 10.9954 Explore P: 0.7542
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 150 Total reward: 17.0 Training q_loss: 0.5227 Training g_loss: 0.3644 Training d_loss: 14.2420 Explore P: 0.7529
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 151 Total reward: 19.0 Training q_loss: 0.4910 Training g_loss: 0.2035 Training d_loss: 14.9683 Explore P: 0.7515
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 152 Total reward: 12.0 Training q_loss: 0.4818 Training g_loss:

-------------------------------------------------------------------------------
Episode: 178 Total reward: 18.0 Training q_loss: 0.5455 Training g_loss: 0.7009 Training d_loss: 13.3728 Explore P: 0.7208
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 179 Total reward: 15.0 Training q_loss: 0.5719 Training g_loss: 0.6407 Training d_loss: 9.5179 Explore P: 0.7197
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 180 Total reward: 14.0 Training q_loss: 0.5463 Training g_loss: 0.5597 Training d_loss: 11.4273 Explore P: 0.7188
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 181 Total reward: 10.0 Training q_loss: 0.5472 Training g_loss:

-------------------------------------------------------------------------------
Episode: 209 Total reward: 9.0 Training q_loss: 0.4816 Training g_loss: 0.2398 Training d_loss: 15.1457 Explore P: 0.6837
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 210 Total reward: 39.0 Training q_loss: 0.5649 Training g_loss: 0.5094 Training d_loss: 11.9616 Explore P: 0.6811
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 211 Total reward: 13.0 Training q_loss: 0.6695 Training g_loss: 0.9387 Training d_loss: 9.2622 Explore P: 0.6802
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 212 Total reward: 15.0 Training q_loss: 0.5894 Training g_loss: 

-------------------------------------------------------------------------------
Episode: 239 Total reward: 11.0 Training q_loss: 0.5904 Training g_loss: 0.5074 Training d_loss: 12.0255 Explore P: 0.6454
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 240 Total reward: 12.0 Training q_loss: 0.6316 Training g_loss: 0.7663 Training d_loss: 7.4827 Explore P: 0.6447
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 241 Total reward: 11.0 Training q_loss: 0.5958 Training g_loss: 0.5576 Training d_loss: 8.5386 Explore P: 0.6440
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 242 Total reward: 13.0 Training q_loss: 0.6328 Training g_loss: 

-------------------------------------------------------------------------------
Episode: 268 Total reward: 16.0 Training q_loss: 0.6838 Training g_loss: 0.3388 Training d_loss: 18.5972 Explore P: 0.6212
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 269 Total reward: 30.0 Training q_loss: 0.7085 Training g_loss: 0.6332 Training d_loss: 9.5787 Explore P: 0.6194
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 270 Total reward: 23.0 Training q_loss: 0.6690 Training g_loss: 0.5452 Training d_loss: 10.9310 Explore P: 0.6180
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 271 Total reward: 11.0 Training q_loss: 0.6149 Training g_loss:

-------------------------------------------------------------------------------
Episode: 297 Total reward: 57.0 Training q_loss: 34.8322 Training g_loss: 0.0829 Training d_loss: 16.4107 Explore P: 0.5794
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 298 Total reward: 66.0 Training q_loss: 31.9141 Training g_loss: 0.0468 Training d_loss: 22.6435 Explore P: 0.5757
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 299 Total reward: 41.0 Training q_loss: 32.8507 Training g_loss: 0.0237 Training d_loss: 8.5139 Explore P: 0.5734
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 300 Total reward: 33.0 Training q_loss: 32.9248 Training g_l

-------------------------------------------------------------------------------
Episode: 326 Total reward: 45.0 Training q_loss: 31.4252 Training g_loss: 0.1141 Training d_loss: 12.4144 Explore P: 0.5077
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 327 Total reward: 39.0 Training q_loss: 30.5110 Training g_loss: 0.0367 Training d_loss: 10.7700 Explore P: 0.5057
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 328 Total reward: 45.0 Training q_loss: 31.7668 Training g_loss: 0.4656 Training d_loss: 10.9768 Explore P: 0.5035
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 329 Total reward: 91.0 Training q_loss: 27.9091 Training g_

-------------------------------------------------------------------------------
Episode: 355 Total reward: 92.0 Training q_loss: 18.3459 Training g_loss: 0.5107 Training d_loss: 18.9314 Explore P: 0.4388
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 356 Total reward: 35.0 Training q_loss: 18.7576 Training g_loss: 0.0086 Training d_loss: 21.1130 Explore P: 0.4373
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 357 Total reward: 32.0 Training q_loss: 20.3984 Training g_loss: 0.0414 Training d_loss: 13.9924 Explore P: 0.4359
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 358 Total reward: 49.0 Training q_loss: 18.6370 Training g_

-------------------------------------------------------------------------------
Episode: 384 Total reward: 60.0 Training q_loss: 15.0980 Training g_loss: 0.0390 Training d_loss: 18.6706 Explore P: 0.3757
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 385 Total reward: 86.0 Training q_loss: 15.6518 Training g_loss: 0.1312 Training d_loss: 19.4277 Explore P: 0.3726
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 386 Total reward: 63.0 Training q_loss: 17.7502 Training g_loss: 0.0187 Training d_loss: 7.9069 Explore P: 0.3703
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 387 Total reward: 38.0 Training q_loss: 16.8461 Training g_l

-------------------------------------------------------------------------------
Episode: 413 Total reward: 31.0 Training q_loss: 7.7232 Training g_loss: 0.0203 Training d_loss: 8.8572 Explore P: 0.3239
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 414 Total reward: 32.0 Training q_loss: 6.6302 Training g_loss: 0.1054 Training d_loss: 17.7116 Explore P: 0.3229
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 415 Total reward: 42.0 Training q_loss: 7.2675 Training g_loss: 0.0408 Training d_loss: 11.9984 Explore P: 0.3216
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 416 Total reward: 72.0 Training q_loss: 7.3343 Training g_loss:

-------------------------------------------------------------------------------
Episode: 442 Total reward: 30.0 Training q_loss: 5.7241 Training g_loss: 0.0296 Training d_loss: 9.6054 Explore P: 0.2771
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 443 Total reward: 29.0 Training q_loss: 5.2633 Training g_loss: 0.0333 Training d_loss: 2.8098 Explore P: 0.2763
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 444 Total reward: 20.0 Training q_loss: 4.6378 Training g_loss: 0.0282 Training d_loss: 19.9319 Explore P: 0.2758
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 445 Total reward: 26.0 Training q_loss: 5.2331 Training g_loss: 

-------------------------------------------------------------------------------
Episode: 471 Total reward: 19.0 Training q_loss: 5.1216 Training g_loss: 0.0078 Training d_loss: 15.9945 Explore P: 0.2594
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 472 Total reward: 23.0 Training q_loss: 6.7481 Training g_loss: 0.0150 Training d_loss: 17.0886 Explore P: 0.2588
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 473 Total reward: 23.0 Training q_loss: 7.0354 Training g_loss: 0.0210 Training d_loss: 12.1064 Explore P: 0.2583
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 474 Total reward: 21.0 Training q_loss: 6.8825 Training g_loss

-------------------------------------------------------------------------------
Episode: 500 Total reward: 33.0 Training q_loss: 8.8803 Training g_loss: 0.0044 Training d_loss: 14.7662 Explore P: 0.2422
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 501 Total reward: 24.0 Training q_loss: 6.6288 Training g_loss: 0.0305 Training d_loss: 12.6069 Explore P: 0.2417
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 502 Total reward: 22.0 Training q_loss: 4.3013 Training g_loss: 0.0317 Training d_loss: 11.4848 Explore P: 0.2412
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 503 Total reward: 27.0 Training q_loss: 8.2322 Training g_loss

-------------------------------------------------------------------------------
Episode: 529 Total reward: 21.0 Training q_loss: 4.6888 Training g_loss: 0.0895 Training d_loss: 8.7024 Explore P: 0.2277
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 530 Total reward: 26.0 Training q_loss: 6.8876 Training g_loss: 0.0110 Training d_loss: 18.1651 Explore P: 0.2271
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 531 Total reward: 36.0 Training q_loss: 6.9398 Training g_loss: 0.0282 Training d_loss: 20.0247 Explore P: 0.2263
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 532 Total reward: 25.0 Training q_loss: 7.2096 Training g_loss:

-------------------------------------------------------------------------------
Episode: 558 Total reward: 14.0 Training q_loss: 7.6474 Training g_loss: 0.1178 Training d_loss: 15.6368 Explore P: 0.2164
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 559 Total reward: 10.0 Training q_loss: 7.6421 Training g_loss: 0.0135 Training d_loss: 17.9123 Explore P: 0.2162
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 560 Total reward: 11.0 Training q_loss: 7.9407 Training g_loss: 0.0427 Training d_loss: 18.7384 Explore P: 0.2160
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 561 Total reward: 15.0 Training q_loss: 7.8547 Training g_loss

-------------------------------------------------------------------------------
Episode: 588 Total reward: 12.0 Training q_loss: 7.0097 Training g_loss: 0.0986 Training d_loss: 13.4781 Explore P: 0.2077
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 589 Total reward: 17.0 Training q_loss: 8.5482 Training g_loss: 0.0736 Training d_loss: 10.1912 Explore P: 0.2074
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 590 Total reward: 10.0 Training q_loss: 8.2104 Training g_loss: 0.0304 Training d_loss: 15.1304 Explore P: 0.2072
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 591 Total reward: 13.0 Training q_loss: 7.6963 Training g_loss

-------------------------------------------------------------------------------
Episode: 617 Total reward: 19.0 Training q_loss: 7.9948 Training g_loss: 0.0806 Training d_loss: 14.2486 Explore P: 0.1990
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 618 Total reward: 14.0 Training q_loss: 8.0263 Training g_loss: 0.0101 Training d_loss: 25.2679 Explore P: 0.1987
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 619 Total reward: 13.0 Training q_loss: 8.0528 Training g_loss: 0.0668 Training d_loss: 18.2320 Explore P: 0.1984
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 620 Total reward: 17.0 Training q_loss: 8.2199 Training g_loss

-------------------------------------------------------------------------------
Episode: 646 Total reward: 16.0 Training q_loss: 7.4063 Training g_loss: 0.2276 Training d_loss: 15.0275 Explore P: 0.1891
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 647 Total reward: 13.0 Training q_loss: 7.5243 Training g_loss: 0.0203 Training d_loss: 9.1529 Explore P: 0.1889
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 648 Total reward: 22.0 Training q_loss: 7.6138 Training g_loss: 0.0233 Training d_loss: 20.7851 Explore P: 0.1885
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 649 Total reward: 28.0 Training q_loss: 7.7604 Training g_loss:

-------------------------------------------------------------------------------
Episode: 675 Total reward: 46.0 Training q_loss: 6.7914 Training g_loss: 0.0234 Training d_loss: 16.1567 Explore P: 0.1760
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 676 Total reward: 66.0 Training q_loss: 7.0389 Training g_loss: 0.0110 Training d_loss: 10.2577 Explore P: 0.1749
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 677 Total reward: 47.0 Training q_loss: 6.2227 Training g_loss: 0.0650 Training d_loss: 24.9087 Explore P: 0.1742
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 678 Total reward: 34.0 Training q_loss: 6.0119 Training g_loss

-------------------------------------------------------------------------------
Episode: 704 Total reward: 199.0 Training q_loss: 3.5902 Training g_loss: 0.0045 Training d_loss: 13.7140 Explore P: 0.1419
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 705 Total reward: 70.0 Training q_loss: 3.6765 Training g_loss: 0.0020 Training d_loss: 10.9912 Explore P: 0.1410
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 706 Total reward: 100.0 Training q_loss: 3.8557 Training g_loss: 0.0130 Training d_loss: 17.5533 Explore P: 0.1397
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 707 Total reward: 122.0 Training q_loss: 3.2091 Training g_l

-------------------------------------------------------------------------------
Episode: 733 Total reward: 25.0 Training q_loss: 2.7796 Training g_loss: 0.0344 Training d_loss: 14.2468 Explore P: 0.1232
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 734 Total reward: 22.0 Training q_loss: 2.9436 Training g_loss: 0.0746 Training d_loss: 29.8028 Explore P: 0.1229
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 735 Total reward: 18.0 Training q_loss: 3.3711 Training g_loss: 0.1231 Training d_loss: 22.7976 Explore P: 0.1227
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 736 Total reward: 28.0 Training q_loss: 4.2020 Training g_loss

-------------------------------------------------------------------------------
Episode: 762 Total reward: 33.0 Training q_loss: 5.0169 Training g_loss: 0.0114 Training d_loss: 15.6016 Explore P: 0.1078
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 763 Total reward: 44.0 Training q_loss: 4.6791 Training g_loss: 0.0149 Training d_loss: 15.7297 Explore P: 0.1074
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 764 Total reward: 35.0 Training q_loss: 4.5969 Training g_loss: 0.0222 Training d_loss: 23.5960 Explore P: 0.1071
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 765 Total reward: 51.0 Training q_loss: 4.7534 Training g_loss

-------------------------------------------------------------------------------
Episode: 791 Total reward: 28.0 Training q_loss: 2.0835 Training g_loss: 0.0899 Training d_loss: 20.5706 Explore P: 0.1008
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 792 Total reward: 22.0 Training q_loss: 2.3299 Training g_loss: 0.0602 Training d_loss: 35.3453 Explore P: 0.1006
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 793 Total reward: 18.0 Training q_loss: 2.2237 Training g_loss: 0.0775 Training d_loss: 28.8958 Explore P: 0.1005
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 794 Total reward: 24.0 Training q_loss: 2.0296 Training g_loss

-------------------------------------------------------------------------------
Episode: 820 Total reward: 88.0 Training q_loss: 4.0469 Training g_loss: 0.0184 Training d_loss: 16.8508 Explore P: 0.0877
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 821 Total reward: 54.0 Training q_loss: 3.5442 Training g_loss: 0.0210 Training d_loss: 25.4027 Explore P: 0.0873
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 822 Total reward: 26.0 Training q_loss: 3.1767 Training g_loss: 0.0432 Training d_loss: 10.4259 Explore P: 0.0871
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 823 Total reward: 52.0 Training q_loss: 4.5083 Training g_loss

-------------------------------------------------------------------------------
Episode: 849 Total reward: 50.0 Training q_loss: 5.3663 Training g_loss: 0.0129 Training d_loss: 19.5384 Explore P: 0.0753
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 850 Total reward: 101.0 Training q_loss: 5.9391 Training g_loss: 0.0084 Training d_loss: 9.3301 Explore P: 0.0746
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 851 Total reward: 98.0 Training q_loss: 6.3462 Training g_loss: 0.0406 Training d_loss: 10.6643 Explore P: 0.0740
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 852 Total reward: 87.0 Training q_loss: 6.5233 Training g_loss

-------------------------------------------------------------------------------
Episode: 878 Total reward: 89.0 Training q_loss: 5.1383 Training g_loss: 0.0159 Training d_loss: 24.3955 Explore P: 0.0643
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 879 Total reward: 68.0 Training q_loss: 6.7772 Training g_loss: 0.0303 Training d_loss: 9.6841 Explore P: 0.0640
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 880 Total reward: 70.0 Training q_loss: 6.4863 Training g_loss: 0.0529 Training d_loss: 19.6069 Explore P: 0.0636
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 881 Total reward: 71.0 Training q_loss: 4.7541 Training g_loss:

-------------------------------------------------------------------------------
Episode: 907 Total reward: 40.0 Training q_loss: 6.1697 Training g_loss: 0.0282 Training d_loss: 9.8841 Explore P: 0.0550
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 908 Total reward: 45.0 Training q_loss: 7.9167 Training g_loss: 0.0580 Training d_loss: 12.4587 Explore P: 0.0548
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 909 Total reward: 50.0 Training q_loss: 7.4659 Training g_loss: 0.0260 Training d_loss: 17.2572 Explore P: 0.0546
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 910 Total reward: 55.0 Training q_loss: 8.5352 Training g_loss:

-------------------------------------------------------------------------------
Episode: 936 Total reward: 81.0 Training q_loss: 11.7874 Training g_loss: 0.0281 Training d_loss: 20.4109 Explore P: 0.0495
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 937 Total reward: 53.0 Training q_loss: 11.9841 Training g_loss: 0.0104 Training d_loss: 11.3523 Explore P: 0.0493
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 938 Total reward: 51.0 Training q_loss: 9.9725 Training g_loss: 0.0132 Training d_loss: 25.8234 Explore P: 0.0491
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 939 Total reward: 46.0 Training q_loss: 11.1515 Training g_l

-------------------------------------------------------------------------------
Episode: 965 Total reward: 57.0 Training q_loss: 11.6626 Training g_loss: 0.0346 Training d_loss: 14.7352 Explore P: 0.0437
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 966 Total reward: 47.0 Training q_loss: 9.9406 Training g_loss: 0.0093 Training d_loss: 6.5246 Explore P: 0.0436
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 967 Total reward: 37.0 Training q_loss: 7.5301 Training g_loss: 0.0396 Training d_loss: 12.1411 Explore P: 0.0434
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 968 Total reward: 49.0 Training q_loss: 12.8164 Training g_los

-------------------------------------------------------------------------------
Episode: 994 Total reward: 129.0 Training q_loss: 12.4624 Training g_loss: 0.0589 Training d_loss: 13.6058 Explore P: 0.0371
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 995 Total reward: 57.0 Training q_loss: 14.7694 Training g_loss: 0.0171 Training d_loss: 37.7773 Explore P: 0.0369
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 996 Total reward: 70.0 Training q_loss: 11.9038 Training g_loss: 0.0209 Training d_loss: 15.4949 Explore P: 0.0368
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 997 Total reward: 52.0 Training q_loss: 15.4136 Training g

-------------------------------------------------------------------------------
Episode: 1023 Total reward: 49.0 Training q_loss: 20.6123 Training g_loss: 0.0097 Training d_loss: 19.4188 Explore P: 0.0332
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1024 Total reward: 22.0 Training q_loss: 19.6571 Training g_loss: 0.0080 Training d_loss: 15.5690 Explore P: 0.0331
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1025 Total reward: 32.0 Training q_loss: 19.2638 Training g_loss: 0.0008 Training d_loss: 6.8735 Explore P: 0.0330
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1026 Total reward: 64.0 Training q_loss: 18.0178 Training

-------------------------------------------------------------------------------
Episode: 1053 Total reward: 16.0 Training q_loss: 18.9626 Training g_loss: 0.0292 Training d_loss: 7.0215 Explore P: 0.0321
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1054 Total reward: 16.0 Training q_loss: 20.1845 Training g_loss: 0.0280 Training d_loss: 25.4174 Explore P: 0.0320
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1055 Total reward: 13.0 Training q_loss: 19.8554 Training g_loss: 0.0282 Training d_loss: 10.1957 Explore P: 0.0320
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1056 Total reward: 11.0 Training q_loss: 18.1267 Training

-------------------------------------------------------------------------------
Episode: 1082 Total reward: 21.0 Training q_loss: 20.1154 Training g_loss: 0.0024 Training d_loss: 6.7855 Explore P: 0.0313
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1083 Total reward: 12.0 Training q_loss: 20.4502 Training g_loss: 0.0034 Training d_loss: 12.1669 Explore P: 0.0312
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1084 Total reward: 8.0 Training q_loss: 20.7258 Training g_loss: 0.0026 Training d_loss: 23.7162 Explore P: 0.0312
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1085 Total reward: 15.0 Training q_loss: 20.5666 Training 

-------------------------------------------------------------------------------
Episode: 1112 Total reward: 27.0 Training q_loss: 19.9573 Training g_loss: 0.0123 Training d_loss: 9.4496 Explore P: 0.0305
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1113 Total reward: 12.0 Training q_loss: 18.9234 Training g_loss: 0.0103 Training d_loss: 15.7225 Explore P: 0.0304
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1114 Total reward: 24.0 Training q_loss: 19.2920 Training g_loss: 0.0138 Training d_loss: 10.5183 Explore P: 0.0304
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1115 Total reward: 8.0 Training q_loss: 18.9731 Training 

-------------------------------------------------------------------------------
Episode: 1141 Total reward: 9.0 Training q_loss: 19.2513 Training g_loss: 0.0467 Training d_loss: 3.4572 Explore P: 0.0298
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1142 Total reward: 9.0 Training q_loss: 18.7989 Training g_loss: 0.0062 Training d_loss: 24.5856 Explore P: 0.0297
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1143 Total reward: 7.0 Training q_loss: 18.9517 Training g_loss: 0.0174 Training d_loss: 19.5099 Explore P: 0.0297
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1144 Total reward: 8.0 Training q_loss: 18.0667 Training g_l

-------------------------------------------------------------------------------
Episode: 1170 Total reward: 7.0 Training q_loss: 18.7658 Training g_loss: 0.0034 Training d_loss: 15.5794 Explore P: 0.0290
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1171 Total reward: 8.0 Training q_loss: 18.4648 Training g_loss: 0.0068 Training d_loss: 26.9122 Explore P: 0.0290
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1172 Total reward: 9.0 Training q_loss: 17.7555 Training g_loss: 0.0021 Training d_loss: 33.6902 Explore P: 0.0290
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1173 Total reward: 18.0 Training q_loss: 18.3304 Training g

-------------------------------------------------------------------------------
Episode: 1200 Total reward: 24.0 Training q_loss: 16.0001 Training g_loss: 0.0512 Training d_loss: 17.9652 Explore P: 0.0282
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1201 Total reward: 21.0 Training q_loss: 14.8054 Training g_loss: 0.0671 Training d_loss: 17.8625 Explore P: 0.0282
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1202 Total reward: 21.0 Training q_loss: 16.4229 Training g_loss: 0.2007 Training d_loss: 17.9370 Explore P: 0.0282
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1203 Total reward: 8.0 Training q_loss: 16.9225 Training

-------------------------------------------------------------------------------
Episode: 1230 Total reward: 8.0 Training q_loss: 14.3287 Training g_loss: 0.0109 Training d_loss: 16.5179 Explore P: 0.0273
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1231 Total reward: 27.0 Training q_loss: 14.4254 Training g_loss: 0.0202 Training d_loss: 9.2246 Explore P: 0.0273
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1232 Total reward: 24.0 Training q_loss: 13.1750 Training g_loss: 0.1268 Training d_loss: 24.8729 Explore P: 0.0272
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1233 Total reward: 28.0 Training q_loss: 14.7102 Training 

-------------------------------------------------------------------------------
Episode: 1260 Total reward: 22.0 Training q_loss: 14.8287 Training g_loss: 0.0165 Training d_loss: 22.9281 Explore P: 0.0263
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1261 Total reward: 8.0 Training q_loss: 14.9188 Training g_loss: 0.0045 Training d_loss: 20.0996 Explore P: 0.0263
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1262 Total reward: 22.0 Training q_loss: 14.6612 Training g_loss: 0.0237 Training d_loss: 8.7268 Explore P: 0.0263
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1263 Total reward: 25.0 Training q_loss: 14.7764 Training 

-------------------------------------------------------------------------------
Episode: 1289 Total reward: 7.0 Training q_loss: 14.2870 Training g_loss: 0.0047 Training d_loss: 21.0510 Explore P: 0.0256
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1290 Total reward: 13.0 Training q_loss: 13.2008 Training g_loss: 0.0018 Training d_loss: 20.3183 Explore P: 0.0256
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1291 Total reward: 8.0 Training q_loss: 13.9072 Training g_loss: 0.0144 Training d_loss: 25.7218 Explore P: 0.0256
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1292 Total reward: 8.0 Training q_loss: 13.7059 Training g

-------------------------------------------------------------------------------
Episode: 1318 Total reward: 10.0 Training q_loss: 13.6035 Training g_loss: 0.0381 Training d_loss: 21.6232 Explore P: 0.0249
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1319 Total reward: 12.0 Training q_loss: 12.3865 Training g_loss: 0.0127 Training d_loss: 32.4111 Explore P: 0.0249
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1320 Total reward: 10.0 Training q_loss: 11.8419 Training g_loss: 0.0600 Training d_loss: 33.6296 Explore P: 0.0249
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1321 Total reward: 8.0 Training q_loss: 13.4488 Training

-------------------------------------------------------------------------------
Episode: 1347 Total reward: 12.0 Training q_loss: 11.9636 Training g_loss: 0.2157 Training d_loss: 26.8845 Explore P: 0.0244
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1348 Total reward: 12.0 Training q_loss: 11.2474 Training g_loss: 0.0391 Training d_loss: 36.3997 Explore P: 0.0244
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1349 Total reward: 12.0 Training q_loss: 13.5979 Training g_loss: 0.0015 Training d_loss: 27.8407 Explore P: 0.0244
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1350 Total reward: 9.0 Training q_loss: 14.2728 Training

-------------------------------------------------------------------------------
Episode: 1376 Total reward: 9.0 Training q_loss: 12.9535 Training g_loss: 0.0008 Training d_loss: 16.8644 Explore P: 0.0240
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1377 Total reward: 8.0 Training q_loss: 12.8300 Training g_loss: 0.0010 Training d_loss: 12.5839 Explore P: 0.0240
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1378 Total reward: 12.0 Training q_loss: 13.1654 Training g_loss: 0.0028 Training d_loss: 20.2488 Explore P: 0.0239
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1379 Total reward: 9.0 Training q_loss: 13.0821 Training g

-------------------------------------------------------------------------------
Episode: 1405 Total reward: 9.0 Training q_loss: 13.1688 Training g_loss: 0.0976 Training d_loss: 9.6419 Explore P: 0.0236
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1406 Total reward: 8.0 Training q_loss: 12.7326 Training g_loss: 0.0048 Training d_loss: 21.4376 Explore P: 0.0236
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1407 Total reward: 11.0 Training q_loss: 12.6574 Training g_loss: 0.0071 Training d_loss: 26.9981 Explore P: 0.0235
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1408 Total reward: 11.0 Training q_loss: 13.0670 Training g

-------------------------------------------------------------------------------
Episode: 1434 Total reward: 9.0 Training q_loss: 11.2392 Training g_loss: 0.0107 Training d_loss: 43.2122 Explore P: 0.0232
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1435 Total reward: 10.0 Training q_loss: 11.8636 Training g_loss: 0.0137 Training d_loss: 7.3019 Explore P: 0.0232
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1436 Total reward: 9.0 Training q_loss: 11.8209 Training g_loss: 0.0316 Training d_loss: 16.0437 Explore P: 0.0231
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1437 Total reward: 11.0 Training q_loss: 11.5212 Training g

-------------------------------------------------------------------------------
Episode: 1464 Total reward: 9.0 Training q_loss: 12.0751 Training g_loss: 0.0165 Training d_loss: 11.2018 Explore P: 0.0228
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1465 Total reward: 11.0 Training q_loss: 12.9985 Training g_loss: 0.0026 Training d_loss: 13.7910 Explore P: 0.0228
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1466 Total reward: 10.0 Training q_loss: 13.5747 Training g_loss: 0.0116 Training d_loss: 13.1721 Explore P: 0.0228
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1467 Total reward: 10.0 Training q_loss: 13.6035 Training

-------------------------------------------------------------------------------
Episode: 1494 Total reward: 8.0 Training q_loss: 12.0164 Training g_loss: 0.0022 Training d_loss: 17.9986 Explore P: 0.0224
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1495 Total reward: 8.0 Training q_loss: 12.4396 Training g_loss: 0.0599 Training d_loss: 20.9038 Explore P: 0.0224
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1496 Total reward: 9.0 Training q_loss: 12.3599 Training g_loss: 0.0039 Training d_loss: 21.7808 Explore P: 0.0224
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1497 Total reward: 10.0 Training q_loss: 12.4228 Training g

-------------------------------------------------------------------------------
Episode: 1523 Total reward: 10.0 Training q_loss: 12.7923 Training g_loss: 0.0032 Training d_loss: 9.1489 Explore P: 0.0221
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1524 Total reward: 7.0 Training q_loss: 12.6356 Training g_loss: 0.0083 Training d_loss: 12.1464 Explore P: 0.0221
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1525 Total reward: 11.0 Training q_loss: 12.2705 Training g_loss: 0.0051 Training d_loss: 19.5413 Explore P: 0.0221
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1526 Total reward: 8.0 Training q_loss: 12.1749 Training g

-------------------------------------------------------------------------------
Episode: 1552 Total reward: 9.0 Training q_loss: 11.5919 Training g_loss: 0.0002 Training d_loss: 32.0431 Explore P: 0.0218
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1553 Total reward: 9.0 Training q_loss: 13.3181 Training g_loss: 0.0698 Training d_loss: 14.6370 Explore P: 0.0218
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1554 Total reward: 7.0 Training q_loss: 13.5435 Training g_loss: 0.0006 Training d_loss: 19.1649 Explore P: 0.0218
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1555 Total reward: 8.0 Training q_loss: 13.7724 Training g_

-------------------------------------------------------------------------------
Episode: 1581 Total reward: 26.0 Training q_loss: 12.3855 Training g_loss: 0.0639 Training d_loss: 18.8988 Explore P: 0.0215
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1582 Total reward: 11.0 Training q_loss: 10.1812 Training g_loss: 0.0021 Training d_loss: 16.8407 Explore P: 0.0215
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1583 Total reward: 8.0 Training q_loss: 10.2560 Training g_loss: 0.0222 Training d_loss: 16.0717 Explore P: 0.0215
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1584 Total reward: 8.0 Training q_loss: 10.4855 Training 

-------------------------------------------------------------------------------
Episode: 1611 Total reward: 7.0 Training q_loss: 11.6390 Training g_loss: 0.0112 Training d_loss: 16.4411 Explore P: 0.0212
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1612 Total reward: 8.0 Training q_loss: 11.8618 Training g_loss: 0.0017 Training d_loss: 20.6530 Explore P: 0.0212
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1613 Total reward: 9.0 Training q_loss: 12.7202 Training g_loss: 0.0484 Training d_loss: 12.4709 Explore P: 0.0212
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1614 Total reward: 9.0 Training q_loss: 11.7706 Training g_

-------------------------------------------------------------------------------
Episode: 1641 Total reward: 10.0 Training q_loss: 11.7439 Training g_loss: 0.0082 Training d_loss: 8.8153 Explore P: 0.0209
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1642 Total reward: 9.0 Training q_loss: 11.3372 Training g_loss: 0.0029 Training d_loss: 21.8570 Explore P: 0.0209
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1643 Total reward: 9.0 Training q_loss: 12.0447 Training g_loss: 0.0695 Training d_loss: 4.7838 Explore P: 0.0209
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1644 Total reward: 8.0 Training q_loss: 11.0198 Training g_l

-------------------------------------------------------------------------------
Episode: 1671 Total reward: 10.0 Training q_loss: 10.9910 Training g_loss: 0.0055 Training d_loss: 20.7414 Explore P: 0.0206
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1672 Total reward: 11.0 Training q_loss: 10.9232 Training g_loss: 0.0040 Training d_loss: 18.5357 Explore P: 0.0206
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1673 Total reward: 11.0 Training q_loss: 11.0486 Training g_loss: 0.0117 Training d_loss: 13.4299 Explore P: 0.0205
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1674 Total reward: 11.0 Training q_loss: 10.3547 Trainin

-------------------------------------------------------------------------------
Episode: 1701 Total reward: 12.0 Training q_loss: 8.9641 Training g_loss: 0.0005 Training d_loss: 21.2815 Explore P: 0.0202
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1702 Total reward: 7.0 Training q_loss: 9.2840 Training g_loss: 0.0152 Training d_loss: 18.0714 Explore P: 0.0202
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1703 Total reward: 9.0 Training q_loss: 9.1319 Training g_loss: 0.0113 Training d_loss: 14.8667 Explore P: 0.0202
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1704 Total reward: 7.0 Training q_loss: 9.0009 Training g_los

-------------------------------------------------------------------------------
Episode: 1730 Total reward: 9.0 Training q_loss: 8.1995 Training g_loss: 0.0125 Training d_loss: 20.9179 Explore P: 0.0200
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1731 Total reward: 9.0 Training q_loss: 8.2338 Training g_loss: 0.0078 Training d_loss: 16.9791 Explore P: 0.0199
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1732 Total reward: 12.0 Training q_loss: 7.9836 Training g_loss: 0.0319 Training d_loss: 10.0242 Explore P: 0.0199
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1733 Total reward: 13.0 Training q_loss: 8.3499 Training g_lo

-------------------------------------------------------------------------------
Episode: 1759 Total reward: 14.0 Training q_loss: 7.0287 Training g_loss: 0.0158 Training d_loss: 20.4955 Explore P: 0.0192
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1760 Total reward: 10.0 Training q_loss: 6.3696 Training g_loss: 0.0059 Training d_loss: 11.8243 Explore P: 0.0191
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1761 Total reward: 13.0 Training q_loss: 6.3347 Training g_loss: 0.0045 Training d_loss: 15.8655 Explore P: 0.0191
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1762 Total reward: 11.0 Training q_loss: 6.7959 Training g_

-------------------------------------------------------------------------------
Episode: 1788 Total reward: 10.0 Training q_loss: 5.8421 Training g_loss: 0.0445 Training d_loss: 15.4322 Explore P: 0.0188
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1789 Total reward: 10.0 Training q_loss: 4.1677 Training g_loss: 0.0266 Training d_loss: 15.1418 Explore P: 0.0188
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1790 Total reward: 12.0 Training q_loss: 3.0957 Training g_loss: 0.0182 Training d_loss: 23.7690 Explore P: 0.0188
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1791 Total reward: 10.0 Training q_loss: 2.4332 Training g_

-------------------------------------------------------------------------------
Episode: 1818 Total reward: 12.0 Training q_loss: 4.9534 Training g_loss: 0.0226 Training d_loss: 7.7068 Explore P: 0.0185
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1819 Total reward: 10.0 Training q_loss: 4.6597 Training g_loss: 0.0037 Training d_loss: 18.5384 Explore P: 0.0185
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1820 Total reward: 10.0 Training q_loss: 5.2002 Training g_loss: 0.0027 Training d_loss: 21.7110 Explore P: 0.0185
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1821 Total reward: 10.0 Training q_loss: 4.7304 Training g_l

-------------------------------------------------------------------------------
Episode: 1848 Total reward: 8.0 Training q_loss: 3.2136 Training g_loss: 0.0251 Training d_loss: 15.3711 Explore P: 0.0183
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1849 Total reward: 8.0 Training q_loss: 2.8182 Training g_loss: 0.0094 Training d_loss: 17.8722 Explore P: 0.0183
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1850 Total reward: 10.0 Training q_loss: 2.5777 Training g_loss: 0.0055 Training d_loss: 16.6116 Explore P: 0.0182
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1851 Total reward: 14.0 Training q_loss: 1.9877 Training g_lo

-------------------------------------------------------------------------------
Episode: 1877 Total reward: 9.0 Training q_loss: 1.5529 Training g_loss: 0.0048 Training d_loss: 14.2272 Explore P: 0.0180
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1878 Total reward: 9.0 Training q_loss: 1.6541 Training g_loss: 0.0003 Training d_loss: 13.5125 Explore P: 0.0180
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1879 Total reward: 10.0 Training q_loss: 2.2404 Training g_loss: 0.0030 Training d_loss: 19.9257 Explore P: 0.0180
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1880 Total reward: 12.0 Training q_loss: 2.2710 Training g_lo

-------------------------------------------------------------------------------
Episode: 1906 Total reward: 9.0 Training q_loss: 120.2595 Training g_loss: 0.2135 Training d_loss: 7.8592 Explore P: 0.0176
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1907 Total reward: 10.0 Training q_loss: 146.2154 Training g_loss: 0.2758 Training d_loss: 13.9132 Explore P: 0.0176
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1908 Total reward: 10.0 Training q_loss: 82.6257 Training g_loss: 0.3111 Training d_loss: 18.0056 Explore P: 0.0176
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1909 Total reward: 8.0 Training q_loss: 47.2192 Training

-------------------------------------------------------------------------------
Episode: 1936 Total reward: 8.0 Training q_loss: 31.5393 Training g_loss: 0.0537 Training d_loss: 11.8267 Explore P: 0.0174
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1937 Total reward: 7.0 Training q_loss: 30.1780 Training g_loss: 0.0796 Training d_loss: 20.6988 Explore P: 0.0174
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1938 Total reward: 11.0 Training q_loss: 27.4508 Training g_loss: 0.0886 Training d_loss: 11.4278 Explore P: 0.0174
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1939 Total reward: 11.0 Training q_loss: 22.1102 Training 

-------------------------------------------------------------------------------
Episode: 1965 Total reward: 28.0 Training q_loss: 5.8297 Training g_loss: 0.0074 Training d_loss: 12.0937 Explore P: 0.0169
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1966 Total reward: 47.0 Training q_loss: 5.3037 Training g_loss: 0.0107 Training d_loss: 29.1423 Explore P: 0.0169
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1967 Total reward: 62.0 Training q_loss: 6.3008 Training g_loss: 0.0298 Training d_loss: 6.6223 Explore P: 0.0168
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1968 Total reward: 46.0 Training q_loss: 6.1283 Training g_l

-------------------------------------------------------------------------------
Episode: 1994 Total reward: 102.0 Training q_loss: 4.7074 Training g_loss: 0.1019 Training d_loss: 18.5515 Explore P: 0.0159
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1995 Total reward: 25.0 Training q_loss: 4.2984 Training g_loss: 0.0241 Training d_loss: 16.4890 Explore P: 0.0159
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1996 Total reward: 41.0 Training q_loss: 4.4877 Training g_loss: 0.0345 Training d_loss: 17.4894 Explore P: 0.0159
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 1997 Total reward: 32.0 Training q_loss: 4.5312 Training g

-------------------------------------------------------------------------------
Episode: 2023 Total reward: 29.0 Training q_loss: 4.3067 Training g_loss: 0.0348 Training d_loss: 12.7958 Explore P: 0.0154
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2024 Total reward: 27.0 Training q_loss: 4.4859 Training g_loss: 0.0175 Training d_loss: 15.5205 Explore P: 0.0154
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2025 Total reward: 43.0 Training q_loss: 4.0717 Training g_loss: 0.0548 Training d_loss: 12.6427 Explore P: 0.0154
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2026 Total reward: 55.0 Training q_loss: 3.8417 Training g_

-------------------------------------------------------------------------------
Episode: 2052 Total reward: 21.0 Training q_loss: 3.4938 Training g_loss: 0.0200 Training d_loss: 17.3438 Explore P: 0.0147
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2053 Total reward: 53.0 Training q_loss: 4.4425 Training g_loss: 0.0306 Training d_loss: 17.2475 Explore P: 0.0146
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2054 Total reward: 28.0 Training q_loss: 3.9439 Training g_loss: 0.0443 Training d_loss: 13.0069 Explore P: 0.0146
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2055 Total reward: 21.0 Training q_loss: 3.2605 Training g_

-------------------------------------------------------------------------------
Episode: 2082 Total reward: 9.0 Training q_loss: 2.7851 Training g_loss: 0.1295 Training d_loss: 9.0470 Explore P: 0.0142
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2083 Total reward: 11.0 Training q_loss: 30.6969 Training g_loss: 0.6066 Training d_loss: 20.2129 Explore P: 0.0142
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2084 Total reward: 7.0 Training q_loss: 82.3181 Training g_loss: 0.5509 Training d_loss: 19.7461 Explore P: 0.0142
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2085 Total reward: 10.0 Training q_loss: 46.6979 Training g_

-------------------------------------------------------------------------------
Episode: 2111 Total reward: 34.0 Training q_loss: 2.9756 Training g_loss: 0.0083 Training d_loss: 16.4656 Explore P: 0.0134
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2112 Total reward: 38.0 Training q_loss: 3.1241 Training g_loss: 0.0555 Training d_loss: 15.5563 Explore P: 0.0134
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2113 Total reward: 42.0 Training q_loss: 2.5474 Training g_loss: 0.0398 Training d_loss: 17.9105 Explore P: 0.0134
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2114 Total reward: 70.0 Training q_loss: 2.7692 Training g_

-------------------------------------------------------------------------------
Episode: 2141 Total reward: 28.0 Training q_loss: 6.0977 Training g_loss: 0.3789 Training d_loss: 16.9633 Explore P: 0.0128
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2142 Total reward: 11.0 Training q_loss: 3.7918 Training g_loss: 0.3626 Training d_loss: 6.8721 Explore P: 0.0128
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2143 Total reward: 10.0 Training q_loss: 4.5915 Training g_loss: 0.2798 Training d_loss: 18.4583 Explore P: 0.0128
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2144 Total reward: 9.0 Training q_loss: 124.8877 Training g_

-------------------------------------------------------------------------------
Episode: 2170 Total reward: 118.0 Training q_loss: 1.8797 Training g_loss: 0.0066 Training d_loss: 16.4533 Explore P: 0.0122
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2171 Total reward: 191.0 Training q_loss: 4.3913 Training g_loss: 0.0437 Training d_loss: 14.8665 Explore P: 0.0122
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2172 Total reward: 118.0 Training q_loss: 2.6905 Training g_loss: 0.0385 Training d_loss: 12.4850 Explore P: 0.0122
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2173 Total reward: 109.0 Training q_loss: 3.6427 Trainin

-------------------------------------------------------------------------------
Episode: 2199 Total reward: 125.0 Training q_loss: 5.1055 Training g_loss: 0.0039 Training d_loss: 17.3751 Explore P: 0.0116
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2200 Total reward: 102.0 Training q_loss: 5.1574 Training g_loss: 0.0065 Training d_loss: 11.1942 Explore P: 0.0115
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2201 Total reward: 100.0 Training q_loss: 6.1239 Training g_loss: 0.0843 Training d_loss: 14.9059 Explore P: 0.0115
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2202 Total reward: 135.0 Training q_loss: 6.3408 Trainin

-------------------------------------------------------------------------------
Episode: 2228 Total reward: 179.0 Training q_loss: 10.7526 Training g_loss: 0.0281 Training d_loss: 14.1550 Explore P: 0.0112
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2229 Total reward: 132.0 Training q_loss: 8.8361 Training g_loss: 0.0791 Training d_loss: 7.0040 Explore P: 0.0112
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2230 Total reward: 139.0 Training q_loss: 7.4636 Training g_loss: 0.0744 Training d_loss: 18.0153 Explore P: 0.0111
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2231 Total reward: 199.0 Training q_loss: 6.7720 Trainin

-------------------------------------------------------------------------------
Episode: 2257 Total reward: 73.0 Training q_loss: 5.0581 Training g_loss: 0.0224 Training d_loss: 20.9467 Explore P: 0.0109
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2258 Total reward: 81.0 Training q_loss: 7.0254 Training g_loss: 0.1664 Training d_loss: 11.8341 Explore P: 0.0108
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2259 Total reward: 119.0 Training q_loss: 6.0524 Training g_loss: 0.0931 Training d_loss: 41.3041 Explore P: 0.0108
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2260 Total reward: 69.0 Training q_loss: 6.1038 Training g

-------------------------------------------------------------------------------
Episode: 2286 Total reward: 16.0 Training q_loss: 3.1942 Training g_loss: 0.3231 Training d_loss: 21.1921 Explore P: 0.0107
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2287 Total reward: 29.0 Training q_loss: 4.2814 Training g_loss: 0.1377 Training d_loss: 16.7265 Explore P: 0.0107
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2288 Total reward: 63.0 Training q_loss: 11.7439 Training g_loss: 0.0907 Training d_loss: 7.6278 Explore P: 0.0107
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2289 Total reward: 67.0 Training q_loss: 11.4475 Training g

-------------------------------------------------------------------------------
Episode: 2315 Total reward: 75.0 Training q_loss: 9.8717 Training g_loss: 0.0121 Training d_loss: 12.7201 Explore P: 0.0106
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2316 Total reward: 44.0 Training q_loss: 8.9074 Training g_loss: 0.0380 Training d_loss: 8.6102 Explore P: 0.0106
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2317 Total reward: 28.0 Training q_loss: 9.0732 Training g_loss: 0.0272 Training d_loss: 11.2954 Explore P: 0.0106
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2318 Total reward: 28.0 Training q_loss: 8.1649 Training g_l

-------------------------------------------------------------------------------
Episode: 2345 Total reward: 13.0 Training q_loss: 340.2473 Training g_loss: 0.7170 Training d_loss: 22.1021 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2346 Total reward: 10.0 Training q_loss: 766.3155 Training g_loss: 1.1494 Training d_loss: 14.6249 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2347 Total reward: 7.0 Training q_loss: 350.5972 Training g_loss: 0.8461 Training d_loss: 17.6127 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2348 Total reward: 11.0 Training q_loss: 27.5023 Train

-------------------------------------------------------------------------------
Episode: 2374 Total reward: 25.0 Training q_loss: 11.5884 Training g_loss: 0.0283 Training d_loss: 23.5532 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2375 Total reward: 15.0 Training q_loss: 13.8635 Training g_loss: 0.0420 Training d_loss: 18.4657 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2376 Total reward: 23.0 Training q_loss: 12.6588 Training g_loss: 0.2069 Training d_loss: 12.4688 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2377 Total reward: 36.0 Training q_loss: 14.7373 Trainin

-------------------------------------------------------------------------------
Episode: 2403 Total reward: 16.0 Training q_loss: 7.9684 Training g_loss: 0.0446 Training d_loss: 7.9839 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2404 Total reward: 15.0 Training q_loss: 7.3741 Training g_loss: 0.0154 Training d_loss: 25.9812 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2405 Total reward: 13.0 Training q_loss: 7.3708 Training g_loss: 0.0116 Training d_loss: 10.2863 Explore P: 0.0105
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2406 Total reward: 17.0 Training q_loss: 6.2839 Training g_l

-------------------------------------------------------------------------------
Episode: 2432 Total reward: 11.0 Training q_loss: 4.3065 Training g_loss: 0.0131 Training d_loss: 13.6973 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2433 Total reward: 9.0 Training q_loss: 4.5221 Training g_loss: 0.0146 Training d_loss: 19.2860 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2434 Total reward: 13.0 Training q_loss: 3.9384 Training g_loss: 0.0303 Training d_loss: 21.7340 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2435 Total reward: 9.0 Training q_loss: 2.8297 Training g_lo

-------------------------------------------------------------------------------
Episode: 2462 Total reward: 10.0 Training q_loss: 1.8155 Training g_loss: 0.0111 Training d_loss: 16.5356 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2463 Total reward: 8.0 Training q_loss: 1.9641 Training g_loss: 0.0040 Training d_loss: 7.6819 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2464 Total reward: 12.0 Training q_loss: 1.5673 Training g_loss: 0.0034 Training d_loss: 10.9604 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2465 Total reward: 9.0 Training q_loss: 2.2591 Training g_los

-------------------------------------------------------------------------------
Episode: 2492 Total reward: 8.0 Training q_loss: 1.2502 Training g_loss: 0.0623 Training d_loss: 26.6298 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2493 Total reward: 10.0 Training q_loss: 1.1955 Training g_loss: 0.0115 Training d_loss: 17.0117 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2494 Total reward: 10.0 Training q_loss: 1.3979 Training g_loss: 0.0033 Training d_loss: 16.9268 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2495 Total reward: 10.0 Training q_loss: 1.5245 Training g_l

-------------------------------------------------------------------------------
Episode: 2521 Total reward: 15.0 Training q_loss: 14.1405 Training g_loss: 0.0621 Training d_loss: 10.8443 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2522 Total reward: 13.0 Training q_loss: 12.7173 Training g_loss: 0.0271 Training d_loss: 7.8512 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2523 Total reward: 12.0 Training q_loss: 12.0134 Training g_loss: 0.0734 Training d_loss: 7.4715 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2524 Total reward: 14.0 Training q_loss: 12.0853 Training 

-------------------------------------------------------------------------------
Episode: 2550 Total reward: 13.0 Training q_loss: 12.7637 Training g_loss: 0.0018 Training d_loss: 10.0242 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2551 Total reward: 13.0 Training q_loss: 10.9886 Training g_loss: 0.0027 Training d_loss: 18.9275 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2552 Total reward: 14.0 Training q_loss: 12.2442 Training g_loss: 0.0057 Training d_loss: 9.4967 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2553 Total reward: 12.0 Training q_loss: 11.0833 Training

-------------------------------------------------------------------------------
Episode: 2579 Total reward: 14.0 Training q_loss: 8.3236 Training g_loss: 0.0126 Training d_loss: 18.6852 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2580 Total reward: 14.0 Training q_loss: 9.5198 Training g_loss: 0.0183 Training d_loss: 18.1779 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2581 Total reward: 31.0 Training q_loss: 8.5321 Training g_loss: 0.0224 Training d_loss: 15.9969 Explore P: 0.0104
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2582 Total reward: 18.0 Training q_loss: 8.1830 Training g_

-------------------------------------------------------------------------------
Episode: 2608 Total reward: 14.0 Training q_loss: 7.6184 Training g_loss: 0.0247 Training d_loss: 11.4961 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2609 Total reward: 24.0 Training q_loss: 6.9623 Training g_loss: 0.0451 Training d_loss: 25.8919 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2610 Total reward: 40.0 Training q_loss: 9.5276 Training g_loss: 0.1057 Training d_loss: 26.5013 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2611 Total reward: 15.0 Training q_loss: 8.6759 Training g_

-------------------------------------------------------------------------------
Episode: 2637 Total reward: 33.0 Training q_loss: 4.9147 Training g_loss: 0.0529 Training d_loss: 15.3890 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2638 Total reward: 21.0 Training q_loss: 5.2120 Training g_loss: 0.0283 Training d_loss: 29.3788 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2639 Total reward: 19.0 Training q_loss: 4.7482 Training g_loss: 0.0183 Training d_loss: 17.1750 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2640 Total reward: 14.0 Training q_loss: 5.7290 Training g_

-------------------------------------------------------------------------------
Episode: 2666 Total reward: 15.0 Training q_loss: 2.8658 Training g_loss: 0.0007 Training d_loss: 20.5613 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2667 Total reward: 15.0 Training q_loss: 3.5317 Training g_loss: 0.0018 Training d_loss: 26.3729 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2668 Total reward: 13.0 Training q_loss: 2.9844 Training g_loss: 0.0111 Training d_loss: 18.6429 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2669 Total reward: 13.0 Training q_loss: 3.1326 Training g_

-------------------------------------------------------------------------------
Episode: 2695 Total reward: 10.0 Training q_loss: 3.6118 Training g_loss: 0.0010 Training d_loss: 15.3134 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2696 Total reward: 11.0 Training q_loss: 4.1278 Training g_loss: 0.0449 Training d_loss: 15.0437 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2697 Total reward: 14.0 Training q_loss: 4.2525 Training g_loss: 0.0039 Training d_loss: 18.7619 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2698 Total reward: 16.0 Training q_loss: 3.1870 Training g_

-------------------------------------------------------------------------------
Episode: 2725 Total reward: 12.0 Training q_loss: 3.0521 Training g_loss: 0.0336 Training d_loss: 6.3216 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2726 Total reward: 31.0 Training q_loss: 2.9828 Training g_loss: 0.0829 Training d_loss: 24.5414 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2727 Total reward: 18.0 Training q_loss: 1.7952 Training g_loss: 0.0052 Training d_loss: 12.8733 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2728 Total reward: 10.0 Training q_loss: 1.4513 Training g_l

-------------------------------------------------------------------------------
Episode: 2754 Total reward: 18.0 Training q_loss: 2.7876 Training g_loss: 0.0314 Training d_loss: 20.5632 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2755 Total reward: 22.0 Training q_loss: 2.6237 Training g_loss: 0.0166 Training d_loss: 6.7475 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2756 Total reward: 16.0 Training q_loss: 2.9943 Training g_loss: 0.0520 Training d_loss: 9.1196 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2757 Total reward: 15.0 Training q_loss: 2.7582 Training g_lo

-------------------------------------------------------------------------------
Episode: 2784 Total reward: 17.0 Training q_loss: 3.0994 Training g_loss: 0.0167 Training d_loss: 21.3738 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2785 Total reward: 16.0 Training q_loss: 3.4121 Training g_loss: 0.0391 Training d_loss: 8.0337 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2786 Total reward: 11.0 Training q_loss: 3.7079 Training g_loss: 0.0166 Training d_loss: 22.7916 Explore P: 0.0103
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2787 Total reward: 14.0 Training q_loss: 3.2826 Training g_l

-------------------------------------------------------------------------------
Episode: 2813 Total reward: 15.0 Training q_loss: 3.2976 Training g_loss: 0.0047 Training d_loss: 15.5796 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2814 Total reward: 10.0 Training q_loss: 2.8478 Training g_loss: 0.1024 Training d_loss: 20.2429 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2815 Total reward: 13.0 Training q_loss: 3.3727 Training g_loss: 0.0401 Training d_loss: 14.3697 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2816 Total reward: 15.0 Training q_loss: 2.8301 Training g_

-------------------------------------------------------------------------------
Episode: 2842 Total reward: 16.0 Training q_loss: 2.9382 Training g_loss: 0.0083 Training d_loss: 21.7974 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2843 Total reward: 13.0 Training q_loss: 2.9556 Training g_loss: 0.0123 Training d_loss: 25.7287 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2844 Total reward: 18.0 Training q_loss: 2.8185 Training g_loss: 0.0104 Training d_loss: 9.6591 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2845 Total reward: 15.0 Training q_loss: 2.9665 Training g_l

-------------------------------------------------------------------------------
Episode: 2871 Total reward: 17.0 Training q_loss: 2.6868 Training g_loss: 0.1576 Training d_loss: 19.2360 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2872 Total reward: 12.0 Training q_loss: 2.9852 Training g_loss: 0.0255 Training d_loss: 7.1613 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2873 Total reward: 24.0 Training q_loss: 3.0034 Training g_loss: 0.0436 Training d_loss: 20.6555 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2874 Total reward: 12.0 Training q_loss: 2.6523 Training g_l

-------------------------------------------------------------------------------
Episode: 2900 Total reward: 42.0 Training q_loss: 4.6977 Training g_loss: 0.0217 Training d_loss: 30.7461 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2901 Total reward: 35.0 Training q_loss: 5.9757 Training g_loss: 0.0218 Training d_loss: 14.7193 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2902 Total reward: 27.0 Training q_loss: 5.4495 Training g_loss: 0.0062 Training d_loss: 13.3324 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2903 Total reward: 29.0 Training q_loss: 5.6476 Training g_

-------------------------------------------------------------------------------
Episode: 2929 Total reward: 26.0 Training q_loss: 5.2739 Training g_loss: 0.0078 Training d_loss: 13.1059 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2930 Total reward: 28.0 Training q_loss: 4.8667 Training g_loss: 0.0008 Training d_loss: 20.6714 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2931 Total reward: 29.0 Training q_loss: 5.9058 Training g_loss: 0.0090 Training d_loss: 19.8136 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2932 Total reward: 49.0 Training q_loss: 5.2924 Training g_

-------------------------------------------------------------------------------
Episode: 2958 Total reward: 29.0 Training q_loss: 4.6100 Training g_loss: 0.0184 Training d_loss: 9.3217 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2959 Total reward: 41.0 Training q_loss: 4.5978 Training g_loss: 0.0065 Training d_loss: 9.6497 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2960 Total reward: 49.0 Training q_loss: 4.1054 Training g_loss: 0.0829 Training d_loss: 17.2696 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2961 Total reward: 22.0 Training q_loss: 3.5063 Training g_lo

-------------------------------------------------------------------------------
Episode: 2987 Total reward: 24.0 Training q_loss: 4.1653 Training g_loss: 0.0131 Training d_loss: 24.5168 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2988 Total reward: 26.0 Training q_loss: 3.8441 Training g_loss: 0.0009 Training d_loss: 9.0492 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2989 Total reward: 17.0 Training q_loss: 3.8132 Training g_loss: 0.0058 Training d_loss: 16.5858 Explore P: 0.0102
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 2990 Total reward: 17.0 Training q_loss: 3.9601 Training g_l

-------------------------------------------------------------------------------
Episode: 3016 Total reward: 12.0 Training q_loss: 4.6462 Training g_loss: 0.0008 Training d_loss: 18.1186 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3017 Total reward: 25.0 Training q_loss: 5.2726 Training g_loss: 0.1498 Training d_loss: 13.4731 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3018 Total reward: 27.0 Training q_loss: 4.8254 Training g_loss: 0.0028 Training d_loss: 13.5396 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3019 Total reward: 11.0 Training q_loss: 4.7436 Training g_

-------------------------------------------------------------------------------
Episode: 3045 Total reward: 17.0 Training q_loss: 3.4850 Training g_loss: 0.0903 Training d_loss: 25.3467 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3046 Total reward: 9.0 Training q_loss: 3.0002 Training g_loss: 0.0116 Training d_loss: 11.4498 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3047 Total reward: 12.0 Training q_loss: 2.5291 Training g_loss: 0.0767 Training d_loss: 19.4384 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3048 Total reward: 10.0 Training q_loss: 2.8068 Training g_l

-------------------------------------------------------------------------------
Episode: 3075 Total reward: 16.0 Training q_loss: 3.5785 Training g_loss: 0.0562 Training d_loss: 7.0946 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3076 Total reward: 11.0 Training q_loss: 2.8926 Training g_loss: 0.0033 Training d_loss: 10.9702 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3077 Total reward: 10.0 Training q_loss: 3.0556 Training g_loss: 0.0137 Training d_loss: 8.9582 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3078 Total reward: 10.0 Training q_loss: 3.3500 Training g_lo

-------------------------------------------------------------------------------
Episode: 3105 Total reward: 10.0 Training q_loss: 3.7447 Training g_loss: 0.0083 Training d_loss: 19.3012 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3106 Total reward: 12.0 Training q_loss: 3.7299 Training g_loss: 0.0030 Training d_loss: 19.8163 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3107 Total reward: 12.0 Training q_loss: 3.3492 Training g_loss: 0.0034 Training d_loss: 20.1058 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3108 Total reward: 12.0 Training q_loss: 3.6252 Training g_

-------------------------------------------------------------------------------
Episode: 3134 Total reward: 9.0 Training q_loss: 3.0026 Training g_loss: 0.1457 Training d_loss: 21.1967 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3135 Total reward: 12.0 Training q_loss: 2.5425 Training g_loss: 0.0992 Training d_loss: 18.3305 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3136 Total reward: 10.0 Training q_loss: 2.7030 Training g_loss: 0.0410 Training d_loss: 10.0458 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3137 Total reward: 9.0 Training q_loss: 3.3948 Training g_lo

-------------------------------------------------------------------------------
Episode: 3163 Total reward: 11.0 Training q_loss: 10.9769 Training g_loss: 0.1700 Training d_loss: 23.4611 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3164 Total reward: 12.0 Training q_loss: 16.2877 Training g_loss: 0.0595 Training d_loss: 20.6209 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3165 Total reward: 12.0 Training q_loss: 5.3307 Training g_loss: 0.0826 Training d_loss: 11.8854 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3166 Total reward: 53.0 Training q_loss: 2.0930 Training 

-------------------------------------------------------------------------------
Episode: 3192 Total reward: 103.0 Training q_loss: 6.0454 Training g_loss: 0.0638 Training d_loss: 12.1470 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3193 Total reward: 35.0 Training q_loss: 5.3747 Training g_loss: 0.0174 Training d_loss: 19.1078 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3194 Total reward: 20.0 Training q_loss: 4.4862 Training g_loss: 0.0152 Training d_loss: 21.7778 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3195 Total reward: 23.0 Training q_loss: 3.8130 Training g

-------------------------------------------------------------------------------
Episode: 3221 Total reward: 110.0 Training q_loss: 9.2909 Training g_loss: 0.0085 Training d_loss: 25.4785 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3222 Total reward: 177.0 Training q_loss: 9.3851 Training g_loss: 0.0033 Training d_loss: 22.3028 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3223 Total reward: 199.0 Training q_loss: 8.4075 Training g_loss: 0.0297 Training d_loss: 19.5964 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3224 Total reward: 114.0 Training q_loss: 8.2424 Trainin

-------------------------------------------------------------------------------
Episode: 3250 Total reward: 11.0 Training q_loss: 5.4615 Training g_loss: 0.1455 Training d_loss: 16.4352 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3251 Total reward: 8.0 Training q_loss: 5.2655 Training g_loss: 0.0917 Training d_loss: 11.7987 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3252 Total reward: 11.0 Training q_loss: 4.8174 Training g_loss: 0.0381 Training d_loss: 30.1356 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3253 Total reward: 11.0 Training q_loss: 5.2673 Training g_l

-------------------------------------------------------------------------------
Episode: 3280 Total reward: 11.0 Training q_loss: 3.3738 Training g_loss: 0.0148 Training d_loss: 14.3553 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3281 Total reward: 7.0 Training q_loss: 3.4281 Training g_loss: 0.0017 Training d_loss: 24.0232 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3282 Total reward: 11.0 Training q_loss: 3.6881 Training g_loss: 0.0034 Training d_loss: 12.2191 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3283 Total reward: 9.0 Training q_loss: 3.7327 Training g_lo

-------------------------------------------------------------------------------
Episode: 3309 Total reward: 20.0 Training q_loss: 31.5813 Training g_loss: 0.0229 Training d_loss: 14.5275 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3310 Total reward: 19.0 Training q_loss: 31.1181 Training g_loss: 0.0367 Training d_loss: 19.0344 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3311 Total reward: 15.0 Training q_loss: 34.4523 Training g_loss: 0.0157 Training d_loss: 14.5230 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3312 Total reward: 25.0 Training q_loss: 36.8495 Trainin

-------------------------------------------------------------------------------
Episode: 3338 Total reward: 19.0 Training q_loss: 14.4305 Training g_loss: 0.0163 Training d_loss: 12.5672 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3339 Total reward: 23.0 Training q_loss: 13.2134 Training g_loss: 0.0074 Training d_loss: 19.0492 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3340 Total reward: 17.0 Training q_loss: 12.7174 Training g_loss: 0.0176 Training d_loss: 16.6355 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3341 Total reward: 17.0 Training q_loss: 10.7827 Trainin

-------------------------------------------------------------------------------
Episode: 3367 Total reward: 15.0 Training q_loss: 7.3638 Training g_loss: 0.0054 Training d_loss: 21.8665 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3368 Total reward: 17.0 Training q_loss: 7.1547 Training g_loss: 0.0013 Training d_loss: 12.7526 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3369 Total reward: 16.0 Training q_loss: 6.2682 Training g_loss: 0.0820 Training d_loss: 15.2912 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3370 Total reward: 14.0 Training q_loss: 5.4093 Training g_

-------------------------------------------------------------------------------
Episode: 3396 Total reward: 15.0 Training q_loss: 5.0635 Training g_loss: 0.0010 Training d_loss: 7.0018 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3397 Total reward: 21.0 Training q_loss: 6.2378 Training g_loss: 0.0263 Training d_loss: 14.1625 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3398 Total reward: 20.0 Training q_loss: 5.2662 Training g_loss: 0.0233 Training d_loss: 23.8879 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3399 Total reward: 18.0 Training q_loss: 5.6808 Training g_l

-------------------------------------------------------------------------------
Episode: 3425 Total reward: 11.0 Training q_loss: 2.4544 Training g_loss: 0.0017 Training d_loss: 18.4495 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3426 Total reward: 12.0 Training q_loss: 2.4272 Training g_loss: 0.0113 Training d_loss: 16.3080 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3427 Total reward: 14.0 Training q_loss: 2.7112 Training g_loss: 0.0038 Training d_loss: 13.3810 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3428 Total reward: 16.0 Training q_loss: 4.0901 Training g_

-------------------------------------------------------------------------------
Episode: 3454 Total reward: 16.0 Training q_loss: 17.3002 Training g_loss: 0.0539 Training d_loss: 14.0285 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3455 Total reward: 16.0 Training q_loss: 16.5129 Training g_loss: 0.0371 Training d_loss: 14.8696 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3456 Total reward: 13.0 Training q_loss: 18.6373 Training g_loss: 0.1002 Training d_loss: 17.5542 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3457 Total reward: 19.0 Training q_loss: 17.6426 Trainin

-------------------------------------------------------------------------------
Episode: 3484 Total reward: 11.0 Training q_loss: 8.8222 Training g_loss: 0.0489 Training d_loss: 21.0334 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3485 Total reward: 9.0 Training q_loss: 10.0802 Training g_loss: 0.0025 Training d_loss: 15.5832 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3486 Total reward: 8.0 Training q_loss: 9.9658 Training g_loss: 0.0820 Training d_loss: 17.7371 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3487 Total reward: 13.0 Training q_loss: 10.9212 Training g_

-------------------------------------------------------------------------------
Episode: 3514 Total reward: 11.0 Training q_loss: 7.6653 Training g_loss: 0.0015 Training d_loss: 10.2326 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3515 Total reward: 10.0 Training q_loss: 6.5870 Training g_loss: 0.0018 Training d_loss: 20.4550 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3516 Total reward: 12.0 Training q_loss: 7.7576 Training g_loss: 0.0574 Training d_loss: 11.0301 Explore P: 0.0101
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3517 Total reward: 14.0 Training q_loss: 7.3347 Training g_

-------------------------------------------------------------------------------
Episode: 3543 Total reward: 19.0 Training q_loss: 10.4961 Training g_loss: 0.0462 Training d_loss: 22.0429 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3544 Total reward: 18.0 Training q_loss: 10.1855 Training g_loss: 0.0132 Training d_loss: 15.8743 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3545 Total reward: 13.0 Training q_loss: 9.1036 Training g_loss: 0.0393 Training d_loss: 18.8702 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3546 Total reward: 13.0 Training q_loss: 8.6305 Training 

-------------------------------------------------------------------------------
Episode: 3572 Total reward: 29.0 Training q_loss: 6.8968 Training g_loss: 0.0620 Training d_loss: 4.8284 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3573 Total reward: 89.0 Training q_loss: 10.4039 Training g_loss: 0.3019 Training d_loss: 12.5489 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3574 Total reward: 111.0 Training q_loss: 19.9702 Training g_loss: 0.0219 Training d_loss: 11.6283 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3575 Total reward: 176.0 Training q_loss: 17.2110 Trainin

-------------------------------------------------------------------------------
Episode: 3601 Total reward: 97.0 Training q_loss: 38.7622 Training g_loss: 0.0421 Training d_loss: 17.7630 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3602 Total reward: 92.0 Training q_loss: 27.6266 Training g_loss: 0.0615 Training d_loss: 17.9120 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3603 Total reward: 99.0 Training q_loss: 42.1820 Training g_loss: 0.3990 Training d_loss: 9.1979 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3604 Total reward: 150.0 Training q_loss: 27.9071 Trainin

-------------------------------------------------------------------------------
Episode: 3630 Total reward: 49.0 Training q_loss: 7.2020 Training g_loss: 0.0258 Training d_loss: 13.0994 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3631 Total reward: 97.0 Training q_loss: 13.8427 Training g_loss: 0.6963 Training d_loss: 12.9587 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3632 Total reward: 199.0 Training q_loss: 10.2416 Training g_loss: 0.1235 Training d_loss: 13.8822 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3633 Total reward: 16.0 Training q_loss: 5.5741 Training

-------------------------------------------------------------------------------
Episode: 3659 Total reward: 41.0 Training q_loss: 4.5854 Training g_loss: 0.0141 Training d_loss: 19.8848 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3660 Total reward: 37.0 Training q_loss: 3.4360 Training g_loss: 0.0649 Training d_loss: 9.9218 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3661 Total reward: 56.0 Training q_loss: 5.7299 Training g_loss: 0.3689 Training d_loss: 26.9917 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3662 Total reward: 67.0 Training q_loss: 5.0076 Training g_l

-------------------------------------------------------------------------------
Episode: 3688 Total reward: 51.0 Training q_loss: 7.0520 Training g_loss: 0.0064 Training d_loss: 16.6785 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3689 Total reward: 108.0 Training q_loss: 9.4805 Training g_loss: 0.0232 Training d_loss: 36.4822 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3690 Total reward: 110.0 Training q_loss: 5.0789 Training g_loss: 1.4393 Training d_loss: 18.5210 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3691 Total reward: 10.0 Training q_loss: 1.4239 Training 

-------------------------------------------------------------------------------
Episode: 3717 Total reward: 10.0 Training q_loss: 11.2791 Training g_loss: 0.0296 Training d_loss: 17.8779 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3718 Total reward: 9.0 Training q_loss: 8.3740 Training g_loss: 1.1163 Training d_loss: 21.0393 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3719 Total reward: 75.0 Training q_loss: 6.9121 Training g_loss: 1.0517 Training d_loss: 16.7756 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3720 Total reward: 199.0 Training q_loss: 10.5112 Training 

-------------------------------------------------------------------------------
Episode: 3746 Total reward: 64.0 Training q_loss: 5.0843 Training g_loss: 0.2234 Training d_loss: 18.4062 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3747 Total reward: 9.0 Training q_loss: 5.4141 Training g_loss: 0.0124 Training d_loss: 18.6414 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3748 Total reward: 9.0 Training q_loss: 4.9335 Training g_loss: 0.0153 Training d_loss: 13.7005 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3749 Total reward: 12.0 Training q_loss: 4.3251 Training g_lo

-------------------------------------------------------------------------------
Episode: 3776 Total reward: 10.0 Training q_loss: 2.8202 Training g_loss: 0.0153 Training d_loss: 21.5410 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3777 Total reward: 15.0 Training q_loss: 2.7877 Training g_loss: 0.0257 Training d_loss: 16.4372 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3778 Total reward: 12.0 Training q_loss: 2.8347 Training g_loss: 0.1207 Training d_loss: 9.0219 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3779 Total reward: 14.0 Training q_loss: 2.6239 Training g_l

-------------------------------------------------------------------------------
Episode: 3805 Total reward: 82.0 Training q_loss: 2.6250 Training g_loss: 0.0815 Training d_loss: 25.8428 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3806 Total reward: 83.0 Training q_loss: 3.1761 Training g_loss: 0.0137 Training d_loss: 17.9920 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3807 Total reward: 17.0 Training q_loss: 3.5878 Training g_loss: 0.1691 Training d_loss: 8.6038 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3808 Total reward: 12.0 Training q_loss: 2.3671 Training g_l

-------------------------------------------------------------------------------
Episode: 3834 Total reward: 116.0 Training q_loss: 4.3339 Training g_loss: 0.0041 Training d_loss: 8.6653 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3835 Total reward: 81.0 Training q_loss: 3.7837 Training g_loss: 0.2091 Training d_loss: 23.4179 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3836 Total reward: 10.0 Training q_loss: 15.6829 Training g_loss: 0.1294 Training d_loss: 21.7397 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3837 Total reward: 11.0 Training q_loss: 38.3319 Training 

-------------------------------------------------------------------------------
Episode: 3863 Total reward: 34.0 Training q_loss: 4.0931 Training g_loss: 0.0844 Training d_loss: 14.8634 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3864 Total reward: 12.0 Training q_loss: 3.2810 Training g_loss: 0.0652 Training d_loss: 10.3259 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3865 Total reward: 10.0 Training q_loss: 4.3252 Training g_loss: 0.0268 Training d_loss: 18.2403 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3866 Total reward: 11.0 Training q_loss: 3.6368 Training g_

-------------------------------------------------------------------------------
Episode: 3892 Total reward: 40.0 Training q_loss: 4.1158 Training g_loss: 0.2870 Training d_loss: 16.8032 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3893 Total reward: 13.0 Training q_loss: 8.1854 Training g_loss: 0.2779 Training d_loss: 5.2554 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3894 Total reward: 14.0 Training q_loss: 23.4548 Training g_loss: 0.1279 Training d_loss: 19.3402 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3895 Total reward: 174.0 Training q_loss: 4.4585 Training g

-------------------------------------------------------------------------------
Episode: 3922 Total reward: 9.0 Training q_loss: 5.6621 Training g_loss: 0.0844 Training d_loss: 26.0977 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3923 Total reward: 10.0 Training q_loss: 5.2349 Training g_loss: 0.0237 Training d_loss: 11.8744 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3924 Total reward: 100.0 Training q_loss: 4.7331 Training g_loss: 0.0434 Training d_loss: 11.1581 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3925 Total reward: 137.0 Training q_loss: 3.6943 Training g

-------------------------------------------------------------------------------
Episode: 3952 Total reward: 11.0 Training q_loss: 18.9500 Training g_loss: 0.0353 Training d_loss: 16.9209 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3953 Total reward: 11.0 Training q_loss: 13.2763 Training g_loss: 0.0183 Training d_loss: 11.6840 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3954 Total reward: 13.0 Training q_loss: 9.5380 Training g_loss: 0.1195 Training d_loss: 21.4404 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3955 Total reward: 101.0 Training q_loss: 4.9781 Training

-------------------------------------------------------------------------------
Episode: 3981 Total reward: 11.0 Training q_loss: 13.0486 Training g_loss: 0.0226 Training d_loss: 24.3953 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3982 Total reward: 10.0 Training q_loss: 11.8760 Training g_loss: 0.0587 Training d_loss: 26.9111 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3983 Total reward: 11.0 Training q_loss: 11.2124 Training g_loss: 0.0114 Training d_loss: 14.7963 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 3984 Total reward: 10.0 Training q_loss: 7.6654 Training

-------------------------------------------------------------------------------
Episode: 4010 Total reward: 136.0 Training q_loss: 8.6545 Training g_loss: 0.8921 Training d_loss: 11.3315 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 4011 Total reward: 148.0 Training q_loss: 5.1284 Training g_loss: 0.0848 Training d_loss: 10.4821 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 4012 Total reward: 189.0 Training q_loss: 15.4465 Training g_loss: 0.0952 Training d_loss: 11.9652 Explore P: 0.0100
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Episode: 4013 Total reward: 124.0 Training q_loss: 8.0712 Traini

## Visualizing training

Below I'll plot the total rewards for each episode. I'm plotting the rolling average too, in blue.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(q_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Q losses')

In [None]:
eps, arr = np.array(g_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('G losses')

In [None]:
eps, arr = np.array(d_loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

## Testing

Let's checkout how our trained agent plays the game.

In [None]:
test_episodes = 1
test_max_steps = 20000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

# # # Create the env after closing it.
# env = gym.make('CartPole-v0')
# # env = gym.make('Acrobot-v1')
env.reset()

with tf.Session() as sess:
    
    # Restore/load the trained model 
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # iterations
    for ep in range(test_episodes):
        
        # number of env/rob steps
        t = 0
        while t < test_max_steps:
            
            # Rendering the env graphics
            env.render()
            
            # Get action from the model
            feed_dict = {model.prev_actions: np.array([prev_action]), 
                         model.states: state.reshape((1, *state.shape))}
            actions_logits = sess.run(model.actions_logits, feed_dict)
            action = np.argmax(actions_logits)
            
            # Take action, get new state and reward
            next_state, reward, done, _ = env.step(action)
            
            # The task is done or not;
            if done:
                t = test_max_steps
                env.reset()
                
                # Take one random step to get the pole and cart moving
                prev_action = env.action_space.sample()
                state, reward, done, _ = env.step(prev_action)
            else:
                state = next_state
                t += 1

In [None]:
# # Closing the env
# # WARNING: If you close, you can NOT restart again!!!!!!
# env.close()

## Extending this to Deep Convolutional QAN

So, Cart-Pole is a pretty simple game. However, the same model can be used to train an agent to play something much more complicated like Pong or Space Invaders. Instead of a state like we're using here though, you'd want to use convolutional layers to get the state from the screen images.

![Deep Q-Learning Atari](assets/atari-network.png)

I'll leave it as a challenge for you to use deep Q-learning to train an agent to play Atari games. Here's the original paper which will get you started: http://www.davidqiu.com:8888/research/nature14236.pdf.