# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
# env = UnityEnvironment(file_name="/home/arasdar/unity-envs/Banana_Linux/Banana.x86_64")
env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Banana_Linux_NoVis/Banana.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
for steps in range(1111111):
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        print(state.shape)
        break
    
print("Score and steps: {} and {}".format(score, steps))

(37,)
Score and steps: 0.0 and 299


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
# env_info = env.reset(train_mode=True)[brain_name] # reset the environment
# state = env_info.vector_observations[0]            # get the current state
# score = 0                                          # initialize the score
# while True:
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     score += reward                                # update the score
#     state = next_state                             # roll over the state to next time step
#     #print(state)
#     if done:                                       # exit loop if episode finished
#         break
    
# print("Score: {}".format(score))

In [8]:
import tensorflow as tf
print('TensorFlow Version: {}'.format(tf.__version__))
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [9]:
# env_info = env.reset(train_mode=True)[brain_name] # reset the environment
# state = env_info.vector_observations[0]            # get the current state
# score = 0                                          # initialize the score
# batch = []
# while True: # infinite number of steps
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     score += reward                                # update the score
#     #print(state, action, reward, done)
#     batch.append([action, state, reward, done])
#     state = next_state                             # roll over the state to next time step
#     if done:                                       # exit loop if episode finished
#         break
    
# print("Score: {}".format(score))

In [13]:
def model_input(state_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    next_states = tf.placeholder(tf.float32, [None, state_size], name='next_states')
    rewards = tf.placeholder(tf.float32, [None], name='rewards')
    dones = tf.placeholder(tf.float32, [None], name='dones')
    rates = tf.placeholder(tf.float32, [None], name='rates') # success rate
    return states, actions, next_states, rewards, dones, rates

In [14]:
def Act(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('Act', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        return logits

In [16]:
def Env(states, actions, state_size, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('Env', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        nl1_fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=nl1_fused, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        states_logits = tf.layers.dense(inputs=nl2, units=state_size, trainable=False)
        Qlogits = tf.layers.dense(inputs=nl2, units=1, trainable=False)
        return states_logits, Qlogits

In [17]:
def model_loss(state_size, action_size, hidden_size, gamma,
               states, actions, next_states, rewards, dones, rates):
    ################################################ a = act(s)
    actions_logits = Act(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    aloss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels))
    ################################################ s', r = env(s, a)
    ################################################ s', Q = env(s, a)
    ################################################ ~s', ~Q = env(s, ~a)
    e_next_states_logits, eQs = Env(actions=actions_labels, states=states, hidden_size=hidden_size, 
                                    action_size=action_size, state_size=state_size)
    a_next_states_logits, aQs = Env(actions=actions_logits, states=states, hidden_size=hidden_size, 
                                    action_size=action_size, state_size=state_size, reuse=True)
    next_states_labels = tf.nn.sigmoid(next_states)
    eloss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=e_next_states_logits, 
                                                                   labels=next_states_labels))
    eloss += -tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=a_next_states_logits, 
                                                                     labels=next_states_labels)) # maximize loss
    aloss2 = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=a_next_states_logits, 
                                                                    labels=next_states_labels)) # minimize loss
    #eQs_logits = tf.reshape(eQs, shape=[-1])
    eQs_logits = tf.nn.tanh(tf.reshape(eQs, shape=[-1]))
    aQs_logits = tf.reshape(aQs, shape=[-1])
    #################################################### s'', Q' = ~env(s', ~a')
    next_actions_logits = Act(states=next_states, hidden_size=hidden_size, action_size=action_size, reuse=True)
    _, aQs2 = Env(actions=next_actions_logits, states=next_states, hidden_size=hidden_size, 
                  action_size=action_size, state_size=state_size, reuse=True)
    aQs2_logits = tf.reshape(aQs2, shape=[-1]) * (1-dones)
    #     eloss += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=eQs_logits, # GAN
    #                                                                     labels=rates)) # 0-1 real
    eloss += tf.reduce_mean(tf.square(eQs_logits-rates)) # [-1, +1]
    eloss += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=(aQs_logits+aQs2_logits)/2, # GAN
                                                                    labels=tf.zeros_like(rates))) # min
    aloss2 += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=(aQs_logits+aQs2_logits)/2, # GAN
                                                                     labels=tf.ones_like(rates))) # max
    #     ###################################################### Q(s,a)= r + Q'(s',a') # max
    #     ###################################################### ~Q(s,~a)= r # min
    #     ###################################################### ~Q(s,~a)= r + Q'(s',a') # max
    #     targetQs = rewards + (gamma * aQs2_logits)
    #     eloss += tf.reduce_mean(tf.square(eQs_logits - targetQs)) # real
    #     eloss += tf.reduce_mean((aQs_logits+aQs2_logits)/2) # min
    #     aloss2 += -tf.reduce_mean((aQs_logits+aQs2_logits)/2) # max
    return actions_logits, aloss, eloss, aloss2

In [18]:
def model_opt(a_loss, e_loss, a_loss2, a_learning_rate, e_learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    a_vars = [var for var in t_vars if var.name.startswith('Act')]
    e_vars = [var for var in t_vars if var.name.startswith('Env')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        a_opt = tf.train.AdamOptimizer(a_learning_rate).minimize(a_loss, var_list=a_vars)
        e_opt = tf.train.AdamOptimizer(e_learning_rate).minimize(e_loss, var_list=e_vars)
        a_opt2 = tf.train.AdamOptimizer(a_learning_rate).minimize(a_loss2, var_list=a_vars)
    return a_opt, e_opt, a_opt2

In [19]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, a_learning_rate, e_learning_rate, gamma):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.next_states, self.rewards, self.dones, self.rates = model_input(
            state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.a_loss, self.e_loss, self.a_loss2 = model_loss(
            state_size=state_size, action_size=action_size, hidden_size=hidden_size, gamma=gamma, # model init
            states=self.states, actions=self.actions, next_states=self.next_states, 
            rewards=self.rewards, dones=self.dones, rates=self.rates) # model input
        
        # Update the model: backward pass and backprop
        self.a_opt, self.e_opt, self.a_opt2 = model_opt(a_loss=self.a_loss, 
                                                        e_loss=self.e_loss,
                                                        a_loss2=self.a_loss2, 
                                                        a_learning_rate=a_learning_rate,
                                                        e_learning_rate=e_learning_rate)

In [20]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size) # data batch
#     def sample(self, batch_size):
#         idx = np.random.choice(np.arange(len(self.buffer)), size=batch_size, replace=False)
#         return [self.buffer[ii] for ii in idx]

In [21]:
brain.vector_observation_space_size, brain.vector_observation_space_type, \
brain.vector_action_space_size, brain.vector_action_space_type

(37, 'continuous', 4, 'discrete')

In [23]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01           # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 37
action_size = 4
hidden_size = 37*2             # number of units in each Q-network hidden layer
a_learning_rate = 1e-4         # Q-network learning rate
e_learning_rate = 1e-4         # Q-network learning rate

# Memory parameters
memory_size = int(1e5)            # memory capacity
batch_size = int(5e3)             # experience mini-batch size: 300*10 a successfull episode size
gamma = 0.99                   # future reward discount

In [24]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, gamma=gamma,
              a_learning_rate=a_learning_rate, e_learning_rate=e_learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [25]:
# state = env.reset()
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]   # get the state
total_reward = 0
num_step = 0
for each in range(memory_size):
    # action = env.action_space.sample()
    # next_state, reward, done, _ = env.step(action)
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    rate = -1
    memory.buffer.append([state, action, next_state, reward, float(done), rate])
    num_step += 1 # memory incremented
    total_reward += reward
    state = next_state
    if done is True:
        rate = np.clip(total_reward/13, a_min=-1, a_max=+1)
        for idx in range(num_step): # episode length
            if memory.buffer[-1-idx][-1] == -1:
                memory.buffer[-1-idx][-1] = rate
        # state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the state
        total_reward = 0 # reset
        print('number of steps per episode:', num_step, 
              '%gone:', each/memory_size, 'rate:', rate)
        num_step = 0 # reset

number of steps per episode: 300 %gone: 0.00299 rate: 0.15384615384615385
number of steps per episode: 300 %gone: 0.00599 rate: 0.0
number of steps per episode: 300 %gone: 0.00899 rate: 0.0
number of steps per episode: 300 %gone: 0.01199 rate: -0.23076923076923078
number of steps per episode: 300 %gone: 0.01499 rate: 0.0
number of steps per episode: 300 %gone: 0.01799 rate: 0.07692307692307693
number of steps per episode: 300 %gone: 0.02099 rate: 0.07692307692307693
number of steps per episode: 300 %gone: 0.02399 rate: 0.0
number of steps per episode: 300 %gone: 0.02699 rate: 0.15384615384615385
number of steps per episode: 300 %gone: 0.02999 rate: 0.15384615384615385
number of steps per episode: 300 %gone: 0.03299 rate: 0.0
number of steps per episode: 300 %gone: 0.03599 rate: 0.0
number of steps per episode: 300 %gone: 0.03899 rate: -0.07692307692307693
number of steps per episode: 300 %gone: 0.04199 rate: -0.07692307692307693
number of steps per episode: 300 %gone: 0.04499 rate: -0.

number of steps per episode: 300 %gone: 0.36899 rate: 0.0
number of steps per episode: 300 %gone: 0.37199 rate: 0.07692307692307693
number of steps per episode: 300 %gone: 0.37499 rate: -0.07692307692307693
number of steps per episode: 300 %gone: 0.37799 rate: 0.07692307692307693
number of steps per episode: 300 %gone: 0.38099 rate: 0.0
number of steps per episode: 300 %gone: 0.38399 rate: 0.0
number of steps per episode: 300 %gone: 0.38699 rate: 0.07692307692307693
number of steps per episode: 300 %gone: 0.38999 rate: 0.0
number of steps per episode: 300 %gone: 0.39299 rate: 0.0
number of steps per episode: 300 %gone: 0.39599 rate: 0.07692307692307693
number of steps per episode: 300 %gone: 0.39899 rate: 0.0
number of steps per episode: 300 %gone: 0.40199 rate: 0.0
number of steps per episode: 300 %gone: 0.40499 rate: 0.0
number of steps per episode: 300 %gone: 0.40799 rate: 0.0
number of steps per episode: 300 %gone: 0.41099 rate: 0.07692307692307693
number of steps per episode: 300 

number of steps per episode: 300 %gone: 0.73799 rate: 0.0
number of steps per episode: 300 %gone: 0.74099 rate: 0.15384615384615385
number of steps per episode: 300 %gone: 0.74399 rate: 0.07692307692307693
number of steps per episode: 300 %gone: 0.74699 rate: 0.0
number of steps per episode: 300 %gone: 0.74999 rate: 0.07692307692307693
number of steps per episode: 300 %gone: 0.75299 rate: -0.07692307692307693
number of steps per episode: 300 %gone: 0.75599 rate: 0.15384615384615385
number of steps per episode: 300 %gone: 0.75899 rate: -0.07692307692307693
number of steps per episode: 300 %gone: 0.76199 rate: -0.07692307692307693
number of steps per episode: 300 %gone: 0.76499 rate: 0.07692307692307693
number of steps per episode: 300 %gone: 0.76799 rate: 0.0
number of steps per episode: 300 %gone: 0.77099 rate: -0.07692307692307693
number of steps per episode: 300 %gone: 0.77399 rate: 0.0
number of steps per episode: 300 %gone: 0.77699 rate: -0.07692307692307693
number of steps per epi

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list = [], []
aloss_list, eloss_list, aloss2_list = [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes for running average/running mean/window

    # Training episodes/epochs
    for ep in range(11111):
        aloss_batch, eloss_batch, aloss2_batch = [], [], []
        total_reward = 0
        #state = env.reset() # each episode
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state
        num_step = 0 # each episode
        rate = -1

        # Training steps/batches
        while True:
            # Explore (env) or Exploit (model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                #action = env.action_space.sample()
                action = np.random.randint(action_size)        # select an action
            else:
                action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
                action = np.argmax(action_logits)
            #next_state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            next_state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            memory.buffer.append([state, action, next_state, reward, float(done), rate])
            num_step += 1 # momory added
            total_reward += reward
            state = next_state
            
            # Training with the maxrated minibatch
            batch = memory.buffer
            #for idx in range(memory_size// batch_size):
            while True:
                idx = np.random.choice(np.arange(memory_size// batch_size))
                states = np.array([each[0] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                actions = np.array([each[1] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                next_states = np.array([each[2] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                rewards = np.array([each[3] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                dones = np.array([each[4] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                rates = np.array([each[5] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                states = states[rates >= np.max(rates)]
                actions = actions[rates >= np.max(rates)]
                next_states = next_states[rates >= np.max(rates)]
                rewards = rewards[rates >= np.max(rates)]
                dones = dones[rates >= np.max(rates)]
                rates = rates[rates >= np.max(rates)]
                if np.count_nonzero(dones) > 0 and len(dones) > 1 and np.max(rates) > 0:
                    break
            aloss, _ = sess.run([model.a_loss, model.a_opt],
                                  feed_dict = {model.states: states, 
                                               model.actions: actions,
                                               model.next_states: next_states,
                                               model.rewards: rewards,
                                               model.dones: dones,
                                               model.rates: rates})
            eloss, _ = sess.run([model.e_loss, model.e_opt],
                                  feed_dict = {model.states: states, 
                                               model.actions: actions,
                                               model.next_states: next_states,
                                               model.rewards: rewards,
                                               model.dones: dones,
                                               model.rates: rates})
            aloss2, _= sess.run([model.a_loss2, model.a_opt2], 
                                 feed_dict = {model.states: states, 
                                              model.actions: actions,
                                              model.next_states: next_states,
                                              model.rewards: rewards,
                                              model.dones: dones,
                                              model.rates: rates})
            # print(len(dones), np.count_nonzero(dones), np.max(rates))
            aloss_batch.append(aloss)
            eloss_batch.append(eloss)
            aloss2_batch.append(aloss2)
            if done is True:
                break
                
        # Rating the memory
        #if done is True:
        rate = np.clip(total_reward/13, a_min=-1, a_max=+1) # [-1, +1]
        for idx in range(num_step): # episode length
            if memory.buffer[-1-idx][-1] == -1: # double-check the landmark/marked indexes
                memory.buffer[-1-idx][-1] = rate # rate the trajectory/data
                    
        # Print out
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'rate:{:.4f}'.format(rate),
              'aloss:{:.4f}'.format(np.mean(aloss_batch)),
              'eloss:{:.4f}'.format(np.mean(eloss_batch)),
              'aloss2:{:.4f}'.format(np.mean(aloss2_batch)),
              'exploreP:{:.4f}'.format(explore_p))

        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        aloss_list.append([ep, np.mean(aloss_batch)])
        eloss_list.append([ep, np.mean(eloss_batch)])
        aloss2_list.append([ep, np.mean(aloss2_batch)])
        
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.
        if np.mean(episode_reward) >= +13:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:0.0000 R:0.0000 rate:0.0000 aloss:1.4109 eloss:0.7335 aloss2:1.4176 exploreP:0.9707
Episode:1 meanR:0.0000 R:0.0000 rate:0.0000 aloss:1.4017 eloss:0.7055 aloss2:1.4410 exploreP:0.9423
Episode:2 meanR:0.0000 R:0.0000 rate:0.0000 aloss:1.4339 eloss:0.7029 aloss2:1.4461 exploreP:0.9148
Episode:3 meanR:0.5000 R:2.0000 rate:0.1538 aloss:1.4493 eloss:0.7015 aloss2:1.4515 exploreP:0.8881
Episode:4 meanR:0.8000 R:2.0000 rate:0.1538 aloss:1.4587 eloss:0.7007 aloss2:1.4507 exploreP:0.8621
Episode:5 meanR:0.8333 R:1.0000 rate:0.0769 aloss:1.4407 eloss:0.6968 aloss2:1.4539 exploreP:0.8369
Episode:6 meanR:1.0000 R:2.0000 rate:0.1538 aloss:1.4665 eloss:0.6960 aloss2:1.4523 exploreP:0.8125
Episode:7 meanR:0.8750 R:0.0000 rate:0.0000 aloss:1.4591 eloss:0.6941 aloss2:1.4595 exploreP:0.7888
Episode:8 meanR:1.0000 R:2.0000 rate:0.1538 aloss:1.4434 eloss:0.6817 aloss2:1.4672 exploreP:0.7657
Episode:9 meanR:0.8000 R:-1.0000 rate:-0.0769 aloss:1.4402 eloss:0.6812 aloss2:1.4687 exploreP:0.743

Episode:81 meanR:-0.2683 R:1.0000 rate:0.0769 aloss:1.3708 eloss:0.4654 aloss2:2.0620 exploreP:0.0946
Episode:82 meanR:-0.2651 R:0.0000 rate:0.0000 aloss:1.3917 eloss:0.4521 aloss2:2.0992 exploreP:0.0921
Episode:83 meanR:-0.2500 R:1.0000 rate:0.0769 aloss:1.3901 eloss:0.4446 aloss2:2.1339 exploreP:0.0897
Episode:84 meanR:-0.2588 R:-1.0000 rate:-0.0769 aloss:1.4014 eloss:0.4361 aloss2:2.1679 exploreP:0.0873
Episode:85 meanR:-0.2674 R:-1.0000 rate:-0.0769 aloss:1.3773 eloss:0.4343 aloss2:2.1984 exploreP:0.0850
Episode:86 meanR:-0.2759 R:-1.0000 rate:-0.0769 aloss:1.3767 eloss:0.4229 aloss2:2.2389 exploreP:0.0828
Episode:87 meanR:-0.2841 R:-1.0000 rate:-0.0769 aloss:1.3800 eloss:0.4137 aloss2:2.2740 exploreP:0.0806
Episode:88 meanR:-0.2472 R:3.0000 rate:0.2308 aloss:1.3784 eloss:0.4056 aloss2:2.3121 exploreP:0.0786
Episode:89 meanR:-0.2444 R:0.0000 rate:0.0000 aloss:1.3697 eloss:0.3950 aloss2:2.3541 exploreP:0.0765
Episode:90 meanR:-0.2418 R:0.0000 rate:0.0000 aloss:1.3834 eloss:0.3821 al

Episode:161 meanR:0.0600 R:0.0000 rate:0.0000 aloss:1.2419 eloss:-0.1786 aloss2:6.3321 exploreP:0.0177
Episode:162 meanR:0.0800 R:0.0000 rate:0.0000 aloss:1.2726 eloss:-0.1613 aloss2:6.3268 exploreP:0.0174
Episode:163 meanR:0.0800 R:0.0000 rate:0.0000 aloss:1.2597 eloss:-0.1760 aloss2:6.3956 exploreP:0.0172
Episode:164 meanR:0.1200 R:1.0000 rate:0.0769 aloss:1.2106 eloss:-0.2610 aloss2:6.5115 exploreP:0.0170
Episode:165 meanR:0.1200 R:0.0000 rate:0.0000 aloss:1.2212 eloss:-0.2531 aloss2:6.5288 exploreP:0.0168
Episode:166 meanR:0.1200 R:0.0000 rate:0.0000 aloss:1.2562 eloss:-0.2178 aloss2:6.5934 exploreP:0.0166
Episode:167 meanR:0.0900 R:-1.0000 rate:-0.0769 aloss:1.2238 eloss:-0.2098 aloss2:6.5347 exploreP:0.0164
Episode:168 meanR:0.0800 R:0.0000 rate:0.0000 aloss:1.2381 eloss:-0.2200 aloss2:6.5653 exploreP:0.0162
Episode:169 meanR:0.1000 R:1.0000 rate:0.0769 aloss:1.2413 eloss:-0.2259 aloss2:6.6094 exploreP:0.0160
Episode:170 meanR:0.1300 R:1.0000 rate:0.0769 aloss:1.2218 eloss:-0.292

Episode:240 meanR:-0.0100 R:1.0000 rate:0.0769 aloss:1.1260 eloss:-2.9921 aloss2:15.0705 exploreP:0.0107
Episode:241 meanR:-0.0200 R:-1.0000 rate:-0.0769 aloss:1.1177 eloss:-3.1270 aloss2:15.3709 exploreP:0.0107
Episode:242 meanR:-0.0100 R:1.0000 rate:0.0769 aloss:1.1438 eloss:-3.1083 aloss2:15.2617 exploreP:0.0107
Episode:243 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:1.1019 eloss:-3.3342 aloss2:15.7494 exploreP:0.0107
Episode:244 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:1.1005 eloss:-3.4073 aloss2:15.9435 exploreP:0.0106
Episode:245 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:1.1333 eloss:-3.3650 aloss2:15.8622 exploreP:0.0106
Episode:246 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:1.0925 eloss:-3.6351 aloss2:16.5692 exploreP:0.0106
Episode:247 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:1.1336 eloss:-3.8317 aloss2:17.1545 exploreP:0.0106
Episode:248 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:1.0951 eloss:-4.1517 aloss2:17.6483 exploreP:0.0106
Episode:249 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:

Episode:318 meanR:0.0300 R:1.0000 rate:0.0769 aloss:1.3416 eloss:-20.6850 aloss2:53.1775 exploreP:0.0101
Episode:319 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.3883 eloss:-21.3134 aloss2:54.6964 exploreP:0.0101
Episode:320 meanR:0.0300 R:1.0000 rate:0.0769 aloss:1.2558 eloss:-21.8857 aloss2:56.3979 exploreP:0.0101
Episode:321 meanR:0.0300 R:0.0000 rate:0.0000 aloss:1.4052 eloss:-22.4589 aloss2:56.9502 exploreP:0.0101
Episode:322 meanR:0.0300 R:0.0000 rate:0.0000 aloss:1.3718 eloss:-22.5391 aloss2:57.2692 exploreP:0.0101
Episode:323 meanR:0.0400 R:1.0000 rate:0.0769 aloss:1.3060 eloss:-22.6698 aloss2:57.5016 exploreP:0.0101
Episode:324 meanR:0.0400 R:0.0000 rate:0.0000 aloss:1.3169 eloss:-22.9846 aloss2:57.6650 exploreP:0.0101
Episode:325 meanR:0.0400 R:-1.0000 rate:-0.0769 aloss:1.2502 eloss:-22.2483 aloss2:56.3061 exploreP:0.0101
Episode:326 meanR:0.0500 R:0.0000 rate:0.0000 aloss:1.3605 eloss:-23.3351 aloss2:59.0607 exploreP:0.0101
Episode:327 meanR:0.0500 R:-1.0000 rate:-0.0769 aloss

Episode:396 meanR:0.0000 R:1.0000 rate:0.0769 aloss:1.6268 eloss:-95.9053 aloss2:213.9284 exploreP:0.0100
Episode:397 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:1.7496 eloss:-97.6648 aloss2:220.0827 exploreP:0.0100
Episode:398 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:1.6885 eloss:-98.7085 aloss2:223.5584 exploreP:0.0100
Episode:399 meanR:-0.0400 R:-2.0000 rate:-0.1538 aloss:1.6900 eloss:-102.0887 aloss2:228.2106 exploreP:0.0100
Episode:400 meanR:-0.0300 R:1.0000 rate:0.0769 aloss:1.5548 eloss:-98.8114 aloss2:222.2774 exploreP:0.0100
Episode:401 meanR:-0.0100 R:2.0000 rate:0.1538 aloss:1.6487 eloss:-99.9101 aloss2:223.6512 exploreP:0.0100
Episode:402 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.4650 eloss:-96.7580 aloss2:217.0466 exploreP:0.0100
Episode:403 meanR:0.0300 R:0.0000 rate:0.0000 aloss:1.5945 eloss:-102.5036 aloss2:233.1802 exploreP:0.0100
Episode:404 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.6182 eloss:-105.5741 aloss2:235.2527 exploreP:0.0100
Episode:405 meanR:0.0300 R:1.0000 ra

Episode:472 meanR:-0.0400 R:-2.0000 rate:-0.1538 aloss:1.6400 eloss:-223.5514 aloss2:445.2233 exploreP:0.0100
Episode:473 meanR:-0.0300 R:1.0000 rate:0.0769 aloss:1.6561 eloss:-218.2205 aloss2:441.7474 exploreP:0.0100
Episode:474 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1.6670 eloss:-215.9429 aloss2:439.9060 exploreP:0.0100
Episode:475 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1.8141 eloss:-219.3239 aloss2:450.2938 exploreP:0.0100
Episode:476 meanR:-0.0500 R:-2.0000 rate:-0.1538 aloss:1.7581 eloss:-214.0738 aloss2:442.5710 exploreP:0.0100
Episode:477 meanR:-0.0500 R:-1.0000 rate:-0.0769 aloss:1.7306 eloss:-214.7964 aloss2:436.8786 exploreP:0.0100
Episode:478 meanR:-0.0400 R:1.0000 rate:0.0769 aloss:1.6563 eloss:-237.4943 aloss2:469.8857 exploreP:0.0100
Episode:479 meanR:-0.0400 R:-1.0000 rate:-0.0769 aloss:1.8867 eloss:-229.0284 aloss2:470.4228 exploreP:0.0100
Episode:480 meanR:-0.0700 R:-2.0000 rate:-0.1538 aloss:1.8463 eloss:-231.7472 aloss2:470.1136 exploreP:0.0100
Episode:481 meanR:

Episode:548 meanR:-0.0500 R:-3.0000 rate:-0.2308 aloss:1.9540 eloss:-379.4906 aloss2:754.5338 exploreP:0.0100
Episode:549 meanR:-0.0400 R:1.0000 rate:0.0769 aloss:2.1339 eloss:-422.5831 aloss2:818.2294 exploreP:0.0100
Episode:550 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:2.5167 eloss:-402.8400 aloss2:838.6260 exploreP:0.0100
Episode:551 meanR:-0.0600 R:-1.0000 rate:-0.0769 aloss:2.9894 eloss:-394.8690 aloss2:874.6498 exploreP:0.0100
Episode:552 meanR:-0.0700 R:-1.0000 rate:-0.0769 aloss:2.5763 eloss:-382.1058 aloss2:822.9249 exploreP:0.0100
Episode:553 meanR:-0.0700 R:0.0000 rate:0.0000 aloss:2.7254 eloss:-375.3995 aloss2:820.1628 exploreP:0.0100
Episode:554 meanR:-0.0700 R:0.0000 rate:0.0000 aloss:2.7286 eloss:-377.9636 aloss2:824.4070 exploreP:0.0100
Episode:555 meanR:-0.0600 R:1.0000 rate:0.0769 aloss:2.8055 eloss:-381.2773 aloss2:829.9075 exploreP:0.0100
Episode:556 meanR:-0.0700 R:0.0000 rate:0.0000 aloss:2.7826 eloss:-403.9128 aloss2:870.3042 exploreP:0.0100
Episode:557 meanR:-0.0

Episode:624 meanR:0.0300 R:0.0000 rate:0.0000 aloss:2.8074 eloss:-584.5901 aloss2:1248.9021 exploreP:0.0100
Episode:625 meanR:0.0300 R:0.0000 rate:0.0000 aloss:2.9110 eloss:-590.1543 aloss2:1266.6179 exploreP:0.0100
Episode:626 meanR:0.0200 R:0.0000 rate:0.0000 aloss:3.1841 eloss:-602.6115 aloss2:1285.4014 exploreP:0.0100
Episode:627 meanR:0.0200 R:0.0000 rate:0.0000 aloss:2.8762 eloss:-597.1653 aloss2:1278.4928 exploreP:0.0100
Episode:628 meanR:0.0200 R:0.0000 rate:0.0000 aloss:3.2233 eloss:-611.7264 aloss2:1311.4775 exploreP:0.0100
Episode:629 meanR:0.0100 R:-2.0000 rate:-0.1538 aloss:3.1839 eloss:-610.3557 aloss2:1305.7517 exploreP:0.0100
Episode:630 meanR:0.0200 R:0.0000 rate:0.0000 aloss:3.1087 eloss:-631.3844 aloss2:1351.2156 exploreP:0.0100
Episode:631 meanR:0.0200 R:0.0000 rate:0.0000 aloss:3.2455 eloss:-629.2338 aloss2:1342.9784 exploreP:0.0100
Episode:632 meanR:0.0200 R:0.0000 rate:0.0000 aloss:3.2583 eloss:-649.3957 aloss2:1385.9958 exploreP:0.0100
Episode:633 meanR:0.0100 R

Episode:700 meanR:0.1200 R:0.0000 rate:0.0000 aloss:5.2793 eloss:-1271.8898 aloss2:2521.1982 exploreP:0.0100
Episode:701 meanR:0.1200 R:0.0000 rate:0.0000 aloss:5.5459 eloss:-1324.1371 aloss2:2643.0818 exploreP:0.0100
Episode:702 meanR:0.1200 R:0.0000 rate:0.0000 aloss:5.6669 eloss:-1361.5835 aloss2:2713.0347 exploreP:0.0100
Episode:703 meanR:0.1200 R:0.0000 rate:0.0000 aloss:5.6825 eloss:-1413.9211 aloss2:2789.8809 exploreP:0.0100
Episode:704 meanR:0.1200 R:0.0000 rate:0.0000 aloss:5.2220 eloss:-1400.5978 aloss2:2733.4993 exploreP:0.0100
Episode:705 meanR:0.1300 R:0.0000 rate:0.0000 aloss:5.2468 eloss:-1381.9030 aloss2:2715.5723 exploreP:0.0100
Episode:706 meanR:0.1300 R:0.0000 rate:0.0000 aloss:5.4532 eloss:-1379.0983 aloss2:2722.0718 exploreP:0.0100
Episode:707 meanR:0.1300 R:0.0000 rate:0.0000 aloss:5.0283 eloss:-1374.4421 aloss2:2689.1772 exploreP:0.0100
Episode:708 meanR:0.1200 R:-1.0000 rate:-0.0769 aloss:5.2717 eloss:-1374.3970 aloss2:2695.3943 exploreP:0.0100
Episode:709 meanR

Episode:775 meanR:0.1300 R:-2.0000 rate:-0.1538 aloss:5.0882 eloss:-1724.3998 aloss2:3577.9441 exploreP:0.0100
Episode:776 meanR:0.1200 R:0.0000 rate:0.0000 aloss:5.6294 eloss:-1861.0110 aloss2:3897.9468 exploreP:0.0100
Episode:777 meanR:0.1200 R:0.0000 rate:0.0000 aloss:5.5776 eloss:-1782.0481 aloss2:3831.9492 exploreP:0.0100
Episode:778 meanR:0.1200 R:0.0000 rate:0.0000 aloss:6.5810 eloss:-1905.5900 aloss2:4118.6338 exploreP:0.0100
Episode:779 meanR:0.1100 R:-1.0000 rate:-0.0769 aloss:5.9455 eloss:-1888.0259 aloss2:4122.6201 exploreP:0.0100
Episode:780 meanR:0.1000 R:0.0000 rate:0.0000 aloss:5.4976 eloss:-1865.8287 aloss2:4130.2261 exploreP:0.0100
Episode:781 meanR:0.1100 R:0.0000 rate:0.0000 aloss:5.8582 eloss:-1853.6409 aloss2:4145.0635 exploreP:0.0100
Episode:782 meanR:0.1100 R:0.0000 rate:0.0000 aloss:6.3356 eloss:-1896.0773 aloss2:4300.2896 exploreP:0.0100
Episode:783 meanR:0.1100 R:0.0000 rate:0.0000 aloss:6.2919 eloss:-1873.2856 aloss2:4320.6177 exploreP:0.0100
Episode:784 mea

Episode:850 meanR:0.0300 R:0.0000 rate:0.0000 aloss:7.5369 eloss:-3426.4111 aloss2:11339.1885 exploreP:0.0100
Episode:851 meanR:0.0100 R:0.0000 rate:0.0000 aloss:7.1921 eloss:-3437.3015 aloss2:11253.0596 exploreP:0.0100
Episode:852 meanR:0.0200 R:0.0000 rate:0.0000 aloss:7.9351 eloss:-3657.9800 aloss2:11894.5801 exploreP:0.0100
Episode:853 meanR:0.0100 R:0.0000 rate:0.0000 aloss:8.1440 eloss:-3793.4150 aloss2:12178.3291 exploreP:0.0100
Episode:854 meanR:-0.0100 R:-1.0000 rate:-0.0769 aloss:8.1957 eloss:-3964.3071 aloss2:12528.7480 exploreP:0.0100
Episode:855 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:8.1943 eloss:-4031.3416 aloss2:12636.8115 exploreP:0.0100
Episode:856 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:8.6021 eloss:-4029.7612 aloss2:12683.0068 exploreP:0.0100
Episode:857 meanR:-0.0300 R:-1.0000 rate:-0.0769 aloss:7.5402 eloss:-4027.7300 aloss2:12620.9033 exploreP:0.0100
Episode:858 meanR:-0.0200 R:1.0000 rate:0.0769 aloss:8.2278 eloss:-4121.2627 aloss2:13137.6621 exploreP:0.0100
E

Episode:924 meanR:-0.1200 R:0.0000 rate:0.0000 aloss:10.3093 eloss:-8214.7529 aloss2:29004.2871 exploreP:0.0100
Episode:925 meanR:-0.1200 R:0.0000 rate:0.0000 aloss:10.0544 eloss:-7996.1182 aloss2:28447.1309 exploreP:0.0100
Episode:926 meanR:-0.1200 R:0.0000 rate:0.0000 aloss:10.2202 eloss:-7880.8936 aloss2:28295.7598 exploreP:0.0100
Episode:927 meanR:-0.1100 R:0.0000 rate:0.0000 aloss:9.9777 eloss:-8117.4219 aloss2:28419.8008 exploreP:0.0100
Episode:928 meanR:-0.1000 R:1.0000 rate:0.0769 aloss:9.8843 eloss:-8488.4297 aloss2:29246.1367 exploreP:0.0100
Episode:929 meanR:-0.1000 R:0.0000 rate:0.0000 aloss:10.4404 eloss:-8081.3311 aloss2:28895.8242 exploreP:0.0100
Episode:930 meanR:-0.0900 R:1.0000 rate:0.0769 aloss:9.7832 eloss:-8604.8633 aloss2:29961.6230 exploreP:0.0100
Episode:931 meanR:-0.1000 R:0.0000 rate:0.0000 aloss:9.6627 eloss:-8432.4834 aloss2:29527.7207 exploreP:0.0100
Episode:932 meanR:-0.1000 R:0.0000 rate:0.0000 aloss:10.9214 eloss:-8654.2090 aloss2:30617.8457 exploreP:0.0

Episode:998 meanR:0.0100 R:-1.0000 rate:-0.0769 aloss:11.7638 eloss:-10645.3564 aloss2:40886.1836 exploreP:0.0100
Episode:999 meanR:0.0300 R:1.0000 rate:0.0769 aloss:11.0072 eloss:-10307.7354 aloss2:40114.5039 exploreP:0.0100
Episode:1000 meanR:0.0300 R:0.0000 rate:0.0000 aloss:11.9483 eloss:-10553.8848 aloss2:40840.9648 exploreP:0.0100
Episode:1001 meanR:0.0300 R:0.0000 rate:0.0000 aloss:11.6951 eloss:-10363.3398 aloss2:40066.5156 exploreP:0.0100
Episode:1002 meanR:0.0500 R:1.0000 rate:0.0769 aloss:11.6363 eloss:-10728.9482 aloss2:40723.5469 exploreP:0.0100
Episode:1003 meanR:0.0500 R:0.0000 rate:0.0000 aloss:11.6979 eloss:-10895.1846 aloss2:41018.8398 exploreP:0.0100
Episode:1004 meanR:0.0600 R:1.0000 rate:0.0769 aloss:10.4472 eloss:-10741.5156 aloss2:40462.8477 exploreP:0.0100
Episode:1005 meanR:0.0500 R:-1.0000 rate:-0.0769 aloss:11.0994 eloss:-10697.4287 aloss2:40330.3594 exploreP:0.0100
Episode:1006 meanR:0.0300 R:-2.0000 rate:-0.1538 aloss:11.0583 eloss:-10496.8633 aloss2:40204.

Episode:1071 meanR:0.0100 R:-1.0000 rate:-0.0769 aloss:14.9433 eloss:-22404.1855 aloss2:69658.9453 exploreP:0.0100
Episode:1072 meanR:0.0100 R:0.0000 rate:0.0000 aloss:15.2995 eloss:-22636.4707 aloss2:70323.1250 exploreP:0.0100
Episode:1073 meanR:0.0100 R:0.0000 rate:0.0000 aloss:15.3771 eloss:-23166.6660 aloss2:71425.6953 exploreP:0.0100
Episode:1074 meanR:0.0000 R:0.0000 rate:0.0000 aloss:15.9860 eloss:-23603.8223 aloss2:72595.9219 exploreP:0.0100
Episode:1075 meanR:-0.0200 R:-2.0000 rate:-0.1538 aloss:15.2969 eloss:-23603.3398 aloss2:73360.2266 exploreP:0.0100
Episode:1076 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:15.0608 eloss:-23959.4883 aloss2:73969.0000 exploreP:0.0100
Episode:1077 meanR:-0.0100 R:1.0000 rate:0.0769 aloss:15.4474 eloss:-24181.9004 aloss2:75507.6562 exploreP:0.0100
Episode:1078 meanR:-0.0200 R:-1.0000 rate:-0.0769 aloss:14.2368 eloss:-24462.0859 aloss2:75789.2109 exploreP:0.0100
Episode:1079 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:14.7368 eloss:-24548.6270 aloss2

Episode:1143 meanR:-0.0100 R:-1.0000 rate:-0.0769 aloss:13.9794 eloss:-27776.8398 aloss2:101025.8125 exploreP:0.0100
Episode:1144 meanR:0.0000 R:2.0000 rate:0.1538 aloss:13.7960 eloss:-28078.0137 aloss2:101726.7969 exploreP:0.0100
Episode:1145 meanR:-0.0100 R:1.0000 rate:0.0769 aloss:14.2851 eloss:-27970.1973 aloss2:102927.5391 exploreP:0.0100
Episode:1146 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:14.0218 eloss:-29286.6543 aloss2:103657.8125 exploreP:0.0100
Episode:1147 meanR:-0.0100 R:1.0000 rate:0.0769 aloss:14.0208 eloss:-29225.6465 aloss2:104256.3516 exploreP:0.0100
Episode:1148 meanR:0.0000 R:0.0000 rate:0.0000 aloss:14.0561 eloss:-29632.9004 aloss2:104786.0391 exploreP:0.0100
Episode:1149 meanR:-0.0400 R:-1.0000 rate:-0.0769 aloss:14.3236 eloss:-29214.1699 aloss2:105871.0391 exploreP:0.0100
Episode:1150 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:13.8389 eloss:-29808.0840 aloss2:105551.0703 exploreP:0.0100
Episode:1151 meanR:-0.0700 R:-3.0000 rate:-0.2308 aloss:14.0389 eloss:-29683.1

Episode:1215 meanR:0.0000 R:0.0000 rate:0.0000 aloss:20.1356 eloss:-44190.9062 aloss2:172630.1406 exploreP:0.0100
Episode:1216 meanR:0.0300 R:3.0000 rate:0.2308 aloss:19.8386 eloss:-44130.2344 aloss2:173625.1719 exploreP:0.0100
Episode:1217 meanR:-0.0100 R:-2.0000 rate:-0.1538 aloss:19.3757 eloss:-43389.0586 aloss2:174349.7812 exploreP:0.0100
Episode:1218 meanR:-0.0400 R:-2.0000 rate:-0.1538 aloss:18.9090 eloss:-42649.1797 aloss2:174469.2656 exploreP:0.0100
Episode:1219 meanR:-0.0600 R:-2.0000 rate:-0.1538 aloss:19.3923 eloss:-41318.6250 aloss2:175976.1562 exploreP:0.0100
Episode:1220 meanR:-0.0600 R:-1.0000 rate:-0.0769 aloss:19.4357 eloss:-43131.1250 aloss2:177923.8125 exploreP:0.0100
Episode:1221 meanR:-0.0200 R:3.0000 rate:0.2308 aloss:18.6109 eloss:-42156.5039 aloss2:177402.3750 exploreP:0.0100
Episode:1222 meanR:-0.0600 R:-2.0000 rate:-0.1538 aloss:18.4777 eloss:-42311.7812 aloss2:177501.9375 exploreP:0.0100
Episode:1223 meanR:-0.0600 R:0.0000 rate:0.0000 aloss:18.5433 eloss:-423

Episode:1286 meanR:-0.1500 R:-2.0000 rate:-0.1538 aloss:19.8398 eloss:-65858.2812 aloss2:244083.5000 exploreP:0.0100
Episode:1287 meanR:-0.1600 R:1.0000 rate:0.0769 aloss:19.2215 eloss:-66165.4531 aloss2:245821.2500 exploreP:0.0100
Episode:1288 meanR:-0.1600 R:-1.0000 rate:-0.0769 aloss:19.2215 eloss:-66694.8516 aloss2:246792.4844 exploreP:0.0100
Episode:1289 meanR:-0.1400 R:2.0000 rate:0.1538 aloss:20.4563 eloss:-65551.1328 aloss2:247265.1250 exploreP:0.0100
Episode:1290 meanR:-0.1400 R:0.0000 rate:0.0000 aloss:20.2055 eloss:-65777.1172 aloss2:248880.6875 exploreP:0.0100
Episode:1291 meanR:-0.1400 R:0.0000 rate:0.0000 aloss:19.7525 eloss:-68548.0156 aloss2:252591.3125 exploreP:0.0100
Episode:1292 meanR:-0.1300 R:1.0000 rate:0.0769 aloss:20.2584 eloss:-66770.7578 aloss2:253140.2656 exploreP:0.0100
Episode:1293 meanR:-0.1000 R:2.0000 rate:0.1538 aloss:19.1153 eloss:-68141.4766 aloss2:254547.8438 exploreP:0.0100
Episode:1294 meanR:-0.1000 R:0.0000 rate:0.0000 aloss:20.8138 eloss:-68541.8

Episode:1357 meanR:-0.0900 R:-1.0000 rate:-0.0769 aloss:24.6115 eloss:-177221.0469 aloss2:434626.6875 exploreP:0.0100
Episode:1358 meanR:-0.1200 R:0.0000 rate:0.0000 aloss:23.9892 eloss:-183958.3125 aloss2:435136.0938 exploreP:0.0100
Episode:1359 meanR:-0.1000 R:1.0000 rate:0.0769 aloss:25.7160 eloss:-188288.4219 aloss2:437988.2812 exploreP:0.0100
Episode:1360 meanR:-0.1100 R:-1.0000 rate:-0.0769 aloss:25.9916 eloss:-206988.5625 aloss2:438517.4688 exploreP:0.0100
Episode:1361 meanR:-0.1000 R:0.0000 rate:0.0000 aloss:26.6737 eloss:-221078.4219 aloss2:447039.8438 exploreP:0.0100
Episode:1362 meanR:-0.1000 R:0.0000 rate:0.0000 aloss:26.6457 eloss:-220361.1406 aloss2:456458.2500 exploreP:0.0100
Episode:1363 meanR:-0.1000 R:1.0000 rate:0.0769 aloss:27.7837 eloss:-220548.6406 aloss2:463362.2500 exploreP:0.0100
Episode:1364 meanR:-0.1000 R:2.0000 rate:0.1538 aloss:27.9779 eloss:-225806.6719 aloss2:467568.3750 exploreP:0.0100
Episode:1365 meanR:-0.0600 R:1.0000 rate:0.0769 aloss:28.5732 eloss:

Episode:1428 meanR:-0.0300 R:1.0000 rate:0.0769 aloss:59.8920 eloss:-712571.8125 aloss2:1050939.2500 exploreP:0.0100
Episode:1429 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:61.0195 eloss:-724974.5000 aloss2:1070396.5000 exploreP:0.0100
Episode:1430 meanR:0.0000 R:1.0000 rate:0.0769 aloss:61.9498 eloss:-743761.2500 aloss2:1101096.8750 exploreP:0.0100
Episode:1431 meanR:0.0000 R:0.0000 rate:0.0000 aloss:62.2858 eloss:-751568.1250 aloss2:1116295.7500 exploreP:0.0100
Episode:1432 meanR:0.0200 R:-1.0000 rate:-0.0769 aloss:62.2592 eloss:-761655.0000 aloss2:1133697.7500 exploreP:0.0100
Episode:1433 meanR:0.0300 R:0.0000 rate:0.0000 aloss:62.9911 eloss:-781103.0000 aloss2:1166671.2500 exploreP:0.0100
Episode:1434 meanR:0.0300 R:0.0000 rate:0.0000 aloss:60.8992 eloss:-791464.7500 aloss2:1183715.0000 exploreP:0.0100
Episode:1435 meanR:0.0600 R:2.0000 rate:0.1538 aloss:59.3927 eloss:-794954.9375 aloss2:1194982.8750 exploreP:0.0100
Episode:1436 meanR:0.0600 R:0.0000 rate:0.0000 aloss:61.2084 eloss:-

Episode:1498 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:92.8530 eloss:-1742967.3750 aloss2:3125059.0000 exploreP:0.0100
Episode:1499 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:86.7039 eloss:-1751511.0000 aloss2:3134481.5000 exploreP:0.0100
Episode:1500 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:90.0544 eloss:-1758995.6250 aloss2:3150250.7500 exploreP:0.0100
Episode:1501 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:97.3778 eloss:-1774411.7500 aloss2:3182176.2500 exploreP:0.0100
Episode:1502 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:103.7010 eloss:-1800046.5000 aloss2:3234884.7500 exploreP:0.0100
Episode:1503 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:101.7547 eloss:-1809490.5000 aloss2:3248029.7500 exploreP:0.0100
Episode:1504 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:102.2031 eloss:-1848860.1250 aloss2:3323942.5000 exploreP:0.0100
Episode:1505 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:102.8867 eloss:-1843556.2500 aloss2:3320392.5000 exploreP:0.0100
Episode:1506 meanR:-0.0500 R:0.0000 rate:0.0000 alos

Episode:1567 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:213.3169 eloss:-3183568.7500 aloss2:5665313.5000 exploreP:0.0100
Episode:1568 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:235.5832 eloss:-3172822.2500 aloss2:5643196.5000 exploreP:0.0100
Episode:1569 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:231.0702 eloss:-3220061.0000 aloss2:5714774.0000 exploreP:0.0100
Episode:1570 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:214.6927 eloss:-3233481.7500 aloss2:5732378.5000 exploreP:0.0100
Episode:1571 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:218.2508 eloss:-3242200.7500 aloss2:5748920.5000 exploreP:0.0100
Episode:1572 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:226.1159 eloss:-3291210.0000 aloss2:5825394.5000 exploreP:0.0100
Episode:1573 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:247.2943 eloss:-3309052.5000 aloss2:5863652.0000 exploreP:0.0100
Episode:1574 meanR:-0.0300 R:-1.0000 rate:-0.0769 aloss:258.4340 eloss:-3337507.7500 aloss2:5913917.5000 exploreP:0.0100
Episode:1575 meanR:-0.0500 R:-2.0000 rate:-0.1

Episode:1637 meanR:0.0300 R:0.0000 rate:0.0000 aloss:940.9977 eloss:-6570099.0000 aloss2:7801530.0000 exploreP:0.0100
Episode:1638 meanR:0.0300 R:0.0000 rate:0.0000 aloss:934.4044 eloss:-6526835.0000 aloss2:7775145.0000 exploreP:0.0100
Episode:1639 meanR:0.0300 R:0.0000 rate:0.0000 aloss:1047.0306 eloss:-6506348.5000 aloss2:7844854.0000 exploreP:0.0100
Episode:1640 meanR:0.0300 R:0.0000 rate:0.0000 aloss:693.5726 eloss:-6454056.0000 aloss2:7667342.5000 exploreP:0.0100
Episode:1641 meanR:0.0300 R:0.0000 rate:0.0000 aloss:848.2794 eloss:-6551873.5000 aloss2:7799809.5000 exploreP:0.0100
Episode:1642 meanR:0.0300 R:0.0000 rate:0.0000 aloss:904.8597 eloss:-6603218.0000 aloss2:7947935.5000 exploreP:0.0100
Episode:1643 meanR:0.0300 R:0.0000 rate:0.0000 aloss:906.1624 eloss:-6651642.0000 aloss2:7952626.5000 exploreP:0.0100
Episode:1644 meanR:0.0300 R:0.0000 rate:0.0000 aloss:915.6672 eloss:-6636148.0000 aloss2:7929583.0000 exploreP:0.0100
Episode:1645 meanR:0.0300 R:0.0000 rate:0.0000 aloss:86

Episode:1707 meanR:0.0500 R:0.0000 rate:0.0000 aloss:642.0901 eloss:-8784013.0000 aloss2:10230328.0000 exploreP:0.0100
Episode:1708 meanR:0.0400 R:0.0000 rate:0.0000 aloss:535.5349 eloss:-8914210.0000 aloss2:10368469.0000 exploreP:0.0100
Episode:1709 meanR:0.0400 R:0.0000 rate:0.0000 aloss:658.6013 eloss:-8999183.0000 aloss2:10501205.0000 exploreP:0.0100
Episode:1710 meanR:0.0400 R:0.0000 rate:0.0000 aloss:666.3801 eloss:-9037220.0000 aloss2:10552504.0000 exploreP:0.0100
Episode:1711 meanR:0.0400 R:0.0000 rate:0.0000 aloss:685.7247 eloss:-8994922.0000 aloss2:10529505.0000 exploreP:0.0100
Episode:1712 meanR:0.0200 R:0.0000 rate:0.0000 aloss:705.1938 eloss:-9164739.0000 aloss2:10703710.0000 exploreP:0.0100
Episode:1713 meanR:0.0100 R:-1.0000 rate:-0.0769 aloss:666.2019 eloss:-9127363.0000 aloss2:10689331.0000 exploreP:0.0100
Episode:1714 meanR:0.0000 R:-1.0000 rate:-0.0769 aloss:660.4756 eloss:-9096887.0000 aloss2:10637786.0000 exploreP:0.0100
Episode:1715 meanR:0.0000 R:0.0000 rate:0.00

Episode:1776 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1604.8859 eloss:-12837530.0000 aloss2:14800942.0000 exploreP:0.0100
Episode:1777 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1739.3292 eloss:-12893883.0000 aloss2:14888694.0000 exploreP:0.0100
Episode:1778 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1680.4042 eloss:-12898846.0000 aloss2:14911577.0000 exploreP:0.0100
Episode:1779 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1866.9269 eloss:-12969797.0000 aloss2:14980915.0000 exploreP:0.0100
Episode:1780 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1700.3909 eloss:-12984083.0000 aloss2:14921759.0000 exploreP:0.0100
Episode:1781 meanR:0.0100 R:0.0000 rate:0.0000 aloss:1917.8765 eloss:-13097712.0000 aloss2:15025292.0000 exploreP:0.0100
Episode:1782 meanR:0.0100 R:0.0000 rate:0.0000 aloss:2105.9543 eloss:-13211659.0000 aloss2:15144361.0000 exploreP:0.0100
Episode:1783 meanR:0.0200 R:1.0000 rate:0.0769 aloss:2177.4314 eloss:-13349860.0000 aloss2:15378493.0000 exploreP:0.0100
Episode:1784 meanR:0.0300 R:0.00

Episode:1844 meanR:0.1000 R:0.0000 rate:0.0000 aloss:2232.4399 eloss:-18526488.0000 aloss2:21205886.0000 exploreP:0.0100
Episode:1845 meanR:0.1000 R:0.0000 rate:0.0000 aloss:1687.9556 eloss:-18741838.0000 aloss2:21372166.0000 exploreP:0.0100
Episode:1846 meanR:0.0900 R:0.0000 rate:0.0000 aloss:2227.8796 eloss:-18972178.0000 aloss2:21702958.0000 exploreP:0.0100
Episode:1847 meanR:0.0900 R:0.0000 rate:0.0000 aloss:2166.0200 eloss:-18682248.0000 aloss2:21407820.0000 exploreP:0.0100
Episode:1848 meanR:0.1000 R:1.0000 rate:0.0769 aloss:2152.8230 eloss:-18858936.0000 aloss2:21569632.0000 exploreP:0.0100
Episode:1849 meanR:0.1000 R:0.0000 rate:0.0000 aloss:2398.1917 eloss:-18925334.0000 aloss2:21726330.0000 exploreP:0.0100
Episode:1850 meanR:0.1000 R:0.0000 rate:0.0000 aloss:2318.0022 eloss:-18963012.0000 aloss2:21816006.0000 exploreP:0.0100
Episode:1851 meanR:0.0900 R:-1.0000 rate:-0.0769 aloss:2395.4431 eloss:-19080950.0000 aloss2:21948724.0000 exploreP:0.0100
Episode:1852 meanR:0.0900 R:0.

Episode:1912 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1513.4435 eloss:-25365968.0000 aloss2:28740144.0000 exploreP:0.0100
Episode:1913 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1171.8242 eloss:-25359796.0000 aloss2:28708208.0000 exploreP:0.0100
Episode:1914 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1584.4297 eloss:-25291896.0000 aloss2:28643328.0000 exploreP:0.0100
Episode:1915 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1307.0894 eloss:-25398220.0000 aloss2:28744458.0000 exploreP:0.0100
Episode:1916 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:987.2780 eloss:-25735162.0000 aloss2:29098606.0000 exploreP:0.0100
Episode:1917 meanR:-0.0400 R:-1.0000 rate:-0.0769 aloss:1408.8977 eloss:-25753784.0000 aloss2:29185062.0000 exploreP:0.0100
Episode:1918 meanR:-0.0500 R:-1.0000 rate:-0.0769 aloss:1627.6252 eloss:-25838814.0000 aloss2:29274280.0000 exploreP:0.0100
Episode:1919 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:1399.8412 eloss:-25940238.0000 aloss2:29386138.0000 exploreP:0.0100
Episode:1920 meanR:-0

Episode:1980 meanR:0.0200 R:0.0000 rate:0.0000 aloss:614.9421 eloss:-33852828.0000 aloss2:38418744.0000 exploreP:0.0100
Episode:1981 meanR:0.0100 R:0.0000 rate:0.0000 aloss:355.3630 eloss:-34127012.0000 aloss2:38690468.0000 exploreP:0.0100
Episode:1982 meanR:0.0000 R:0.0000 rate:0.0000 aloss:424.1814 eloss:-34233372.0000 aloss2:38856552.0000 exploreP:0.0100
Episode:1983 meanR:0.0000 R:0.0000 rate:0.0000 aloss:510.1333 eloss:-34444312.0000 aloss2:39071380.0000 exploreP:0.0100
Episode:1984 meanR:-0.0100 R:-1.0000 rate:-0.0769 aloss:1103.3362 eloss:-34616164.0000 aloss2:39312960.0000 exploreP:0.0100
Episode:1985 meanR:0.0000 R:0.0000 rate:0.0000 aloss:1054.7383 eloss:-34736908.0000 aloss2:39400960.0000 exploreP:0.0100
Episode:1986 meanR:0.0000 R:0.0000 rate:0.0000 aloss:1007.3688 eloss:-34789792.0000 aloss2:39520920.0000 exploreP:0.0100
Episode:1987 meanR:0.0000 R:0.0000 rate:0.0000 aloss:1553.6620 eloss:-34871348.0000 aloss2:39596960.0000 exploreP:0.0100
Episode:1988 meanR:0.0000 R:0.000

Episode:2048 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:193.4157 eloss:-42469840.0000 aloss2:48577440.0000 exploreP:0.0100
Episode:2049 meanR:-0.0200 R:1.0000 rate:0.0769 aloss:212.2513 eloss:-42638132.0000 aloss2:48796888.0000 exploreP:0.0100
Episode:2050 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:220.4654 eloss:-42840632.0000 aloss2:49042560.0000 exploreP:0.0100
Episode:2051 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:217.5923 eloss:-43053724.0000 aloss2:49254508.0000 exploreP:0.0100
Episode:2052 meanR:-0.0300 R:-1.0000 rate:-0.0769 aloss:220.6485 eloss:-43213036.0000 aloss2:49468856.0000 exploreP:0.0100
Episode:2053 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:227.2363 eloss:-43357888.0000 aloss2:49634760.0000 exploreP:0.0100
Episode:2054 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:223.8311 eloss:-43142248.0000 aloss2:49448112.0000 exploreP:0.0100
Episode:2055 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:230.2941 eloss:-43532936.0000 aloss2:49844156.0000 exploreP:0.0100
Episode:2056 meanR:-0.0400 R:0

Episode:2116 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:254.7516 eloss:-54135212.0000 aloss2:61556004.0000 exploreP:0.0100
Episode:2117 meanR:-0.0400 R:-1.0000 rate:-0.0769 aloss:244.7124 eloss:-54388700.0000 aloss2:61830908.0000 exploreP:0.0100
Episode:2118 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:243.0913 eloss:-54671204.0000 aloss2:62139168.0000 exploreP:0.0100
Episode:2119 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:251.0349 eloss:-54566016.0000 aloss2:62052480.0000 exploreP:0.0100
Episode:2120 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:266.1176 eloss:-54990324.0000 aloss2:62532956.0000 exploreP:0.0100
Episode:2121 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:267.4751 eloss:-55199464.0000 aloss2:62720752.0000 exploreP:0.0100
Episode:2122 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:258.0045 eloss:-55406472.0000 aloss2:62946892.0000 exploreP:0.0100
Episode:2123 meanR:-0.0400 R:-1.0000 rate:-0.0769 aloss:269.3906 eloss:-55373460.0000 aloss2:62932500.0000 exploreP:0.0100
Episode:2124 meanR:-0.0400 R

Episode:2184 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:347.9734 eloss:-67125224.0000 aloss2:75895160.0000 exploreP:0.0100
Episode:2185 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:382.7847 eloss:-67241544.0000 aloss2:76060000.0000 exploreP:0.0100
Episode:2186 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:396.7380 eloss:-67306688.0000 aloss2:76148776.0000 exploreP:0.0100
Episode:2187 meanR:-0.0200 R:-1.0000 rate:-0.0769 aloss:423.7170 eloss:-67296968.0000 aloss2:76185160.0000 exploreP:0.0100
Episode:2188 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:369.9958 eloss:-67638488.0000 aloss2:76536616.0000 exploreP:0.0100
Episode:2189 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:440.7395 eloss:-67893920.0000 aloss2:76884624.0000 exploreP:0.0100
Episode:2190 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:323.4175 eloss:-68045680.0000 aloss2:76997984.0000 exploreP:0.0100
Episode:2191 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:331.9424 eloss:-68405672.0000 aloss2:77386480.0000 exploreP:0.0100
Episode:2192 meanR:-0.0200 R:-

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Episode rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(aloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('A losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

In [None]:
eps, arr = np.array(aloss2_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('A losses 2')

In [37]:
# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Testing episodes/epochs
    for _ in range(11):
        total_reward = 0
        #state = env.reset()
        env_info = env.reset(train_mode=False)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state

        # Testing steps/batches
        while True:
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            action = np.argmax(action_logits)
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            total_reward += reward
            if done:
                break
                
        print('total_reward: {:.2f}'.format(total_reward))

INFO:tensorflow:Restoring parameters from checkpoints/model-nav.ckpt


total_reward: 14.00


In [None]:
# # Be careful!!!!!!!!!!!!!!!!
# # Closing the env
# env.close()