# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
# env = UnityEnvironment(file_name="/home/arasdar/unity-envs/Banana_Linux/Banana.x86_64")
env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Banana_Linux_NoVis/Banana.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
# env_info = env.reset(train_mode=False)[brain_name] # reset the environment
# state = env_info.vector_observations[0]            # get the current state
# score = 0                                          # initialize the score
# for steps in range(1111111):
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     score += reward                                # update the score
#     state = next_state                             # roll over the state to next time step
#     if done:                                       # exit loop if episode finished
#         print(state.shape)
#         break
    
# print("Score and steps: {} and {}".format(score, steps))

When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
# env_info = env.reset(train_mode=True)[brain_name] # reset the environment
# state = env_info.vector_observations[0]            # get the current state
# score = 0                                          # initialize the score
# while True:
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     score += reward                                # update the score
#     state = next_state                             # roll over the state to next time step
#     #print(state)
#     if done:                                       # exit loop if episode finished
#         break
    
# print("Score: {}".format(score))

In [8]:
import tensorflow as tf
print('TensorFlow Version: {}'.format(tf.__version__))
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [9]:
# env_info = env.reset(train_mode=True)[brain_name] # reset the environment
# state = env_info.vector_observations[0]            # get the current state
# score = 0                                          # initialize the score
# batch = []
# while True: # infinite number of steps
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     score += reward                                # update the score
#     #print(state, action, reward, done)
#     batch.append([action, state, reward, done])
#     state = next_state                             # roll over the state to next time step
#     if done:                                       # exit loop if episode finished
#         break
    
# print("Score: {}".format(score))

In [15]:
def model_input(state_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    next_states = tf.placeholder(tf.float32, [None, state_size], name='next_states')
    # rewards = tf.placeholder(tf.float32, [None], name='rewards')
    dones = tf.placeholder(tf.float32, [None], name='dones')
    rates = tf.placeholder(tf.float32, [None], name='rates') # success rate
    return states, actions, next_states, dones, rates

In [16]:
def actor(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('actor', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        return logits

In [17]:
def generator(states, actions, state_size, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        nl1_fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=nl1_fused, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=state_size)        
        return logits

In [18]:
def discriminator(states, actions, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('discriminator', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        nl1_fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=nl1_fused, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)        
        return logits

In [19]:
def model_loss(state_size, action_size, hidden_size,
               states, actions, next_states, dones, rates):
    actions_logits = actor(states=states, hidden_size=hidden_size, action_size=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    aloss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                      labels=actions_labels))
    ###############################################
    next_states_logits = generator(actions=actions_logits, states=states, hidden_size=hidden_size, 
                                   action_size=action_size, state_size=state_size)
    next_states_labels = tf.nn.sigmoid(next_states)
    aloss += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=next_states_logits, 
                                                                    labels=next_states_labels))
    ####################################################
    dQs = discriminator(actions=actions_labels, hidden_size=hidden_size, states=states, action_size=action_size)
    #rates = tf.reshape(rates, shape=[-1, 1])
    # dloss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=dQs, # GAN
    #                                                                labels=rates)) # 0-1
    dQs = tf.nn.tanh(tf.reshape(dQs, shape=[-1]))
    dloss = tf.reduce_mean(tf.square(dQs-rates)) # [-1, +1]
    ####################################################
    gQs = discriminator(actions=actions_logits, hidden_size=hidden_size, states=states, action_size=action_size, 
                        reuse=True)
    dloss += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=gQs, # GAN
                                                                    labels=tf.zeros_like(gQs))) # 0-1
    aloss2 = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=gQs, # GAN
                                                                    labels=tf.ones_like(gQs))) # 0-1
    #####################################################
    next_actions_logits = actor(states=next_states, hidden_size=hidden_size, action_size=action_size, reuse=True)
    gQs2 = discriminator(actions=next_actions_logits, hidden_size=hidden_size, states=next_states, 
                         action_size=action_size, reuse=True)
    gQs2 = tf.reshape(gQs2, shape=[-1]) * (1-dones)
    dloss += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=gQs2, # GAN
                                                                    labels=tf.zeros_like(gQs2))) # 0-1
    aloss2 += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=gQs2, # GAN
                                                                     labels=tf.ones_like(gQs2))) # 0-1
    return actions_logits, aloss, dloss, aloss2

In [20]:
def model_opt(a_loss, d_loss, a_loss2, a_learning_rate, d_learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    a_vars = [var for var in t_vars if var.name.startswith('actor')]
    #g_vars = [var for var in t_vars if var.name.startswith('generator')]
    d_vars = [var for var in t_vars if var.name.startswith('discriminator')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        a_opt = tf.train.AdamOptimizer(a_learning_rate).minimize(a_loss, var_list=a_vars)
        d_opt = tf.train.AdamOptimizer(d_learning_rate).minimize(d_loss, var_list=d_vars)
        a_opt2 = tf.train.AdamOptimizer(a_learning_rate).minimize(a_loss2, var_list=a_vars)
    return a_opt, d_opt, a_opt2

In [21]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, a_learning_rate, d_learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.next_states, self.dones, self.rates = model_input(state_size=state_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.a_loss, self.d_loss, self.a_loss2 = model_loss(
            state_size=state_size, action_size=action_size, hidden_size=hidden_size, # model init
            states=self.states, actions=self.actions, next_states=self.next_states, 
            dones=self.dones, rates=self.rates) # model input
        
        # Update the model: backward pass and backprop
        self.a_opt, self.d_opt, self.a_opt2 = model_opt(a_loss=self.a_loss, 
                                                        d_loss=self.d_loss,
                                                        a_loss2=self.a_loss2, 
                                                        a_learning_rate=a_learning_rate,
                                                        d_learning_rate=d_learning_rate)

In [23]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size) # data batch

In [35]:
brain.vector_observation_space_size, brain.vector_observation_space_type, \
brain.vector_action_space_size, brain.vector_action_space_type

(37, 'continuous', 4, 'discrete')

In [36]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01           # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 37
action_size = 4
hidden_size = 37*2             # number of units in each Q-network hidden layer
a_learning_rate = 1e-4         # Q-network learning rate
d_learning_rate = 1e-4         # Q-network learning rate

# Memory parameters
memory_size = int(1e5)            # memory capacity
batch_size = int(1e3)             # experience mini-batch size: 200/500 a successfull episode size
# gamma = 0.99                   # future reward discount

In [38]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size,
              a_learning_rate=a_learning_rate, 
              d_learning_rate=d_learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [40]:
# state = env.reset()
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]   # get the state
total_reward = 0
num_step = 0
for _ in range(memory_size):
    # action = env.action_space.sample()
    # next_state, reward, done, _ = env.step(action)
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    rate = -1
    memory.buffer.append([state, action, next_state, reward, float(done), rate])
    num_step += 1 # memory incremented
    total_reward += reward
    state = next_state
    if done is True:
        # state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the state
        rate = np.clip(total_reward/13, a_min=-1, a_max=+1)
        for idx in range(num_step): # episode length
            if memory.buffer[-1-idx][-1] == -1:
                memory.buffer[-1-idx][-1] = rate
        total_reward = 0 # reset
        num_step = 0 # reset

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list = [], []
aloss_list, dloss_list, aloss2_list = [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window

    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0 # each episode
        aloss_batch, dloss_batch, aloss2_batch = [], [], []
        #state = env.reset() # each episode
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state
        num_step = 0 # each episode
        rate = -1

        # Training steps/batches
        while True:
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                #action = env.action_space.sample()
                action = np.random.randint(action_size)        # select an action
            else:
                action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
                action = np.argmax(action_logits)
            #next_state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            next_state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            memory.buffer.append([state, action, next_state, reward, float(done), rate])
            num_step += 1 # momory added
            total_reward += reward
            state = next_state
            
            # Rating the memory
            if done is True:
                rate = np.clip(total_reward/13, a_min=-1, a_max=+1)
                for idx in range(num_step): # episode length
                    if memory.buffer[-1-idx][-1] == -1: # double-check the landmark/marked indexes
                        memory.buffer[-1-idx][-1] = rate # rate the trajectory/data
                        
            # Training with the maxrated minibatch
            batch = memory.buffer
            #for idx in range(memory_size// batch_size):
            while True:
                idx = np.random.choice(np.arange(memory_size// batch_size))
                states = np.array([each[0] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                actions = np.array([each[1] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                next_states = np.array([each[2] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                #rewards = np.array([each[3] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                dones = np.array([each[4] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                rates = np.array([each[5] for each in batch])[idx*batch_size:(idx+1)*batch_size]
                states = states[rates >= np.max(rates)]
                actions = actions[rates >= np.max(rates)]
                next_states = next_states[rates >= np.max(rates)]
                #rewards = rewards[rates >= np.max(rates)]
                dones = dones[rates >= np.max(rates)]
                rates = rates[rates >= np.max(rates)]
                #if np.count_nonzero(dones)==1 and len(dones) >= 1 and np.max(rates) > 0:
                if len(dones) > 1:
                    # print('np.count_nonzero(dones)==1 and len(dones) >= 1 and np.max(rates) > 0: ', 
                    #       np.count_nonzero(dones), len(dones), np.max(rates))
                    break
            #             if np.count_nonzero(dones)!=1 and len(dones) < 1 and np.max(rates) <= 0:
            #                 print(np.count_nonzero(dones), len(dones), np.max(rates))
            #                 break
            aloss, _ = sess.run([model.a_loss, model.a_opt],
                                feed_dict = {model.states: states, 
                                            model.actions: actions,
                                            model.next_states: next_states,
                                            #model.rewards: rewards,
                                            model.dones: dones,
                                            model.rates: rates})
            dloss, _ = sess.run([model.d_loss, model.d_opt],
                                  feed_dict = {model.states: states, 
                                               model.actions: actions,
                                               model.next_states: next_states,
                                               #model.rewards: rewards,
                                               model.dones: dones,
                                               model.rates: rates})
            aloss2, _= sess.run([model.a_loss2, model.a_opt2], 
                                 feed_dict = {model.states: states, 
                                              model.actions: actions,
                                              model.next_states: next_states,
                                              #model.rewards: rewards,
                                              model.dones: dones,
                                              model.rates: rates})
            # print('dones:', 
            #       len(dones), np.count_nonzero(dones), 
            #       len(dones1), np.count_nonzero(dones1), 
            #       len(dones2), np.count_nonzero(dones2))
            aloss_batch.append(aloss)
            dloss_batch.append(dloss)
            aloss2_batch.append(aloss2)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'rate:{:.4f}'.format(rate),
              'aloss:{:.4f}'.format(np.mean(aloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
              'aloss2:{:.4f}'.format(np.mean(aloss2_batch)),
              'exploreP:{:.4f}'.format(explore_p))

        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        aloss_list.append([ep, np.mean(aloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        aloss2_list.append([ep, np.mean(aloss2_batch)])
        
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.
        if np.mean(episode_reward) >= +13:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:-2.0000 R:-2.0000 rate:-0.1538 aloss:2.2151 dloss:1.4167 aloss2:1.5538 exploreP:0.9707
Episode:1 meanR:-0.5000 R:1.0000 rate:0.0769 aloss:2.1404 dloss:1.2774 aloss2:1.8209 exploreP:0.9423
Episode:2 meanR:-1.0000 R:-2.0000 rate:-0.1538 aloss:2.1091 dloss:1.2442 aloss2:1.8236 exploreP:0.9148
Episode:3 meanR:-1.7500 R:-4.0000 rate:-0.3077 aloss:2.1278 dloss:1.2428 aloss2:1.8251 exploreP:0.8881
Episode:4 meanR:-1.2000 R:1.0000 rate:0.0769 aloss:2.1899 dloss:1.2582 aloss2:1.8251 exploreP:0.8621
Episode:5 meanR:-1.3333 R:-2.0000 rate:-0.1538 aloss:2.1781 dloss:1.2575 aloss2:1.8259 exploreP:0.8369
Episode:6 meanR:-1.0000 R:1.0000 rate:0.0769 aloss:2.1630 dloss:1.2529 aloss2:1.8229 exploreP:0.8125
Episode:7 meanR:-0.8750 R:0.0000 rate:0.0000 aloss:2.1727 dloss:1.2443 aloss2:1.8380 exploreP:0.7888
Episode:8 meanR:-0.8889 R:-1.0000 rate:-0.0769 aloss:2.1818 dloss:1.2490 aloss2:1.8364 exploreP:0.7657
Episode:9 meanR:-0.7000 R:1.0000 rate:0.0769 aloss:2.1780 dloss:1.2486 aloss2:1.8

Episode:81 meanR:0.1829 R:0.0000 rate:0.0000 aloss:2.0954 dloss:1.1966 aloss2:17.5732 exploreP:0.0946
Episode:82 meanR:0.1687 R:-1.0000 rate:-0.0769 aloss:2.1020 dloss:1.2042 aloss2:18.0006 exploreP:0.0921
Episode:83 meanR:0.1667 R:0.0000 rate:0.0000 aloss:2.0957 dloss:1.1912 aloss2:18.3607 exploreP:0.0897
Episode:84 meanR:0.1529 R:-1.0000 rate:-0.0769 aloss:2.1035 dloss:1.1962 aloss2:18.6704 exploreP:0.0873
Episode:85 meanR:0.1512 R:0.0000 rate:0.0000 aloss:2.0986 dloss:1.1820 aloss2:19.1553 exploreP:0.0850
Episode:86 meanR:0.1494 R:0.0000 rate:0.0000 aloss:2.0924 dloss:1.1966 aloss2:19.4838 exploreP:0.0828
Episode:87 meanR:0.1477 R:0.0000 rate:0.0000 aloss:2.0838 dloss:1.1781 aloss2:19.8512 exploreP:0.0806
Episode:88 meanR:0.1348 R:-1.0000 rate:-0.0769 aloss:2.0863 dloss:1.1897 aloss2:20.2636 exploreP:0.0786
Episode:89 meanR:0.1333 R:0.0000 rate:0.0000 aloss:2.0860 dloss:1.2070 aloss2:20.6202 exploreP:0.0765
Episode:90 meanR:0.1319 R:0.0000 rate:0.0000 aloss:2.0845 dloss:1.2004 aloss

Episode:161 meanR:-0.0200 R:-1.0000 rate:-0.0769 aloss:2.0753 dloss:1.1754 aloss2:46.3478 exploreP:0.0177
Episode:162 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:2.0723 dloss:1.1547 aloss2:46.7277 exploreP:0.0174
Episode:163 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:2.0739 dloss:1.1604 aloss2:46.9762 exploreP:0.0172
Episode:164 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:2.0733 dloss:1.1644 aloss2:47.3104 exploreP:0.0170
Episode:165 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:2.0682 dloss:1.1540 aloss2:47.5716 exploreP:0.0168
Episode:166 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:2.0698 dloss:1.1635 aloss2:47.9356 exploreP:0.0166
Episode:167 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:2.0679 dloss:1.1571 aloss2:48.0602 exploreP:0.0164
Episode:168 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:2.0722 dloss:1.1538 aloss2:48.4956 exploreP:0.0162
Episode:169 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:2.0727 dloss:1.1563 aloss2:48.7045 exploreP:0.0160
Episode:170 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:2.0674 dl

Episode:241 meanR:0.1000 R:0.0000 rate:0.0000 aloss:2.0373 dloss:1.1233 aloss2:55.6546 exploreP:0.0107
Episode:242 meanR:0.1000 R:0.0000 rate:0.0000 aloss:2.0209 dloss:1.1392 aloss2:55.6685 exploreP:0.0107
Episode:243 meanR:0.1000 R:1.0000 rate:0.0769 aloss:2.0027 dloss:1.1522 aloss2:55.6148 exploreP:0.0107
Episode:244 meanR:0.0900 R:0.0000 rate:0.0000 aloss:1.9995 dloss:1.1327 aloss2:55.8490 exploreP:0.0106
Episode:245 meanR:0.0900 R:0.0000 rate:0.0000 aloss:1.9598 dloss:1.1207 aloss2:55.8097 exploreP:0.0106
Episode:246 meanR:0.0900 R:0.0000 rate:0.0000 aloss:1.9601 dloss:1.1445 aloss2:55.9311 exploreP:0.0106
Episode:247 meanR:0.1000 R:0.0000 rate:0.0000 aloss:1.9863 dloss:1.1266 aloss2:55.6936 exploreP:0.0106
Episode:248 meanR:0.1000 R:0.0000 rate:0.0000 aloss:2.0112 dloss:1.1450 aloss2:56.0562 exploreP:0.0106
Episode:249 meanR:0.0900 R:-1.0000 rate:-0.0769 aloss:1.9816 dloss:1.1252 aloss2:55.9433 exploreP:0.0105
Episode:250 meanR:0.0900 R:0.0000 rate:0.0000 aloss:2.0098 dloss:1.1354

Episode:321 meanR:0.0200 R:0.0000 rate:0.0000 aloss:2.0450 dloss:1.1025 aloss2:54.6052 exploreP:0.0101
Episode:322 meanR:0.0200 R:0.0000 rate:0.0000 aloss:2.1331 dloss:1.1105 aloss2:54.9899 exploreP:0.0101
Episode:323 meanR:0.0200 R:0.0000 rate:0.0000 aloss:2.1174 dloss:1.1075 aloss2:55.1251 exploreP:0.0101
Episode:324 meanR:0.0300 R:0.0000 rate:0.0000 aloss:2.0482 dloss:1.1113 aloss2:55.3285 exploreP:0.0101
Episode:325 meanR:0.0300 R:0.0000 rate:0.0000 aloss:2.1660 dloss:1.0936 aloss2:55.4355 exploreP:0.0101
Episode:326 meanR:0.0300 R:0.0000 rate:0.0000 aloss:2.1490 dloss:1.1130 aloss2:55.6246 exploreP:0.0101
Episode:327 meanR:0.0200 R:-1.0000 rate:-0.0769 aloss:2.1733 dloss:1.1147 aloss2:55.9986 exploreP:0.0101
Episode:328 meanR:0.0200 R:0.0000 rate:0.0000 aloss:2.1575 dloss:1.1055 aloss2:56.0981 exploreP:0.0101
Episode:329 meanR:0.0200 R:0.0000 rate:0.0000 aloss:2.2160 dloss:1.1137 aloss2:56.2795 exploreP:0.0100
Episode:330 meanR:0.0200 R:0.0000 rate:0.0000 aloss:2.1162 dloss:1.1147

Episode:400 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:1.6100 dloss:1.0552 aloss2:59.7440 exploreP:0.0100
Episode:401 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:1.5949 dloss:1.0680 aloss2:59.7586 exploreP:0.0100
Episode:402 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:1.6176 dloss:1.0710 aloss2:59.8494 exploreP:0.0100
Episode:403 meanR:-0.0600 R:-1.0000 rate:-0.0769 aloss:1.5625 dloss:1.0516 aloss2:59.9224 exploreP:0.0100
Episode:404 meanR:-0.0600 R:0.0000 rate:0.0000 aloss:1.6008 dloss:1.0597 aloss2:59.8645 exploreP:0.0100
Episode:405 meanR:-0.0600 R:0.0000 rate:0.0000 aloss:1.5514 dloss:1.0615 aloss2:59.8793 exploreP:0.0100
Episode:406 meanR:-0.0600 R:0.0000 rate:0.0000 aloss:1.5520 dloss:1.0546 aloss2:59.9476 exploreP:0.0100
Episode:407 meanR:-0.0600 R:0.0000 rate:0.0000 aloss:1.6958 dloss:1.0546 aloss2:60.0337 exploreP:0.0100
Episode:408 meanR:-0.0700 R:-1.0000 rate:-0.0769 aloss:1.5739 dloss:1.0557 aloss2:59.9546 exploreP:0.0100
Episode:409 meanR:-0.0700 R:0.0000 rate:0.0000 aloss:1.5794 

Episode:479 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:1.0678 dloss:1.0645 aloss2:61.1130 exploreP:0.0100
Episode:480 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:1.0720 dloss:1.0627 aloss2:61.1392 exploreP:0.0100
Episode:481 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1.0669 dloss:1.0663 aloss2:61.0202 exploreP:0.0100
Episode:482 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1.0519 dloss:1.0640 aloss2:61.1364 exploreP:0.0100
Episode:483 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1.0732 dloss:1.0684 aloss2:61.2026 exploreP:0.0100
Episode:484 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1.0664 dloss:1.0625 aloss2:61.1652 exploreP:0.0100
Episode:485 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1.0580 dloss:1.0581 aloss2:61.1581 exploreP:0.0100
Episode:486 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:1.0737 dloss:1.0649 aloss2:61.3138 exploreP:0.0100
Episode:487 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:1.0686 dloss:1.0527 aloss2:61.3594 exploreP:0.0100
Episode:488 meanR:-0.0100 R:1.0000 rate:0.0769 aloss:1.0627 dlos

Episode:559 meanR:0.0100 R:0.0000 rate:0.0000 aloss:0.9135 dloss:1.0486 aloss2:65.1091 exploreP:0.0100
Episode:560 meanR:0.0100 R:0.0000 rate:0.0000 aloss:0.9123 dloss:1.0588 aloss2:63.8780 exploreP:0.0100
Episode:561 meanR:0.0200 R:1.0000 rate:0.0769 aloss:0.9189 dloss:1.0626 aloss2:63.4992 exploreP:0.0100
Episode:562 meanR:0.0200 R:0.0000 rate:0.0000 aloss:0.9045 dloss:1.0629 aloss2:63.5026 exploreP:0.0100
Episode:563 meanR:0.0200 R:0.0000 rate:0.0000 aloss:0.9172 dloss:1.0641 aloss2:63.3914 exploreP:0.0100
Episode:564 meanR:0.0300 R:1.0000 rate:0.0769 aloss:0.8981 dloss:1.0597 aloss2:63.3177 exploreP:0.0100
Episode:565 meanR:0.0300 R:0.0000 rate:0.0000 aloss:0.9184 dloss:1.0654 aloss2:63.3338 exploreP:0.0100
Episode:566 meanR:0.0400 R:1.0000 rate:0.0769 aloss:0.9186 dloss:1.0644 aloss2:63.3464 exploreP:0.0100
Episode:567 meanR:0.0400 R:0.0000 rate:0.0000 aloss:0.9196 dloss:1.0681 aloss2:63.3860 exploreP:0.0100
Episode:568 meanR:0.0400 R:-1.0000 rate:-0.0769 aloss:0.9086 dloss:1.0711

Episode:639 meanR:0.0000 R:-1.0000 rate:-0.0769 aloss:0.9911 dloss:1.0590 aloss2:63.9151 exploreP:0.0100
Episode:640 meanR:0.0000 R:0.0000 rate:0.0000 aloss:0.9841 dloss:1.0704 aloss2:63.8550 exploreP:0.0100
Episode:641 meanR:0.0000 R:0.0000 rate:0.0000 aloss:0.9866 dloss:1.0707 aloss2:63.9254 exploreP:0.0100
Episode:642 meanR:0.0100 R:1.0000 rate:0.0769 aloss:1.0060 dloss:1.0548 aloss2:64.1898 exploreP:0.0100
Episode:643 meanR:0.0200 R:0.0000 rate:0.0000 aloss:0.9935 dloss:1.0614 aloss2:64.0102 exploreP:0.0100
Episode:644 meanR:0.0200 R:0.0000 rate:0.0000 aloss:0.9895 dloss:1.0655 aloss2:63.9451 exploreP:0.0100
Episode:645 meanR:0.0200 R:0.0000 rate:0.0000 aloss:0.9895 dloss:1.0610 aloss2:64.1118 exploreP:0.0100
Episode:646 meanR:0.0200 R:0.0000 rate:0.0000 aloss:0.9964 dloss:1.0596 aloss2:64.0078 exploreP:0.0100
Episode:647 meanR:0.0100 R:0.0000 rate:0.0000 aloss:0.9999 dloss:1.0572 aloss2:64.0024 exploreP:0.0100
Episode:648 meanR:0.0100 R:0.0000 rate:0.0000 aloss:0.9958 dloss:1.0680

Episode:718 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1.1910 dloss:1.0711 aloss2:64.8348 exploreP:0.0100
Episode:719 meanR:-0.0400 R:-1.0000 rate:-0.0769 aloss:1.1918 dloss:1.0710 aloss2:65.1293 exploreP:0.0100
Episode:720 meanR:-0.0300 R:1.0000 rate:0.0769 aloss:1.1902 dloss:1.0620 aloss2:65.3323 exploreP:0.0100
Episode:721 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1.2052 dloss:1.0689 aloss2:65.5043 exploreP:0.0100
Episode:722 meanR:-0.0300 R:0.0000 rate:0.0000 aloss:1.1933 dloss:1.0694 aloss2:65.3590 exploreP:0.0100
Episode:723 meanR:-0.0400 R:-1.0000 rate:-0.0769 aloss:1.2043 dloss:1.0680 aloss2:65.6868 exploreP:0.0100
Episode:724 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:1.1823 dloss:1.0755 aloss2:65.9763 exploreP:0.0100
Episode:725 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:1.2055 dloss:1.0717 aloss2:65.8231 exploreP:0.0100
Episode:726 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:1.2093 dloss:1.0629 aloss2:65.9816 exploreP:0.0100
Episode:727 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:1.2004 

Episode:798 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.4384 dloss:1.0783 aloss2:68.0078 exploreP:0.0100
Episode:799 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.2353 dloss:1.0681 aloss2:66.9538 exploreP:0.0100
Episode:800 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.3313 dloss:1.0633 aloss2:67.1199 exploreP:0.0100
Episode:801 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.3303 dloss:1.0788 aloss2:67.4743 exploreP:0.0100
Episode:802 meanR:0.0100 R:-1.0000 rate:-0.0769 aloss:1.4802 dloss:1.0791 aloss2:68.8208 exploreP:0.0100
Episode:803 meanR:0.0100 R:0.0000 rate:0.0000 aloss:1.3218 dloss:1.0706 aloss2:67.2977 exploreP:0.0100
Episode:804 meanR:0.0000 R:-1.0000 rate:-0.0769 aloss:1.5374 dloss:1.0775 aloss2:68.9099 exploreP:0.0100
Episode:805 meanR:0.0000 R:0.0000 rate:0.0000 aloss:1.3231 dloss:1.0780 aloss2:67.6232 exploreP:0.0100
Episode:806 meanR:0.0000 R:0.0000 rate:0.0000 aloss:1.6917 dloss:1.0575 aloss2:69.2265 exploreP:0.0100
Episode:807 meanR:-0.0100 R:-1.0000 rate:-0.0769 aloss:1.2294 dloss:1

Episode:878 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.6773 dloss:1.0892 aloss2:76.0138 exploreP:0.0100
Episode:879 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.8487 dloss:1.0719 aloss2:77.7818 exploreP:0.0100
Episode:880 meanR:0.0100 R:0.0000 rate:0.0000 aloss:1.8125 dloss:1.0713 aloss2:76.7541 exploreP:0.0100
Episode:881 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.8853 dloss:1.0785 aloss2:78.9565 exploreP:0.0100
Episode:882 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.9497 dloss:1.0810 aloss2:79.9746 exploreP:0.0100
Episode:883 meanR:0.0200 R:0.0000 rate:0.0000 aloss:1.9888 dloss:1.0696 aloss2:79.5374 exploreP:0.0100
Episode:884 meanR:0.0300 R:1.0000 rate:0.0769 aloss:1.9574 dloss:1.0820 aloss2:77.8883 exploreP:0.0100
Episode:885 meanR:0.0300 R:0.0000 rate:0.0000 aloss:2.0188 dloss:1.0761 aloss2:79.9669 exploreP:0.0100
Episode:886 meanR:0.0200 R:0.0000 rate:0.0000 aloss:2.1037 dloss:1.0704 aloss2:80.6584 exploreP:0.0100
Episode:887 meanR:0.0200 R:0.0000 rate:0.0000 aloss:2.1542 dloss:1.0728 a

Episode:958 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:4.6522 dloss:1.0816 aloss2:86.0625 exploreP:0.0100
Episode:959 meanR:-0.0400 R:1.0000 rate:0.0769 aloss:4.6783 dloss:1.0706 aloss2:83.0466 exploreP:0.0100
Episode:960 meanR:-0.0400 R:0.0000 rate:0.0000 aloss:4.4038 dloss:1.0683 aloss2:82.5909 exploreP:0.0100
Episode:961 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:4.6074 dloss:1.0679 aloss2:82.4024 exploreP:0.0100
Episode:962 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:4.6221 dloss:1.0813 aloss2:87.9488 exploreP:0.0100
Episode:963 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:4.7215 dloss:1.0717 aloss2:85.9879 exploreP:0.0100
Episode:964 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:4.8574 dloss:1.0789 aloss2:81.7749 exploreP:0.0100
Episode:965 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:4.5310 dloss:1.0800 aloss2:83.7759 exploreP:0.0100
Episode:966 meanR:-0.0500 R:0.0000 rate:0.0000 aloss:4.5427 dloss:1.0769 aloss2:84.2047 exploreP:0.0100
Episode:967 meanR:-0.0600 R:-1.0000 rate:-0.0769 aloss:4.6934 dl

Episode:1037 meanR:-0.0800 R:1.0000 rate:0.0769 aloss:4.4904 dloss:1.0923 aloss2:74.5698 exploreP:0.0100
Episode:1038 meanR:-0.0800 R:0.0000 rate:0.0000 aloss:3.4590 dloss:1.0892 aloss2:71.4995 exploreP:0.0100
Episode:1039 meanR:-0.0800 R:0.0000 rate:0.0000 aloss:3.8885 dloss:1.0824 aloss2:72.9927 exploreP:0.0100
Episode:1040 meanR:-0.1000 R:-2.0000 rate:-0.1538 aloss:3.6646 dloss:1.0852 aloss2:73.0514 exploreP:0.0100
Episode:1041 meanR:-0.0700 R:3.0000 rate:0.2308 aloss:6.1268 dloss:1.0867 aloss2:78.5154 exploreP:0.0100
Episode:1042 meanR:-0.1000 R:-4.0000 rate:-0.3077 aloss:3.4525 dloss:1.0917 aloss2:72.5629 exploreP:0.0100
Episode:1043 meanR:-0.1000 R:0.0000 rate:0.0000 aloss:3.5961 dloss:1.1048 aloss2:72.8273 exploreP:0.0100
Episode:1044 meanR:-0.0800 R:2.0000 rate:0.1538 aloss:3.5957 dloss:1.1035 aloss2:74.8439 exploreP:0.0100
Episode:1045 meanR:-0.1000 R:-2.0000 rate:-0.1538 aloss:4.5125 dloss:1.0912 aloss2:75.6121 exploreP:0.0100
Episode:1046 meanR:-0.0900 R:1.0000 rate:0.0769 a

Episode:1115 meanR:-0.2600 R:0.0000 rate:0.0000 aloss:3.9529 dloss:1.1169 aloss2:74.7586 exploreP:0.0100
Episode:1116 meanR:-0.2400 R:2.0000 rate:0.1538 aloss:4.6171 dloss:1.1005 aloss2:77.0371 exploreP:0.0100
Episode:1117 meanR:-0.2100 R:1.0000 rate:0.0769 aloss:5.0488 dloss:1.0995 aloss2:78.0471 exploreP:0.0100
Episode:1118 meanR:-0.2300 R:-3.0000 rate:-0.2308 aloss:5.0988 dloss:1.1056 aloss2:76.8521 exploreP:0.0100
Episode:1119 meanR:-0.2300 R:0.0000 rate:0.0000 aloss:5.1169 dloss:1.1128 aloss2:75.5577 exploreP:0.0100
Episode:1120 meanR:-0.2500 R:-1.0000 rate:-0.0769 aloss:5.0144 dloss:1.1112 aloss2:73.8597 exploreP:0.0100
Episode:1121 meanR:-0.2500 R:0.0000 rate:0.0000 aloss:5.2937 dloss:1.1053 aloss2:71.7727 exploreP:0.0100
Episode:1122 meanR:-0.2500 R:-1.0000 rate:-0.0769 aloss:4.4687 dloss:1.1134 aloss2:73.2275 exploreP:0.0100
Episode:1123 meanR:-0.2600 R:0.0000 rate:0.0000 aloss:3.6971 dloss:1.0907 aloss2:72.7211 exploreP:0.0100
Episode:1124 meanR:-0.2400 R:1.0000 rate:0.0769 a

Episode:1193 meanR:0.0300 R:0.0000 rate:0.0000 aloss:4.3947 dloss:1.1252 aloss2:79.6097 exploreP:0.0100
Episode:1194 meanR:0.0300 R:-1.0000 rate:-0.0769 aloss:3.8186 dloss:1.1337 aloss2:78.0697 exploreP:0.0100
Episode:1195 meanR:0.0200 R:-2.0000 rate:-0.1538 aloss:3.8817 dloss:1.1324 aloss2:80.0111 exploreP:0.0100
Episode:1196 meanR:0.0200 R:0.0000 rate:0.0000 aloss:4.2426 dloss:1.1253 aloss2:78.5235 exploreP:0.0100
Episode:1197 meanR:0.0500 R:2.0000 rate:0.1538 aloss:4.0851 dloss:1.1347 aloss2:81.3125 exploreP:0.0100
Episode:1198 meanR:0.0500 R:0.0000 rate:0.0000 aloss:4.1339 dloss:1.1338 aloss2:79.3762 exploreP:0.0100
Episode:1199 meanR:0.0600 R:-1.0000 rate:-0.0769 aloss:4.7006 dloss:1.1406 aloss2:82.4949 exploreP:0.0100
Episode:1200 meanR:0.0600 R:-1.0000 rate:-0.0769 aloss:4.2390 dloss:1.1330 aloss2:80.0752 exploreP:0.0100
Episode:1201 meanR:0.0600 R:0.0000 rate:0.0000 aloss:4.0469 dloss:1.1386 aloss2:78.1863 exploreP:0.0100
Episode:1202 meanR:0.0800 R:-1.0000 rate:-0.0769 aloss:3

Episode:1272 meanR:-0.0100 R:1.0000 rate:0.0769 aloss:3.7954 dloss:1.1560 aloss2:83.8494 exploreP:0.0100
Episode:1273 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:4.8853 dloss:1.1461 aloss2:93.5108 exploreP:0.0100
Episode:1274 meanR:-0.0100 R:0.0000 rate:0.0000 aloss:4.0218 dloss:1.1464 aloss2:86.2232 exploreP:0.0100
Episode:1275 meanR:0.0000 R:0.0000 rate:0.0000 aloss:4.7373 dloss:1.1581 aloss2:88.8350 exploreP:0.0100
Episode:1276 meanR:-0.0200 R:0.0000 rate:0.0000 aloss:4.6735 dloss:1.1659 aloss2:90.5585 exploreP:0.0100


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Episode rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(aloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('A losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('D losses')

In [None]:
eps, arr = np.array(aloss2_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('A losses 2')

In [37]:
# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Testing episodes/epochs
    for _ in range(1):
        total_reward = 0
        #state = env.reset()
        env_info = env.reset(train_mode=False)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state

        # Testing steps/batches
        while True:
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            action = np.argmax(action_logits)
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            total_reward += reward
            if done:
                break
                
        print('total_reward: {:.2f}'.format(total_reward))

INFO:tensorflow:Restoring parameters from checkpoints/model-nav.ckpt


total_reward: 14.00


In [None]:
# # Be careful!!!!!!!!!!!!!!!!
# # Closing the env
# env.close()