# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
env = UnityEnvironment(file_name="/home/arasdar/unity-envs/Banana_Linux/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
env_info = env.reset(train_mode=False)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        print(state.shape)
        break
    
print("Score: {}".format(score))

(37,)
Score: 0.0


When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
while True:
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    #print(state)
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

Score: 2.0


In [8]:
# In this one we should define and detect GPUs for tensorflow
# GPUs or CPU
import tensorflow as tf

# Check TensorFlow Version
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [9]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]            # get the current state
score = 0                                          # initialize the score
batch = []
while True: # infinite number of steps
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    #print(state, action, reward, done)
    batch.append([state, action, next_state, reward, float(done)])
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

Score: 0.0


In [10]:
batch[0], batch[0][1]

([array([0.        , 0.        , 1.        , 0.        , 0.42380726,
         1.        , 0.        , 0.        , 0.        , 0.07645891,
         0.        , 1.        , 0.        , 0.        , 0.59556293,
         1.        , 0.        , 0.        , 0.        , 0.76522499,
         0.        , 1.        , 0.        , 0.        , 0.81929988,
         1.        , 0.        , 0.        , 0.        , 0.06445977,
         1.        , 0.        , 0.        , 0.        , 0.48016095,
         0.        , 0.        ]),
  2,
  array([0.        , 0.        , 0.        , 1.        , 0.        ,
         0.        , 0.        , 0.        , 1.        , 0.        ,
         0.        , 0.        , 1.        , 0.        , 0.44709426,
         1.        , 0.        , 0.        , 0.        , 0.76708418,
         0.        , 1.        , 0.        , 0.        , 0.57643121,
         1.        , 0.        , 0.        , 0.        , 0.47142395,
         0.        , 1.        , 0.        , 0.        , 0.7495

In [11]:
batch[0]

[array([0.        , 0.        , 1.        , 0.        , 0.42380726,
        1.        , 0.        , 0.        , 0.        , 0.07645891,
        0.        , 1.        , 0.        , 0.        , 0.59556293,
        1.        , 0.        , 0.        , 0.        , 0.76522499,
        0.        , 1.        , 0.        , 0.        , 0.81929988,
        1.        , 0.        , 0.        , 0.        , 0.06445977,
        1.        , 0.        , 0.        , 0.        , 0.48016095,
        0.        , 0.        ]),
 2,
 array([0.        , 0.        , 0.        , 1.        , 0.        ,
        0.        , 0.        , 0.        , 1.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.44709426,
        1.        , 0.        , 0.        , 0.        , 0.76708418,
        0.        , 1.        , 0.        , 0.        , 0.57643121,
        1.        , 0.        , 0.        , 0.        , 0.47142395,
        0.        , 1.        , 0.        , 0.        , 0.74959463,
        0.

In [12]:
states = np.array([each[1] for each in batch])
actions = np.array([each[0] for each in batch])
next_states = np.array([each[1] for each in batch])
rewards = np.array([each[2] for each in batch])
dones = np.array([each[3] for each in batch])
# infos = np.array([each[4] for each in batch])

In [13]:
# print(rewards[:])
print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
print(np.max(np.array(actions)), np.min(np.array(actions)), 
      (np.max(np.array(actions)) - np.min(np.array(actions)))+1)
print(np.max(np.array(rewards)), np.min(np.array(rewards)))
print(np.max(np.array(states)), np.min(np.array(states)))

(300, 37) (300,) (300, 37) (300,)
float64 int64 float64 float64
11.040274620056152 -10.833015441894531 22.873290061950684
11.040274620056152 -10.833015441894531
3 0


In [14]:
def model_input(state_size, lstm_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    targetQs = tf.placeholder(tf.float32, [None], name='targetQs')
    # RNN
    gru = tf.nn.rnn_cell.GRUCell(lstm_size)
    cell = tf.nn.rnn_cell.MultiRNNCell([gru], state_is_tuple=False)
    initial_state = cell.zero_state(batch_size, tf.float32)
    return states, actions, targetQs, cell, initial_state

In [15]:
# RNN generator or sequence generator
def generator(states, initial_state, cell, lstm_size, num_classes, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=lstm_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, lstm_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cell, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state.shape)
        outputs = tf.reshape(outputs_rnn, [-1, lstm_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=num_classes)
        print(logits.shape)
        #predictions = tf.nn.softmax(logits)
        
        # logits are the action logits
        return logits, final_state

In [16]:
def model_loss(action_size, hidden_size, states, cell, initial_state, actions, targetQs):
    actions_logits, final_state = generator(states=states, cell=cell, initial_state=initial_state, 
                                            lstm_size=hidden_size, num_classes=action_size)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    Qs = tf.reduce_max(actions_logits*actions_labels, axis=1)
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [17]:
def model_opt(loss, learning_rate):
    """
    Get optimization operations in order
    :param loss: Generator loss Tensor for action prediction
    :param learning_rate: Learning Rate Placeholder
    :return: A tuple of (qfunction training, generator training, discriminator training)
    """
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [18]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cell, self.initial_state = model_input(
            state_size=state_size, lstm_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cell=cell, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

In [19]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)
#     def sample(self, batch_size):
#         idx = np.random.choice(np.arange(len(self.buffer)), 
#                                size=batch_size, 
#                                replace=False)
#         return [self.buffer[ii] for ii in idx], [self.states[ii] for ii in idx]

In [20]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01            # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
action_size = 4
state_size = 37
hidden_size = 37*4             # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 128            # memory capacity
batch_size = 128             # experience mini-batch size
gamma = 0.99                 # future reward discount

In [21]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

(?, 37) (?, 148)
(1, ?, 148) (1, 148)
(1, ?, 148) (1, 148)
(?, 148)
(?, 4)


In [22]:
# state = env.reset()
# for _ in range(batch_size):
#     action = env.action_space.sample()
#     next_state, reward, done, _ = env.step(action)
#     memory.buffer.append([state, action, next_state, reward, float(done)])
#     state = next_state
#     if done is True:
#         state = env.reset()

In [23]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]   # get the state
for _ in range(memory_size):
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    memory.buffer.append([state, action, next_state, reward, float(done)])
    state = next_state
    if done:                                       # exit loop if episode finished
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the state
        break

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            # Explore (Env) or Exploit (Model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                #action = env.action_space.sample()
                action = np.random.randint(action_size)        # select an action
            else:
                action = np.argmax(action_logits)
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            next_state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append([initial_state, final_state])
            total_reward += reward
            initial_state = final_state
            state = next_state
            
            # Training
            #batch, rnn_states = memory.sample(batch_size)
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            rnn_states = memory.states
            initial_states = np.array([each[0] for each in rnn_states])
            final_states = np.array([each[1] for each in rnn_states])
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states, 
                                                        model.initial_state: final_states[0].reshape([1, -1])})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                        model.initial_state: initial_states[0].reshape([1, -1])})
            loss_batch.append(loss)
            if done is True:
                break
                
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{}'.format(total_reward),
              'loss:{:.4f}'.format(np.mean(loss_batch)),
              'exploreP:{:.4f}'.format(explore_p))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= +13:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model-nav-seq.ckpt')

Episode:0 meanR:2.0000 R:2.0 loss:0.3006 exploreP:0.9707
Episode:1 meanR:0.0000 R:-2.0 loss:0.1314 exploreP:0.9423
Episode:2 meanR:-0.6667 R:-2.0 loss:0.1372 exploreP:0.9148
Episode:3 meanR:-0.7500 R:-1.0 loss:0.1735 exploreP:0.8881
Episode:4 meanR:-0.6000 R:0.0 loss:0.1825 exploreP:0.8621
Episode:5 meanR:-0.1667 R:2.0 loss:0.2299 exploreP:0.8369
Episode:6 meanR:-0.2857 R:-1.0 loss:0.2194 exploreP:0.8125
Episode:7 meanR:-0.2500 R:0.0 loss:0.2632 exploreP:0.7888
Episode:8 meanR:0.0000 R:2.0 loss:0.1097 exploreP:0.7657
Episode:9 meanR:-0.1000 R:-1.0 loss:0.1322 exploreP:0.7434
Episode:10 meanR:-0.0909 R:0.0 loss:0.1182 exploreP:0.7217
Episode:11 meanR:-0.1667 R:-1.0 loss:0.0977 exploreP:0.7007
Episode:12 meanR:-0.1538 R:0.0 loss:0.1603 exploreP:0.6803
Episode:13 meanR:0.0000 R:2.0 loss:0.1571 exploreP:0.6605
Episode:14 meanR:-0.0667 R:-1.0 loss:0.1785 exploreP:0.6413
Episode:15 meanR:-0.1875 R:-2.0 loss:0.1386 exploreP:0.6226
Episode:16 meanR:-0.1765 R:0.0 loss:0.1860 exploreP:0.6045
Epi

Episode:140 meanR:0.8000 R:0.0 loss:0.0622 exploreP:0.0244
Episode:141 meanR:0.8300 R:2.0 loss:0.0351 exploreP:0.0240
Episode:142 meanR:0.8600 R:5.0 loss:0.0508 exploreP:0.0236
Episode:143 meanR:0.8700 R:1.0 loss:0.0585 exploreP:0.0232
Episode:144 meanR:0.8700 R:2.0 loss:0.0450 exploreP:0.0228
Episode:145 meanR:0.9200 R:5.0 loss:0.1082 exploreP:0.0224
Episode:146 meanR:0.9800 R:3.0 loss:0.0893 exploreP:0.0220
Episode:147 meanR:1.0100 R:3.0 loss:0.0449 exploreP:0.0217
Episode:148 meanR:1.0500 R:3.0 loss:0.0566 exploreP:0.0213
Episode:149 meanR:1.0700 R:3.0 loss:0.0565 exploreP:0.0210
Episode:150 meanR:1.0700 R:0.0 loss:0.0284 exploreP:0.0207
Episode:151 meanR:1.1500 R:5.0 loss:0.0446 exploreP:0.0204
Episode:152 meanR:1.1500 R:0.0 loss:0.0560 exploreP:0.0201
Episode:153 meanR:1.1400 R:1.0 loss:0.0556 exploreP:0.0198
Episode:154 meanR:1.1500 R:1.0 loss:0.0675 exploreP:0.0195
Episode:155 meanR:1.1500 R:1.0 loss:0.0858 exploreP:0.0192
Episode:156 meanR:1.1300 R:0.0 loss:0.0627 exploreP:0.01

Episode:279 meanR:2.8000 R:2.0 loss:0.0887 exploreP:0.0102
Episode:280 meanR:2.8200 R:0.0 loss:0.0842 exploreP:0.0102
Episode:281 meanR:2.7900 R:0.0 loss:0.0665 exploreP:0.0102
Episode:282 meanR:2.7900 R:1.0 loss:0.0450 exploreP:0.0102
Episode:283 meanR:2.7900 R:4.0 loss:0.0649 exploreP:0.0102
Episode:284 meanR:2.8000 R:5.0 loss:0.1016 exploreP:0.0102
Episode:285 meanR:2.8400 R:7.0 loss:0.1168 exploreP:0.0102
Episode:286 meanR:2.8900 R:6.0 loss:0.1060 exploreP:0.0102
Episode:287 meanR:2.9300 R:4.0 loss:0.0896 exploreP:0.0102
Episode:288 meanR:2.9700 R:3.0 loss:0.0703 exploreP:0.0102
Episode:289 meanR:2.9800 R:3.0 loss:0.0944 exploreP:0.0102
Episode:290 meanR:3.0400 R:7.0 loss:0.0671 exploreP:0.0102
Episode:291 meanR:3.0700 R:7.0 loss:0.0608 exploreP:0.0102
Episode:292 meanR:3.1100 R:6.0 loss:0.0463 exploreP:0.0102
Episode:293 meanR:3.1300 R:1.0 loss:0.0771 exploreP:0.0101
Episode:294 meanR:3.1500 R:4.0 loss:0.0415 exploreP:0.0101
Episode:295 meanR:3.1600 R:6.0 loss:0.0515 exploreP:0.01

Episode:418 meanR:2.7700 R:5.0 loss:0.0575 exploreP:0.0100
Episode:419 meanR:2.7400 R:3.0 loss:0.0620 exploreP:0.0100
Episode:420 meanR:2.6900 R:2.0 loss:0.0996 exploreP:0.0100
Episode:421 meanR:2.6800 R:2.0 loss:0.0498 exploreP:0.0100
Episode:422 meanR:2.6700 R:1.0 loss:0.0424 exploreP:0.0100
Episode:423 meanR:2.6800 R:4.0 loss:0.0598 exploreP:0.0100
Episode:424 meanR:2.6900 R:5.0 loss:0.0483 exploreP:0.0100
Episode:425 meanR:2.7300 R:5.0 loss:0.0892 exploreP:0.0100
Episode:426 meanR:2.6900 R:0.0 loss:0.0738 exploreP:0.0100
Episode:427 meanR:2.6300 R:2.0 loss:0.0893 exploreP:0.0100
Episode:428 meanR:2.6500 R:3.0 loss:0.1183 exploreP:0.0100
Episode:429 meanR:2.6400 R:4.0 loss:0.0580 exploreP:0.0100
Episode:430 meanR:2.6400 R:1.0 loss:0.0434 exploreP:0.0100
Episode:431 meanR:2.6800 R:4.0 loss:0.0648 exploreP:0.0100
Episode:432 meanR:2.6800 R:1.0 loss:0.0722 exploreP:0.0100
Episode:433 meanR:2.6700 R:2.0 loss:0.0510 exploreP:0.0100
Episode:434 meanR:2.6700 R:3.0 loss:0.0739 exploreP:0.01

Episode:557 meanR:2.7900 R:-1.0 loss:0.0525 exploreP:0.0100
Episode:558 meanR:2.7900 R:6.0 loss:0.0464 exploreP:0.0100
Episode:559 meanR:2.7300 R:1.0 loss:0.0596 exploreP:0.0100
Episode:560 meanR:2.8100 R:8.0 loss:0.0576 exploreP:0.0100
Episode:561 meanR:2.8400 R:2.0 loss:0.0461 exploreP:0.0100
Episode:562 meanR:2.8300 R:3.0 loss:0.0594 exploreP:0.0100
Episode:563 meanR:2.8100 R:1.0 loss:0.0726 exploreP:0.0100
Episode:564 meanR:2.8000 R:2.0 loss:0.0400 exploreP:0.0100
Episode:565 meanR:2.7500 R:-2.0 loss:0.0315 exploreP:0.0100
Episode:566 meanR:2.7900 R:6.0 loss:0.0542 exploreP:0.0100
Episode:567 meanR:2.7500 R:0.0 loss:0.0336 exploreP:0.0100
Episode:568 meanR:2.7300 R:1.0 loss:0.0331 exploreP:0.0100
Episode:569 meanR:2.7200 R:0.0 loss:0.0320 exploreP:0.0100
Episode:570 meanR:2.6100 R:-1.0 loss:0.0150 exploreP:0.0100
Episode:571 meanR:2.5500 R:2.0 loss:0.0326 exploreP:0.0100
Episode:572 meanR:2.5300 R:0.0 loss:0.0332 exploreP:0.0100
Episode:573 meanR:2.5000 R:2.0 loss:0.0331 exploreP:0

Episode:696 meanR:1.6000 R:2.0 loss:0.0311 exploreP:0.0100
Episode:697 meanR:1.6000 R:4.0 loss:0.0630 exploreP:0.0100
Episode:698 meanR:1.5200 R:0.0 loss:0.0263 exploreP:0.0100
Episode:699 meanR:1.5200 R:4.0 loss:0.0327 exploreP:0.0100
Episode:700 meanR:1.5100 R:0.0 loss:0.0190 exploreP:0.0100
Episode:701 meanR:1.5100 R:3.0 loss:0.0382 exploreP:0.0100
Episode:702 meanR:1.4600 R:-3.0 loss:0.0499 exploreP:0.0100
Episode:703 meanR:1.4900 R:5.0 loss:0.0283 exploreP:0.0100
Episode:704 meanR:1.4600 R:-1.0 loss:0.0405 exploreP:0.0100
Episode:705 meanR:1.4500 R:4.0 loss:0.0521 exploreP:0.0100
Episode:706 meanR:1.4500 R:1.0 loss:0.0813 exploreP:0.0100
Episode:707 meanR:1.4500 R:2.0 loss:0.0805 exploreP:0.0100
Episode:708 meanR:1.4700 R:2.0 loss:0.0624 exploreP:0.0100
Episode:709 meanR:1.4800 R:3.0 loss:0.0770 exploreP:0.0100
Episode:710 meanR:1.5400 R:6.0 loss:0.1142 exploreP:0.0100
Episode:711 meanR:1.6300 R:9.0 loss:0.1242 exploreP:0.0100
Episode:712 meanR:1.6600 R:5.0 loss:0.1170 exploreP:0.

Episode:835 meanR:2.5000 R:4.0 loss:0.0676 exploreP:0.0100
Episode:836 meanR:2.5100 R:0.0 loss:0.0460 exploreP:0.0100
Episode:837 meanR:2.5300 R:5.0 loss:0.0520 exploreP:0.0100
Episode:838 meanR:2.6100 R:9.0 loss:0.0708 exploreP:0.0100
Episode:839 meanR:2.7100 R:7.0 loss:0.1382 exploreP:0.0100
Episode:840 meanR:2.7800 R:5.0 loss:0.1435 exploreP:0.0100
Episode:841 meanR:2.7600 R:1.0 loss:0.0723 exploreP:0.0100
Episode:842 meanR:2.8100 R:5.0 loss:0.0619 exploreP:0.0100
Episode:843 meanR:2.8500 R:2.0 loss:0.0333 exploreP:0.0100
Episode:844 meanR:2.9100 R:1.0 loss:0.0319 exploreP:0.0100
Episode:845 meanR:2.9400 R:6.0 loss:0.0295 exploreP:0.0100
Episode:846 meanR:2.8900 R:1.0 loss:0.0321 exploreP:0.0100
Episode:847 meanR:2.8200 R:0.0 loss:0.0401 exploreP:0.0100
Episode:848 meanR:2.8400 R:7.0 loss:0.0270 exploreP:0.0100
Episode:849 meanR:2.8800 R:5.0 loss:0.0625 exploreP:0.0100
Episode:850 meanR:2.8900 R:8.0 loss:0.1182 exploreP:0.0100
Episode:851 meanR:2.8200 R:0.0 loss:0.0980 exploreP:0.01

Episode:974 meanR:3.1200 R:3.0 loss:0.0356 exploreP:0.0100
Episode:975 meanR:3.1900 R:7.0 loss:0.0602 exploreP:0.0100
Episode:976 meanR:3.2600 R:6.0 loss:0.0949 exploreP:0.0100
Episode:977 meanR:3.2800 R:3.0 loss:0.1084 exploreP:0.0100
Episode:978 meanR:3.3700 R:7.0 loss:0.1591 exploreP:0.0100
Episode:979 meanR:3.3700 R:2.0 loss:0.1353 exploreP:0.0100
Episode:980 meanR:3.3800 R:2.0 loss:0.0566 exploreP:0.0100
Episode:981 meanR:3.4200 R:5.0 loss:0.0776 exploreP:0.0100
Episode:982 meanR:3.4800 R:7.0 loss:0.0765 exploreP:0.0100
Episode:983 meanR:3.5100 R:2.0 loss:0.0618 exploreP:0.0100
Episode:984 meanR:3.4900 R:1.0 loss:0.0259 exploreP:0.0100
Episode:985 meanR:3.5000 R:3.0 loss:0.0423 exploreP:0.0100
Episode:986 meanR:3.4900 R:2.0 loss:0.0699 exploreP:0.0100
Episode:987 meanR:3.5000 R:1.0 loss:0.0314 exploreP:0.0100
Episode:988 meanR:3.4800 R:1.0 loss:0.0235 exploreP:0.0100
Episode:989 meanR:3.4800 R:-1.0 loss:0.0270 exploreP:0.0100
Episode:990 meanR:3.4500 R:1.0 loss:0.0251 exploreP:0.0

Episode:1111 meanR:3.1800 R:5.0 loss:0.0979 exploreP:0.0100
Episode:1112 meanR:3.1900 R:7.0 loss:0.1016 exploreP:0.0100
Episode:1113 meanR:3.2200 R:6.0 loss:0.1114 exploreP:0.0100
Episode:1114 meanR:3.2600 R:7.0 loss:0.0752 exploreP:0.0100
Episode:1115 meanR:3.2700 R:4.0 loss:0.0704 exploreP:0.0100
Episode:1116 meanR:3.2700 R:6.0 loss:0.0477 exploreP:0.0100
Episode:1117 meanR:3.3400 R:8.0 loss:0.0734 exploreP:0.0100
Episode:1118 meanR:3.3600 R:5.0 loss:0.0786 exploreP:0.0100
Episode:1119 meanR:3.2400 R:1.0 loss:0.0406 exploreP:0.0100
Episode:1120 meanR:3.2000 R:3.0 loss:0.0546 exploreP:0.0100
Episode:1121 meanR:3.1800 R:1.0 loss:0.0257 exploreP:0.0100
Episode:1122 meanR:3.1700 R:3.0 loss:0.0506 exploreP:0.0100
Episode:1123 meanR:3.2300 R:6.0 loss:0.0492 exploreP:0.0100
Episode:1124 meanR:3.3100 R:11.0 loss:0.0570 exploreP:0.0100
Episode:1125 meanR:3.3300 R:2.0 loss:0.0746 exploreP:0.0100
Episode:1126 meanR:3.3700 R:5.0 loss:0.0835 exploreP:0.0100
Episode:1127 meanR:3.4000 R:5.0 loss:0.

Episode:1248 meanR:3.5400 R:8.0 loss:0.0395 exploreP:0.0100
Episode:1249 meanR:3.6000 R:9.0 loss:0.0721 exploreP:0.0100
Episode:1250 meanR:3.6800 R:15.0 loss:0.0888 exploreP:0.0100
Episode:1251 meanR:3.6800 R:9.0 loss:0.1118 exploreP:0.0100
Episode:1252 meanR:3.6700 R:2.0 loss:0.1030 exploreP:0.0100
Episode:1253 meanR:3.6000 R:0.0 loss:0.0397 exploreP:0.0100
Episode:1254 meanR:3.5600 R:2.0 loss:0.0403 exploreP:0.0100
Episode:1255 meanR:3.5300 R:1.0 loss:0.0513 exploreP:0.0100
Episode:1256 meanR:3.5000 R:2.0 loss:0.0365 exploreP:0.0100
Episode:1257 meanR:3.4500 R:2.0 loss:0.0459 exploreP:0.0100
Episode:1258 meanR:3.4500 R:1.0 loss:0.0325 exploreP:0.0100
Episode:1259 meanR:3.4800 R:3.0 loss:0.0226 exploreP:0.0100
Episode:1260 meanR:3.4900 R:2.0 loss:0.0253 exploreP:0.0100
Episode:1261 meanR:3.4700 R:2.0 loss:0.0245 exploreP:0.0100
Episode:1262 meanR:3.4500 R:1.0 loss:0.0183 exploreP:0.0100
Episode:1263 meanR:3.4200 R:3.0 loss:0.0510 exploreP:0.0100
Episode:1264 meanR:3.3900 R:5.0 loss:0.

Episode:1385 meanR:3.2400 R:0.0 loss:0.0113 exploreP:0.0100
Episode:1386 meanR:3.2100 R:1.0 loss:0.0182 exploreP:0.0100
Episode:1387 meanR:3.2100 R:4.0 loss:0.0300 exploreP:0.0100
Episode:1388 meanR:3.2000 R:3.0 loss:0.0625 exploreP:0.0100
Episode:1389 meanR:3.1500 R:3.0 loss:0.0953 exploreP:0.0100
Episode:1390 meanR:3.2100 R:6.0 loss:0.0703 exploreP:0.0100
Episode:1391 meanR:3.2100 R:3.0 loss:0.0888 exploreP:0.0100
Episode:1392 meanR:3.2000 R:3.0 loss:0.0502 exploreP:0.0100
Episode:1393 meanR:3.1500 R:-1.0 loss:0.0515 exploreP:0.0100
Episode:1394 meanR:3.1300 R:0.0 loss:0.0596 exploreP:0.0100
Episode:1395 meanR:3.1000 R:0.0 loss:0.0320 exploreP:0.0100
Episode:1396 meanR:3.0600 R:0.0 loss:0.0197 exploreP:0.0100
Episode:1397 meanR:3.0300 R:0.0 loss:0.0253 exploreP:0.0100
Episode:1398 meanR:3.0000 R:1.0 loss:0.0146 exploreP:0.0100
Episode:1399 meanR:2.9800 R:0.0 loss:0.0050 exploreP:0.0100
Episode:1400 meanR:2.9800 R:3.0 loss:0.0180 exploreP:0.0100
Episode:1401 meanR:2.9100 R:-1.0 loss:0

Episode:1522 meanR:3.0300 R:2.0 loss:0.0298 exploreP:0.0100
Episode:1523 meanR:3.0400 R:1.0 loss:0.1855 exploreP:0.0100
Episode:1524 meanR:3.0300 R:0.0 loss:0.2652 exploreP:0.0100
Episode:1525 meanR:2.9700 R:0.0 loss:0.4620 exploreP:0.0100
Episode:1526 meanR:2.9600 R:-1.0 loss:0.5227 exploreP:0.0100
Episode:1527 meanR:2.9000 R:0.0 loss:0.2315 exploreP:0.0100
Episode:1528 meanR:2.9100 R:0.0 loss:0.2416 exploreP:0.0100
Episode:1529 meanR:2.8900 R:1.0 loss:0.2956 exploreP:0.0100
Episode:1530 meanR:2.8800 R:1.0 loss:0.4353 exploreP:0.0100
Episode:1531 meanR:2.8100 R:-1.0 loss:0.6211 exploreP:0.0100
Episode:1532 meanR:2.7800 R:-1.0 loss:0.7440 exploreP:0.0100
Episode:1533 meanR:2.7800 R:1.0 loss:0.5380 exploreP:0.0100
Episode:1534 meanR:2.7700 R:0.0 loss:0.3772 exploreP:0.0100
Episode:1535 meanR:2.6800 R:0.0 loss:0.5756 exploreP:0.0100
Episode:1536 meanR:2.6300 R:-1.0 loss:0.7468 exploreP:0.0100
Episode:1537 meanR:2.5800 R:0.0 loss:0.6362 exploreP:0.0100
Episode:1538 meanR:2.5700 R:0.0 loss

Episode:1659 meanR:0.0800 R:0.0 loss:1.9532 exploreP:0.0100
Episode:1660 meanR:0.0800 R:0.0 loss:1.6676 exploreP:0.0100
Episode:1661 meanR:0.0900 R:1.0 loss:1.4550 exploreP:0.0100
Episode:1662 meanR:0.0900 R:1.0 loss:1.7424 exploreP:0.0100
Episode:1663 meanR:0.1100 R:1.0 loss:1.9016 exploreP:0.0100
Episode:1664 meanR:0.1000 R:-1.0 loss:1.8681 exploreP:0.0100
Episode:1665 meanR:0.1000 R:0.0 loss:1.7777 exploreP:0.0100
Episode:1666 meanR:0.1100 R:1.0 loss:1.7234 exploreP:0.0100
Episode:1667 meanR:0.1200 R:1.0 loss:2.0769 exploreP:0.0100
Episode:1668 meanR:0.1100 R:0.0 loss:1.8078 exploreP:0.0100
Episode:1669 meanR:0.1100 R:0.0 loss:1.8279 exploreP:0.0100
Episode:1670 meanR:0.1100 R:0.0 loss:2.2876 exploreP:0.0100
Episode:1671 meanR:0.1200 R:1.0 loss:1.9945 exploreP:0.0100
Episode:1672 meanR:0.1100 R:-1.0 loss:1.9339 exploreP:0.0100
Episode:1673 meanR:0.1100 R:0.0 loss:1.8242 exploreP:0.0100
Episode:1674 meanR:0.1200 R:2.0 loss:1.8621 exploreP:0.0100
Episode:1675 meanR:0.1200 R:0.0 loss:1

Episode:1795 meanR:-0.0400 R:0.0 loss:1.6479 exploreP:0.0100
Episode:1796 meanR:-0.0400 R:0.0 loss:1.8446 exploreP:0.0100
Episode:1797 meanR:-0.0300 R:0.0 loss:1.8929 exploreP:0.0100
Episode:1798 meanR:-0.0200 R:1.0 loss:2.1121 exploreP:0.0100
Episode:1799 meanR:-0.0200 R:0.0 loss:2.1385 exploreP:0.0100
Episode:1800 meanR:-0.0200 R:0.0 loss:1.3831 exploreP:0.0100
Episode:1801 meanR:0.0000 R:2.0 loss:1.6224 exploreP:0.0100
Episode:1802 meanR:0.0100 R:0.0 loss:1.7583 exploreP:0.0100
Episode:1803 meanR:0.0100 R:0.0 loss:2.1670 exploreP:0.0100
Episode:1804 meanR:0.0100 R:0.0 loss:1.8681 exploreP:0.0100
Episode:1805 meanR:0.0100 R:0.0 loss:2.0401 exploreP:0.0100
Episode:1806 meanR:-0.0100 R:-2.0 loss:2.1598 exploreP:0.0100
Episode:1807 meanR:-0.0100 R:0.0 loss:1.8055 exploreP:0.0100
Episode:1808 meanR:0.0000 R:0.0 loss:1.7614 exploreP:0.0100
Episode:1809 meanR:-0.0100 R:-1.0 loss:2.1143 exploreP:0.0100
Episode:1810 meanR:-0.0100 R:0.0 loss:2.0077 exploreP:0.0100
Episode:1811 meanR:-0.0400 R

Episode:1931 meanR:-0.0100 R:0.0 loss:1.7744 exploreP:0.0100
Episode:1932 meanR:-0.0100 R:0.0 loss:1.6940 exploreP:0.0100
Episode:1933 meanR:-0.0100 R:-1.0 loss:1.4248 exploreP:0.0100
Episode:1934 meanR:-0.0200 R:0.0 loss:1.7604 exploreP:0.0100
Episode:1935 meanR:0.0000 R:2.0 loss:1.8475 exploreP:0.0100
Episode:1936 meanR:-0.0200 R:0.0 loss:1.8518 exploreP:0.0100
Episode:1937 meanR:-0.0400 R:-2.0 loss:1.9624 exploreP:0.0100
Episode:1938 meanR:-0.0400 R:0.0 loss:1.8162 exploreP:0.0100
Episode:1939 meanR:-0.0600 R:-1.0 loss:1.3926 exploreP:0.0100
Episode:1940 meanR:-0.0700 R:-1.0 loss:1.8157 exploreP:0.0100
Episode:1941 meanR:-0.0700 R:0.0 loss:1.8001 exploreP:0.0100
Episode:1942 meanR:-0.0800 R:-1.0 loss:1.9049 exploreP:0.0100
Episode:1943 meanR:-0.0700 R:1.0 loss:1.8417 exploreP:0.0100
Episode:1944 meanR:-0.0800 R:0.0 loss:1.8413 exploreP:0.0100
Episode:1945 meanR:-0.0800 R:1.0 loss:1.3323 exploreP:0.0100
Episode:1946 meanR:-0.0800 R:0.0 loss:1.9379 exploreP:0.0100
Episode:1947 meanR:-

Episode:2065 meanR:0.0000 R:0.0 loss:2.2217 exploreP:0.0100
Episode:2066 meanR:0.0000 R:1.0 loss:2.0053 exploreP:0.0100
Episode:2067 meanR:-0.0100 R:0.0 loss:1.7809 exploreP:0.0100
Episode:2068 meanR:0.0100 R:1.0 loss:1.8351 exploreP:0.0100
Episode:2069 meanR:0.0300 R:2.0 loss:1.7192 exploreP:0.0100
Episode:2070 meanR:0.0300 R:0.0 loss:2.2946 exploreP:0.0100
Episode:2071 meanR:0.0400 R:1.0 loss:1.8679 exploreP:0.0100
Episode:2072 meanR:0.0400 R:0.0 loss:1.7854 exploreP:0.0100
Episode:2073 meanR:0.0400 R:1.0 loss:1.8096 exploreP:0.0100
Episode:2074 meanR:0.0500 R:0.0 loss:2.0625 exploreP:0.0100
Episode:2075 meanR:0.0500 R:0.0 loss:2.0929 exploreP:0.0100
Episode:2076 meanR:0.0500 R:0.0 loss:1.9020 exploreP:0.0100
Episode:2077 meanR:0.0500 R:0.0 loss:2.0081 exploreP:0.0100
Episode:2078 meanR:0.0600 R:0.0 loss:1.6817 exploreP:0.0100
Episode:2079 meanR:0.0500 R:0.0 loss:1.8530 exploreP:0.0100
Episode:2080 meanR:0.0600 R:1.0 loss:1.9948 exploreP:0.0100
Episode:2081 meanR:0.0600 R:0.0 loss:1.

Episode:2201 meanR:-0.0700 R:0.0 loss:1.8632 exploreP:0.0100
Episode:2202 meanR:-0.0900 R:0.0 loss:1.9921 exploreP:0.0100
Episode:2203 meanR:-0.0900 R:0.0 loss:1.9160 exploreP:0.0100
Episode:2204 meanR:-0.1100 R:-1.0 loss:1.7094 exploreP:0.0100
Episode:2205 meanR:-0.0800 R:0.0 loss:2.0058 exploreP:0.0100
Episode:2206 meanR:-0.0700 R:0.0 loss:1.8576 exploreP:0.0100
Episode:2207 meanR:-0.0600 R:0.0 loss:1.8283 exploreP:0.0100
Episode:2208 meanR:-0.0500 R:0.0 loss:1.7886 exploreP:0.0100
Episode:2209 meanR:-0.0300 R:1.0 loss:1.9200 exploreP:0.0100
Episode:2210 meanR:-0.0300 R:0.0 loss:1.7522 exploreP:0.0100
Episode:2211 meanR:-0.0500 R:-1.0 loss:1.7907 exploreP:0.0100
Episode:2212 meanR:-0.0400 R:1.0 loss:1.8203 exploreP:0.0100
Episode:2213 meanR:-0.0300 R:0.0 loss:2.2734 exploreP:0.0100
Episode:2214 meanR:-0.0400 R:-1.0 loss:1.9416 exploreP:0.0100
Episode:2215 meanR:-0.0300 R:1.0 loss:1.8093 exploreP:0.0100
Episode:2216 meanR:-0.0300 R:-1.0 loss:1.6293 exploreP:0.0100
Episode:2217 meanR:-

Episode:2336 meanR:0.0300 R:-1.0 loss:1.7864 exploreP:0.0100
Episode:2337 meanR:0.0400 R:0.0 loss:1.8491 exploreP:0.0100
Episode:2338 meanR:0.0400 R:0.0 loss:1.9557 exploreP:0.0100
Episode:2339 meanR:0.0500 R:0.0 loss:2.2960 exploreP:0.0100
Episode:2340 meanR:0.0500 R:0.0 loss:1.9366 exploreP:0.0100
Episode:2341 meanR:0.0300 R:-1.0 loss:1.8702 exploreP:0.0100
Episode:2342 meanR:0.0300 R:0.0 loss:1.7944 exploreP:0.0100
Episode:2343 meanR:0.0500 R:1.0 loss:1.7193 exploreP:0.0100
Episode:2344 meanR:0.0600 R:0.0 loss:1.7836 exploreP:0.0100
Episode:2345 meanR:0.0600 R:0.0 loss:1.7988 exploreP:0.0100
Episode:2346 meanR:0.0600 R:-1.0 loss:1.9117 exploreP:0.0100
Episode:2347 meanR:0.0600 R:-1.0 loss:1.8982 exploreP:0.0100
Episode:2348 meanR:0.0700 R:0.0 loss:1.8113 exploreP:0.0100
Episode:2349 meanR:0.0700 R:0.0 loss:1.7892 exploreP:0.0100
Episode:2350 meanR:0.0800 R:0.0 loss:1.6144 exploreP:0.0100
Episode:2351 meanR:0.0800 R:0.0 loss:1.9677 exploreP:0.0100
Episode:2352 meanR:0.0700 R:0.0 loss

Episode:2472 meanR:0.1400 R:0.0 loss:2.0053 exploreP:0.0100
Episode:2473 meanR:0.1400 R:0.0 loss:2.2246 exploreP:0.0100
Episode:2474 meanR:0.1400 R:0.0 loss:1.8066 exploreP:0.0100
Episode:2475 meanR:0.1400 R:1.0 loss:1.8534 exploreP:0.0100
Episode:2476 meanR:0.1400 R:1.0 loss:1.6716 exploreP:0.0100
Episode:2477 meanR:0.1400 R:0.0 loss:1.7945 exploreP:0.0100
Episode:2478 meanR:0.1200 R:0.0 loss:1.9872 exploreP:0.0100
Episode:2479 meanR:0.1200 R:0.0 loss:1.9332 exploreP:0.0100
Episode:2480 meanR:0.1300 R:0.0 loss:1.9141 exploreP:0.0100
Episode:2481 meanR:0.1300 R:0.0 loss:1.6415 exploreP:0.0100
Episode:2482 meanR:0.1200 R:0.0 loss:1.7747 exploreP:0.0100
Episode:2483 meanR:0.1300 R:1.0 loss:1.6379 exploreP:0.0100
Episode:2484 meanR:0.1300 R:0.0 loss:1.8249 exploreP:0.0100
Episode:2485 meanR:0.1300 R:0.0 loss:1.8341 exploreP:0.0100
Episode:2486 meanR:0.1400 R:1.0 loss:1.3531 exploreP:0.0100
Episode:2487 meanR:0.1400 R:2.0 loss:1.8930 exploreP:0.0100
Episode:2488 meanR:0.1400 R:0.0 loss:2.0

Episode:2609 meanR:0.1200 R:0.0 loss:2.0363 exploreP:0.0100
Episode:2610 meanR:0.1300 R:1.0 loss:1.7624 exploreP:0.0100
Episode:2611 meanR:0.1500 R:2.0 loss:1.9283 exploreP:0.0100
Episode:2612 meanR:0.1400 R:0.0 loss:1.0947 exploreP:0.0100
Episode:2613 meanR:0.1400 R:0.0 loss:1.6462 exploreP:0.0100
Episode:2614 meanR:0.1400 R:0.0 loss:1.6206 exploreP:0.0100
Episode:2615 meanR:0.1200 R:-2.0 loss:1.8519 exploreP:0.0100
Episode:2616 meanR:0.1300 R:0.0 loss:1.7139 exploreP:0.0100
Episode:2617 meanR:0.1400 R:1.0 loss:2.0306 exploreP:0.0100
Episode:2618 meanR:0.1500 R:0.0 loss:1.7264 exploreP:0.0100
Episode:2619 meanR:0.1500 R:0.0 loss:1.7303 exploreP:0.0100
Episode:2620 meanR:0.1600 R:1.0 loss:1.8804 exploreP:0.0100
Episode:2621 meanR:0.1400 R:0.0 loss:1.9588 exploreP:0.0100
Episode:2622 meanR:0.1200 R:-1.0 loss:1.9020 exploreP:0.0100
Episode:2623 meanR:0.1200 R:0.0 loss:2.0338 exploreP:0.0100
Episode:2624 meanR:0.1400 R:1.0 loss:1.7522 exploreP:0.0100
Episode:2625 meanR:0.1500 R:0.0 loss:1

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Episode rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [37]:
# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
    # Testing episodes/epochs
    for _ in range(1):
        total_reward = 0
        #state = env.reset()
        env_info = env.reset(train_mode=False)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state

        # Testing steps/batches
        while True:
            action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
            action = np.argmax(action_logits)
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            total_reward += reward
            if done:
                break
                
        print('total_reward: {:.2f}'.format(total_reward))

INFO:tensorflow:Restoring parameters from checkpoints/model-nav.ckpt


total_reward: 14.00


In [None]:
# Be careful!!!!!!!!!!!!!!!!
# Closing the env
env.close()