# Navigation

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the first project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893).

### 1. Start the Environment

We begin by importing some necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Banana.app"`
- **Windows** (x86): `"path/to/Banana_Windows_x86/Banana.exe"`
- **Windows** (x86_64): `"path/to/Banana_Windows_x86_64/Banana.exe"`
- **Linux** (x86): `"path/to/Banana_Linux/Banana.x86"`
- **Linux** (x86_64): `"path/to/Banana_Linux/Banana.x86_64"`
- **Linux** (x86, headless): `"path/to/Banana_Linux_NoVis/Banana.x86"`
- **Linux** (x86_64, headless): `"path/to/Banana_Linux_NoVis/Banana.x86_64"`

For instance, if you are using a Mac, then you downloaded `Banana.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Banana.app")
```

In [2]:
# env = UnityEnvironment(file_name="/home/arasdar/VisualBanana_Linux/Banana.x86")
# env = UnityEnvironment(file_name="/home/aras/unity-envs/Banana_Linux/Banana.x86_64")
env = UnityEnvironment(file_name="/home/arasdar/unity-envs/Banana_Linux_NoVis/Banana.x86_64")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		
Unity brain name: BananaBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 37
        Number of stacked Vector Observation: 1
        Vector Action space type: discrete
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

The simulation contains a single agent that navigates a large environment.  At each time step, it has four actions at its disposal:
- `0` - walk forward 
- `1` - walk backward
- `2` - turn left
- `3` - turn right

The state space has `37` dimensions and contains the agent's velocity, along with ray-based perception of objects around agent's forward direction.  A reward of `+1` is provided for collecting a yellow banana, and a reward of `-1` is provided for collecting a blue banana. 

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents in the environment
print('Number of agents:', len(env_info.agents))

# number of actions
action_size = brain.vector_action_space_size
print('Number of actions:', action_size)

# examine the state space 
state = env_info.vector_observations[0]
# print('States look like:', state)
state_size = len(state)
print('States have length:', state_size)
# print(state.shape, len(env_info.vector_observations), env_info.vector_observations.shape)

Number of agents: 1
Number of actions: 4
States have length: 37


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action (uniformly) at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
# env_info = env.reset(train_mode=False)[brain_name] # reset the environment
# state = env_info.vector_observations[0]            # get the current state
# score = 0                                          # initialize the score
# num_steps = 0
# while True:
#     num_steps += 1
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     score += reward                                # update the score
#     state = next_state                             # roll over the state to next time step
#     if done:                                       # exit loop if episode finished
#         print(state.shape)
#         break
    
# print("Score: {}".format(score))
# num_steps

When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
# env_info = env.reset(train_mode=True)[brain_name] # reset the environment
# state = env_info.vector_observations[0]            # get the current state
# score = 0                                          # initialize the score
# num_steps = 0
# while True:
#     num_steps += 1
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     score += reward                                # update the score
#     state = next_state                             # roll over the state to next time step
#     #print(state)
#     if done:                                       # exit loop if episode finished
#         break
    
# print("Score: {}".format(score))
# num_steps

In [8]:
import tensorflow as tf
print('TensorFlow Version: {}'.format(tf.__version__))
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [9]:
# env_info = env.reset(train_mode=True)[brain_name] # reset the environment
# state = env_info.vector_observations[0]            # get the current state
# score = 0                                          # initialize the score
# batch = []
# num_steps = 0
# while True: # infinite number of steps
#     num_steps += 1
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     score += reward                                # update the score
#     #print(state, action, reward, done)
#     batch.append([state, action, next_state, reward, float(done)])
#     state = next_state                             # roll over the state to next time step
#     if done:                                       # exit loop if episode finished
#         break
    
# print("Score: {}".format(score))
# num_steps

In [10]:
# batch[0], batch[0][1]

In [11]:
# batch[0]

In [12]:
# states = np.array([each[1] for each in batch])
# actions = np.array([each[0] for each in batch])
# next_states = np.array([each[1] for each in batch])
# rewards = np.array([each[2] for each in batch])
# dones = np.array([each[3] for each in batch])
# # infos = np.array([each[4] for each in batch])

In [13]:
# # print(rewards[:])
# print(np.array(rewards).shape, np.array(states).shape, np.array(actions).shape, np.array(dones).shape)
# print(np.array(rewards).dtype, np.array(states).dtype, np.array(actions).dtype, np.array(dones).dtype)
# print(np.max(np.array(actions)), np.min(np.array(actions)), 
#       (np.max(np.array(actions)) - np.min(np.array(actions)))+1)
# print(np.max(np.array(rewards)), np.min(np.array(rewards)))
# print(np.max(np.array(states)), np.min(np.array(states)))

In [18]:
def model_input(state_size, hidden_size, batch_size=1):
    states = tf.placeholder(tf.float32, [None, state_size], name='states')
    actions = tf.placeholder(tf.int32, [None], name='actions')
    next_states = tf.placeholder(tf.float32, [None, state_size], name='next_states')
    rewards = tf.placeholder(tf.float32, [None], name='rewards')
    dones = tf.placeholder(tf.float32, [None], name='dones')
    rate = tf.placeholder(tf.float32, [], name='rate') # success rate
    # RNN
    cell = tf.nn.rnn_cell.GRUCell(hidden_size)
    #cell = tf.nn.rnn_cell.LSTMCell(hidden_size)
    cells = tf.nn.rnn_cell.MultiRNNCell([cell], state_is_tuple=False)
    a_initial_state = cells.zero_state(batch_size, tf.float32)
    g_initial_state = cells.zero_state(batch_size, tf.float32)
    d_initial_state = cells.zero_state(batch_size, tf.float32)
    return states, actions, next_states, rewards, dones, rate, cells, a_initial_state, g_initial_state, d_initial_state

In [19]:
def actor(states, action_size, initial_state, cells, hidden_size, reuse=False): 
    with tf.variable_scope('actor', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=states, units=hidden_size)
        print(states.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size and
        # static means can NOT adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, hidden_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cells, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state)
        outputs = tf.reshape(outputs_rnn, [-1, hidden_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=action_size)
        print(logits.shape)
        return logits, final_state

In [20]:
def generator(actions, state_size, initial_state, cells, hidden_size, reuse=False): 
    with tf.variable_scope('generator', reuse=reuse):
        # First fully connected layer
        inputs = tf.layers.dense(inputs=actions, units=hidden_size)
        print(actions.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size and
        # static means can NOT adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, hidden_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cells, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state)
        outputs = tf.reshape(outputs_rnn, [-1, hidden_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=state_size)
        print(logits.shape)
        return logits, final_state

In [21]:
def discriminator(states, actions, action_size, initial_state, cells, hidden_size, reuse=False): 
    with tf.variable_scope('discriminator', reuse=reuse):
        # First fully connected layer
        h = tf.layers.dense(inputs=states, units=action_size)
        h_fused = tf.concat(axis=1, values=[h, actions])
        inputs = tf.layers.dense(inputs=h_fused, units=hidden_size)
        print(h_fused.shape, inputs.shape)
        
        # with tf.variable_scope('dynamic_rnn_', reuse=tf.AUTO_REUSE):
        # dynamic means adapt to the batch_size and
        # static means can NOT adapt to the batch_size
        inputs_rnn = tf.reshape(inputs, [1, -1, hidden_size]) # NxH -> 1xNxH
        print(inputs_rnn.shape, initial_state.shape)
        outputs_rnn, final_state = tf.nn.dynamic_rnn(cell=cells, inputs=inputs_rnn, initial_state=initial_state)
        print(outputs_rnn.shape, final_state)
        outputs = tf.reshape(outputs_rnn, [-1, hidden_size]) # 1xNxH -> NxH
        print(outputs.shape)

        # Last fully connected layer
        logits = tf.layers.dense(inputs=outputs, units=1)
        print(logits.shape)
        return logits, final_state

In [16]:
def model_loss(state_size, action_size, hidden_size, 
               states, actions, next_states, rewards, dones, rate, 
               cells, a_initial_state, g_initial_state, d_initial_state):
    actions_logits, a_final_state = actor(states=states, hidden_size=hidden_size, action_size=action_size, 
                                          cells=cells, initial_state=a_initial_state)
    actions_labels = tf.one_hot(indices=actions, depth=action_size, dtype=actions_logits.dtype)
    a_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=actions_logits, 
                                                                       labels=actions_labels))
    next_states_logits, g_final_state = generator(actions=actions_logits, hidden_size=hidden_size, 
                                                  state_size=state_size, 
                                                  cells=cells, initial_state=g_initial_state)
    next_states_labels = tf.nn.sigmoid(next_states)
    a_loss += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=next_states_logits, 
                                                                     labels=next_states_labels))
    Qlogits_real, d_final_state = discriminator(states=states, actions=actions_labels, hidden_size=hidden_size,
                                           action_size=action_size, cells=cells, initial_state=d_initial_state)
    d_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qlogits_real[-1], 
                                                                    labels=rate*tf.ones_like(Qlogits_real[-1])))
    Qlogits_fake, d_final_state = discriminator(states=states, actions=actions_logits, hidden_size=hidden_size,
                                                action_size=action_size, cells=cells, 
                                                initial_state=d_initial_state, reuse=True)
    d_loss += tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qlogits_fake[-1], 
                                                                     labels=tf.zeros_like(Qlogits_real[-1])))
    a_loss2 = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=Qlogits_fake[-1], 
                                                                     labels=tf.ones_like(Qlogits_real[-1])))
    next_actions_logits, a_final_state = actor(states=next_states, hidden_size=hidden_size, 
                                               action_size=action_size, cells=cells, 
                                               initial_state=a_initial_state)
    nextQlogits, d_final_state = discriminator(states=next_states, actions=next_actions_logits, 
                                               hidden_size=hidden_size, action_size=action_size, cells=cells, 
                                               initial_state=d_initial_state, reuse=True)
    nextQs = tf.reshape(nextQlogits, shape=[-1]) * dones
    targetQs = rewards + (0.99*nextQs)
    Qs = tf.reshape(Qlogits_real, shape=[-1])
    loss = tf.reduce_mean(tf.square(Qs - targetQs))
    return actions_logits, final_state, loss

In [17]:
def model_opt(loss, learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('generator')]

    # # Optimize MLP/CNN
    # with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
    # #opt = tf.train.AdamOptimizer(learning_rate).minimize(loss, var_list=g_vars)

    # # Optimize RNN
    #grads, _ = tf.clip_by_global_norm(t_list=tf.gradients(loss, g_vars), clip_norm=5) # usually around 1-5
    grads = tf.gradients(loss, g_vars)
    opt = tf.train.AdamOptimizer(learning_rate).apply_gradients(grads_and_vars=zip(grads, g_vars))

    return opt

In [18]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, learning_rate):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, cells, self.initial_state = model_input(
                state_size=state_size, hidden_size=hidden_size)
        
        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.final_state, self.loss = model_loss(
            action_size=action_size, hidden_size=hidden_size, 
            states=self.states, actions=self.actions, 
            targetQs=self.targetQs, cells=cells, initial_state=self.initial_state)

        # Update the model: backward pass and backprop
        self.opt = model_opt(loss=self.loss, learning_rate=learning_rate)

In [19]:
from collections import deque

class Memory():    
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
        self.states = deque(maxlen=max_size)

In [38]:
# Network parameters
action_size = 4
state_size = 37
hidden_size = 37*2             # number of units in each Q-network hidden layer
learning_rate = 0.0001         # Q-network learning rate

# Memory parameters
memory_size = 50            # memory capacity - 1000 DQN
batch_size = 50             # experience mini-batch size - 20 DQN
gamma = 0.99                 # future reward discount

In [39]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, hidden_size=hidden_size, state_size=state_size, learning_rate=learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

(?, 37) (?, 74)
(1, ?, 74) (<tf.Tensor 'MultiRNNCellZeroState/GRUCellZeroState/zeros:0' shape=(1, 74) dtype=float32>,)
(1, ?, 74) (<tf.Tensor 'generator/rnn/while/Exit_3:0' shape=(1, 74) dtype=float32>,)
(?, 74)
(?, 4)


In [40]:
model.initial_state[0]

<tf.Tensor 'MultiRNNCellZeroState/GRUCellZeroState/zeros:0' shape=(1, 74) dtype=float32>

In [41]:
# state = env.reset()
# for _ in range(batch_size):
#     action = env.action_space.sample()
#     next_state, reward, done, _ = env.step(action)
#     memory.buffer.append([state, action, next_state, reward, float(done)])
#     state = next_state
#     if done is True:
#         state = env.reset()

In [42]:
env_info = env.reset(train_mode=True)[brain_name] # reset the environment
state = env_info.vector_observations[0]   # get the state
for _ in range(memory_size):
    action = np.random.randint(action_size)        # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    memory.buffer.append([state, action, next_state, reward, float(done)])
    memory.states.append(np.zeros([1, hidden_size])) # initial_states for rnn/mem
    state = next_state
    if done:                                       # exit loop if episode finished
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the state
        break

In [43]:
memory.states[0].shape, model.initial_state[0].shape # gru
# memory.states[0][1].shape, model.initial_state[0][1].shape #lstm

((1, 74), TensorShape([Dimension(1), Dimension(74)]))

In [None]:
saver = tf.train.Saver()
episode_rewards_list, rewards_list, loss_list = [], [], []

# TF session for training
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    episode_reward = deque(maxlen=100) # 100 episodes average/running average/running mean/window
    
    # Training episodes/epochs
    for ep in range(11111):
        total_reward = 0
        loss_batch = []
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name] # reset the environment
        state = env_info.vector_observations[0]   # get the current state
        initial_state = sess.run(model.initial_state)

        # Training steps/batches
        while True:
            action_logits, final_state = sess.run([model.actions_logits, model.final_state],
                                                  feed_dict = {model.states: state.reshape([1, -1]), 
                                                               model.initial_state: initial_state})
            action = np.argmax(action_logits)
            #state, reward, done, _ = env.step(action)
            env_info = env.step(action)[brain_name]        # send the action to the environment
            next_state = env_info.vector_observations[0]   # get the next state
            reward = env_info.rewards[0]                   # get the reward
            done = env_info.local_done[0]                  # see if episode has finished
            memory.buffer.append([state, action, next_state, reward, float(done)])
            memory.states.append(initial_state)
            total_reward += reward
            state = next_state
            initial_state = final_state
            
            # Training
            batch = memory.buffer
            states = np.array([each[0] for each in batch])
            actions = np.array([each[1] for each in batch])
            next_states = np.array([each[2] for each in batch])
            rewards = np.array([each[3] for each in batch])
            dones = np.array([each[4] for each in batch])
            initial_states = memory.states
            next_actions_logits = sess.run(model.actions_logits, 
                                           feed_dict = {model.states: next_states,
                                                        model.initial_state: initial_states[1]})
            nextQs = np.max(next_actions_logits, axis=1) * (1-dones)
            targetQs = rewards + (gamma * nextQs)
            loss, _ = sess.run([model.loss, model.opt], feed_dict = {model.states: states, 
                                                                     model.actions: actions,
                                                                     model.targetQs: targetQs,
                                                                     model.initial_state: initial_states[0]})
            # End of training
            loss_batch.append(loss)
            if done is True:
                break
                
        # Outputing: priting out/Potting
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'loss:{:.4f}'.format(np.mean(loss_batch)))
        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        loss_list.append([ep, np.mean(loss_batch)])
        # Break episode/epoch loop
        if np.mean(episode_reward) >= +13:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:-1.0000 R:-1.0000 loss:0.0422
Episode:1 meanR:0.0000 R:1.0000 loss:0.0084
Episode:2 meanR:0.6667 R:2.0000 loss:0.0116
Episode:3 meanR:0.5000 R:0.0000 loss:0.0094
Episode:4 meanR:0.4000 R:0.0000 loss:0.0080
Episode:5 meanR:0.3333 R:0.0000 loss:0.0091
Episode:6 meanR:-0.1429 R:-3.0000 loss:0.0174
Episode:7 meanR:-0.2500 R:-1.0000 loss:0.0147
Episode:8 meanR:-0.4444 R:-2.0000 loss:0.0175
Episode:9 meanR:-0.6000 R:-2.0000 loss:0.0345
Episode:10 meanR:-0.6364 R:-1.0000 loss:0.0240
Episode:11 meanR:-0.5000 R:1.0000 loss:0.0127
Episode:12 meanR:-0.4615 R:0.0000 loss:0.0255
Episode:13 meanR:-0.4286 R:0.0000 loss:0.0313
Episode:14 meanR:-0.3333 R:1.0000 loss:0.0082
Episode:15 meanR:-0.2500 R:1.0000 loss:0.0160
Episode:16 meanR:-0.1765 R:1.0000 loss:0.0117
Episode:17 meanR:0.0000 R:3.0000 loss:0.0227
Episode:18 meanR:0.1053 R:2.0000 loss:0.1274
Episode:19 meanR:0.1000 R:0.0000 loss:0.0409
Episode:20 meanR:0.3810 R:6.0000 loss:0.0396
Episode:21 meanR:0.3182 R:-1.0000 loss:0.0478
E

Episode:180 meanR:7.5100 R:15.0000 loss:0.0701
Episode:181 meanR:7.6200 R:12.0000 loss:0.0703
Episode:182 meanR:7.6700 R:5.0000 loss:0.0617
Episode:183 meanR:7.7300 R:7.0000 loss:0.0609
Episode:184 meanR:7.7900 R:6.0000 loss:0.0489
Episode:185 meanR:7.8500 R:9.0000 loss:0.0490
Episode:186 meanR:7.8500 R:7.0000 loss:0.0358
Episode:187 meanR:7.8300 R:6.0000 loss:0.0412
Episode:188 meanR:7.8000 R:4.0000 loss:0.0640
Episode:189 meanR:7.8200 R:3.0000 loss:0.0564
Episode:190 meanR:7.8200 R:7.0000 loss:0.0347
Episode:191 meanR:7.7600 R:-2.0000 loss:0.0378
Episode:192 meanR:7.7000 R:2.0000 loss:0.0276
Episode:193 meanR:7.5700 R:0.0000 loss:0.0200
Episode:194 meanR:7.5200 R:5.0000 loss:0.0192
Episode:195 meanR:7.4600 R:1.0000 loss:0.0176
Episode:196 meanR:7.4000 R:4.0000 loss:0.0164
Episode:197 meanR:7.2800 R:4.0000 loss:0.0167
Episode:198 meanR:7.2000 R:6.0000 loss:0.0384
Episode:199 meanR:7.0900 R:0.0000 loss:0.0332
Episode:200 meanR:7.1200 R:14.0000 loss:0.0448
Episode:201 meanR:7.1600 R:11.

Episode:357 meanR:8.5200 R:9.0000 loss:0.1034
Episode:358 meanR:8.5400 R:10.0000 loss:0.0655
Episode:359 meanR:8.6300 R:13.0000 loss:0.0665
Episode:360 meanR:8.5900 R:8.0000 loss:0.0678
Episode:361 meanR:8.6900 R:16.0000 loss:0.0734
Episode:362 meanR:8.6000 R:6.0000 loss:0.1148
Episode:363 meanR:8.6300 R:15.0000 loss:0.0573
Episode:364 meanR:8.6000 R:6.0000 loss:0.0673
Episode:365 meanR:8.5600 R:7.0000 loss:0.0628
Episode:366 meanR:8.4700 R:7.0000 loss:0.0483
Episode:367 meanR:8.4000 R:4.0000 loss:0.0361
Episode:368 meanR:8.3600 R:13.0000 loss:0.0859
Episode:369 meanR:8.2900 R:6.0000 loss:0.0566
Episode:370 meanR:8.2900 R:9.0000 loss:0.0590
Episode:371 meanR:8.3700 R:14.0000 loss:0.0810
Episode:372 meanR:8.4100 R:9.0000 loss:0.0702
Episode:373 meanR:8.4100 R:11.0000 loss:0.0547
Episode:374 meanR:8.4000 R:5.0000 loss:0.0383
Episode:375 meanR:8.4700 R:14.0000 loss:0.0447
Episode:376 meanR:8.5000 R:9.0000 loss:0.0958
Episode:377 meanR:8.4700 R:2.0000 loss:0.0675
Episode:378 meanR:8.5600 R

Episode:534 meanR:8.2200 R:11.0000 loss:0.0378
Episode:535 meanR:8.3200 R:16.0000 loss:0.0586
Episode:536 meanR:8.2500 R:4.0000 loss:0.0575
Episode:537 meanR:8.1800 R:8.0000 loss:0.0449
Episode:538 meanR:8.2100 R:11.0000 loss:0.0420
Episode:539 meanR:8.1500 R:4.0000 loss:0.0616
Episode:540 meanR:8.1400 R:10.0000 loss:0.0485
Episode:541 meanR:8.0700 R:6.0000 loss:0.0840
Episode:542 meanR:8.0600 R:9.0000 loss:0.0343
Episode:543 meanR:8.0200 R:10.0000 loss:0.0626
Episode:544 meanR:8.0400 R:9.0000 loss:0.0275
Episode:545 meanR:8.0100 R:10.0000 loss:0.0409
Episode:546 meanR:8.0200 R:10.0000 loss:0.0859
Episode:547 meanR:8.0200 R:6.0000 loss:0.0490
Episode:548 meanR:8.1000 R:15.0000 loss:0.0662
Episode:549 meanR:8.1600 R:15.0000 loss:0.0833
Episode:550 meanR:8.2000 R:9.0000 loss:0.0716
Episode:551 meanR:8.3500 R:15.0000 loss:0.0671
Episode:552 meanR:8.3800 R:12.0000 loss:0.0705
Episode:553 meanR:8.4700 R:8.0000 loss:0.0460
Episode:554 meanR:8.6300 R:17.0000 loss:0.0566
Episode:555 meanR:8.65

Episode:710 meanR:9.1800 R:4.0000 loss:0.0426
Episode:711 meanR:9.1600 R:8.0000 loss:0.0558
Episode:712 meanR:9.1300 R:11.0000 loss:0.0551
Episode:713 meanR:9.1600 R:10.0000 loss:0.0788
Episode:714 meanR:9.1100 R:12.0000 loss:0.0619
Episode:715 meanR:9.0500 R:5.0000 loss:0.0544
Episode:716 meanR:9.0600 R:12.0000 loss:0.0470
Episode:717 meanR:9.0300 R:7.0000 loss:0.0775
Episode:718 meanR:9.0400 R:10.0000 loss:0.0459
Episode:719 meanR:9.0400 R:12.0000 loss:0.0583
Episode:720 meanR:9.1200 R:14.0000 loss:0.0553
Episode:721 meanR:9.1900 R:12.0000 loss:0.0693
Episode:722 meanR:9.2700 R:12.0000 loss:0.0681
Episode:723 meanR:9.2900 R:8.0000 loss:0.0569
Episode:724 meanR:9.3100 R:8.0000 loss:0.0507
Episode:725 meanR:9.4000 R:15.0000 loss:0.0490
Episode:726 meanR:9.3000 R:4.0000 loss:0.0490
Episode:727 meanR:9.2600 R:9.0000 loss:0.0487
Episode:728 meanR:9.3400 R:12.0000 loss:0.0623
Episode:729 meanR:9.3500 R:10.0000 loss:0.0577
Episode:730 meanR:9.3000 R:5.0000 loss:0.0548
Episode:731 meanR:9.34

Episode:885 meanR:10.1700 R:13.0000 loss:0.0577
Episode:886 meanR:10.0500 R:4.0000 loss:0.0747
Episode:887 meanR:10.0900 R:8.0000 loss:0.0624
Episode:888 meanR:10.0800 R:4.0000 loss:0.0338
Episode:889 meanR:10.0700 R:12.0000 loss:0.0448
Episode:890 meanR:10.0000 R:10.0000 loss:0.0518
Episode:891 meanR:10.0000 R:7.0000 loss:0.0495
Episode:892 meanR:10.1600 R:19.0000 loss:0.0916
Episode:893 meanR:10.2300 R:10.0000 loss:0.0726
Episode:894 meanR:10.3400 R:16.0000 loss:0.0786
Episode:895 meanR:10.2900 R:9.0000 loss:0.0615
Episode:896 meanR:10.3600 R:12.0000 loss:0.0528
Episode:897 meanR:10.3600 R:12.0000 loss:0.0659
Episode:898 meanR:10.3600 R:11.0000 loss:0.0704
Episode:899 meanR:10.4000 R:14.0000 loss:0.0502
Episode:900 meanR:10.3500 R:7.0000 loss:0.0661
Episode:901 meanR:10.3700 R:5.0000 loss:0.0763
Episode:902 meanR:10.3800 R:12.0000 loss:0.0607
Episode:903 meanR:10.3600 R:11.0000 loss:0.0679
Episode:904 meanR:10.3200 R:10.0000 loss:0.0508
Episode:905 meanR:10.1400 R:1.0000 loss:0.0410


Episode:1058 meanR:10.7800 R:9.0000 loss:0.0567
Episode:1059 meanR:10.8200 R:13.0000 loss:0.0404
Episode:1060 meanR:10.7700 R:5.0000 loss:0.0498
Episode:1061 meanR:10.8300 R:13.0000 loss:0.0671
Episode:1062 meanR:10.9200 R:13.0000 loss:0.0768
Episode:1063 meanR:10.8100 R:3.0000 loss:0.0387
Episode:1064 meanR:10.7700 R:10.0000 loss:0.0407
Episode:1065 meanR:10.8200 R:13.0000 loss:0.0560
Episode:1066 meanR:10.8100 R:13.0000 loss:0.0800
Episode:1067 meanR:10.8300 R:14.0000 loss:0.0688
Episode:1068 meanR:10.7900 R:13.0000 loss:0.0711
Episode:1069 meanR:10.8700 R:13.0000 loss:0.0810
Episode:1070 meanR:10.8500 R:11.0000 loss:0.0641
Episode:1071 meanR:10.7600 R:6.0000 loss:0.0411
Episode:1072 meanR:10.7300 R:6.0000 loss:0.0374
Episode:1073 meanR:10.7100 R:12.0000 loss:0.0546
Episode:1074 meanR:10.7100 R:9.0000 loss:0.0666
Episode:1075 meanR:10.6700 R:6.0000 loss:0.0343
Episode:1076 meanR:10.6100 R:9.0000 loss:0.0408
Episode:1077 meanR:10.6100 R:12.0000 loss:0.0398
Episode:1078 meanR:10.5300 R

Episode:1229 meanR:8.8800 R:7.0000 loss:0.0513
Episode:1230 meanR:8.9500 R:13.0000 loss:0.0746
Episode:1231 meanR:9.0000 R:12.0000 loss:0.0542
Episode:1232 meanR:9.0400 R:8.0000 loss:0.0442
Episode:1233 meanR:8.9800 R:6.0000 loss:0.0498
Episode:1234 meanR:9.0000 R:10.0000 loss:0.0502
Episode:1235 meanR:8.9500 R:5.0000 loss:0.0429
Episode:1236 meanR:8.9600 R:9.0000 loss:0.0537
Episode:1237 meanR:8.9900 R:13.0000 loss:0.1058
Episode:1238 meanR:8.8700 R:4.0000 loss:0.0666
Episode:1239 meanR:8.8400 R:12.0000 loss:0.0505
Episode:1240 meanR:8.8500 R:11.0000 loss:0.0772
Episode:1241 meanR:8.8800 R:13.0000 loss:0.0512
Episode:1242 meanR:8.8300 R:6.0000 loss:0.0535
Episode:1243 meanR:8.8400 R:10.0000 loss:0.0519
Episode:1244 meanR:8.8600 R:7.0000 loss:0.0570
Episode:1245 meanR:8.9600 R:11.0000 loss:0.0552
Episode:1246 meanR:8.9400 R:3.0000 loss:0.0638
Episode:1247 meanR:9.0000 R:12.0000 loss:0.0344
Episode:1248 meanR:8.9300 R:13.0000 loss:0.0706
Episode:1249 meanR:8.9300 R:14.0000 loss:0.0444
E

Episode:1400 meanR:9.9600 R:8.0000 loss:0.0492
Episode:1401 meanR:9.9500 R:5.0000 loss:0.0513
Episode:1402 meanR:9.9100 R:5.0000 loss:0.0404
Episode:1403 meanR:9.8600 R:6.0000 loss:0.0695
Episode:1404 meanR:9.9300 R:14.0000 loss:0.0674
Episode:1405 meanR:9.9500 R:12.0000 loss:0.0633
Episode:1406 meanR:9.9700 R:9.0000 loss:0.0694
Episode:1407 meanR:10.1500 R:18.0000 loss:0.0780
Episode:1408 meanR:10.3100 R:20.0000 loss:0.1106
Episode:1409 meanR:10.3300 R:10.0000 loss:0.0914
Episode:1410 meanR:10.3300 R:5.0000 loss:0.0781
Episode:1411 meanR:10.4000 R:17.0000 loss:0.0575
Episode:1412 meanR:10.3900 R:10.0000 loss:0.0688
Episode:1413 meanR:10.3400 R:8.0000 loss:0.0624
Episode:1414 meanR:10.2500 R:4.0000 loss:0.0543
Episode:1415 meanR:10.2900 R:10.0000 loss:0.0289
Episode:1416 meanR:10.3200 R:14.0000 loss:0.0332
Episode:1417 meanR:10.4400 R:16.0000 loss:0.0696
Episode:1418 meanR:10.4000 R:7.0000 loss:0.0718
Episode:1419 meanR:10.3300 R:6.0000 loss:0.0512
Episode:1420 meanR:10.2700 R:8.0000 l

Episode:1571 meanR:10.3900 R:11.0000 loss:0.0856
Episode:1572 meanR:10.3300 R:7.0000 loss:0.0591
Episode:1573 meanR:10.3000 R:10.0000 loss:0.0417
Episode:1574 meanR:10.2400 R:9.0000 loss:0.0411
Episode:1575 meanR:10.1700 R:6.0000 loss:0.0436
Episode:1576 meanR:10.0400 R:-1.0000 loss:0.0431
Episode:1577 meanR:9.9700 R:3.0000 loss:0.0273
Episode:1578 meanR:10.0800 R:16.0000 loss:0.0582
Episode:1579 meanR:10.1100 R:14.0000 loss:0.0821
Episode:1580 meanR:10.1500 R:10.0000 loss:0.0765
Episode:1581 meanR:10.1800 R:8.0000 loss:0.0695
Episode:1582 meanR:10.1600 R:16.0000 loss:0.1008
Episode:1583 meanR:10.2500 R:19.0000 loss:0.0671
Episode:1584 meanR:10.3900 R:25.0000 loss:0.1188
Episode:1585 meanR:10.4500 R:17.0000 loss:0.1192
Episode:1586 meanR:10.3500 R:4.0000 loss:0.1123
Episode:1587 meanR:10.3800 R:7.0000 loss:0.0495
Episode:1588 meanR:10.3000 R:0.0000 loss:0.0378
Episode:1589 meanR:10.3500 R:3.0000 loss:0.0239
Episode:1590 meanR:10.2800 R:5.0000 loss:0.0190
Episode:1591 meanR:10.2200 R:2.

Episode:1744 meanR:8.5300 R:13.0000 loss:0.0737
Episode:1745 meanR:8.6100 R:16.0000 loss:0.0780
Episode:1746 meanR:8.6800 R:21.0000 loss:0.0930
Episode:1747 meanR:8.6400 R:11.0000 loss:0.0846
Episode:1748 meanR:8.5500 R:7.0000 loss:0.0680
Episode:1749 meanR:8.5300 R:10.0000 loss:0.0580
Episode:1750 meanR:8.4300 R:3.0000 loss:0.0400
Episode:1751 meanR:8.3600 R:3.0000 loss:0.0472
Episode:1752 meanR:8.3300 R:9.0000 loss:0.0681
Episode:1753 meanR:8.2900 R:5.0000 loss:0.0632
Episode:1754 meanR:8.2400 R:2.0000 loss:0.0340
Episode:1755 meanR:8.2300 R:7.0000 loss:0.0610
Episode:1756 meanR:8.2500 R:15.0000 loss:0.0998
Episode:1757 meanR:8.2700 R:9.0000 loss:0.0624
Episode:1758 meanR:8.2000 R:9.0000 loss:0.0668
Episode:1759 meanR:8.2100 R:11.0000 loss:0.0728
Episode:1760 meanR:8.2900 R:11.0000 loss:0.0399
Episode:1761 meanR:8.3800 R:15.0000 loss:0.0680
Episode:1762 meanR:8.4200 R:6.0000 loss:0.0876
Episode:1763 meanR:8.3400 R:4.0000 loss:0.0417
Episode:1764 meanR:8.4100 R:17.0000 loss:0.0580
Epi

Episode:1916 meanR:9.3800 R:12.0000 loss:0.0459
Episode:1917 meanR:9.3200 R:7.0000 loss:0.0437
Episode:1918 meanR:9.3600 R:14.0000 loss:0.0664
Episode:1919 meanR:9.4600 R:13.0000 loss:0.0332
Episode:1920 meanR:9.4800 R:3.0000 loss:0.0365
Episode:1921 meanR:9.4900 R:12.0000 loss:0.0597
Episode:1922 meanR:9.3800 R:5.0000 loss:0.0571
Episode:1923 meanR:9.3700 R:14.0000 loss:0.0587
Episode:1924 meanR:9.2500 R:0.0000 loss:0.0661
Episode:1925 meanR:9.1900 R:9.0000 loss:0.0323
Episode:1926 meanR:9.2100 R:12.0000 loss:0.0479
Episode:1927 meanR:9.2800 R:10.0000 loss:0.0478
Episode:1928 meanR:9.2200 R:11.0000 loss:0.0735
Episode:1929 meanR:9.1700 R:11.0000 loss:0.0502
Episode:1930 meanR:9.1700 R:8.0000 loss:0.0695
Episode:1931 meanR:9.2100 R:12.0000 loss:0.0514
Episode:1932 meanR:9.2300 R:9.0000 loss:0.0387
Episode:1933 meanR:9.3000 R:11.0000 loss:0.0612
Episode:1934 meanR:9.3400 R:11.0000 loss:0.0638
Episode:1935 meanR:9.4300 R:15.0000 loss:0.0721
Episode:1936 meanR:9.4300 R:15.0000 loss:0.0941

Episode:2088 meanR:9.7100 R:5.0000 loss:0.0572
Episode:2089 meanR:9.8000 R:15.0000 loss:0.0695
Episode:2090 meanR:9.7800 R:12.0000 loss:0.0782
Episode:2091 meanR:9.7800 R:12.0000 loss:0.0647
Episode:2092 meanR:9.7300 R:9.0000 loss:0.0551
Episode:2093 meanR:9.7200 R:14.0000 loss:0.0480
Episode:2094 meanR:9.7400 R:11.0000 loss:0.0841
Episode:2095 meanR:9.7800 R:11.0000 loss:0.0744
Episode:2096 meanR:9.8700 R:11.0000 loss:0.0972
Episode:2097 meanR:9.8500 R:13.0000 loss:0.0635
Episode:2098 meanR:9.8800 R:13.0000 loss:0.0702
Episode:2099 meanR:9.9200 R:16.0000 loss:0.0662
Episode:2100 meanR:10.0000 R:16.0000 loss:0.0765
Episode:2101 meanR:9.9500 R:7.0000 loss:0.0801
Episode:2102 meanR:9.9200 R:3.0000 loss:0.0821
Episode:2103 meanR:9.9100 R:15.0000 loss:0.0813
Episode:2104 meanR:9.8700 R:11.0000 loss:0.0741
Episode:2105 meanR:9.8600 R:7.0000 loss:0.0729
Episode:2106 meanR:9.9400 R:18.0000 loss:0.0673
Episode:2107 meanR:10.0300 R:11.0000 loss:0.1096
Episode:2108 meanR:10.0500 R:11.0000 loss:0

Episode:2258 meanR:9.7700 R:2.0000 loss:0.0404
Episode:2259 meanR:9.7400 R:7.0000 loss:0.0432
Episode:2260 meanR:9.6800 R:9.0000 loss:0.0577
Episode:2261 meanR:9.6600 R:11.0000 loss:0.0552
Episode:2262 meanR:9.6000 R:3.0000 loss:0.0548
Episode:2263 meanR:9.5700 R:12.0000 loss:0.0565
Episode:2264 meanR:9.7000 R:13.0000 loss:0.0639
Episode:2265 meanR:9.7200 R:7.0000 loss:0.0743
Episode:2266 meanR:9.7100 R:11.0000 loss:0.0536
Episode:2267 meanR:9.6800 R:11.0000 loss:0.0746
Episode:2268 meanR:9.8100 R:14.0000 loss:0.0814
Episode:2269 meanR:9.7600 R:4.0000 loss:0.0544
Episode:2270 meanR:9.6900 R:8.0000 loss:0.0472
Episode:2271 meanR:9.7100 R:8.0000 loss:0.0527
Episode:2272 meanR:9.7300 R:11.0000 loss:0.0707
Episode:2273 meanR:9.8800 R:17.0000 loss:0.0695
Episode:2274 meanR:9.7900 R:5.0000 loss:0.0723
Episode:2275 meanR:9.6400 R:2.0000 loss:0.0679
Episode:2276 meanR:9.6400 R:2.0000 loss:0.0440
Episode:2277 meanR:9.6200 R:3.0000 loss:0.0356
Episode:2278 meanR:9.6300 R:10.0000 loss:0.0790
Epis

Episode:2431 meanR:9.9400 R:1.0000 loss:0.0608
Episode:2432 meanR:10.0000 R:12.0000 loss:0.0556
Episode:2433 meanR:10.0300 R:9.0000 loss:0.0474
Episode:2434 meanR:9.9900 R:7.0000 loss:0.0499
Episode:2435 meanR:10.0100 R:12.0000 loss:0.0431
Episode:2436 meanR:10.0300 R:13.0000 loss:0.0641
Episode:2437 meanR:9.8800 R:0.0000 loss:0.0612
Episode:2438 meanR:9.9000 R:10.0000 loss:0.0523
Episode:2439 meanR:9.8800 R:9.0000 loss:0.0691
Episode:2440 meanR:9.8400 R:7.0000 loss:0.0374
Episode:2441 meanR:9.9200 R:8.0000 loss:0.0414
Episode:2442 meanR:9.8700 R:8.0000 loss:0.0447
Episode:2443 meanR:9.8400 R:8.0000 loss:0.0674
Episode:2444 meanR:9.7900 R:12.0000 loss:0.0849
Episode:2445 meanR:9.8500 R:11.0000 loss:0.0456
Episode:2446 meanR:9.9500 R:12.0000 loss:0.0670
Episode:2447 meanR:9.9000 R:8.0000 loss:0.0563
Episode:2448 meanR:9.9100 R:11.0000 loss:0.0635
Episode:2449 meanR:9.9100 R:11.0000 loss:0.0699
Episode:2450 meanR:9.9000 R:10.0000 loss:0.0708
Episode:2451 meanR:9.9200 R:8.0000 loss:0.0391

Episode:2604 meanR:9.2800 R:9.0000 loss:0.0706
Episode:2605 meanR:9.3400 R:11.0000 loss:0.0802
Episode:2606 meanR:9.3800 R:13.0000 loss:0.0616
Episode:2607 meanR:9.4000 R:9.0000 loss:0.0648
Episode:2608 meanR:9.3600 R:7.0000 loss:0.0738
Episode:2609 meanR:9.3000 R:6.0000 loss:0.0419
Episode:2610 meanR:9.2100 R:8.0000 loss:0.0663
Episode:2611 meanR:9.2000 R:11.0000 loss:0.0612
Episode:2612 meanR:9.2400 R:9.0000 loss:0.0634
Episode:2613 meanR:9.3400 R:17.0000 loss:0.0710
Episode:2614 meanR:9.4100 R:13.0000 loss:0.0758
Episode:2615 meanR:9.4200 R:14.0000 loss:0.0909
Episode:2616 meanR:9.3100 R:2.0000 loss:0.0894
Episode:2617 meanR:9.2400 R:0.0000 loss:0.0290
Episode:2618 meanR:9.2000 R:3.0000 loss:0.0267
Episode:2619 meanR:9.3000 R:16.0000 loss:0.0491
Episode:2620 meanR:9.3700 R:13.0000 loss:0.0542
Episode:2621 meanR:9.3500 R:7.0000 loss:0.0715
Episode:2622 meanR:9.3200 R:9.0000 loss:0.0720
Episode:2623 meanR:9.2600 R:5.0000 loss:0.0568
Episode:2624 meanR:9.2800 R:11.0000 loss:0.0413
Epis

Episode:2777 meanR:8.8100 R:7.0000 loss:0.0303
Episode:2778 meanR:8.8000 R:8.0000 loss:0.0588
Episode:2779 meanR:8.8500 R:8.0000 loss:0.0454
Episode:2780 meanR:8.8300 R:5.0000 loss:0.0422
Episode:2781 meanR:8.8300 R:6.0000 loss:0.0340
Episode:2782 meanR:8.8600 R:12.0000 loss:0.0613
Episode:2783 meanR:8.9300 R:15.0000 loss:0.0535
Episode:2784 meanR:8.9200 R:5.0000 loss:0.0652
Episode:2785 meanR:8.9600 R:9.0000 loss:0.0596
Episode:2786 meanR:8.9600 R:8.0000 loss:0.0468
Episode:2787 meanR:8.9200 R:12.0000 loss:0.0514
Episode:2788 meanR:8.9000 R:6.0000 loss:0.0537
Episode:2789 meanR:9.0000 R:19.0000 loss:0.0480
Episode:2790 meanR:8.8900 R:6.0000 loss:0.0630
Episode:2791 meanR:8.9200 R:7.0000 loss:0.0781
Episode:2792 meanR:8.9600 R:9.0000 loss:0.0626
Episode:2793 meanR:8.9600 R:7.0000 loss:0.0877
Episode:2794 meanR:8.9700 R:14.0000 loss:0.0778
Episode:2795 meanR:8.9400 R:9.0000 loss:0.0875
Episode:2796 meanR:8.9700 R:14.0000 loss:0.0573
Episode:2797 meanR:8.8900 R:5.0000 loss:0.0570
Episode

Episode:2950 meanR:9.2200 R:7.0000 loss:0.0409
Episode:2951 meanR:9.2200 R:11.0000 loss:0.0423
Episode:2952 meanR:9.2900 R:18.0000 loss:0.0752
Episode:2953 meanR:9.3100 R:6.0000 loss:0.0521
Episode:2954 meanR:9.3500 R:16.0000 loss:0.0612
Episode:2955 meanR:9.2000 R:4.0000 loss:0.0642
Episode:2956 meanR:9.2700 R:12.0000 loss:0.0569
Episode:2957 meanR:9.2400 R:10.0000 loss:0.0628
Episode:2958 meanR:9.3100 R:7.0000 loss:0.0574
Episode:2959 meanR:9.3500 R:15.0000 loss:0.0547
Episode:2960 meanR:9.3400 R:12.0000 loss:0.0642
Episode:2961 meanR:9.3600 R:13.0000 loss:0.0602
Episode:2962 meanR:9.3600 R:7.0000 loss:0.0468
Episode:2963 meanR:9.3500 R:6.0000 loss:0.0536
Episode:2964 meanR:9.1200 R:-3.0000 loss:0.0373
Episode:2965 meanR:9.1300 R:9.0000 loss:0.0471
Episode:2966 meanR:9.1300 R:13.0000 loss:0.0556
Episode:2967 meanR:9.0500 R:5.0000 loss:0.0411
Episode:2968 meanR:9.0300 R:6.0000 loss:0.0541
Episode:2969 meanR:8.9900 R:12.0000 loss:0.0486
Episode:2970 meanR:8.9600 R:2.0000 loss:0.0477
Ep

Episode:3121 meanR:10.8800 R:16.0000 loss:0.0629
Episode:3122 meanR:10.8800 R:2.0000 loss:0.0450
Episode:3123 meanR:10.8800 R:11.0000 loss:0.0380
Episode:3124 meanR:10.7700 R:3.0000 loss:0.0416
Episode:3125 meanR:10.9000 R:21.0000 loss:0.0695
Episode:3126 meanR:10.7500 R:1.0000 loss:0.0803
Episode:3127 meanR:10.6400 R:3.0000 loss:0.0295
Episode:3128 meanR:10.5300 R:4.0000 loss:0.0412
Episode:3129 meanR:10.4400 R:5.0000 loss:0.0590
Episode:3130 meanR:10.4100 R:8.0000 loss:0.0395
Episode:3131 meanR:10.4100 R:15.0000 loss:0.0561
Episode:3132 meanR:10.3100 R:2.0000 loss:0.0551
Episode:3133 meanR:10.2500 R:7.0000 loss:0.0409
Episode:3134 meanR:10.2000 R:5.0000 loss:0.0505
Episode:3135 meanR:10.1500 R:8.0000 loss:0.0362
Episode:3136 meanR:9.9700 R:1.0000 loss:0.0409
Episode:3137 meanR:9.8000 R:1.0000 loss:0.0203
Episode:3138 meanR:9.8200 R:9.0000 loss:0.0348
Episode:3139 meanR:9.7700 R:8.0000 loss:0.0406
Episode:3140 meanR:9.6000 R:4.0000 loss:0.0335
Episode:3141 meanR:9.5100 R:6.0000 loss:0

Episode:3294 meanR:9.1500 R:14.0000 loss:0.0472
Episode:3295 meanR:9.1700 R:12.0000 loss:0.0456
Episode:3296 meanR:9.0500 R:6.0000 loss:0.0705
Episode:3297 meanR:9.0800 R:16.0000 loss:0.0838
Episode:3298 meanR:9.0800 R:7.0000 loss:0.0863
Episode:3299 meanR:9.2400 R:15.0000 loss:0.0717
Episode:3300 meanR:9.2500 R:19.0000 loss:0.0696
Episode:3301 meanR:9.2500 R:13.0000 loss:0.0813
Episode:3302 meanR:9.4100 R:15.0000 loss:0.1017
Episode:3303 meanR:9.4200 R:14.0000 loss:0.0851
Episode:3304 meanR:9.4500 R:11.0000 loss:0.0584
Episode:3305 meanR:9.5500 R:16.0000 loss:0.0573
Episode:3306 meanR:9.5700 R:10.0000 loss:0.0597
Episode:3307 meanR:9.7000 R:14.0000 loss:0.0874
Episode:3308 meanR:9.7500 R:9.0000 loss:0.0769
Episode:3309 meanR:9.9100 R:15.0000 loss:0.0864
Episode:3310 meanR:9.9100 R:5.0000 loss:0.0385
Episode:3311 meanR:9.9800 R:14.0000 loss:0.0542
Episode:3312 meanR:10.1000 R:14.0000 loss:0.0759
Episode:3313 meanR:10.1500 R:8.0000 loss:0.0525
Episode:3314 meanR:10.1900 R:10.0000 loss:0

Episode:3463 meanR:11.1000 R:13.0000 loss:0.0834
Episode:3464 meanR:11.0600 R:7.0000 loss:0.0560
Episode:3465 meanR:11.0800 R:13.0000 loss:0.0795
Episode:3466 meanR:11.1600 R:14.0000 loss:0.0672
Episode:3467 meanR:11.2200 R:15.0000 loss:0.0668
Episode:3468 meanR:11.3300 R:18.0000 loss:0.0796
Episode:3469 meanR:11.2900 R:12.0000 loss:0.0815
Episode:3470 meanR:11.2400 R:8.0000 loss:0.0829
Episode:3471 meanR:11.2000 R:11.0000 loss:0.0594
Episode:3472 meanR:11.1300 R:9.0000 loss:0.0847
Episode:3473 meanR:11.0700 R:10.0000 loss:0.0705
Episode:3474 meanR:11.0200 R:9.0000 loss:0.0585
Episode:3475 meanR:11.0800 R:15.0000 loss:0.0600
Episode:3476 meanR:11.1200 R:17.0000 loss:0.0794
Episode:3477 meanR:11.0700 R:7.0000 loss:0.0871
Episode:3478 meanR:11.0800 R:16.0000 loss:0.1003
Episode:3479 meanR:11.1300 R:12.0000 loss:0.0860
Episode:3480 meanR:11.1000 R:12.0000 loss:0.0744
Episode:3481 meanR:11.1700 R:6.0000 loss:0.0713
Episode:3482 meanR:11.1300 R:7.0000 loss:0.0393
Episode:3483 meanR:11.2300 

Episode:3632 meanR:10.4200 R:7.0000 loss:0.0551
Episode:3633 meanR:10.5200 R:18.0000 loss:0.0741
Episode:3634 meanR:10.5300 R:13.0000 loss:0.0659
Episode:3635 meanR:10.6200 R:16.0000 loss:0.0557
Episode:3636 meanR:10.6100 R:8.0000 loss:0.0913
Episode:3637 meanR:10.6200 R:4.0000 loss:0.0550
Episode:3638 meanR:10.7000 R:15.0000 loss:0.0616
Episode:3639 meanR:10.7000 R:10.0000 loss:0.0680
Episode:3640 meanR:10.6800 R:9.0000 loss:0.0736
Episode:3641 meanR:10.6700 R:13.0000 loss:0.0706
Episode:3642 meanR:10.6200 R:10.0000 loss:0.0507
Episode:3643 meanR:10.5900 R:8.0000 loss:0.0579
Episode:3644 meanR:10.6200 R:9.0000 loss:0.0577
Episode:3645 meanR:10.6200 R:15.0000 loss:0.0547
Episode:3646 meanR:10.5500 R:9.0000 loss:0.0664
Episode:3647 meanR:10.5100 R:6.0000 loss:0.0521
Episode:3648 meanR:10.4200 R:7.0000 loss:0.0570
Episode:3649 meanR:10.4000 R:15.0000 loss:0.0599
Episode:3650 meanR:10.3800 R:10.0000 loss:0.0370
Episode:3651 meanR:10.2900 R:8.0000 loss:0.0405
Episode:3652 meanR:10.4700 R:2

Episode:3801 meanR:10.4600 R:15.0000 loss:0.0707
Episode:3802 meanR:10.4600 R:10.0000 loss:0.0984
Episode:3803 meanR:10.4400 R:8.0000 loss:0.0674
Episode:3804 meanR:10.4400 R:10.0000 loss:0.0497
Episode:3805 meanR:10.5900 R:21.0000 loss:0.0765
Episode:3806 meanR:10.6400 R:11.0000 loss:0.0562
Episode:3807 meanR:10.7000 R:13.0000 loss:0.0868
Episode:3808 meanR:10.6900 R:4.0000 loss:0.0615
Episode:3809 meanR:10.6800 R:7.0000 loss:0.0575
Episode:3810 meanR:10.6000 R:4.0000 loss:0.0462
Episode:3811 meanR:10.6800 R:16.0000 loss:0.0550
Episode:3812 meanR:10.6200 R:5.0000 loss:0.0513
Episode:3813 meanR:10.6800 R:17.0000 loss:0.0534
Episode:3814 meanR:10.6500 R:3.0000 loss:0.0671
Episode:3815 meanR:10.5800 R:7.0000 loss:0.0478
Episode:3816 meanR:10.5300 R:8.0000 loss:0.0533
Episode:3817 meanR:10.3800 R:1.0000 loss:0.0659
Episode:3818 meanR:10.3200 R:8.0000 loss:0.0539
Episode:3819 meanR:10.3500 R:12.0000 loss:0.0555
Episode:3820 meanR:10.3200 R:10.0000 loss:0.0656
Episode:3821 meanR:10.3500 R:7

Episode:3970 meanR:10.3400 R:3.0000 loss:0.0480
Episode:3971 meanR:10.3200 R:5.0000 loss:0.0618
Episode:3972 meanR:10.3900 R:12.0000 loss:0.0535
Episode:3973 meanR:10.3200 R:8.0000 loss:0.0627
Episode:3974 meanR:10.3600 R:17.0000 loss:0.0620
Episode:3975 meanR:10.3500 R:9.0000 loss:0.0866
Episode:3976 meanR:10.4400 R:14.0000 loss:0.0767
Episode:3977 meanR:10.3400 R:6.0000 loss:0.0531
Episode:3978 meanR:10.3500 R:14.0000 loss:0.0550
Episode:3979 meanR:10.4200 R:15.0000 loss:0.0502
Episode:3980 meanR:10.5100 R:15.0000 loss:0.0860
Episode:3981 meanR:10.4400 R:11.0000 loss:0.0566
Episode:3982 meanR:10.4600 R:16.0000 loss:0.0896
Episode:3983 meanR:10.4100 R:10.0000 loss:0.0692
Episode:3984 meanR:10.3100 R:7.0000 loss:0.0459
Episode:3985 meanR:10.4000 R:15.0000 loss:0.0597
Episode:3986 meanR:10.4500 R:20.0000 loss:0.0790
Episode:3987 meanR:10.5600 R:14.0000 loss:0.1155
Episode:3988 meanR:10.5000 R:5.0000 loss:0.0705
Episode:3989 meanR:10.4600 R:7.0000 loss:0.0618
Episode:3990 meanR:10.4500 R

Episode:4140 meanR:9.8900 R:13.0000 loss:0.0985
Episode:4141 meanR:9.9200 R:13.0000 loss:0.0612
Episode:4142 meanR:9.9600 R:17.0000 loss:0.0626
Episode:4143 meanR:10.0500 R:14.0000 loss:0.0845
Episode:4144 meanR:9.9500 R:3.0000 loss:0.0847
Episode:4145 meanR:10.0400 R:16.0000 loss:0.0591
Episode:4146 meanR:10.0300 R:18.0000 loss:0.0689
Episode:4147 meanR:10.0000 R:5.0000 loss:0.0738
Episode:4148 meanR:10.0100 R:10.0000 loss:0.0605
Episode:4149 meanR:9.9800 R:8.0000 loss:0.0760
Episode:4150 meanR:9.9100 R:9.0000 loss:0.0674
Episode:4151 meanR:9.8600 R:7.0000 loss:0.0473
Episode:4152 meanR:9.8400 R:10.0000 loss:0.0451
Episode:4153 meanR:9.7500 R:0.0000 loss:0.0287
Episode:4154 meanR:9.7600 R:14.0000 loss:0.0412
Episode:4155 meanR:9.6100 R:1.0000 loss:0.0356
Episode:4156 meanR:9.5900 R:11.0000 loss:0.0448
Episode:4157 meanR:9.6400 R:10.0000 loss:0.0597
Episode:4158 meanR:9.6200 R:7.0000 loss:0.0608
Episode:4159 meanR:9.5500 R:6.0000 loss:0.0444
Episode:4160 meanR:9.5300 R:12.0000 loss:0.0

Episode:4311 meanR:9.3500 R:9.0000 loss:0.0543
Episode:4312 meanR:9.2500 R:8.0000 loss:0.0522
Episode:4313 meanR:9.2300 R:8.0000 loss:0.0545
Episode:4314 meanR:9.1500 R:9.0000 loss:0.0697
Episode:4315 meanR:9.2200 R:9.0000 loss:0.0472
Episode:4316 meanR:9.3000 R:18.0000 loss:0.0882
Episode:4317 meanR:9.2800 R:6.0000 loss:0.0805
Episode:4318 meanR:9.3800 R:14.0000 loss:0.0732
Episode:4319 meanR:9.3200 R:7.0000 loss:0.0568
Episode:4320 meanR:9.3600 R:13.0000 loss:0.0618
Episode:4321 meanR:9.3400 R:10.0000 loss:0.0475
Episode:4322 meanR:9.3100 R:6.0000 loss:0.0526
Episode:4323 meanR:9.4000 R:13.0000 loss:0.0806
Episode:4324 meanR:9.5600 R:18.0000 loss:0.0873
Episode:4325 meanR:9.6100 R:12.0000 loss:0.0943
Episode:4326 meanR:9.6200 R:10.0000 loss:0.0849
Episode:4327 meanR:9.5100 R:5.0000 loss:0.0746
Episode:4328 meanR:9.5700 R:12.0000 loss:0.0535
Episode:4329 meanR:9.5500 R:10.0000 loss:0.0703
Episode:4330 meanR:9.5100 R:6.0000 loss:0.0657
Episode:4331 meanR:9.4900 R:7.0000 loss:0.0460
Epi

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Episode rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(loss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [32]:
# # TF session for training
# with tf.Session(graph=graph) as sess:
#     sess.run(tf.global_variables_initializer())
#     #saver.restore(sess, 'checkpoints/model.ckpt')    
#     saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    
#     # Testing episodes/epochs
#     for _ in range(1):
#         total_reward = 0
#         #state = env.reset()
#         env_info = env.reset(train_mode=False)[brain_name] # reset the environment
#         state = env_info.vector_observations[0]   # get the current state

#         # Testing steps/batches
#         while True:
#             action_logits = sess.run(model.actions_logits, feed_dict={model.states: state.reshape([1, -1])})
#             action = np.argmax(action_logits)
#             #state, reward, done, _ = env.step(action)
#             env_info = env.step(action)[brain_name]        # send the action to the environment
#             state = env_info.vector_observations[0]   # get the next state
#             reward = env_info.rewards[0]                   # get the reward
#             done = env_info.local_done[0]                  # see if episode has finished
#             total_reward += reward
#             if done:
#                 break
                
#         print('total_reward: {:.2f}'.format(total_reward))

In [33]:
# # Be careful!!!!!!!!!!!!!!!!
# # Closing the env
# env.close()