# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_v1/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_OneAgent/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_NoVis_OneAgent/Reacher_Linux_NoVis/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_NoVis/Reacher.x86_64')
env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_NoVis_MultiAgents/Reacher_Linux_NoVis/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [5]:
# env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
# states = env_info.vector_observations                  # get the current state (for each agent)
# scores = np.zeros(num_agents)                          # initialize the score (for each agent)
# while True:
#     actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
#     actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
#     env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#     next_states = env_info.vector_observations         # get next state (for each agent)
#     rewards = env_info.rewards                         # get reward (for each agent)
#     dones = env_info.local_done                        # see if episode finished
#     scores += env_info.rewards                         # update the score (for each agent)
#     states = next_states                               # roll over states to next time step
#     if np.any(dones):                                  # exit loop if episode finished
#         break
# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

When finished, you can close the environment.

In [6]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [7]:
# # Testing the train mode
# env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
# state = env_info.vector_observations[0]                  # get the current state (for each agent)
# #scores = np.zeros(num_agents)                          # initialize the score (for each agent)
# num_steps = 0
# while True:
#     num_steps += 1
#     action = np.random.randn(num_agents, action_size) # select an action (for each agent)
#     #print(action)
#     action = np.clip(action, -1, 1)                  # all actions between -1 and 1
#     #print(action)
#     env_info = env.step(action)[brain_name]           # send all actions to tne environment
#     next_state = env_info.vector_observations[0]         # get next state (for each agent)
#     reward = env_info.rewards[0]                         # get reward (for each agent)
#     done = env_info.local_done[0]                        # see if episode finished
#     #scores += env_info.rewards                         # update the score (for each agent)
#     state = next_state                               # roll over states to next time step
#     if done is True:                                  # exit loop if episode finished
#         #print(action.shape, reward)
#         #print(done)
#         break
# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))
# num_steps

## Option 1: Solve the First Version
The task is episodic, and in order to solve the environment, your agent must get an average score of +30 over 100 consecutive episodes.

In [8]:
import tensorflow as tf
print('TensorFlow Version: {}'.format(tf.__version__))
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [9]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float64, [None, state_size], name='states')
    actions = tf.placeholder(tf.float64, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float64, [None], name='targetQs')
    return states, actions, targetQs

In [10]:
def actor(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('actor', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        return logits

In [11]:
def critic(states, actions, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('critic', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        nl1_fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=nl1_fused, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)
        return logits

In [12]:
def model_loss(actions, states, targetQs, action_size, hidden_size):
    ###################################################################
    actions_logits = actor(states=states, hidden_size=hidden_size, action_size=action_size)
    gQlogits = critic(states=states, actions=actions_logits, hidden_size=hidden_size, action_size=action_size)
    ###########################################################################
    Qlogits = critic(states=states, actions=actions, hidden_size=hidden_size, action_size=action_size, reuse=True)
    ###########################################################################
    Qs = tf.reshape(Qlogits, shape=[-1])
    gQs = tf.reshape(gQlogits, shape=[-1])
    dloss = tf.reduce_mean(tf.square(Qs - targetQs))
    gloss = -tf.reduce_mean(gQs)
    return actions_logits, gQlogits, gloss, dloss

In [13]:
def model_opt(gloss, dloss, g_learning_rate, d_learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('actor')]
    d_vars = [var for var in t_vars if var.name.startswith('critic')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(g_learning_rate).minimize(gloss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(d_learning_rate).minimize(dloss, var_list=d_vars)
    return g_opt, d_opt

In [14]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, g_learning_rate, d_learning_rate, gamma):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs = model_input(state_size=state_size, action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.gQlogits, self.gloss, self.dloss = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init
            states=self.states, actions=self.actions, targetQs=self.targetQs) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(gloss=self.gloss, dloss=self.dloss,
                                           g_learning_rate=g_learning_rate,
                                           d_learning_rate=d_learning_rate)

In [15]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

In [16]:
# reset the environment
env_info.vector_observations.shape, env_info.previous_vector_actions.shape, \
brain.vector_action_space_size, brain.number_visual_observations, \
brain.vector_action_space_size, brain.vector_observation_space_size

((20, 33), (20, 4), 4, 0, 4, 33)

In [17]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01           # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 33
action_size = 4
hidden_size = 33*2             # number of units in each Q-network hidden layer
g_learning_rate = 1e-3         # Q-network learning rate
d_learning_rate = 1e-3         # Q-network learning rate

# Memory parameters
memory_size = int(1e6)            # memory capacity
batch_size = 1024             # experience mini-batch size
gamma=0.99

In [18]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, gamma=gamma,
              g_learning_rate=g_learning_rate, d_learning_rate=d_learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [19]:
# env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
# states = env_info.vector_observations                  # get the current state (for each agent)
# scores = np.zeros(num_agents)                          # initialize the score (for each agent)

# for _ in range(memory_size):
#     actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
#     actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
#     env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#     next_states = env_info.vector_observations         # get next state (for each agent)
#     rewards = env_info.rewards                         # get reward (for each agent)
#     dones = env_info.local_done                        # see if episode finished

#     for state, action, next_state, reward, done in zip(states, actions, next_states, rewards, dones):
#         #agent.step(state, action, reward, next_state, done) # send actions to the agent
#         memory.buffer.append([state, action, next_state, reward, done])
        
#     scores += env_info.rewards                         # update the score (for each agent)
#     states = next_states                               # roll over states to next time step
    
#     if np.any(dones):                                  # exit loop if episode finished
#         print('Average scores: {}'.format(np.mean(scores)))
#         env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
#         states = env_info.vector_observations                  # get the current state (for each agent)
#         scores = np.zeros(num_agents)                          # initialize the score (for each agent)

In [20]:
# env_info = env.reset(train_mode=True)[brain_name] # reset the environment
# states = env_info.vector_observations   # get the state
# for _ in range(memory_size):
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     memory.buffer.append([state, action, next_state, reward, float(done)])
#     state = next_state
#     if done:                                       # exit loop if episode finished
#         env_info = env.reset(train_mode=True)[brain_name] # reset the environment
#         state = env_info.vector_observations[0]   # get the state
#         break

In [21]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list = [], []
gloss_list, dloss_list = [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes for running average/running mean/window
    n_episodes=2000 
    max_t=1000 
    #     print_every=10, 
    learn_every=20 
    num_learn=10
    goal_score=30

    # Training episodes/epochs
    for ep in range(n_episodes):
        gloss_batch, dloss_batch = [], []
        
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        states = env_info.vector_observations                 # get the current state (for each agent)
        #print(states.shape)
        scores = np.zeros(num_agents)                          # initialize the score (for each agent)
        
        # Training steps/batches
        for t in range(max_t):
            # Explore (env) or Exploit (model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                #action = env.action_space.sample()
                actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
                #print('explore', actions.shape)
            else:
                #print(states.shape)
                actions = sess.run(model.actions_logits, feed_dict={model.states: states})
                #print('model', actions.shape)

            actions = np.clip(actions, -1, 1) # [-1, +1]
            #print(actions.shape)
            #next_state, reward, done, _ = env.step(action)
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                        # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            
            for state, action, next_state, reward, done in zip(states, actions, next_states, rewards, dones):
                #agent.step(state, action, reward, next_state, done) # send actions to the agent
                memory.buffer.append([state, action, next_state, reward, done])

            scores += env_info.rewards                         # update the score (for each agent)
            states = next_states                               # roll over states to next time step
            
            # total reward
            total_reward = np.mean(scores)
            
            # Training
            if t%learn_every == 0:
                for _ in range(num_learn):
                    #agent.start_learn()
                    if len(memory.buffer) >= batch_size:
                        #experiences = self.memory.sample()
                        batch = memory.sample(batch_size)
                        states_ = np.array([each[0] for each in batch])
                        actions_ = np.array([each[1] for each in batch])
                        next_states_ = np.array([each[2] for each in batch])
                        rewards_ = np.array([each[3] for each in batch])
                        dones_ = np.array([each[4] for each in batch])

                        #self.learn(experiences, GAMMA)
                        # TargetQs
                        nextQlogits = sess.run(model.gQlogits, feed_dict = {model.states: next_states_})
                        nextQs = nextQlogits.reshape(-1)
                        targetQs = rewards_ + (gamma * nextQs * (1-dones_))
                        
                        feed_dict = {model.states: states_, model.actions: actions_, model.targetQs: targetQs}
                        dloss, _= sess.run([model.dloss, model.d_opt], feed_dict)
                        gloss, _= sess.run([model.gloss, model.g_opt], feed_dict)

                        gloss_batch.append(gloss)
                        dloss_batch.append(dloss)
            
            # End of episode
            if done is True:
                break
                
        # Print out
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
              'exploreP:{:.4f}'.format(explore_p))

        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.
        if np.mean(episode_reward) >= goal_score:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:0.1175 R:0.1175 gloss:-15808020.8409 dloss:1409050604624048.0000 exploreP:0.9058
Episode:1 meanR:0.1857 R:0.2540 gloss:-14436963616.5646 dloss:599330947005348184064.0000 exploreP:0.8205
Episode:2 meanR:0.2477 R:0.3715 gloss:-556652484195.1956 dloss:589901054991295313870848.0000 exploreP:0.7434
Episode:3 meanR:0.3264 R:0.5625 gloss:-5909603296514.8115 dloss:54671209385723100919758848.0000 exploreP:0.6736
Episode:4 meanR:0.3899 R:0.6440 gloss:-32591894945923.9922 dloss:1542204836161002914388639744.0000 exploreP:0.6105
Episode:5 meanR:0.4117 R:0.5205 gloss:-122032690973255.4531 dloss:21556524247497347098682064896.0000 exploreP:0.5533
Episode:6 meanR:0.3759 R:0.1615 gloss:-364289784455833.0625 dloss:202066139579476792127610945536.0000 exploreP:0.5016
Episode:7 meanR:0.3510 R:0.1765 gloss:-928699497237855.7500 dloss:1388194024887035490776188452864.0000 exploreP:0.4548
Episode:8 meanR:0.3179 R:0.0530 gloss:-2098536257745308.2500 dloss:7401075665614362535027915358208.0000 expl

Episode:64 meanR:0.0643 R:0.0105 gloss:-567377083994179829760.0000 dloss:566067930690331451423158196585964853264384.0000 exploreP:0.0115
Episode:65 meanR:0.0634 R:0.0050 gloss:-626767909621616869376.0000 dloss:688571680838644381045414034488112476848128.0000 exploreP:0.0113
Episode:66 meanR:0.0625 R:0.0030 gloss:-689600083001443090432.0000 dloss:833452077901901752070038953921580401426432.0000 exploreP:0.0112
Episode:67 meanR:0.0617 R:0.0125 gloss:-759589500890148634624.0000 dloss:1009370939229229130427550991994764419661824.0000 exploreP:0.0111
Episode:68 meanR:0.0611 R:0.0155 gloss:-836252412960558940160.0000 dloss:1221203248475396179800984599178009821315072.0000 exploreP:0.0110
Episode:69 meanR:0.0604 R:0.0140 gloss:-917684587034636189696.0000 dloss:1469101828632014077673163052428193532215296.0000 exploreP:0.0109
Episode:70 meanR:0.0597 R:0.0105 gloss:-1004571345454625456128.0000 dloss:1758038796716864543626348046477127751565312.0000 exploreP:0.0108
Episode:71 meanR:0.0590 R:0.0095 glo

Episode:123 meanR:0.0112 R:0.0075 gloss:-38462610178202132283392.0000 dloss:2521799192060080107122260373395141970421088256.0000 exploreP:0.0100
Episode:124 meanR:0.0113 R:0.0235 gloss:-40616477067517523656704.0000 dloss:2807724990653705776832730447217176271764062208.0000 exploreP:0.0100
Episode:125 meanR:0.0109 R:0.0000 gloss:-42863824153510632488960.0000 dloss:3124462215871861421158182396145957190657638400.0000 exploreP:0.0100
Episode:126 meanR:0.0109 R:0.0160 gloss:-45095107413656467406848.0000 dloss:3464405264443692437550067973548108060808970240.0000 exploreP:0.0100
Episode:127 meanR:0.0106 R:0.0055 gloss:-47543077845641208004608.0000 dloss:3850017421168067263874660002980359674299154432.0000 exploreP:0.0100
Episode:128 meanR:0.0105 R:0.0000 gloss:-49991109941080581210112.0000 dloss:4256648362074490762722028326457280526674821120.0000 exploreP:0.0100
Episode:129 meanR:0.0105 R:0.0075 gloss:-52849739894989703348224.0000 dloss:4747835476297658342677788454425593019516321792.0000 exploreP

Episode:180 meanR:0.0094 R:0.0090 gloss:-491712081014600912338944.0000 dloss:409025340914796748541205756187936848520696299520.0000 exploreP:0.0100
Episode:181 meanR:0.0092 R:0.0100 gloss:-510723034115296173490176.0000 dloss:441408897213590553967209503409157564689652121600.0000 exploreP:0.0100
Episode:182 meanR:0.0090 R:0.0150 gloss:-530252000802730500161536.0000 dloss:475575983006481191237737973666964926595627745280.0000 exploreP:0.0100
Episode:183 meanR:0.0089 R:0.0025 gloss:-551046989197960726708224.0000 dloss:512805255134437593580728731660103744978107236352.0000 exploreP:0.0100
Episode:184 meanR:0.0091 R:0.0190 gloss:-570930071357141553250304.0000 dloss:551540127544309404779562672790192065923696820224.0000 exploreP:0.0100
Episode:185 meanR:0.0091 R:0.0095 gloss:-591829946758781810507776.0000 dloss:592627152981981687866049804106953614565698437120.0000 exploreP:0.0100
Episode:186 meanR:0.0090 R:0.0075 gloss:-613272488252902006063104.0000 dloss:63702511375295031196513495790828485700789

Episode:236 meanR:0.0097 R:0.0165 gloss:-3160975137731090489278464.0000 dloss:16869706313461194559890511516322486514903125327872.0000 exploreP:0.0100
Episode:237 meanR:0.0098 R:0.0140 gloss:-3257834780949809968709632.0000 dloss:17890448743612315328959357132097704416607696060416.0000 exploreP:0.0100
Episode:238 meanR:0.0097 R:0.0015 gloss:-3357292797465870737604608.0000 dloss:18985320925068081321674363686838330879045504335872.0000 exploreP:0.0100
Episode:239 meanR:0.0098 R:0.0105 gloss:-3454883960319584338182144.0000 dloss:20129623026641325358396987345324255941280252559360.0000 exploreP:0.0100
Episode:240 meanR:0.0097 R:0.0060 gloss:-3553576534355777882685440.0000 dloss:21250473541954524878677388944299830832000808255488.0000 exploreP:0.0100
Episode:241 meanR:0.0096 R:0.0040 gloss:-3654691488964877605666816.0000 dloss:22543456307698609864834295363766620239028052557824.0000 exploreP:0.0100
Episode:242 meanR:0.0096 R:0.0020 gloss:-3766950456404583429177344.0000 dloss:2388684516764649105777

Episode:291 meanR:0.0105 R:0.0155 gloss:-13685150355073843591118848.0000 dloss:315795395822273865901122084948614170837802322755584.0000 exploreP:0.0100
Episode:292 meanR:0.0104 R:0.0155 gloss:-14041207592106182957858816.0000 dloss:332010159035141852928569421518440814756732304621568.0000 exploreP:0.0100
Episode:293 meanR:0.0103 R:0.0090 gloss:-14409015627051533758103552.0000 dloss:348807303694854825000616598186235280916507497857024.0000 exploreP:0.0100
Episode:294 meanR:0.0104 R:0.0145 gloss:-14731660820431118708768768.0000 dloss:365797816783016054696127853008488682416149826109440.0000 exploreP:0.0100
Episode:295 meanR:0.0103 R:0.0025 gloss:-15117183972228531044220928.0000 dloss:384081248650655618050849618364894926486758529957888.0000 exploreP:0.0100
Episode:296 meanR:0.0102 R:0.0020 gloss:-15428395566161511294435328.0000 dloss:401423591002232048634400733432784746542920512831488.0000 exploreP:0.0100
Episode:297 meanR:0.0103 R:0.0160 gloss:-15861123099828375359848448.0000 dloss:422576106

Episode:345 meanR:0.0105 R:0.0110 gloss:-45916454000070949691457536.0000 dloss:3543447534200994629394773182162897289059657486172160.0000 exploreP:0.0100
Episode:346 meanR:0.0104 R:0.0065 gloss:-46911576819566033984552960.0000 dloss:3697014157455896271675361858913255416266648853676032.0000 exploreP:0.0100
Episode:347 meanR:0.0103 R:0.0065 gloss:-47927282165265960271872000.0000 dloss:3854403687591754718578330467798885770439420192751616.0000 exploreP:0.0100
Episode:348 meanR:0.0105 R:0.0210 gloss:-48981653170918540210864128.0000 dloss:4025566341316727636790294854501100316756942935556096.0000 exploreP:0.0100
Episode:349 meanR:0.0105 R:0.0045 gloss:-49836847088898162292686848.0000 dloss:4175180912368307628190665957077482896720030393696256.0000 exploreP:0.0100
Episode:350 meanR:0.0106 R:0.0125 gloss:-50917700387770316567347200.0000 dloss:4348709197055805288619427767050808909022668926222336.0000 exploreP:0.0100
Episode:351 meanR:0.0106 R:0.0105 gloss:-51923398381075281267916800.0000 dloss:452

Episode:399 meanR:0.0105 R:0.0090 gloss:-130994903603650637137444864.0000 dloss:28718249930992908924675393711945576969753190429884416.0000 exploreP:0.0100
Episode:400 meanR:0.0105 R:0.0040 gloss:-133248056014690175730843648.0000 dloss:29745345592525268325665330663818737296487882269130752.0000 exploreP:0.0100
Episode:401 meanR:0.0105 R:0.0000 gloss:-135755773831727890485477376.0000 dloss:30864789276367795253082537181547054233393814331981824.0000 exploreP:0.0100
Episode:402 meanR:0.0106 R:0.0170 gloss:-137937815792843777764229120.0000 dloss:31883037206014273305007390923254145791777000407433216.0000 exploreP:0.0100
Episode:403 meanR:0.0106 R:0.0110 gloss:-140487562261075960686182400.0000 dloss:33119371262039624851830251769399857204380988289843200.0000 exploreP:0.0100
Episode:404 meanR:0.0105 R:0.0165 gloss:-143145303463866745170690048.0000 dloss:34257275141449724402450730333220233472581577385967616.0000 exploreP:0.0100
Episode:405 meanR:0.0105 R:0.0015 gloss:-145707733575548362498768896.0

Episode:452 meanR:0.0103 R:0.0080 gloss:-323136110885547685787467776.0000 dloss:174703575117329983409568713832002383570910518204956672.0000 exploreP:0.0100
Episode:453 meanR:0.0103 R:0.0130 gloss:-328004930291047033491947520.0000 dloss:180044614130860210364381751231158232159296139879776256.0000 exploreP:0.0100
Episode:454 meanR:0.0103 R:0.0085 gloss:-332811217673595878889226240.0000 dloss:185509608537831246310261630341684782862816797010165760.0000 exploreP:0.0100
Episode:455 meanR:0.0103 R:0.0255 gloss:-338582824966894473099345920.0000 dloss:191914940762854106119816742392851787427825026888368128.0000 exploreP:0.0100
Episode:456 meanR:0.0103 R:0.0075 gloss:-344291056674520520636497920.0000 dloss:198165384395309071370623391708975946312584383436947456.0000 exploreP:0.0100
Episode:457 meanR:0.0101 R:0.0040 gloss:-349961184442895388974776320.0000 dloss:204739270572548269089989593880198967153038989493010432.0000 exploreP:0.0100
Episode:458 meanR:0.0101 R:0.0155 gloss:-35462392024285393082698

Episode:505 meanR:0.0104 R:0.0100 gloss:-724380259820289670051266560.0000 dloss:877051729771245534924717224832680550670149912723718144.0000 exploreP:0.0100
Episode:506 meanR:0.0105 R:0.0150 gloss:-733590577688057087906545664.0000 dloss:901450070532067523713699187549318881034594157298974720.0000 exploreP:0.0100
Episode:507 meanR:0.0103 R:0.0000 gloss:-744775080959958291203489792.0000 dloss:926384807350256294591173207512041338756373665982447616.0000 exploreP:0.0100
Episode:508 meanR:0.0103 R:0.0095 gloss:-754951609342440339165675520.0000 dloss:954425116150687878421260762384868450899716989872242688.0000 exploreP:0.0100
Episode:509 meanR:0.0103 R:0.0040 gloss:-765374034329449926393593856.0000 dloss:982552530110623269747709652236649070077278961268686848.0000 exploreP:0.0100
Episode:510 meanR:0.0103 R:0.0125 gloss:-778043082595578840153587712.0000 dloss:1013436647310847648734825245612619079221175744154566656.0000 exploreP:0.0100
Episode:511 meanR:0.0103 R:0.0045 gloss:-7891158920784226691443

Episode:558 meanR:0.0110 R:0.0145 gloss:-1505335833312761573623726080.0000 dloss:3792817082406798716749646169396723421761116928145883136.0000 exploreP:0.0100
Episode:559 meanR:0.0110 R:0.0040 gloss:-1527604255055824162419376128.0000 dloss:3898741343523870188248095789880515223156827209410281472.0000 exploreP:0.0100
Episode:560 meanR:0.0112 R:0.0215 gloss:-1544610634470929505760313344.0000 dloss:3992102046207326778616403682073375643394129227528798208.0000 exploreP:0.0100
Episode:561 meanR:0.0113 R:0.0145 gloss:-1566386069992386327302111232.0000 dloss:4106252433776494390755146605825376180274343392655179776.0000 exploreP:0.0100
Episode:562 meanR:0.0112 R:0.0040 gloss:-1588636312999451586585427968.0000 dloss:4221045895314480660600530604432044134616961918332567552.0000 exploreP:0.0100
Episode:563 meanR:0.0112 R:0.0020 gloss:-1605567452199260068385914880.0000 dloss:4319587572416693197738256315835765112973241732011393024.0000 exploreP:0.0100
Episode:564 meanR:0.0112 R:0.0090 gloss:-16291356815

Episode:610 meanR:0.0110 R:0.0055 gloss:-2905677500971477138078171136.0000 dloss:14102155572408465627569210738296232871403968295499464704.0000 exploreP:0.0100
Episode:611 meanR:0.0111 R:0.0105 gloss:-2938340202939474173524705280.0000 dloss:14424321625405390095244478657888702850997199272023162880.0000 exploreP:0.0100
Episode:612 meanR:0.0111 R:0.0055 gloss:-2976974744960088179843530752.0000 dloss:14835820018441184450068435889044361493593439220283736064.0000 exploreP:0.0100
Episode:613 meanR:0.0112 R:0.0145 gloss:-3008918819737993920022315008.0000 dloss:15170353635379697838751130889387439332798951103196561408.0000 exploreP:0.0100
Episode:614 meanR:0.0112 R:0.0050 gloss:-3048268455735525207657414656.0000 dloss:15559728833861033319756165477462397503552185596179906560.0000 exploreP:0.0100
Episode:615 meanR:0.0112 R:0.0105 gloss:-3084610743132208478537383936.0000 dloss:15924296616321962091276432184314739413827810171568521216.0000 exploreP:0.0100
Episode:616 meanR:0.0110 R:0.0000 gloss:-31329

Episode:662 meanR:0.0103 R:0.0085 gloss:-5318821880466090121999941632.0000 dloss:47342782670088321919971446015309663427694102474106339328.0000 exploreP:0.0100
Episode:663 meanR:0.0104 R:0.0120 gloss:-5386083224085572967339130880.0000 dloss:48493010743274316071371839697588420421531887260758179840.0000 exploreP:0.0100
Episode:664 meanR:0.0104 R:0.0065 gloss:-5447306892484590388184612864.0000 dloss:49603510480922924197926416766512291809136190680368939008.0000 exploreP:0.0100
Episode:665 meanR:0.0102 R:0.0040 gloss:-5500359397151645310421827584.0000 dloss:50548753698885071195806548545064991322742177355968544768.0000 exploreP:0.0100
Episode:666 meanR:0.0103 R:0.0175 gloss:-5578376257762152116882767872.0000 dloss:51916400108849230225382643627857365689958935033739739136.0000 exploreP:0.0100
Episode:667 meanR:0.0104 R:0.0180 gloss:-5632286003575327157209006080.0000 dloss:53051122150480440992282437560320617690905887326336974848.0000 exploreP:0.0100
Episode:668 meanR:0.0105 R:0.0265 gloss:-56996

Episode:714 meanR:0.0099 R:0.0050 gloss:-9332440750172729170768953344.0000 dloss:145755409418722802722204081052198823424645281333240659968.0000 exploreP:0.0100
Episode:715 meanR:0.0099 R:0.0150 gloss:-9448283629212969744560291840.0000 dloss:149040281256055930629524654184396582313884168894984224768.0000 exploreP:0.0100
Episode:716 meanR:0.0099 R:0.0000 gloss:-9526872312549691667328794624.0000 dloss:151833985447143974517365031681628484614629865268002684928.0000 exploreP:0.0100
Episode:717 meanR:0.0099 R:0.0035 gloss:-9653775807641111342520205312.0000 dloss:155646327113430913516011458454870787278690533725360357376.0000 exploreP:0.0100
Episode:718 meanR:0.0099 R:0.0030 gloss:-9750392367724731150725283840.0000 dloss:158890884737790697122461144537456114880468726634548035584.0000 exploreP:0.0100
Episode:719 meanR:0.0098 R:0.0055 gloss:-9848709379970578204833349632.0000 dloss:161826939395940430068340514728036673761230909855380799488.0000 exploreP:0.0100
Episode:720 meanR:0.0098 R:0.0110 gloss:

Episode:765 meanR:0.0099 R:0.0140 gloss:-15629471789113781665119862784.0000 dloss:408153090069489021956609663194855026936840278161070489600.0000 exploreP:0.0100
Episode:766 meanR:0.0098 R:0.0100 gloss:-15775189769215427643162230784.0000 dloss:416187152161011550417844585121025529631133357543742504960.0000 exploreP:0.0100
Episode:767 meanR:0.0097 R:0.0000 gloss:-15970591752559165991483867136.0000 dloss:426321965962656836773596053051628456097089931601241964544.0000 exploreP:0.0100
Episode:768 meanR:0.0094 R:0.0000 gloss:-16112132143836011302951583744.0000 dloss:433868416087868020929887179052351476822970529615039168512.0000 exploreP:0.0100
Episode:769 meanR:0.0097 R:0.0315 gloss:-16239111520954049128432664576.0000 dloss:440614813575602572799102212305984854834036048346677772288.0000 exploreP:0.0100
Episode:770 meanR:0.0096 R:0.0060 gloss:-16443600997852439885465518080.0000 dloss:452112318936972289776935502436672752672754935444193411072.0000 exploreP:0.0100
Episode:771 meanR:0.0095 R:0.0105 

Episode:816 meanR:0.0099 R:0.0050 gloss:-25332306064469758039046488064.0000 dloss:1073332693590367403608707529386572479082592077403859714048.0000 exploreP:0.0100
Episode:817 meanR:0.0100 R:0.0100 gloss:-25567276381595281134119288832.0000 dloss:1091976407766293202597870317128147382505143159797934194688.0000 exploreP:0.0100
Episode:818 meanR:0.0100 R:0.0030 gloss:-25826575727126153241714229248.0000 dloss:1114234319531161174274874713633050161270806751680730759168.0000 exploreP:0.0100
Episode:819 meanR:0.0100 R:0.0105 gloss:-26035558403322512288266584064.0000 dloss:1133442360016090383042815588231496619421709216602883358720.0000 exploreP:0.0100
Episode:820 meanR:0.0100 R:0.0125 gloss:-26289948171971975771368456192.0000 dloss:1155583019220790953588119952846302784011431759178360684544.0000 exploreP:0.0100
Episode:821 meanR:0.0102 R:0.0200 gloss:-26524829622576932615712931840.0000 dloss:1176077808704248236585102491833507983249737807309783957504.0000 exploreP:0.0100
Episode:822 meanR:0.0101 R:0

Episode:867 meanR:0.0099 R:0.0040 gloss:-39933804259001741056703201280.0000 dloss:2665011557065774689969800218701069314878450616727320395776.0000 exploreP:0.0100
Episode:868 meanR:0.0099 R:0.0010 gloss:-40235966155275776124388376576.0000 dloss:2709038572318341764022231918629817359182451699577859866624.0000 exploreP:0.0100
Episode:869 meanR:0.0097 R:0.0075 gloss:-40642302197275100977031020544.0000 dloss:2758287781685204809480758902954745744451753676143094398976.0000 exploreP:0.0100
Episode:870 meanR:0.0097 R:0.0095 gloss:-41018506794312572714556063744.0000 dloss:2810117776252437901608574529907787221123680661002052960256.0000 exploreP:0.0100
Episode:871 meanR:0.0097 R:0.0090 gloss:-41326858497109137179310817280.0000 dloss:2852797347281698427018309318219021817989031218642642010112.0000 exploreP:0.0100
Episode:872 meanR:0.0096 R:0.0095 gloss:-41679279442911054390653616128.0000 dloss:2905277544478019270780535363206748202527997448244963573760.0000 exploreP:0.0100
Episode:873 meanR:0.0098 R:0

Episode:918 meanR:0.0104 R:0.0150 gloss:-61304170873805554178448162816.0000 dloss:6278921080428562341658239969879597106462981356933284364288.0000 exploreP:0.0100
Episode:919 meanR:0.0105 R:0.0210 gloss:-61937781311957136768608763904.0000 dloss:6414063955450168077218562661171573070238907411982406647808.0000 exploreP:0.0100
Episode:920 meanR:0.0104 R:0.0025 gloss:-62296544216889674709872934912.0000 dloss:6495786101505408712423979909674653157727361664027746893824.0000 exploreP:0.0100
Episode:921 meanR:0.0104 R:0.0150 gloss:-62849523204509213271093411840.0000 dloss:6603895496363345231059104987146436906960482338361365757952.0000 exploreP:0.0100
Episode:922 meanR:0.0104 R:0.0115 gloss:-63561694721438482905536921600.0000 dloss:6741661971103239135954942852988893389958251656793965985792.0000 exploreP:0.0100
Episode:923 meanR:0.0104 R:0.0080 gloss:-64041762262583783367647428608.0000 dloss:6844386902727593836399356318982721543657826243974166216704.0000 exploreP:0.0100
Episode:924 meanR:0.0105 R:0

Episode:969 meanR:0.0109 R:0.0050 gloss:-92173954653465772801721892864.0000 dloss:14173202378112387835267259470875607673698402563084708216832.0000 exploreP:0.0100
Episode:970 meanR:0.0109 R:0.0150 gloss:-92996975969468087348968292352.0000 dloss:14454005292306451091340761455997650348976543828033694334976.0000 exploreP:0.0100
Episode:971 meanR:0.0109 R:0.0080 gloss:-93765906759567465486334558208.0000 dloss:14692562171033285840553219138538393194525979961104806182912.0000 exploreP:0.0100
Episode:972 meanR:0.0109 R:0.0055 gloss:-94518514695165902881185333248.0000 dloss:14920850136637052250544408186473418544484118470395378532352.0000 exploreP:0.0100
Episode:973 meanR:0.0107 R:0.0075 gloss:-95174594614154749528995004416.0000 dloss:15132516286756935835605097284975613362733185300530538217472.0000 exploreP:0.0100
Episode:974 meanR:0.0107 R:0.0135 gloss:-95854297251719234512224256000.0000 dloss:15358933574096988776082528339766933063879973459447596974080.0000 exploreP:0.0100
Episode:975 meanR:0.01

Episode:1019 meanR:0.0110 R:0.0050 gloss:-134531434169091190387816529920.0000 dloss:30334389600340895584723930897077737033133661076331650088960.0000 exploreP:0.0100
Episode:1020 meanR:0.0111 R:0.0080 gloss:-135904555506637362351854059520.0000 dloss:30877528724731980527432616102448817916087324448388886298624.0000 exploreP:0.0100
Episode:1021 meanR:0.0110 R:0.0075 gloss:-137145327548664400233300492288.0000 dloss:31405522914049575004551167921553881405158475777403170848768.0000 exploreP:0.0100
Episode:1022 meanR:0.0110 R:0.0095 gloss:-138289085898036717108865269760.0000 dloss:31891048294145754906007213740200576263235899419938208612352.0000 exploreP:0.0100
Episode:1023 meanR:0.0110 R:0.0070 gloss:-139052297029881423221898608640.0000 dloss:32266697943368156675332248114166616736497021955931384053760.0000 exploreP:0.0100
Episode:1024 meanR:0.0110 R:0.0110 gloss:-140152261625361337187276161024.0000 dloss:32881276342742158010113089494941632061589413524579765190656.0000 exploreP:0.0100
Episode:10

Episode:1069 meanR:0.0101 R:0.0065 gloss:-194139543102704508269582352384.0000 dloss:62905993306522423520352654521533646467494740322766848786432.0000 exploreP:0.0100
Episode:1070 meanR:0.0100 R:0.0050 gloss:-195597949714694341454142111744.0000 dloss:63800110794727442794832900655006859010767304310426049708032.0000 exploreP:0.0100
Episode:1071 meanR:0.0101 R:0.0135 gloss:-196844414112489833909010300928.0000 dloss:64680494823799899098751787383585932244523970640003580559360.0000 exploreP:0.0100
Episode:1072 meanR:0.0101 R:0.0100 gloss:-198191378005037194195387809792.0000 dloss:65612507119395169583598221216416712127322026281442742370304.0000 exploreP:0.0100
Episode:1073 meanR:0.0102 R:0.0160 gloss:-199599587787907705822383177728.0000 dloss:66592387945012181824445690121551263919919699272953738821632.0000 exploreP:0.0100
Episode:1074 meanR:0.0103 R:0.0225 gloss:-201043820626209737509703778304.0000 dloss:67609874462179196117119191061248202565470210710121880748032.0000 exploreP:0.0100
Episode:10

Episode:1119 meanR:0.0101 R:0.0105 gloss:-274353849654373938371563094016.0000 dloss:125776621483247599184643703854067612346763377713030881083392.0000 exploreP:0.0100
Episode:1120 meanR:0.0100 R:0.0050 gloss:-276544721317912399631496511488.0000 dloss:127475231428059412482316960168126921364903187332693481226240.0000 exploreP:0.0100
Episode:1121 meanR:0.0100 R:0.0005 gloss:-278286167348201945197946339328.0000 dloss:129423185636285745212950974319560124841350469830877227515904.0000 exploreP:0.0100
Episode:1122 meanR:0.0100 R:0.0125 gloss:-280848894885499501723611299840.0000 dloss:131556502867174775071840124996156307332591977090170893631488.0000 exploreP:0.0100
Episode:1123 meanR:0.0100 R:0.0050 gloss:-281925389020136430504782069760.0000 dloss:132850838697993707502099152504685499229767532183718547423232.0000 exploreP:0.0100
Episode:1124 meanR:0.0100 R:0.0125 gloss:-283588608679825986584087363584.0000 dloss:134495650325295581041438529295983605877617036739321257787392.0000 exploreP:0.0100
Epis

Episode:1169 meanR:0.0105 R:0.0030 gloss:-382968114175811769058832941056.0000 dloss:244782223903400036137840978565966453662097641121636549656576.0000 exploreP:0.0100
Episode:1170 meanR:0.0106 R:0.0150 gloss:-385609207038924857990556155904.0000 dloss:248051368436472899798912506588315994618616928870301563879424.0000 exploreP:0.0100
Episode:1171 meanR:0.0105 R:0.0095 gloss:-387375430173953173306136854528.0000 dloss:250855322055641073666557453224654083970928255350467830218752.0000 exploreP:0.0100
Episode:1172 meanR:0.0104 R:0.0000 gloss:-391623597205283551274007003136.0000 dloss:255443226422195970812278901276054827129186809755284770127872.0000 exploreP:0.0100
Episode:1173 meanR:0.0104 R:0.0165 gloss:-393274848151625130361105678336.0000 dloss:258029197012741524715978952676003028532649602051080854700032.0000 exploreP:0.0100
Episode:1174 meanR:0.0103 R:0.0050 gloss:-395943508685299928446722375680.0000 dloss:261469184620091823147025079392475211350724787399507140673536.0000 exploreP:0.0100
Epis

Episode:1219 meanR:0.0099 R:0.0055 gloss:-525657977014656090976035012608.0000 dloss:462080557154643821042697993457926948869271477322562481946624.0000 exploreP:0.0100
Episode:1220 meanR:0.0101 R:0.0270 gloss:-529836019215165423249786929152.0000 dloss:468773742811284517281883516041456295661328509049950722588672.0000 exploreP:0.0100
Episode:1221 meanR:0.0102 R:0.0035 gloss:-532554203718618487833852116992.0000 dloss:473594447883688905287117525164940965511132050318735349645312.0000 exploreP:0.0100
Episode:1222 meanR:0.0102 R:0.0120 gloss:-536522377427976012157773938688.0000 dloss:480139591988852946530929879059728425119529573938136900173824.0000 exploreP:0.0100
Episode:1223 meanR:0.0103 R:0.0160 gloss:-539712421115479688062913478656.0000 dloss:486532818318427814993695156583784046435503501222338279505920.0000 exploreP:0.0100
Episode:1224 meanR:0.0102 R:0.0090 gloss:-542060661852571809311983927296.0000 dloss:491685487541700134997792325286398604990329561858331207794688.0000 exploreP:0.0100
Epis

Episode:1269 meanR:0.0103 R:0.0105 gloss:-714018745582268346980496310272.0000 dloss:852225867175009077908172265673736730552014486300663358685184.0000 exploreP:0.0100
Episode:1270 meanR:0.0103 R:0.0160 gloss:-719128233255917107135018696704.0000 dloss:863627903402297359578682944353507612016068648954521802768384.0000 exploreP:0.0100
Episode:1271 meanR:0.0104 R:0.0215 gloss:-723849700156386383105265500160.0000 dloss:874473152336357975239073570700658579905821133926553740115968.0000 exploreP:0.0100
Episode:1272 meanR:0.0105 R:0.0125 gloss:-727014828422182930046051680256.0000 dloss:882613678085811246794421104267336612439131044004699186397184.0000 exploreP:0.0100
Episode:1273 meanR:0.0104 R:0.0050 gloss:-730950851202434744672145375232.0000 dloss:893718354197991379485756831121680381196331306976881938530304.0000 exploreP:0.0100
Episode:1274 meanR:0.0104 R:0.0055 gloss:-737356811473797913851623112704.0000 dloss:906203329206894205905362625569554856638990066858601364848640.0000 exploreP:0.0100
Epis

Episode:1319 meanR:0.0105 R:0.0210 gloss:-960527071325028458072404656128.0000 dloss:1539363772129219730915856716344195929450677352306190187495424.0000 exploreP:0.0100
Episode:1320 meanR:0.0104 R:0.0185 gloss:-964924727902771724165836701696.0000 dloss:1556036661448373448997372399294473749222861670206259385597952.0000 exploreP:0.0100
Episode:1321 meanR:0.0105 R:0.0130 gloss:-968542730678856199109368348672.0000 dloss:1569053283387022936440588718250248764335361437435198798036992.0000 exploreP:0.0100
Episode:1322 meanR:0.0106 R:0.0180 gloss:-976937615342363379443710296064.0000 dloss:1596059119263363976583768331055056378574281953371344877912064.0000 exploreP:0.0100
Episode:1323 meanR:0.0106 R:0.0220 gloss:-982519932536394896299941429248.0000 dloss:1613076734773089810391593459218365519715105425306606215626752.0000 exploreP:0.0100
Episode:1324 meanR:0.0106 R:0.0095 gloss:-989847480331485545645574455296.0000 dloss:1634825150566829482459517812782610197739216870516495254290432.0000 exploreP:0.010

Episode:1368 meanR:0.0110 R:0.0120 gloss:-1267147938020681496073938665472.0000 dloss:2680368045319437531704721706836259775531764321517873999642624.0000 exploreP:0.0100
Episode:1369 meanR:0.0109 R:0.0000 gloss:-1278834866754373584695825793024.0000 dloss:2726958518596052868762095023299749254338373458725691010842624.0000 exploreP:0.0100
Episode:1370 meanR:0.0108 R:0.0105 gloss:-1281039428278097435557594398720.0000 dloss:2740053223321412944776919955894998234897598160206089034924032.0000 exploreP:0.0100
Episode:1371 meanR:0.0107 R:0.0120 gloss:-1290631447348805553516854640640.0000 dloss:2780791353561824426352910701353449029983588169405670430343168.0000 exploreP:0.0100
Episode:1372 meanR:0.0107 R:0.0130 gloss:-1298345600777480706384367452160.0000 dloss:2815050015632689796067877889532659864830488777404617731342336.0000 exploreP:0.0100
Episode:1373 meanR:0.0108 R:0.0155 gloss:-1302930560555414360467229900800.0000 dloss:2834446933515472153808595804312303066978349853879322244284416.0000 exploreP

Episode:1417 meanR:0.0107 R:0.0100 gloss:-1659228576406803808458484219904.0000 dloss:4597144256445591927807533846882021102752133984337588893253632.0000 exploreP:0.0100
Episode:1418 meanR:0.0108 R:0.0275 gloss:-1669472331198139645674647453696.0000 dloss:4657149226356672456496828332746672740702189509483623748730880.0000 exploreP:0.0100
Episode:1419 meanR:0.0107 R:0.0085 gloss:-1681179254132123575159057547264.0000 dloss:4711700122718605482655810105627789943004684567726117835767808.0000 exploreP:0.0100
Episode:1420 meanR:0.0106 R:0.0130 gloss:-1686855366420535032908397674496.0000 dloss:4754055852334333610628254470690284350649513625460815201042432.0000 exploreP:0.0100
Episode:1421 meanR:0.0106 R:0.0050 gloss:-1694580851107765925014122004480.0000 dloss:4798899838662766428857800947583171448463824612262367476056064.0000 exploreP:0.0100
Episode:1422 meanR:0.0104 R:0.0025 gloss:-1709927833204491403539658047488.0000 dloss:4879288768669368721986813276019172725932833641504760311840768.0000 exploreP

Episode:1466 meanR:0.0098 R:0.0120 gloss:-2159732490178953603379072335872.0000 dloss:7771174300022489660153463478463197822795221182914509144064000.0000 exploreP:0.0100
Episode:1467 meanR:0.0098 R:0.0055 gloss:-2171665633101452439211349114880.0000 dloss:7867730222973585528853914522537007460799389528211316992376832.0000 exploreP:0.0100
Episode:1468 meanR:0.0098 R:0.0130 gloss:-2181994721532147403623235584000.0000 dloss:7950116920726538602919945304006850906104186512239318955720704.0000 exploreP:0.0100
Episode:1469 meanR:0.0099 R:0.0090 gloss:-2190737826840529402673592532992.0000 dloss:8014188938447493098218461712716684264610526752226261698871296.0000 exploreP:0.0100
Episode:1470 meanR:0.0100 R:0.0150 gloss:-2203347010131091028691992444928.0000 dloss:8094942514774408889279284129490031352932810929491309676724224.0000 exploreP:0.0100
Episode:1471 meanR:0.0099 R:0.0085 gloss:-2214467928110924272890135183360.0000 dloss:8186095808529290115009000756086408671363934400825127224737792.0000 exploreP

Episode:1515 meanR:0.0101 R:0.0100 gloss:-2772395740991976058925837451264.0000 dloss:12829480770590018740379235165893834904122704040349156107091968.0000 exploreP:0.0100
Episode:1516 meanR:0.0100 R:0.0050 gloss:-2787213373206667712904176861184.0000 dloss:12959919998728388491694785461785788236328137802597770001907712.0000 exploreP:0.0100
Episode:1517 meanR:0.0099 R:0.0060 gloss:-2803896001838742785596237807616.0000 dloss:13118035669288188185460541620643820344294559074718041080070144.0000 exploreP:0.0100
Episode:1518 meanR:0.0099 R:0.0275 gloss:-2819827161329471408287671910400.0000 dloss:13266527059148979085241349702919920963285295236069832840445952.0000 exploreP:0.0100
Episode:1519 meanR:0.0101 R:0.0260 gloss:-2832811163974190106560476217344.0000 dloss:13397493340373399178620478979150256779516498103735963820752896.0000 exploreP:0.0100
Episode:1520 meanR:0.0101 R:0.0080 gloss:-2847493184698186381490690981888.0000 dloss:13515636705439744543721854571265718691864671230706566538723328.0000 ex

Episode:1564 meanR:0.0107 R:0.0100 gloss:-3541238286774805346519416307712.0000 dloss:20924473246491655249150860276671196242725763736480801475264512.0000 exploreP:0.0100
Episode:1565 meanR:0.0107 R:0.0175 gloss:-3551831047483725184558801354752.0000 dloss:21089336678506659258254272076578665049227524437390407017431040.0000 exploreP:0.0100
Episode:1566 meanR:0.0107 R:0.0055 gloss:-3580770331090961623548164046848.0000 dloss:21379792405285075980756874915010453172241945863426249030893568.0000 exploreP:0.0100
Episode:1567 meanR:0.0108 R:0.0165 gloss:-3595964445524440914269453680640.0000 dloss:21597052917665032515656319312774753305893082375247487660195840.0000 exploreP:0.0100
Episode:1568 meanR:0.0109 R:0.0285 gloss:-3611776973634670476245514321920.0000 dloss:21765645308048212378740580540563741666727375961159974193725440.0000 exploreP:0.0100
Episode:1569 meanR:0.0110 R:0.0135 gloss:-3632008791900327630808868192256.0000 dloss:22015413222280673221349684234607922975524586821918518316892160.0000 ex

Episode:1613 meanR:0.0113 R:0.0085 gloss:-4493893686185807492416477855744.0000 dloss:33707158486583520233720892157274290273281981067786387292422144.0000 exploreP:0.0100
Episode:1614 meanR:0.0115 R:0.0170 gloss:-4508811486827906488698738311168.0000 dloss:33951805860310428391280139075507391124117361373588277570830336.0000 exploreP:0.0100
Episode:1615 meanR:0.0115 R:0.0050 gloss:-4522256062913765103829298184192.0000 dloss:34207819452107838949917347244379559461631547255535278321827840.0000 exploreP:0.0100
Episode:1616 meanR:0.0114 R:0.0020 gloss:-4550175638767968335115277303808.0000 dloss:34548895803378459181339312210950407228693303068090296197709824.0000 exploreP:0.0100
Episode:1617 meanR:0.0117 R:0.0370 gloss:-4565969742749901101801949626368.0000 dloss:34823555840603289630599569189646713126579749395238910110466048.0000 exploreP:0.0100
Episode:1618 meanR:0.0117 R:0.0280 gloss:-4588663594308856242215715864576.0000 dloss:35197456673383380815142841169318360624958617983474145314209792.0000 ex

Episode:1662 meanR:0.0121 R:0.0125 gloss:-5644429938541708901686129459200.0000 dloss:53169255150979586229434694696627509274600560399828082090835968.0000 exploreP:0.0100
Episode:1663 meanR:0.0121 R:0.0165 gloss:-5669614350696612608695395680256.0000 dloss:53755031337796308448279088483761281759502841449583927841783808.0000 exploreP:0.0100
Episode:1664 meanR:0.0121 R:0.0090 gloss:-5707272949132793779880001011712.0000 dloss:54387498553608002830829099671969061262565967085794231161192448.0000 exploreP:0.0100
Episode:1665 meanR:0.0121 R:0.0160 gloss:-5732690156076972832357970083840.0000 dloss:54793471849698293088287556373380589313915480663151432217657344.0000 exploreP:0.0100
Episode:1666 meanR:0.0121 R:0.0020 gloss:-5762247290946274699601783029760.0000 dloss:55387003717940548377777163379297695758638175400319921304895488.0000 exploreP:0.0100
Episode:1667 meanR:0.0119 R:0.0000 gloss:-5782064607852230051796491436032.0000 dloss:55803343640850925367765528891428594260199756396185971668811776.0000 ex

Episode:1711 meanR:0.0110 R:0.0055 gloss:-7072255927194736506891623464960.0000 dloss:83394853562525683276355216520519122710891867130563258094714880.0000 exploreP:0.0100
Episode:1712 meanR:0.0110 R:0.0105 gloss:-7094595328136560811367651606528.0000 dloss:84064435173202276769342614701109640514225254550537526895968256.0000 exploreP:0.0100
Episode:1713 meanR:0.0110 R:0.0050 gloss:-7129371306660665729926811353088.0000 dloss:84779343339003261778918039472280934988925165367625947189084160.0000 exploreP:0.0100
Episode:1714 meanR:0.0110 R:0.0200 gloss:-7164422682723501194059169398784.0000 dloss:85580190415461833561358025505853824261863881857838500861181952.0000 exploreP:0.0100
Episode:1715 meanR:0.0111 R:0.0110 gloss:-7198180929087191550648735760384.0000 dloss:86460462292117050295691303078285631744164091445219058161549312.0000 exploreP:0.0100
Episode:1716 meanR:0.0111 R:0.0055 gloss:-7231322620934334557278594662400.0000 dloss:87217511345151794465686818883127522153761876847185814083338240.0000 ex

Episode:1760 meanR:0.0095 R:0.0130 gloss:-8793489659317800885680338894848.0000 dloss:129022701414548722465351494443628382566444556612563264639860736.0000 exploreP:0.0100
Episode:1761 meanR:0.0093 R:0.0090 gloss:-8829799999900673361850142818304.0000 dloss:130080258471474033855998105760939877387938180667155423512494080.0000 exploreP:0.0100
Episode:1762 meanR:0.0093 R:0.0105 gloss:-8864286412990906521436364472320.0000 dloss:131240042563189828620777736173805324076985610110819875246047232.0000 exploreP:0.0100
Episode:1763 meanR:0.0094 R:0.0220 gloss:-8902363840322669952115354370048.0000 dloss:132221548243395919391073802641815761249915327242357305656737792.0000 exploreP:0.0100
Episode:1764 meanR:0.0094 R:0.0090 gloss:-8937166551710614341702568116224.0000 dloss:133377172557729155893593821638508006396922062530356323549184000.0000 exploreP:0.0100
Episode:1765 meanR:0.0094 R:0.0150 gloss:-8989187016588919564226443345920.0000 dloss:134791478098530480440571782531129461798602385675384852171456512.0

Episode:1809 meanR:0.0094 R:0.0065 gloss:-10860543985724344782428047409152.0000 dloss:196966221529488222468983731964251318427407271635631745286537216.0000 exploreP:0.0100
Episode:1810 meanR:0.0095 R:0.0150 gloss:-10893286761107978548446305976320.0000 dloss:198115852345256410084565475372991440668179276399409350932692992.0000 exploreP:0.0100
Episode:1811 meanR:0.0096 R:0.0120 gloss:-10925666334525537845885165109248.0000 dloss:199629854723595431929162572335005186970892744427894053793693696.0000 exploreP:0.0100
Episode:1812 meanR:0.0096 R:0.0090 gloss:-10981338414074998923671883481088.0000 dloss:201263852804258905438503330816365746036150237299885034631593984.0000 exploreP:0.0100
Episode:1813 meanR:0.0096 R:0.0040 gloss:-11049137194040640448029008068608.0000 dloss:203565715382608623617907187313671282146666198144683920035151872.0000 exploreP:0.0100
Episode:1814 meanR:0.0095 R:0.0095 gloss:-11090316917027407858864102572032.0000 dloss:20504653634205663174288407379066304710565034578801420593948

Episode:1857 meanR:0.0106 R:0.0205 gloss:-13290931062753104116673666875392.0000 dloss:294781038653664764497067129045610813513441438798629893688000512.0000 exploreP:0.0100
Episode:1858 meanR:0.0108 R:0.0180 gloss:-13332519096048289561307500249088.0000 dloss:296605587136343123806321378086994249537841838132027315382648832.0000 exploreP:0.0100
Episode:1859 meanR:0.0108 R:0.0035 gloss:-13407816862869345515730332811264.0000 dloss:299791548889952143130214456510150788016855094050787731934543872.0000 exploreP:0.0100
Episode:1860 meanR:0.0109 R:0.0230 gloss:-13454660294806267787798745448448.0000 dloss:301993956684639853821339532212709850942861135242522960460775424.0000 exploreP:0.0100
Episode:1861 meanR:0.0109 R:0.0160 gloss:-13503311416051554566641989713920.0000 dloss:304149353422409661542301288002666440649497866538602385687707648.0000 exploreP:0.0100
Episode:1862 meanR:0.0109 R:0.0105 gloss:-13558344458248820540231607386112.0000 dloss:30726300602474757467501314323056944546633443904173334205982

Episode:1905 meanR:0.0115 R:0.0285 gloss:-16185382008265825316083198853120.0000 dloss:437075650817560887106886216716402561600616374979823943838334976.0000 exploreP:0.0100
Episode:1906 meanR:0.0116 R:0.0145 gloss:-16226434049204062107783640645632.0000 dloss:439479881833276249172430700768303477359644980659600067877928960.0000 exploreP:0.0100
Episode:1907 meanR:0.0116 R:0.0080 gloss:-16283382914058978902110766104576.0000 dloss:443056390872091164201039406803430647806325363207693628788965376.0000 exploreP:0.0100
Episode:1908 meanR:0.0116 R:0.0160 gloss:-16367888269668968867730133352448.0000 dloss:447002740373348168586195415037179752162729045107100122613809152.0000 exploreP:0.0100
Episode:1909 meanR:0.0115 R:0.0050 gloss:-16428274424305946294058179624960.0000 dloss:450444125299922305920173906081351030576227526827819162660765696.0000 exploreP:0.0100
Episode:1910 meanR:0.0114 R:0.0000 gloss:-16493908460351778156658380767232.0000 dloss:45398805922551664508783691447716784587658189978212692112126

Episode:1953 meanR:0.0106 R:0.0000 gloss:-19592016088206179646273388281856.0000 dloss:640465351680535175723272751528642987202806590912107160618926080.0000 exploreP:0.0100
Episode:1954 meanR:0.0105 R:0.0015 gloss:-19668500685412396163915042521088.0000 dloss:645169575785332735613633733670728883179501529858860043024203776.0000 exploreP:0.0100
Episode:1955 meanR:0.0104 R:0.0000 gloss:-19750836773321553555097640763392.0000 dloss:652227305351726260865419245561082604578486957270062901676212224.0000 exploreP:0.0100
Episode:1956 meanR:0.0103 R:0.0045 gloss:-19871632673462279381884912795648.0000 dloss:658325664070753836266688846432261298032802150655538363182874624.0000 exploreP:0.0100
Episode:1957 meanR:0.0103 R:0.0195 gloss:-19902978560350533028655283568640.0000 dloss:660834775373421969224869062871941786027276724555085228514738176.0000 exploreP:0.0100
Episode:1958 meanR:0.0103 R:0.0140 gloss:-19992653830561711263338719084544.0000 dloss:66704604920238787339942047127938810040040619938244981513466

In [22]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)

plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

NameError: name 'plt' is not defined

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(gloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [None]:
# env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
# states = env_info.vector_observations                  # get the current state (for each agent)
# scores = np.zeros(num_agents)                          # initialize the score (for each agent)

# while True:
#     actions = agent.act(states)                        # select actions from loaded model agent
#     env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#     next_states = env_info.vector_observations         # get next state (for each agent)
#     rewards = env_info.rewards                         # get reward (for each agent)
#     dones = env_info.local_done                        # see if episode finished
#     scores += env_info.rewards                         # update the score (for each agent)
#     states = next_states                               # roll over states to next time step
#     if np.any(dones):                                  # exit loop if episode finished
#         break
# print('Total score: {}'.format(np.mean(scores)))