# Continuous Control

---

In this notebook, you will learn how to use the Unity ML-Agents environment for the second project of the [Deep Reinforcement Learning Nanodegree](https://www.udacity.com/course/deep-reinforcement-learning-nanodegree--nd893) program.

### 1. Start the Environment

We begin by importing the necessary packages.  If the code cell below returns an error, please revisit the project instructions to double-check that you have installed [Unity ML-Agents](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Installation.md) and [NumPy](http://www.numpy.org/).

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the environment!  **_Before running the code cell below_**, change the `file_name` parameter to match the location of the Unity environment that you downloaded.

- **Mac**: `"path/to/Reacher.app"`
- **Windows** (x86): `"path/to/Reacher_Windows_x86/Reacher.exe"`
- **Windows** (x86_64): `"path/to/Reacher_Windows_x86_64/Reacher.exe"`
- **Linux** (x86): `"path/to/Reacher_Linux/Reacher.x86"`
- **Linux** (x86_64): `"path/to/Reacher_Linux/Reacher.x86_64"`
- **Linux** (x86, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86"`
- **Linux** (x86_64, headless): `"path/to/Reacher_Linux_NoVis/Reacher.x86_64"`

For instance, if you are using a Mac, then you downloaded `Reacher.app`.  If this file is in the same folder as the notebook, then the line below should appear as follows:
```
env = UnityEnvironment(file_name="Reacher.app")
```

In [2]:
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_v1/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_OneAgent/Reacher_Linux/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_NoVis_OneAgent/Reacher_Linux_NoVis/Reacher.x86_64')
# env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_NoVis/Reacher.x86_64')
env = UnityEnvironment(file_name='/home/arasdar/unity-envs/Reacher_Linux_NoVis_MultiAgents/Reacher_Linux_NoVis/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [7]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
action_size = brain.vector_action_space_size
print('Size of each action:', action_size)

# examine the state space 
states = env_info.vector_observations
state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  7.90150833e+00 -1.00000000e+00
  1.25147629e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -5.22214413e-01]


### 3. Take Random Actions in the Environment

In the next code cell, you will learn how to use the Python API to control the agent and receive feedback from the environment.

Once this cell is executed, you will watch the agent's performance, if it selects an action at random with each time step.  A window should pop up that allows you to observe the agent, as it moves through the environment.  

Of course, as part of the project, you'll have to change the code so that the agent is able to use its experience to gradually choose better actions when interacting with the environment!

In [8]:
# env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
# states = env_info.vector_observations                  # get the current state (for each agent)
# scores = np.zeros(num_agents)                          # initialize the score (for each agent)
# while True:
#     actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
#     actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
#     env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#     next_states = env_info.vector_observations         # get next state (for each agent)
#     rewards = env_info.rewards                         # get reward (for each agent)
#     dones = env_info.local_done                        # see if episode finished
#     scores += env_info.rewards                         # update the score (for each agent)
#     states = next_states                               # roll over states to next time step
#     if np.any(dones):                                  # exit loop if episode finished
#         break
# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

When finished, you can close the environment.

In [9]:
# env.close()

### 4. It's Your Turn!

Now it's your turn to train your own agent to solve the environment!  When training the environment, set `train_mode=True`, so that the line for resetting the environment looks like the following:
```python
env_info = env.reset(train_mode=True)[brain_name]
```

In [10]:
# # Testing the train mode
# env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
# state = env_info.vector_observations[0]                  # get the current state (for each agent)
# #scores = np.zeros(num_agents)                          # initialize the score (for each agent)
# num_steps = 0
# while True:
#     num_steps += 1
#     action = np.random.randn(num_agents, action_size) # select an action (for each agent)
#     #print(action)
#     action = np.clip(action, -1, 1)                  # all actions between -1 and 1
#     #print(action)
#     env_info = env.step(action)[brain_name]           # send all actions to tne environment
#     next_state = env_info.vector_observations[0]         # get next state (for each agent)
#     reward = env_info.rewards[0]                         # get reward (for each agent)
#     done = env_info.local_done[0]                        # see if episode finished
#     #scores += env_info.rewards                         # update the score (for each agent)
#     state = next_state                               # roll over states to next time step
#     if done is True:                                  # exit loop if episode finished
#         #print(action.shape, reward)
#         #print(done)
#         break
# print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))
# num_steps

## Option 1: Solve the First Version
The task is episodic, and in order to solve the environment, your agent must get an average score of +30 over 100 consecutive episodes.

In [11]:
import tensorflow as tf
print('TensorFlow Version: {}'.format(tf.__version__))
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.7.1
Default GPU Device: 


In [12]:
def model_input(state_size, action_size):
    #states = tf.placeholder(tf.float32, [None, *state_size], name='states')
    states = tf.placeholder(tf.float64, [None, state_size], name='states')
    actions = tf.placeholder(tf.float64, [None, action_size], name='actions')
    targetQs = tf.placeholder(tf.float64, [None], name='targetQs')
    isTraining = tf.placeholder(tf.bool, [], name='isTraining')
    return states, actions, targetQs, isTraining

In [13]:
def actor(states, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('actor', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=hidden_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        h2 = tf.layers.dense(inputs=nl1, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
        
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=action_size)        
        return logits

In [14]:
def critic(states, actions, action_size, hidden_size, reuse=False, alpha=0.1, training=False):
    with tf.variable_scope('critic', reuse=reuse):
        # First fully connected layer
        h1 = tf.layers.dense(inputs=states, units=action_size)
        bn1 = tf.layers.batch_normalization(h1, training=training)        
        nl1 = tf.maximum(alpha * bn1, bn1)
        
        # Second fully connected layer
        nl1_fused = tf.concat(axis=1, values=[nl1, actions])
        h2 = tf.layers.dense(inputs=nl1_fused, units=hidden_size)
        bn2 = tf.layers.batch_normalization(h2, training=training)        
        nl2 = tf.maximum(alpha * bn2, bn2)
                
        # Output layer
        logits = tf.layers.dense(inputs=nl2, units=1)
        return logits

In [15]:
def model_loss(actions, states, targetQs, action_size, hidden_size, isTraining):
    ###################################################################
    actions_logits = actor(states=states, hidden_size=hidden_size, action_size=action_size, 
                           training=isTraining)
    gQlogits = critic(states=states, actions=actions_logits, hidden_size=hidden_size, action_size=action_size,
                      training=isTraining)
    ###########################################################################
    Qlogits = critic(states=states, actions=actions, hidden_size=hidden_size, action_size=action_size, 
                     training=isTraining, reuse=True)
    ###########################################################################
    Qs = tf.reshape(Qlogits, shape=[-1])
    gQs = tf.reshape(gQlogits, shape=[-1])
    dloss = tf.reduce_mean(tf.square(Qs - targetQs))
    gloss = -tf.reduce_mean(gQs)
    return actions_logits, gQlogits, gloss, dloss

In [16]:
def model_opt(gloss, dloss, g_learning_rate, d_learning_rate):
    # Get weights and bias to update
    t_vars = tf.trainable_variables()
    g_vars = [var for var in t_vars if var.name.startswith('actor')]
    d_vars = [var for var in t_vars if var.name.startswith('critic')]

    # Optimize
    with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)): # Required for batchnorm (BN)
        g_opt = tf.train.AdamOptimizer(g_learning_rate).minimize(gloss, var_list=g_vars)
        d_opt = tf.train.AdamOptimizer(d_learning_rate).minimize(dloss, var_list=d_vars)
    return g_opt, d_opt

In [17]:
class Model:
    def __init__(self, state_size, action_size, hidden_size, g_learning_rate, d_learning_rate, gamma):

        # Data of the Model: make the data available inside the framework
        self.states, self.actions, self.targetQs, self.isTraining = model_input(state_size=state_size, 
                                                                                action_size=action_size)

        # Create the Model: calculating the loss and forwad pass
        self.actions_logits, self.gQlogits, self.gloss, self.dloss = model_loss(
            action_size=action_size, hidden_size=hidden_size, # model init
            states=self.states, actions=self.actions, targetQs=self.targetQs, isTraining=self.isTraining) # model input
        
        # Update the model: backward pass and backprop
        self.g_opt, self.d_opt = model_opt(gloss=self.gloss, dloss=self.dloss,
                                           g_learning_rate=g_learning_rate, d_learning_rate=d_learning_rate)

In [18]:
from collections import deque
class Memory():
    def __init__(self, max_size = 1000):
        self.buffer = deque(maxlen=max_size)
    def sample(self, batch_size):
        idx = np.random.choice(np.arange(len(self.buffer)), 
                               size=batch_size, 
                               replace=False)
        return [self.buffer[ii] for ii in idx]

In [19]:
# reset the environment
env_info.vector_observations.shape, env_info.previous_vector_actions.shape, \
brain.vector_action_space_size, brain.number_visual_observations, \
brain.vector_action_space_size, brain.vector_observation_space_size

((20, 33), (20, 4), 4, 0, 4, 33)

In [20]:
# Exploration parameters
explore_start = 1.0            # exploration probability at start
explore_stop = 0.01           # minimum exploration probability 
decay_rate = 0.0001            # exponential decay rate for exploration prob

# Network parameters
state_size = 33
action_size = 4
hidden_size = 33*2             # number of units in each Q-network hidden layer
g_learning_rate = 1e-3         # Q-network learning rate
d_learning_rate = 1e-3         # Q-network learning rate

# Memory parameters
memory_size = int(1e6)            # memory capacity
batch_size = 1024             # experience mini-batch size
gamma=0.99

In [21]:
# Reset/init the graph/session
graph = tf.reset_default_graph()

# Init the model
model = Model(action_size=action_size, state_size=state_size, hidden_size=hidden_size, gamma=gamma,
              g_learning_rate=g_learning_rate, d_learning_rate=d_learning_rate)

# Init the memory
memory = Memory(max_size=memory_size)

In [22]:
# env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
# states = env_info.vector_observations                  # get the current state (for each agent)
# scores = np.zeros(num_agents)                          # initialize the score (for each agent)

# for _ in range(memory_size):
#     actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
#     actions = np.clip(actions, -1, 1)                  # all actions between -1 and 1
#     env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#     next_states = env_info.vector_observations         # get next state (for each agent)
#     rewards = env_info.rewards                         # get reward (for each agent)
#     dones = env_info.local_done                        # see if episode finished

#     for state, action, next_state, reward, done in zip(states, actions, next_states, rewards, dones):
#         #agent.step(state, action, reward, next_state, done) # send actions to the agent
#         memory.buffer.append([state, action, next_state, reward, done])
        
#     scores += env_info.rewards                         # update the score (for each agent)
#     states = next_states                               # roll over states to next time step
    
#     if np.any(dones):                                  # exit loop if episode finished
#         print('Average scores: {}'.format(np.mean(scores)))
#         env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
#         states = env_info.vector_observations                  # get the current state (for each agent)
#         scores = np.zeros(num_agents)                          # initialize the score (for each agent)

In [23]:
# env_info = env.reset(train_mode=True)[brain_name] # reset the environment
# states = env_info.vector_observations   # get the state
# for _ in range(memory_size):
#     action = np.random.randint(action_size)        # select an action
#     env_info = env.step(action)[brain_name]        # send the action to the environment
#     next_state = env_info.vector_observations[0]   # get the next state
#     reward = env_info.rewards[0]                   # get the reward
#     done = env_info.local_done[0]                  # see if episode has finished
#     memory.buffer.append([state, action, next_state, reward, float(done)])
#     state = next_state
#     if done:                                       # exit loop if episode finished
#         env_info = env.reset(train_mode=True)[brain_name] # reset the environment
#         state = env_info.vector_observations[0]   # get the state
#         break

In [None]:
# Save/load the model and save for plotting
saver = tf.train.Saver()
episode_rewards_list, rewards_list = [], []
gloss_list, dloss_list = [], []

# TF session for training
with tf.Session(graph=graph) as sess:
    sess.run(tf.global_variables_initializer())
    #saver.restore(sess, 'checkpoints/model.ckpt')    
    #saver.restore(sess, tf.train.latest_checkpoint('checkpoints'))
    total_step = 0 # Explore or exploit parameter
    episode_reward = deque(maxlen=100) # 100 episodes for running average/running mean/window
    n_episodes=2000 
    max_t=1000 
    #     print_every=10, 
    learn_every=20 
    num_learn=10
    goal_score=30

    # Training episodes/epochs
    for ep in range(n_episodes):
        gloss_batch, dloss_batch = [], []
        
        #state = env.reset()
        env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    
        states = env_info.vector_observations                 # get the current state (for each agent)
        #print(states.shape)
        scores = np.zeros(num_agents)                          # initialize the score (for each agent)
        
        # Training steps/batches
        for t in range(max_t):
            # Explore (env) or Exploit (model)
            total_step += 1
            explore_p = explore_stop + (explore_start - explore_stop) * np.exp(-decay_rate * total_step) 
            if explore_p > np.random.rand():
                #action = env.action_space.sample()
                actions = np.random.randn(num_agents, action_size) # select an action (for each agent)
                #print('explore', actions.shape)
            else:
                #print(states.shape)
                actions = sess.run(model.actions_logits, feed_dict={model.states: states, 
                                                                    model.isTraining: False})
                #print('model', actions.shape)

            actions = np.clip(actions, -1, 1) # [-1, +1]
            #print(actions.shape)
            #next_state, reward, done, _ = env.step(action)
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                        # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished
            
            for state, action, next_state, reward, done in zip(states, actions, next_states, rewards, dones):
                #agent.step(state, action, reward, next_state, done) # send actions to the agent
                memory.buffer.append([state, action, next_state, reward, done])

            scores += env_info.rewards                         # update the score (for each agent)
            states = next_states                               # roll over states to next time step
            
            # total reward
            total_reward = np.mean(scores)
            
            # Training
            if len(memory.buffer) >= batch_size:
                if t%learn_every == 0:
                    for each in range(num_learn):
                        #agent.start_learn()
                        #experiences = self.memory.sample()
                        batch = memory.sample(batch_size)
                        states_ = np.array([each[0] for each in batch])
                        actions_ = np.array([each[1] for each in batch])
                        next_states_ = np.array([each[2] for each in batch])
                        rewards_ = np.array([each[3] for each in batch])
                        dones_ = np.array([each[4] for each in batch])

                        #self.learn(experiences, GAMMA)
                        # TargetQs
                        nextQlogits = sess.run(model.gQlogits, feed_dict = {model.states: next_states_, 
                                                                            model.isTraining: False})
                        nextQs = nextQlogits.reshape(-1)
                        targetQs = rewards_ + (gamma * nextQs * (1-dones_))

                        feed_dict = {model.states: states_, model.actions: actions_, model.targetQs: targetQs,
                                     model.isTraining: True}
                        dloss, _= sess.run([model.dloss, model.d_opt], feed_dict)
                        dloss_batch.append(dloss)

                        # Learn actor only once compared to critic
                        if each+1 == num_learn:
                            gloss, _= sess.run([model.gloss, model.g_opt], feed_dict)
                            gloss_batch.append(gloss)
            
            # End of episode
            if done is True:
                break
                
        # Print out
        episode_reward.append(total_reward)
        print('Episode:{}'.format(ep),
              'meanR:{:.4f}'.format(np.mean(episode_reward)),
              'R:{:.4f}'.format(total_reward),
              'gloss:{:.4f}'.format(np.mean(gloss_batch)),
              'dloss:{:.4f}'.format(np.mean(dloss_batch)),
              'exploreP:{:.4f}'.format(explore_p))

        # Ploting out
        episode_rewards_list.append([ep, np.mean(episode_reward)])
        rewards_list.append([ep, total_reward])
        gloss_list.append([ep, np.mean(gloss_batch)])
        dloss_list.append([ep, np.mean(dloss_batch)])
        
        # Break episode/epoch loop
        ## Option 1: Solve the First Version
        #The task is episodic, and in order to solve the environment, 
        #your agent must get an average score of +30 over 100 consecutive episodes.
        if np.mean(episode_reward) >= goal_score:
            break
            
    # At the end of all training episodes/epochs
    saver.save(sess, 'checkpoints/model.ckpt')

Episode:0 meanR:0.1980 R:0.1980 gloss:-3.9094 dloss:70.9249 exploreP:0.9058
Episode:1 meanR:0.1690 R:0.1400 gloss:-35.1160 dloss:148.9027 exploreP:0.8205
Episode:2 meanR:0.1323 R:0.0590 gloss:-53.2826 dloss:37.7698 exploreP:0.7434
Episode:3 meanR:0.1245 R:0.1010 gloss:-113.9594 dloss:2857.2634 exploreP:0.6736
Episode:4 meanR:0.1273 R:0.1385 gloss:-223.6206 dloss:1229.3695 exploreP:0.6105
Episode:5 meanR:0.1487 R:0.2555 gloss:-330.2719 dloss:1397.7375 exploreP:0.5533
Episode:6 meanR:0.1540 R:0.1860 gloss:-518.6990 dloss:1702.5973 exploreP:0.5016
Episode:7 meanR:0.1659 R:0.2495 gloss:-761.5189 dloss:3698.9479 exploreP:0.4548
Episode:8 meanR:0.1903 R:0.3855 gloss:-1029.6779 dloss:34510.5311 exploreP:0.4125
Episode:9 meanR:0.2345 R:0.6320 gloss:-1163.9317 dloss:16449.2597 exploreP:0.3742
Episode:10 meanR:0.2774 R:0.7065 gloss:-1346.3811 dloss:12830.7884 exploreP:0.3395
Episode:11 meanR:0.3383 R:1.0085 gloss:-1408.8666 dloss:11176.2516 exploreP:0.3082
Episode:12 meanR:0.4301 R:1.5310 gloss:

Episode:97 meanR:0.4855 R:0.4265 gloss:-83177.3785 dloss:20915977.8933 exploreP:0.0101
Episode:98 meanR:0.4838 R:0.3165 gloss:-84731.4485 dloss:25871013.6040 exploreP:0.0100
Episode:99 meanR:0.4809 R:0.1995 gloss:-85947.2341 dloss:35912204.2247 exploreP:0.0100
Episode:100 meanR:0.4837 R:0.4715 gloss:-87337.5501 dloss:10557582.2537 exploreP:0.0100
Episode:101 meanR:0.4865 R:0.4200 gloss:-88583.5641 dloss:16744735.1187 exploreP:0.0100
Episode:102 meanR:0.4896 R:0.3745 gloss:-90787.3892 dloss:24569384.2632 exploreP:0.0100
Episode:103 meanR:0.4922 R:0.3555 gloss:-91339.6390 dloss:17868248.5470 exploreP:0.0100
Episode:104 meanR:0.4974 R:0.6630 gloss:-91187.5514 dloss:27756878.3392 exploreP:0.0100
Episode:105 meanR:0.5014 R:0.6565 gloss:-93995.1693 dloss:17893127.5745 exploreP:0.0100
Episode:106 meanR:0.5071 R:0.7490 gloss:-95990.0137 dloss:8365997.4112 exploreP:0.0100
Episode:107 meanR:0.5090 R:0.4380 gloss:-98349.3069 dloss:14891072.7482 exploreP:0.0100
Episode:108 meanR:0.5141 R:0.8965 gl

Episode:189 meanR:0.4934 R:0.5795 gloss:-319327.3869 dloss:340669259.5086 exploreP:0.0100
Episode:190 meanR:0.4989 R:0.7370 gloss:-324283.4034 dloss:306638317.9220 exploreP:0.0100
Episode:191 meanR:0.5036 R:0.6265 gloss:-329364.3074 dloss:338646465.6450 exploreP:0.0100
Episode:192 meanR:0.5107 R:0.8270 gloss:-333521.1792 dloss:627757543.1406 exploreP:0.0100
Episode:193 meanR:0.5102 R:0.0025 gloss:-337378.2938 dloss:412147406.4175 exploreP:0.0100
Episode:194 meanR:0.5078 R:0.0300 gloss:-341967.0606 dloss:311665632.9567 exploreP:0.0100
Episode:195 meanR:0.5044 R:0.0625 gloss:-346475.4691 dloss:284799681.7153 exploreP:0.0100
Episode:196 meanR:0.5008 R:0.2670 gloss:-351838.4327 dloss:392888746.2624 exploreP:0.0100
Episode:197 meanR:0.4988 R:0.2305 gloss:-357311.5198 dloss:328395480.8080 exploreP:0.0100
Episode:198 meanR:0.5002 R:0.4575 gloss:-361867.5928 dloss:289337875.6333 exploreP:0.0100
Episode:199 meanR:0.4999 R:0.1670 gloss:-367680.6648 dloss:416682480.5720 exploreP:0.0100
Episode:20

Episode:280 meanR:0.5899 R:0.7685 gloss:-821391.9729 dloss:3967475146.8976 exploreP:0.0100
Episode:281 meanR:0.5914 R:0.6900 gloss:-828508.3870 dloss:14702679108.6426 exploreP:0.0100
Episode:282 meanR:0.5919 R:0.5445 gloss:-838630.3023 dloss:11516211999.7733 exploreP:0.0100
Episode:283 meanR:0.5896 R:0.5680 gloss:-850592.3162 dloss:8305679899.3622 exploreP:0.0100
Episode:284 meanR:0.5880 R:0.5655 gloss:-856588.7137 dloss:4740198529.7051 exploreP:0.0100
Episode:285 meanR:0.5872 R:0.6260 gloss:-860579.5778 dloss:3458903552.0471 exploreP:0.0100
Episode:286 meanR:0.5872 R:0.6405 gloss:-864991.6078 dloss:3308780543.6877 exploreP:0.0100
Episode:287 meanR:0.5881 R:0.6140 gloss:-871792.5292 dloss:5535894492.4381 exploreP:0.0100
Episode:288 meanR:0.5869 R:0.6300 gloss:-879683.1675 dloss:5537805928.2806 exploreP:0.0100
Episode:289 meanR:0.5856 R:0.4540 gloss:-887294.6776 dloss:5244962344.3842 exploreP:0.0100
Episode:290 meanR:0.5834 R:0.5120 gloss:-894245.4271 dloss:7014926296.4185 exploreP:0.01

Episode:369 meanR:0.6232 R:0.6465 gloss:-1525649.9679 dloss:16230826574.8062 exploreP:0.0100
Episode:370 meanR:0.6233 R:0.7945 gloss:-1532006.0006 dloss:7803451116.8195 exploreP:0.0100
Episode:371 meanR:0.6215 R:0.6035 gloss:-1544339.8702 dloss:26405680879.7779 exploreP:0.0100
Episode:372 meanR:0.6235 R:0.9020 gloss:-1546512.8597 dloss:25466724190.7760 exploreP:0.0100
Episode:373 meanR:0.6254 R:0.9435 gloss:-1548416.3019 dloss:49966825502.0476 exploreP:0.0100
Episode:374 meanR:0.6250 R:0.7410 gloss:-1567777.8128 dloss:29946006620.8842 exploreP:0.0100
Episode:375 meanR:0.6282 R:0.9865 gloss:-1571163.8853 dloss:13194457748.7920 exploreP:0.0100
Episode:376 meanR:0.6283 R:0.7220 gloss:-1583237.8199 dloss:38739446166.2399 exploreP:0.0100
Episode:377 meanR:0.6319 R:1.0025 gloss:-1586270.4758 dloss:30523554685.7764 exploreP:0.0100
Episode:378 meanR:0.6326 R:0.8055 gloss:-1602385.9448 dloss:29979030983.3085 exploreP:0.0100
Episode:379 meanR:0.6336 R:0.8420 gloss:-1608621.8181 dloss:40688453970

Episode:458 meanR:0.7377 R:0.6400 gloss:-2536047.3049 dloss:50365419514.4309 exploreP:0.0100
Episode:459 meanR:0.7359 R:0.5820 gloss:-2552347.7616 dloss:54088634484.2701 exploreP:0.0100
Episode:460 meanR:0.7378 R:0.7560 gloss:-2565044.9501 dloss:56591880646.5340 exploreP:0.0100
Episode:461 meanR:0.7415 R:0.8070 gloss:-2579897.0709 dloss:53350951052.2300 exploreP:0.0100
Episode:462 meanR:0.7450 R:0.9450 gloss:-2592882.6392 dloss:59438227123.8391 exploreP:0.0100
Episode:463 meanR:0.7480 R:0.7960 gloss:-2604817.9438 dloss:63225206248.6238 exploreP:0.0100
Episode:464 meanR:0.7480 R:0.7920 gloss:-2619030.9401 dloss:63447941482.1953 exploreP:0.0100
Episode:465 meanR:0.7477 R:0.8285 gloss:-2631186.0258 dloss:78066401154.7236 exploreP:0.0100
Episode:466 meanR:0.7492 R:0.8775 gloss:-2646100.9929 dloss:79212376312.1178 exploreP:0.0100
Episode:467 meanR:0.7502 R:0.8225 gloss:-2659634.6557 dloss:59590699961.8780 exploreP:0.0100
Episode:468 meanR:0.7513 R:0.7505 gloss:-2669670.6678 dloss:7004705994

Episode:546 meanR:0.8237 R:0.9870 gloss:-3665216.0138 dloss:132065496293.0195 exploreP:0.0100
Episode:547 meanR:0.8233 R:0.7190 gloss:-3675120.1177 dloss:95964190915.3275 exploreP:0.0100
Episode:548 meanR:0.8217 R:0.6125 gloss:-3684590.1991 dloss:92292256838.3514 exploreP:0.0100
Episode:549 meanR:0.8221 R:0.6985 gloss:-3696675.0110 dloss:151739891173.7192 exploreP:0.0100
Episode:550 meanR:0.8218 R:0.5580 gloss:-3708748.1632 dloss:92981716067.6109 exploreP:0.0100
Episode:551 meanR:0.8218 R:0.7000 gloss:-3719000.1225 dloss:207031315733.9788 exploreP:0.0100
Episode:552 meanR:0.8211 R:0.4325 gloss:-3735143.0177 dloss:188593901667.8655 exploreP:0.0100
Episode:553 meanR:0.8203 R:0.6775 gloss:-3744724.8470 dloss:84978296900.1844 exploreP:0.0100
Episode:554 meanR:0.8180 R:0.6810 gloss:-3756180.6299 dloss:142860882391.3658 exploreP:0.0100
Episode:555 meanR:0.8174 R:0.7765 gloss:-3763012.4982 dloss:261605152419.6177 exploreP:0.0100
Episode:556 meanR:0.8198 R:0.8630 gloss:-3781631.4629 dloss:9178

Episode:634 meanR:0.6676 R:0.5255 gloss:-4888180.3312 dloss:241205230021.6314 exploreP:0.0100
Episode:635 meanR:0.6614 R:0.5830 gloss:-4900554.6984 dloss:218197779234.8790 exploreP:0.0100
Episode:636 meanR:0.6549 R:0.4285 gloss:-4921515.8736 dloss:211873453981.6480 exploreP:0.0100
Episode:637 meanR:0.6512 R:0.6030 gloss:-4934112.3269 dloss:212433914594.1696 exploreP:0.0100
Episode:638 meanR:0.6483 R:0.6750 gloss:-4949948.0827 dloss:252224758543.0456 exploreP:0.0100
Episode:639 meanR:0.6437 R:0.6200 gloss:-4971019.7862 dloss:262881065094.2520 exploreP:0.0100
Episode:640 meanR:0.6466 R:0.9770 gloss:-4990376.9770 dloss:228246591363.8145 exploreP:0.0100
Episode:641 meanR:0.6449 R:0.7630 gloss:-4993554.9653 dloss:167000213637.3342 exploreP:0.0100
Episode:642 meanR:0.6407 R:0.7325 gloss:-5013053.4834 dloss:174211771482.3965 exploreP:0.0100
Episode:643 meanR:0.6382 R:0.5220 gloss:-5029421.6941 dloss:198477788056.2473 exploreP:0.0100
Episode:644 meanR:0.6361 R:0.4760 gloss:-5048412.8786 dloss:

Episode:722 meanR:0.6761 R:0.6945 gloss:-6335467.4426 dloss:354603268621.0072 exploreP:0.0100
Episode:723 meanR:0.6776 R:0.7805 gloss:-6316613.1144 dloss:757264188818.2740 exploreP:0.0100
Episode:724 meanR:0.6786 R:0.8245 gloss:-6366993.2342 dloss:293125897191.6288 exploreP:0.0100
Episode:725 meanR:0.6780 R:0.6015 gloss:-6398311.4304 dloss:321870275163.1422 exploreP:0.0100
Episode:726 meanR:0.6795 R:0.6170 gloss:-6430896.0860 dloss:389258851648.2724 exploreP:0.0100
Episode:727 meanR:0.6814 R:0.6110 gloss:-6457122.1161 dloss:438733113016.1514 exploreP:0.0100
Episode:728 meanR:0.6853 R:0.8840 gloss:-6482514.6093 dloss:442599100877.3058 exploreP:0.0100
Episode:729 meanR:0.6888 R:0.8395 gloss:-6505039.5783 dloss:420534882408.1239 exploreP:0.0100
Episode:730 meanR:0.6922 R:0.8985 gloss:-6524796.0909 dloss:456095811160.9547 exploreP:0.0100
Episode:731 meanR:0.6941 R:0.8015 gloss:-6544605.8996 dloss:428916391921.7208 exploreP:0.0100
Episode:732 meanR:0.6935 R:0.4960 gloss:-6562612.6144 dloss:

Episode:809 meanR:0.6138 R:0.5375 gloss:-8236288.8603 dloss:714617515168.5876 exploreP:0.0100
Episode:810 meanR:0.6126 R:0.5325 gloss:-8261478.6913 dloss:628237358944.8132 exploreP:0.0100
Episode:811 meanR:0.6130 R:0.5505 gloss:-8277699.9328 dloss:804551983451.5613 exploreP:0.0100
Episode:812 meanR:0.6104 R:0.4125 gloss:-8297981.2477 dloss:636898457158.6202 exploreP:0.0100
Episode:813 meanR:0.6056 R:0.5765 gloss:-8323156.0241 dloss:678593400779.5126 exploreP:0.0100
Episode:814 meanR:0.6010 R:0.4115 gloss:-8344637.1796 dloss:537430313916.8365 exploreP:0.0100
Episode:815 meanR:0.5930 R:0.3175 gloss:-8368431.7256 dloss:610866941438.2528 exploreP:0.0100
Episode:816 meanR:0.5943 R:0.5645 gloss:-8389783.9076 dloss:531336222872.3942 exploreP:0.0100
Episode:817 meanR:0.5962 R:0.5900 gloss:-8414801.2677 dloss:635944152314.6639 exploreP:0.0100
Episode:818 meanR:0.5974 R:0.5140 gloss:-8441748.2654 dloss:769817877947.2634 exploreP:0.0100
Episode:819 meanR:0.5988 R:0.4535 gloss:-8463922.1098 dloss:

Episode:896 meanR:0.5994 R:0.7875 gloss:-10223025.7610 dloss:2751918887264.7334 exploreP:0.0100
Episode:897 meanR:0.5988 R:0.5760 gloss:-10241734.9879 dloss:1980697010577.9136 exploreP:0.0100
Episode:898 meanR:0.5982 R:0.5765 gloss:-10279804.6395 dloss:2405106907687.1851 exploreP:0.0100
Episode:899 meanR:0.5962 R:0.4740 gloss:-10311499.6331 dloss:2677742583634.8960 exploreP:0.0100
Episode:900 meanR:0.5949 R:0.6260 gloss:-10337160.9370 dloss:2733671722395.6250 exploreP:0.0100
Episode:901 meanR:0.5928 R:0.5585 gloss:-10362399.0647 dloss:2524432255808.8940 exploreP:0.0100
Episode:902 meanR:0.5914 R:0.5945 gloss:-10391441.0159 dloss:2613031205053.9951 exploreP:0.0100
Episode:903 meanR:0.5923 R:0.7660 gloss:-10416887.2159 dloss:2445116516638.3950 exploreP:0.0100
Episode:904 meanR:0.5972 R:0.9845 gloss:-10444553.8589 dloss:2468544527297.8691 exploreP:0.0100
Episode:905 meanR:0.5969 R:0.6380 gloss:-10469570.1753 dloss:2523421712809.7285 exploreP:0.0100
Episode:906 meanR:0.5993 R:0.8910 gloss:

Episode:982 meanR:0.7182 R:0.4635 gloss:-12690332.6520 dloss:2946505765373.1392 exploreP:0.0100
Episode:983 meanR:0.7160 R:0.5835 gloss:-12716764.2982 dloss:2927123998004.2871 exploreP:0.0100
Episode:984 meanR:0.7126 R:0.4325 gloss:-12739335.2865 dloss:2856049776741.0566 exploreP:0.0100
Episode:985 meanR:0.7101 R:0.4860 gloss:-12764922.3918 dloss:2718348132754.7368 exploreP:0.0100
Episode:986 meanR:0.7083 R:0.5105 gloss:-12794055.4119 dloss:2537774903018.6323 exploreP:0.0100
Episode:987 meanR:0.7054 R:0.5475 gloss:-12816650.4797 dloss:2154069855598.8660 exploreP:0.0100
Episode:988 meanR:0.7037 R:0.6130 gloss:-12839534.4254 dloss:1847842490379.3867 exploreP:0.0100
Episode:989 meanR:0.7020 R:0.6920 gloss:-12870207.4608 dloss:2231459266754.0562 exploreP:0.0100
Episode:990 meanR:0.7032 R:0.6295 gloss:-12896618.2485 dloss:2201424344523.3555 exploreP:0.0100
Episode:991 meanR:0.7035 R:0.5045 gloss:-12923600.7066 dloss:2040164548339.7356 exploreP:0.0100
Episode:992 meanR:0.7014 R:0.4320 gloss:

Episode:1067 meanR:0.7096 R:0.5390 gloss:-14912562.0791 dloss:1589144458769.0120 exploreP:0.0100
Episode:1068 meanR:0.7024 R:0.5340 gloss:-14948025.1227 dloss:1675142802276.4336 exploreP:0.0100
Episode:1069 meanR:0.7000 R:0.7790 gloss:-14952143.6869 dloss:1814592666961.0215 exploreP:0.0100
Episode:1070 meanR:0.6976 R:0.6900 gloss:-14919477.7932 dloss:2460420916245.3081 exploreP:0.0100


In [29]:
# fig = plt.figure()
# ax = fig.add_subplot(1,1,1)

# plt.plot(np.arange(1, len(scores)+1), scores)
# plt.ylabel('Score')
# plt.xlabel('Episode #')
# plt.show()

In [27]:
%matplotlib inline
import matplotlib.pyplot as plt

def running_mean(x, N):
    cumsum = np.cumsum(np.insert(x, 0, 0)) 
    return (cumsum[N:] - cumsum[:-N]) / N 

In [None]:
eps, arr = np.array(episode_rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(rewards_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Total rewards')

In [None]:
eps, arr = np.array(gloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [None]:
eps, arr = np.array(dloss_list).T
smoothed_arr = running_mean(arr, 10)
plt.plot(eps[-len(smoothed_arr):], smoothed_arr)
plt.plot(eps, arr, color='grey', alpha=0.3)
plt.xlabel('Episode')
plt.ylabel('Average losses')

In [None]:
# env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
# states = env_info.vector_observations                  # get the current state (for each agent)
# scores = np.zeros(num_agents)                          # initialize the score (for each agent)

# while True:
#     actions = agent.act(states)                        # select actions from loaded model agent
#     env_info = env.step(actions)[brain_name]           # send all actions to tne environment
#     next_states = env_info.vector_observations         # get next state (for each agent)
#     rewards = env_info.rewards                         # get reward (for each agent)
#     dones = env_info.local_done                        # see if episode finished
#     scores += env_info.rewards                         # update the score (for each agent)
#     states = next_states                               # roll over states to next time step
#     if np.any(dones):                                  # exit loop if episode finished
#         break
# print('Total score: {}'.format(np.mean(scores)))