# Reacher solution

---

This notebook will present and proposal for a solution for the Udacity Reacher environment and will serve as an guideline in order to describe how this solved the problem. 

In [1]:
!pip -q install ./python
from unityagents import UnityEnvironment

import numpy as np
import matplotlib.pyplot as plt
import torch

[31mtensorflow 1.7.1 has requirement numpy>=1.13.3, but you'll have numpy 1.12.1 which is incompatible.[0m
[31mipython 6.5.0 has requirement prompt-toolkit<2.0.0,>=1.0.15, but you'll have prompt-toolkit 3.0.20 which is incompatible.[0m


## Configuration

The cell below act as an abstraction for using one of the many environment variants of the workspace. If the variable **'MULTI_AGENT'** is set by the user, this workspace will understand that we would like to solve the problem using the Unity simulator with 20 reacher arms, otherwise, the single reacher arm would be used. 

By enabling the visualization, through the **'VIS_ENABLED'**, the GUI of the simulator can be turned on or off.


**NOTE**: Bellow we hardcode the location of the files for the reacher environment. Feel free to adapt it as needed in order to make this notebook run with the desired variant of the problem.


In [2]:
MULTI_AGENT = True
VIS_ENABLED = False

if 'MULTI_AGENT' in globals() and MULTI_AGENT:
    if 'VIS_ENABLED' in globals() and VIS_ENABLED:
        #env = UnityEnvironment(file_name='one_agent_reacher_novis/Reacher.x86_64')
        pass
        
    else:
        env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')
        
        
else:
    if 'VIS_ENABLED' in globals() and VIS_ENABLED:
        #env = UnityEnvironment(file_name='many_agents_reacher_novis/Reacher.x86_64')
        pass
        
    else:
        env = UnityEnvironment(file_name='/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64')
        
        
    
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
print( env.brain_names )

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


['ReacherBrain']


## Model hyperparameters and configuration

The cell below uses the DDPG agent for solving the problem and passes as its constructor arguments all the hyperparameters that we exposed and tweaked to achieve the needed performance. Also, for clarity, the DDPG Agent will import and use a PyTorch model for "actor" and "critic". We will print this network to expose its guts below, but the model is the same presented at the Udacity DDPG Pendulum example but with the addition of a Batch normalization layer in both networks at the first layer.

This batch normalization was seen in other implementations of this problem across GitHub and made the training finally become stable and achieve the score needed for submission. 

For the agent itself, we implemented the mechanism described in the problem statement called **learn_prescaler** which controls how ofter we want the agent to run the learning phase. After the agent has performed "learn_prescaler_s" steps it will perform the number of **learning_cycles** consecutively. 

The **sample_every_cycle** config was a minor test performed in which this author tried to understand if learning **learning_cycles** with the same sample from the replay buffer would improve performance but it had not made a positive effect. The default behavior, 'sample_every_cycle=True', was the best performant and this was maintained for historical purposes. 

Another approach made to understand how to make the training more stable was to constantly decrease the noise applied to the action during the training phase. This noise, as far I understand it, makes it possible for the agent to explore during training, applying actions that were not predicted by its "actor" network and thus, may lead the network convergence process to explore regions of its parameters spaces that maybe would not be easily accessed in any other way. The **noise_initial_gain** and **noise_gain_decay** were introduced to make the noise steadily decreases as the training progressed to decrease how our agent would explore in lather states of the training but this showed no improvement over the default implemented mechanism and, as the 'sample_every_cycle', was maintained for documenting which were and were not tested during the development phase. 



Finally **batch_normalize** enable or disable a "batch normalization" for the first layer of the "actor" and "critic" network and, in the matter of fact, was the parameter that mostly contributed to achieving the needed performance. This variable allowed to easily "enable" and "disable" this process to understand if it were actually what was majorly contributing to the improvements seen at the scores during training.

## TL;DR

**buffer_size**: Total number of steps saved which can be sampled for the learning.

**batch_size**: How many steps are actually sampled for each learning phase.

**gamma**: Discount factor applied to the rewards.

**tau**: Soft update factor for the "Target" and "Local" variants for the network.

**lr_actor**: Learning rate for the "Actor" network.

**lr_critic**: Learning rate for the "Critic" network.

**actor_weight_decay**: L2 Weight decay used for the Adam Optmizer of the Actor network.

**critic_weight_decay**: L2 Weight decay used for the Adam Optmizer of the Critic network.

**learn_prescaler**: How often the agent should run the learn. One each "learn_prescaler". 0 disables it and the 
agent will learn for every step.

**learning_cycles**: How many consecutive learns the agent should perform. 

**noise_initial_gain**: Initial value for a "gain" multiplied to the noise added to the actor actions.

**noise_gain_decay**: How much 'noise_initial_gain' should decrease for each step.

**sample_every_cycle**: If the agent should sample its replay buffer for each consecutive learn

**gradient_limiter**: If the critic grandient should be limited before running th backward prop. 

**batch_normalize**: Enable or disable the batch normalization of the first layer of both networks, agent and critic.


In [3]:
from ddpg_agent import Agent

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

env_info = env.reset(train_mode=True)[brain_name]     # reset the environment    

# number of agents
num_agents = len(env_info.agents)

agent = Agent(state_size          = 33, 
              action_size         = 4, 
              random_seed         = 2, 
              buffer_size         = 100000,
              batch_size          = 128,
              gamma               = 0.99,
              tau                 = 1e-3,
              lr_actor            = 2e-4,
              lr_critic           = 2e-4,
              actor_weight_decay  = 0,
              critic_weight_decay = 0,
              learn_prescaler     = 20, 
              learning_cycles     = 10, 
              noise_initial_gain  = 1.0,
              noise_gain_decay    = 1.0,
              sample_every_cycle  = True, 
              gradient_limiter    = True,
              batch_normalize     = True)

In [4]:
print( "Agent PyTorch Network\n")
print( agent.actor_local )


print( "\n\nCritic PyTorch Network\n")
print( agent.critic_local )

Agent PyTorch Network

Actor(
  (fc1): Linear(in_features=33, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=4, bias=True)
  (bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)


Critic PyTorch Network

Critic(
  (fcs1): Linear(in_features=33, out_features=256, bias=True)
  (fc2): Linear(in_features=260, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=1, bias=True)
  (bn): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)


In [None]:
from collections import deque

def ddpg(n_episodes=150, max_t=10000, print_every=100):   
    scores_deque = deque(maxlen=print_every)
    scores = []

    
    for i_episode in range(1, n_episodes+1):
        
        states = env.reset(train_mode=True)[brain_name].vector_observations     # reset the environment    
        agent.reset()
        
        scores_episode = np.zeros(num_agents)               # rewards per episode for each agent
        
        for t in range( max_t ):
            actions     = agent.act(states)
            env_info    = env.step(actions)[brain_name]     # send all actions to tne environment
            next_states = env_info.vector_observations      # get next state (for each agent)
            rewards     = env_info.rewards                  # get reward (for each agent)
            dones       = env_info.local_done               # see if episode finished
            
            for (state, action, reward, next_state, done) in zip(states, actions, rewards, next_states, dones):
                agent.step(state, action, reward, next_state, done)
            
            
            states = next_states
            scores_episode += rewards
            
            if any(dones):
                break 
                
        
        #Averaring the mean score across all agents for this episode
        mean_score = np.mean( scores_episode )

        scores.append( mean_score )
        scores_deque.append( mean_score )
        
        
        print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean( scores_deque )), end="")
        torch.save(agent.actor_local.state_dict(), 'checkpoint_actor.pth')
        torch.save(agent.critic_local.state_dict(), 'checkpoint_critic.pth')
        if i_episode % print_every == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean( scores_deque )))
            
    return scores

scores = ddpg()

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()

  torch.nn.utils.clip_grad_norm(self.critic_local.parameters(), 1)


Episode 19	Average Score: 26.51

In [None]:
env_info = env.reset(train_mode=False)[brain_name]     # reset the environment    
states = env_info.vector_observations                  # get the current state (for each agent)
scores = np.zeros(num_agents)                          # initialize the score (for each agent)
while True:
    actions     = agent.act(states)                    # select an action (for each agent)
    env_info = env.step(actions)[brain_name]           # send all actions to tne environment
    next_states = env_info.vector_observations         # get next state (for each agent)
    rewards = env_info.rewards                         # get reward (for each agent)
    dones = env_info.local_done                        # see if episode finished
    scores += env_info.rewards                         # update the score (for each agent)
    states = next_states                               # roll over states to next time step
    if np.any(dones):                                  # exit loop if episode finished
        break
print('Total score (averaged over agents) this episode: {}'.format(np.mean(scores)))

In [None]:
env.close()