# Report

---

In this notebook, I provided a description of my implementation for the continous controll project.

### 1. Start the Environment

We begin by importing some necessary packages.

In [1]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the 1st version of environment!

In [2]:
#env = UnityEnvironment(file_name="env/Reacher.app")

env = UnityEnvironment(file_name="env/Reacher-2.app")
# select this option to load version 1 (with a single agent) of the environment
#env = UnityEnvironment(file_name='/data/Reacher_One_Linux_NoVis/Reacher_One_Linux_NoVis.x86_64')

# select this option to load version 2 (with 20 agents) of the environment
#env = UnityEnvironment(file_name='/data/Reacher_Linux_NoVis/Reacher.x86_64')

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [3]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
print(brain_name)
print(brain)

ReacherBrain
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [4]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
cc_action_size = brain.vector_action_space_size
print('Size of each action:', cc_action_size)

# examine the state space 
states = env_info.vector_observations
cc_state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], cc_state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 20
Size of each action: 4
There are 20 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726624e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 4. Define the DDPG Agent

The Smart Agent that is going to be learned to control the double-jointed arm is coded in the [ddpg_agent.py](./ddpg_agent.py). 

The agent is constructed with DDPG learning algorithm. The agent can be initialized for learning or for play only based on a saved checkpoint (see below).

The DDPG (Deep Deterministic Poicy Gradient) actor-critic approach is recommended for this problem, main reason being the fact that agent must learn a continous spectrum of actions. 

The actor and critic NN models are coded in the [model.py](./model.py). The two models have two hidden layers each. At input the critic NN it gets a tensor having a number of elements equal with the number of state elements and in second layer also the action is introduced as input, while is producing at the output the estimated Q value the given state-action pair. The actor NN gets a tensor having a number of elements equal with number of states elements and is producing at the output the action value.

This type of algorithm is an of-policy one and combined with the fact that function approximation is used plus bootstraping this leads to potentially stability issues. The trick to stabilize the learning is use of replay buffer similar with the DQN algorithm.

The exploration is ensured by adding a noise to the actor NN returned action value.

### 4. Instantiate the learning Agent defined above

Instantiate the Agent and load from the filesystem the NN weights from checkpoints (if exists).

In [5]:
from ddpg_agent import Agent
import torch
from pathlib import Path

cc_agent = Agent(state_size=cc_state_size, 
                   action_size=cc_action_size, 
                   random_seed=0)

actor_chk_file = Path('./checkpoint_actor.pth')
if actor_chk_file.is_file():
    cc_agent.actor_local.load_state_dict(torch.load('./checkpoint_actor.pth',  map_location='cpu'))

    
critic_chk_file = Path('./checkpoint_critic.pth')
if critic_chk_file.is_file():
    cc_agent.critic_local.load_state_dict(torch.load('./checkpoint_critic.pth',  map_location='cpu'))

Initialising ReplayBuffer


Check the NNs.

In [6]:
print(cc_agent.actor_local)

Actor(
  (fc1): Linear(in_features=33, out_features=256, bias=True)
  (fc2): Linear(in_features=256, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=4, bias=True)
)


In [7]:
print(cc_agent.critic_local)

Critic(
  (fcs1): Linear(in_features=33, out_features=256, bias=True)
  (fc2): Linear(in_features=260, out_features=128, bias=True)
  (fc3): Linear(in_features=128, out_features=1, bias=True)
)


### 5. Train the Agent with DDPG Algorithm

Set the hyper parameters to the values mentioned in the original DDPG paper.

In [8]:
from collections import deque

BUFFER_SIZE = int(1e5)  # replay buffer size
BATCH_SIZE = 128        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 0        # L2 weight decay

In [9]:
import random
import torch
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Run the code cell below to train the agent (from scratch or continue training the models loaded from the files). The NN parameters are saved in files every 10 episodes so that traiing can be resumed later. Training stops when the average score over past 100 episodes goes beyond 31.

In [None]:
def ddpg(n_episodes=400, max_t=1000):
    scores_deque = deque(maxlen=100)
    scores_list = []
    max_score = -np.Inf
    scores_episode = []
    cc_agents =[]    # list of agents
    
    for i in range(num_agents):
        cc_agents.append(Agent(cc_state_size, cc_action_size, random_seed=0))
    
    
    for i_episode in range(1, n_episodes+1):
        env_info = env.reset(train_mode=True)[brain_name]      # reset the environment    
        states = env_info.vector_observations                  # get the current state (for each agent)
        scores = np.zeros(num_agents)                          # initialize the score (for each agent)
        
        for cc_agent in cc_agents:
            cc_agent.reset()                                   # reset the agent for each episode
        
        scores = np.zeros(num_agents)
        
        for t in range(max_t):
            actions = np.array([cc_agents[i].act(states[i]) for i in range(num_agents)]) # get the action from each agent
            env_info = env.step(actions)[brain_name]           # send all actions to tne environment
            next_states = env_info.vector_observations         # get next state (for each agent)
            rewards = env_info.rewards                         # get reward (for each agent)
            dones = env_info.local_done                        # see if episode finished

            for i in range(num_agents):
                cc_agents[i].step(t, states[i], actions[i], rewards[i], next_states[i], dones[i]) 
                
            states = next_states
            scores += rewards
            
            if np.any(dones):
                break
                
        score = np.mean(scores)
        scores_deque.append(score)
        scores_list.append(score)
        print('\rEpisode {}\tAverage Score: {:.2f}\tScore: {:.2f}'.format(i_episode, np.mean(scores_deque), score), end="")
        if i_episode % 10 == 0:
            torch.save(cc_agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(cc_agent.critic_local.state_dict(), 'checkpoint_critic.pth')

        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
        
        if np.mean(scores_deque)>=30: # stop learning if the average score for the last 100 episodes is greater than 31
            print('Environment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_deque)))
            break

    return scores

scores = ddpg()

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()



### 6. Play with the trained agent in the environment
In the cell below the trained agent is used to play in the environment...

In [None]:
# reset the environment to check how the untrained agent behaves
env_info = env.reset(train_mode=False)[brain_name] 
# watch an untrained agent
state = env_info.vector_observations[0]            # get the current state
score = 0   # initialize the score

while True:
    action = cc_agent.act(state)               # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

When finished, you can close the environment.

In [None]:
env.close()

### Ideas for future improvements

There are several improvements that can be made:
   - tune the hyper-parameters to train even faster