# Report

---

In this notebook, I provided a description of my implementation for the continous controll project.

### 1. Start the Environment

We begin by importing some necessary packages.

In [2]:
from unityagents import UnityEnvironment
import numpy as np

Next, we will start the 1st version of environment!

In [3]:
env = UnityEnvironment(file_name="env/Reacher.app")

INFO:unityagents:
'Academy' started successfully!
Unity Academy name: Academy
        Number of Brains: 1
        Number of External Brains : 1
        Lesson number : 0
        Reset Parameters :
		goal_speed -> 1.0
		goal_size -> 5.0
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


Environments contain **_brains_** which are responsible for deciding the actions of their associated agents. Here we check for the first brain available, and set it as the default brain we will be controlling from Python.

### 2. Examine the State and Action Spaces

In this environment, a double-jointed arm can move to target locations. A reward of `+0.1` is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.  Each action is a vector with four numbers, corresponding to torque applicable to two joints.  Every entry in the action vector must be a number between `-1` and `1`.

Run the code cell below to print some information about the environment.

In [4]:
# get the default brain
brain_name = env.brain_names[0]
brain = env.brains[brain_name]
print(brain_name)
print(brain)

ReacherBrain
Unity brain name: ReacherBrain
        Number of Visual Observations (per agent): 0
        Vector Observation space type: continuous
        Vector Observation space size (per agent): 33
        Number of stacked Vector Observation: 1
        Vector Action space type: continuous
        Vector Action space size (per agent): 4
        Vector Action descriptions: , , , 


In [5]:
# reset the environment
env_info = env.reset(train_mode=True)[brain_name]

# number of agents
num_agents = len(env_info.agents)
print('Number of agents:', num_agents)

# size of each action
cc_action_size = brain.vector_action_space_size
print('Size of each action:', cc_action_size)

# examine the state space 
states = env_info.vector_observations
cc_state_size = states.shape[1]
print('There are {} agents. Each observes a state with length: {}'.format(states.shape[0], cc_state_size))
print('The state for the first agent looks like:', states[0])

Number of agents: 1
Size of each action: 4
There are 1 agents. Each observes a state with length: 33
The state for the first agent looks like: [ 0.00000000e+00 -4.00000000e+00  0.00000000e+00  1.00000000e+00
 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -1.00000000e+01  0.00000000e+00
  1.00000000e+00 -0.00000000e+00 -0.00000000e+00 -4.37113883e-08
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  5.75471878e+00 -1.00000000e+00
  5.55726671e+00  0.00000000e+00  1.00000000e+00  0.00000000e+00
 -1.68164849e-01]


### 4. Define the DDPG Agent

The Smart Agent that is going to be learned to control the environment is coded in the [cc_agent.py](./cc_agent.py). 

The agent is constructed with DDPG learning algorithm. The agent can be initialized for learning or for play only based on a saved checkpoint.

The DQN (Deep Q-networks) approach is recommended for this problem because the a pure tabular Q-Learning algorithm is not feasible due to huge state-action space exposed by this environment. Actually the state space is a multidimensional continous space (state is represented in a vector of 37 elements, most elements being float numbers, idicating contnous space). Event in case of discretizing the state space, along all dimenssions, the state-action space will become huge, very hard to be solvable with a pure tabular solution. Therefore the Q-value function approximation is needed and for this approximation a deep NN (Neural Network) is recommended. 

The NN model is coded in the [cc_model.py](./cc_model.py). The model is built to support a variable number of hiddnen and dense layers. At input it gets a tensor having a number of elements equal with the number of states and the output of the NN is producing the Q value estimates for all 4 actions given a certain state at the input. The NN is also containing an optional droput layer in between each hidden layer, defaulted to 0.2.

Pure DQN have the tendency of overestimating the q-values, at least at the early stages of learning, by choosing the maximum of action value for the next state, out of the possible values which are noisy because are calculated on limited experience. Therefore the implemented algorithm which proves much more robust in pratice is DDQN (Double Deep Q-Networks) - and this is using two function approximators (two almost twin neural networks ), and the one that is used to select the best action for next state is the target one, that is changing its parameters once in a while by getting the weights from the othe network that is learning/updated after each iteration.

### 4. Instantiate the learning Agent defined above

Instantiate the CCAgent and leave it interract with the environment for an episode, event if the agent is untrained.

In [6]:
from ddpg_agent import Agent
import torch
from pathlib import Path

cc_agent = Agent(state_size=cc_state_size, 
                   action_size=cc_action_size, 
                   random_seed=0)

actor_chk_file = Path('./checkpoint_actor.pth')
if actor_chk_file.is_file():
    cc_agent.actor_local.load_state_dict(torch.load('./checkpoint_actor.pth'))

    
critic_chk_file = Path('./checkpoint_critic.pth')
if critic_chk_file.is_file():
    cc_agent.critic_local.load_state_dict(torch.load('./checkpoint_critic.pth'))

In [7]:
print(cc_agent.actor_local)

Actor(
  (fc1): Linear(in_features=33, out_features=400, bias=True)
  (fc2): Linear(in_features=400, out_features=300, bias=True)
  (fc3): Linear(in_features=300, out_features=4, bias=True)
)


In [8]:
print(cc_agent.critic_local)

Critic(
  (fcs1): Linear(in_features=33, out_features=400, bias=True)
  (fc2): Linear(in_features=404, out_features=300, bias=True)
  (fc3): Linear(in_features=300, out_features=1, bias=True)
)


### 5. Train the Agent with DDPG Algorithm

Set the hyper parameters to the values mentioned in the original DDPG paper.

In [9]:
from collections import deque

BUFFER_SIZE = int(1e4)  # replay buffer size
BATCH_SIZE = 64        # minibatch size
GAMMA = 0.99            # discount factor
TAU = 0.001             # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor 
LR_CRITIC = 1e-3        # learning rate of the critic
WEIGHT_DECAY = 1e-2     # L2 weight decay

In [10]:
import random
import torch
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


scores_deque = deque(maxlen=100)
scores = []

Run the code cell below to train the agent (from scratch or continue training the models loaded from the files). 

In [None]:
def ddpg(n_episodes=2500, max_t=700):
    max_score = -np.Inf
    for i_episode in range(1, n_episodes+1):
        env_info = env.reset(train_mode=True)[brain_name] 
        state = env_info.vector_observations[0]
        cc_agent.reset()
        score = 0
        for t in range(max_t):
            action = cc_agent.act(state)
            env_info = env.step(action)[brain_name]
            next_state = env_info.vector_observations[0]
            reward = env_info.rewards[0]
            done = env_info.local_done [0]
            
            cc_agent.step(state, action, reward, next_state, done)
            state = next_state
            score += reward
            if done:
                break 
        scores_deque.append(score)
        scores.append(score)
        print('\rEpisode {}\tAverage Score: {:.2f}\tScore: {:.2f}'.format(i_episode, np.mean(scores_deque), score), end="")
        if i_episode % 10 == 0:
            torch.save(cc_agent.actor_local.state_dict(), 'checkpoint_actor.pth')
            torch.save(cc_agent.critic_local.state_dict(), 'checkpoint_critic.pth')

        if i_episode % 100 == 0:
            print('\rEpisode {}\tAverage Score: {:.2f}'.format(i_episode, np.mean(scores_deque)))
        
        if np.mean(scores_deque)>=30.0: # stop learning if the average score for the last 100 episodes is greater than 30
            print('Environment solved in {:d} episodes!\tAverage Score: {:.2f}'.format(i_episode-100, np.mean(scores_deque)))
            break

    return scores

scores = ddpg()

fig = plt.figure()
ax = fig.add_subplot(111)
plt.plot(np.arange(1, len(scores)+1), scores)
plt.ylabel('Score')
plt.xlabel('Episode #')
plt.show()



Episode 68	Average Score: 1.27	Score: 0.33

### 6. Play with the trained agent in the environment
In the cell below the trained agent is used to play in the environment...

In [None]:
# reset the environment to check how the untrained agent behaves
env_info = env.reset(train_mode=False)[brain_name] 
# watch an untrained agent
state = env_info.vector_observations[0]            # get the current state
score = 0   # initialize the score

while True:
    action = cc_agent.act(state)               # select an action
    env_info = env.step(action)[brain_name]        # send the action to the environment
    next_state = env_info.vector_observations[0]   # get the next state
    reward = env_info.rewards[0]                   # get the reward
    done = env_info.local_done[0]                  # see if episode has finished
    score += reward                                # update the score
    state = next_state                             # roll over the state to next time step
    if done:                                       # exit loop if episode finished
        break
    
print("Score: {}".format(score))

When finished, you can close the environment.

In [None]:
env.close()

### Ideas for future improvements

There are several improvements that can be made:
   - tune the hyper-parameters to train even faster