
# Project 2: Continuous Control

[image1]: https://user-images.githubusercontent.com/10624937/43851024-320ba930-9aff-11e8-8493-ee547c6af349.gif "Trained Agent"

### Introduction

For this project, you will work with the [Reacher](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#reacher) environment.

![Trained Agent][image1]
  
In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

The environment is considered solved, when the average (over 100 episodes) of those average scores is at least +30.

Your agents must get an average score of +30 (over 100 consecutive episodes, and over all agents). Specifically,

- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 20 (potentially different) scores. We then take the average of these 20 scores.

- This yields an average score for each episode (where the average is over all 20 agents).


#### DDPG Agent
DDPG agent, [ddpg_agent.py](ddpg_agent.py), implements the DDPG algorithm from [DDPG paper](https://arxiv.org/pdf/1509.02971): policy gradient algorithm that employs actor-critic model.
Following are the Actor network parameters: 

In [8]:
from model import Actor
from torchsummary import summary
state_size = 33
action_size = 4
actor_model = Actor(state_size, action_size, 2 )
summary(actor_model, (state_size,))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
       BatchNorm1d-1                   [-1, 33]              66
            Linear-2                  [-1, 512]          17,408
            Linear-3                  [-1, 512]         262,656
            Linear-4                  [-1, 512]         262,656
            Linear-5                  [-1, 256]         131,328
            Linear-6                  [-1, 256]          65,792
            Linear-7                    [-1, 4]           1,028
Total params: 740,934
Trainable params: 740,934
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.02
Params size (MB): 2.83
Estimated Total Size (MB): 2.84
----------------------------------------------------------------


Critic network parameters:

In [9]:
from model import Critic

critic_model = Critic(state_size, action_size, 2 )
summary(critic_model, [(state_size,), (action_size,)])

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
       BatchNorm1d-1                   [-1, 33]              66
            Linear-2                  [-1, 512]          17,408
            Linear-3                  [-1, 512]         264,704
            Linear-4                  [-1, 512]         262,656
            Linear-5                  [-1, 256]         131,328
            Linear-6                  [-1, 256]          65,792
            Linear-7                    [-1, 1]             257
Total params: 742,211
Trainable params: 742,211
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.02
Params size (MB): 2.83
Estimated Total Size (MB): 2.85
----------------------------------------------------------------


- I employ soft updates for both networks (actor and critic) for a stable learning process. Parameter $\tau$ controls those updates. 
    $$
    \theta_{target} = \tau*\theta_{local} + (1-\tau)*\theta_{target}
    $$
    
-  The actor network is learned to increase Q-values of good actions and decrease Q-values of bad actions. 
-  The critic network is learned with temporal difference(TD) learning.
    $$
    y_t = r_t + discount * Q'(s_{t+1},a,\theta_t')
    $$    $$
    L^{critic} = \frac{1}{N}\sum(y_t - Q(s_t,a,\theta_t))^2
    $$



#### Hyper-Parameters
- Replay buffer size **1e5**.
- Minibatch size = **128**.
- Discount factor = **0.99**.
- Actor learning rate of **1e-4** 
- Critic learning rate of **1e-3**.

#### Result
The following is training result, which is resolved in 10 episodes with average score of 30.11 over the last 100 episodes.

![Result](scores_episodes.png)