## Continous Control Project

![](https://user-images.githubusercontent.com/10624937/43851024-320ba930-9aff-11e8-8493-ee547c6af349.gif)

- More info about the purpose of the project can be read here : [Udacity Project link](https://github.com/udacity/deep-reinforcement-learning/tree/master/p2_continuous-control)


## Prerequisites

- Create a python env as described here: https://github.com/udacity/deep-reinforcement-learning/tree/master/python

- Download the project structure here : [github](https://github.com/antoniopenta/deep_reinforcement_learning/tree/master/drl_continous_control)

- The Reacher with 20 agents has been choosen.

- You need to download the unity env Reacher at the following link (Mac) and save it (unzipped) in the env folder of the main project: https://s3-us-west-1.amazonaws.com/udacity-drlnd/P2/Reacher/Reacher.app.zip


- **AWS configuration**, the code has been runned on the AWS instance using GPU.
    - To AWS instance (p2.xlarge) is the Deep Learning AMI with Source Code (CUDA 8, Ubuntu) (you can search it on the AWS Marketplace). I have used the credit from the Udacity course.
    -  The configuration is well explained in the extrac curriculm activities (you can find notes with more details in aws.txt file saved within this repository).

- For AWS, the unity env can be downloaded [here](https://s3-us-west-1.amazonaws.com/udacity-drlnd/P2/Reacher/Reacher_Linux_NoVis.zip), You will not be able to watch the agent without enabling a virtual screen, but you will be able to train the agent.
    
 


## Project Structure

- The project has these foldes:
    - **data**: it contains data that are created during the execution
    - **framework**: it contain the code for the agents and the network for learning the Q function
    - **env**: where the unity env is store
    - **model**: where the checkpoint for the network is saved
    - **jupyter**: where the notebook with the explanation is store
    
     

## Problem Description
This project uses **Reacher** Unity environment. 

- In this environment, a double-jointed arm can move to target locations. 
- A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.
- The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.
- The task is episodic. 

- In udacity, it is declared that in order to solve the environment, the agent must get an average score of +30 over 100 consecutive episodes.
- In an optimal situation, an  agent to collect +30 during an episode means that he has to behave correctely at least in 300 subsequent events.
- The average length of an episode in this env is 1000.


## Scripts to Run 

- There are two main scripts in the main foler:
    - main_script_ddpg_test.py, which is used to see the agent in action loading the weights of the network that have been learned during the train
    - main_script_ddpg_train.py, which is used to train the agent
  
    
    

## Algorithm Explanation
- The RL algorithm is Deep Deterministic Policy Gradient (DDPG), which is based on the Policy Gradient approach.
- The core idea of the algorithm is described in the Udacity videos [video1](https://youtu.be/0NVOPIyrr98) [video2](https://youtu.be/RT-HDnAVe9o)
- The algorithm is using the Actor-Critic approach. 
- The Actor is a Neural Network that approximates a deterministic policy. It takes in input the state and the output is an array of values, one for each action.
- The Actor is used to approximate the policy ($\pi(a\mid s;\theta_\pi)$).
- In DDPG, the output is deterministic, and it represents the policy value the action space. The outputs are values in the range [-1,1]
- The Critic is used to give feedback about the value of the input state observed by the actor and the action considered in the actor.
- Both Actor and Critic have a target and regular networks. The target network is used to define the desired target, avoiding to have the same network for considering the prediction and the target. This is the same idea used in the DQN algorithm.
- In the DDPG algorithm, there is a soft update approach, which means that the target network is updated more often but with a smaller change.
- The DDPG is a policy gradient algorithm, this kind of algorithms suffers from having large variance, due to their Monte Carlo approach in computing the cumulative reword.
- To reduce the variance, a Temporal Difference approach is used together with bootstrapping, this approach reduces the variance but introduce a bias.
- The exploration of the action space is done by injecting the noise in the last layer of the Actor Network.

## DDPG code
- The code is based on the (udacity version)[https://github.com/udacity/deep-reinforcement-learning/tree/master/ddpg-bipedal]
- with the following modification:
     - **repating the learning steps** multiple times after sampling the buffer
     - **gradient clip** for critic has suggested in the course (torch.nn.utils.clip_grad_norm_(self.critic_local.parameters(), 1))


## HyperParameters

In [4]:
#HyperParemeter Configuration for the Agent
#fc1_units=256, fc2_units=128  for actor
#fc1_units=256, fc2_units=128  for critic
BUFFER_SIZE = int(1e6)  # replay buffer size
BATCH_SIZE = 128         # minibatch size
GAMMA = 0.99            # discount factor
TAU = 1e-3              # for soft update of target parameters
LR_ACTOR = 1e-4         # learning rate of the actor
LR_CRITIC = 1e-4        # learning rate of the critic
WEIGHT_DECAY = 0.0      # L2 weight decay

N_LEARN_UPDATES = 10     # number of learning updates
N_TIME_STEPS = 20       # every n time step do update


## Future works:
- Use different algorithm like [Raimbow](https://arxiv.org/abs/1710.02298)
- Change the approach for action exploration using [OpenAI approach](https://blog.openai.com/better-exploration-with-parameter-noise/) or noise netoworks as explanined [here](https://youtu.be/L6xaQ501jEs?t=3046)