### Project 2: Continuous Control
#### Author : Md. Masud Rana

The project demonstrates how policy-based methods can be used to learn the optimal policy in a model-free Reinforcement Learning setting using a Unity environment, in which a double-jointed arm can move to target locations. 

In this environment, a double-jointed arm can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of your agent is to maintain its position at the target location for as many time steps as possible.

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

| Random Agent | Train Agent |
| :--: | :--: |
|<img src="images/random_agent.gif">|<img src="images/reacher.gif">|
|Unity ML-Agents [Reacher](https://github.com/Unity-Technologies/ml-agents/blob/master/docs/Learning-Environment-Examples.md#reacher) Environment|

### Implimentation

The basic algorithm lying under the hood is an actor-critic method. Policy-based methods like REINFORCE, which use a Monte-Carlo estimate, have the problem of high variance. TD estimates used in value-based methods have low bias and low variance. Actor-critic methods marry these two ideas where the actor is a neural network which updates the policy and the critic is another neural network which evaluates the policy being learned which is, in turn, used to train the actor.

[Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/pdf/1509.02971.pdf) lies under the class of Actor Critic Methods but is a bit different than the vanilla Actor-Critic algorithm. The actor produces a deterministic policy instead of the usual stochastic policy and the critic evaluates the deterministic policy. The critic is updated using the TD-error and the actor is trained using the deterministic policy gradient algorithm.


<img src="images/dpg.png">



#### [Deep Deterministic Policy Gradient (DDPG)](https://arxiv.org/pdf/1509.02971.pdf) Algorithm

<img src="images/dpg_algo.png">

### Hyperparameters
There were many hyperparameters involved in the experiment. The value of each of them is given below:

|Hyperparameter	|Value|
|:-- | :--: |
|Replay buffer size|	1e6|
|Batch size|	1024|
|GAMMA(discount factor)	|0.99|
|TAU	|1e-3|
|Actor Learning rate	|1e-4|
|Critic Learning rate	|3e-4|
|Update interval|	20|
|Update times per interval	|10|
|Number of episodes|	500|
|Max number of timesteps per episode	|1000|
|Leak for LeakyReLU	|0.01|

### Results

The best performance was achieved by DDPG where the reward of +30 was achieved in 56 episodes. It took so much time to find right parameter. Really it's hard to find right model hyperperameter. 

<img src="images/reward.png">



### Ideas for Future Work

- Other algorithms like TRPO, PPO, A3C, A2C that have been discussed in the course could potentially lead to better results as well.

- The Q-prop algorithm, which combines both off-policy and on-policy learning, could be good one to try.

- General optimization techniques like cyclical learning rates and warm restarts could be useful as well.