## Project 2: Continuous Control

### 1. Problem Statement.

The problem in this project is to learn the optimal policy in a model-free, continuous control task.  We use the Unity `Reacher` environment, in which a double-jointed robot arm can move to target locations. A reward of `+0.1` is obtained for each timestep that the agent's hand is in the goal location.  The goal of the agent therefore is to maintain its position at the target location for as many time steps as possible.

### 2. The State and Action Spaces

- The observation space consists of `33` variables corresponding to position, rotation, velocity, and angular velocities of the arm.

- Each action is a vector with four numbers, corresponding to torque applicable to two joints.

- Every entry in the action vector must be a number between `-1` and `1`.

- This task utilises a single agent.

###  3. The Learning algorithm.

Purely policy based methods have high variance, since they are based on Monte Carlo estimates.  Temporal difference, value-based methods are biased, but have lower variance.  Actor-critic methods are an attempt to reduce the variance of policy based methods, at the cost of introducing some bias.

Deep Deterministic Policy Gradient (DDPG) is an algorithm applicable to continuous action spaces.  It is regarded as an actor-critic method, since it utilizes two neural networks.  One is used to update the policy, the other is used to evaluate the policy.  That is, rather than using the reward provided by the environment to compute a gradient update step, the value provided by the critic network is used instead to make the policy update.

### 4. Model Architecture.

The actor network contains three fully connected hidden layers, of `600`, `400` and `200` nodes respectively.  The hidden layers use `ReLU` activation functions.  The final layer maps to the action space, so uses a `tanh` activation function.

The critic network contains two fully connected layers, of `400` and `300` nodes respectively.  The hidden layers use `ReLU` activation functions, while the output layer uses the identity function.

### 5. Implementation Details.

- The implementation makes use of a replay buffer to decorrelate the observed experiences.


- We use target networks with soft updates help stabilise the training.


- Ornstein-Uhlenbeck noise is added to the action space, to aid exploration.

### 6. Hyperparameters.

The DDPG agent is trained with the following hyperparameters:

BUFFER_SIZE = int(1e5)  # replay buffer size.


BATCH_SIZE = 128        # minibatch size for training.


GAMMA = 0.99            # discount factor.


TAU = 1e-3              # for soft update of target parameters.


LR_ACTOR = 1.5e-4       # learning rate of the actor network.


LR_CRITIC = 1.5e-4      # learning rate of the critic network.


WEIGHT_DECAY = 0.0001   # L2 weight decay.


mu = 0                  # drift parameter for Ornstein-Uhlenbeck noise process.


sigma = 0.2             # variance parameter for noise process.


theta = 0.15            # mean reversion parameter for noise process.


n_episodes = 400        # maximum number of episodes to run.


max_t = 1000            # maximum number of timesteps per episode.

### 7.  Results.

The DDPG algorithm succeeded in solving the environment in 269 episodes, with an average score of 30.05.


### 8.  Suggestions For Improvement.

- The DDPG algorithm may be improved by using prioritized experience replay.


- While this particular implementation solved the task, it is computationally expensive.  It is possible that optimizing the hyperparameters in a systematic way could improve convergence.


- Noise can be added to the network parameters rather than to the action space.


- The Q-prop algorithm is claimed to improve stability over DDPG.  It is a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate and combines the benefits of on-policy and off-policy methods.