# Project 2: Continuous Control

## Introduction

The target of this project is to design, train and evaluate a deep reinforcement learning algorithm that enables an agent to move a double-jointed arm to given target locations. The goal for the agent is to maintain its position at the target location for as many time steps as possible. This environment is provided by Unity Machine Learning Agents (ML-Agents) which is an open-source Unity plugin that enables games and simulations to serve as environments for training intelligent agents.

The task of maintaining the arm at the target location is episodic. Once the agent is able to score an average of +30 point over 100 consecutive episodes the environment is solved.

## Implementation#

The state space of this environment has 33 dimensions. This contains the arm's position, rotation, velocity as well as the arm's angular velocities. The agent itself acts within the environment using a vector of four numbers, corresponding to the torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1.

For any given 33-dimensional state a suitable implementation of an agent needs to map this state to a 4-dimensional action vector. Through learning over time the agent is expected to act in a way that maximizes the achieved score (sum of discounted rewards) per episode.

In the chosen reacher environment 20 arms operate in parallel. The experience collected by all of them is used for the training of one agent.

### Learning Algorithm

The solution described here is based on the deep deterministic policy gradients (DDPG) algorithm. It utilizes an actor critic architecture with experience replay, soft target network updates and gradient clipping. For both, the actor and the critic, a fully connected neural network architecture is chosen here with two hidden layers of 400 and 300 neurons respectively, relu activations, and batch normalization before the activations of the first hidden layer. The 4-node output layer (action space) of the actor network makes use of tanh activations to satisfy the action value range of -1 to 1.

For implementing experience replay, a buffer capable of storing up to 100,000 experience tuples is utilized. Every 20th time step 10 learning cycles are initiated. During each cycle 1024 experience tuples are randomly sampled from the buffer (given there are already at least 1024 tuples in the buffer) and used to train the networks. The training itself uses the sampled experience tuples to train the actor and the critic network with a learning rate of 0.001 and without weight decay. The target actor and critic networks are updated at the end of every learning cycle with an interpolation tau of 0.001 from the respective local networks.

Future rewards are discounted by a gamma of 0.99.

### Training Results

The training is scheduled to run for up to 1,000 episodes and with up to 1000 time steps per episode. However, as soon as an average score across all 20 agents of +30 is achieved over the last 100 consecutive episodes the environment is solved and the training stops. With the above described learning algorithm scores as plotted in the following graph have been achieved per episode during training.

![image.png](attachment:image.png)

After a training of 100 episodes an average score of 30.31 was achieved by the agent. Hence, it took the agent 100 episodes to solve this environment.

## Future Work

There are apparently many ways to further improve the agent's performance. For example, due to long training times, so far only a relatively small set of possible hyperparameter settings have been evaluated manually. This hyperparameter optimization process could be automated using techniques such as grid search. As part of this optimization process the architecture of both networks (number of layers, layer type, neurons per hidden layer, batch normalization) and hyperparameters of the employed neural networks shall be included as well.

Further improvements are expected to be achieved by implementing prioritized experience replay.

The performance of alternative algorithms as compared in "Benchmarking Deep Reinforcement Learning for Continuous Control" (Duan et al., 2016) may also be investigated.