## Report: Continuous Control project

### Learning algorithm

This project was solved using deep deterministic policy gradient (DDPG). DDPG is an actor-critic method for reinforcement learning, which applies well to continuous action spaces. Thus, the agent consists of four neural networks, two of which constitute the actor, and two of which constitute the critic. 

#### The actor

The actor networks consist of 3 fully connected layers. The first layer receives the state as input, and the final layer outputs the actions. In the implementation, the hidden layers have 256 and 128 units. 

#### The critic 

Like the actor, the critic networks have 3 fully connected layers. The critic networks input the state size and output the Q-value. In the implementation, fc1_units = 256 and fc2_units = 128.

#### Agent instantiation 

The agent has local actor, target actor, local critic, and target critic networks. Initially, for each critic and actor, the local and target networks are identically initialised. 

In addition, a noise process is established which will set the agent to continually explore the action space (this is re-set at the beginning of each episode).

Finally, the agent is allocated a "memory" - which is a store of previous {state, action, next_state, reward, done} tuples from which it will learn.

#### Act, step, learn

At the beginning of a new episode, the environment is reset and the agent observes the state. 

Next, the environment uses the state to 'decide' which action to take. After acting, it observes the new environment and takes a step (see later).

The agent acts by feeding the states to the local actor network, then adding noise:

Next, the agent 'steps' - stores the experience tuple and performs a learning update, which involves updating the local networks and soft updating the target networks according to hyperparameter TAU. The effect is that the target networks change more slowly.

#### Hyperparameters

___BUFFER_SIZE = int(1e4)___ The buffer size determines how many experience tuples are stored in the agent's memory and from which it can sample to learn.

___BATCH_SIZE = 128___ Batch size determines how many experience tuples are sampled from the memory buffer during a learning update.

___GAMMA = 0.99___ Gamma is the 'discount factor' - meaning it determines the weight given to existing network parameters compared to the newly observed rewards.

___TAU = 1e-3___ Tau determines the weight given to the local network parameters when updating the target networks. 

___LR_ACTOR = 1e-4, LR_CRITIC = 4e-4___ The learning rates of the actor and critic. I changed these around quite a bit and found that the model performance was very sensitive.


### Performance 

This learning algorithm solved the environment with 20 agents in 122 agents (mean score across all 20 agents over the last 100 trials of >= 30).

![](Rewards.png)

### Future directions

___Single agent:___ Try to solve the task with a single agent. In theory this should be implementable with the same learning algorithm and model architectures used here, but may take longer.

___Prioritized replay:___ Try storing memories in a probability-dependent manner to prioritize learning from rare state-action tuples.