# Udacity's Benchmark Implmentation

## An Amended DDPG Agent
In this part of the Nanodegree program, you learned about a lot of potential ways to solve this project. We instead decided to solve the project by making some amendments to the Deep Deterministic Policy Gradients (DDPG) algorithm.

## Attempt 1
The first thing that we did was amend the DDPG code to work for multiple agents, to solve version 2 of the environment. The DDPG code in the DRLND GitHub repository utilizes only a single agent, and with each step:
- the agent adds its experience to the replay buffer, and
- the (local) actor and critic networks are updated, using a sample from the replay buffer.

So, in order to make the code work with 20 agents, we modified the code so that after each step:
- each agent adds its experience to a replay buffer that is shared by all agents, and
- the (local) actor and critic networks are updated 20 times in a row (one for each agent), using 20 different samples from the replay buffer.

In hindsight, this wasn't a great plan, but it was a start! That said, the scores are shown below.

<img src="img/attempt1.png" style="width:450px; height:300px;">

You'll notice that we made some rapid improvement pretty early in training, because of the extremely large number of updates. Unfortunately, also due to the large number of updates, the agent is incredibly unstable. Around episode 100, performance crashed and did not recover.

So, we focused on determining ways to stabilize this first attempt.

## Attempt 2

For this second attempt, we reduced the number of agents from 20 to 1 (by switching to version 1 of the environment). We wanted to know how much stability we could expect from a single agent. The idea was that the code would likely train more reliably, if we didn't make so many updates. And it did train much better.

<img src="img/attempt2.png" style="width:450px; height:300px;">

At one point, we even hit the target score of 30. However, this score wasn't maintained for very long, and we saw strong indications that the algorithm was going to crash again. This showed us that we needed to spend more time with figuring out how to stabilize the algorithm, if we wanted to have a chance of training all 20 agents simultaneously.

## Attempt 3

This time, we switched back to version 2 of the environment, and began with the code from **Attempt 1** as a starting point. Then, the only change we made was to use gradient clipping when training the critic network. The corresponding snippet of code was as follows:

> self.critic_optimizer.zero_grad()  
critic_loss.backward()  
torch.nn.utils.clip_grad_norm(self.critic_local.parameters(), 1)  
self.critic_optimizer.step()  

The corresponding scores are plotted below.

<img src="img/attempt3.png" style="width:450px; height:300px;">

This is when we really started to feel hopeful. We still didn't maintain an average score of 30 over 100 episodes, but we maintained the score for longer than before. And the agent didn't crash as suddenly as in the previous attempts!

## Attempt 4

At this point, we decided to get less aggressive with the number of updates per time step. In particular, instead of updating the actor and critic networks **20 times** at every **timestep**, we amended the code to update the networks **10 times** after every **20 timesteps**. The corresponding scores are plotted below.

<img src="img/attempt4.png" style="width:450px; height:300px;">

And, this was enough to solve the environment! In hindsight, we probably should have realized this fix much earlier, but this long path to the solution was definitely a nice way to help with building intuition! :)

<br>