## Report
---
This page describes choices and details results. It includes
- Description and justification of the **model architectures** 
- Description and justification of the **hyperparameters**
- Plot of **rewards**
- Ideas for **future works**

>Note 1: I spent around **40 hours** working on the project (not counting training time, lectures and exercices)

>Note 2: My main take-away would be about the interface with the environment. In particular, it is essential to investigate and understand the sturcture of observations and actions sent with the environment. I present some findings about it in my  [`README.md`](https://github.com/chauvinSimon/deep-reinforcement-learning/blob/master/p3_collab-compet/README.md)


### Description of the model architectures 
My repository is structured as follow.
- [`main_collab_compet.ipynb`](https://github.com/chauvinSimon/deep-reinforcement-learning/blob/master/p3_collab-compet/src_submission/main_collab_compet.ipynb) is **the central file you want to use**. It contains
    - all the import statements and instructions to start the environment
    - calls to `train`
    - calls to `test`
- [`ddpg_agent.py`](https://github.com/chauvinSimon/deep-reinforcement-learning/blob/master/p3_collab-compet/src_submission/ddpg_agent.py) defines three classes
    - `Agent` with methods such as `step`, `act`, `learn` 
    - `ReplayBuffer` to store experience tuples 
	- `Ornstein-Uhlenbeck Noise` process, used when calling `agent.act()` to help convergence of the Actor
- [`model.py`](https://github.com/chauvinSimon/deep-reinforcement-learning/blob/master/p3_collab-compet/src_submission/model.py) defines the Actor and Critic Networks used by the Agent
- [`checkpoint_critic12success.pth`](https://github.com/chauvinSimon/deep-reinforcement-learning/blob/master/p3_collab-compet/src_submission/checkpoint_critic12success.pth) and [`checkpoint_actor12success.pth`](https://github.com/chauvinSimon/deep-reinforcement-learning/blob/master/p3_collab-compet/src_submission/checkpoint_actor12success.pth) are the saved model weights of one of my successful agents.

I did not start from scratch. Instead, I use the example of **DDPG** of the lectures and add changes to work with **2 agents**. I particular, I modified the structure so that:
-     after each step, **each agent adds its experience to a replay buffer** that is **shared** by all agents
-     the (local) actor and critic networks are **updated 3 times, every 2 steps**.


This is done with:

`if len(self.memory) > BATCH_SIZE and self.step_counter % (2*NUM_AGENTS) == 0:
    for _ in range(3):
        experiences = self.memory.sample()
        self.learn(experiences, GAMMA)
`

As detailed in this lecture on [Policy Gradients](https://www.youtube.com/watch?v=XGmd3wcyDg8&index=11), policy gradient methods differ on many aspects from supervised learning. In particular, I prefer defining a **large Batch-Size** (256 - instead of 64 in the DDPG original paper), i.e. work with **more samples**, in order to address the **noisy-gradient issue**.


Furthermore, I apply a **hard copy** (tau=0.99) from the local to the target networks just after their **instantiations**. This was done in the [original DDPG paper](https://arxiv.org/pdf/1509.02971.pdf)

`self.soft_update(self.critic_local, self.critic_target, tau=0.99)
self.soft_update(self.actor_local, self.actor_target, tau=0.99)`
 
In addition, I follow the advice in the project instruction and use **gradient clipping** when training the Critic network. The corresponding snippet of code is as follows:

`self.critic_optimizer.zero_grad()
critic_loss.backward()
torch.nn.utils.clip_grad_norm(self.critic_local.parameters(), 1)
self.critic_optimizer.step()`

Since the state space is relatively small (compared to 84 * 84 RGB-pixel images e.g.), there is no need to implement convolutional layers. Multiple **fully-connected units** show good results.

To **improve stability** of the DDPG algorithm, I use **target networks** for both the critic and the policy. For stability, a **soft update** of the model parameters is completed.
    - `θ_target = τ*θ_local + (1 - τ)*θ_target`

This is done in the static method `soft_update(local_model, target_model, tau)` where the models to update are passed by reference.

To increase stability, and as recommended in the [original DDPG paper](https://arxiv.org/pdf/1509.02971.pdf), I also added **batch normalization** for the Critic network. This is meant to **minimize covariance shift** during training by ensuring that each layer receives whitened inputs. It turns out to be useful when trying to **generalize over different environments**.

Finally, I decided to decay the noise added to each actions and meant to offer exploration.
As in DQN, I opted for an exponential schudeling. I choose `noise_reduction 0.9977 = exp(ln(0,01)/2000)` so that it can reaches 0.01 at episode==2000.

`
    noise = 1.0               # initial value
    mini_noise = 0.02         # end value
    noise_reduction = 0.9977  
    ...
    noise *= noise_reduction  # decay at each episode
    ...
    actions = agent.act((states), noise=max(noise, mini_noise))
`

The Critic network is structured as followed:
    - State (space size = 26 = 24 + 2)
    - Batch Normalization
    - Fully-connected layer with *128* outputs
    - *ReLu* activation function
    - Batch Normalization
    - Concatenation with the four actions
    - Fully-connected layer with *64* outputs
    - *ReLu* activation function
    - Batch Normalization
    - Fully-connected layer with *1* output (= value of taking these actions being in that state)

The Actor network is structured as followed:
    - State (space size = 24)
    - Fully-connected layer with *128* outputs
    - *ReLu* activation function
    - Batch Normalization
    - Fully-connected layer with *64* outputs
    - *ReLu* activation function
    - Batch Normalization
    - Fully-connected layer with *2* outputs
    - Batch Normalization
    - *TanH* activation function -- to output the four (= action space size)values for torque values in `[-1, 1]`

The **Replay Memory** is based on *uniform sampling*. It enables the Critic network to be trained off-policy with samples from a replay buffer to minimize correlations between samples. Together with the **target networks** idea, the **Replay buffer** is the one of the ideas **taken from the successful DQN method**, as detailed [here](https://arxiv.org/pdf/1509.02971.pdf) by its authors.

Other improvement to the [original DPG](http://proceedings.mlr.press/v32/silver14.pdf) is the **OU-Noise**, which aim at constructing an **exploration policy µ** by adding noise sampled from a noise process N to our **actor policy**. In other words, it enables to generate temporally correlated exploration for **exploration efficiency** with inertia. I discuss the use of Ornstein-Uhlenbeck process in the following.

### Description of the hyperparameters


In [2]:
BUFFER_SIZE = int(5e5)  # replay buffer size (int(1e6) in paper)
BATCH_SIZE = 512        # mini-batch size (64 in paper)
GAMMA = 0.99            # discount factor (0.99 in paper)
TAU = 1e-3              # for soft update of target parameters (1e-3 in paper)
LR_ACTOR = 1e-3         # learning rate of the actor (1e-4 in paper)
LR_CRITIC = 3e-3        # learning rate of the critic (1e-3 in paper)
WEIGHT_DECAY = 0        # L2 weight decay (1e−2 in paper)
NUM_AGENTS = 20
# OPTIMIZER = Adam      # (as in paper)
# ACTIVATION = ReLu     # for hidden layers (as in paper)

#### Impact of seed
The `seed` is a parameter used to **initialize the pseudorandom number generators**.

This is one **important hyperparameter** I particularly played with.

To better understand how this can impact the performance in training, I tried **7 different seeds**. 

> Note: Between each trial, I ensured that my agent is properly reset (so that learning starts from scratch each time)

The **discrepancy in outcomes** is **significant**, as shown in the figure below.

![Returns for 7 different seeds - Figure](report_submission/impact_of_seed.png)

Among the 7 seeds, 4 of them (57%) enabled to solve the environment in less than 5000 episodes.

`successful_seeds = [13, 45, 55, 67]`

![Average Returns for 7 different seeds - Figure](report_submission/impact_of_seed_avg.png)

Conclusions
> Conclusion1: the **seed** has a large impact on performance. Hence it is important to **document the one used in each report**. Moreover, I found essential to try different seeds on the same code.

> Conclusion2: even with fix a seed, model and hyperparameters, results are **not reproducible**. The only "non-controllable element" I can think of now is the *environment* itself.
Is there any way to *fix a seed for it* as well?

Decisions
> Decision1: - similar to the assemssment and hyperparameter tuning of [this benchmark](https://arxiv.org/abs/1604.06778), I decide to execute the algorithm each time under **five random seeds**. With

`
seeds = [random.randint(1, 100) for _ in range(5)]
for i in seeds:
    agent = Agent(state_size=state_size, action_size=action_size, random_seed=i)
    scores = ddpg(name=i)`

After looking at the parameter used to initialize the pseudorandom number generators, I investigate the **impact** of **additional design choise and hyper-parameters**:

#### Monitoring tools
Since some of my very first trials were poorly performing (never reached score of 0.1), I decided to implement a couple of **monitoring tools**

- -1- Obviously, **looking at the environment** helps detecting behaviours. While this is not possible when running the code on a server, it is very easy to **early spot dysfunction**. One example of failure was that the agents rush together to the net. And then wait.

- -2- Furthermore, I realized that the **distribution in actions** is worth being monitored.
In particular, it is important for the **2 actions to avoid saturation**. In other words, their values should not converge to -1 or 1. I noticed that agents which failed usually get their torques staying in these extrema:

`Episode 1280	Average Score: 0.10
actions in batch at 52000-th learning:
	 shape = (512, 2),
	 mean = [0.11419048 0.11018715],
	  std = [0.75708777 0.8016415 ]`

![Returns during training - Figure](report_submission/action-distribution.png)

Observations:

- It can be seen that the **successful agents** keep, on average, their actions **far away from extremums torques**.

- In addition, the **sampled batch** keeps **diversity in actions**, as depicted by the pretty large **standard deviation**.

Based on these observervations, I better understand the role of the **Ornstein-Uhlenbeck Noise**

- among other, it allows for **escaping from situations** were the Actor gets its **torques saturated** (i.e. blocked a `1` or `-1`), by **constructing an exploration policy µ** with an added to our actor policy a noise sampled from the ON-noise process. 

- It uses **action space noise** to **change the likelihoods** associated with each action the agent might take from one moment to the next.

### Plot of Rewards

Details about **score calculation**:

- After each episode, we add up the rewards that each agent received (without discounting) to get a score for each agent.

- This yields 2 (potentially different) scores.

- We then take the **maximum of these 2 scores**.

- This yields a **single score** for **each episode**.

- The environment is considered solved, when the **average (over 100 episodes)** of those scores is at least **+0.5**

At the end, a plot shows the evolution of this average score:

![Returns during training  - Figure](report_submission/success-raw-scores.png)

![Returns during training  - Figure](report_submission/success-avg.png)

### Ideas for Future Work - MADDPG

Before going for this DDPG-based approach, I first tried to solve this environment with the MADDPG algorithm.

[MADDPG](http://arxiv.org/abs/1706.02275) is an **extension of DDPG** with an actor-critic architecture where the **critic** is **augmented with other agents’ actions**, while the actor only has **local information**.

In other words, this structure turns the method into a **centralized training with decentralized execution**.

I have also uploaded my [MADDPG code](https://github.com/chauvinSimon/deep-reinforcement-learning/blob/master/p3_collab-compet/src_draft_maddpg). It has been adapted from the exercise in the multi-agent lecture.

For some reason, I cannot get any positive score and the losses of my actors keep diverging, as show below:

![maddpg_avg-score - Figure](report_submission/maddpg_avg-score.PNG)

![maddpg_losses_and_rewards - Figure](report_submission/maddpg_losses_and_rewards.PNG)

In short, my do to list would contain three topics:

- I am definitely about to **investigate the above-illustrated issue**. It would be nice to **compare** the present DDPG solution with a working MADDPG. In term of *performance*, *time* and *stability*.

- As mentioned, the **stability** of my current DDPG approach is relatively poor. The list of options I have elaborate in [my report for project-2](https://github.com/chauvinSimon/deep-reinforcement-learning/blob/master/p2_continuous-control/report.ipynb) still holds.

- Another thing I would like to like to test is to **pre-train the model**, based on **Imitation Learning**. I have recently learn about it in the [CS294-DRL lecture](http://rail.eecs.berkeley.edu/deeprlcourse/). In the same vein, applying **Model-based RL** to this environment is something I would like to explore in the future.

- Last but not least, I have help myself during this project with this very recent paper: 
[Is multiagent deep reinforcement learning the answer or the question? A brief survey](http://arxiv.org/abs/1810.05587). I would like to investigate some mentioned technics, especially those related to *Experience replay buffer*, *Parameter sharing* and *Ensemble policies*.