## Collaboration & Competition (Tennis) Report

### Background

One difficult but interesting subject in the world of artificial intelligence is to design agents who can achieve their goals in the most efficient manner. In a previous exercise where we trained robotic arms to follow a target, we used Deep Deterministic Policy Gradient (DDPG), which is an actor critic algorithm that used a Q-function and a policy concurrently. Depending on the environment, there may be opportunities for agents to interact with the environment and each other, which provides opportunities to learn off of each other. As we seen in implementation of Alphazero, not only does this method increased the efficiencies of training, but it also elevated the algo's performance to beat masters in the game of Go. 

In this exercise, I have been provided with a tennis environment, where agents are able to control rackets that can hit balls back and forth. Agents are awarded and penalized to maximize play time per episode. My goal is to work off of the DDPG algo that I developed in the robotic arm exercise, and train agents that can exceed the goal of scoring more than 0.5 on average for more than 100 consecutive episodes.

### Learning Algorithm

#### Implement Learning Algorithm

Some key factors to consider before algo selection:

- Multiple agents — The Tennis environment has 2 different agents.
- Continuous action space — The action space is now continuous, which allows each agent to execute more complex and precise movements. Even though each tennis agent can only move forward, backward, or jump, the range of possible action values that control these movements are endless.

We need an algorithm that allows our agents to utilize its full range and power of movement. Policy-based methods seem to be most suitable.

 
To fully immerse in the Multi-Agent Deep Deterministic Policy Gradient (MADDPG), we can talk about the following:

1. Actor-Critic Method

Actor-critic methods leverage the strengths of both policy-based and value-based methods.

With a policy-based approach, the agent (actor) learns how to act by directly estimating the optimal policy and maximizing reward through gradient ascent. The agent (critic) also learns how to estimate the value of different state-action tuples. Actor-critic methods combine these two approaches to accelerate the learning process. Actor-critic agents are more stable than value-based agents, while requiring fewer training samples than policy-based agents.

What makes this implementation unique is that the actors leverages a centralized critic approach. Whereas the traditional actor-critic methods have a critic for each agent, this approach utilizes a single critic that receives observations from all agents. This extra information makes training easier and allows for "server" training with "client" execution. Each agent still takes actions based on its own unique observations of the environment.



2. Exploration vs Exploitation

The idea of Exploration is to encourage the agent to take actions that are unknown. With a continuous space, I have chosen the Ornstein-Uhlenbeck noise to work with. Since OU noise is correlated to the previous oise, it tends to stay in the same direction for longer durations without canceling itself out. This property allows the agent to maintain velocity and explore the action space with more continuity. The OU process adds a certain amount of noise to the action values at each timestep. 

The OU process has several hyperparameters that determine the noise characteristics:

- mu: the long-running mean
- theta: the speed of mean reversion
- sigma: the volatility parameter
- eps_start: initial value for epsilon in noise decay process in Agent.act()
- eps_ep_end: episode to end the noise decay process
- eps_final: final value for epsilon after decay

Notice also there's an epsilon parameter used to decay the noise level over time. This decay mechanism ensures that more noise is introduced earlier in the training process and the noise decreases over time as the agent gains more experience. 

Note: By boosting the noise output from the OU process early on, the algo encouraged aggressive exploration of the action space and improved the chances that some signals would be detected. This extra signal seemed to improve learning later in training once the noise decayed to zero.

3. Learning Interval

Performing multiple learning passes per episode yield faster convergence and higher scores. For example: at each learning step, the algorithm samples experiences from the buffer and runs the Agent.learn() method several times. 

- learn_every: we perform learning for every learn_every episodes. 1 means learn every episodes
- learn_num: number of NN passes per learning step

4. Gradient Clipping

Seems that my agent suffered from "catestropic chronic amnesia". The algo would pick up some learning, and just went it seems the agent is on the right track, it completely fail to score. As a result, the score moving average takes a nose dive and never recovers.

I suspect that one of the causes was outsized gradients. I implemented gradient clipping using the torch.nn.utils.clip_grad_norm_ function. I set the function to "clip" the norm of the gradients at 1, thereby placing an upper limit on the size of the parameter updates, and preventing them from growing exponentially. After implementation, the scoring trend became more stable and my agent seemed to be able to learn continuously.


5. Experience Replay

Experience replay allows the agent to learn from past experience. The difference between this and the previous implementation is that experiences from both agents are stored in a single replay buffer as each agent interacts with the environment. These experiences are then utilized by the central critic, thereby allowing both agents to learn from each others' experiences.

The replay buffer contains a collection of experience tuples with the state, action, reward, and next state (s, a, r, s'). The critic samples from this buffer as part of the learning step. Experiences are sampled randomly, so that the data is uncorrelated. This prevents action values from oscillating or diverging catastrophically.

Also, experience replay improves learning through repetition. By doing multiple passes over the data, our agents have multiple opportunities to learn from a single experience tuple. This is particularly useful for state-action pairs that occur infrequently within the environment.

#### Algorithm Architecture
Here's the framework of the code:

1. Prepare the Unity environment and import the necessary packages
2. Check the Unity environment
3. Define functions to instanciate and train multiple DDPG agents
4. Train agents using MADDPG framework
5. Present results

The algo contains actor and critic pair of neural networks. The actor NN uses the following flow:

- Input nodes = 24  * 2  = 48 (length of available states * 2)
- Fully Connected Layer (256 nodes, Relu activation)
- Fully Connected Layer (128 nodes, Relu activation)
- Ouput nodes (2 nodes, which is the length of available actions, tanh activation)

The Critic NN uses the following flow:

- Input nodes = 24  * 2  = 48 (length of available states * 2)
- Fully Connected Layer (256 + 2 * 2 = 260 nodes, Relu activation)
- Fully Connected Layer (128 nodes, Relu activation)
- Ouput node (1 node, no activation)

Environment and DDPG Parameters:

- state_size. This is the environment state size.
- action_size. This is the different actions that the agent can take.
- buffer_size. This is how much the cache can store regarding past experiences
- batch_size. In the buffer, this is how many samples are being taken out at one time

Hyperparameters
Model Related: 
- state_size. This is the environment state size.
- action_size. This is the different actions that the agent can take.
- buffer_size. This is how much the cache can store regarding past experiences
- batch_size. In the buffer, this is how many samples are being taken out at one time
- lr_actor. learning rate of the actor NN
- lr_critic. learning rate of the critic NN
- learn_every. learn ever n episodes
- learn_num. number of learning passes
- gamma. reward discount factor
- tau. Soft Update: weight_target = tau weight_local + (1 - tau) weight_target

Noise Related: 
- add_ounoise. Add Ornstein-Uhlenbeck noise or not?
- sigma. Ornstein-Uhlenbeck noise parameter, volatility
- theta. Ornstein-Uhlenbeck noise parameter, speed of mean reversion
- eps_start. initial value for epsilon in noise decay process
- eps_ep_end. episode to end the noise decay process (by this time should not be doing random actions)
- eps_final. final value for epsilon after decay


Dependencies
The libraries that are required to run the code are the following:

- unityagents
- torch
- numpy
- random
- copy
- time
- collections
- matplotlib

### Results

Given all the setup, the agents were able to break the goal of 0.5 for 100 episodes at the 1610th episode. The moving average score at this point is 0.522, for over 100 episodes.

The best moving average of 1.985 was achieved betweeen episodes 1860 and 1870. 

<img src = 'maddpg_performance_20191108.png'>

<img src = 'maddpg_performance_20191108_z.png'>

Notice that the score peaked near episode 1860. After that, the average score deteriorated. I would imagine that if I run for more than 2000 episodes, the score would never recover.

### Future Improvements

One of the difficulties that I faced throughout MADDPG and DDPG exercise is that I don't have any intuition in solving DRL problems, and I have not established an experiene bank on the type of problems versus algos use and hyperparameter values. In order to make my DRL experience as efficient and effective as possible, I feel that I need to have hands on experience on all combinations of the following:

- problem type
- algorithm type
- hyperparameter ranges

Otherwise, I feel like I'm always navigating blindly and trying different things without a good reason. Similar to using GridSearch in ML tuning, I would want to spend more time trying to tackle problems with different algos, and different setups, to gain a better idea of how to tackle future problems. 