## Project 2 Collaboration and Competition

### Training 
```bash
maddpg = AgentHandler()
dir_chkpoints = ''
scores_total, scores_global = train(maddpg, env, dir_chkpoints, n_episodes=10000)
```

### Parameters
There are total 8 networks, 4 for each agent, in which each Actor and Critic have local and target networks.
For the training session for Actor Networks, 
 * _n\_fc1_ is played out as a value 32,
 * _n\_fc2_ is played out as a value 32.

For the training session for Critic Networks, 
 * _n\_fcs1_ is played out as a value 64,
 * _n\_fc2_ is played out as a value 64.
 
The other Hyperparameters used for solving the environment.
 * GAMMA -> 0.99 (To get the future expected rewards.)
 * TAU -> 5e-2 i.e 0.05 (To update the target model with local model.)
 * BATCH_SIZE -> 64 (Batch of 64 _sarsa_ tuples to feed to network for learning)
 * BUFFER_SIZE -> 1e6 i.e 1000000 (Length of Replay Buffer)
 * LR_ACTOR = 5e-4 (Learning Rate to update weights of the Actor network during back-propagation.)
 * LR_CRITIC = 5e-4 (Learning Rate to update weights of the Critic network during back-propagation.)
 * WEIGHT_DECAY = 0 (Weight Decay for optimizer)
 * NOISE_AMPLIFICATION = 1  (exploration noise amplification)
 * NOISE_AMPLIFICATION_DECAY = 1  (noise amplification decay)
 * LEARNING_PERIOD = 2  (weight update frequency)


### Model Networks (Policy and Value)

There are total 8 networks, 4 for each agent, in which each Actor and Critic have local and target networks.

For Actor:<br>
Fully-Connected Layer 1, number of neurons: _state\_size_ x _n\_fc1_ ,<br>
Fully-Connected Layer 2, number of neurons: _n\_fc1_ x _n\_fc2_ ,<br>
Fully-Connected Layer 3, number of neurons: _n\_fc2_ x _action_size_ ,<br>
 
For Critic:<br>
Fully-Connected Layer 1, number of neurons: _(state\_size + action\_size) x n_agents_ x _n_fcs1_ ,<br>
Fully-Connected Layer 2, number of neurons: _n\_fcs1_ x _n\_fc2_ ,<br>
Fully-Connected Layer 3, number of neurons: _n\_fc2_ x 1 ,<br>

### Learning Algrorithm

#### Deep Deterministic Policy Gradient algorithm

DDPG is an algorithm which concurrently learns a Q-function and a policy. It uses off-policy data and the Bellman equation to learn the Q-function, and uses the Q-function to learn the policy. This dual mechanism is the actor-critic method. The DDPG algorithm uses two additional mechanisms: Replay Buffer and Soft Updates.

#### Multi Agent Deep Deterministic Policy Gradient algorithm

In this project, we use the **DDPG** algorithm (Deep Deterministic Policy Gradient) and the **MADDPG** algorithm,     
a wrapper for DDPG. MADDPG stands for **Multi-Agent DDPG**. DDPG is an algorithm which concurrently   learns    
a Q-function and a policy.  It uses off-policy data and the Bellman equation to learn the Q-function, 
and uses    
the Q-function to learn the policy. This dual mechanism is the  actor-critic method. The DDPG algorithm uses   
two additional mechanisms: _Replay Buffer_ and _Soft Updates_.  

In MADDPG, we train two separate agents, and the agents need to **collaborate** (like don’t let the   ball hit the ground)   
and **compete** (like gather as many points as possible). Just doing a simple extension of single 
agent RL    
by independently training the two agents does not work very well because the agents are independently updating    
their policies as learning progresses. And this causes the   environment to appear non-stationary from the viewpoint   
of any one agent. 

In MADDPG, _each agent’s critic is trained using the observations and actions_ from **both agents** , whereas   
each _agent’s actor is trained using just_ its **own observations**.  

In the finction _step()_ of the _class madppg_\__agent_, we collect all current info
 for **both agents**  into  the **common** variable    
_memory_ of the type  _ReplayBuffer_.  Then we get the random _sample_ from _memory_  into the variable _experiance_.   
This _experiance_   together with the current number of agent (0 or 1) go to the function _learn()_.   We get the corresponding    
agent (of type _ddpg_\__agent_):

      agent = self.agents[agent_number]

and _experiance_ is transferred to function _learn()_  of the _class ddpg_\__agent_.  There, the actor and the critic 
are handled by different ways.  

####     Actor-Critic dual mechanism

For each timestep _t,_ we do the following operations:

Let __*S&nbsp;*__ be the current state. It is the  input for the  _Actor NN_.  The output is the action-value 

![](images/policy_pi.png)

where \pi is the policy function,  i.e., the distribution of the actions. The _Critic NN_  gets the state __*S&nbsp;*__ as input and outputs      
the state-value function __*v(S,w)*__ , that is the _expected total reward_ for the agent starting from state __*S&nbsp;*__. Here, _\theta_ is    
the vector parameter of the _Actor NN_, _w&nbsp;_ - the vector parameter of the _Critic NN_. The task is to train both networks, i.e.,   
to find the optimal values for _\theta_ and _w&nbsp;_.  By policy _\pi_ we get the action _A&nbsp;_,  from the environment we get reward _R&nbsp;_   
and the next state __*S'&nbsp;*__. Then we get _TD-estimate_: 
 
![](images/TD_estimate.png)
		 
Next, we use the _Critic_ to calculate the _advantage function_ _A(s, a)_:

![](images/calc_advantage.png)
				 
Here, _\gamma_ is the _discount factor_. The parameter _\theta_ is updated by gradient ascent as follows:

![](images/update_theta.png)

The parameter _w&nbsp;_ is updated as follows:

![](images/update_w.png)
		
Here, \alpha (resp. \beta) is the learning rate for the _Actor NN_ (resp. _Critic NN_).  Before we return to the next timestep we update the state _S&nbsp;_ and the operator _I&nbsp;_ by _discount factor_ \gamma:

![](images/next_state.png)

At the start of the algorithm the operator _I_ should be initialized to the identity opeartor. 

#### Mechanism of each Agent
* 4 Networks Q-Networks, Actor and Critic each having 2 networks as follows.
    * Actor
```bash
self.actor_local = Actor(state_size, action_size, random_seed).to(device)
self.actor_target = Actor(state_size, action_size, random_seed).to(device)
```
    * Critic
```bash
self.critic_local = Critic(state_size, action_size, random_seed).to(device)
self.critic_target = Critic(state_size, action_size, random_seed).to(device)
```
* Replay memory (using the class Experience)
```bash
self.experience = namedtuple("Experience", field_names=["state", "action", "reward", "next_state", "done"])
e = self.experience(state, action, reward, next_state, done)
self.memory.append(e)
```
* Update Critic Network
```bash
# Get predicted next-state actions and Q values from target models
actions_next = self.actor_target(next_states)
Q_targets_next = self.critic_target(next_states, actions_next)
# Compute Q targets for current states (y_i)
Q_targets = rewards + (gamma * Q_targets_next * (1 - dones))
# Compute critic loss
Q_expected = self.critic_local(states, actions)
critic_loss = F.mse_loss(Q_expected, Q_targets)
# Minimize the loss
self.critic_optimizer.zero_grad()
critic_loss.backward()
torch.nn.utils.clip_grad_norm_(self.critic_local.parameters(), 1)
self.critic_optimizer.step()
```
* Update Actor
```bash
# Compute actor loss
actions_pred = self.actor_local(states)
actor_loss = -self.critic_local(states, actions_pred).mean()
# Minimize the loss
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
```

### Result Graph
```bash
maddpg = AgentHandler()
dir_chkpoints = ''
scores_total, scores_global = train(maddpg, env, dir_chkpoints, n_episodes=10000)
```
<p>
*** Environment solved in 6424 episodes!	Average Score: 0.52 ***

Episode: 6450, Score: 1.8450, 	Average Score: 0.6923, Time: 00:31:38<br> 
\*** Episode 6450	Average Score: 0.69, Time: 00:31:38 ***
 
Episode: 6500, Score: 2.6000, 	Average Score: 1.2494, Time: 00:37:05 <br>
\*** Episode 6500	Average Score: 1.25, Time: 00:37:05 ***
 
Episode: 6550, Score: 2.6000, 	Average Score: 1.6173, Time: 00:42:26 <br>
\*** Episode 6550	Average Score: 1.62, Time: 00:42:26 ***
</p>

![Result Graph](plot.png)

Using The Above Given Parameters, this environment was solved within 270 episodes i.e 261 episodes.<br>
This can be improved by applying below given Future Ideas.

### Future Ideas
1. Improvement can be done by adding one or more layers to the network.
2. Changing the number of neurons in each layer can also improve the learning curve between states and actions.
3. Updating epsilon with different decay or starting with different epsilon can be a improvement.
4. NAF can be used as a alternative to DDPG as
 1. NAF learns a smooth, stable policy, whereas DDPG learns an unstable policy
 2. Therefore NAF is more suitable for domains where precision is required (robot arm manip e.g.)
 3. NAF performs better than DDPG on 80% or so of tested tasks
5. Also MAPPO i.e multiagent PPO can be used here.