In [None]:
import numpy as np
import torch
import sys
sys.path.append('../')
from voting_games.werewolf_env_v0 import pare, plurality_env, Roles, Phase
from notebooks.learning_agents.models import ActorCriticAgent
from notebooks.learning_agents.utils import play_static_game, play_recurrent_game
from notebooks.learning_agents.static_agents import (
    random_plurality_wolf, 
    random_approval_wolf,
    )
import notebooks.learning_agents.stats as indicators
import random
import copy
from matplotlib import pyplot as plt
from tqdm import tqdm
from tabulate import tabulate

# Discussion

## Training 
### Inconsistent Training

Reinforcement learning is a hard problem, and MARL even more so given its dynamic nature. To train our agents, we used a policy gradient method (PPO) that consistently improved our agents, but almost always had some type of divergence or weight collapse. In Sutton and Barto[^Sutton-Barto-Book], they refer to a combination of issues -**deadly triad**- that when used together, lead to divergence more often than not. These issues are:
- Using non-linear function approximations. _Our neural network_
- Bootstraping our estimated values. _Our policy is changing constantly, along with our value approximation_
- Off-Policy Training. _The one thing we do not do in this case!_

The difficulty in calculating good estimates of the policy gradient is compounded by the stochasticity of our MARL environment {cite}`Kakade2002ApproximatelyOA`. If we get a couple unfortunate episodes with bad estimates, our parameters may go in a poor direction leading to policy collapse and a long to possibly never recovery time. Multiple such events can be seen during our training. Empirically, when training our agents for approximately 500 update steps, they run to completion only roughly $10\%$ of the time. A collapse of our weights is almost always the termination factor of training. Despite this we do get decent results in the meantime, and would like to make training more consistent.

Some ideas we believe may help and are worth trying:
- Using decaying rates for our learning rate
- Use running norms for observations and rewards
- futher explore gradient clipping
- any divisions should also add an $\epsilon$ of $1e-5$ to begin with
- clamp logarithmic values between `-np.log(1e-5) , np.log(1e-5))` to begin with
- split critic and actor networks
    - have a higher learning rate for the critic
- vary replay buffer sizes
- vary batch sizes
- change optimizer
- simplify model strcuture

These have been collected from many blog posts, reddit posts, and work done in {cite}`Andrychowicz2020-fs, Henderson2017DeepRL`. We leave this exploration to future work, as it is not the direct scope of the project, but would help with consistency.

## Agents learning Approval and Plurality Mechanics

We were able to train both approval and plurality based agents to perform better than random policy villagers, even ones that coordinated in what would be a game breaking way. Despite the challenges with the PPO training itself, it was clear that agents trained in the approval environment were consistently reaching higher average win-rates. 

When it came down to what behaviors they seemed to have learnt, and how they went about executing them in-game, the expressability of our enhanced approval mechanism allowed agents to openly share their beliefs throughout accusations, while plurality agents had to repurpose targetting somewhat during accusation phases. This repurposing is probably why higher win-rates were more challenging for plurality agents to learn. They had to figure out how to superimpose intent and also synthesize it from others. On the other hand, trained approval agents consistently learned the ordinality of dissaprovals, neutrals and approvals, along with using approvals to indicate trust and to some extend, trust others with whom they shared a trusted agent. 

Both plurality and approval indicators when viewed holistically provided strong evidence for the behaviors we identified, and while analyzing them, we realized that being able to view dynamics of changing votes would have presented even more compelling proof. We leave creating indicators for changes between approval, neutral and dissaproval to future work. 

Experimentally, it was found that in approval voting scenarios, most voters will pick a small amount of candidates {cite}`Laslier2004UneED, Laslier2010HandbookOA` relative to the full candidate list[^approval-voting-avg-targets]. This was empirically observed in our werewolf game, albeit our claim here is not truly tested. If this number changed at a proportional rate to the amount of players, we could make it a stronger claim.

### Reward shaping

We based our reward function and values on prior work in werewolf, however shaping behavior through rewards is challenging and not something prior works looked at. For approval agents, we had no direct rewards for liking or feeling neutral about other agents, and there was no mechanical incentive either, however they learned to use them to implicitly communicate in a way that was interpretable to us. Two ideas branch out from this finding: 
1. One would be to have a derived reward function using Inverse Reinforcement Learning (IRL) by using human replays.
2. Adding more complex rewarding logic to see if we can force certain behavior to be learned.

In plurality voting, the couple of trained agents that were able to achieve higher win-rates did so by discovering an ability to superimpose intent in their targetting. This is more complex behavior, and they were able to do this possibly due to the fact that accusation phases did not penalize or reward in any way. By being heavy handed with penalties, this behavior may never have been discovered.


[^approval-voting-avg-targets]:https://electionscience.org/commentary-analysis/super-tuesday-deep-voting-methods-dive/

[^Sutton-Barto-Book]:http://incompleteideas.net/book/the-book-2nd.html
[^Warp-Drive]:https://github.com/salesforce/warp-drive