# CPR appropriation with policy gradient

This notebook contains actual Harvest trainings for each implemented policy gradient method. The environment in use is a custom implementation of Harvest.

## Pre-requisites

The cells down below install and import the necessary libraries to successfully run the notebook examples.

In [1]:
import sys
sys.path.append('../')

In [41]:
%%capture
!pip install -r ../init/requirements.txt
!pip install ../src/gym_cpr_grid

In [42]:
import numpy as np
import gym

from src import memory, models, policies

%load_ext autoreload
%autoreload 2

[autoreload of gym_cpr_grid failed: Traceback (most recent call last):
  File "/Users/jobs/Github/cpr-appropriation/venv/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/Users/jobs/Github/cpr-appropriation/venv/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
    module = reload(module)
  File "/usr/local/Cellar/python@3.9/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/imp.py", line 314, in reload
    return importlib.reload(module)
  File "/usr/local/Cellar/python@3.9/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 613, in _exec
  File "<frozen importlib._bootstrap_external>", line 855, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/Users/jobs/Github/cpr-app

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Utilities

The cell down below defines the environment, along with common variables to be used throughout the notebook.

In [43]:
env = gym.make(
    'gym_cpr_grid:CPRGridEnv-v0', 
    n_agents=11, 
    grid_width=39, 
    grid_height=19,
    tagging_ability=True,
    gifting_mechanism=None
)

In [44]:
observation_space_size = env.observation_space_size()
action_space_size = env.action_space_size()
epochs = 4000
steps_per_epoch = 4000
save_every = 500
hidden_sizes = [32, 32]
checkpoints_path = "../checkpoints"
wandb_config = {
    "api_key": open("../wandb_api_key_file", "r").read().strip(),
    "project": "cpr-appropriation",
    "entity": "wadaboa",
}

## VPG

This section deals with training a set of Harvest agents using our custom Vanilla Policy Gradient implementation.

In [45]:
vpg_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
vpg_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
vpg_policy = policies.VPGPolicy(env, vpg_policy_nn, baseline_nn=vpg_baseline_nn)
vpg_policy.train(
    epochs,
    steps_per_epoch,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "VPG"}
)

2021-08-25 13:04:12.486 | INFO     | src.policies:train:103 - Epoch 1 / 4000
2021-08-25 13:04:12.487 | INFO     | src.policies:train:110 - Episode 1
2021-08-25 13:04:32.495 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:04:32.497 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 281.1818181818182, 'equality': 0.9623783910889343, 'sustainability': 510.8224459964027, 'peace': 592.2727272727273}
2021-08-25 13:04:32.498 | INFO     | src.policies:train:122 - Mean episode return: 281.1818181818182
2021-08-25 13:04:32.498 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 281.1818181818182


KeyboardInterrupt: 

## TRPO

This section deals with training a set of Harvest agents using our custom Trust Region Policy Optimization implementation.

In [None]:
beta = 1.0
kl_target = 0.01

In [None]:
trpo_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
trpo_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
trpo_policy = policies.TRPOPolicy(env, trpo_policy_nn, trpo_baseline_nn, beta=beta, kl_target=kl_target)
trpo_policy.train(
    epochs,
    steps_per_epoch,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "TRPO"}
)

## PPO

This section deals with training a set of Harvest agents using our custom Proximal Policy Optimization implementation.

In [None]:
c1=1.0
c2=0.01
eps=0.2

In [None]:
ppo_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
ppo_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
ppo_policy = policies.PPOPolicy(env, ppo_policy_nn, ppo_baseline_nn, c1=c1, c2=c2, eps=eps)
ppo_policy.train(
    epochs,
    steps_per_epoch,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "PPO"}
)