# CPR appropriation with policy gradient

This notebook contains actual Harvest trainings for each implemented policy gradient method. The environment in use is a custom implementation of Harvest.

## Pre-requisites

The cells down below install and import the necessary libraries to successfully run the notebook examples.

In [None]:
import sys
sys.path.append('../')

In [None]:
%%capture
!pip install -r init/requirements.txt
!pip install src/gym_cpr_grid

In [19]:
import numpy as np
import gym

from src import memory

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Utilities

The cell down below defines the environment, along with common variables to be used throughout the notebook.

In [56]:
env = gym.make(
    'gym_cpr_grid:CPRGridEnv-v0', 
    n_agents=11, 
    grid_width=39, 
    grid_height=19,
    tagging_ability=True,
    gifting_mechanism=None
)

In [57]:
observation_space_size = env.observation_space_size()
action_space_size = env.action_space_size()
epochs = 4000
steps_per_epoch = 4000
save_every = 500
hidden_sizes = [32, 32]
checkpoints_path = "./checkpoints"
wandb_config = {
    "api_key": open("./wandb_api_key_file", "r").read().strip(),
    "project": "cpr-appropriation",
    "entity": "wadaboa",
}

## VPG

This section deals with training a set of Harvest agents using our custom Vanilla Policy Gradient implementation.

In [None]:
vpg_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
vpg_baseline_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size, log_softmax=False)
vpg_policy = policies.VPGPolicy(env, vpg_policy_nn, baseline_nn=vpg_baseline_nn)
vpg_policy.train(
    epochs,
    steps_per_epoch,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "VPG"}
)

## TRPO

This section deals with training a set of Harvest agents using our custom Trust Region Policy Optimization implementation.

In [None]:
beta = 1.0
kl_target = 0.01

In [None]:
trpo_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
trpo_baseline_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size, log_softmax=False)
trpo_policy = policies.TRPOPolicy(env, trpo_policy_nn, trpo_baseline_nn, beta=beta, kl_target=kl_target)
trpo_policy.train(
    epochs,
    steps_per_epoch,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "TRPO"}
)

## PPO

This section deals with training a set of Harvest agents using our custom Proximal Policy Optimization implementation.

In [None]:
c1=1.0
c2=0.01
eps=0.2

In [None]:
ppo_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
ppo_baseline_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size, log_softmax=False)
ppo_policy = policies.PPOPolicy(env, ppo_policy_nn, ppo_baseline_nn, c1=c1, c2=c2, eps=eps)
ppo_policy.train(
    epochs,
    steps_per_epoch,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "PPO"}
)