# CPR appropriation with policy gradient

This notebook contains actual Harvest trainings for each implemented policy gradient method. The environment in use is a custom implementation of Harvest.

## Pre-requisites

The cells down below install and import the necessary libraries to successfully run the notebook examples.

In [1]:
import sys
sys.path.append('../')

In [41]:
%%capture
!pip install -r ../init/requirements.txt
!pip install ../src/gym_cpr_grid

In [42]:
import numpy as np
import gym

from src import memory, models, policies

%load_ext autoreload
%autoreload 2

[autoreload of gym_cpr_grid failed: Traceback (most recent call last):
  File "/Users/jobs/Github/cpr-appropriation/venv/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
    superreload(m, reload, self.old_objects)
  File "/Users/jobs/Github/cpr-appropriation/venv/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
    module = reload(module)
  File "/usr/local/Cellar/python@3.9/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/imp.py", line 314, in reload
    return importlib.reload(module)
  File "/usr/local/Cellar/python@3.9/3.9.5/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/__init__.py", line 169, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 613, in _exec
  File "<frozen importlib._bootstrap_external>", line 855, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/Users/jobs/Github/cpr-app

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Utilities

The cell down below defines the environment, along with common variables to be used throughout the notebook.

In [43]:
env = gym.make(
    'gym_cpr_grid:CPRGridEnv-v0', 
    n_agents=11, 
    grid_width=39, 
    grid_height=19,
    tagging_ability=True,
    gifting_mechanism=None
)

In [44]:
observation_space_size = env.observation_space_size()
action_space_size = env.action_space_size()
epochs = 4000
steps_per_epoch = 4000
save_every = 500
hidden_sizes = [32, 32]
checkpoints_path = "../checkpoints"
wandb_config = {
    "api_key": open("../wandb_api_key_file", "r").read().strip(),
    "project": "cpr-appropriation",
    "entity": "wadaboa",
}

## VPG

This section deals with training a set of Harvest agents using our custom Vanilla Policy Gradient implementation.

In [None]:
vpg_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
vpg_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
vpg_policy = policies.VPGPolicy(env, vpg_policy_nn, baseline_nn=vpg_baseline_nn)
vpg_policy.train(
    epochs,
    steps_per_epoch,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "VPG"}
)

[34m[1mwandb[0m: Currently logged in as: [33mwadaboa[0m (use `wandb login --relogin` to force relogin)


2021-08-25 13:05:12.502 | INFO     | src.policies:train:103 - Epoch 1 / 4000
2021-08-25 13:05:12.503 | INFO     | src.policies:train:110 - Episode 1
2021-08-25 13:05:30.874 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:05:30.876 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 210.0909090909091, 'equality': 0.9288776995411495, 'sustainability': 474.1942691329546, 'peace': 756.9090909090909}
2021-08-25 13:05:30.877 | INFO     | src.policies:train:122 - Mean episode return: 210.0909090909091
2021-08-25 13:05:30.877 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 210.0909090909091
2021-08-25 13:05:38.303 | INFO     | src.policies:train:159 - Total loss: 0.9996551871299744
2021-08-25 13:05:38.304 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 210.0909090909091, 'equality': 0.9288776995411495, 'sustainability': 474.1942691329546, 'peace': 756.9090909090909}
2021-08-25 13:05:38.348 | IN

2021-08-25 13:08:40.733 | INFO     | src.policies:train:159 - Total loss: 0.9971657991409302
2021-08-25 13:08:40.734 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 192.0909090909091, 'equality': 0.8630985673135332, 'sustainability': 481.78660196007417, 'peace': 653.1818181818181}
2021-08-25 13:08:40.781 | INFO     | src.policies:train:103 - Epoch 9 / 4000
2021-08-25 13:08:40.782 | INFO     | src.policies:train:110 - Episode 9
2021-08-25 13:08:59.288 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:08:59.309 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 215.0, 'equality': 0.9331923890076266, 'sustainability': 505.6465004460492, 'peace': 812.8181818181819}
2021-08-25 13:08:59.310 | INFO     | src.policies:train:122 - Mean episode return: 215.0
2021-08-25 13:08:59.310 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 207.2828282828283
2021-08-25 13:09:06.978 | INFO     | src.policies:t

2021-08-25 13:12:05.012 | INFO     | src.policies:train:122 - Mean episode return: 194.27272727272728
2021-08-25 13:12:05.012 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 208.84090909090907
2021-08-25 13:12:13.399 | INFO     | src.policies:train:159 - Total loss: 0.9996477365493774
2021-08-25 13:12:13.400 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 194.27272727272728, 'equality': 0.8995192921278871, 'sustainability': 478.79792458808674, 'peace': 725.3636363636364}
2021-08-25 13:12:13.447 | INFO     | src.policies:train:103 - Epoch 17 / 4000
2021-08-25 13:12:13.448 | INFO     | src.policies:train:110 - Episode 17
2021-08-25 13:12:32.802 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:12:32.824 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 207.9090909090909, 'equality': 0.9270978256563362, 'sustainability': 487.7497772200393, 'peace': 669.0}
2021-08-25 13:12:32.825 | INFO    

2021-08-25 13:15:30.666 | INFO     | src.policies:train:103 - Epoch 24 / 4000
2021-08-25 13:15:30.667 | INFO     | src.policies:train:110 - Episode 24
2021-08-25 13:15:51.046 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:15:51.068 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 197.36363636363637, 'equality': 0.9317449018062112, 'sustainability': 493.5729859192556, 'peace': 706.1818181818181}
2021-08-25 13:15:51.069 | INFO     | src.policies:train:122 - Mean episode return: 197.36363636363637
2021-08-25 13:15:51.069 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 209.82196969696966
2021-08-25 13:15:59.356 | INFO     | src.policies:train:159 - Total loss: 0.9995957016944885
2021-08-25 13:15:59.356 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 197.36363636363637, 'equality': 0.9317449018062112, 'sustainability': 493.5729859192556, 'peace': 706.1818181818181}
2021-08-25 13:15:59.40

2021-08-25 13:19:23.961 | INFO     | src.policies:train:159 - Total loss: 1.0068039894104004
2021-08-25 13:19:23.961 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 201.27272727272728, 'equality': 0.9343844953614522, 'sustainability': 481.00459614865974, 'peace': 689.1818181818181}
2021-08-25 13:19:24.008 | INFO     | src.policies:train:103 - Epoch 32 / 4000
2021-08-25 13:19:24.008 | INFO     | src.policies:train:110 - Episode 32
2021-08-25 13:19:45.644 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:19:45.667 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 213.0, 'equality': 0.9480076048588055, 'sustainability': 497.9819523633874, 'peace': 797.7272727272727}
2021-08-25 13:19:45.668 | INFO     | src.policies:train:122 - Mean episode return: 213.0
2021-08-25 13:19:45.668 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 209.78977272727272
2021-08-25 13:19:54.482 | INFO     | src.polici

2021-08-25 13:23:22.596 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 212.27272727272728, 'equality': 0.9247031341264415, 'sustainability': 497.69277664037327, 'peace': 649.4545454545455}
2021-08-25 13:23:22.597 | INFO     | src.policies:train:122 - Mean episode return: 212.27272727272728
2021-08-25 13:23:22.597 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 209.96270396270396
2021-08-25 13:23:31.632 | INFO     | src.policies:train:159 - Total loss: 1.001400351524353
2021-08-25 13:23:31.633 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 212.27272727272728, 'equality': 0.9247031341264415, 'sustainability': 497.69277664037327, 'peace': 649.4545454545455}
2021-08-25 13:23:31.680 | INFO     | src.policies:train:103 - Epoch 40 / 4000
2021-08-25 13:23:31.681 | INFO     | src.policies:train:110 - Episode 40
2021-08-25 13:23:54.392 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:23:54.4

2021-08-25 13:27:13.340 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 212.27272727272728, 'equality': 0.937161767569843, 'sustainability': 497.7415083601114, 'peace': 640.7272727272727}
2021-08-25 13:27:13.391 | INFO     | src.policies:train:103 - Epoch 47 / 4000
2021-08-25 13:27:13.391 | INFO     | src.policies:train:110 - Episode 47
2021-08-25 13:27:35.457 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:27:35.484 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 219.9090909090909, 'equality': 0.9539253635996482, 'sustainability': 487.9896388131372, 'peace': 768.8181818181819}
2021-08-25 13:27:35.485 | INFO     | src.policies:train:122 - Mean episode return: 219.9090909090909
2021-08-25 13:27:35.485 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 210.20309477756285
2021-08-25 13:27:44.547 | INFO     | src.policies:train:159 - Total loss: 1.0042535066604614
2021-08-25 13:27:44.548 |

2021-08-25 13:31:21.857 | INFO     | src.policies:train:122 - Mean episode return: 217.8181818181818
2021-08-25 13:31:21.857 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 210.35185185185185
2021-08-25 13:31:31.444 | INFO     | src.policies:train:159 - Total loss: 1.000842809677124
2021-08-25 13:31:31.445 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 217.8181818181818, 'equality': 0.9244194870252614, 'sustainability': 489.18477372369665, 'peace': 717.8181818181819}
2021-08-25 13:31:31.495 | INFO     | src.policies:train:103 - Epoch 55 / 4000
2021-08-25 13:31:31.495 | INFO     | src.policies:train:110 - Episode 55
2021-08-25 13:31:54.230 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:31:54.255 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 196.36363636363637, 'equality': 0.9030303030323437, 'sustainability': 479.37800590856585, 'peace': 659.5454545454545}
2021-08-25 13:31:54.256

2021-08-25 13:35:33.107 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:35:33.132 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 202.8181818181818, 'equality': 0.8948698097083478, 'sustainability': 483.87186448110066, 'peace': 703.5454545454545}
2021-08-25 13:35:33.132 | INFO     | src.policies:train:122 - Mean episode return: 202.8181818181818
2021-08-25 13:35:33.133 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 209.32697947214078
2021-08-25 13:35:42.191 | INFO     | src.policies:train:159 - Total loss: 1.004791021347046
2021-08-25 13:35:42.192 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 202.8181818181818, 'equality': 0.8948698097083478, 'sustainability': 483.87186448110066, 'peace': 703.5454545454545}
2021-08-25 13:35:42.240 | INFO     | src.policies:train:103 - Epoch 63 / 4000
2021-08-25 13:35:42.241 | INFO     | src.policies:train:110 - Episode 63
2021-08-25 13:36:04.774 

2021-08-25 13:39:23.815 | INFO     | src.policies:train:159 - Total loss: 1.0009381771087646
2021-08-25 13:39:23.815 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 199.8181818181818, 'equality': 0.9262966333045269, 'sustainability': 466.8555455640329, 'peace': 701.5454545454545}
2021-08-25 13:39:23.865 | INFO     | src.policies:train:103 - Epoch 70 / 4000
2021-08-25 13:39:23.866 | INFO     | src.policies:train:110 - Episode 70
2021-08-25 13:39:46.689 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:39:46.717 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 213.45454545454547, 'equality': 0.9468019203974988, 'sustainability': 494.0652500967135, 'peace': 763.9090909090909}
2021-08-25 13:39:46.718 | INFO     | src.policies:train:122 - Mean episode return: 213.45454545454547
2021-08-25 13:39:46.718 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 208.5194805194805
2021-08-25 13:39:55.714 

2021-08-25 13:43:28.370 | INFO     | src.policies:train:122 - Mean episode return: 198.27272727272728
2021-08-25 13:43:28.370 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 208.03896103896105
2021-08-25 13:43:37.445 | INFO     | src.policies:train:159 - Total loss: 0.9987347722053528
2021-08-25 13:43:37.446 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 198.27272727272728, 'equality': 0.9363928139732318, 'sustainability': 470.1142073718709, 'peace': 700.6363636363636}
2021-08-25 13:43:37.495 | INFO     | src.policies:train:103 - Epoch 78 / 4000
2021-08-25 13:43:37.496 | INFO     | src.policies:train:110 - Episode 78
2021-08-25 13:43:59.854 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:43:59.877 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 201.45454545454547, 'equality': 0.9405973088295743, 'sustainability': 476.0885174774863, 'peace': 720.2727272727273}
2021-08-25 13:43:59.87

2021-08-25 13:47:15.990 | INFO     | src.policies:train:110 - Episode 85
2021-08-25 13:47:37.701 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:47:37.725 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 194.0909090909091, 'equality': 0.9129657228036414, 'sustainability': 502.9879926180079, 'peace': 766.0}
2021-08-25 13:47:37.725 | INFO     | src.policies:train:122 - Mean episode return: 194.0909090909091
2021-08-25 13:47:37.726 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 207.97112299465238
2021-08-25 13:47:46.830 | INFO     | src.policies:train:159 - Total loss: 1.0055216550827026
2021-08-25 13:47:46.831 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 194.0909090909091, 'equality': 0.9129657228036414, 'sustainability': 502.9879926180079, 'peace': 766.0}
2021-08-25 13:47:46.882 | INFO     | src.policies:train:103 - Epoch 86 / 4000
2021-08-25 13:47:46.882 | INFO     | src.policies

2021-08-25 13:51:32.719 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 218.63636363636363, 'equality': 0.9092799092816239, 'sustainability': 510.7701598743622, 'peace': 743.8181818181819}
2021-08-25 13:51:32.775 | INFO     | src.policies:train:103 - Epoch 93 / 4000
2021-08-25 13:51:32.776 | INFO     | src.policies:train:110 - Episode 93
2021-08-25 13:51:58.501 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:51:58.528 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 200.36363636363637, 'equality': 0.9674146180504988, 'sustainability': 470.52826071218317, 'peace': 734.0909090909091}
2021-08-25 13:51:58.529 | INFO     | src.policies:train:122 - Mean episode return: 200.36363636363637
2021-08-25 13:51:58.530 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 207.80938416422288
2021-08-25 13:52:09.545 | INFO     | src.policies:train:159 - Total loss: 1.0045057535171509
2021-08-25 13:52:09.5

2021-08-25 13:55:44.680 | INFO     | src.policies:train:122 - Mean episode return: 184.63636363636363
2021-08-25 13:55:44.681 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 207.06909090909093
2021-08-25 13:55:54.212 | INFO     | src.policies:train:159 - Total loss: 1.0026956796646118
2021-08-25 13:55:54.213 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 184.63636363636363, 'equality': 0.9287408799980139, 'sustainability': 443.5497396882569, 'peace': 595.9090909090909}
2021-08-25 13:55:54.264 | INFO     | src.policies:train:103 - Epoch 101 / 4000
2021-08-25 13:55:54.265 | INFO     | src.policies:train:110 - Episode 101
2021-08-25 13:56:19.664 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 13:56:19.690 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 213.54545454545453, 'equality': 0.9232942451348098, 'sustainability': 503.95696799951696, 'peace': 785.1818181818181}
2021-08-25 13:56:19

2021-08-25 13:59:54.972 | INFO     | src.policies:train:110 - Episode 108
2021-08-25 14:00:18.262 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 14:00:18.286 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 199.54545454545453, 'equality': 0.9006833713004622, 'sustainability': 465.8076161078238, 'peace': 721.6363636363636}
2021-08-25 14:00:18.286 | INFO     | src.policies:train:122 - Mean episode return: 199.54545454545453
2021-08-25 14:00:18.287 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 206.89818181818183
2021-08-25 14:00:27.957 | INFO     | src.policies:train:159 - Total loss: 0.9994522929191589
2021-08-25 14:00:27.957 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 199.54545454545453, 'equality': 0.9006833713004622, 'sustainability': 465.8076161078238, 'peace': 721.6363636363636}
2021-08-25 14:00:28.012 | INFO     | src.policies:train:103 - Epoch 109 / 4000
2021-08-25 14:00:28.

2021-08-25 14:04:40.656 | INFO     | src.policies:train:159 - Total loss: 0.9985252618789673
2021-08-25 14:04:40.657 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 208.9090909090909, 'equality': 0.9355170504008324, 'sustainability': 509.1051252571682, 'peace': 742.2727272727273}
2021-08-25 14:04:40.711 | INFO     | src.policies:train:103 - Epoch 116 / 4000
2021-08-25 14:04:40.712 | INFO     | src.policies:train:110 - Episode 116
2021-08-25 14:05:07.004 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 14:05:07.032 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 199.27272727272728, 'equality': 0.9255142667566872, 'sustainability': 490.7914187613911, 'peace': 764.9090909090909}
2021-08-25 14:05:07.032 | INFO     | src.policies:train:122 - Mean episode return: 199.27272727272728
2021-08-25 14:05:07.033 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 206.5127272727273
2021-08-25 14:05:17.33

2021-08-25 14:09:21.001 | INFO     | src.policies:train:122 - Mean episode return: 207.9090909090909
2021-08-25 14:09:21.002 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 205.7563636363637
2021-08-25 14:09:30.323 | INFO     | src.policies:train:159 - Total loss: 0.9974158406257629
2021-08-25 14:09:30.324 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 207.9090909090909, 'equality': 0.9100846682849687, 'sustainability': 477.1056953834955, 'peace': 675.6363636363636}
2021-08-25 14:09:30.374 | INFO     | src.policies:train:103 - Epoch 124 / 4000
2021-08-25 14:09:30.374 | INFO     | src.policies:train:110 - Episode 124
2021-08-25 14:09:53.591 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 14:09:53.616 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 211.27272727272728, 'equality': 0.9343608199042724, 'sustainability': 487.49962352414934, 'peace': 774.2727272727273}
2021-08-25 14:09:53.61

2021-08-25 14:13:38.547 | INFO     | src.policies:train:103 - Epoch 131 / 4000
2021-08-25 14:13:38.548 | INFO     | src.policies:train:110 - Episode 131
2021-08-25 14:14:02.012 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 14:14:02.038 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 221.36363636363637, 'equality': 0.9255553481440069, 'sustainability': 471.8954606985578, 'peace': 731.4545454545455}
2021-08-25 14:14:02.038 | INFO     | src.policies:train:122 - Mean episode return: 221.36363636363637
2021-08-25 14:14:02.039 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 206.03363636363636
2021-08-25 14:14:12.011 | INFO     | src.policies:train:159 - Total loss: 1.002609133720398
2021-08-25 14:14:12.012 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 221.36363636363637, 'equality': 0.9255553481440069, 'sustainability': 471.8954606985578, 'peace': 731.4545454545455}
2021-08-25 14:14:12.0

2021-08-25 14:18:30.135 | INFO     | src.policies:train:159 - Total loss: 1.0050582885742188
2021-08-25 14:18:30.136 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 217.0909090909091, 'equality': 0.9615501751187462, 'sustainability': 493.8885265814053, 'peace': 794.4545454545455}
2021-08-25 14:18:30.190 | INFO     | src.policies:train:103 - Epoch 139 / 4000
2021-08-25 14:18:30.191 | INFO     | src.policies:train:110 - Episode 139
2021-08-25 14:18:55.485 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 14:18:55.512 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 207.36363636363637, 'equality': 0.9320074927278306, 'sustainability': 471.05786822758404, 'peace': 729.4545454545455}
2021-08-25 14:18:55.513 | INFO     | src.policies:train:122 - Mean episode return: 207.36363636363637
2021-08-25 14:18:55.514 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 205.16090909090912
2021-08-25 14:19:05.

2021-08-25 14:22:57.836 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 199.36363636363637, 'equality': 0.916594121794209, 'sustainability': 476.41264874360405, 'peace': 659.4545454545455}
2021-08-25 14:22:57.837 | INFO     | src.policies:train:122 - Mean episode return: 199.36363636363637
2021-08-25 14:22:57.838 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 205.1127272727273
2021-08-25 14:23:08.737 | INFO     | src.policies:train:159 - Total loss: 1.003564476966858
2021-08-25 14:23:08.738 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 199.36363636363637, 'equality': 0.916594121794209, 'sustainability': 476.41264874360405, 'peace': 659.4545454545455}
2021-08-25 14:23:08.801 | INFO     | src.policies:train:103 - Epoch 147 / 4000
2021-08-25 14:23:08.802 | INFO     | src.policies:train:110 - Episode 147
2021-08-25 14:23:34.859 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 14:23:34.88

2021-08-25 14:27:19.743 | INFO     | src.policies:train:103 - Epoch 154 / 4000
2021-08-25 14:27:19.743 | INFO     | src.policies:train:110 - Episode 154
2021-08-25 14:27:45.590 | DEBUG    | src.policies:execute_episode:270 - Early stopping, all agents done
2021-08-25 14:27:45.617 | INFO     | src.policies:train:117 - Episode infos: {'efficiency': 206.72727272727272, 'equality': 0.927240745184152, 'sustainability': 498.0924375206096, 'peace': 742.4545454545455}
2021-08-25 14:27:45.618 | INFO     | src.policies:train:122 - Mean episode return: 206.72727272727272
2021-08-25 14:27:45.618 | INFO     | src.policies:train:123 - Last 100 episodes mean return: 204.85636363636365
2021-08-25 14:27:55.839 | INFO     | src.policies:train:159 - Total loss: 0.9998180270195007
2021-08-25 14:27:55.840 | INFO     | src.policies:train:164 - Epoch infos: {'efficiency': 206.72727272727272, 'equality': 0.927240745184152, 'sustainability': 498.0924375206096, 'peace': 742.4545454545455}
2021-08-25 14:27:55.89

## TRPO

This section deals with training a set of Harvest agents using our custom Trust Region Policy Optimization implementation.

In [None]:
beta = 1.0
kl_target = 0.01

In [None]:
trpo_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
trpo_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
trpo_policy = policies.TRPOPolicy(env, trpo_policy_nn, trpo_baseline_nn, beta=beta, kl_target=kl_target)
trpo_policy.train(
    epochs,
    steps_per_epoch,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "TRPO"}
)

## PPO

This section deals with training a set of Harvest agents using our custom Proximal Policy Optimization implementation.

In [None]:
c1=1.0
c2=0.01
eps=0.2

In [None]:
ppo_policy_nn = models.MLP(observation_space_size, hidden_sizes, action_space_size)
ppo_baseline_nn = models.MLP(observation_space_size, hidden_sizes, 1, log_softmax=False)
ppo_policy = policies.PPOPolicy(env, ppo_policy_nn, ppo_baseline_nn, c1=c1, c2=c2, eps=eps)
ppo_policy.train(
    epochs,
    steps_per_epoch,
    enable_wandb=True,
    wandb_config={**wandb_config, "group": "PPO"}
)