# Part 2: other environments and RL training
---

In this notebook we will go over some of the variations of `greenCrabEnv` available in this package, and over the syntax for training RL algorithms on instances of these environments.

## 0. Setup
---
As with Part 1 of this series, uncomment the following cell in order to install our package if you haven't done so already. After that restart the jupyter kernel.

In [49]:
%pip install -e ..

Obtaining file:///home/rstudio/rl4greencrab
  Installing build dependencies ... [?25ldone
[?25h  Checking if build backend supports build_editable ... [?25ldone
[?25h  Getting requirements to build editable ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing editable metadata (pyproject.toml) ... [?25ldone
Building wheels for collected packages: rl4greencrab
  Building editable for rl4greencrab (pyproject.toml) ... [?25ldone
[?25h  Created wheel for rl4greencrab: filename=rl4greencrab-1.0.0-py2.py3-none-any.whl size=1067 sha256=a8b1e2f585a58a3eb81cb889e2b33bc31b60bf519ff32c90d914324158f7408e
  Stored in directory: /tmp/pip-ephem-wheel-cache-eh74nj24/wheels/e9/7e/e6/00c4b11a2574abd59d64425d537139e25fadbde37f002c4dba
Successfully built rl4greencrab
Installing collected packages: rl4greencrab
  Attempting uninstall: rl4greencrab
    Found existing installation: rl4greencrab 1.0.0
    Uninstalling rl4greencrab-1.0.0:
      Successfully uninstalled

In [50]:
import numpy as np
import pandas as pd
from plotnine import ggplot, aes, geom_density, geom_line, geom_point, geom_violin, facet_grid, labs, theme, facet_wrap
import stable_baselines3

## 1. Other envs
---

We will go over two other envs provided by our package: `greenCrabSimplifiedEnv` and `timeSeriesEnv`.
Let's focus on the first one of these envs.

### greenCrabSimplifiedEnv

`greenCrabSimplifiedEnv` is closely related to `greenCrabEnv` and only varies in small aspects.
Let's examine these aspecs one by one.
The first aspect is its action space:

In [51]:
from rl4greencrab import greenCrabSimplifiedEnv
gcse = greenCrabSimplifiedEnv()
gcse.action_space

Box(-1.0, 1.0, (1,), float32)

Actions in this `greenCrabSimplifiedEnv` are between -1 and +1 (in contrast to `greenCrabEnv` where they were in [0, 2000]). 
This difference in action space is purely conceptual: we linearly associate the segment [-1, 1] to the segment [0, 2000] so that, e.g., an action of -1 corresponds to 0 traps laid, an action of 0 corresponds to 1000 traps laid, and an action of +1 corresponds to 2000 traps laid.
Mathematically, this transformation is:
$$a = A / 1000 - 1$$,
where $A\in[0,2000]$ and $a\in[-1,1]$.
This transformation of action space is performed because of purely computational reasons related to hyperparameter tuning of RL algorithms.

A second difference of `greenCrabSimplifiedEnv` with respect to `greenCrabEnv` is in its observation space.

In [42]:
gcse.observation_space

Box(-1.0, 1.0, (3,), float32)

In [43]:
gcse.reset()

(array([-1., -1., -1.], dtype=float32), {})

Here, observations are vectors with *three* components instead of nine, and they are [-1, 1] valued.
E.g., consider the following observation after a second time-step:

In [6]:
gcse.step(np.float32([-0.1]))[0]

array([-1. , -1. , -0.1], dtype=float32)

These three numbers correspond to: 1. the catch per 100 traps in the first five months of the year, 2. the catch per 100 traps in the later four months of the year, 3. the number of traps.
These three numbers are transformed to [-1, 1] in a similar fashion to eq. (1).

This simplifies the observations, making it easier for RL algorithms to exploit the information they provide.
Because of this, we will train our algorithms on `greenCrabSimplifiedEnv` rather than `greenCrabEnv`.

### timeSeriesEnv

TBD.

## 2. Training and evaluating RL algos
---

Here we cover some basic syntax for training RL algorithms on our envs.
We use short train times for the sake of brevity in this example.
Typical run-times might need upwards of 1 million time-steps, or possibly up to 10 million time-steps to converge.
This number will, however, depend on the particular algorithm used.

**Note:** This package also provides a more ergonomic syntax for training through the `train.py` script. 
To train models this way, run the following command on the terminal:

`python scripts/train.py -f hyperpars/ppo-gcse.yml`

There, we encode the input to the training algorithm as a YAML file.

In [52]:
from stable_baselines3 import PPO, TD3
from sb3_contrib import TQC
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.noise import NormalActionNoise

gcse = greenCrabSimplifiedEnv() # use of simplify environment how it will affect the model?
vec_env = make_vec_env(greenCrabSimplifiedEnv, n_envs=12) # vectorize the environment

### PPO

In [54]:
model = PPO("MlpPolicy", vec_env, verbose=0, 
    n_steps=2048,
    batch_size=128,
    n_epochs=8,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    tensorboard_log="/home/rstudio/logs")

model.learn(
	total_timesteps=1000_000, 
	progress_bar=True,
)
model.save("ppo_gcse")

Output()

### TD3

In [34]:

# Add some action noise for exploration
n_actions = gcse.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

model = TD3("MlpPolicy", 
            gcse,
            batch_size = 128,
            action_noise=action_noise,
            gamma = 0.999,
            policy_delay = 5,
            verbose=0, 
            tensorboard_log="/home/rstudio/logs") # should I add noise to the model in training
model.learn(
	total_timesteps=1000_000, 
	progress_bar=True,
)
model.save("td3_gcse")

Output()

### TQC

In [32]:
model = TQC("MlpPolicy", vec_env, verbose=0, tensorboard_log="/home/rstudio/logs")
model.learn(
	total_timesteps=50_000, 
	progress_bar=True,
)
model.save("tqc_gcse")

Output()

## 3. Loading and evaluating RL algos
---

In [55]:
ppoAgent = PPO.load("ppo_gcse")
td3Agent = TD3.load("td3_gcse")
tqcAgent = TQC.load("tqc_gcse")
evalEnv = greenCrabSimplifiedEnv()

In [56]:
from stable_baselines3.common.evaluation import evaluate_policy

### PPO

In [57]:
mean_rew, std_rew = evaluate_policy(ppoAgent, evalEnv)
print(f"PPO reward = {mean_rew:.5f} +/- {std_rew:.5f}")

PPO reward = -0.01739 +/- 0.00006


### TD3

In [28]:
mean_rew, std_rew = evaluate_policy(td3Agent, evalEnv)
print(f"TD3 reward = {mean_rew:.5f} +/- {std_rew:.5f}")

TD3 reward = -0.01740 +/- 0.00005


### TQC

In [11]:
mean_rew, std_rew = evaluate_policy(tqcAgent, evalEnv)
print(f"TQC reward = {mean_rew:.5f} +/- {std_rew:.5f}")

TQC reward = -0.02030 +/- 0.00010
