# Management strategy evaluation

Rather than experimenting with a handful of strategies, we can seek to optimize a strategy.  Here we introduce a technique for multi-dimensional optimization of nonlinear, stochastic systems based on Gaussian processes.  While we have so far considered agents whose behavior is set by a single parameter, this approach is also suitable for more complex policies involving multiple parameters.  

In [None]:
from fish import fish
import numpy as np
import matplotlib.pyplot as plt
from skopt import gp_minimize, gbrt_minimize
from skopt.plots import plot_objective, plot_convergence, plot_gaussian_process
from utils import simulate
from utils import plot_sim


In [None]:
env = fish()

In [None]:
# A simple agent
class some_agent:
    def __init__(self, effort):
        self.effort = effort

    def predict(self, obs, **kwargs):
        return self.effort



To improve convergence, we will define a optimization function that computes the average reward over 100 simulations.  Becuase the optimizer seeks to 'minimize' the value, we will also need to take the negative of the average episode reward

In [None]:


def g(x):
    agent = some_agent(x)
    def my_function(i):
        np.random.seed(i)
        df, mu = simulate(agent, env, timeseries=False)
        return mu
    # do 100 simulations at each value to reduce noise    
    results = [my_function(i) for i in range(100)]
    return -np.mean(results)



In [None]:
%%time
# look for actions between [0,1] 
res = gp_minimize(g, [(0, .2)], n_calls = 20, verbose=True, n_jobs=-1)
res.fun, res.x

In [None]:

agent = some_agent(*res.x)
df, mu = simulate(agent, env)
print(mu)
plot_sim(df)

In [None]:

ax2 = plot_convergence(res)

plt.show()

In [None]:

ax2 = plot_gaussian_process(res)

plt.show()

This approach still requires that we define the agent's general behavior.  In principle, an agent could take a different response based on any possible observation -- that is, any map, `action_t = agent(observation_t)`.  Neural networks have consistently proven to be highly flexible function approximators given ample data.  In our next section we will seek to use neural networks as general purpose maps from observation space to action space.  

# RL Approaches

In [None]:
from sb3_contrib import TQC, ARS
from stable_baselines3 import PPO, A2C, DQN, SAC, TD3
from stable_baselines3.common.env_util import make_vec_env
vec_env = make_vec_env(fish, 12)


In [None]:
model = ARS("MlpPolicy", vec_env, verbose=0, tensorboard_log="/home/jovyan/logs")
model.learn(total_timesteps=800_000, tb_log_name="ars-fish", progress_bar=True)
model.save("ars_fish")

In [None]:
from utils import policy_fn
from utils import simulate_rl

model = ARS("MlpPolicy", env, device="cpu")
agent = model.load("ars_fish")


In [None]:

df, mu = simulate_rl(agent, env)
print(mu)
plot_sim(df)

In [None]:
model = PPO("MlpPolicy", vec_env, verbose=0, tensorboard_log="/home/jovyan/logs", use_sde=True, device = "cpu")
model.learn(total_timesteps=800_000, tb_log_name="ppo-fish", progress_bar=True)
model.save("ppo_fish")

In [None]:
# results
model = PPO("MlpPolicy", env, device="cpu")
agent = model.load("ppo_fish")

df, mu = simulate_rl(agent, env)
print(mu)
plot_sim(df)

In [None]:
policy_fn(agent, env)

In [None]:
model = TQC("MlpPolicy", vec_env, verbose=0, tensorboard_log="/home/jovyan/logs", use_sde=True, device = "cuda")
model.learn(total_timesteps=200_000, tb_log_name="tqc-fish", progress_bar=False)
model.save("tqc_fish")

In [None]:
model = TQC("MlpPolicy", env, device = "cpu")
agent = model.load("tqc_fish")

df, mu = simulate_rl(agent, env)
print(mu)
plot_sim(df)

In [None]:
policy_fn(agent, env)