# Tutorial 05 - Settings Optimizing

Similar to other learning-based approaches, model hyperparameters play an important role in RL. Likewise, environment configurations such as observation items or reward weights affect the outcome of RL substantially. It is therefore critical to enable a finetuning process for these different settings. We offer now three separate options to choose from:
* optimize observation configurations
* optimize reward configurations
* optimize model hyperparameters

In CommonRoad-RL, these are achieved with the [Optuna](https://optuna.org) package. Essentially, with Optuna's interfaces, an optimization process involves several rounds of learning processes, or "trials" as called in the package, each being triggered with a set of sampled configurations/hyperparameters and behaving exactly the same as a vanilla learning process. Finally, with several rounds of learning conducted, the best performing set of configurations/hyperparameters will be reported. Note that there exist different criteria for configurations and hyperparameters optimizing. A more detailed description can be found in [commonroad_rl/README.md](https://gitlab.lrz.de/ss20-mpfav-rl/commonroad-rl/-/blob/development/commonroad_rl/README.md).     

We show in this tutorial an example to optimize reward configurations. The interfaces to optimize observation configurations and model hyperparamters are similar.

## 0. Preparation

Please make sure the training and testing data are prepared, otherwise see **Tutorial 01 - Data Preprocessing**. It is highly recommended that both **Tutorial 02 - Vanilla Learning** and **Tutorial 03 - Continual Learning** are completed first. Also, check the followings:
* current path is at the project root `commonroad-rl`, i.e. one upper layer to the `tutorials` folder
* interactive python kernel is triggered from the correct environment

In [None]:
# Check current path
%cd ..
%pwd

# Check interactive python kernel
import sys
sys.executable

## 1. Load RL environment and model settings

Similar to a vanilla learning process, we have to specify the environment configurations and model hyperparameters beforehands. However, a major difference for configurations optimization is that besides assigning direct values, we also prepare a set of sampling settings so that later the optimizer knows what sampling method and what sampling range/candidates are used for each item. Please see `commonroad_rl/config.yaml` for details.  

In [None]:
import os
import yaml
import copy
 
# Read in environment configurations 
env_configs = {}
with open("commonroad_rl/gym_commonroad/configs.yaml", "r") as config_file:
    env_configs = yaml.safe_load(config_file)["env_configs"]

# Save settings for later use
log_path = "tutorials/logs/"
os.makedirs(log_path, exist_ok=True)

with open(os.path.join(log_path, "environment_configurations.yml"), "w") as config_file:
    yaml.dump(env_configs, config_file)

# Read in model hyperparameters
hyperparams = {}
with open("commonroad_rl/hyperparams/ppo2.yml", "r") as hyperparam_file:
    hyperparams = yaml.safe_load(hyperparam_file)["commonroad-v0"]
    
# Save settings for later use
with open(os.path.join(log_path, "model_hyperparameters.yml"), "w") as hyperparam_file:
    yaml.dump(hyperparams, hyperparam_file)

# Remove `normalize` as it will be handled explicitly later
if "normalize" in hyperparams:
    del hyperparams["normalize"]
    
# Read in sampling settings for reward configurations
sampling_settings_reward_configs = {}
with open("commonroad_rl/gym_commonroad/configs.yaml", "r") as config_file:
    sampling_settings_reward_configs = yaml.safe_load(config_file)["sampling_setting_reward_configs"]

## 2. Create training and testing environments

Likewise, training and testing environments are to be prepared for settings optimizing. However, since we are going to sample and optimize the reward configurations later, we do not create the environments directly at this point, but a callable function and pass it to the optimizer interface from the [Optuna](https://optuna.org) package.

In [None]:
import gym
from stable_baselines.bench import Monitor
from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize

import commonroad_rl.gym_commonroad

# Prepare a low-level environment maker
meta_scenario_path = "tutorials/data/highD/pickles/meta_scenario"
training_data_path = "tutorials/data/highD/pickles/problem_train"
testing_data_path = "tutorials/data/highD/pickles/problem_test"

def make_env(**env_kwargs):
    def _func():
        # Create the environment
        env = gym.make("commonroad-v0", 
                       meta_scenario_path=meta_scenario_path,
                       train_reset_config_path= training_data_path,
                       test_reset_config_path= testing_data_path,
                       **env_kwargs)

        # Wrap the environment with a monitor to keep an record of the learning process
        info_keywords=tuple(["is_collision", \
                             "is_time_out", \
                             "is_off_road", \
                             "is_friction_violation", \
                             "is_goal_reached"])
        env = Monitor(env, log_path + "infos", info_keywords=info_keywords)
        return env
    return _func

# Prepare a callable function for the optimizer
def create_env(**env_kwargs):
    # Vectorize the environment
    env = DummyVecEnv([make_env(**env_kwargs)])

    # Normalize observations and rewards as required
    if "test_env" not in env_kwargs or env_kwargs["test_env"] is False:
        # Normalize observations and rewards during training
        env = VecNormalize(env, norm_obs=True, norm_reward=True)
    else:
        # Normalize only observations during testing
        env = VecNormalize(env, norm_obs=True, norm_reward=False)

    return env


## 3. Create a model

In addition, we do not create a model explicitly but prepare a callable function and pass it to the optimizer later.

In [None]:
from stable_baselines import PPO2

def create_model(hyperparams, env_configs):
    return PPO2(env=create_env(**env_configs), **hyperparams)

## 4. Assemble an evaluation callback with optimization criteria

During an optimization process, it is important to assess how good or bad a set of sampled configurations has performed. This is done with an evaluation callback, which will be appended to every learning trial. In `commonroad_rl/utils_run/callbacks.py`, there are specific callback functions defined for reward configurations, observation configurations, and model hyperparameters. In the following, we show how the callback for reward configurations are established for later use.

In [None]:
import numpy as np
from stable_baselines.common.callbacks import EvalCallback
from stable_baselines.common.vec_env import sync_envs_normalization, VecEnv

class RewardConfigsTrialEvalCallback(EvalCallback):
    def __init__(
        self,
        eval_env,
        trial,
        n_eval_episodes=5,
        eval_freq=10000,
        log_path=None,
        best_model_save_path=None,
        deterministic=True,
        verbose=1,
    ):
        super(RewardConfigsTrialEvalCallback, self).__init__(
            eval_env=eval_env,
            n_eval_episodes=n_eval_episodes,
            eval_freq=eval_freq,
            deterministic=deterministic,
            verbose=verbose,
        )
        self.trial = trial
        self.eval_idx = 0
        self.is_pruned = False
        self.lowest_mean_cost = np.inf
        self.last_mean_cost = np.inf
        self.cost = 0.0

        # Save best model into `($best_model_save_path)/trial_($trial_number)/best_model.zip`
        if best_model_save_path is not None:
            self.best_model_save_path = os.path.join(
                best_model_save_path, "trial_" + str(trial.number), "best_model"
            )
            os.makedirs(self.best_model_save_path, exist_ok=True)
        else:
            self.best_model_save_path = best_model_save_path

        # Log evaluation information into `($log_path)/trial_($trial_number)/evaluations.npz`
        self.evaluation_timesteps = []
        self.evaluation_costs = []
        self.evaluation_lengths = []
        if log_path is not None:
            self.log_path = os.path.join(
                log_path, "trial_" + str(trial.number), "evaluations"
            )
            os.makedirs(os.path.dirname(self.log_path), exist_ok=True)
        else:
            self.log_path = log.path

    def _on_step(self):
        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:
            def evaluate_policy_configs(
                model,
                env,
                n_eval_episodes=10,
                render=False,
                deterministic=True,
                callback=None,
            ):
                """
                Runs policy for `n_eval_episodes` episodes and returns cost for optimization.
                This is made to work only with one env.

                :param model: (BaseRLModel) The RL agent you want to evaluate.
                :param env: (gym.Env or VecEnv) The gym environment. In the case of a `VecEnv`, this must contain only one environment.
                :param n_eval_episodes: (int) Number of episode to evaluate the agent
                :param deterministic: (bool) Whether to use deterministic or stochastic actions
                :param render: (bool) Whether to render the environment or not
                :param callback: (callable) callback function to do additional checks, called after each step.
                :return: ([float], [int]) list of episode costs and lengths
                """
                if isinstance(env, VecEnv):
                    assert (
                        env.num_envs == 1
                    ), "You must pass only one environment when using this function"

                episode_costs = []
                episode_lengths = []
                for _ in range(n_eval_episodes):
                    obs = env.reset()
                    done, info, state = False, None, None

                    # Record required information
                    # Since vectorized environments get reset automatically after each episode,
                    # we have to keep a copy of the relevant states here.
                    # See https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html for more details.
                    episode_length = 0
                    episode_cost = 0.0
                    episode_is_time_out = []
                    episode_is_collision = []
                    episode_is_off_road = []
                    episode_is_goal_reached = []
                    episode_is_friction_violation = []
                    while not done:
                        action, state = model.predict(
                            obs, state=state, deterministic=deterministic
                        )
                        obs, reward, done, info = env.step(action)

                        episode_length += 1
                        episode_is_time_out.append(info[-1]["is_time_out"])
                        episode_is_collision.append(info[-1]["is_collision"])
                        episode_is_off_road.append(info[-1]["is_off_road"])
                        episode_is_goal_reached.append(info[-1]["is_goal_reached"])
                        episode_is_friction_violation.append(info[-1]["is_friction_violation"])

                        if callback is not None:
                            callback(locals(), globals())
                        if render:
                            env.render()

                    # Calculate cost for optimization from state information
                    normalized_episode_length = (
                        episode_length / info[-1]["max_episode_time_steps"]
                    )
                    if episode_is_time_out[-1]:
                        episode_cost += 10.0 * (1 / normalized_episode_length)
                    if episode_is_collision[-1]:
                        episode_cost += 10.0 * (1 / normalized_episode_length)
                    if episode_is_off_road[-1]:
                        episode_cost += 10.0 * (1 / normalized_episode_length)
                    if episode_is_friction_violation[-1]:
                        episode_cost += 10.0 * (1 / normalized_episode_length)
                    if episode_is_goal_reached[-1]:
                        episode_cost -= 10.0 * normalized_episode_length

                    episode_costs.append(episode_cost)
                    episode_lengths.append(episode_length)

                return episode_costs, episode_lengths

            sync_envs_normalization(self.training_env, self.eval_env)
            episode_costs, episode_lengths = evaluate_policy_configs(
                self.model,
                self.eval_env,
                n_eval_episodes=self.n_eval_episodes,
                render=self.render,
                deterministic=self.deterministic,
            )

            mean_cost, std_cost = np.mean(episode_costs), np.std(episode_costs)
            mean_length, std_length = np.mean(episode_lengths), np.std(episode_lengths)
            self.last_mean_cost = mean_cost

            if self.verbose > 0:
                print("Evaluating at learning time step: {}".format(self.num_timesteps))
                print("Cost mean: {:.2f}, std: {:.2f}".format(mean_cost, std_cost))
                print("Length mean: {:.2f}, std: {:.2f}".format(mean_length, std_length))

            if self.log_path is not None:
                self.evaluation_timesteps.append(self.num_timesteps)
                self.evaluation_costs.append(episode_costs)
                self.evaluation_lengths.append(episode_lengths)
                np.savez(
                    self.log_path,
                    timesteps=self.evaluation_timesteps,
                    episode_costs=self.evaluation_costs,
                    episode_lengths=self.evaluation_lengths,
                )

            if mean_cost < self.lowest_mean_cost:
                self.lowest_mean_cost = mean_cost
                if self.best_model_save_path is not None:
                    self.model.save(self.best_model_save_path)
                # Trigger callback if needed
                if self.callback is not None:
                    return self._on_event()

            # Report trial results
            self.eval_idx += 1
            self.cost = self.lowest_mean_cost
            self.trial.report(self.cost, self.eval_idx)
            # Prune trial if need
            if self.trial.should_prune():
                self.is_pruned = True
                return False
        return True

## 5. Formalize the optimization objective and process

Also different from a regular learning process, we have to define an objective function which helps sample the set of configurations to be used in the upcoming trial, call the functions for the model, environments and callbacks, and start a trial.

In [None]:
def objective_reward_configs(trial):
    # Sample reward configurations according to settings
    sampled_reward_configs = {}
    for key, value in sampling_settings_reward_configs.items():
        method, interval = next(iter(value.items()))
        if method == "categorical":
            sampled_reward_configs[key] = trial.suggest_categorical(key, interval)
        elif method == "uniform":
            sampled_reward_configs[key] = trial.suggest_uniform(key, interval[0], interval[1])
        elif method == "loguniform":
            sampled_reward_configs[key] = trial.suggest_loguniform(key, interval[0], interval[1])
        else:
            print("Sampling method " + method + " not supported for " + key)
    
    # Update environment configurations
    env_configs.update(sampled_reward_configs)
    
    # Save data for later inspection
    tmp_path = os.path.join(log_path, "trial_" + str(trial.number))
    os.makedirs(tmp_path, exist_ok=True)
    with open(os.path.join(tmp_path, "environment_configurations.yml"), "w") as f:
        yaml.dump(env_configs, f)
    
    model = create_model(hyperparams, env_configs)
    testing_env = create_env(test_env=True, **env_configs)

    reward_configs_eval_callback = RewardConfigsTrialEvalCallback(testing_env,
                                                                  trial,
                                                                  n_eval_episodes=3,
                                                                  eval_freq=500,
                                                                  log_path=log_path,
                                                                  best_model_save_path=log_path,
                                                                  deterministic=True,
                                                                  verbose=1)
    # Conduct a learning trial
    try:
        n_timesteps = 3000
        model.learn(n_timesteps, callback=reward_configs_eval_callback)
        # Free memory
        model.env.close()
        testing_env.close()
    # Catch NaN from bad random configurations
    except AssertionError:
        # Free memory
        model.env.close()
        testing_env.close()
        raise optuna.exceptions.TrialPruned()
    
    # Record trial results
    is_pruned = reward_configs_eval_callback.is_pruned
    cost = reward_configs_eval_callback.cost
    del model.env, testing_env
    del model
    if is_pruned:
        raise optuna.exceptions.TrialPruned()
    return cost

## 6. Trigger the optimization process and save the best

Finally, we are ready to trigger the optimization process. For such, we first create a `study` object from the Optuna package with the designated sampler and pruner, and then call the `optimize` member function to start the overall process, passing in the objective function defined above. Please see the [Optuna examples](https://optuna.org/#code_examples) for details.

In [None]:
import optuna
from optuna.samplers import RandomSampler
from optuna.pruners import MedianPruner

# Create a study object on reward configurations from Optuna
reward_configs_study = optuna.create_study(sampler=RandomSampler(), pruner=MedianPruner())

# Start optimizing
reward_configs_study.optimize(objective_reward_configs, n_trials=5, n_jobs=1)

# Access and record the best performing set of configurations after all trials
with open(os.path.join(log_path, "report_reward_configs_study.yaml"), "w") as f:
    yaml.dump(reward_configs_study.best_trial.params, f)

Now in `tutorials/logs`, there should be a resulting best `.yaml` file and several `trail_*` folders recording the information for each of the optimization trials. These are useful for inspection and reuse in other subsequent learnings. For a detailed diretory description, please refer to the [commonroad_rl/README.md](https://gitlab.lrz.de/ss20-mpfav-rl/commonroad-rl/-/tree/development/commonroad_rl) file.