# RL Exercise Demo
This exercise serves as a demonstration how to quickly train an RL agent on a popular environment with a RL framework.
We will
1. Select and instantiate and environment (gym's Hopper-v3).
2. Select and setup our RL algorithm / agent (stablebaselines3 TQC [TODO add ref]).
3. Train the agent on the environment and visualize training progress.
4. Evaluate our agent and observe the distribution of test performances.
5. Record and replay the agent, before and after.
5. Optimize the agent's hyperparameters with SMAC.

Notes
- https://github.com/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/monitor_training.ipynb

## The Environment: Bipedal Walker
![Bipedal Walker](bipedal_walker.gif) [credits](https://www.gymlibrary.dev/_images/bipedal_walker.gif)


In [4]:
import gym
env = gym.make("BipedalWalker-v3")

## The Agent: TQC
[todo add paper]

In [None]:
from stable_baselines3 import SAC
model_fn = "trained_agent.zip"
model = SAC("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=500_000)
model.save(model_fn)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 119      |
|    ep_rew_mean     | -117     |
| time/              |          |
|    episodes        | 4        |
|    fps             | 151      |
|    time_elapsed    | 3        |
|    total_timesteps | 476      |
| train/             |          |
|    actor_loss      | -6.48    |
|    critic_loss     | 25.9     |
|    ent_coef        | 0.895    |
|    ent_coef_loss   | -0.691   |
|    learning_rate   | 0.0003   |
|    n_updates       | 375      |
---------------------------------
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 110      |
|    ep_rew_mean     | -117     |
| time/              |          |
|    episodes        | 8        |
|    fps             | 136      |
|    time_elapsed    | 6        |
|    total_timesteps | 883      |
| train/             |

# Rollout

In [None]:
from stable_baselines3 import SAC

env = TrainMonitor(env)  # attach a logger to record video

model = SAC.load(model_fn)
random_model = SAC("MlpPolicy", env, verbose=1)

n_steps = 1000  # run simulation for this number of steps

obs = env.reset()
for i in range(n_steps):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    if done:
      obs = env.reset()
env.close()

## Optimize Hyperparameters
Often RL algorithms don't converge if the hyperparameters are not configured properly. Or we just want to find out whether we could increase performance further. For this optimize the hyperparameters (HPs) of the RL algorithm/agent.

### What Hyperparameters to Optimize?
We do not optimize every possible hyperparameter but focus on those which often have a high impact on the learning dynamics, such as the learning rate (step size) or gamma. 

### How to Set Up the Search Space?
For this we need to define our search space, that is, what values we allow for these HPs. It is okay to set a broad range however if there is already an intuition we might set a smaller interval for searching. This way we might get a better solution faster because of the smaller resulting search space. 

### What Method to Use for Hyperparameter Optimization?
Oftentimes people rely on **manual search** for various reasons. However it is often more time-efficient if we use principled and automated methods for the search. Another common approach is **grid search**. This at first seems appealing because we humans like structure. For a HP search space it is better though to sample randomly because we cover the search space way better, see the image below. **So use random search instead of grid search**.
[todo add image grid search vs manual search]
Those approaches do not use past observed data points to try the next HP configuration. **Bayesian Optimization (BO)** uses a model to incorporate seen data points to approximate the HP optimization landscape. From this model we can smartly sample new HP configurations to try. BO therefore is really sample-efficient (does not need many evaluations/data points to find the optimum. This of course also depends on the number of HPs we would like to optimize).
For this optimization we use **SMAC** [todo cite] and especially a multi-fidelity approach (we don't always let the run finish but look early on how the progress and performance is), namely BOHB [todo cite].

### HPO Naming Conventions
- configuration = hyperparameters to use
- configuration space = search space, how we can set the hyperparameters
- incumbent = best configuration / hyperparameters found so far
- target algorithm = the function or algorithm we want to optimize (find the best hyperparameters for)
- trajectory = progress of incumbent value


In [1]:
import smac
from smac import SMAC4HPO, SMAC4MF, Scenario
from smac.config_space import ConfigurationSpace, Float, Configuration

number_of_function_evaluations = 20
max_env_steps = 500_000

# Build configuration space
configuration_space = ConfigurationSpace()
gamma = Float(0.9, 0.999, name="gamma", log=True)
learning_rate = Float(5e-5, 5e-3, name="learning_rate", log=True)
configuration_space.add_hyperparameters([gamma, learning_rate])

# In order to use SMAC we need to package the model we want to optimize in a function
# In this case, SMAC minimizes therefore we want to return the negative performance
def target_algorithm(configuration: Configuration, budget: int) -> float:
    env = gym.make("Hopper-v3")
    model = TQC(
        "MlpPolicy", env, verbose=1, gamma=configuration["gamma"], learning_rate=configuration["learning_rate"])
    model.learn(total_timesteps=budget)
    performance = evaluate(model, env)
    return -performance
    
# Setup SMAC
scenario = Scenarion({
    "runcount-limit": number_of_function_evaluations,
    "eta": 3,
    
})
smac = SMAC4MF(
    scenario=scenario,
    target_algorithm=target_algorithm
)

# Let's go, optimize!
smac.optimize()



ModuleNotFoundError: No module named 'smac'

In [None]:
# Visualize trajectory
smac_outdir = ...
trajectory = 