# Exercise #5 - Stable Baselines3

In this final exercise we will use a library with implementations of various RL algorithms instead of implementing them ourselves. We will be using the [Stable Baselines3](https://stable-baselines3.readthedocs.io/en/master/) library. It has implementations of a lot of algorithms using PyTorch, and the implementations have been tested against the original papers to make sure the results are the same.

In this exercise we will use the [PPO](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html) algorithm that is basically an upgraded Actor-Critic using multiple workers (A3C) and a smarter neural network gradient (TRPO). We will see that it converges a lot quicker on the `LunarLander-v2` environment.

Let's start with including the necessary classes and functions.

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

## Training

One of the ways to improve the efficiency of Reinforcement Learning is by using multiple parallel environments. With Stable Baselines3 that is really simple to achieve. Take a look at the [Examples page](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html) for more details.

Let's create a new environment that runs 4 instances simultaneously using the `make_vec_env` function. These vectorized environments coordinate the execution of multiple environments that run in separate processes. Make sure the `seed` parameter is set to a value (e.g. `0`), otherwise the initialization could break. The `make_vec_env` utility function will create a new vectorized environment using the `gymnasium` library we have used before.

In [None]:
### START CODE ###
env = ...
### END CODE ###

# Show observation shape
obs = env.reset()
obs.shape

As you see the output is now a batch of 4 observations, one for each environment.

With this environment we can create a model. This model will check the observation and action spaces of the environment and configures the neural networks accordingly. We want to create a new `PPO` instance with the `'MlpPolicy'` policy network configuration. This is a simple 'multi-layer perceptron' network, perfectly suitable for this task. Make sure `verbose` is set to `1` to show some output.

In [None]:
### START CODE ###
model = ...
### END CODE ###

We are now ready to start training, which is done with the `learn` function of the model. Train for `200000` time steps and save the results. It should take about 5 minutes on this server.

In [None]:
### START CODE ###
# Learn the model for 200_000 steps
...
### END CODE ###

model.save('lunar-ppo')

## Evaluation

Let's take a look at the result.

First we need to import the same utility functions to show the images as we did in previous exercises.

In [None]:
from utils import init_display, create_frame, update_frame
init_display()

We want to see how well it performs with a single environment, not with a vectorized environment. So, we have to create a new environment. Let's use `gymnasium` directly.

In [None]:
import gymnasium
env = gymnasium.make('LunarLander-v2', render_mode='rgb_array')

`predict`. Hidden state (for recurrent neural networks).

In [None]:
def evaluate(model_file):
    # Load model from disk
    model = PPO.load(model_file)

    ### START CODE ###

    # Reset environment
    obs, _ = ...

    ### END CODE ###

    # Show initial frame
    frame = create_frame(env)

    # Start loop
    done, score, length = False, 0, 0
    while not done:
        # Update frame with current state
        update_frame(frame)

        ### START CODE ###
        
        # Get action from model based on the current state
        action, _ = ...

        # Perform action in environment and collect next state and reward
        obs, reward, done, _, _ = ...

        ### END CODE ###

        # End early
        if length >= 500:
            reward += -100
            done = True

        # Next iteration
        score += reward
        length += 1

    return score

In [None]:
evaluate('lunar-ppo')

It should have performed quite well already, and that after only 5 minutes.

If you let it train for 1.5 million time steps with 8 environments in parallel you can achieve the following result. This took less than 20 minutes to train.

In [None]:
evaluate('trained/lunar-ppo')

## Conclusion

Well, that was easy, right?

We have seen how the basic algorithms work in the previous exercises. The state-of-the-art algorithms are a bit more work to implement, but luckily these libraries allow us to use them as well with minimum effort.