### Ensemble PPO models
In this notebook, we share a simple example on how we can ensemble multiple Proximal Policy Optimization models at inference time to improve performance of our Reinforcement Learning agent.

For illustration, we have selected a simple Lundar Lander example with discrete action space.

First, let's install required packages, we use open-source [gymnasium](https://gymnasium.farama.org/) and [stable_baselines3](https://stable-baselines3.readthedocs.io/en/master/index.html) packages.

In [None]:
!pip install stable_baselines3==2.0.0a5
!pip install gymnasium[box2d]
!pip install swig

Next, we import `gymnasium` and required `stable_baselines3` classes and functions.

In [2]:
import time
import gymnasium as gym
from stable_baselines3 import PPO  
from stable_baselines3.common.evaluation import evaluate_policy  
from stable_baselines3.common.env_util import make_vec_env
from stable_baselines3.common.monitor import Monitor

In this example, we train multiple models with different learning rates. We can modify the next cell to train as many models while tunning different hyper paramters of the [`PPO` model](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html).

In [None]:
# Define different learning rates to use for training  
learning_rates = [0.0003, 0.00025, 0.00035, 0.0004, 0.001, 0.0002, 0.0001]
num_models = len(learning_rates) 
  
# Create a list to store the PPO models  
models = []  
  
# Create a vectorized Lunar Lander environment (stacked 16 environments)
env = make_vec_env('LunarLander-v2', n_envs = 16)  

# Train and save the PPO models
# This is not the most efficient way of training, we can parallelize training using pyspark or muliprocessing
for i in range(num_models):  
    start_time = time.time()
    print(f"Learning rate: {learning_rates[i]}")
    model = PPO('MlpPolicy',
                env,
                verbose = 0,
                n_steps = 2024,
                n_epochs = 20,
                gamma = 0.999,
                gae_lambda = 0.99,
                max_grad_norm = 9,
                batch_size = 64,
                learning_rate = learning_rates[i])  
    model.learn(total_timesteps=1000000)  
    model.save(f'ppo_model{i + 1}.zip')  # Save each model with a unique name  
    models.append(model)  
    print(f"Finished training a model in {round((time.time() - start_time)/60.0, 3)} minutes...")

Next, we evaluate the performance of each trained model by reviewing mean and std rewards.

In [None]:
# Create an evaluation environment
eval_env = Monitor(gym.make('LunarLander-v2'))

# Calculate mean and std reward for each model
for idx, m in enumerate(models):
    mean_reward, std_reward = evaluate_policy(m, eval_env, n_eval_episodes=10, deterministic = True)
    print(f"Learning rate: {learning_rates[idx]} | Mean reward: {mean_reward} +/- {std_reward}") 

Lastly, we **ensemble** the trained models in a single model and evaluate the performance of the **ensemble model**.

Optionally, we can only consider models in our ensemble that has a minimum mean reward.

We use majority voting for selecting the action by the ensemble of models as the action space in this example is discrete, however the `predict` function can be modified accordingly for continuous action spaces.

In [8]:
import numpy as np
from scipy import stats as st

MINIMUM_REWARD = 240

# Create a custom ensemble model class  
class EnsembleModel:  
    def __init__(self, models):  
        self.models = models  
  
    # To use this Ensemble model as a policy in the PPO class
    # we need to define the predict function and pass relevant arguments
    def predict(self, obs, **kwargs):  
        actions = []
        for model in self.models: 
            actions.append(model.predict(obs, **kwargs)[0].item())
        return st.mode(actions, keepdims=True)  
  

# We would like to only consider models with a mean reward greater than a given threshold in our ensemble
update_models = [m for m in models if 
                 evaluate_policy(m, eval_env, n_eval_episodes=10, deterministic = True)[0] >= MINIMUM_REWARD]
print("Number of models in the ensemble: ", len(update_models))
ensemble_model = EnsembleModel(update_models)  
  
# Calculate mean and std reward for the ensemble model
mean_reward, std_reward = evaluate_policy(ensemble_model, eval_env, n_eval_episodes=10, deterministic = True)  # Replace n_eval_episodes with your desired number of evaluation episodes  
  
print(f"Ensemble model: Mean reward = {mean_reward} +/- {std_reward}") 

Number of models in the ensemble:  6
Ensemble model: Mean reward = 252.70587512668658 +/- 28.829527834468042
