# Model Based Reinforcement Learning


Model-Based Reinforcement Learning (MBRL) is a paradigm within RL where an agent first learns a
model of the environment's dynamics. This learned model, often denoted as $p(s_{t+1},r_t∣s_t, a_t)$,
approximates the real world by predicting the next state $s_{t+1}$ and reward $r_t$ given the
current state $s_t$ and an action $a_t$. The agent can then use this internal model to simulate
experiences and plan its actions, often |without needing to interact with the actual environment.

Even when direct interaction with an environment is possible, MBRL offers several key advantages
that make it a powerful approach. By learning a model, an agent can overcome limitations of
real-world interaction and unlock more sophisticated decision-making capabilities.

- **Sample Efficiency**: Interacting with the real world can be slow, expensive, or dangerous. MBRL
  agents can use their learned model to generate vast amounts of simulated data, drastically
  reducing the number of real-world samples needed to learn an effective policy.

- **Planning and Deliberation**: The learned model enables the use of powerful planning algorithms
  (e.g., Monte Carlo Tree Search). The agent can "look ahead" and simulate the outcomes of different
  action sequences to find an optimal plan before executing a single action in reality.

- **Safety and Risk Management**: Before trying actions in the real world, an agent can use its
  model to predict and avoid potentially catastrophic outcomes. This is critical for applications
  like robotics or autonomous vehicles where mistakes are costly.

- **Transfer Learning**: A model captures the underlying physics of an environment. This knowledge
  can often be transferred or fine-tuned for new tasks within the same environment, allowing the
  agent to adapt more quickly than learning a new policy from scratch.


In [1]:
from collections import deque
import numpy as np
import random
import pandas as pd
import matplotlib.pyplot as plt
import pickle
import seaborn as sns

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import gymnasium as gym
from gymnasium.wrappers.vector import NormalizeObservation

from util.rl_algos import SAC, AgentSAC, ReplayBuffer
from util.gymnastics import DEVICE, gym_simulation, init_random

## The Environment

Let's experiment with model-based reinforcement learning using the MuJoCo
[Half-Cheetah](https://gymnasium.farama.org/environments/mujoco/half_cheetah/) environment. Let's
start with some utility functions.

We use the [vectorized version](https://gymnasium.farama.org/api/vector/) of the environment to
speed up and stabilize training. Moreover, we
[normalize the observations](https://gymnasium.farama.org/api/vector/wrappers/#gymnasium.wrappers.vector.NormalizeObservation),
which is critical for learning an effective dynamics model of the environment as required by MBPO.
Because of that, we need to make sure to store and reuse the running mean and standard deviation
statistics.

**NOTE:** This notebook is computationally intensive, expect $O(hours)$ of GPU training for both the
model-free (SAC) and especially the MBPO training. Consider lowering the `SOLVED_SCORE` below to get
faster experimentation loops.


In [2]:
ENV_NAME = "HalfCheetah-v5"
SOLVED_SCORE = 5999.0  # State of the art algorithms often go beyond this value.

In [3]:
def make_vectorized_env(num_envs=15):
    """Utility method to create the vectorized environment with observation normalization."""
    env = gym.make_vec(ENV_NAME, num_envs=num_envs)
    env = NormalizeObservation(env)
    return init_random(env)

In [4]:
def run_simulation(agent: AgentSAC = None, obs_rms: np.ndarray = None):
    """Utility to run an agent simulation restoring normalization statistics."""
    sym_env = gym.wrappers.NormalizeObservation(gym.make(ENV_NAME, render_mode="rgb_array_list"))
    if obs_rms is not None:
        sym_env.obs_rms = obs_rms
    sym_env.update_running_mean = False
    return gym_simulation(sym_env, agent, max_t=150 if agent is not None else 75)

In [5]:
def write(text: str):
    """Extremely naive utility to overwrite the same console line for logging in this notebook."""
    padding = 100 - len(text)
    print(f"\r{text}{' ' * padding}", end="")

Let's see the state / action dimensions, and run a simulation of a random agent.


In [None]:
TEST_ENV = gym.make(ENV_NAME)
STATE_SIZE = TEST_ENV.observation_space.shape[0]
ACTION_SIZE = TEST_ENV.action_space.shape[0]

write(f"State size: {STATE_SIZE}\n")
write(f"Action size: {ACTION_SIZE}\n")

In [None]:
# Random agent simulation.
run_simulation()

## Training with SAC

Let's first train a _model-free_ SAC agent. The SAC algorithm and agent implementations are provided
as part of the utility code, but you already have built them as part of the actor-critic notebook :)

We will hopefully be able to train the same agent in way less episodes (i.e., better sample
efficiency) using MBPO at the end of this notebook.


In [7]:
def make_sac_agent():
    """Creates a SAC agent, used in both SAC and MBPO training."""
    return AgentSAC(
        STATE_SIZE,
        ACTION_SIZE,
        sample_size=512,
        max_norm=1.0,
        lr_actor=1e-3,
        lr_critic=1e-3,
    )

In [None]:
# If you don't want to run the full SAC training, you can load scores from a prerun, and comment the
# SAC training (and simulation) below. We will need the scores for plotting later on.
with open("solution/mbrl_sac_scores.pkl", "rb") as file:
    sac_scores = pickle.load(file)

In [None]:
# Solved in ~1h30m minutes GPU
sac_env = make_vectorized_env()
sac_agent = make_sac_agent()
sac_scores = SAC(sac_env, sac_agent, solved_score=SOLVED_SCORE).train()

In [None]:
run_simulation(sac_agent, sac_env.obs_rms)

## Model-Based Policy Optimization (MBPO)

Modern MBRL can be roughly divided into two major approaches, which differ in how they use the
learned model.

- **Dyna-Style Algorithms (Background Planning)**: The core idea is to the learned model to generate
  more data to train a model-free RL algorithm. The model acts as a data augmentation engine.
- **Planning-Based Algorithms (Decision-Time Planning)**: Use the learned model to plan or search
  for the best possible actions at each and every timestep. The model acts as an internal simulator
  for decision-making.

In this lesson, we will put these concepts into practice by implementing a simplified version of
Model-Based Policy Optimization (MBPO), a popular Dyna-style and effective MBRL algorithm
([paper here](https://arxiv.org/abs/1906.08253), and a
[reference implementation](https://github.com/Xingyu-Lin/mbpo_pytorch)).

### Algorithm

The vanilla model-based RL algorithm can be described as for $N$ epochs: (1) collect data under the
current policy, (2) learn dynamics model from past data, (3) improve the policy by using the learnt
dynamics model (either via backprop through time or using the model as a simulator). You can see at
a very high-level how the MBPO algorithm reflects those steps:

<div style="width: 50%">
  <img src="assets/13_MBRL_MBPO_algorithm.png">
  <br>
  <small></small>
</div>

In particular, MBPO works by creating a cycle between real interaction, model learning, and policy
improvement. The agent maintains **two separate replay buffers**: one for real experiences collected
from the environment and another for imaginary experiences generated by a model. First, an
**ensemble of dynamics models** is trained on the real data to predict next states and rewards while
capturing uncertainty. Then, the algorithm performs **short, imaginary rollouts** by sampling states
from the real buffer, using the current policy to choose actions, and using a randomly selected
model from the ensemble to predict the outcomes. These simulated transitions populate the model
buffer. Finally, a powerful model-free agent (SAC in our case), is trained on mixed batches of data
drawn from both the real and model buffers, allowing it to learn with far greater sample efficiency
than from real interactions alone.


### Reward Function

The reward function _should be_ part of the dynamics model. But in this notebook, we **manually
define the reward function**, a common and practical approach when the reward can be calculated
directly from the state. This simplifies the agent's task, allowing the dynamics model to focus
solely on predicting the next state.

However, in many real-world problems, this isn't possible. An agent may only have partial
observations and must learn the reward function alongside the dynamics. This introduces the critical
challenge of **reward hacking**, where the agent exploits flaws in its learned reward model to
achieve high scores that don't reflect true task success, a risk that is especially high in
environments with sparse or complex rewards.


In [None]:
def reward_fn(action: np.ndarray, next_state: np.ndarray) -> float:
    """Reward function for HalfCheetah-v5.

    from: https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/mujoco/half_cheetah_v5.py
    """
    # TODO: The x_velocity is the 9th component (at index 8) of next_state
    x_velocity = None
    # TODO: action_penalty is 0.1 (ctrl_cost_weight) times the sum of the squared action components.
    action_penalty = None
    # TODO: The reward is the x_velocity minus the action_penalty
    return None

In [None]:
print("Running tests for reward_fn...")

action1 = np.array([0.0, 0.0, 0.0])
state1 = np.array([0, 0, 0, 0, 0, 0, 0, 0, 5.0, 0, 0, 0, 0, 0, 0, 0, 0])
expected_reward1 = 5.0
assert np.isclose(reward_fn(action1, state1), expected_reward1), "Test Case 1 Failed"
print("✅ Test Case 1 Passed")

action2 = np.array([1.0, -2.0, 0.5])
state2 = np.array([0, 0, 0, 0, 0, 0, 0, 0, -2.0, 0, 0, 0, 0, 0, 0, 0, 0])
expected_reward2 = -2.525
assert np.isclose(reward_fn(action2, state2), expected_reward2), "Test Case 2 Failed"
print("✅ Test Case 2 Passed")

action3 = np.array([-1.0, -1.0, -1.0, -1.0])
state3 = np.zeros(17)
expected_reward3 = -0.4
assert np.isclose(reward_fn(action3, state3), expected_reward3), "Test Case 3 Failed"
print("✅ Test Case 3 Passed")

print("🎉 All tests passed!")

### Initial Exploration


In [None]:
def init_env_dataset(env: gym.vector.VectorEnv, env_dataset: ReplayBuffer, num_samples=150_000):
    """Initializes the environment dataset by interacting randomly with the environment."""
    # TODO: Clear the env_dataset.
    # ...
    # TODO: reset the gym vectorized environment.
    states, _ = None
    for _ in range(num_samples // env.num_envs):
        # TODO: Sample a random action from the vectorized environment.
        actions = None
        # TODO: Perform a step using that action.
        next_states, rewards, terms, truncs, _ = None
        # TODO: Compute dones as logical or between terminations and truncations
        dones = None
        for i in range(env.num_envs):
            # TODO: Add the tuple state, action, reward, next_state, done to the env_dataset.
            pass
        # TODO: Crucial! Set state to the next_state for the next iteration!
        states = None

In [None]:
test_env_dataset = ReplayBuffer()
init_env_dataset(make_vectorized_env(), test_env_dataset, num_samples=1020)

assert len(test_env_dataset) == 1020, f"Incorrect number of samples: {len(test_env_dataset)}"

env_dataset_sample = test_env_dataset.sample(3)
assert env_dataset_sample[0].shape == (3, 17), "Incorrect state shape"
assert env_dataset_sample[1].shape == (3, 6), "Incorrect action shape"
assert env_dataset_sample[2].shape == (3, 1), "Incorrect reward shape"
assert env_dataset_sample[3].shape == (3, 17), "Incorrect next_state shape"
assert env_dataset_sample[4].shape == (3, 1), "Incorrect done shape"

print("🎉 All tests passed!")

### Model Network

The learnt dynamics model of the environment $f_{\theta}$ is implemented as a neural-network which
predicts the change in the _normalized_ state given the current (normalized) state and action:
$\hat{\Delta}_{t+1} = f_{\theta}(s_t, a_t)$. See [this paper](https://arxiv.org/abs/1708.02596) for
an intuition of why it is better to predict deltas instead of the next states directly.


In [None]:
class ModelNetwork(nn.Module):

    def __init__(self, state_size=STATE_SIZE, action_size=ACTION_SIZE):
        super().__init__()
        # TODO: The network input is state + action and outputs the next state delta. Create three
        #       linear layers with hidden dimension 256.

    def forward(self, state, action):
        # TODO: Concat state and action. Hint: use torch.cat and pay attention on the dimension!
        #       First dimension (index 0) is always the batch dimension.
        # ...
        # TODO: Forward through the network to compute the `delta`
        # ...
        delta = None
        # TODO: Also compute the next_state as: state + delta
        new_state = None
        return new_state, delta

### Ensemble Model

MBPO utilizes an ensemble of probabilistic dynamics models rather than relying on a single one. This
approach is crucial for managing model uncertainty; a single learned model might be confidently
incorrect about the environment's physics, leading the agent to exploit these inaccuracies for
illusory gains. To ensure the models in the ensemble are diverse, they are trained using
**bootstrapping**.

While this technique traditionally involves creating separate, fixed datasets for each model by
_sampling with replacement_ from the main experience buffer, we implement a simplified version for
efficiency: each model in the ensemble is trained on the full dataset, and diversity is achieved by
having each model sample different, random mini-batches during each training step. This ensures each
model sees a unique sequence of data, which effectively approximates the bootstrapping process.

Let's create the `EnsembleModel` first.


In [None]:
class EnsembleModel(nn.Module):
    def __init__(self, n_models=7, state_size=STATE_SIZE, action_size=ACTION_SIZE, lr=3e-4):
        super().__init__()
        self.n_models = n_models
        # TODO: Create an ensemble of n_models ModelNetwork. Hint: use nn.ModuleList.
        self.models = None
        # TODO: Create n_models separate optimizers in a list.
        self.optimizers = [None]

    def forward(self, state, action):
        # TODO: Compute the deltas from all the models.
        models_deltas = None
        return models_deltas

    @torch.no_grad()
    def step(self, state, action):
        # TODO: Select a random ModelNetwork from the ensemble.
        selected_model_in_ensamble = None
        # TODO: Use the selected model to perform the prediction.
        next_state, delta_pred = None
        return next_state, delta_pred

Let's then write the training loop, using holdouts for cross-validation and a _patience_ threshold
to interrupt the training when no more improvements are made.|


In [None]:
def train_predictive_model(
    model: EnsembleModel,
    env_dataset: ReplayBuffer,
    n_epochs=100,
    mini_batch_size=256,
    gradient_steps=390,
    holdout_ratio=0.1,
    patience=5,
):
    # TODO: Make sure to set the model in training mode.
    # ...
    # TODO: Get all samples from the env_dataset
    all_samples = None
    # TODO: Get the total number of samples. Hint: use the shape.
    total_samples = None
    # TODO: Compute the number of holdouts as the product between total_samples and holdout_ratio.
    n_holdout = None

    best_val_loss = float("inf")
    epochs_since_improvement = 0

    # TODO: Split data into training and validation tensors for easier indexing
    train_tensors = None
    val_samples = None
    train_states, train_actions, _, train_next_states, _ = None
    n_train_samples = None

    write(f"Performing model rollout... training model...")

    for epoch in range(1, n_epochs + 1):
        epoch_loss = 0.0
        for _ in range(gradient_steps):
            # Instead of one sequential batch, we train each model on its own random batch
            for i in range(model.n_models):
                # TODO: Create a bootstrap sample (mini_batch_size random indices). Hint: use
                #       np.random.randint
                indices = None

                # TODO: select the various training batches using the indices.
                state_batch = None
                action_batch = None
                next_state_batch = None

                # TODO: Predict with the specific model
                _, delta_pred = None
                delta_target = None

                # TODO: Calculate loss and update the specific model's optimizer. Use mse_loss.
                loss = None
                # TODO: Perform an optimization step.
                # ...

                epoch_loss += loss.item()

        # Validation loss
        with torch.no_grad():
            # TODO: Unpack the cross-validation samples.
            val_state, val_action, val_next_state, _, _ = None
            # TODO: Get the delta target as val_next_state - val_state
            val_target = None
            # TODO: Run the prediction using the ensemble.
            val_preds = None
            # TODO: Adjust val_target dimensionality to match val_preds
            val_target = None
            # TODO: Compute MSE loss.
            val_loss = None

        epoch_loss /= gradient_steps
        write(f"Model Epoch {epoch} | Train Loss: {epoch_loss:.4f} | Val Loss: {val_loss:.4f}")

        # Early stopping
        if val_loss < best_val_loss:
            # TODO: save the best_val_loss, and reset the epochs_since_improvement to zero.
            pass
        else:
            # TODO: Increment epochs_since_improvement, and break if it is greater than patience.
            pass

    write(f"Model training complete (up to {epoch} epochs). Final Val Loss: {val_loss:.4f}")
    # TODO: Put model back into eval() mode.
    # ...

In [None]:
# Let's test whether some training at least happens :)
train_predictive_model(EnsembleModel().to(DEVICE), test_env_dataset, n_epochs=2)
print("\n✅ Training ran with no runtime errors.")

### Predictive Model Rollout

Let's implement the _k-step model rollout_ using the predictive model to populate the simulated
experiences buffer. Let's make sure to use normalized observations when feeding the model, and
unnormalize when calculating rewards and storing in the buffer.

Note that in MBPO short model rollouts are crucial to mitigate the accumulation of prediction errors
from the imperfect learned dynamics model, which prevents the policy from exploiting these
inaccuracies over long horizons. In our training, we keep $k=1$, though you can experiment with
adaptive approaches that increase it as training proceeds and the model is more robust and accurate.


In [None]:
def k_step_model_rollout(
    predictive_model: EnsembleModel,
    agent: AgentSAC,
    model_dataset: ReplayBuffer,
    start_state: torch.Tensor,  # Dimension (1, STATE_SIZE)
    obs_rms_mean: np.ndarray,
    obs_rms_var: np.ndarray,
    k_steps=1,
):
    # TODO: Set the current state to the start_state.
    current_state = None
    # TODO: Convert running mean / std into tensors.
    obs_rms_mean_t = None
    obs_rms_var_t = None

    for _ in range(k_steps):
        # TODO: Convert the current state into numpy() and get a single action from the agent.
        state_np = None
        action_np = None

        # TODO: Convert the action into a tensor. Make sure to add the batch dimension (unsqueeze).
        action = torch.from_numpy(action_np).float().unsqueeze(0).to(DEVICE)
        # TODO: Call the predictivel model step(...)
        next_state, _ = None

        # TODO: Compute the unnormalized next state using obs_rms_var_t and obs_rms_mean_t.
        unnormalized_next_state = None
        # TODO: Compute the reward invoking the manually written reward function.
        reward = None
        done = False  # We never terminate with HalfCheetah.

        # Convert state and next_state back to numpy, making sure to remove the batch dimension!
        state_to_add = None
        next_state_to_add = None
        # TODO: Add the tuple (state, action, reward, next_state, done) to the model_dataset
        # ...

        current_state = next_state

### MBPO Algorithm

Finally, let's code the MBPO algorithm! In this implementation, we make a couple of slight
modifications from the algorithm described in the paper:

1.  We use global timesteps, not epochs directly; both the predictive model training _and_ the
    k-step model rollout happen once every `train_model_freq`;
2.  We clear the model dataset after predictive model training, to ensure the policy is always
    learning from rollouts generated by the most up-to-date dynamics model;
3.  The start state for the k-step model rollout is sampled from the real environment, rather than
    uniformly from the env dataset (to have always "true" start states).
4.  We update the policy parameters using a mixture of model / synthetic and real data.

And now we are ready to go!


In [None]:
def MBPO(
    env: gym.vector.VectorEnv,
    agent: AgentSAC,
    n_episodes=900,
    n_env_steps=1_000,
    n_model_rollouts=400,
    n_grad_updates=25,
    train_model_freq=250,
    train_model_epochs=10,
    policy_train_batch_size=256,
    real_ratio=0.25,
    k_steps=1,
    solved_score=SOLVED_SCORE,
):
    assert real_ratio < 1.0, "1.0 would mean degenerate into SAC. Implement tweaks as exercise :)"

    scores = []
    scores_window = deque(maxlen=100)
    num_envs = env.num_envs
    total_timesteps = n_episodes * n_env_steps
    episode_scores = np.zeros(num_envs, dtype=np.float32)

    states, _ = env.reset()
    env_dataset = ReplayBuffer()
    model_dataset = ReplayBuffer(1e6)
    predictive_model = EnsembleModel().to(DEVICE)

    # TODO: ALGO: Initialization: call init_env_dataset. Initial exploration matters a lot!
    # ...

    for global_step in range(total_timesteps // num_envs):
        if global_step % train_model_freq == 0:
            # TODO: ALGO: Train predictive model on env_dataset via maximum likelihood.
            #       Hint: call train_predictive_model :)
            # ...

            # TODO: Clear the model dataset.
            # ...
            # TODO: Get obs_rms.mean and obs_rms.var from the environment (thanks to the wrapper).
            obs_rms_mean = None
            obs_rms_var = None

            for model_rollout in range(n_model_rollouts):
                # TODO: Pick the start state randomly from `states`.
                start_state_np = None
                # TODO: Convert it to a tensor, adding the batch dimension (unsqueeze).
                rollout_state = None

                # TODO: ALGO: Perform k-step model rollout!
                # ...
                write(f"Rollout {model_rollout} completed...")

        # TODO: ALGO: Take action in environment, according to policy; add to the env_dataset.
        actions = None
        next_states, rewards, truncateds, terminateds, _ = None
        dones = None
        for i in range(env.num_envs):
            # TODO: Add to env_dataset here.
            pass

        # ALGO: Update policy parameters on model data (and percentage of real data).
        write(f"Step {global_step} in progress + gradient updates...")
        for _ in range(n_grad_updates):
            # TODO: Compute model_batch_size using the real_ratio.
            model_batch_size = None
            # TODO: Compute env_batch_size as: policy_train_batch_size - model_batch_size
            env_batch_size = None
            # TODO: Sample the batch part from the env_dataset.
            env_batch = None
            # TODO: Sample the batch part from the model_dataset.
            model_batch = None
            # TODO: Compute the full batch concatenating env + model batches.
            batch = None
            # TODO: Call agent.learn!
            # ...

        # TODO: CRITICAL! Set states to next_states!
        states = None
        episode_scores += rewards

        # No need to check for termination; HalfCheetah-v5 only truncates at the time limit, and
        # vectorized environments handle the automatic reset. We check it only for logging.

        for i, done in enumerate(dones):
            if done:
                finished_score = episode_scores[i]
                scores.append(finished_score)
                scores_window.append(finished_score)
                episode_scores[i] = 0.0
                avg_score = np.mean(scores_window)
                num_episodes_done = len(scores)
                write(
                    (
                        f"Episode {num_episodes_done}\tAverage Score: {avg_score:.2f}"
                        + ("\n" if num_episodes_done % env.num_envs == 0 else "")
                    ),
                )
                if avg_score >= solved_score:
                    write(
                        f"\nEnvironment solved in {num_episodes_done} episodes!"
                        + f"\tAverage Score: {avg_score:.2f}"
                    )
                    return scores

### Let's Train Our MBPO Agent!


In [None]:
mbpo_env = make_vectorized_env()
mbpo_agent = make_sac_agent()
mbpo_scores = MBPO(mbpo_env, mbpo_agent)

In [None]:
run_simulation(mbpo_agent, mbpo_env.obs_rms)

### Performance Comparison

We can see the better sample efficiency of MBPO vs. SAC!


In [19]:
def plot_scores(sac_scores, mbpo_scores, window_size=20):
    """Utility method to plot."""
    sac_series = pd.Series(sac_scores, name="SAC")
    mbpo_series = pd.Series(mbpo_scores, name="MBPO")
    sac_running_avg = sac_series.rolling(window=window_size).mean()
    sac_running_std = sac_series.rolling(window=window_size).std()
    mbpo_running_avg = mbpo_series.rolling(window=window_size).mean()
    mbpo_running_std = mbpo_series.rolling(window=window_size).std()
    plt.figure(figsize=(12, 7))
    sns.set_theme(style="darkgrid")
    plt.plot(sac_running_avg.index, sac_running_avg, label="SAC", color="purple")
    plt.fill_between(
        sac_running_avg.index,
        sac_running_avg - sac_running_std,
        sac_running_avg + sac_running_std,
        alpha=0.2,
        color="purple",
    )
    plt.plot(mbpo_running_avg.index, mbpo_running_avg, label="MBPO", color="orange")
    plt.fill_between(
        mbpo_running_avg.index,
        mbpo_running_avg - mbpo_running_std,
        mbpo_running_avg + mbpo_running_std,
        alpha=0.2,
        color="orange",
    )
    plt.title("Smoothed Training Curves with Standard Deviation")
    plt.xlabel("Episode")
    plt.ylabel("Return/Score")
    plt.legend()
    plt.show()

In [None]:
plot_scores(sac_scores, mbpo_scores)

## Conclusion

While MBPO is a powerful and foundational algorithm, the field of Model-Based Reinforcement Learning
is vast. The key differences often lie in how the learned model is used—either for data augmentation
(like MBPO) or for active planning—and how model errors are handled.

Here are several influential algorithms and how their core ideas differ from MBPO:

- **PETS (Probabilistic Ensembles with Trajectory Sampling)**: Instead of augmenting a replay
  buffer, PETS uses the model for active decision-time planning by simulating thousands of future
  trajectories to pick the best action at every step, a method known as Model Predictive Control
  (MPC).

- **ME-TRPO (Model-Ensemble Trust Region Policy Optimization)**: Like MBPO, it uses a model ensemble
  to train a policy, but it replaces the standard policy update (like SAC) with the more stable and
  conservative TRPO algorithm to prevent performance collapse from model errors.

- **MB-MPO (Model-Based Meta-Policy Optimization)**: This approach treats each learned model in an
  ensemble as a separate task and uses meta-learning to find a policy that can quickly adapt to any
  of their different dynamics, making it highly robust to model uncertainty.

- **The Dreamer Family (DreamerV2, DreamerV3)**: This state-of-the-art method learns a world model
  in a compact latent space (not the raw state space) and learns the policy entirely within this
  imagined "dream," making it incredibly effective for complex, image-based environments like video
  games.
