[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/alucantonio/data_enhanced_simulation/blob/master/12_MBRL.ipynb)

# Model-Based Reinforcement Learning (MBRL)

Model-based reinforcement learning uses a model of the dynamics of the environment to
guide the agent's learning and decision-making. This approach differs from model-free
RL, which learns a policy of value function directly from interaction with the environment.

The model of the environment is expressed in terms of the environment transition dynamics, $s_{t+1} = f(s_t,
a_t)$ (think of a deterministic environment), and the reward function $R(s,a)$. The
agent either uses a predefined model or learns one through interactions with the environment. 

The agent uses the model to simulate trajectories, evaluate potential actions, and
**plan** future behavior. Then, the agent optimizes its policy or value function based
on the predicted outcomes and periodically interacts with the actual environment to
improve the model and correct inaccuracies in its predictions.

Advantages of MBRL:

- Sample efficiency: by learning and simulating within the model, the agent can reduce the need for extensive interactions with the real environment, making it particularly useful when interactions are costly or limited.
- Generalization: a well-learned model can generalize to unseen scenarios and adapt more easily to changes.
- Interpretability: the explicit model provides insights into the environment’s dynamics.

Challenges:

- Model accuracy: errors in the learned model can lead to suboptimal or unsafe decisions (a problem known as “model bias”).
- Complexity: learning and maintaining a reliable model can be computationally expensive, especially in high-dimensional or stochastic environments.
- Trade-off with exploration: balancing exploration of the real environment and reliance on the model requires careful design.

## Dyna-Q

The Dyna-Q algorithm combines ideas from both model-free RL and model-based planning to
increase learning efficiency. Specifically, it employs both real-world experience and simulated experience derived
from a learned model of the environment.

Key components:
1.	Value function (Q-function)
2.	Model of the environment: it consists of a transition model and a reward model.

Pseudo-code of the algorithm:

```pseudo
Initialize Q(s, a) arbitrarily
Initialize the model (empty initially)

Repeat (for each episode or until convergence):
    1. Observe the current state s
    2. Select action a using an epsilon-greedy policy
    3. Execute a, observe reward r and next state s'
    4. Update Q(s, a) using the Q-learning update rule
    5. Update the model with (s, a, r, s')
    6. For n planning steps:
        a. Randomly sample (s_sim, a_sim) from the model
        b. Predict r_sim, s'_sim using the model
        c. Update Q(s_sim, a_sim) using the Q-learning update rule

## SAC-Dyna-Q-like algorithm

We can combine an actor-critic model (SAC - Soft Actor Critic) with a planning component
based on the model to create a Dyna-Q-like algorithm.

Here is the pseudo-code:

1. Initialize environment $ \text{env} $.
2. Initialize SAC agent and replay buffer $ \mathcal{D} $.
3. Set $ N_\text{real\_episodes}, N_\text{synthetic\_samples} $.

For $ \text{episode} $ in $ N_\text{real\_episodes} $:
1. **Collect real data:**
   - Reset environment.
   - For each step in the environment:
     1. Select action $a_t$ according to the agent's policy
     2. Execute $ a_t $, observe $ s_{t+1}, r_t, \text{done} $.
     3. Store $ (s_t, a_t, r_t, s_{t+1}, \text{done}) $ in $ \mathcal{D} $.

2. **Model-free SAC update:**
   - Sample $ (s, a, r, s', \text{done}) $ from $ \mathcal{D} $.
   - Update value and policy networks using SAC objectives.

3. **Generate synthetic data:**
   - For $ i $ in $ N_\text{synthetic\_samples} $:
     1. Sample $ s $ from $ \mathcal{D} $ or the observation space.
     2. Predict action according to the agent's policy
     3. Simulate $ s', r $ using the model
     4. Store $ (s, a, r, s', \text{done}) $ in $ \mathcal{D} $.

4. **SAC update with synthetic data:**
   - Repeat step 2 with synthetic data added to $ \mathcal{D} $.

## Solving the `Pendulum` environment with MBRL

In this exercise, you will implement the SAC-Dyna-Q-like algorithm described above to
solve the `Pendulum` environment of `gymnasium`.

1. Use Symbolic Regression (see notebook n. 11) to discover the equation for the
   evolution of the angular velocity:

   $\theta_{t+1} = f(\theta_t, a_t)$

   where $a_t$ is the action (torque). Compare it with the true dynamics:

   $\theta_{t+1} = \theta_t + \Delta t\frac{3g}{2l}\sin \theta_t + \Delta
   t\frac{3}{ml^2}a_t$
   
   where $g$ is the gravity acceleration, $l$ is the length of the pendulum, $m$ is the
   mass of the pendulum and $\Delta t$ is the time-step of the simulation.

2. Implement a `gymnasium` enviroment based on the learned model. Complete the `step`
   function below.

In [None]:
from typing import Optional

import numpy as np

import gymnasium as gym
from gymnasium import spaces

class ModelEnv(gym.Env):

    def __init__(self, render_mode: Optional[str] = None, g=10.0):
        self.max_speed = 8
        self.max_torque = 2.0
        self.dt = 0.05
        self.g = g
        self.m = 1.0
        self.l = 1.0

        self.render_mode = render_mode

        self.screen_dim = 500
        self.screen = None
        self.clock = None
        self.isopen = True

        high = np.array([1.0, 1.0, self.max_speed], dtype=np.float32)
        self.action_space = spaces.Box(
            low=-self.max_torque, high=self.max_torque, shape=(1,), dtype=np.float32
        )
        self.observation_space = spaces.Box(low=-high, high=high, dtype=np.float32)

    def step(self, action):
        th, thdot = self.state  # th := theta

        dt = self.dt

        action = np.clip(action, -self.max_torque, self.max_torque)[0]
        costs = angle_normalize(th) ** 2 + 0.1 * thdot**2 + 0.001 * (action**2)

        # UPDATE ANGULAR VELOCITY HERE based on the equation found via symbolic regression
        # newthdot = ...

        newthdot = np.clip(newthdot, -self.max_speed, self.max_speed)
        newth = th + newthdot * dt

        self.state = np.array([newth, newthdot])

        return self._get_obs(), -costs, False, False, {}

    def reset(self, *, seed: Optional[int] = None, options: Optional[dict] = None):
        super().reset(seed=seed)
        high = np.array([np.pi, 1.])
        low = -high
        self.state = self.np_random.uniform(low=low, high=high)

        return self._get_obs(), {}

    def _get_obs(self):
        theta, thetadot = self.state
        return np.array([np.cos(theta), np.sin(theta), thetadot], dtype=np.float32)

    def render(self):
        pass


def angle_normalize(x):
    return ((x + np.pi) % (2 * np.pi)) - np.pi

3. Complete the function `generate_synthetic_data` that uses an instance of `ModelEnv`
   generate samples according to the policy and adds them to
   the SAC replay buffer.

In [None]:
from stable_baselines3 import SAC
import numpy as np
import gymnasium as gym

def generate_synthetic_data(env, policy, replay_buffer, num_samples):
    """
    Generate synthetic transitions using the ModelEnv and add them to the SAC replay buffer.

    Parameters:
    - env: The ModelEnv instance used to simulate transitions.
    - policy: The SAC policy used to predict actions.
    - replay_buffer: The SAC replay buffer where synthetic data will be stored.
    - num_samples: Number of synthetic transitions to generate.
    """
    for _ in range(num_samples):
        # Reset the environment to a random initial state
        # ...

        # Predict action using the current policy
        # ...

        # Step through the environment
        # ...

        if terminated or truncated:
            done = True
        else:
            done = False

        # Add synthetic transition to the replay buffer
        replay_buffer.add(state, next_state, action, reward, done, [{}])


In [3]:
#@title Solution:

from stable_baselines3 import SAC
import numpy as np
import gymnasium as gym

def generate_synthetic_data(env, policy, replay_buffer, num_samples):
    """
    Generate synthetic transitions using the ModelEnv and add them to the SAC replay buffer.

    Parameters:
    - env: The ModelEnv instance used to simulate transitions.
    - policy: The SAC policy used to predict actions.
    - replay_buffer: The SAC replay buffer where synthetic data will be stored.
    - num_samples: Number of synthetic transitions to generate.
    """
    for _ in range(num_samples):
        # Reset the environment to a random initial state
        state, _ = env.reset()

        # Predict action using the current policy
        action = policy.predict(state, deterministic=False)[0]

        # Step through the environment
        next_state, reward, terminated, truncated, _ = env.step(action)

        if terminated or truncated:
            done = True
        else:
            done = False

        # Add synthetic transition to the replay buffer
        replay_buffer.add(state, next_state, action, reward, done, [{}])


4. Complete the function `train` that implements the training loop of the
   SAC-Dyna-Q-like algorithm. Check the [docs](https://stable-baselines3.readthedocs.io/en/master/modules/sac.html) of the SAC class for methods needed for
   training the policy. Run the training on the `Pendulum-v1` environment.

In [None]:
from stable_baselines3.common.logger import configure

# Create the real environment
real_env = gym.make("Pendulum-v1")

# Create the planning environment
model_env = ModelEnv()

# Configure logger
logger = configure("./logs", ["stdout", "csv"])

# Initialize the SAC agent
# model = ...

model.set_logger(logger)

# Training parameters
num_real_episodes = 20
steps_per_episode = 200
# ...
# ...

def train():
    for episode in range(num_real_episodes):
        print(f"Episode {episode + 1}/{num_real_episodes}")

        # Real environment interaction
        state, _ = real_env.reset()

        for _ in range(steps_per_episode):
            # Predict action using the current policy
            # ...

            # Take action in the real environment
            # ...

            # Store real transition in the replay buffer
            # ...

            # Update the agent using real data
            # ...

            state = next_state
            
            if done or truncated:
                break

        # Generate synthetic data for planning
        generate_synthetic_data(model_env, model.policy, model.replay_buffer, num_synthetic_samples)

        # Update the agent using synthetic data
        # ...


In [4]:
#@title Solution:

from stable_baselines3.common.logger import configure

# Create the real environment
real_env = gym.make("Pendulum-v1")

# Create the planning environment
model_env = ModelEnv()

# Configure logger
logger = configure("./logs", ["stdout", "csv"])

# Initialize the SAC agent
model = SAC(
    "MlpPolicy",
    real_env,
    verbose=2,
)

model.set_logger(logger)

# Training parameters
num_real_episodes = 20
num_synthetic_samples = 100 
steps_per_episode = 200
gradient_steps = 2

def train():
    for episode in range(num_real_episodes):
        print(f"Episode {episode + 1}/{num_real_episodes}")

        total_reward = 0.
        # Real environment interaction
        state, _ = real_env.reset()

        for _ in range(steps_per_episode):
            # Predict action using the current policy
            action, _ = model.predict(state, deterministic=False)

            # Take action in the real environment
            next_state, reward, done, truncated, _ = real_env.step(action)

            # Store real transition in the replay buffer
            model.replay_buffer.add(state, next_state, action, reward, done, [{}])

            # Update the agent using real data
            model.train(gradient_steps=gradient_steps)

            state = next_state
            total_reward += reward
            if done or truncated:
                break

        print(total_reward)
        # Generate synthetic data for planning
        generate_synthetic_data(model_env, model.policy, model.replay_buffer, num_synthetic_samples)

        # Update the agent using synthetic data
        for _ in range(num_synthetic_samples):
            model.train(gradient_steps=gradient_steps)


Logging to ./logs
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


In [5]:
train()

Episode 1/20
-1165.5467805297974
Episode 2/20
-1200.632108076978
Episode 3/20
-1176.6643912765708
Episode 4/20
-857.7098791862674
Episode 5/20
-1013.8722827144887
Episode 6/20
-1290.3436113539876
Episode 7/20
-120.22670171238829
Episode 8/20
-125.76394965933342
Episode 9/20
-1.166182503453593
Episode 10/20
-122.67847458727958
Episode 11/20
-247.70754221348838
Episode 12/20
-122.14709014875864
Episode 13/20
-245.48805022938694
Episode 14/20
-230.60787759365053
Episode 15/20
-122.49819153718009
Episode 16/20
-115.8081584167936
Episode 17/20
-243.3814160629186
Episode 18/20
-121.02087896534711
Episode 19/20
-116.26152193718487
Episode 20/20
-122.29341828630434


5. Play 100 episodes using the trained SAC policy (set `deterministic=True` when using
   the `predict` method) and evaluate the average reward. 

In [None]:
#@title Solution:

# Evaluate the trained agent
# real_env = gym.make("Pendulum-v1", render_mode="human")
real_env = gym.make("Pendulum-v1")
total_rewards = []
for _ in range(100):  # Evaluate for 10 episodes
    state, _ = real_env.reset()
    total_reward = 0
    done = False
    while not done:
        action, _ = model.predict(state, deterministic=True)
        action = np.array(action, dtype=np.float32).reshape(real_env.action_space.shape)
        state, reward, terminated, truncated, _ = real_env.step(action)
        total_reward += reward
        # real_env.render()
        if terminated or truncated:
            done = True
    total_rewards.append(total_reward)

print(f"Average reward over 100 evaluation episodes: {np.mean(total_rewards)}")

Average reward over 100 evaluation episodes: -154.78459222948953


6. Train a PPO agent and compare the average reward over 100 episodes.

In [None]:
#@title Solution:

from stable_baselines3 import PPO

# Create the environment
env = gym.make("Pendulum-v1")

# Initialize the PPO model
model = PPO(
    "MlpPolicy",  # Use a Multi-Layer Perceptron policy
    env,
    learning_rate=7e-4,
    verbose=1,
)

# Train the model
model.learn(total_timesteps=200000)

# Evaluate the model
episodes = 100
total_rewards = []
for ep in range(episodes):
    obs, _ = env.reset()
    done = False
    total_reward = 0
    while not done:
        action, _ = model.predict(obs, deterministic=True)
        obs, reward, done, truncated, _ = env.step(action)
        total_reward += reward
        if done or truncated:
            break
    total_rewards.append(total_reward)

print(f"Average reward over 100 evaluation episodes: {np.mean(total_rewards)}")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
----------------------------------
| rollout/           |           |
|    ep_len_mean     | 200       |
|    ep_rew_mean     | -1.13e+03 |
| time/              |           |
|    fps             | 7276      |
|    iterations      | 1         |
|    time_elapsed    | 0         |
|    total_timesteps | 2048      |
----------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 200        |
|    ep_rew_mean          | -1.18e+03  |
| time/                   |            |
|    fps                  | 4931       |
|    iterations           | 2          |
|    time_elapsed         | 0          |
|    total_timesteps      | 4096       |
| train/                  |            |
|    approx_kl            | 0.00272544 |
|    clip_fraction        | 0.0151     |
|    clip_range           | 0.2        |
|    entropy_loss      