# Exploration and Curiosity

In Reinforcement Learning (RL), agents learn by maximizing a **reward signal**. This core idea,
often called the _reward hypothesis_, means the design and frequency of rewards heavily influence
how well an agent learns.

However, we often face two major hurdles:

- **Difficult Reward Design:** Creating informative reward functions (extrinsic rewards from the
  environment) can be complex, especially for intricate tasks.
- **Sparse Rewards:** Many environments offer rewards only rarely. Imagine a maze where the agent
  only gets a reward (+1) at the exit and zero everywhere else. Without frequent feedback, the agent
  might wander aimlessly or get stuck, failing to explore effectively and find the goal.

When extrinsic rewards are scarce, standard exploration methods (like adding noise to actions) might
not be enough. The agent needs an internal drive to explore novel situations, a form of _curiosity_.
The overall reward is now given by:

$
r_t = e_t + i_t
$

I.e., the sum of the _extrinsic_ reward and an _intrinsic_ component.

This notebook explores [Random Network Distillation (RND)](https://arxiv.org/abs/1810.12894), a
powerful technique that provides the agent with such an intrinsic reward. RND achieved SoTA in
Montezuma's Revenge, a game famously difficult for deep reinforcement learning methods.

**NOTE:** It is highly recommended to read and understand the paper before proceeding with the
notebook.


In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import gymnasium as gym
from minigrid.wrappers import ImgObsWrapper
from gymnasium.wrappers.vector import RecordEpisodeStatistics

from util.gymnastics import DEVICE, gym_simulation, init_random
from util.rl_algos import (
    BaseAgent,
    TrajectorySegment,
    collect_trajectory_segment,
    compute_advantages_and_returns,
    flatten_and_shuffle,
)

## MiniGrid DoorKey Environment

From the [documentation](https://minigrid.farama.org/environments/minigrid/DoorKeyEnv/): "This
environment is difficult, because of the sparse reward, to solve using classical RL algorithms. It
is useful to experiment with curiosity or curriculum learning."

Note that this environment can be solved without
[memory](https://github.com/lcswillems/rl-starter-files/tree/317da04a9a6fb26506bbd7f6c7c7e10fc0de86e0?tab=readme-ov-file#add-memory),
though I encourage you to explore Lucas Willems’ starter-files repo and memory-based solutions.

**Something to think about:** Consider the Markovian property and partial observability of the
state... In particular, how does the agent know if they have the key? They don't! They learnt a
policy of states highly-likely to have a key, if they don't they go explore more. They may be
learning that staying close to a wall means we have the key...


In [None]:
MINIGRID_ENV = "MiniGrid-DoorKey-5x5-v0"
MINIGRID_ENV_8x8 = "MiniGrid-DoorKey-8x8-v0"
MINIGRID_ENV_16x16 = "MiniGrid-DoorKey-16x16-v0"

In [None]:
# Simulation of the environment, to verify it loaded.
# https://minigrid.farama.org/environments/minigrid/DoorKeyEnv/#
gym_simulation(MINIGRID_ENV)

### Environment Wrappers

To simplify our training, we only use simplified image observations.


In [None]:
sample_state, _ = init_random(gym.make(MINIGRID_ENV)).reset()

print(sample_state["image"].shape)
print(sample_state["direction"])
print(sample_state["mission"])

In [5]:
def minigrid_wrapper(env: gym.Env) -> gym.Env:
    # Converts to image observations
    env = ImgObsWrapper(env)
    return env

In [None]:
sample_state, _ = minigrid_wrapper(gym.make(MINIGRID_ENV)).reset()
print(sample_state.shape)

In [6]:
def make_minigrid_env(minigrid_env=MINIGRID_ENV, num_envs=32) -> gym.vector.VectorEnv:
    """Utility function to create a MiniGrid vectorized environment with the appropriate wrappers"""
    env = gym.make_vec(minigrid_env, num_envs=num_envs, wrappers=[minigrid_wrapper])
    env = RecordEpisodeStatistics(env)
    return env

## A2C Algorithm

We already implemented a simplified A2C algorithm in the actor-critic notebook. We will now power-up
the A2C implementation using a vectorized environment and utilities previously introduced in the PPO
notebook. We then will extend the implementation with RND, let's start!


### Training Loop


In [None]:
def train(
    env: gym.vector.VectorEnv, agent: BaseAgent, rollout_size=16, log_every=100, solved_score=0.95
):
    """The A2C training loop. We use the utility BaseAgent class for on-policy policy-gradient
    algorithms just for convenience in this notebook."""
    n_step = 0
    avg_score = 0.0
    obs, _ = env.reset()
    obs = torch.Tensor(obs).to(DEVICE)
    while True:
        # Collects a trajectory segment across multiple vectorized environments, making sure that
        # we take intrinsic rewards into account. Feel free to check out the library implementation
        # of `collect_trajectory_segment` which is very similar to the one you implemented already
        # in the PPO notebook!
        segment = collect_trajectory_segment(
            env, agent, obs, rollout_size, with_intrinsic_values=True
        )
        # Make the agent learn via the just-collected trajectory.
        stats = agent.learn(segment)
        # Update the state... Easy to overlook!
        obs = segment.next_start_obs
        print(
            f'Loss: {stats["loss"]: .7f}, Instrinsic reward (avg): '
            + f'{stats["rnd_reward_avg"]: .7f}, RND Loss: {stats["rnd_loss"]: .7f}\r',
            end="",
        )

        n_step += 1
        if n_step % log_every == 0:
            avg_score = np.mean(env.return_queue)
            print(f"Global step: {n_step*rollout_size}, Average Score: {avg_score:.5f}".ljust(100))

        if avg_score > solved_score:
            print(f"Environment solved with avg_score: {avg_score:.5f}".ljust(100))
            break

### Neural Network Architectures

The actor-critic network has a shared feature extractor, the policy head (outputting a probability
distribution over actions), and two value heads: for extrinsic and intrinsic rewards respectively.


In [None]:
def convolutional_embedding(input_channels=3) -> nn.Sequential:
    """Shared layers for common feature extraction from the MiniGrid image observation."""
    return nn.Sequential(
        # TODO: Create three 2x2 2D convolutional layers with 16, 32, 64 output channels. Use the
        #       ReLU non-linearity, and remember to flatten to prepare feeding the linear layers!
    )

In [None]:
class ActorCritic(nn.Module):
    """Neural network for both policy (actor) and value function (critic)."""

    def __init__(self, action_dim, hidden_dim=64):
        super(ActorCritic, self).__init__()
        # TODO: Create the shared feature extraction convolutional embedding layer.
        self.features = None
        # Actor head
        self.policy_head = nn.Sequential(
            # TODO: Add a linear layer, tanh non-linearity, and a second linear layer to output the
            #       logits of the action probability distribution.
        )
        # Critic head (extrinsic)
        self.value_extr_head = nn.Sequential(
            # TODO: Add a linear layer, tanh non-linearity, and a second linear layer to output the
            #       state value using extrinsic rewards.
        )
        # Critic head (intrinsic)
        self.value_intr_head = nn.Sequential(
            # TODO: Add a linear layer, tanh non-linearity, and a second linear layer to output the
            #       state value using instrinsic rewards.
        )

    def forward(self, x):
        # TODO: Get the shared feature embedding.
        x = None
        # TODO: Get the policy logits using the policy head.
        policy_logits = None
        # TODO: Get the critic value (extrinsic)
        value_extr = None
        # TODO: Get the critic value (intrinsic)
        value_intr = None
        # TODO: Stack the extrinsic and intrinsic value in a single tensor.
        values = None
        # TODO: Get the Categorical distribution over the policy logits.
        dist = None
        return dist, values

### Agent


In [None]:
class AgentA2C(BaseAgent):
    """A2C agent using vectorized environment and trajectory segments."""

    def __init__(
        self, env, gamma=0.99, lr=0.0001, vloss_coeff=0.5, ent_coeff=0.01, max_grad_norm=0.5
    ):
        self.env = env
        self.gamma = gamma
        self.vloss_coeff = vloss_coeff
        self.ent_coeff = ent_coeff
        self.max_grad_norm = max_grad_norm
        self.model = ActorCritic(env.action_space[0].n).to(DEVICE)
        self.optimizer = optim.RMSprop(self.model.parameters(), lr=lr)

    @torch.no_grad
    def eval(self, obs: torch.Tensor) -> tuple[torch.Tensor, ...]:
        """Evaluates the current observation, returning the action to take, action logprob, and
        state value."""
        # TODO: Permute the observation. The input `obs` dimension is (batch, x, y, channels), but
        #       PyTorch convolutional layers expect (batch, channels, x, y).
        obs = None
        # TODO: Get action distribution and state values calling the model.
        dist, values = None
        # TODO: Sample the action.
        action = None
        # TODO: Get the log probability.
        logprob = None
        return action, logprob, values

    def learn(self, segment: TrajectorySegment) -> dict:
        """Single step training of the agent using a trajectory segment."""
        with torch.no_grad():
            # TODO: Get the next_values. Hint: use eval.
            _, _, next_values = None
            # TODO: Compute the advantages and returns using the providede library function
            #       `compute_advantages_and_returns`. Note that for basic A2C we need to keep only
            #       the extrinsic component of the value tensors!
            advantages, returns = None

        # TODO: Flatten and shuffle observations from all environments. Hint: remember to permute
        #       the observation; also, the `flatten_and_shuffle` function might help :)
        obs, actions, advantages, returns = None

        # Forward pass
        # TODO: Call the model.
        dist, values = None
        # TODO: Get the logprobs.
        logprobs = None
        # TODO: Also, get the entropy. Hint: you may want to reduce using mean()
        entropy = None

        # Loss
        # TODO: Compute the policy loss as the mean of negative product of logprobs and advantages.
        policy_loss = None
        # TODO: Compute the value loss as MSE between extrinsic values and returns.
        value_loss = None
        # TODO: Compute the total loss as the sum of policy loss plus value loss minus entropy,
        #       scaled by the respective coefficients.
        loss = None

        # TODO: Run backprop optimizer step. Remember to clip the gradient!
        # ...

        return {
            "loss": loss.item(),
            "rnd_reward_avg": 0.0,
            "rnd_loss": 0.0,
        }

### Let's train A2C on a trivial grid


In [None]:
env = init_random(make_minigrid_env())
agent_a2c = AgentA2C(env)
train(env, agent_a2c, rollout_size=16, solved_score=0.91)

In [None]:
gym_simulation(MINIGRID_ENV, agent_a2c, wrappers=[minigrid_wrapper])

## Random Network Distillation

RND encourages the agent to visit unfamiliar states. It measures novelty by seeing how accurately a
_predictor_ network can guess the output of a _fixed, randomly initialized target network_ given the
current state.

- **Familiar states:** The predictor network learns to accurately predict the target's output.
  Prediction error is low -> low intrinsic reward (boredom).
- **Novel states:** The predictor network struggles to predict the target's output. Prediction error
  is high -> high intrinsic reward (curiosity).

This intrinsic reward motivates the agent to explore parts of the environment it hasn't seen often,
helping it overcome sparse extrinsic rewards and learn more effectively. RND cleverly avoids issues
found in other intrinsic motivation methods (like getting distracted by unpredictable but irrelevant
things, e.g., a noisy TV screen).


### RND Module

Let's implement the RND module, i.e., the neural network used to output the instrinsic reward using
a trainable predictor and a fixed target network.


In [None]:
class RNDModule(nn.Module):
    """The Random Network Distillation module"""

    def __init__(self, hidden_dim=256):
        super(RNDModule, self).__init__()
        # RND predictor network.
        self.predictor = nn.Sequential(
            # TODO: Shared convolutional embedding layer to start
            # TODO: Linear layer of hidden_dim
            # TODO: Extra layers in predictor, regularized via dropouts. We use dropouts instead of
            #       other regularization techniques.
            #       Hint: dropout(0.25), ReLU, linear, dropout(0l25), ReLU, linear.
        )
        # RND target network (non-trainable)
        self.target = nn.Sequential(
            # TODO: Shared convolutional embedding layer to start
            # TODO: Linear layer of hidden_dim
        )
        # TODO: Freeze the target network. Hint: set `requires_grad` to False in the parameters.
        # ...

    def forward(self, obs):
        # TODO: Get the target output (make sure to detach())
        target_output = None
        # TODO: Get the predictor output.
        predictor_output = None
        # TODO: Compute the intrinsic reward for each step of the trajectory using MSE. Keep an eye
        #       on dimensionality, reduction, and which dimension to compute the mean() of!
        rnd_error = None
        return rnd_error

### A2C w/ RND

Let's now re-implement A2C integrating RND and intrinsic rewards.


In [None]:
class AgentA2CWithRND(BaseAgent):
    """A2C agent with RND using vectorized environment and trajectory segments."""

    def __init__(
        self,
        env,
        gamma=0.99,
        lr=0.0001,
        vloss_coeff=0.5,
        ent_coeff=0.01,
        max_grad_norm_ac=0.5,
        # The following hyperparameters have been tuned for minigrid in these examples.
        gamma_rnd=0.99,
        eta_extr=100.0,
        eta_intr=10.0,
        max_grad_norm_rnd=0.01,
    ):
        self.env = env
        self.gamma = gamma
        self.vloss_coeff = vloss_coeff
        self.ent_coeff = ent_coeff
        self.max_grad_norm_ac = max_grad_norm_ac
        self.gamma_rnd = gamma_rnd
        self.eta_extr = eta_extr
        self.eta_intr = eta_intr
        self.max_grad_norm_rnd = max_grad_norm_rnd

        self.model = ActorCritic(env.action_space[0].n).to(DEVICE)
        self.rnd_module = RNDModule().to(DEVICE)
        self.optimizer = optim.RMSprop(
            list(self.model.parameters()) + list(self.rnd_module.predictor.parameters()), lr=lr
        )

    @torch.no_grad
    def eval(self, obs: torch.Tensor) -> tuple[torch.Tensor, ...]:
        # TODO: Same implementation of the A2CAgent!
        pass

    def learn(self, segment: TrajectorySegment) -> dict:
        batch_size = segment.obs.shape[0]
        n_envs = segment.obs.shape[1]

        with torch.no_grad():
            # TODO: Get the next_values.
            _, _, next_values = None
            # TODO: Compute extrinsic advantages and returns.
            advantages_extr, returns_extr = None
            next_obs = None  # Hint: flatten and permute!
            # TODO: Call the RNDModule to compute intrinsic rewards on the next observations. Note:
            #       we use dropouts as regularization, so we need to call eval() before the forward
            #       pass, and then train() again.
            # ...
            # In the original implementation, the intrinsic rewards are scaled by a rolling std.
            # Here we don't do that, and use dropouts as regularizers for simplicity.
            intr_rewards = None
            # ...
            # Compute intrinsic advantages and returns.
            advantages_intr, returns_intr = None

        # TODO: Flatten and shuffle observations from all environments, advantages, and returns.
        obs, actions, advantages_extr, returns_extr = None
        advantages_intr, returns_intr = None

        # TODO: Compute advantages as combination of extrinsic and intrinsic. Remember to use the
        #       appropriate `eta` coefficients.
        advantages = None

        # TODO: Forward pass like A2CAgent
        dist, values = None
        logprobs = None
        entropy = None

        # TODO: Compute policy and value loss (as sum of extrinsic and instrinsic losses)
        policy_loss = None
        value_extr_loss = None
        value_intr_loss = None
        value_loss = None

        # RND. In the original implementation, observations are normalized here.
        # Moreover, there is a masked update on the number of experience for regularization
        # (which we don't do and use dropouts instead).
        # TODO: Get the rnd_error and rnd_loss (as the mean of the error)
        rnd_error = None
        rnd_loss = None

        # TODO: Compute the total loss.
        loss = None

        # TODO: Backprop (and clip grad norm!)
        # ...

        return {
            "loss": loss.item(),
            "rnd_reward_avg": intr_rewards.mean().item(),
            "rnd_loss": rnd_loss.item(),
        }

### Let's train A2C w/ RND on a trivial grid


In [None]:
env = init_random(make_minigrid_env())
agent_rnd = AgentA2CWithRND(env)
train(env, agent_rnd, solved_score=0.91)

In [None]:
gym_simulation(MINIGRID_ENV, agent_rnd, wrappers=[minigrid_wrapper])

## Time to shine: larger grids!

Let's train our A2C w/ RND on a larger grid.


### MiniGrid-DoorKey-8x8


In [None]:
env = init_random(make_minigrid_env(MINIGRID_ENV_8x8))
agent_rnd_8 = AgentA2CWithRND(env)
train(env, agent_rnd_8)

In [None]:
gym_simulation(MINIGRID_ENV_8x8, agent_rnd_8, wrappers=[minigrid_wrapper])

Now let's train just A2C, giving it some benefits with lower solved score threshold :)


In [None]:
env = init_random(make_minigrid_env(MINIGRID_ENV_8x8))
agent_a2c_8 = AgentA2C(env)
# Give A2C some benefits, and reduce the threshold for solved score :)
train(env, agent_a2c_8, solved_score=0.91)

In [None]:
gym_simulation(MINIGRID_ENV_8x8, agent_a2c_8, wrappers=[minigrid_wrapper])

### MiniGrid-DoorKey-16x16

Train A2C w/ RND on `MiniGrid-DoorKey-16x16`. It is going to take time, b/c lots of exploration is
required before getting stable reward signals... but it will be solved!

NOTE: This is computationally really intensive (128 parallel env)! Consider just running the
pretrained simulation below.


In [None]:
def run_pretrained_simulation():
    loaded_env = gym.make_vec(MINIGRID_ENV_16x16, wrappers=[minigrid_wrapper])
    loaded_agent = AgentA2CWithRND(loaded_env)
    loaded_params = torch.load("solution/a2c_rnd_16x16_weights.pth", map_location=DEVICE)
    loaded_agent.model.load_state_dict(loaded_params["a2c_params"])
    loaded_agent.rnd_module.load_state_dict(loaded_params["rnd_params"])
    return gym_simulation(MINIGRID_ENV_16x16, loaded_agent, wrappers=[minigrid_wrapper], seed=110)


# Uncomment the following line to run the pretrained agent on the 16x16 grid.
# run_pretrained_simulation()

In [None]:
env = init_random(make_minigrid_env(MINIGRID_ENV_16x16, num_envs=128))
agent_rnd_16 = AgentA2CWithRND(env, gamma=0.995)
train(env, agent_rnd_16, rollout_size=48, log_every=35)

In [None]:
gym_simulation(MINIGRID_ENV_16x16, agent_rnd_16, wrappers=[minigrid_wrapper])

Feel free to try A2C on the same environment... wait as much as you like... but I would not
recommend staying too long :)


In [None]:
# Attempt bare A2C... it's not gonna work...
env = init_random(make_minigrid_env(MINIGRID_ENV_16x16, num_envs=128))
agent_a2c_16 = AgentA2C(env, gamma=0.995)
train(env, agent_a2c_16, rollout_size=48, log_every=35, solved_score=0.91)