# Intrinsic Curiosity Module (ICM) with A2C on FrozenLake-v1

Implementation of the Intrinsic Curiosity Module from Pathak et al. (2017) paper "Curiosity-driven Exploration by Self-supervised Prediction"

This notebook demonstrates:
- Baseline A2C implementation on FrozenLake 8x8
- A2C with ICM for curiosity-driven exploration
- Comparison between both approaches

**FrozenLake-v1** is a challenging discrete environment where the agent must navigate a slippery frozen lake to reach a goal while avoiding holes.

## Setup and Imports

In [None]:
import gymnasium as gym
from gymnasium.wrappers import FlattenObservation
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam

from stable_baselines3 import A2C
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines3.common.callbacks import EvalCallback, CheckpointCallback, BaseCallback
from stable_baselines3.common.monitor import Monitor

from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for plots
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Environment Setup

FrozenLake-v1 is a grid-world environment where:
- **S**: Starting position
- **F**: Frozen surface (safe to walk on)
- **H**: Hole (fall to your doom)
- **G**: Goal (where you want to reach)

The 8x8 map is more challenging than the default 4x4, and `is_slippery=True` means actions have stochastic outcomes.

In [None]:
def make_frozenlake_env(map_name="8x8", is_slippery=True, render_mode=None):
    """Create a FrozenLake environment wrapped for neural-network policies."""
    env = gym.make(
        "FrozenLake-v1",
        map_name=map_name,
        is_slippery=is_slippery,
        render_mode=render_mode,
    )
    # FlattenObservation converts the discrete state to a one-hot vector
    env = FlattenObservation(env)
    env = Monitor(env)
    return env


def make_env():
    """Create the default FrozenLake environment."""
    return make_frozenlake_env(map_name="8x8", is_slippery=True)


# Test environment
test_env = make_env()
print(f"Observation space: {test_env.observation_space}")
print(f"Action space: {test_env.action_space}")
print(f"Observation shape: {test_env.observation_space.shape}")
print(f"Number of actions: {test_env.action_space.n}")
print(f"\nActions: 0=Left, 1=Down, 2=Right, 3=Up")
test_env.close()

## 2. Intrinsic Curiosity Module (ICM)

**Paper Reference: Pathak et al. (2017), Section 2.2 and Figure 2**

ICM consists of three components that work together to generate curiosity-driven intrinsic rewards:

### Feature Encoder: φ(s)
Encodes raw observations into a learned feature space that filters out task-irrelevant information (e.g., moving leaves, TV static).

### Inverse Dynamics Model (Equations 2-3)
Predicts action from state transitions:
- **Equation 2**: `â_t = g(s_t, s_{t+1}; θ_I)`
- **Equation 3**: `min_θI L_I(â_t, a_t)`

The inverse model learns features φ(s) that encode **only** information relevant for predicting actions, filtering out uncontrollable aspects of the environment.

### Forward Dynamics Model (Equations 4-5)
Predicts next state features from current state and action:
- **Equation 4**: `φ̂(s_{t+1}) = f(φ(s_t), a_t; θ_F)`
- **Equation 5**: `L_F = (1/2) ||φ̂(s_{t+1}) - φ(s_{t+1})||²_2`

### Intrinsic Reward (Equation 6) - **THE CURIOSITY SIGNAL**
The prediction error in feature space serves as the curiosity reward:
- **Equation 6**: `r^i_t = (η/2) ||φ̂(s_{t+1}) - φ(s_{t+1})||²_2`

where η = 0.01 (default). High prediction error → novel/surprising state → high intrinsic reward → encourages exploration.

### Overall Optimization (Equation 7)
The complete system is trained end-to-end:

**`min_{θP,θI,θF} [-λ E_π[Σ_t r_t] + (1-β)L_I + βL_F]`**

where:
- **β = 0.2** (paper default): weights forward vs inverse loss
- **λ = 0.1** (paper default): weights policy gradient vs ICM learning  
- **r_t = r^e_t + r^i_t**: total reward = extrinsic + intrinsic

In [None]:
class ICMModule(nn.Module):
    """
    Intrinsic Curiosity Module with forward and inverse dynamics models.
    
    Supports both:
    - Visual observations (Conv2D encoder)
    - Vector observations (MLP encoder) - used for FrozenLake
    """
    
    def __init__(self, observation_space, action_space, feature_dim=288, beta=0.2, eta=0.01):
        """
        Initialize ICM module.
        
        Args:
            observation_space: Gym observation space
            action_space: Gym action space
            feature_dim: Dimension of learned feature representation
            beta: Weight for forward loss vs inverse loss (paper uses 0.2)
            eta: Scaling factor for intrinsic reward (paper uses 0.01)
        """
        super(ICMModule, self).__init__()
        
        self.feature_dim = feature_dim
        self.beta = beta
        self.eta = eta
        
        # Determine if observations are images or vectors
        if len(observation_space.shape) == 3:
            # Image observations (C, H, W)
            self.is_image = True
            n_input_channels = observation_space.shape[0]
        elif len(observation_space.shape) == 1:
            # Vector observations (e.g., FrozenLake)
            self.is_image = False
            self.obs_dim = observation_space.shape[0]
        else:
            raise ValueError(f"Unsupported observation space shape: {observation_space.shape}")
        
        # Determine action space
        if hasattr(action_space, 'n'):
            self.action_dim = action_space.n
            self.discrete = True
        else:
            self.action_dim = action_space.shape[0]
            self.discrete = False
        
        # Create appropriate feature encoder
        if self.is_image:
            # Convolutional encoder for images (as in paper)
            self.feature_encoder = nn.Sequential(
                nn.Conv2d(n_input_channels, 32, kernel_size=3, stride=2, padding=1),
                nn.ELU(),
                nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1),
                nn.ELU(),
                nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1),
                nn.ELU(),
                nn.Conv2d(32, 32, kernel_size=3, stride=2, padding=1),
                nn.ELU(),
                nn.Flatten(),
            )
            
            # Calculate feature dimension after convolutions
            with torch.no_grad():
                sample = torch.zeros(1, *observation_space.shape)
                n_flatten = self.feature_encoder(sample).shape[1]
            
            # Project to desired feature dimension
            self.feature_projection = nn.Linear(n_flatten, feature_dim)
        else:
            # MLP encoder for vector observations (FrozenLake uses this)
            self.feature_encoder = nn.Sequential(
                nn.Linear(self.obs_dim, 128),
                nn.ReLU(),
                nn.Linear(128, 128),
                nn.ReLU(),
            )
            self.feature_projection = nn.Linear(128, feature_dim)
        
        # Inverse model: φ(st), φ(st+1) → at
        self.inverse_model = nn.Sequential(
            nn.Linear(feature_dim * 2, 256),
            nn.ReLU(),
            nn.Linear(256, self.action_dim)
        )
        
        # Forward model: φ(st), at → φ(st+1)
        self.forward_model = nn.Sequential(
            nn.Linear(feature_dim + self.action_dim, 256),
            nn.ReLU(),
            nn.Linear(256, feature_dim)
        )
    
    def encode(self, obs):
        """
        Encode observation to feature space.
        
        Args:
            obs: Observation tensor
            
        Returns:
            Feature representation φ(obs)
        """
        features = self.feature_encoder(obs)
        features = self.feature_projection(features)
        return features
    
    def forward(self, obs, next_obs, action):
        """
        Compute ICM losses and intrinsic reward.
        
        Args:
            obs: Current observation
            next_obs: Next observation
            action: Action taken
        
        Returns:
            Tuple of (forward_loss, inverse_loss, intrinsic_reward)
        """
        # Encode observations to feature space
        phi_obs = self.encode(obs)
        phi_next_obs = self.encode(next_obs)
        
        # Inverse model loss: predict action from state transition
        phi_concat = torch.cat([phi_obs, phi_next_obs], dim=1)
        pred_action = self.inverse_model(phi_concat)
        
        if self.discrete:
            inverse_loss = F.cross_entropy(pred_action, action.long())
        else:
            inverse_loss = F.mse_loss(pred_action, action)
        
        # Forward model loss: predict next state features
        if self.discrete:
            action_one_hot = F.one_hot(action.long(), num_classes=self.action_dim).float()
        else:
            action_one_hot = action
        
        phi_action = torch.cat([phi_obs, action_one_hot], dim=1)
        pred_phi_next = self.forward_model(phi_action)
        
        forward_loss = F.mse_loss(pred_phi_next, phi_next_obs.detach())
        
        # Intrinsic reward = prediction error in feature space
        intrinsic_reward = self.eta / 2 * torch.norm(
            pred_phi_next - phi_next_obs.detach(), 
            dim=1, 
            p=2
        ) ** 2
        
        return forward_loss, inverse_loss, intrinsic_reward

## 3. ICM Callback for Training

**Paper Reference: Equation 7 - Joint optimization of policy and ICM**

This callback integrates ICM training with A2C rollouts:

1. **Collects transitions** (s_t, a_t, s_{t+1}) from rollout buffer
2. **Computes ICM losses**:
   - Inverse loss L_I (Eq 3)
   - Forward loss L_F (Eq 5)  
   - Combined: `ICM_loss = (1-β)L_I + βL_F` with β=0.2
3. **Computes intrinsic rewards** r^i_t (Eq 6) using forward model prediction error
4. **Adds intrinsic rewards** to extrinsic rewards: `r_t = r^e_t + r^i_t`
5. **Updates ICM parameters** θ_I and θ_F via backpropagation

The A2C policy then trains on the augmented rewards to maximize curiosity-driven exploration.

In [None]:
class ICMCallback(BaseCallback):
    """
    Callback to train ICM module during A2C rollouts and add intrinsic rewards.
    """
    
    def __init__(self, icm_module, icm_optimizer, lambda_weight=0.1, verbose=0):
        """
        Initialize ICM callback.
        
        Args:
            icm_module: ICMModule instance
            icm_optimizer: Optimizer for ICM
            lambda_weight: Weight for ICM loss (not currently used)
            verbose: Verbosity level
        """
        super(ICMCallback, self).__init__(verbose)
        self.icm_module = icm_module
        self.icm_optimizer = icm_optimizer
        self.lambda_weight = lambda_weight
        self.intrinsic_rewards = []
        self.forward_losses = []
        self.inverse_losses = []
        self.icm_losses = []
        
    def _on_step(self) -> bool:
        return True
    
    def _on_rollout_end(self) -> None:
        """Train ICM on collected rollout data and add intrinsic rewards."""
        rollout_buffer = self.model.rollout_buffer
        
        # Collect all transitions
        obs_list = []
        next_obs_list = []
        actions_list = []
        
        buffer_size = rollout_buffer.observations.shape[0]
        n_envs = rollout_buffer.observations.shape[1]
        
        # Extract transitions step by step
        for step in range(buffer_size - 1):
            for env in range(n_envs):
                obs_list.append(rollout_buffer.observations[step, env])
                next_obs_list.append(rollout_buffer.observations[step + 1, env])
                actions_list.append(rollout_buffer.actions[step, env])
        
        # Convert to tensors
        obs = torch.FloatTensor(np.array(obs_list)).to(self.model.device)
        next_obs = torch.FloatTensor(np.array(next_obs_list)).to(self.model.device)
        actions = torch.FloatTensor(np.array(actions_list)).to(self.model.device)
        
        # Handle action shape for discrete actions (squeeze if needed)
        if len(actions.shape) > 1 and actions.shape[-1] == 1:
            actions = actions.squeeze(-1)
        
        # Train ICM
        self.icm_optimizer.zero_grad()
        forward_loss, inverse_loss, intrinsic_reward = self.icm_module(obs, next_obs, actions)
        
        # Combined ICM loss (Eq. 7 in paper)
        icm_loss = (1 - self.icm_module.beta) * inverse_loss + self.icm_module.beta * forward_loss
        icm_loss.backward()
        self.icm_optimizer.step()
        
        # Add intrinsic rewards to rollout buffer
        intrinsic_reward_np = intrinsic_reward.detach().cpu().numpy()
        intrinsic_reward_reshaped = intrinsic_reward_np.reshape(buffer_size - 1, n_envs)
        rollout_buffer.rewards[:-1] += intrinsic_reward_reshaped
        
        # Track statistics
        self.intrinsic_rewards.extend(intrinsic_reward_np.tolist())
        self.forward_losses.append(forward_loss.item())
        self.inverse_losses.append(inverse_loss.item())
        self.icm_losses.append(icm_loss.item())
        
        # LOG TO TENSORBOARD
        if self.logger is not None:
            # Log ICM-specific metrics
            self.logger.record("icm/forward_loss", forward_loss.item())
            self.logger.record("icm/inverse_loss", inverse_loss.item())
            self.logger.record("icm/total_loss", icm_loss.item())
            self.logger.record("icm/mean_intrinsic_reward", intrinsic_reward_np.mean())
            self.logger.record("icm/std_intrinsic_reward", intrinsic_reward_np.std())
            self.logger.record("icm/max_intrinsic_reward", intrinsic_reward_np.max())
            self.logger.record("icm/min_intrinsic_reward", intrinsic_reward_np.min())
            
            # Log cumulative statistics
            if len(self.forward_losses) > 0:
                self.logger.record("icm/avg_forward_loss", np.mean(self.forward_losses[-100:]))
                self.logger.record("icm/avg_inverse_loss", np.mean(self.inverse_losses[-100:]))
            
            # Log reward composition
            extrinsic_rewards = rollout_buffer.rewards[:-1] - intrinsic_reward_reshaped
            self.logger.record("icm/mean_extrinsic_reward", extrinsic_rewards.mean())
            self.logger.record("icm/intrinsic_to_extrinsic_ratio", 
                             intrinsic_reward_np.mean() / (abs(extrinsic_rewards.mean()) + 1e-8))
        
        if self.verbose > 0:
            print(f"ICM - Forward: {forward_loss.item():.4f}, "
                  f"Inverse: {inverse_loss.item():.4f}, "
                  f"Intrinsic Reward: {intrinsic_reward_np.mean():.4f}")

## 4. Baseline A2C Training

**Paper Reference: Equation 1 - Policy optimization**

Trains a standard A2C agent that maximizes: `max_θP E_π[Σ_t r_t]`

This baseline uses **only extrinsic rewards** r^e_t from the environment (no curiosity). In sparse reward environments like FrozenLake, the agent receives r^e_t = 0 until it reaches the goal, making exploration difficult.

In [None]:
def train_baseline_a2c(
    total_timesteps=200_000,
    n_envs=4,
    learning_rate=7e-4,
    n_steps=5,
    gamma=0.99,
    gae_lambda=1.0,
    ent_coef=0.01,
    vf_coef=0.5,
    save_path="models/baseline/a2c_frozenlake_baseline"
):
    """
    Train baseline A2C on FrozenLake-v1 (8x8)
    
    Args:
        total_timesteps: Total training steps
        n_envs: Number of parallel environments
        learning_rate: Learning rate for optimizer
        n_steps: Number of steps to run for each environment per update
        gamma: Discount factor
        gae_lambda: Factor for GAE
        ent_coef: Entropy coefficient for exploration
        vf_coef: Value function coefficient
        save_path: Path to save the model
    """
    
    # Create vectorized environment
    env = DummyVecEnv([make_env for _ in range(n_envs)])
    
    # Normalize observations and rewards for better training
    env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
    
    # Create evaluation environment
    eval_env = DummyVecEnv([make_env])
    eval_env = VecNormalize(eval_env, norm_obs=True, norm_reward=False, clip_obs=10.)
    
    # Callbacks for evaluation and checkpointing
    eval_callback = EvalCallback(
        eval_env,
        best_model_save_path=f"./logs/{save_path}/",
        log_path=f"./logs/{save_path}/",
        eval_freq=5000,
        deterministic=True,
        render=False,
        n_eval_episodes=10
    )
    
    checkpoint_callback = CheckpointCallback(
        save_freq=10000,
        save_path=f"./logs/{save_path}/checkpoints/",
        name_prefix="a2c_model"
    )
    
    # Create A2C model
    model = A2C(
        "MlpPolicy",
        env,
        learning_rate=learning_rate,
        n_steps=n_steps,
        gamma=gamma,
        gae_lambda=gae_lambda,
        ent_coef=ent_coef,
        vf_coef=vf_coef,
        max_grad_norm=0.5,
        use_rms_prop=True,
        normalize_advantage=True,
        verbose=1,
        tensorboard_log=f"./logs/{save_path}/tensorboard/"
    )
    
    print(f"\n{'='*50}")
    print(f"Training Baseline A2C on FrozenLake-v1 (8x8)")
    print(f"{'='*50}")
    print(f"Total timesteps: {total_timesteps:,}")
    print(f"Number of environments: {n_envs}")
    print(f"Learning rate: {learning_rate}")
    print(f"Entropy coefficient: {ent_coef}")
    print(f"{'='*50}\n")
    
    # Train the model
    model.learn(
        total_timesteps=total_timesteps,
        callback=[eval_callback, checkpoint_callback],
        progress_bar=True
    )
    
    # Save the final model and normalization stats
    model.save(f"{save_path}_final")
    env.save(f"{save_path}_vecnormalize.pkl")
    
    print(f"\n{'='*50}")
    print(f"Training completed!")
    print(f"Model saved to: {save_path}_final.zip")
    print(f"Normalization stats saved to: {save_path}_vecnormalize.pkl")
    print(f"{'='*50}\n")
    
    return model, env

## 5. A2C with ICM Training

**Paper Reference: Equation 7 - Full ICM+A2C optimization**

Trains A2C with curiosity-driven intrinsic rewards. The agent maximizes:

`max_θP E_π[Σ_t (r^e_t + r^i_t)]`

where r^i_t comes from the ICM forward model prediction error (Eq 6).

**Key difference from baseline**: The intrinsic reward r^i_t provides a dense learning signal even when extrinsic rewards r^e_t are sparse or absent, enabling effective exploration of novel states.

In [None]:
def train_a2c_with_icm(
    total_timesteps=200_000,
    n_envs=4,
    learning_rate=7e-4,
    n_steps=5,
    gamma=0.99,
    gae_lambda=1.0,
    ent_coef=0.01,
    vf_coef=0.5,
    icm_lr=1e-3,
    icm_beta=0.2,
    icm_eta=0.01,
    lambda_weight=0.1,
    save_path="models/icm/a2c_frozenlake_icm"
):
    """
    Train A2C with ICM on FrozenLake-v1 (8x8)
    
    Args:
        total_timesteps: Total training steps
        n_envs: Number of parallel environments
        learning_rate: Learning rate for A2C optimizer
        n_steps: Number of steps per environment per update
        gamma: Discount factor
        gae_lambda: GAE lambda
        ent_coef: Entropy coefficient
        vf_coef: Value function coefficient
        icm_lr: Learning rate for ICM
        icm_beta: Weight for forward vs inverse loss (paper uses 0.2)
        icm_eta: Scaling factor for intrinsic reward (paper uses 0.01)
        lambda_weight: Weight for ICM loss in overall optimization
        save_path: Path to save models
    
    Returns:
        Tuple of (model, icm_module, env)
    """
    
    # Create vectorized environment
    env = DummyVecEnv([make_env for _ in range(n_envs)])
    env = VecNormalize(env, norm_obs=True, norm_reward=True, clip_obs=10.)
    
    # Create evaluation environment
    eval_env = DummyVecEnv([make_env])
    eval_env = VecNormalize(eval_env, norm_obs=True, norm_reward=False, clip_obs=10.)
    
    # Create A2C model
    model = A2C(
        "MlpPolicy",
        env,
        learning_rate=learning_rate,
        n_steps=n_steps,
        gamma=gamma,
        gae_lambda=gae_lambda,
        ent_coef=ent_coef,
        vf_coef=vf_coef,
        max_grad_norm=0.5,
        use_rms_prop=True,
        normalize_advantage=True,
        verbose=1,
        tensorboard_log=f"./logs/{save_path}/tensorboard/"
    )
    
    # Create ICM module
    icm_module = ICMModule(
        env.observation_space,
        env.action_space,
        beta=icm_beta,
        eta=icm_eta
    ).to(model.device)
    
    # Create ICM optimizer
    icm_optimizer = Adam(icm_module.parameters(), lr=icm_lr)
    
    # Setup callbacks
    icm_callback = ICMCallback(
        icm_module, 
        icm_optimizer, 
        lambda_weight=lambda_weight, 
        verbose=0
    )
    
    eval_callback = EvalCallback(
        eval_env,
        best_model_save_path=f"./logs/{save_path}/",
        log_path=f"./logs/{save_path}/",
        eval_freq=5000,
        deterministic=True,
        render=False,
        n_eval_episodes=10
    )
    
    checkpoint_callback = CheckpointCallback(
        save_freq=10000,
        save_path=f"./logs/{save_path}/checkpoints/",
        name_prefix="a2c_icm_model"
    )
    
    print(f"\n{'='*50}")
    print(f"Training A2C with ICM on FrozenLake-v1 (8x8)")
    print(f"{'='*50}")
    print(f"Total timesteps: {total_timesteps:,}")
    print(f"Number of environments: {n_envs}")
    print(f"A2C Learning rate: {learning_rate}")
    print(f"ICM Learning rate: {icm_lr}")
    print(f"ICM Beta (forward weight): {icm_beta}")
    print(f"ICM Eta (reward scale): {icm_eta}")
    print(f"Entropy coefficient: {ent_coef}")
    print(f"{'='*50}\n")
    
    # Train the model
    model.learn(
        total_timesteps=total_timesteps,
        callback=[icm_callback, eval_callback, checkpoint_callback],
        progress_bar=True
    )
    
    # Save models
    model.save(f"{save_path}_final")
    torch.save(icm_module.state_dict(), f"{save_path}_icm.pth")
    env.save(f"{save_path}_vecnormalize.pkl")
    
    print(f"\n{'='*50}")
    print(f"Training completed!")
    print(f"A2C Model saved to: {save_path}_final.zip")
    print(f"ICM Module saved to: {save_path}_icm.pth")
    print(f"{'='*50}\n")
    
    return model, icm_module, env

## 6. Testing and Comparison Utilities

In [None]:
def test_model(model_path, n_episodes=10, model_type='baseline'):
    """
    Test a trained A2C model (works for both baseline and ICM versions)
    
    Args:
        model_path: Path to the saved model (without .zip extension)
        n_episodes: Number of episodes to test
        model_type: 'baseline' or 'icm' for proper identification
    """
    # Load the trained model
    model = A2C.load(model_path)
    
    # Create test environment
    env = DummyVecEnv([make_env])
    
    # Load normalization stats if available
    try:
        vec_normalize_path = model_path.replace('_final', '') + '_vecnormalize.pkl'
        env = VecNormalize.load(vec_normalize_path, env)
        env.training = False
        env.norm_reward = False
    except FileNotFoundError:
        print("Warning: Normalization stats not found, continuing without normalization")
    
    episode_rewards = []
    episode_lengths = []
    
    print(f"\n{'='*50}")
    print(f"Testing {model_type.upper()} A2C Model")
    print(f"{'='*50}\n")
    
    for episode in range(n_episodes):
        obs = env.reset()
        done = False
        episode_reward = 0.0
        steps = 0
        
        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, done, info = env.step(action)
            episode_reward += reward[0]
            steps += 1
        
        episode_rewards.append(episode_reward)
        episode_lengths.append(steps)
        
        print(f"Episode {episode + 1}: Reward = {episode_reward:.2f}, Steps = {steps}")
    
    env.close()
    
    print(f"\n{'='*50}")
    print(f"Test Results ({n_episodes} episodes)")
    print(f"{'='*50}")
    print(f"Average Reward: {np.mean(episode_rewards):.2f} (+/- {np.std(episode_rewards):.2f})")
    print(f"Average Steps: {np.mean(episode_lengths):.2f} (+/- {np.std(episode_lengths):.2f})")
    print(f"Success Rate: {sum(1 for r in episode_rewards if r > 0.0) / n_episodes * 100:.1f}%")
    print(f"{'='*50}\n")
    
    return episode_rewards, episode_lengths


def compare_models(baseline_path, icm_path, n_episodes=20):
    """
    Compare baseline A2C and ICM-enhanced A2C
    
    Args:
        baseline_path: Path to baseline model (without .zip)
        icm_path: Path to ICM model (without .zip)
        n_episodes: Number of episodes for comparison
    
    Returns:
        Dictionary with comparison results
    """
    print("\n" + "="*70)
    print("COMPARING BASELINE A2C vs A2C + ICM")
    print("="*70)
    
    print("\n[1/2] Testing Baseline A2C...")
    baseline_rewards, baseline_lengths = test_model(
        baseline_path, 
        n_episodes=n_episodes, 
        model_type='baseline'
    )
    
    print("\n[2/2] Testing A2C + ICM...")
    icm_rewards, icm_lengths = test_model(
        icm_path, 
        n_episodes=n_episodes, 
        model_type='icm'
    )
    
    # Comparison statistics
    print("\n" + "="*70)
    print("COMPARISON SUMMARY")
    print("="*70)
    print(f"\n{'Metric':<30} {'Baseline A2C':<20} {'A2C + ICM':<20}")
    print("-"*70)
    print(f"{'Mean Reward':<30} {np.mean(baseline_rewards):<20.2f} {np.mean(icm_rewards):<20.2f}")
    print(f"{'Std Reward':<30} {np.std(baseline_rewards):<20.2f} {np.std(icm_rewards):<20.2f}")
    print(f"{'Mean Steps':<30} {np.mean(baseline_lengths):<20.2f} {np.mean(icm_lengths):<20.2f}")
    print(f"{'Std Steps':<30} {np.std(baseline_lengths):<20.2f} {np.std(icm_lengths):<20.2f}")
    
    baseline_success = sum(1 for r in baseline_rewards if r > 0.0) / n_episodes * 100
    icm_success = sum(1 for r in icm_rewards if r > 0.0) / n_episodes * 100
    print(f"{'Success Rate (%)':<30} {baseline_success:<20.1f} {icm_success:<20.1f}")
    
    # Statistical comparison
    t_stat, p_value = stats.ttest_ind(baseline_rewards, icm_rewards)
    print(f"\n{'Statistical Test (t-test)':<30}")
    print(f"  t-statistic: {t_stat:.4f}")
    print(f"  p-value: {p_value:.4f}")
    
    if p_value < 0.05:
        winner = "A2C + ICM" if np.mean(icm_rewards) > np.mean(baseline_rewards) else "Baseline A2C"
        improvement = abs(np.mean(icm_rewards) - np.mean(baseline_rewards))
        print(f"  Result: {winner} is significantly better (p < 0.05)")
        print(f"  Mean improvement: {improvement:.2f} reward points")
    else:
        print(f"  Result: No significant difference (p >= 0.05)")
    
    print("="*70 + "\n")
    
    return {
        'baseline': {
            'rewards': baseline_rewards, 
            'lengths': baseline_lengths,
            'mean_reward': np.mean(baseline_rewards),
            'success_rate': baseline_success
        },
        'icm': {
            'rewards': icm_rewards, 
            'lengths': icm_lengths,
            'mean_reward': np.mean(icm_rewards),
            'success_rate': icm_success
        },
        'statistics': {
            't_statistic': t_stat,
            'p_value': p_value,
            'significant': p_value < 0.05
        }
    }


def plot_comparison(results):
    """Plot comparison between baseline and ICM models"""
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Rewards comparison
    axes[0].boxplot([results['baseline']['rewards'], results['icm']['rewards']], 
                    labels=['Baseline A2C', 'A2C + ICM'])
    axes[0].set_ylabel('Episode Reward')
    axes[0].set_title('Reward Distribution Comparison')
    axes[0].grid(True, alpha=0.3)
    
    # Episode lengths comparison
    axes[1].boxplot([results['baseline']['lengths'], results['icm']['lengths']], 
                    labels=['Baseline A2C', 'A2C + ICM'])
    axes[1].set_ylabel('Episode Length (steps)')
    axes[1].set_title('Episode Length Comparison')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 7. Train Models

Run these cells to train both models. Set `total_timesteps` according to your needs (200,000 is recommended for FrozenLake 8x8).

In [None]:
# Train Baseline A2C
baseline_model, baseline_env = train_baseline_a2c(
    total_timesteps=200_000,
    n_envs=4,
    learning_rate=7e-4,
    ent_coef=0.01
)

In [None]:
# Train A2C with ICM
icm_model, icm_module, icm_env = train_a2c_with_icm(
    total_timesteps=200_000,
    n_envs=4,
    learning_rate=7e-4,
    ent_coef=0.01,
    icm_lr=1e-3,
    icm_beta=0.2,
    icm_eta=0.01
)

## 8. Compare Results

In [None]:
# Compare both models
comparison_results = compare_models(
    "models/baseline/a2c_frozenlake_baseline_final",
    "models/icm/a2c_frozenlake_icm_final",
    n_episodes=50
)

In [None]:
# Visualize comparison
plot_comparison(comparison_results)

## 9. Test Individual Models

In [None]:
# Test baseline model
baseline_rewards, baseline_lengths = test_model(
    "models/baseline/a2c_frozenlake_baseline_final",
    n_episodes=10,
    model_type='baseline'
)

In [None]:
# Test ICM model
icm_rewards, icm_lengths = test_model(
    "models/icm/a2c_frozenlake_icm_final",
    n_episodes=10,
    model_type='icm'
)

## 10. Visualize Training with TensorBoard

To view training metrics in your browser, run the command below (TensorBoard is a long-running process, so you may prefer to run it in a separate terminal):

Then navigate to http://localhost:6006 in your browser.

## 11. Key Hyperparameters

### A2C Parameters
- `learning_rate`: 7e-4 (default)
- `n_steps`: 5 (rollout length)
- `gamma`: 0.99 (discount factor)
- `ent_coef`: 0.01 (entropy coefficient for exploration)

### ICM Parameters
- `icm_lr`: 1e-3 (ICM learning rate)
- `icm_beta`: 0.2 (forward vs inverse loss weight, from paper)
- `icm_eta`: 0.01 (intrinsic reward scaling, from paper)

### Experiment with different values:
- Higher `icm_beta` → More focus on forward model (feature prediction)
- Higher `icm_eta` → Larger intrinsic rewards
- Higher `ent_coef` → More exploration

### FrozenLake-specific considerations:
- Success rate is the most important metric (reaching the goal)
- The environment is sparse reward (0 everywhere, 1 at goal)
- ICM helps by providing intrinsic rewards for exploring novel states

## Summary

This notebook implements:

1. **Baseline A2C**: Standard advantage actor-critic algorithm on FrozenLake 8x8
2. **ICM Module**: Curiosity-driven exploration using:
   - Feature encoder φ(s) - converts one-hot states to learned features
   - Inverse model (predicts action from state transitions)
   - Forward model (predicts next state features)
   - Intrinsic reward = forward model prediction error

3. **Comparison**: Statistical comparison showing whether ICM improves performance

The ICM helps the agent explore the frozen lake more effectively by providing intrinsic rewards for novel/surprising states, which is particularly valuable in sparse reward environments like FrozenLake where the agent only receives reward upon reaching the goal.