## CS260R Assignment4 Imitatioin Learning: Behavior Cloning & Preference Learning

## In this assignment, we will focus on imitation learning & preference learning. Hence we will need some static datasets in hand, before start working on this assignment:

Please install the following dependencies:

```
pip install minari[all]
pip install gymnasium==1.0.0
pip install mujoco==3.2.3
pip install torchrl==0.7.0
pip install matplotlib
pip install tqdm
```

We tested with python==3.9.21 on Ubuntu22.04, and it should be fine for any python version >= 3.9. 

If you have problem installing the dependencies, try switch the python version. And if you have problem installing $\textbf{MuJoCo}$ on MacOS or Windows, please refer to their github page https://github.com/google-deepmind/mujoco, where they provide a step-by-step instruction for building it from source.

In [5]:
!pip install minari[all]
!pip install gymnasium==1.0.0
!pip install mujoco==3.2.3
!pip install torchrl==0.7.0
!pip install matplotlib
!pip install tqdm
!pip install imageio







## Section 1: Behavior Cloning

### 1.1 Prepare Dataset (1pt)

In this section, we will need to construct a dataset using Minari and TorchRL. 

In [1]:
import minari
import gymnasium as gym
from torchrl.data.datasets.minari_data import MinariExperienceReplay
from torchrl.data.replay_buffers import RandomSampler

from typing import List, Tuple, Union

# Load the dataset from the minari dataset to recover the properties of the dataset
dataset_name = "mujoco/hopper/expert-v0"
dataset = minari.load_dataset(dataset_name)

# recover the dimension of state and action space
state_dim = dataset.observation_space.shape[0]
action_dim = dataset.action_space.shape[0]

# recover the gymnasium environment
env = dataset.recover_environment()

# setup the parameters for the replay buffer
batch_size = 64
split_trajs = False

"""
Create a torchRL replay buffer from the minari dataset
You can use whatever sampler you want, but make sure it will work for the later sections
Please refer to the documentation: https://pytorch.org/rl/0.7/reference/generated/torchrl.data.datasets.MinariExperienceReplay.html?highlight=minari#torchrl.data.datasets.MinariExperienceReplay
"""
######### Your code here #########
replay_buffer = MinariExperienceReplay(dataset_name, 50, download = False)
##################################

FileNotFoundError: Dataset mujoco/hopper/expert-v0 not found locally at C:\Users\Colton\.minari\datasets\mujoco\hopper\expert-v0. Use download=True to download the dataset.

### 1.2 Implement Policies (2pt)

In [6]:
"""
Print out the state and action space below. 
Notice that the action space is actually bounded, which means our policy will also need to output 
"""
print(dataset.observation_space)
print(dataset.action_space)

Box(-inf, inf, (11,), float64)
Box(-1.0, 1.0, (3,), float32)


In [21]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import numpy as np
import random

from tqdm import tqdm

# set the seed for reproducibility
seed = 42
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)


# initialization function
def weights_init_(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.orthogonal_(m.weight, gain=1)
        torch.nn.init.constant_(m.bias, 0)


"""
Create a policy class for continuous actions as you did in Assignment 3
The policy should take in a state, and output a vector as the action.
Additionally, as we observed above, the action space is bounded between -1 and 1, so make sure use Tanh to restrict the output to be in this scale.
"""
class Policy(nn.Module):
    def __init__(
        self,
        state_dim: int,
        action_dim: int,
        activation: nn.Module=nn.ReLU,
        hidden_dim: int=64,
    ):
        super().__init__()
        ###### Your code here #######
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.activation = activation()
        self.hidden_dim = hidden_dim
        
        self.process = {
            nn.Linear(state_dim, hidden_dim),
            self.activation,
            nn.Linear(hidden_dim, hidden_dim),
            self.activation,
            nn.Linear(hidden_dim, action_dim),
            nn.Tanh()
        }
        
        #############################
        self.apply(weights_init_)
    
    def forward(self, state: torch.Tensor):
        a = state

        for func in self.process:
            a = func(a)
            
        return a

### 1.3 Implement Trainer (6pt)

In this subsection, we need to build our behavior cloning trainer to train the policy. The behavior cloning loss is simply defined as the squared error between the predicted action and the expert action given the state:
$$
L_{BC}(D^{expert}, \theta) = \mathbb{E}_{(s, a) \sim D^{expert}} \left[ \pi_{\theta}(s) - a\right]
$$
And we will train the policy by minimizing this loss using gradient descent.

In [22]:
# The following evaluation function will be helpful for you to monitor the training process and debug the code
@torch.no_grad()
def evaluate_policy(policy, env, num_episodes=5):
    policy.eval()
    rewards = []
    for episode in range(num_episodes):
        state, _ = env.reset(seed=episode+1234)
        done = False
        total_reward = 0
        while not done:
            state = torch.tensor(state, dtype=torch.float32).to('cuda')
            action = policy(state)
            action = action.cpu().numpy()
            next_state, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            done = terminated or truncated
            # env.render()
            state = next_state
        rewards.append(total_reward)
    return rewards

In [None]:
class BCTrainer:
    def __init__(
        self,
        model: nn.Module,
        replay_buffer: MinariExperienceReplay,
        env: gym.Env,
        lr: float = 1e-4,
        device: str = 'cpu',
    ):
        # Feel free to modify this __init__ function as you needed
        self.model = model
        self.replay_buffer = replay_buffer
        self.env = env
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
        self.device = device

        self.model.to(self.device)
    
    def train_step(
        self,
        states: torch.Tensor,
        actions: torch.Tensor
    ) -> float:
        """
        Finish this function for train self.model for one step
        * The loss should be a behavior cloning loss, i.e. MSE loss between the predicted action and the actual expert action
        * Apply one step of gradient descent to minimize the loss
        * The loss.item() should be returned after the gradient descent is done. 
        """
        
        ###### Your code here 
        predictions = self.model(states.to(torch.float32))
        loss = F.mse_loss(predictions, actions)
        
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

        return loss.item()
        ######
        
    def train(self, num_steps: int, batch_size: int, eval_freq: int) -> Tuple[List[float], List[float]]:
        """
        Finish this function for training the model for num_steps steps
        * You should sample [batch_size] data from the replay buffer
        * You should print out the loss every [eval_freq] steps
        """
        loss_log: List[float] = []
        eval_log: List[float] = []
        for step in tqdm(range(num_steps)):
            
            # Sample some data from the replay buffer
            ##### Your code here #####
            batch = self.replay_buffer.sample(batch_size)
            states = batch['observations'].to(self.device)
            actions = batch['actions'].to(self.device)
            ############################
            
            loss = self.train_step(states, actions)
            
            if eval_freq > 0 and step % eval_freq == 0:
                # Do the evaluation, and append the loss as well as the evaluation to the logging list
                ##### Your code here #####
                eval = evaluate_policy(self.model, self.env)
                loss_log.append(loss)
                eval_log.append(eval)
                print(f"Step: {step}, Loss: {loss}, Eval: {eval}")
                ############################
        
        return loss_log, eval_log

### Train your model

In [24]:
# NOTE: Declare the device here
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Feel free to change the hyperparameters
hidden_dim = 64
activation = nn.ReLU
lr = 1e-3

bc_model =  Policy(
    state_dim=state_dim, 
    action_dim=action_dim, 
    hidden_dim=hidden_dim, 
    activation=activation
)

bc_trainer = BCTrainer(
    model=bc_model, 
    replay_buffer=replay_buffer,
    env=env,
    lr=lr, 
    device=device
)
bc_loss_log, bc_eval_log = bc_trainer.train(
    num_steps=100000, 
    batch_size=64, 
    eval_freq=10000
)

ValueError: optimizer got an empty parameter list

### Plot your results (1pt)

In [None]:
# Plot the loss and evaluation


# Section 2: Preference Learning
In this section, we will implement a preference learning algorithm. A preference learning algorithm learns not only to maximize the probability of outputing expert actions, it also learns to minimize the probability of outputing bad actions. We will leverage a "negative dataset", which contains poor-performing trajectories. The algorithm that you will be implementing below is largely based on https://arxiv.org/pdf/2310.13639

### 2.1.1 Construct segment dataset (1pt)
In this section, we will use segment datasets. The difference between a segment replay buffer and a regular replay buffer is that the segment replay buffer will sample continuous segments, whereas a regular one samples individual [state, action] pairs. Please refer to https://pytorch.org/rl/0.7/reference/generated/torchrl.data.replay_buffers.SliceSampler.html?highlight=slicesampler#torchrl.data.replay_buffers.SliceSampler for details.

We need two datasets to perform preference learning: one positive dataset (pos_segment_replay_buffer) and a negative dataset (neg_segment_replay_buffer). The first one contains expert trajectories (and you will have to use proper sampler to slice them into segments), and the latter contains suboptimal trajectories.

In [None]:
from torchrl.data.replay_buffers import SliceSampler

pos_dataset_name = "mujoco/hopper/expert-v0" # The minari dataset contains expert trajectories
neg_dataset_name = "mujoco/hopper/simple-v0" # The minari dataset contains low-reward trajecotires

"""
Construct two segment dataset below: pos_segment_replay_buffer, neg_segment_replay_buffer
"""
batch_size = 64
num_segments = 8
segment_length: int = batch_size / num_segments
traj_key = "episode"
strict_length = True

##### Your code here #####
pos_segment_replay_buffer = None
neg_segment_replay_buffer = None
####################

### Sampling function
The following function is a helper for sampling segements in your trainer, it does two things:
1. Reshape the sampled batch data into segments of data using split_trajectories
2. The segment sampler (SliceSampler) in torchRL sometimes gives you longer segments than you would expect. This is due to the lengths of the trajectories are not always multiple of the segment lengths. Hence, we need to make sure the sampled batch is in the right shape.

In [None]:
from torchrl.collectors.utils import split_trajectories

def sample_segment(slice_replay_buffer, num_segments, segment_length):
    batch = None
    while True:
        batch_size = num_segments * segment_length
        batch = slice_replay_buffer.sample(batch_size)
        batch = split_trajectories(batch, trajectory_key="episode")
        
        # make sure there is correct number of segements
        if batch.shape[0] != num_segments:
            continue
        
        # make sure the length of segments is correct
        if batch.shape[1] != segment_length:
            continue
        
        # make sure there is no padding in the segments
        # NOTE: You can use masks to deal with the padding, but you will also need to adjust your learning rate
        if batch['mask'].sum() == batch_size:
            break
        
    return batch

### 2.2 Implement Preference Learning Trainer (13pt)

In this subsection, your task is to implement the preference learning trainer.

Recall our behavior cloning loss is given by:
$$
L_{BC}(D^{expert}, \theta) = \mathbb{E}_{(s, a) \sim D^{expert}} \left[ \pi(s; \theta) - a\right]
$$

In preference learning, we use $\sigma^+$ to denote preferable segments, and $\sigma^-$ to denote undesired ones, here, each segment can be writen as a seqence of state and actions, i.e. $\sigma = [s_k, a_k, s_{k+1}, a_{k+1}, ...]$. 
Our preference loss is defined as:
$$
L_{Pref}(D^{expert}, D^{neg}, \theta) = \mathbb{E}_{\sigma^+ \sim D^{expert}, \sigma^- \sim D^{neg}} \left[
    - log \frac{
        e^{\alpha \mathbb{P}[\sigma^+ | \pi_{\theta}]}
    }{
        e^{\alpha \mathbb{P}[\sigma^+ | \pi_{\theta}]} + e^{\alpha \lambda \mathbb{P}[\sigma^- | \pi_{\theta}]}
    }
  \right]
$$
where $\mathbb{P}[\sigma | \pi_{\theta}]$ denotes a measure of the likelihood of policy $\pi_{\theta}$ taking the segment $\sigma$. And $\alpha, \lambda$ are hyperparameters, to control the shape of the loss.

Here, we use the negative sum of squared error between the predicted action and the real action as the surrogate of this measure:
$$
\mathbb{P}[\sigma | \pi_{\theta}] = - \sum_{(s_i, a_i) \in \sigma} \Vert \pi_{\theta}(s_i) - a_i \Vert^2
$$

Don't get confused here, the real actions are not necessarily expert actions, the real actions in $\sigma^-$ is actually sub-optimal.

As you may notice, in this loss, there are multiple of exponential and logarithmic operations, which are not numerical stable. Hence, to prevent overflow, you may need to leverage the properties of logarithm to transform this loss first, then implement it. Otherwise, you are likely to encoutner NaN in your training.

To wrap up, the final loss function that you will use will be:

$$
L_{CPL}(\theta) = L_{Perf}(D^{expert}, D^{neg}, \theta) + \beta L_{BC}(D^{expert}, \theta)
$$

where the behavior cloning loss will serve as a regularizer, and $\beta$ is the hyperparameter for controlling it.

In [None]:
class ContrastivePreferenceTrainer:
    def __init__(
        self,
        model: nn.Module,
        pos_segment_buffer: MinariExperienceReplay,
        neg_segment_buffer: MinariExperienceReplay,
        env: gym.Env,
        lr: float = 1e-4,
        entropy_coeff: float = 0.01,
        lam: float = 1e-3,
        alpha: float = 1.0,
        bc_loss_coeff: float = 0.1, # the coeffcient for the behavior cloning loss
        device: str = 'cpu',
    ):
        self.model = model
        self.pos_segment_buffer = pos_segment_buffer
        self.neg_segment_buffer = neg_segment_buffer
        self.env = env
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
        self.entropy_coeff = entropy_coeff
        self.lam = lam
        self.alpha = alpha
        self.bc_loss_coeff = bc_loss_coeff
        self.device = device
        
        self.model.to(self.device)
        
    def train_step(
        self, 
        states: torch.Tensor, 
        actions: torch.Tensor, 
        neg_states: torch.Tensor,
        neg_actions: torch.Tensor,
    ) -> float:
        ##### Your code here #####
        pass
        ##########################
    
    def train(
        self,
        num_segments: int,
        segment_length: int,
        num_steps: int = 100000,
        eval_freq: int = 10000,
    ):
        loss_log = []
        eval_log = []
        for step in tqdm(range(num_steps)):
            """
            Implement the sampling of the positive and negative segments
            The states, actions, neg_states, neg_actions should be of shape (num_segments, segment_length, state_dim) or (num_segments, segment_length, action_dim)
            You can use the helper function sample_segment to sample the segments
            But you are welcome to implement your own sampling function and using masks, as long as the resulting states / actions are in the correct shape
            """
            ##### Your code here #####
            pass
            ##########################
            
            # Do not modify the following assertion statements
            assert states.shape == (num_segments, segment_length, state_dim)
            assert actions.shape == (num_segments, segment_length, action_dim)
            assert neg_states.shape == (num_segments, segment_length, state_dim)
            assert neg_actions.shape == (num_segments, segment_length, action_dim)
            
            loss = self.train_step(states, actions, neg_states, neg_actions)
            
            if eval_freq > 0 and step % eval_freq == 0:
                # Do the evaluation, and append the loss as well as the evaluation to the logging list
                ##### Your code here #####
                pass
                ############################

        return loss_log, eval_log

### Train a CPL model

In [None]:
cpl_model = Policy(state_dim=state_dim, action_dim=action_dim, hidden_dim=64, activation=nn.ReLU)
cpl_trainer = ContrastivePreferenceTrainer(
    model=cpl_model,
    pos_segment_buffer=pos_segment_replay_buffer,
    neg_segment_buffer=neg_segment_replay_buffer,
    env=env, 
    lr=1e-4,
    lam=1.0,
    alpha=1.0,
    bc_loss_coeff=1,
    device='cuda',
)
cpl_loss_log, cpl_eval_log = cpl_trainer.train(
    num_steps=100000, 
    num_segments=8,
    segment_length=8,
    eval_freq=10000,
)

### Plot the results (1pt)

In [None]:
# Plot the loss and evaluation

### 2.3 Investigate the effect of the hyperparaters $\alpha, \lambda, \beta$ (Bonus 5pt)
Try different choice of $\alpha, \lambda, \beta$, see how does the learning curve changes, and explain why they change in those ways based on your intuition. Use your plot the explain your reasoning.

In [None]:
# Your code here 