In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("hw7.ipynb")

In [None]:
import numpy as np

# FILL IN YOUR NAME AND THE NAME OF YOUR PEER (IF ANY) BELOW

**Name**: \<replace this with your name\>

**Peer**: \<replace this with your peer's name\>

## Collaboration policy
Students are responsible for writing their own quizzes, assignments, and exams. For homework assignments, students are welcome (and encouraged) to discuss problems with one peer, **but each student must write their own assignment wrtieup and code individually**. The peer must be listed at the top of the writeup for each assignment. *Note: I will treat AI assistants as peers. That is, students are welcome to discuss problems with an AI assistant, but it is considered cheating to directly obtain an answer by querying the assistant. Please credit any AI assistant that you use.*

# Homework 7 -- Behavior cloning (100 pts)

**Due:** Tuesday, April 8th, 2025 at 11:59 pm

This homework builds on the material in the slides and MIT Lecture notes Chapter 8 (on Brightspace).

We will use Jupyter/Colab notebooks throughout the semester for writing code and generating assignment outputs.

**This homework will be unlike prior homeworks. It will be _entirely_ implementation-based. Some questions will be assessed by running your code, while others will require you to upload trained neural nets, which the grader will evaluate.**

## 1) Neural net implementation

Our first step will be to create a neural net implementation, including training code. This part will be agnostic to the robotics problem setting: we will simply train a neural net on some arbitrary X, Y dataset.

In this homework, we'll use PyTorch, a Python framework for implementing and training neural networks. PyTorch is currently one of the most popular frameworks within the machine learning community, so it's worth becoming familiar with it.

In particular, PyTorch uses the **Tensor** data type to encode all the objects it works on. In our usage, a tensor is the generalization of the idea of an array (which is typically 2D) to a structure that can have arbitrarily many dimensions. We will use tensors to represent data sets, batches of data, weights, activations, etc. They are very versatile but can be a little bit difficult to conceptualize especially when you're starting out. You'll get the hang of it! I recommend looking at this very good [primer on PyTorch tensors](https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html).

To run locally, you may simply run `pip install torch`; PyTorch is already installed in Google Colab. 

PyTorch creates networks in a modular fashion, by chaining different [torch.nn.Modules](https://pytorch.org/docs/stable/nn.html) together. Some helpful modules for this assignment include:
- [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) -- implements (the pre-activations for) a fully-connected layer, what we call $W^\top x + b$.
- [nn.ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) -- implements ReLU activation.
- [nn.MSELoss](https://pytorch.org/docs/stable/generated/torch.nn.MSELoss.html) -- implements element-wise Mean Squared Error.



For an example of how all these pieces fit together to create a network, take a look at this [quickstart tutorial](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html) from the PyTorch website.

Throughout this question, you can use the dataset in `dummy_data.pt`, included with this assignment. You may load this file using `data = torch.load("dummy_data.pt")`, which will return dictionary of the form: `{'inputs_train': X_train, 'outputs_train': Y_train, 'inputs_test': X_test, 'outputs_test': Y_test}`. Inspect the X and Y to determine what the input-output dimensions to your network should be.
- You should interpret the *_train as being the inputs/outputs that you will use to execute gradient descent on your network and the *_test as being the inputs/outputs that you will evaluate your model on to see if it is learning something meaningful.

### 1.1) Fully-connected network

Your first task is to create a class that constructs a fully-connected network. The class constructor will take as input the following arguments:
- `input_dim`: the number of features in the input
- `output_dim`: the number of outputs that the network should produce
- `hid_size`: the number of neurons in the hidden layers
- `num_layers`: the number of hidden layers in your network. (*Note: in NN-speak, the "input" layer is the actual features, the "output" layer is the final set of nodes, and the "hidden" layers are all the ones in between.*)

Your network should be able to consume a batch of inputs of shape `(n, input_dim)` (where `n` is an arbitrary integer) and produce a batch of outputs of shape `(n, output_dim)` after going through `num_layers` hidden layers with **ReLU** activation. The output layer should not have any activation.

*Note: if you choose to store your hidden layers in a list, you should use [`nn.ModuleList`](https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html#torch.nn.ModuleList), which operates just like a standard Python list, but has additional attributes that enable Pytorch to use its elements during backpropagation.*

_Points:_ 15

In [None]:
import torch
import torch.nn as nn

class FCNN_11(nn.Module):
    def __init__(self, input_dim, output_dim, hid_size, num_layers):
        super().__init__()
        ...
    
    def forward(self, x):
        ...

### 1.2) Data normalization

Neural net training via backpropagation on data that is normalized to be within some relatively constrained range. One common data normalization technique is to *standardize* the data to have zero mean and standard deviation one. This is achieved by computing:
$$X_\text{norm} = \frac{X - \mu}{\sigma}\enspace.$$


There are three key points to consider when normalizing data:
- Both inputs and outputs should be normalized
- Normalization should be done for each dimension in the input/output separately (e.g., if one feature dimension is the robot's joint 0 position, then joint 0 should be normalized so that the mean of joint 0 positions is 0)
- Normalization should be based on *training* data statistics, and never on *test* data statistics. This is because we want to use the same constants to scale training and test data (and test data is not available during training)

Write a Python class to normalize data via standardization. Your class constructor should receive a torch tensor `X` as input and store the relevant statistics from `X` to use for standardization. 

_Points:_ 10

In [None]:
class Normalizer_12:
    def __init__(self, X):
        ''' 
        Add a small constant 1e-5 to the standard deviation to avoid division by zero
        '''
        self.mean = ...
        self.std = ...
    
    def normalize(self, X):
        '''
        Given a tensor X, return the normalized tensor using the mean and std
        '''
        ...
    
    def denormalize(self, X_normalized):
        ''''
        Given a normalized tensor X_normalized, return the denormalized tensor using the mean and std
        '''
        ...

### 1.3) Training

Now you will write a function to train the network to minimize the MSE loss on a given training data. 

It will be helpful to leverage the following two utilities:
- [`torch.utils.data.TorchDataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset) -- takes multiple tensors as input and constructs a Dataset object to use as input to a DataLoader (see below)
- [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) -- takes a Dataset as input and produces an object that can be iterated over. Using `shuffle=True` makes each loop over the DataLoader produce a different ordering over the data.

Read the documentation and examples to understand how to use these two functions.

Write the train function, which takes the following arguments:
- `X`: the *unnormalized* training set inputs
- `Y`: the *unnormalized* training set outputs
- `net`: the neural net to train
- `num_epochs`: the number of epochs (loops over the whole dataset) to execute
- `batchsize`: the size of minibatches to train on (passed as input to `DataLoader`)

You should normalize `X` and `Y`, then create your Dataset and DataLoader, and loop for `num_epochs` many rounds of training, taking gradient steps to minimize the MSE loss.

The return value of your function should be the `X_normalizer` and `Y_normalizer` (the network itself is trained in-place, and so you do not need to return it).

_Points:_ 20

In [None]:
def train_13(X, Y, net, num_epochs, batchsize):
    X_normalizer = ...
    X = ...
    Y_normalizer = ...
    Y = ...
    
    # Create a DataLoader for the training data
    dataset = ...
    dataloader = ...

    # Define the loss function and optimizer
    criterion = ...
    optimizer = torch.optim.Adam(net.parameters(), lr=1e-3)

    # Training loop
    for epoch in range(num_epochs):
        for batch_X, batch_Y in dataloader:
            # Zero the gradients: this is a weird PyTorch thing needed to avoid summing gradients over multiple rounds
            optimizer.zero_grad()

            # Forward pass
            Yhat = ...
            loss = ...

            # Backward pass and optimization
            ...
            ...

    ...

## 2) Behavior cloning

In this second half of the homework, we will train behavior policies for controlling a robot. For this, we will fit together two APIs:
- [`minari`](https://github.com/Farama-Foundation/Minari) -- A library that standardizes behavior cloning (and, more generally, offline RL) data formats and contains a collection of simulated demonstration datasets in MuJoCo environments
- [`gymnasium`](https://github.com/Farama-Foundation/Gymnasium) -- A libary that standardizes interaction with RL-like environments

You can pip install everything you need for this part of the assignment with:
```
pip install mujoco==3.2.3
pip install "minari[hf,hdf5]"
pip install gymnasium[mujoco]
```

### 2.1) Creating dataset and environment

In this assignment, we will be working with the [`mujoco/pusher/expert-v0`](https://minari.farama.org/main/datasets/mujoco/pusher/expert-v0/) dataset.

In this question, you will load the dataset, construct X and Y tensors to pass into your training code, and construct a simulation environment (which we will later use for running our policies). 
- `dataset = minari.load_dataset(<name>, download=True)` gives a dataset object. Dataset objects can be iterated via `for episode in dataset`.
    - Episode objects have a `.observations` array and a `.actions` array
- `env = dataset.recover_environment()` gives the desired simulation environment. During debugging, you may wish to pass the argument `render_mode=human` to observe the behavior of your policies. It is likely that this will cause issues on Gradescope, so be sure to turn rendering off (by removing the argument) from the env creation

_Points:_ 10

In [None]:
import minari
import gymnasium

def minari_data_21(dataset_name):
    '''
    Given the name of a minari dataset name, return:
    - X: an n x d tensor of observations (to use as input to a dataloader)
    - Y: an n x m tensor of actions (to use as input to a dataloader) 
    - the simulation environment for interaction
    Note: X and Y should "get rid" of the episode structure and just return
    the observations and actions in a single array.
    '''
    dataset = ...
    env = ...

    # *** Loop over the dataset and extract all observations and actions to construct the X,Y tensors ***
    ...
    print("something")
    return X, Y, env

### 2.2) Running a neural net policy

You will now write a function that takes as input a network and a simulation environment and runs the policy on the simulator. 

The `env` API has the following methods:
- `env.reset()`: Sets the environment to a (possibly random) initial state. Returns:
    - `obs`: an observation of the initial state
    - `info`: a dictionary containing information about the environment (we will not use this)
- `env.step(action)`: Executes the action on the simulator. Returns:
    - `obs`: an observation of the resulting state after running the action
    - `rew`: the reward obtained by the agent for executing the action in that state
    - `terminated`: whether the episode terminated (e.g., the agent died)
    - `truncated`: whether the episode terminated due to some fixed timeout
    - `info`: unused

Your function should execute one full episode that passes each new state as input through a network and executed the action predicted by the network. Use the `terminated` and `truncated` signals to determine when the episode has terminated.

Return the sum of rewards accumulated throughout the episode.

*Hint: think carefully about how to use your normalizers in this function.*

_Points:_ 20

In [None]:
def run_policy_22(net, env, X_normalizer, Y_normalizer, seed=None):
    obs, _ = env.reset(seed=seed)
    ...

### 2.3) Training a good behavior cloning policy

You will now (on your own) try various combinations of number of hidden layers, layer sizes, batchsize, and number of epochs. Because this takes a considerable amount of time, the autograder will not run this part of the code. Instead, once you find a good combination of values, you will train a network and save it using the following code cell. You can either:
- run the training on this same notebook. In that case, you'll need to remove it (including the saving code) from the notebook before submitting (otherwise, the autograder will very likely time out); or
- run the training on another notebook. In that case, you'll need to copy the code below to your other notebook to save your network.

*Hint 1: I recommend that you run this on Google Colab, to leverage their free GPUs. For that, you'll need to click on the little triangle in the upper right and hit "change runtime type" and choose "T4 GPU". Then, you'll need to add `net = net.to('cuda')` and `X = X.to(cuda)` and `Y = Y.to(cuda)` so that your tensors and network are all on GPU.*

*Hint 2: consider the following when choosing your hyperparameters:*
- Hidden layers: more than a handful is probably too many
- Layer sizes: more than a few thousand is probably too big
- Batchsize: I'll let you figure this one out, but for speed, it's best to pick powers of 2 (actually, powers of 8 is even better)
- epochs: more than a few hundred is probably too many

*Hint 3: use the `render_mode=human` argument to visualize your policy with your `run_policy` function. A good policy should (almost always) succeed in pushing the block.*

Your score will be based on how high a reward your policy obtains on my tests.

_Points:_ 25

In [None]:
''''
# Save trained net
save_dict = {
    'net': net.state_dict(),
    'hid_size': ..., # The size of your hidden layers
    'num_layers': ..., # The number of hidden layers
    'X_mean': ..., # The mean of the observations
    'X_std': ..., # The std of the observations
    'Y_mean': ..., # The mean of the actions
    'Y_std': ..., # The std of the actions
}
torch.save(save_dict, 'trained_pusher_policy.pt')'
'''

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

Fill out the answers to all questions. Submit a zip file containing hw7.ipynb with your answers and the `trained_pusher_policy.pt` file you saved to the HW7 assignment on Gradescope. You are free to resubmit as many times as you wish.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True, files=['trained_pusher_policy.pt'])