# Explanation

This breakthrough paper was the first indication of the current paradigm of robotics training, using an end-to-end imitation learning based system that integrates visual data and the control policy together and uses demonstrations from tele-operation to train a policy with few-shot generalization abilities to unseen tasks. It showed that this approach could be used to effectively train successful robotics policies.

# Notes

> End-to-end learning from pixels is a flexible choice for modeling the behavior of such generalist robots, as it has minimal assumptions about the state representation of the world.

End-to-end learning based on vision data is a generalized approach to robotics that allows functionality in all environment.

> With sufficient real-world data, these methods should in principle enable robots to generalize across new tasks, objects, and scenes without requiring hand-coded, task-specific representations.

These systems could generalize given sufficient data, though this scale of data doesn’t exist.

> In this paper, we study the problem of enabling a robot to generalize zero-shot or few-shot to new vision-based manipulation tasks.

> Achieving such generalization depends on solving challenges relating to scaling up data collection and learning algorithms for diverse data.

> First, our system incorporates shared autonomy into teleoperation to allow us to collect both raw demonstration data and human interventions to correct the robot’s current policy.

> Second, our system flexibly conditions the policy on different forms of task specification, including a language instruction or a video of a person performing the task.

The system allows robots to train on videos and allow humans to improve the robots policy over time.

> Our main contribution is an empirical study of a large-scale interactive imitation learning system that solves a breadth of tasks, including zero-shot and few-shot generalization to tasks not seen during training.

### Related Work

Imitation learning using deep learning has allowed robots to learn grasping and pick-place tasks from raw image observations.

Priori imitation learning work has achieved different forms of generalization.

Demonstrations are often collected from teleoperation data and use methods like DAgger to address distribution shift.

### Method Overview

The goal is to train a conditional policy that can take RGB images $s \in \mathcal{S}$ with task command $w \in \mathcal{W}$ in the form of a language instruction or video and accomplish the intended objective.

The policy can be written as $\mu : \mathcal{S} \times \mathcal{W} \rightarrow \mathcal{A}$ where $\mathcal{A}$ corresponds with the action space consisting of the 6-DoF pose of the end effector as well as the 7th degree of freedom for continuous control of the parallel jaw gripper.

The policy is trained with data collected with VR-based teleoperation using demonstration and human-in-the-loop shared autonomy, resembling HG-DAgger.

The model architecture has an encoder $q(z|w)$ that predicts an embededding $z$ from the instruction $w$ and a control layer $\pi : \mathcal{S} \times \mathcal{Z} \rightarrow \mathcal{A}$ that predicts an action $a$ from $w$ and the image $s$.

### Learning Algorithm

**1. Language and Video Encoders**

The encoder takes a language command or video of a human as input and produces a task embedding $w$. A lingual sentence encoder is used for language, and a ResNet-18 is used for video.

Then given examples of a human video $w_h^i$ and a demonstration demo $\{ (s, a) \}^i$, the human video is encoded $z^i \sim q(\cdot | w_h^i)$ and the embedding is passed to the control layer $\pi(a|s, z^i)$, and then the gradient of the behavior cloning loss is backpropagated through the policy and encoder.

![Screenshot 2024-11-01 at 6.15.04 PM.png](../../images/Screenshot_2024-11-01_at_6.15.04_PM.png)

The loss function uses a **language regression loss** that makes the embedding space smoother by pushing the corresponding language and video embeddings closer to each other.

**2. Policy Training**

They train $\pi(a|s, z)$ using Huber loss to control the 6-DoF of the robot and gripper angle.

Images are cropped/downsampled to improve generalization during training.

At inference time, the robot predicts the next action in closed-loop fashion, but in training the model also predicts the next 10 steps in open-loop to provide another auxiliary training objective.

**3. Network Architecture**

The architecture uses a ResNet-18 body to process the images with many action heads (each a 2 hidden layer MLP). The task is conditioned on the embedding $z$ using FiLM layers.

### Experimental Results

![Screenshot 2024-11-01 at 6.25.28 PM.png](../../images/Screenshot_2024-11-01_at_6.25.28_PM.png)

![Screenshot 2024-11-01 at 6.25.35 PM.png](../../images/Screenshot_2024-11-01_at_6.25.35_PM.png)

BC-Z is able to generalize to new objects that weren’t operated on with the same task during training, although all the skills were still learned.

![Screenshot 2024-11-01 at 6.27.28 PM.png](../../images/Screenshot_2024-11-01_at_6.27.28_PM.png)

Ablation studies show that training on multi-tasks was essential for generalization as well as HG-DAgger augmentation with human intervention rather than just expert demonstrations alone.

### Discussion

> We presented a multi-task imitation learning system that combines flexible task embeddings with large-scale training on a 100-task demonstration dataset, enabling it to generalize to entirely new tasks that were not seen in training based on user-provided language or video commands.

> The performance on novel tasks varies significantly. However, even for tasks that are less successful, the robot often exhibits behavior suggesting that it understands at least part of the task, reaching for the right object or performing a semantically related motion.
