# Notes

Modern frontier models use transfer learning from large internet-scale task-agnostic datasets to solve a variety of downstream tasks with zero-shot or smaller task datasets.

End-to-end learning for robotics, using imitation/reinforcement, usually depends on collecting narrow task-specific.

> We therefore ask: can we train a single, capable, large multi-task backbone model on data consisting of a wide variety of robotic tasks?

And does such a model enjoy the benefits observed in other domains, exhibiting zero-shot generalization to new tasks, environments, and objects?

>

This is about applying the same learning from training LLMs and other successful frontier generative models using internet-scale data and generalization + transfer learning.

Building such a generalized robot is hard. The two challenges are in creting a good dataset and designing a good model.

Good data is especially critical and rare in robotics. Datasets have to be robot-specific and gathered manually.

Good generalization requires sufficient breadth and scale in the datasets, with many tasks and settings that are connected enough for the robot to generalize.

Designing the model for the robot is also a challenge. Multi-task learning for robots requires a high capacity model, which the transformer is perfect for.

However, robot controllers have to run in real time.

Robotics Transformer 1 (RT-1) encodes inputs/outputs (images, instructions, motor commands) into token representations that can be used by the transformer, allowing real-time inference.

> Our results show that RT-1 can perform over 700 training instructions at 97% success rate, and can generalize to new tasks, distractors, and backgrounds 25%, 36% and 18% better than the next best baseline, respectively.

> We further show that RT-1 can incorporate data from simulation or even other robot types, retaining performance on the original tasks and improving generalization to new scenarios.

### **Related Work**

> Our work takes the application of Transformers a step further and treats the mapping of language and vision observations to robot actions as a sequence modeling problem, using a Transformer to learn this mapping.

Prior attempts at using transformers for robotics use transformers just for the language modeling and planning side of the problem. RT-1 uses transformers for the entire language + vision → action sequence.

> On the technical side, our work examines how Transformer-based policies can be built so as to combine high capacity and generalization with the computational efficiency necessary for real-time control.

The design of RT-1 is meant to make real time use of transformers for robotics feasible.

> Our work adds further evidence in support of the power of multi-task, language-conditioned robotic learning.

### **Preliminaries**

The robot needs to learn a policy $\pi(\cdot | i, \{x_j\}_{j=0}^t)$ that can be used to sample a set of actions $\{a_0..a_t\}$ that ends in a terminating step $T$, which ends in a reward 1 from the binary reward function $r \in \{0, 1\}$.

RT-1 uses a transformer to parameterize $\pi$, where the transformer takes in tokens from the instruction $i$ and the images $\{x_j\}_{j=0}^t$ as inputs, and outputs action tokens $a_j$ as outputs.

RT-1 uses imitation learning with behavior cloning to learn from a dataset with many demonstrations (episodes) concluding in $r= 1$:

$$
\mathcal{D} = \{ (i^{(n)}, \{ x_t^{(n)}, a_t^{(n)} \})_{n=0}^N \}
$$

### System Overview

> The goal of this work is to build and demonstrate a general robot learning system that can absorb large amounts of data and generalize effectively.

> Our training data consists of human-provided demonstrations, and we annotate each episode with a textual description of the instruction that the robot just performed.

> The instructions usually contain a verb and one or more nouns describing the target objects. To group these instructions together, we split them into a number of skills (e.g., verbs such as “pick”, “open” or “place upright”) and objects (e.g., nouns such as “coke can”, “apple”, or “drawer”).

The data collection strategy uses manual task labeling. They group tasks into equivalent categories for generalization by using similar verbs and nouns.

![Screenshot 2024-10-31 at 2.55.00 PM.png](../images/Screenshot_2024-10-31_at_2.55.00_PM.png)

### Robotics Transformer

**1. Model**

> The RT-1 architecture relies on a data-efficient and compact tokenization of images and language instruction.

RT-1 has to be very token efficient to enable real-time inference.

It takes in a history of 6 images with $300 \times 300$ resolution, and passes it through an **EfficientNet-B3** pre-trained on ImageNet, taking the output of the final convolution layer with $9 \times 9 \times 512$.

Instead of patchifying the images before feeding them to the transformer, it uses these as 81 tokens of dimension 512 to pass to later layers.

They add a [FiLM layer](https://arxiv.org/abs/1709.07871) to the EfficientNet which takes in an embedding of the instruction from Universal Sentence Encoder to further extract information from the images into tokens based on information relevant to the instructions. The **FiLM layer** is initialized to the identity to not disrupt the EfficientNet’s pre-training.

> To further compress the number of tokens that RT-1 needs to attend over and thus speed up inference, RT-1 uses **TokenLearner**

TokenLearner uses attention to select image tokens based on their information, and only passes important tokens farther.

The TokenLearner reduces the 81 image tokens to just 8 final tokens per image passed to the transformer layer.

The transformer itself takes 8 tokens per image for a total of 48 image tokens in its input. The transformer itself is only 19M parameters.

> To tokenize actions, each action dimension in RT-1 is discretized into
> 256 bins.

> Action tokenization uses 7 variables for arm movement ($x$, $y$, $z$, roll, pitch, yaw, opening of gripper), three variables for base movement ($x$, $y$, yaw), and a discrete variable to switch between three modes: controlling arm, base, or terminating the episode.

The robot has to match the humans speed of task execution. Humans executed tasks at 2-4s / task. This allowed 3Hz control frequency, so inference has to be <100ms.

This is why using the EfficientNet + TokenLearner to reduce the total number of input tokens is important.

**2. Data**

![Screenshot 2024-10-31 at 3.18.51 PM.png](../images/Screenshot_2024-10-31_at_3.18.51_PM.png)

> Our primary dataset consists of ∼130k robot demonstrations, collected with a fleet of 13 robots over the course of 17 months.

> RT-1 is able to perform over 700 language instructions in multiple realistic office kitchen environments that we evaluate and describe in detail in the experiments.

### Experiments

**1. Experimental Setup**

RT-1 is compared to Gato and BC-Z, which both use different architectures. Both Gato and BC-Z are retrained on the RT-1 dataset and are much smaller than the papers to be able to run in real-time.

> The policies are evaluated for performance on training tasks as well as generalization to new tasks, robustness to unseen environments, and performance when chained together for long-horizon tasks.

They evaluate **seen task performance** by using tasks in the dataset but alter the configurations/placement of objects.

They evaluate **unseen task generalization** by testing new instructions with known skills/objects combined in ways that were unseen int eh dataset.

They evaluate **robustness** they change environment (for background robustness) and add unknown objects (for distractor robustness)

To evaluate **long-horizon scenarios** they test 15 long-horizon instructions which require execution of many distinct steps [using SayCan to come up with these tasks].

**2. Generalization**

> Can an RT-1 learn to perform a large number of instructions, as well as to generalize in zero shot to new tasks, objects and environments?

![Screenshot 2024-10-31 at 3.32.37 PM.png](../images/Screenshot_2024-10-31_at_3.32.37_PM.png)

RT-1 performs better on generalization across the board, doing well on unseen tasks and robustness.

**3. Simulation & External Data**

> Can we push the resulting model even further by incorporating heterogeneous data sources, such as simulated data or data from different robots?

> We demonstrate how RT1 can incorporate and learn from vastly different data sources and improve from such data.

RT-1 can use data from both real/simulation as well as data from other robots.

![Screenshot 2024-10-31 at 3.38.45 PM.png](../images/Screenshot_2024-10-31_at_3.38.45_PM.png)

RT-1 trained on the original dataset and a new simulation dataset doesn’t make the real tasks much worse but also adds the ability to transfer learning from simulation well.

![Screenshot 2024-10-31 at 3.41.23 PM.png](../images/Screenshot_2024-10-31_at_3.41.23_PM.png)

RT-1 data combined with a separate Kuka bin-picking dataset mostly maintains RT-1 ability to perform the original tasks while also improving its ability at bin picking, indicating it’s ability to generalize and combine different datasets.

This is impressive given how different the RT-1 and Kuka datasets are.

**4. Long-Horizon Generalization**

![Screenshot 2024-10-31 at 3.44.23 PM.png](../images/Screenshot_2024-10-31_at_3.44.23_PM.png)

Using SayCan, the RT-1 can do long horizon tasks and generalizes to a new kitchen far better than previous models.

**5. Data Quantity**

> In many robotics works the model size is often not the primary bottleneck, and the maximum size is limited by the latency requirement for running such models on real robots.

> Since data collection is particularly expensive for real robots, it is important to quantify what kind of data our models need to achieve a certain performance and generalization.

![Screenshot 2024-10-31 at 3.48.11 PM.png](../images/Screenshot_2024-10-31_at_3.48.11_PM.png)

> Our key takeaway is thus that data diversity is more essential than data quantity.

### Conclusion

RT-1 comes with a few challenges:

It uses imitation learning so it may be capped by the performance of demonstrators.

Generalization to new instructions is limited to previously seen concepts and can’t currently generalize to new motions.

The task set is very low in dexterous manipulation requirements.