# Notes

The semantic reasoning, problem solving, and visual interpretation capabilities of large vision-language trained on web-scale internet data would be highly valuable to general purpose robots.

However, it’s challenging for robotics to independently acquire these capabilities.

> While a brute force approach might entail collecting millions of robotic interaction trials, the most capable language and vision-language models are trained on billions of tokens and images from the web - an amount unlikely to be matched with robot data in the near future.

Very important point. The scale of useful data for LLMs provided by the internet is unlikely to be reached through collecting robotic training data directly through trials.

This makes it attractive to integrate vision-language models (VLMs) with robotics.

Current approaches have attempted to use VLMs for high-level robotic planning, taking control of a state-machine that selects individual task primitives executed by lower-level controllers.

This feels like the same mistake as trying to have deep learning controlling symbols/features determined by humans. It doesn’t allow the deep learning systems to generalize even to the low-level parts of the task.

> Therefore, in this paper we ask: can large pre-trained vision-language models be integrated directly into low-level robotic control to boost generalization and enable emergent semantic reasoning?

RT-2 combines existing vision-language models with large compute costs already dumped into them with actions to build **vision-language-action models**.

> We instantiate VLA models by building on the protocol proposed for RT-1, using a similar dataset, but expanding the model to use a large vision-language backbone.

> Besides the expected benefit of dramatically improving generalization to novel objects and semantically varied instructions, we observe a number of emergent capabilities.

VLA models use the generalization from VLMs to increase generalization present in RT-1, though the abilities are still limited to the same set of tasks/actions.

> We show that RT-2 enable significant improvements to generalization
> over objects, scenes, and instructions, and exhibit a breadth of emergent capabilities inherited from web-scale vision-language pre-training.

### Related Work

Most vision-language models use representation-learning [CLIP] or use the vision + text → text pattern.

Building generalizing robot controllers is challenging.

Prior methods have achieved generalization across object instances, new skills, new goals/language instructions, new tasks, and new environments.

> We aim to develop and study a single model that can generalize to unseen conditions along all of these axes.

The RT-2 architecture differs from prior attempts to integrate VLMs with robotics in that it doesn’t use a separate action-only model layers but all action and language are integrated into a single model.

### Vision-Language-Action Models

**1. Pre-Trained Vision-Language Models**

RT-2 uses PaLI-X and PaLM-E to make RT-2-PaLI-X and RT-2-PaLM-E.

**2. Robot-Action Fine-Tuning**

VLM has to be fine-tuned in order to convert it into a useful VLA model. RT-2 uses the same action space as RT-1.

RT-2 uses 256 tokens to be used as action tokens, which can either directly use tokens from the VLM or override the 256 least used tokens in the VLM as action tokens.

RT-2 uses co-fine-tuning to train the model with both vision-language-action pairs and standard web text-image pairs so the model retains its original abilities.

The model has output constraints in robot-task inference to only sample action tokens, whereas it can sample any token in normal inference.

**3. Real-Time Inference**

RT-2 uses a cloud multi-TPU setup instead of on-robot GPUs to meet 1-3 Hz requirements by running robot inference over the cloud.

### Experiments

The model is trained on original web-scale datasets used in other papers and the robotic action dataset used in RT-1.

**1. Generalization**

![Screenshot 2024-10-31 at 4.19.27 PM.png](../images/Screenshot_2024-10-31_at_4.19.27_PM.png)

RT-2 performs similarly to RT-1 on seen tasks, but performs far better on unknown tasks, suggesting that the usage of the VLM mainly increases generalization to new image-text concepts.

**2. Emergent Capabilities**

RT-2 is able to accomplish tasks like “put strawberry in the correct bowl” or “pick up the bag about to fall off the table.”

![Screenshot 2024-10-31 at 4.23.17 PM.png](../images/Screenshot_2024-10-31_at_4.23.17_PM.png)

RT-2 improves performance across symbol understanding, reasoning, and human recognition tasks.

**4. Chain-of-Thought**

![Screenshot 2024-10-31 at 4.24.59 PM.png](../images/Screenshot_2024-10-31_at_4.24.59_PM.png)

They augment the data to include a plan step and fine-tune on this data. RT-2 can then carry out more complex tasks.

### Limitations

Generalization with web-scale data doesn’t give RT-2 the ability to perform new motions/skills. It learns to deploy skills in new ways but still has the same seen skills.

VLA models in real time is computationally expensive. Quantization and distillation may be needed to make the models cheaper.

![Screenshot 2024-10-31 at 4.27.51 PM.png](../images/Screenshot_2024-10-31_at_4.27.51_PM.png)
