# Explanation

The most state-of-the-art published robotics model at the time of writing, $\pi_0$ combines all the most important recent innovations in robotics into a single architecture that shows the best generalization abilities of any model ever created, generalizing to new tasks, objects, environments, usability in new hardware form factors with fine-tuning, and effective generalization to more specialized tasks with post-training.

The architecture combines the value of internet-scale pre-training on language and image data from multi-model LLMs, the action grounding offered by VLA and SayCan, high level hardware and task generalization through the usage of a high quality post-training data from multiple robotic form factors, and the architectural efficacy of ACT.

I'm honestly impressed that they released this paper publicly, the results are the most important published results of any recent robot in terms of the software layer.

# Notes

![Screenshot 2024-11-08 at 11.15.46 AM.png](../../images/notes/Screenshot_2024-11-08_at_11.15.46_AM.png)

> We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.

> Similarly, we may find that for effective specialized robot systems, it is more effective to first pre-train on highly diverse robot data, and then fine-tune or prompt for the desired task.

This is one of their key insights. Training at scale on diverse robot data may help to generalize to the level necessary for learning new tasks.

> This can resolve the data scarcity challenge, because many more sources of data are available to a generalist model — including data from other tasks, other robots, or even non-robot sources — and it may resolve robustness and generalization challenges.

More data allows generalization to more scenarios, robot types, etc. This is the purpose of training a foundational model on robotics. They suggest that this actually allows them to use more broad data.

> Thus, adopting a large-scale pre-training approach to robot learning
> has the potential to address many of the field’s challenges.

Developing generalist robots requires large scale, correct architecture, and the right training approach.

They use a VLA model, with an ACT, and flow-matching to represent complex continuous actions.

They combine high and low quality data so the model can perform tasks well and learn to recover from mistakes. They use large-scale pre-training to learn more general knowledge and the post-training to acquire dexterity, robustness, and efficiency for specific tasks.

They train on >10,000 hours of data.

### Related Work

> Incorporating these concepts into a VLA model, we introduce what to our knowledge is the first flow matching VLA that produces high-frequency action chunks for dexterous control.

VLA + action chunking + diffusion.

> Since one of our aims is to study complex and dexterous behaviors, we utilize a much larger dataset, with about 10,000 hours of demonstrations, complemented by the open-source OXE dataset. To our knowledge, this represents by far the largest robot learning experiment in terms of the amount of robot data.

> The complexity of the tasks we illustrate goes significantly beyond prior work. […] We show that our framework can learn very long tasks, sometimes tens of minutes in length, for behaviors that combine both physical dexterity and combinatorial complexity.

### Overview

![Screenshot 2024-11-08 at 11.52.18 AM.png](../../images/notes/Screenshot_2024-11-08_at_11.52.18_AM.png)

> We first assemble a pre-training mixture consisting of a weighted combination of our own dexterous manipulation datasets, collected on 7 different robot configurations for 68 different tasks, and the entire OXE dataset, which contains data from 22 robots.

> The purpose of the pre-training phase is to train a base model that exhibits broad capabilities and generalization, but is not necessarily specialized for high performance on any one task.

The base model has abilities for basic tasks and can follow language commands. It’s then fine-tuned on more complex tasks.

### The $\pi_0$ Model

They train a VLA model with a pre-trained VLM that they use ViT (vision transformers) to pass the robots images as standard tokens as well.

They use a flow matching to model the continuous distribution of actions. This makes the model have higher precision and better modeling capability. This is very useful for high frequency dexterous tasks.

The use Transfusion where a single transformer is trained on a flow matching loss (to improve smoothness) and a supervised cross-entropy loss (to learn the actual correct actions).

They also use a MoE model where the first expert takes in vision and language data, and the second expert takes in robotics configuration data and outputs the actions.

They pre-train the full architecture on additional robotics data from their own dataset as well as OXE.

They want to model the distribution $p(A_t|o_t)$ with the action chunk of length 50 $A_t = [a_t, a_{t+1},…, a_{t+H-1}]$. The observation $o_t$ consists of the images from multiple RGB cameras, the language command, and the robots configuration state.

They feed predicted action tokens through the action expert that uses a conditional flow matching loss.

### Data Collection and Training Recipe

**1. Pre-training and Post-training**

> The pre-training dataset should cover as many tasks as possible, and within each of those tasks should cover a diversity of behaviors. The post-training dataset should instead cover behaviors that are conducive to effective task execution, which should exhibit a consistent and fluent
> strategy.

> To learn dexterous and more complex tasks, we also use 903M time steps of data from our own datasets, where 106M steps are from single-arm robots and 797M are from dual-arm robots. This data has 68 tasks, where each task is composed of complex behaviors.

**2. Language and High-Level Policies**

They use a SayCan like high-level planning framework

**3. Robot System Details**

> Our dexterous manipulation datasets include 7 different robot configurations and 68 tasks.

### Experimental Evaluation

**1. Evaluating the base model**

> In our first set of experiments, we evaluate the model after pre-training on our full mixture, without any post-training, to evaluate how well our base model can perform a variety of tasks.

They test on a variety of in-distribution tasks like shirt folding, bussing, grocery bagging, toast out of toaster.

![Screenshot 2024-11-08 at 12.18.12 PM.png](../../images/notes/Screenshot_2024-11-08_at_12.18.12_PM.png)

> $\pi_0$ attains by far the best results across the board on all the zero-shot tasks, with near perfect success rates on shirt folding and the easier bussing tasks, and large improvements over all baselines.

**2. Following Language Commands**

![Screenshot 2024-11-08 at 12.22.23 PM.png](../../images/notes/Screenshot_2024-11-08_at_12.22.23_PM.png)

> The language accuracy of $\pi_0$ is significantly better than that of $\pi_0$ small. This suggests a significant improvement from the larger pre-trained VLM initialization.

**3. Learning New Dexterous Tasks**

![Screenshot 2024-11-08 at 12.24.32 PM.png](../../images/notes/Screenshot_2024-11-08_at_12.24.32_PM.png)

> In the next set of experiments, we evaluate our model on new tasks that differ significantly from the pre-training data, requiring entirely new behaviors.

**4. Mastering Complex Multi-Stage Tasks**

![Screenshot 2024-11-08 at 12.27.45 PM.png](../../images/notes/Screenshot_2024-11-08_at_12.27.45_PM.png)

> In our final set of experiments, we tackle a range of challenging multi-stage tasks via a combination of fine-tuning and language. For some of these tasks, data is present in pre-training, but fine-tuning is required to attain mastery.

### Discussion

> We presented a framework for training a robot foundation model, which we refer to as $\pi_0$, that consists of pre-training on highly diverse data, followed by either zero-shot evaluation or fine-tuning to complex downstream tasks.

> To our knowledge, this represents the largest pre-training mixture ever used for a robot manipulation model.

Large pre-training acquires most of the knowledge, and then fine-tuning on a task has more robust learning.

> We hope that our results will serve as a stepping stone toward general and broadly applicable robot foundation models

It’s unclear what data composition works, some of their tasks don’t work, and it’s unclear how much transfer there is from a large number of tasks.
