Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J., Finn, C., ... & Zitkovich, B. (2022). [Rt-1: Robotics transformer for real-world control at scale]((https://robotics-transformer1.github.io/assets/rt1.pdf)). arXiv preprint arXiv:2212.06817.

# INTRODUCTION
End-to-end robotic learning, with either imitation or reinforcement, typically involves collecting task-specific data in either single-task (Kalashnikov et al., 2018; Zhang et al., 2018) or multi-task (Kalashnikov et al., 2021b; Jang et al., 2021) settings that are narrowly tailored to the tasks that the robot should perform. This workﬂow mirrors the classic approach to supervised learning in other domains, such as computer vision and NLP, where task-specific datasets would be collected, labeled, and deployed to solve individual tasks, with little interplay between the tasks themselves. Recent years have seen a transformation in vision, NLP, and other domains, away from siloed, small-scale datasets and models and towards large, general models pre-trained on broad, large datasets. The keys to the success of such models lie with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the knowledge present in large-scale datasets. If a model can "sponge up" experience to learn general patterns in language or perception, then it can bring them to bear on individual tasks more efficiently. While removing the need for large task-specific datasets is appealing generally in supervised learning, it is even more critical in robotics, where datasets might require engineering-heavy autonomous operation or expensive human demonstrations. We therefore ask: can we train a single, capable, large multi-task backbone model on data consisting of a wide variety of robotic tasks? And does such a model enjoy the benefits observed in other domains, exhibiting zero-shot generalization to new tasks, environments, and objects? 
无论是使用模仿还是强化的端到端机器人学习, 通常涉及在单任务 (Kalashnikov et al., 2018; Zhang et al., 2018) 或多任务 (Kalashnikov et al., 2021b; Jang et al., 2021) 配置中收集任务专用数据, 这些数据是为机器人要执行的任务量身定制的。工作流程反映了其他领域(例如计算机视觉和 NLP)中监督学习的经典方法, 其中任务专用数据集被收集、标注和部署, 以解决单个任务, 而任务彼此之间几乎没有相互作用。近年来, 视觉、NLP 和其他领域已发生了转变, 从(孤立的小规模数据集和模型)转向(在广泛的大型数据集上预训练的大型通用模型)。这类模型成功的关键在于开放式任务无关训练, 以及(能够吸收大规模数据集中出现的所有知识的)**高容量架构**。如果一个模型能够"吸收"经验来学习语言或感知中的通用模式, 那么它可以使他们更有效地完成单个任务。尽管消除对大型任务专用数据集的需求在监督学习中通常很有吸引力, 但在机器人领域, 这一点更为关键, 其中数据集可能需要工程繁重的自主操作或昂贵的人工演示。因此, 我们想知道: 我们能否在(由各种机器人任务组成的)数据上训练一个单独的、功能强大的大型多任务 backbone 模型? 这种模式是否享有其他领域所观察到的好处, 展现出对新任务、环境和对象的零样本泛化?

Building such models in robotics is not easy. Although recent years have seen several large multi-task robot policies proposed in the literature (Reed et al., 2022; Jang et al., 2021), such models often have limited breadth of real-world tasks, as with Gato (Reed et al., 2022), or focus on training tasks rather than generalization to new tasks, as with recent instruction following methods (Shridhar et al., 2021; 2022), or attain comparatively lower performance on new tasks (Jang et al., 2021).

todo: 图1
(a) RT-1 takes images and natural language instructions and outputs discretized base and arm actions. Despite its size (35M parameters), it does this at 3 Hz, due to its efficient yet high-capacity architecture: a FiLM (Perez et al., 2018) conditioned EfficientNet (Tan & Le, 2019), a TokenLearner (Ryoo et al., 2021), and a Transformer (Vaswani et al., 2017).

(b) RT-1's large-scale, real-world training (130k demonstrations) and evaluation (3000 real-world trials) show impressive generalization, robustness, and ability to learn from diverse data.

Figure 1: A high-level overview of RT-1's architecture, dataset, and evaluation.


The two main challenges lie in assembling the right dataset and designing the right model. While data collection and curation is often the "unsung hero" of many large-scale machine learning projects (Radford et al., 2021; Ramesh et al., 2021), this is especially true in robotics, where datasets are often robot-speciﬁc and gathered manually (Dasari et al., 2019; Ebert et al., 2021). As we will show in our evaluations, good generalization requires datasets that combine both scale and breadth, covering a variety of tasks and settings. At the same time, the tasks in the dataset should be suficiently well-connected to enable generalization, such that the model can discover the patterns between structural similar tasks and perform new tasks that combine those patterns in novel ways. We utilize a dataset that we gathered over the course of 17 months with a ﬂeet of 13 robots, containing $\sim$130k episodes and over 700 tasks, and we ablate various aspects of this dataset in our evaluation.


The second challenge lies in the design of the model itself. Effective robotic multi-task learning requires a high capacity model, and Transformer (Vaswani et al., 2017) models excel in this regard, particularly when it is necessary to learn many tasks conditioned, as in our case, on language instructions. However, robotic controllers must also be efﬁcient enough to run in real time, which presents a major challenge for Transformers in particular. We propose a novel architecture that we call RT-1 (Robotics Transformer 1), which by encoding high-dimensional inputs and outputs, including camera images, instructions and motor commands into compact token representations to be used by the Transformer, allows for efﬁcient inference at runtime to make real-time control feasible.


Our contribution is the RT-1 model and experiments with this model on a large and broad dataset of real-world robotic tasks. Our experiments not only demonstrate that RT-1 can exhibit signiﬁcantly improved generalization and robustness compared to prior techniques, but also evaluate and ablate many design choices in both the model and in the composition of the training set. Our results show that RT-1 can perform over 700 training instructions at 97% success rate, and can generalize to new tasks, distractors, and backgrounds 25%, 36% and 18% better than the next best baseline, respectively. This level of performance allows us to execute very long-horizon tasks in the SayCan (Ahn et al., 2022) framework, with as many as 50 stages. We further show that RT-1 can incorporate data from simulation or even other robot types, retaining performance on the original tasks and improving generalization to new scenarios. A short overview of RT-1 capabilities is presented in Fig. 1b.

In [1]:
from pypdf import PdfReader

pdf_path = "/mnt/d/github/paper/具身机器人/RT_1_Robotics_transformer_for_real_world_control_at_scale.pdf"

reader = PdfReader(pdf_path)
number_of_pages = len(reader.pages)
page = reader.pages[2]
text = page.extract_text()
print(text)

Preprint
2 R ELATED WORK
A number of recent works have proposed Transformer-based policies for robotic control. As in
RT-1, several works use language commands processed with Transformers as a robust framework
for specifying and generalizing to new tasks (Zhang & Chai, 2021; Pashevich et al., 2021; Silva
et al., 2021; Jang et al., 2021; Ahn et al., 2022; Nair et al., 2022). Our work takes the application
of Transformers a step further and treats the mapping of language and vision observations to robot
actions as a sequence modelling problem, using a Transformer to learn this mapping. This idea
is directly inspired by successes in game-playing (Chen et al., 2021; Lee et al., 2022a) as well
as simulated robot navigation (Fang et al., 2019), locomotion (Janner et al., 2021; Gupta et al.,
2022), and manipulation (Jiang et al., 2022) environments. We note that several of these works go
beyond only text conditioning and use Transformers to also generalize across robot morphologies
(e.g., Gup