**Abstract**

Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. 
A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves.

To address this challenge, we introduce GAIA-1 ('Generative AI for Autonomy'), a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios while offering fine-grained control over ego-vehicle behavior and scene features. 
Our approach casts world modeling as an unsupervised sequence modeling problem by mapping the inputs to discrete tokens, and predicting the next token in the sequence. 
Emerging properties from our model include learning high-level structures and scene dynamics, contextual awareness, generalization, and understanding of geometry. 
The power of GAIA-1's learned representation that captures expectations of future events, combined with its ability to generate realistic samples, provides new possibilities for innovation in the field of autonomy, enabling enhanced and accelerated training of autonomous driving technology.

# Introduction
Predicting future events is a fundamental and critical aspect of autonomous systems. 
Accurate future prediction enables autonomous vehicles to anticipate and plan their actions, enhancing safety and efficiency on the road. 
To achieve this, the development of a robust model of the world is imperative [1] and huge efforts have been made in the past to build such predictive world models for autonomous driving [2, 3, 4, 5, 6]. 
A world model [7, 8] learns a structured representation and understanding of the environment that can be leveraged for making informed decisions when driving.

However, current approaches have had significant limitations. 
World models have been successfully applied to control tasks in both simulation [9, 10, 11, 12, 13] and to real-world robotics tasks [14, 15].
These methods often rely on labeled data, which is challenging to obtain at scale, and models that work on simulated data may not fully capture the complexities of real-world scenarios. 
Furthermore, due to their low-dimensional representations, these models may struggle to generate highly realistic samples of future events, posing challenges in achieving a high level of fidelity in predictions for complex real-world applications such as autonomous driving.

Meanwhile, progress in generative image and video generation has harnessed the power of selfsupervised learning to learn from large quantities of real-world data to generate remarkably realistic video samples [16, 17, 18]. 
Yet, a significant challenge persists in this domain: the difficulty of learning a representation that captures the expected future events. 
While such generative models excel at generating visually convincing content, they may fall short in learning representations of the evolving world dynamics that are crucial for precise future predictions and robust decision-making in complex scenarios.

In this work we introduce GAIA-1, a method designed with the goal of maintaining the benefits of both world models and generative video generation. 
It combines the scalability and realism of generative video models with the ability of world models to learn meaningful representations of the evolution into the future. GAIA-1 works as follows. 
First, we partition the model into two components: the world model and the video diffusion decoder. 
The world model reasons about the scene's high-level components and dynamics, while the diffusion model takes on the responsibility of translating latent representations back into high-quality videos with realistic detail.

For the world model, we use vector-quantized representations of video frames to discretize each frame, transforming them into a sequence of tokens. 
Subsequently, we reframe the challenge of predicting the future into predicting the next token in the sequence [10, 19]. 
This approach has been widely employed in recent years to train large language models [20, 21, 22, 23], and it is recognized for its effectiveness in enhancing model performance through the scaling of model size and data. 
It is possible to generate samples within the latent space of the world model through autoregressive generation.

The second component is a multi-task video diffusion decoder that is able to perform high-resolution video rendering as well as temporal upsampling to generate smooth videos from the information autoregressively generated by the world model. 
Similarly to large language models, video diffusion models have demonstrated a clear correlation between scale of training and overall performance, making both components of GAIA-1 suitable for effective compound scaling.

GAIA-1 is designed to be multimodal, allowing video, text and action to be used as prompts to generate diverse and realistic driving scenarios, as demonstrated in Figure 1. 
By training it on a large corpus of real-world UK urban driving data, GAIA-1 learns to understand and disentangle important concepts such as static and dynamic elements, including cars, buses, pedestrians, cyclists, road layouts, buildings, and even traffic lights. 
Further, it provides fine-grained control over both ego-vehicle behavior and other scene features through action and language conditioning.

GAIA-1 demonstrates the ability to manifest the generative rules of the real world. 
Emerging properties such as learning high-level structures, generalization, creativity, and contextual awareness indicate that the model can comprehend and reproduce the rules and behaviors of the world. 
Moreover, GAIA-1 exhibits understanding of 3D geometry, for example, by effectively capturing the intricate interplay of pitch and roll induced by road irregularities such as speed bumps. 
It showcases reactive behaviors of other agents demonstrating the ability to understand causality in decision making of road users. 
Surprisingly, it shows the capability to successfully extrapolate beyond the training data, for example to driving outside of the boundaries of the road. 
See Section 7 for a comprehensive list of examples.

The power of GAIA-1's learned representations to predict future events, paired with control over both ego-vehicle dynamics and scene elements, is an exciting advance that paves the way for improving embodied intelligence and providing synthetic data to accelerate training and validation. 
World models, such as GAIA-1, are the basis for the ability to predict what might happen next, which is fundamentally important for decision-making in autonomous driving.

# Model
In this section we describe the model architecture of the trainable components of GAIA-1. The general architecture is presented in Figure 2.

## Encoding Video, Text and Action
GAIA-1 can leverage three different input modalities (video, text, action), which are encoded into a shared $d$-dimensional space.

**Image tokens**. Each image frame of a video is represented as discrete tokens. To achieve this, we
use a pre-trained image tokenizer for discretization (for details about the pre-training see Section 2.2).
Formally, let us consider a sequence of T images (x1, . . . , xT ), where each image xt in this sequence
is discretized into n = 576 discrete tokens using the pre-trained image tokenizer. We obtain a
sequence denoted as (z1, . . . , zT ), where each zt = (zt,1, . . . , zt,n) ∈ Rn corresponds to n = H
D×W
D
discrete tokens. Here, H and W represent the height and width of the input image, while D denotes
the downsampling factor of the image tokenizer. These discrete tokens are then mapped to a ddimensional
space via an embedding layer that is trained alongside the world model.

**Text tokens**. At each time step t, we incorporate information from both text and action. Textual
input is encoded using the pre-trained T5-large model [24], resulting in m = 32 text tokens per
time step. These tokens are mapped to a d-dimensional space through a linear layer that is trained
in conjunction with the world model. This process yields a text representation denoted as ct =
(ct,1, . . . , ct,m) ∈ Rm×d.

**Action tokens*. For actions, we consider l = 2 scalar values (representing speed and curvature).
Each scalar is independently mapped to the d-dimensional space via a linear layer that is trained with
the world model. Consequently, the action at time step t is represented as at = (at,1, . . . , at,l) ∈
Rl×d.

For each time step, the input tokens are interleaved in the following order: text - image - action. 
The final input of the world model is therefore $(c_1, z_1, a_1, \dots, c_T, z_T, a_T)$. 
To encode the position of the input tokens, we use a factorized spatio-temporal positional embedding. 
1) A learnable temporal embedding is shared across all the tokens of a given time step, i.e. there are $T$ temporal embeddings.
2) A learnable spatial embedding indicates the position of a token within a time step, i.e. there
are $m + n + l = 610$ spatial embeddings ($m$ text tokens, $n$ image tokens, and $l$ action tokens) of dimension $d = 4096$.

## Image Tokenizer
When modeling discrete input data with a sequence model, there is a trade-off between the sequence length and the vocabulary size. 
The sequence length refers to the number of discrete tokens that are needed to describe the data. 
The vocabulary size corresponds to the number of possible values a single token can take. 
For language, there are two obvious choices for tokens: characters and words. 
When using character-level tokens, the input data has a longer sequence length, and each individual token belongs to a smaller vocabulary, but conveys little meaning. 
When using word-level tokens, the input data has a shorter sequence length, and each token contains a lot of semantics but the vocabulary is extremely large. 
Most language models [25, 26, 24, 21, 27, 22] use byte-pair encoding (or equivalent) as a trade-off between character-level and word-level tokenization.

Likewise for video, we would like to reduce the sequence length of the input, while possibly making the vocabulary larger, but with tokens that are more semantically meaningful than raw pixels. 
We do this with a discrete image autoencoder [28]. 
There are two objectives we would like to achieve in this first stage:

1. Compress the information from raw pixels to make the sequence modeling problem tractable.
Images contain a lot of redundant and noisy information. 
We would like to reduce the sequence length needed to describe the input data.
2. Guide the compression towards meaningful representations, such as semantics, instead of high-frequency signals. 
The resulting input space for the world model will be simpler to compose with, and less dominated by high-frequency signals that can considerably slow down the learning process.