**Abstract**

Motion forecasting for autonomous driving is a challenging task because complex driving scenarios result in a heterogeneous mix of static and dynamic inputs. 
It is an open problem how best to represent and fuse information about road geometry, lane connectivity, time-varying traffic light state, and history of a dynamic set of agents and their interactions into an effective encoding.
To model this diverse set of input features, many approaches proposed to design an equally complex system with a diverse set of modality specific modules. 
This results in systems that are difficult to scale, extend, or tune in rigorous ways to trade off quality and efficiency.

In this paper, we present Wayformer, a family of attention based architectures for motion forecasting that are simple and homogeneous. 
Wayformer offers a compact model description consisting of an attention based scene encoder and a decoder. 
In the scene encoder we study the choice of early, late and hierarchical fusion of input modalities. 
For each fusion type we explore strategies to trade off efficiency and quality via factorized attention or latent query attention. 
We show that early fusion, despite its simplicity of construction, is not only modality agnostic but also achieves state-of-the-art results on both Waymo Open Motion Dataset (WOMD) and Argoverse leaderboards, demonstrating the effectiveness of our design philosophy.

# Introduction
In this work, we focus on the general task of future behavior prediction of agents (pedestrians, vehicles, cyclists) in real-world driving environments. 
This is an essential task for safe and comfortable human-robot interactions, enabling high-impact robotics applications like autonomous driving.

The modeling needed for such scene understanding is challenging for many reasons. 
For one, the output is highly unstructured and multimodal—e.g., a person driving a vehicle could carry out one of many underlying intents unknown to an observer, and representing a distribution over diverse and disjoint possible futures is required. 
A second challenge is that the input consists of a heterogeneous mix of modalities, including agents' past physical state, static road information (e.g. location of lanes and their connectivity), and time-varying traffic light information.

Many previous efforts address how to model the multimodal output [1, 2, 3, 4, 5, 6], and develop hand-engineered architectures to fuse different input types, each requiring their own preprocessing (e.g., image rasterization [7, 2, 8]). 
Here, we focus on the multimodality of the input space, and develop a simple yet effective modality-agnostic framework that avoids complex and heterogeneous architectures, and leads to a simpler architecture parameterization. 
This compact description of a family of architectures results in a simpler design space and allows us to more directly and effectively control for trade-offs in model quality and latency by tuning model computation and capacity.

To keep complexity under control without sacrificing quality or efficiency, we need to find general modeling primitives, which can handle multimodal features that exist in temporal and spatial dimensions concurrently. 
Recently, several approaches proposed Transformer networks as the networks of choice for motion forecasting problems [9, 10, 11, 12, 13]. 
While these approaches offer simplified model architectures, they still require domain expertise and excessive modality specific tuning. [14] proposed a stack of cross attention layers sequentially processing one modality at a time. 
The order in which to process each modality is left to the designer and enumerating all possibilities is combinatorially prohibitive. 
[3] proposed using separate encoders for each modality, where the type of network and its capacity is open for tuning on a per-modality basis. 
Then modalities' embeddings are flattened and one single vector is fed to the predictor. 
While these approaches allow for many degrees of freedom, they increase the search space significantly. 
Without efficient network architecture search or significant human input and hand engineering, the chosen models will likely be sub-optimal given that a limited amount of the modeling options have been explored.

Our experiments suggest the domain of motion forecasting conforms to Occam's Razor. 
We show state of the art results with the simplest design choices and making minimal domain specific assumptions, which is in stark contrast to previous work. 
When tested in simulation and on real AVs, these Wayformer models showed good understanding of the scene.

Our contributions can be summarized as follows:
• We design a family of models with two basic primitives: a self-attention encoder, where we fuse one or more modalities across temporal and spatial dimensions, and a cross-attention decoder, where we attend to driving scene elements to produce a diverse set of trajectories.
• We study three variations of the scene encoder that differ in how and when different input modalities are fused.
• To keep our proposed models within practical real time constraints of motion forecasting, we study two common techniques to speed up self-attention: factorized attention and latent query attention.
• We achieve state-of-the-art results on both WOMD and Argoverse challenges.

# Multimodal Scene Understanding
Driving scenarios consist of multimodal data, such as road information, traffic light state, agent history, and agent interactions. 
In this section we detail the representation of these modalities in our setup. 
For readability, we define the following symbols: $A$ denotes the number of modeled ego-agents, $T$ denotes the number of past and current timesteps being considered in the history, with a feature size $D_m$. 
For a modality $m$, we might have a $4^{th}$ dimension ($S_m$) representing a "set of contextual objects" (i.e. representations of other road users) for each modeled agent.

**Agent History** contains a sequence of past agent states along with the current state $[A, T, 1, D_h]$.
For each timestep $t \in T$, we consider features that define the state of the agent e.g. x, y, velocity, acceleration, bounding box and so on. 
We include a context dimension $S_h = 1$ for homogeneity.

**Agent Interactions** The interaction tensor $[A, T, S_i, D_i]$ represents the relationship between agents. 
For each modeled agent $a \in A$, a fixed number of the closest context agents $c_i \in S_i$ around the modeled agent are considered. 
These context agents represent the agents which influence the behavior of our modeled agent. 
The features in $D_i$ represent the physical state of each context agents (as in $D_h$ above), but transformed into the frame of reference of our ego-agent.

**Roadgraph** The roadgraph $[A, 1, S_r, D_r]$ contains road features around the agent. 
Following [2], we represent roadgraph segments as polylines, approximating the road shape with collections of line segments specified by their endpoints and annotated with type information. 
We use $S_r$ roadgraph segments closest to the modeled agent. 
Note that there is no time dimension for the road features, but we include a time dimension of 1 for homogeneity with the other modalities.

**Traffic Light State** For each agent $a \in A$, traffic light information $[A, T, S_{tls}, D_{tls}]$ contains the states of the traffic signals that are closest to that agent. 
Each traffic signal point $tls \in S{tls}$ has features $D_{tls}$ describing the position and confidence of the signal.

# Wayformer
We design the family of Wayformer models to consist of two main components: a Scene Encoder and a Decoder. 
The scene encoder is mainly composed of one or more attention encoders that summarize the driving scene. 
The decoder is a stack of one or more standard transformer crossattention blocks, in which learned initial queries are fed in, and then cross-attended with the scene encoding to produce trajectories. 
Figure 1 shows the Wayformer model processing multimodal inputs to produce scene encoding. 
This scene encoding serves as the context for the decoder to generate k possible trajectories covering the multimodality of the output space.

**Frame of Reference** As our model is trained to produce futures for a single agent, we transform
the scene into an ego-centric frame of reference by centering and rotating the scene’s spatial features
around the ego-agent’s position and heading at the current time step.
Projection Layers Different input modalities may not share the same number of features, so we
project them to a common dimension D before concatenating all modalities along the temporal and
spatial dimensions [S, T]. We found the simple transformation Projection(xi) = relu(Wxi + b),
where xi 2 RDm, b 2 RD, andW 2 RDDm, to be sufficient. Concretely, given an input of shape
[A; T; Sm;Dm] we project its last dimension producing a tensor of size [A; T; Sm;D].
Positional Embeddings Self-attention is naturally permutation equivariant, therefore, we may
think of them as set-encoders rather than sequence encoders. However, for modalities where the
data does follow a specific ordering, for example agent state across different time steps, it is beneficial
to break permutation equivariance and utilize the sequence information. This is commonly
done through positional embeddings. For simplicity, we add learned positional embeddings for all
modalities. As not all modalities are ordered, the learned positional embeddings are initially set to
zero, letting the model learn if it is necessary to utilize the ordering within a modality.
3.1 Fusion
Once projections and positional embeddings are applied to different modalities, the scene encoder
combines the information from all modalities to generate a representation of the environment.
Concretely, we aim to learn a scene representation Z = Encoder(fm0;m1; :::;mkg); where
mi 2 RA(TSm)D, Z 2 RALD, and L is a hyperparameter.
However, the diversity of input sources makes this integration a non-trivial task. Modalities might
not be represented at the same abstraction level or scale: fpixels vs objectsg. Therefore, some
modalities might require more computation than the others. Splitting compute and parameter count
among modalities is application specific and non-trivial to hand-engineer. We attempt to simplify
the process by proposing three levels of fusion: fLate, Early, Hierarchicalg.
Late Fusion This is the most common approach used by motion forecasting models, where each
modality has its own dedicated encoder (See Figure 2). We set the width of these encoders to be
equal to avoid introducing extra projection layers to their outputs. Moreover, we share the same
depth across all encoders to narrow down the exploration space to a manageable scope. Transfer of
information across modalities is allowed only in the cross-attention layers of the trajectory decoder.
Early Fusion Instead of dedicating a self-attention encoder to each modality, early fusion reduces
modality specific parameters to only the projection layers (See Figure 2). In this paradigm, the scene
encoder consists of a single self-attention encoder (“Cross-Modal Encoder”), giving the network
maximum flexibility in assigning importance across modalities with minimal inductive bias.
Hierarchical Fusion As a compromise between the two previous extremes, capacity is split between
modality-specific self-attention encoders and the cross-modal encoder in a hierarchical fashion.
As done in late fusion, width and depth is common across attention encoders and the cross
modal encoder. This effectively splits the depth of the scene encoder between modality specific
encoders and the cross modal encoder (Figure 2).