[DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models](https://arxiv.org/abs/2402.12289)


**Abstract**  
<!-- A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments. -->
城市环境中自动驾驶的主要挑战是理解复杂且长尾场景, 例如具有挑战性的道路状况和微妙的人类行为。我们提出 DriveVLM, 一种利用视觉语言模型(VLM)增强场景理解和规划能力的自动驾驶系统。DriveVLM 集成了一种独特的推理模块组合, 用于场景描述、场景分析和分层规划。此外，鉴于 VLM 在空间推理和繁重计算要求方面的局限性, 我们提出了 DriveVLM-Dual, 一种混合系统, 耦合 DriveVLM 的优势与传统自动驾驶 pipeline。在 nuScenes 数据集和我们的 SUP-AD 数据集上的实验表明 DriveVLM 和 DriveVLM-Dual 在处理复杂且不可预测的驾驶条件方面的有效性。最后, 我们在量产车辆上部署 DriveVLM-Dual, 验证了它在现实世界的自动驾驶环境中的有效性。

# Introduction

Autonomous driving, with its great promise to revolutionize transportation, has been an active research area over the past two decades. A primary hurdle to a fully autonomous driving system is scene understanding [1], which involves navigating complex, unpredictable scenarios such as adverse weather, intricate road layouts, and unforeseen human behaviors.
    
Existing autonomous driving systems, typically comprising 3D perception, motion prediction, and planning, struggle with these scene understanding challenges. Specifically, 3D perception [2, 3, 4, 5] is limited to detecting and tracking familiar objects, omitting rare objects and their unique attributes; motion prediction [6, 7, 8, 9, 10] and planning [11, 12, 13] focus on trajectory-level actions, often neglecting the decision-level interactions between objects and vehicles.

We introduce DriveVLM , a novel autonomous driving system that aims at these scene understanding challenges, capitalizing on the recent Vision-Language Models (VLMs) [14, 15, 16, 17] which have demonstrated exceptional prowess in visual comprehension and reasoning. Specifically, DriveVLM contains a Chain-of-Though (CoT) process with three key modules: scene description ,scene analysis, and hierarchical planning . The scene description module linguistically depicts the driving environment and identifies critical objects in the scene; the scene analysis module delves into the characteristics of the critical objects and their influence on the ego vehicle; the hierarchical planning module formulates plans step-by-step, from meta-actions and decision descriptions to waypoints. These modules respectively correspond to the components of the traditional perception-prediction-planning pipeline, but the difference is that these modules tackle object perception, intention-level 
prediction andtask-level planning , which were extremely challenging to cope with in the past
.
While VLMs excel in visual understanding, they have limitations in spatial grounding and reasonin ,
and their computational intensity poses challenges for onboard inference speed. Therefore we r-
ther propose DriveVLM-Dual , a hybrid system that combines the strengths of both DriveVLM and
traditional systems. DriveVLM-Dual optionally integrates DriveVLM with traditional 3D pcep-
tion and planning modules, such as 3D object detectors, occupancy networks, and motion pla ners,
enabling the system to achieve 3D grounding and high-frequency planning abilities. This dl sys-
tem design, akin to the human brain’s slow and fast thinking processes, adapts efficiently to varying
complexity in driving s

Meanwhile, we formally define the scene understanding and planning (SUP) task, and propose new 
evaluation metrics to assess the scene analysis and meta-action planning capabilities of DriveVL 
and DriveVLM-Dual. We carry out a comprehensive data mining and annotation pipeline to c-
struct an in-house SUP-AD dataset for the SUP task. Extensive experiments on both the nuSce es
dataset and our own dataset demonstrate the superior performance of DriveVLM, particularly in ew-
shot scenarios. Furthermore, DriveVLM-Dual exceeds state-of-the-art end-to-end motion pla ning
methods. We have also deployed the model on a production vehicle, confirming that DrieVLM-
Dual is effective in real-world autonomous driving environments. Additionally, we have inc uded a
demo in the supplementary ma
terials.
In summary, the contribution of this paper is three-fold:
1. We introduce DriveVLM, a novel autonomous driving system that leverages VLMs for effective
scene understanding and planning. We further introduce DriveVLM-Dual, a hybrid  ystem that
incorporates DriveVLM and a traditional autonomous pipeline, which achieves impr ved spatial
reasoning and real-time planning capabilities.
2. We present a comprehensive data mining and annotation pipeline to construct scene under-
standing and planning dataset (SUP-AD), together with metrics for evaluation.
3. We have successfully deployed DriveVLM-Dual system in a production vehicle  nd test various
effective strategies for accelerating VLM deployment in real driving scenarios.cenarios.

# DriveVLM
## Overview
<!-- The overall pipeline of DriveVLM is illustrated in Figure 1. A sequence of images is processed by a Vision Language Model (VLM) to perform a special chain-of-thought (CoT) [49] reasoning to derive the driving planning results. The architecture of DriveVLM involves a vision transformer encoder [50] and a Large Language Model (LLM). The vision encoder produces image tokens; then an attention-based extractor aligns these tokens with the LLM. The reasoning process can be divided into three modules: scene description (Section 3.2), scene analysis (Section 3.3), and hierarchical planning (Section 3.4). -->
DriveVLM 的整体流程如图 1 所示。视觉语言模型 (VLM) 处理一系列图像, 以执行一个特殊的思路链 (CoT) [49] 推理, 从而得出驾驶规划结果。DriveVLM 的架构涉及一个视觉 transformer 编码器 [50] 和一个大型语言模型 (LLM)。视觉编码器生成图像词元; 然后基于注意的提取器将这些词元与 LLM 对齐。推理过程可分为三个模块: 场景描述(第 [3.2](#Scene-Description) 节)、场景分析(第 [3.3](#Scene-Analysis) 节)和分层规划(第 [3.4](#Hierarchical-Planning) 节)。

<!-- For real-world deployment, we propose a hybrid system, DriveVLM-Dual, in Section 3.5, which combines DriveVLM and the traditional autonomous driving pipeline, leveraging the strengths of both approaches. -->
对于实际部署，我们在第 [3.5](#DriveVLM-Dual) 节中提出了一种混合系统 DriveVLM-Dual, 它结合 DriveVLM 和传统的自动驾驶 pipeline，利用两种方法的优势。

## Scene Description
<!-- The scene description module identifies driving environment description and critical objects.
**Environment Description**. Driving environments, such as weather and road conditions, have a non-negligible impact on driving difficulty. Therefore, the model is first prompted to output a linguistic description Eof the driving environment, including several conditions: $E = { E_\text{weather}, E_\text{time}, E_\text{road}, E_\text{lane} }$,each representing a crucial aspect of the driving environment. The weather component, $E_\text{weather}$ , spans conditions from sunny to snowy, affecting visibility and traction. The time component, $E_\text{time}$, distinguishes between daytime and nighttime, impacting driving strategies due to visibility changes. Road types, $E_\text{road}$, such as urban or highway, introduce different challenges, while lane conditions, $E_\text{lane}$, focus on current lane positioning and possible maneuvers, crucial for safe driving decisions.

**Critical Object Identification**. In addition to environmental conditions, various objects in driving scenarios significantly influence driving behaviors. Unlike traditional autonomous driving perception modules, which detect all objects within a specific range, we solely focus on identifying critical objects that are most likely to influence the current scenario, inspired by human cognitive processes during driving. Each critical object , denoted as $O_c$, contains two attributes: the object category $c$ and its approximate bounding box coordinates $b(x_1, y_1, x_2, y_2)$ on the image. The category and coordinates are mapped to their corresponding language token idin the language modality, enabling seamless integration into the following modules. Moreover, taking advantage of the pre-trained vision encoder, DriveVLM can identify long-tail critical objects that may elude typical 3D object detectors, such as road debris or unusual animals. -->

场景描述模块识别驾驶环境描述和关键对象。

**环境描述**。驾驶环境, 例如天气和道路状况, 对驾驶难度有不可忽略的影响。因此, 首先提示模型输出驾驶环境的语言描述$E$, 包括几个条件: $E = { E_\text{weather}, E_\text{time}, E_\text{road}, E_\text{lane} }$, 每个条件代表驾驶环境的一个重要方面。天气成分$E_\text{weather}$涵盖从晴天到下雪的条件, 影响能见度和牵引力。时间成分$E_\text{time}$区分白天和夜间, 由于能见度变化而影响驾驶策略。道路类型$E_\text{road}$, 例如城市或高速公路, 带来不同的挑战, 而车道条件$E_\text{lane}$则关注当前车道定位和可能的操作，这对于安全驾驶决策至关重要。

**关键物体识别**。除了环境条件, 驾驶场景中的各种物体显著影响驾驶行为。与传统的自动驾驶感知模块不同, 传统的自动驾驶感知模块检测特定范围内的所有物体, 而我们仅专注识别最有可能影响当前场景的关键物体, 这受到人类驾驶过程中的认知过程的启发。每个关键物体(表示为 $O_c$)包含两个属性: 物体类别 $c$ 及其在图像上的近似边界框坐标 $b(x_1, y_1, x_2, y_2)$。类别和坐标被映射到语言模态中相应的语言 token_id, 从而能够无缝集成到后续模块中。此外, 借助预训练的视觉编码器, DriveVLM 可以识别可能逃避传统 3D 物体检测器的长尾关键物体, 例如道路碎片或不寻常的动物。

## Scene Analysis
<!-- In the traditional autonomous driving pipeline, the prediction module typically concentrates on forecasting the future trajectories of objects. The emergence of advanced vision-language models has provided us with the ability to perform a more comprehensive analysis of the current scene. The scene-level analysis summarizes all the critical objects together with the environmental description. This summary gives a comprehensive understanding of the scene, and is fed into the following planning module.

**Critical Object Analysis**. DriveVLM characterizes critical objects in three aspects: static attributes Cs,motion states Cm, and particular behaviors Cb. Static attributes Csdescribe inherent properties of objects, such as a roadside billboard’s visual cues or a truck’s oversized cargo, which are critical in preempting and navigating potential hazards. Motion states Cmdescribe an object’s dynamics over a period, including position, direction, and action—characteristics that are vital in predicting the object’s future trajectory and potential interactions with the ego vehicle. Particular behaviors Cbrefer to special actions or gestures of an object that could directly influence the ego vehicle’s next driving decisions. We do not require the model to analyze all three characteristics for all objects. In practice, only one or two characteristics apply to a critical object. Upon analyzing these characteristics, DriveVLM then predicts the potential influence $I$ of each critical object on the ego vehicle. -->

在传统的自动驾驶 pipeline 中, 预测模块通常专注于预测物体的未来轨迹。先进的视觉语言模型的出现为我们提供了对当前场景进行更全面分析的能力。场景级分析总结了所有关键物体以及环境描述。此总结提供了对场景的全面理解, 并输入到后续的规划模块中。

**关键物体分析**。DriveVLM 从三个方面描述关键物体：静态属性 $C_s$、运动状态 $C_m$ 和特定行为 $C_b$。静态属性 $C_s$ 描述物体的固有属性，例如路边广告牌的视觉提示或卡车的超大货物，这些属性对于预判和规避潜在危险至关重要。运动状态 $C_m$ 描述物体在一段时间内的动态, 包括位置、方向和动作 —— 这些特征对于预测物体的未来轨迹和与自车的潜在交互至关重要。特定行为 $C_b$ 指的是物体的特殊动作或手势，这些动作或手势可能会直接影响自车的下一步驾驶决策。我们不要求模型分析所有物体的这三个特征。在实际应用中, 只有一或两个特征应用于关键物体。在分析这些特征后, DriveVLM 预测每个关键物体对自车的潜在影响 $I$。

## Hierarchical Planning
<!-- The scene-level summary is then combined with the route, ego pose and velocity to form a prompt for planning. Finally, DriveVLM progressively generates driving plans, in three stages: meta-actions, decision description, and trajectory waypoints.

**Meta-actions A**. A meta-action, denoted as ai, represents a short-term decision of the driving strategy. These actions fall into 17 categories, including but not limited to acceleration, deceleration, turning left, changing lanes, minor positional adjustments, and waiting. To plan the ego vehicle’s future maneuver over a certain period, we generate a sequence of meta-actions.

**Decision Description D**. Decision description Darticulates the more fine-grained driving strategy the ego vehicle should adopt. It contains three elements: Action A, Subject S, and Duration D. Action pertains to meta actions such as ‘turn’, ‘wait’, or ‘accelerate’. Subject refers to the interacting object, such as a pedestrian, a traffic signal, or a specific lane. Duration indicates the temporal aspect of the action, specifying how long it should be carried out or when it should start.

**Trajectory Waypoints W**. Upon establishing the decision description D, our next phase involves the generation of corresponding trajectory waypoints. These waypoints, denoted by $W = {w_1, w_2, ..., w_n}, \ \ wi = (x_i, y_i)$, depict the vehicle’s path over a certain future period with predetermined intervals ∆t. We map these numerical waypoints into language tokens for auto-regressive generation. -->

然后将场景级总结与导航、自车位姿和速度相结合, 形成规划提示。最后, DriveVLM 逐步生成驾驶规划, 分为三个阶段: 元动作、决策描述和轨迹途径点。

**元动作 $A$** 元动作, 表示为 $a_i$, 表示驾驶策略的短期决策。这些动作分为 17 类, 包括但不限于加速、减速、左转、变道、微小的位置调整和等待。为了规划自我车辆在一定时期内的未来机动，我们生成了一系列元动作。

**决策描述 $D$** 决策描述 $D$ 阐明了自车应采用更细粒度的驾驶策略。它包含三个元素: 动作 $\mathcal{A}$、主体 $\mathcal{S}$ 和持续时间 $\mathcal{D}$。动作涉及“转弯”、“等待”或“加速”等元动作。主体是指交互对象, 例如行人、交通信号或特定车道。持续时间表示动作的时间方面, 指定应该持续多长时间或何时开始。

**轨迹途径点 $W$** 在建立决策描述 $D$ 后, 我们的下一阶段涉及相应的轨迹途径点生成。这些途径点表示为 $W = {w_1, w_2, \dots, w_n}, \ \ wi = (x_i, y_i)$, 以预定间隔 $\Delta t$ 描绘车辆在一定未来时间段内的路径。我们将这些数值途径点映射到语言词元中, 以进行自回归生成。

## DriveVLM-Dual
<!-- To mitigate the challenges of high latency and imprecise spatial and motion understanding in VLMs, we propose DriveVLM-Dual, a collaboration between DriveVLM and the traditional autonomous driving system. This novel approach involves two key strategies: incorporating 3D perception for critical object analysis, and high-frequency trajectory refinement.
                                                                                            
**Integrating 3D Perception**. We represent objects detected by a 3D detector as $O_{3D} = {c^i_{3D}, b^i_{3D}}$, where $b^i_{3D}$ denotes the i-th bounding box and $c^i_{3D}$ denotes its category. These 3D bounding boxes are then back-projected onto 2D images to derive corresponding 2D bounding boxes $b^i_{2D}$. We conduct IoU matching between these 2D bounding boxes $b^i_{2D}$ and $b^j_c$. $b^j_c$ are the bounding boxes of previously identified critical objects $O_\text{critical} ={c^j_c, b^j_c}$. We classify critical objects that meet a certain approximate IoU threshold and belong to the same category as matched critical objects $O^\text{matched}_c$, defined as
$$ O^\text{matched}_c =\left\{c^j_c, b^j_c\right\}, \ \ \text{if} \ \ c^j_c = c^i_{2D} \ \ \text{and  aIoU}(b^j_c, b^i_{2D}) > \tau, \ \ \text{where aIoU}(b^j_c, b^i_{2D}) = \frac{S_{b^j_c \cap b^i_{2D}}}{S_{b^i_{2D}}}$$
Those critical objects without a corresponding match in the 3D data are noted as $O^\text{unmatched}_c$.  -->

为了缓解 VLM 中高延迟与不精确的空间和运动理解的挑战, 我们提出了 DriveVLM-Dual, 这是 DriveVLM 与传统自动驾驶系统之间的协作。这种新方法涉及两个关键策略: 耦合 3D 感知用于关键对象分析, 以及高频轨迹细化。

**集成 3D 感知** 我们将 3D 检测器检测到的对象表示为 $O_{3D} = {c^i_{3D}, b^i_{3D}}$, 其中 $b^i_{3D}$ 表示第 $i$ 个边界框, $c^i_{3D}$ 表示其类别。然后这些 3D 边界框被反向投影到 2D 图像上, 以得到相应的 2D 边界框 $b^i_{2D}$。我们在这些 2D 边界框 $b^i_{2D}$ 和 $b^j_c$ 之间进行 IoU 匹配。 $b^j_c$ 是先前识别的关键物体 $O_\text{critical} ={c^j_c, b^j_c}$ 的边界框。​我们对满足某个近似 IoU 阈值并与匹配的关键对象 $O^\text{matched}_c$ 属于相同类别的关键对象进行分类, 定义为
$$ O^\text{matched}_c =\left\{c^j_c, b^j_c\right\}, \ \ \text{if} \ \ c^j_c = c^i_{2D} \ \ \text{and aIoU}(b^j_c, b^i_{2D}) > \tau, \ \ \text{where aIoU}(b^j_c, b^i_{2D}) = \frac{S_{b^j_c \cap b^i_{2D}}}{S_{b^i_{2D}}}$$
在 3D 数据中没有相应匹配的关键对象记为 $O^\text{unmatched}_c$。

<!-- In the scene analysis module, for $O^\text{matched}_c$, the center coordinates, orientations, and historical trajectories of the corresponding 3D objects are used as language prompts for the model, assisting in object analysis. Conversely, for $O^\text{unmatched}_c$, analysis relies solely on the language tokens derived from the image. This design enables DriveVLM-Dual to understand the locations and motions of critical objects more accurately, enhancing the overall performance. -->

在场景分析模块中, 对于 $O^\text{matched}_c$, 对应 3D 物体的中心坐标、方向和历史轨迹用作模型的语言提示, 以辅助对象分析。相反, 对于 $O^\text{unmatched}_c$, 分析仅依赖于从图像中得出的语言词元。这种设计使 DriveVLM-Dual 能够更准确地理解关键物体的位置和运动, 从而提高整体性能。

<!-- **High-frequency Trajectory Refinement**. To achieve real-time, high-frequency inference capabilities, we integrate it with a conventional planner to form a slow-fast dual system, combining the advanced capabilities of DriveVLM with the efficiency of traditional planning methods. After obtaining a trajectory from DriveVLM at low frequency, denoted as $W_\text{slow}$, we take it as a reference trajectory for a classical planner for high-frequency trajectory refinement. In the case of an optimization-based planner, $W_\text{slow}$ serves as the initial solution for the optimization solver. For a neural network-based planner, $W_\text{slow}$ is used as an input query, combined with additional input features $f$, and then decoded into a new planning trajectory denoted as $W_\text{fast}$. The formulation of this process can be described as:
$$W_\text{fast} = \text{Planner}([W_\text{slow}, f]). \tag{(1)}$$
This refinement step ensures that the trajectory produced by DriveVLM-Dual (1) achieves higher trajectory quality, and (2) meets real-time requirements. In practice, the two branches operate asynchronously in a slow-fast manner, where the planner module in the traditional autonomous driving branch can selectively receive trajectory from the VLM branch as additional input. -->

**高频轨迹细化** 为了实现实时、高频推理能力, 我们将其与传统规划器集成, 形成慢-快双系统, 将 DriveVLM 的先进能力与传统规划方法的效率相结合。在从 DriveVLM 获得低频轨迹(表示为 $W_\text{slow}$)后, 我们将其作为经典规划器的参考轨迹, 以进行高频轨迹细化。对于基于优化的规划器, $W_\text{slow}$ 作为优化求解器的初始解。对于基于神经网络的规划器, $W_\text{slow}$ 被用作输入查询, 与其他输入特征 $f$ 相结合, 然后解码为新的规划轨迹, 记为 $W_\text{fast}$。该过程的公式可描述为：
$$W_\text{fast} = \text{Planner}([W_\text{slow}, f]). \tag{(1)}$$
此细化步骤确保 DriveVLM-Dual 生成的轨迹 (1) 实现更高的轨迹质量, 并且 (2) 满足实时要求。在实际操作中, 这两个分支以慢-快的方式异步运行, 其中传统自动驾驶分支中的规划器模块可以选择性地接收来自 VLM 分支的轨迹作为额外输入。

# Task and Dataset
To fully exploit the potential of DriveVLM and DriveVLM-Dual in handling complex and long-tai 
driving scenarios, we formally define a task called Scene Understanding for Planning (Section 4.1 ,
together with a set of evaluation metrics (Section 4.2). Furthermore, we propose a data mining  nd
annotation protocol to curate a scene understanding and planning dataset (Section 4.
## 
4.1 Task Definition
The Scene Understanding for Planning task is defined as follows. The input comprises mult -view
$\mathcal{V}$ ideos Vfrom surrounding cameras and optionally 3D perception r$\mathcal{P}$ sults Pfrom a perceptn mod-
ule. The output includes the following comp
on ents:
1.Scene Description E: Composed of weather c$E_\text{weather}$Eweathe$E_\text{time}$ime Etime, road  $E_\text{road}$ion
Eroad, and lane c$E_\text{lane}$ions Elane.

In [5]:
from pypdf import PdfReader

pdf_path = "/mnt/d/2402.12289v5.pdf"

reader = PdfReader(pdf_path)
number_of_pages = len(reader.pages)
page = reader.pages[4]
text = page.extract_text()
print(text)

3.5 DriveVLM-Dual
To mitigate the challenges of high latency and imprecise spatial and motion understanding in VLMs,
we propose DriveVLM-Dual, a collaboration between DriveVLM and the traditional autonomous
driving system. This novel approach involves two key strategies: incorporating 3D perception for
critical object analysis, and high-frequency trajectory refinement.
Integrating 3D Perception. We represent objects detected by a 3D detector as O3D={ci
3D, bi
3D},
where bi
3Ddenotes the i-th bounding box and ci
3Ddenotes its category. These 3D bounding boxes are
then back-projected onto 2D images to derive corresponding 2D bounding boxes bi
2D. We conduct
IoU matching between these 2D bounding boxes bi
2Dandbj
c.bj
care the bounding boxes of previously
identified critical objects Ocritical ={cj
c, bj
c}. We classify critical objects that meet a certain approx-
imate IoU threshold and belong to the same category as matched critical objects Omatched
c , defined
as
Omatched
c ={cj
c, bj
c