A comprehensive framework for evaluating Vision-Language Models on sequential driving scenarios
Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities.
VENUSS systematically assesses VLMs' performance on sequential driving scenes and establishes baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos and generates structured evaluations across custom categories.
Paper: How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study
By comparing 25+ VLMs across 2,600+ scenarios, our analysis reveals:
- Top models achieve only 57% accuracy on driving scenarios
- Human performance reaches 65% (GIF mode), exposing significant capability gaps
- VLMs excel with static object detection but struggle with understanding vehicle dynamics and temporal relations
- 48.2% improvement potential through systematic configuration optimization
- 25+ supported models: OpenAI, Anthropic, Google, Qwen, HuggingFace
- 2,600+ evaluation scenarios across multiple driving contexts
- Sequential scene analysis with temporal understanding assessment
- Configuration impact analysis (grid layouts, resolutions, time intervals, presentation modes)
- Human baseline comparison: Establishes performance benchmarks
- Fine-grained evaluation: Custom categories for driving-specific tasks
- Statistical analysis: Performance metrics and capability gap identification
- Configuration optimization: Systematic testing of input arrangements
- Multiple dataset support: Framework validated across 4 driving datasets (CoVLA, Honda Scenes, Waymo, NuScenes)
- Consistent methodology: Same evaluation protocol across datasets
- Cross-dataset analysis: Performance comparison capabilities
# Clone the repository
git clone https://github.com/V3NU55/VENUSS.git
cd VENUSS
# Install VLM evaluation dependencies
pip install -r evaluation/requirements.txt
# Install Human evaluation dependencies
pip install -r human_evaluation/requirements.txt
# Configure dataset paths
nano config/datasets.py
# Add API keys for VLM evaluation
nano evaluation/config/.env# Check available models (25+ supported)
python evaluation/main.py --available-models
# Quick validation test
python evaluation/main.py --model gpt-4o-mini --dataset covla --phase baseline-cross-dataset
# Full evaluation run
python evaluation/main.py --model gpt-4o --dataset covla --phase comprehensive-evaluation
# Evaluate on Waymo or NuScenes
python evaluation/main.py --model gpt-4o --dataset waymo --phase baseline-cross-dataset
python evaluation/main.py --model gpt-4o --dataset nuscenes --phase baseline-cross-dataset
# Generate analysis report
python evaluation/analysis.py --results-dir evaluation/results --output-dir analysis_output# Navigate to the human evaluation directory
cd human_evaluation
# Run the web application
python app.py
# Access the interface at http://localhost:5000
# ---
# After collecting data, run the analysis scripts:
# Generate time-based statistics for a dataset (e.g., covla)
python analysis/analyze_evaluation_times.py --dataset covla
# Generate a summary report from the time analysis
python analysis/generate_summary_report.py --dataset covla
# Generate accuracy analysis
python analysis/analyze_human_accuracy.py --dataset covla# Model comparison study
python evaluation/main.py --model gpt-4o --dataset covla --phase baseline-cross-dataset
python evaluation/main.py --model claude-3-5-sonnet --dataset covla --phase baseline-cross-dataset
python evaluation/main.py --model gemini-2.0-flash-exp --dataset covla --phase baseline-cross-dataset
# Configuration optimization
python evaluation/main.py --model gpt-4o --dataset covla --phase resolution-optimization
python evaluation/main.py --model gpt-4o --dataset covla --phase grid-configurations
python evaluation/main.py --model gpt-4o --dataset covla --phase temporal-intervalsThe VENUSS pipeline is designed for extensibility. We currently support 4 datasets (CoVLA, Honda Scenes, Waymo, NuScenes). To integrate a new dataset, you only need to modify the following three files:
| File | Purpose |
|---|---|
config/datasets.py |
Define paths and settings for the new dataset |
dataset_creation_scripts/scenario_analysis.py |
Add a custom parser for the dataset's annotation format |
scripts/generate_annotations.py |
Define evaluation questions and answer key generation logic |
- Scenario Extraction: Extract driving scenarios from video datasets
- Temporal Analysis: Analyze sequential segments for optimal evaluation intervals
- Ground Truth Generation: Create structured evaluation questions and answer keys
- Visual Material Creation: Generate frame sequences and collages for VLM input
- Human Baseline Collection: Establish human performance benchmarks
- VLM Evaluation: Systematic assessment of 25+ models with statistical analysis
| Phase | Purpose | Scenarios | Focus |
|---|---|---|---|
| baseline-cross-dataset | Quick validation | 4 | Framework verification |
| resolution-optimization | Optimal image quality | 90 | Visual input optimization |
| grid-configurations | Layout impact | 96 | Spatial arrangement effects |
| temporal-intervals | Time spacing effects | 60 | Temporal understanding |
| comprehensive-evaluation | Complete assessment | 100 | Full capability analysis |
- OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4-vision-preview
- Anthropic: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
- Google: gemini-2.0-flash-exp, gemini-1.5-pro, gemini-1.5-flash
- Qwen: qwen-vl-plus, qwen-vl-max
- HuggingFace: llava-1.5-7b, llava-1.5-13b, and more
- Capability Gaps: Up to 48.2% performance difference based on input configuration
- Temporal Understanding: VLMs struggle with sequential scene relationships
- Static vs Dynamic: Strong object detection, weak motion understanding
- Human Comparison: VLMs approaching but not exceeding human performance
- Grid Layout: Horizontal layouts outperform square grids
- Resolution: 720p provides optimal balance
- Temporal Spacing: 1000ms intervals achieve best performance
- Presentation Mode: Collage outperforms sequential by 6%
VENUSS/
├── evaluation/ # Core VLM evaluation system
│ ├── main.py # Primary evaluation interface
│ ├── evaluator.py # Evaluation engine
│ ├── analysis.py # Results analysis
│ ├── config/ # Model configs and API keys
│ ├── llm_interfaces/ # VLM API implementations
│ └── results/ # Generated evaluation data
├── human_evaluation/ # Web interface for human baseline collection
│ ├── app.py # Main Flask application for the interface
│ ├── analysis/ # Scripts for analyzing human performance data
│ └── results/ # Raw CSV data and analysis outputs
├── config/ # Dataset configuration
├── dataset_creation_scripts/ # Data preparation pipeline
├── scripts/ # High-level utilities
├── utils/ # Core framework utilities
└── datasets/ # Generated evaluation datasets
├── covla/ # CoVLA dataset outputs
├── hsd/ # Honda Scenes dataset outputs
├── waymo/ # Waymo Open Dataset outputs
└── nuscenes/ # NuScenes dataset outputs
- config/README.md: Configuration system guide
- evaluation/README.md: Detailed VLM evaluation documentation