Skip to content

V3NU55/VENUSS_code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VENUSS: VLM Evaluation oN Understanding Sequential Scenes

A comprehensive framework for evaluating Vision-Language Models on sequential driving scenarios

Overview

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities.

VENUSS systematically assesses VLMs' performance on sequential driving scenes and establishes baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos and generates structured evaluations across custom categories.

Paper: How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Key Findings

By comparing 25+ VLMs across 2,600+ scenarios, our analysis reveals:

  • Top models achieve only 57% accuracy on driving scenarios
  • Human performance reaches 65% (GIF mode), exposing significant capability gaps
  • VLMs excel with static object detection but struggle with understanding vehicle dynamics and temporal relations
  • 48.2% improvement potential through systematic configuration optimization

Framework Capabilities

Systematic VLM Evaluation

  • 25+ supported models: OpenAI, Anthropic, Google, Qwen, HuggingFace
  • 2,600+ evaluation scenarios across multiple driving contexts
  • Sequential scene analysis with temporal understanding assessment
  • Configuration impact analysis (grid layouts, resolutions, time intervals, presentation modes)

Comprehensive Analysis

  • Human baseline comparison: Establishes performance benchmarks
  • Fine-grained evaluation: Custom categories for driving-specific tasks
  • Statistical analysis: Performance metrics and capability gap identification
  • Configuration optimization: Systematic testing of input arrangements

Proven Generalizability

  • Multiple dataset support: Framework validated across 4 driving datasets (CoVLA, Honda Scenes, Waymo, NuScenes)
  • Consistent methodology: Same evaluation protocol across datasets
  • Cross-dataset analysis: Performance comparison capabilities

Quick Start

1. Installation

# Clone the repository
git clone https://github.com/V3NU55/VENUSS.git
cd VENUSS

# Install VLM evaluation dependencies
pip install -r evaluation/requirements.txt

# Install Human evaluation dependencies
pip install -r human_evaluation/requirements.txt

# Configure dataset paths
nano config/datasets.py

# Add API keys for VLM evaluation
nano evaluation/config/.env

2. Run VLM Evaluation

# Check available models (25+ supported)
python evaluation/main.py --available-models

# Quick validation test
python evaluation/main.py --model gpt-4o-mini --dataset covla --phase baseline-cross-dataset

# Full evaluation run
python evaluation/main.py --model gpt-4o --dataset covla --phase comprehensive-evaluation

# Evaluate on Waymo or NuScenes
python evaluation/main.py --model gpt-4o --dataset waymo --phase baseline-cross-dataset
python evaluation/main.py --model gpt-4o --dataset nuscenes --phase baseline-cross-dataset

# Generate analysis report
python evaluation/analysis.py --results-dir evaluation/results --output-dir analysis_output

3. Run Human Evaluation and Analysis

# Navigate to the human evaluation directory
cd human_evaluation

# Run the web application
python app.py
# Access the interface at http://localhost:5000

# ---
# After collecting data, run the analysis scripts:

# Generate time-based statistics for a dataset (e.g., covla)
python analysis/analyze_evaluation_times.py --dataset covla

# Generate a summary report from the time analysis
python analysis/generate_summary_report.py --dataset covla

# Generate accuracy analysis
python analysis/analyze_human_accuracy.py --dataset covla

4. Research Workflows

# Model comparison study
python evaluation/main.py --model gpt-4o --dataset covla --phase baseline-cross-dataset
python evaluation/main.py --model claude-3-5-sonnet --dataset covla --phase baseline-cross-dataset
python evaluation/main.py --model gemini-2.0-flash-exp --dataset covla --phase baseline-cross-dataset

# Configuration optimization
python evaluation/main.py --model gpt-4o --dataset covla --phase resolution-optimization
python evaluation/main.py --model gpt-4o --dataset covla --phase grid-configurations
python evaluation/main.py --model gpt-4o --dataset covla --phase temporal-intervals

Framework Extensibility - Adding a New Dataset

The VENUSS pipeline is designed for extensibility. We currently support 4 datasets (CoVLA, Honda Scenes, Waymo, NuScenes). To integrate a new dataset, you only need to modify the following three files:

File Purpose
config/datasets.py Define paths and settings for the new dataset
dataset_creation_scripts/scenario_analysis.py Add a custom parser for the dataset's annotation format
scripts/generate_annotations.py Define evaluation questions and answer key generation logic

Evaluation Framework

Core Pipeline

  1. Scenario Extraction: Extract driving scenarios from video datasets
  2. Temporal Analysis: Analyze sequential segments for optimal evaluation intervals
  3. Ground Truth Generation: Create structured evaluation questions and answer keys
  4. Visual Material Creation: Generate frame sequences and collages for VLM input
  5. Human Baseline Collection: Establish human performance benchmarks
  6. VLM Evaluation: Systematic assessment of 25+ models with statistical analysis

Experimental Phases

Phase Purpose Scenarios Focus
baseline-cross-dataset Quick validation 4 Framework verification
resolution-optimization Optimal image quality 90 Visual input optimization
grid-configurations Layout impact 96 Spatial arrangement effects
temporal-intervals Time spacing effects 60 Temporal understanding
comprehensive-evaluation Complete assessment 100 Full capability analysis

Supported Models

  • OpenAI: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-4-vision-preview
  • Anthropic: claude-3-5-sonnet, claude-3-opus, claude-3-haiku
  • Google: gemini-2.0-flash-exp, gemini-1.5-pro, gemini-1.5-flash
  • Qwen: qwen-vl-plus, qwen-vl-max
  • HuggingFace: llava-1.5-7b, llava-1.5-13b, and more

Results and Analysis

Performance Insights

  • Capability Gaps: Up to 48.2% performance difference based on input configuration
  • Temporal Understanding: VLMs struggle with sequential scene relationships
  • Static vs Dynamic: Strong object detection, weak motion understanding
  • Human Comparison: VLMs approaching but not exceeding human performance

Configuration Impact

  • Grid Layout: Horizontal layouts outperform square grids
  • Resolution: 720p provides optimal balance
  • Temporal Spacing: 1000ms intervals achieve best performance
  • Presentation Mode: Collage outperforms sequential by 6%

Directory Structure

VENUSS/
├── evaluation/                # Core VLM evaluation system
│   ├── main.py               # Primary evaluation interface
│   ├── evaluator.py          # Evaluation engine
│   ├── analysis.py           # Results analysis
│   ├── config/               # Model configs and API keys
│   ├── llm_interfaces/       # VLM API implementations
│   └── results/              # Generated evaluation data
├── human_evaluation/          # Web interface for human baseline collection
│   ├── app.py                # Main Flask application for the interface
│   ├── analysis/             # Scripts for analyzing human performance data
│   └── results/              # Raw CSV data and analysis outputs
├── config/                   # Dataset configuration
├── dataset_creation_scripts/ # Data preparation pipeline
├── scripts/                  # High-level utilities
├── utils/                    # Core framework utilities
└── datasets/                 # Generated evaluation datasets
    ├── covla/                # CoVLA dataset outputs
    ├── hsd/                  # Honda Scenes dataset outputs
    ├── waymo/                # Waymo Open Dataset outputs
    └── nuscenes/             # NuScenes dataset outputs

Documentation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors