VeQRA is an advanced AI-powered platform designed for multimodal inference on earth observation and general imagery. It seamlessly integrates image captioning, visual grounding, translation, and query-based analysis into a high-performance web interface.
- Overview
- Core Features
- Tech Stack
- Project Structure
- Research & Reproducibility
- Installation & Setup
- API Documentation
- Unified Evaluation Mode
VeQRA bridges the gap between complex computer vision tasks and user interaction. It utilizes a Node.js and Python backend for heavy inference orchestration and a React-based frontend for the user experience.
All interactions, including chat history and image analysis results, are persisted via PostgreSQL, ensuring a continuous and stateful workflow.
- Multimodal Inference: Supports Captioning, Visual Question Answering (VQA), and YOLO-based Visual Grounding.
- Data Persistence: Robust PostgreSQL storage for chat history and query logs.
- Microservices Architecture: Python-based inference engines decoupled from the Node.js core.
- Static File Management: Efficient serving of uploaded images and processed grounding results.
- Responsive UI: Features modern transitions and layout.
- Theme-Aware: Supports Dark and Light modes.
- Visual Query Interface: Drag-and-drop image uploads with immediate preview and result visualization.
- Session Management: Organized history grouping by date.
| Component | Technologies |
|---|---|
| Frontend | React 18, Tailwind CSS, GSAP, Framer Motion |
| Backend | Node.js, Express, Python (Inference), PostgreSQL |
| Storage | Local filesystem (Uploads/Results), PostgreSQL (Metadata) |
VeQRA/
│
├── backend/ # API & Inference Logic
│ ├── routes/ # Endpoint definitions
│ ├── uploads/ # Raw image storage
│ ├── results/ # Processed image storage (Grounding)
│ └── server.js # Entry point
│
├── frontend/ # Client Application
│ ├── public/ # Static assets
│ ├── src/
│ │ ├── components/ # Reusable UI elements
│ │ └── pages/ # Route views
│ └── index.html
│
└── scripts/ # Research & Reproducibility
├── benchmarking/ # Baseline model evaluation
│ ├── evaluate.py # Metric calculation (BERT-BLEU)
│ ├── llavanext.py # LLaVA-NeXT inference
│ ├── qwenvl.py # Qwen-VL inference
│ ├── rsllava.py # RS-LLaVA inference
│ └── run.sh # Execution entry point
├── dataset/ # Data ingestion & formatting
│ ├── convert_detection.py # Format MMRS-1M detection
│ ├── convert_grounding.py # Format MMRS-1M grounding
│ ├── download_mmrs1m.sh # Download MMRS-1M parts
│ ├── download_vrsbench.sh # Download VRSBench
│ └── mmrs1m_structure.py # Structure mapping utility
└── finetuning/ # Training pipelines (LoRA)
├── evaluate.py # Evaluation logic
├── evaluate.sh # Evaluation execution
├── finetune_caption.sh # Captioning pipeline entry
├── finetune_caption.py # Captioning training script
├── inference_caption.py # Captioning validation script
├── finetune_vqa.sh # VQA pipeline entry
├── finetune_vqa.py # VQA training script
├── inference_vqa.py # VQA validation script
├── finetune_hybrid.sh # Hybrid pipeline entry
├── finetune_hybrid.py # Hybrid training script
├── inference_hybrid.py # Hybrid validation script
├── multimodality_pipeline.py # SAR/IR adapter training
├── skysense_test.sh # SkySense benchmark entry
└── skysense_test.py # SkySense inferenceThe scripts/ directory contains the complete pipeline used to train, validate, and benchmark the models underlying VeQRA. Follow the steps below to reproduce the results.
Before training, datasets must be downloaded and formatted into the expected JSON structure.
cd scripts/dataset
# Download VRSBench and MMRS-1M datasets
bash download_vrsbench.sh
bash download_mmrs1m.sh
# Convert raw annotations to training format
python3 convert_detection.py
python3 convert_grounding.pyWe provide specific shell scripts that handle environment setup, dependency installation, and the execution of training scripts. These scripts use QLoRA for memory-efficient fine-tuning.
Navigate to the fine-tuning directory:
cd scripts/finetuningTask-Specific Pipelines:
-
Captioning Task: Trains Qwen2.5-VL on VRSBench captioning data.
bash finetune_caption.sh
-
Visual Question Answering (VQA): Trains on VRSBench VQA pairs.
bash finetune_vqa.sh
-
Hybrid Training: Trains on a combined dataset of VRSBench and SkySense to improve generalization.
bash finetune_hybrid.sh
-
Modality-Specific Adapters: Trains adapters specifically for SAR or IR modalities using the MMRS-1M dataset.
# For SAR Modality python3 multimodality_pipeline.py --modality sar # For IR Modality python3 multimodality_pipeline.py --modality ir
Validation: To evaluate the fine-tuned models and generate metrics:
bash evaluate.shExternal Benchmarking: To run the SkySense benchmark test:
bash skysense_test.shTo compare the fine-tuned models against baselines (e.g., LLaVA-NeXT, Qwen-VL, RS-LLaVA), utilize the benchmarking suite.
cd scripts/benchmarking
# Run inference on all baseline models and generate metrics
bash run.sh- Node.js (v18+)
- Python (v3.9+)
- PostgreSQL Database
-
Navigate to the backend directory:
cd backend -
Install dependencies:
npm install
-
Initialize storage directories:
mkdir uploads results
-
Configuration: Update
config.jswith your PostgreSQL credentials and Python environment paths. -
Start the server:
node server.js
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/auth/signup |
Register a new user account. |
POST |
/api/auth/login |
Authenticate and retrieve session token. |
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/upload |
Upload a single image via multipart/form-data. |
GET |
/api/uploads/:filename |
Retrieve raw uploaded image. |
GET |
/api/results/:filename |
Retrieve processed grounding output. |
| Method | Endpoint | Description |
|---|---|---|
POST |
/api/query/captioning |
Generates a textual description of the image. |
POST |
/api/query/grounding |
Performs YOLO object detection and returns bounding boxes. |
POST |
/api/query/vqa |
Processes a natural language query regarding the image. |
POST |
/api/translate |
Translates text between supported languages. |
For automated benchmarking and structured reasoning, VeQRA offers a Unified Multimodal Evaluation API. This endpoint acts as an orchestrator, chaining multiple inference tasks (Captioning, Grounding, VQA) into a single request-response cycle.
POST /api/evaluate
- Captioning: Generates a scene description.
- Grounding: Identifies objects with Oriented Bounding Boxes (OBB).
- Attribute Reasoning: Executes specific VQA sub-queries (Binary, Numeric, Semantic).
- Aggregation: Returns a monolithic JSON response containing all insights.
The endpoint accepts a structured JSON payload defining the image metadata and the specific queries to run.
{
"input_image": {
"image_id": "img_001",
"image_url": "http://source/path/to/image.jpg",
"metadata": {
"width": 1024,
"height": 1024,
"spatial_resolution_m": 0.5
}
},
"queries": {
"caption_query": {
"instruction": "Detailed description of visible elements..."
},
"grounding_query": {
"instruction": "Locate and return oriented bounding boxes..."
},
"attribute_query": {
"binary": {
"instruction": "Is there a vehicle present? (Answer: Yes/No)"
},
"numeric": {
"instruction": "Count the number of buildings. (Answer: Float)"
},
"semantic": {
"instruction": "What is the terrain type? (Answer: Text)"
}
}
}
}- Standardization: Ensures consistent output formats for benchmarking.
- Efficiency: Reduces network overhead by consolidating multiple API calls.
- Automation: Designed for batch processing large datasets.