Generative AI for Natural Language Understanding of Satellite Imagery

VeQRA — Visual Earth Query and Retrieval Assistant

VeQRA is an advanced AI-powered platform designed for multimodal inference on earth observation and general imagery. It seamlessly integrates image captioning, visual grounding, translation, and query-based analysis into a high-performance web interface.

Overview

VeQRA bridges the gap between complex computer vision tasks and user interaction. It utilizes a Node.js and Python backend for heavy inference orchestration and a React-based frontend for the user experience.

All interactions, including chat history and image analysis results, are persisted via PostgreSQL, ensuring a continuous and stateful workflow.

Solution Architecture

Core Features

Backend (Intelligence & Logic)

Multimodal Inference: Supports Captioning, Visual Question Answering (VQA), and YOLO-based Visual Grounding.
Data Persistence: Robust PostgreSQL storage for chat history and query logs.
Microservices Architecture: Python-based inference engines decoupled from the Node.js core.
Static File Management: Efficient serving of uploaded images and processed grounding results.

Frontend (User Experience)

Responsive UI: Features modern transitions and layout.
Theme-Aware: Supports Dark and Light modes.
Visual Query Interface: Drag-and-drop image uploads with immediate preview and result visualization.
Session Management: Organized history grouping by date.

Tech Stack

Component	Technologies
Frontend	React 18, Tailwind CSS, GSAP, Framer Motion
Backend	Node.js, Express, Python (Inference), PostgreSQL
Storage	Local filesystem (Uploads/Results), PostgreSQL (Metadata)

Project Structure

VeQRA/
│
├── backend/                                # API & Inference Logic
│   ├── routes/                             # Endpoint definitions
│   ├── uploads/                            # Raw image storage
│   ├── results/                            # Processed image storage (Grounding)
│   └── server.js                           # Entry point
│
├── frontend/                               # Client Application
│   ├── public/                             # Static assets
│   ├── src/
│   │   ├── components/                     # Reusable UI elements
│   │   └── pages/                          # Route views
│   └── index.html
│
└── scripts/                                # Research & Reproducibility
    ├── benchmarking/                       # Baseline model evaluation
    │   ├── evaluate.py                     # Metric calculation (BERT-BLEU)
    │   ├── llavanext.py                    # LLaVA-NeXT inference
    │   ├── qwenvl.py                       # Qwen-VL inference
    │   ├── rsllava.py                      # RS-LLaVA inference
    │   └── run.sh                          # Execution entry point
    ├── dataset/                            # Data ingestion & formatting
    │   ├── convert_detection.py            # Format MMRS-1M detection
    │   ├── convert_grounding.py            # Format MMRS-1M grounding
    │   ├── download_mmrs1m.sh              # Download MMRS-1M parts
    │   ├── download_vrsbench.sh            # Download VRSBench
    │   └── mmrs1m_structure.py             # Structure mapping utility
    └── finetuning/                         # Training pipelines (LoRA)
        ├── evaluate.py                     # Evaluation logic
        ├── evaluate.sh                     # Evaluation execution
        ├── finetune_caption.sh             # Captioning pipeline entry
        ├── finetune_caption.py             # Captioning training script
        ├── inference_caption.py            # Captioning validation script
        ├── finetune_vqa.sh                 # VQA pipeline entry
        ├── finetune_vqa.py                 # VQA training script
        ├── inference_vqa.py                # VQA validation script
        ├── finetune_hybrid.sh              # Hybrid pipeline entry
        ├── finetune_hybrid.py              # Hybrid training script
        ├── inference_hybrid.py             # Hybrid validation script
        ├── multimodality_pipeline.py       # SAR/IR adapter training
        ├── skysense_test.sh                # SkySense benchmark entry
        └── skysense_test.py                # SkySense inference

Research & Reproducibility

The scripts/ directory contains the complete pipeline used to train, validate, and benchmark the models underlying VeQRA. Follow the steps below to reproduce the results.

1. Dataset Preparation

Before training, datasets must be downloaded and formatted into the expected JSON structure.

cd scripts/dataset

# Download VRSBench and MMRS-1M datasets
bash download_vrsbench.sh
bash download_mmrs1m.sh

# Convert raw annotations to training format
python3 convert_detection.py
python3 convert_grounding.py

2. Fine-Tuning Pipelines

We provide specific shell scripts that handle environment setup, dependency installation, and the execution of training scripts. These scripts use QLoRA for memory-efficient fine-tuning.

Navigate to the fine-tuning directory:

cd scripts/finetuning

Task-Specific Pipelines:

Captioning Task: Trains Qwen2.5-VL on VRSBench captioning data.
```
bash finetune_caption.sh
```
Visual Question Answering (VQA): Trains on VRSBench VQA pairs.
```
bash finetune_vqa.sh
```
Hybrid Training: Trains on a combined dataset of VRSBench and SkySense to improve generalization.
```
bash finetune_hybrid.sh
```

Modality-Specific Adapters: Trains adapters specifically for SAR or IR modalities using the MMRS-1M dataset.

# For SAR Modality
python3 multimodality_pipeline.py --modality sar

# For IR Modality
python3 multimodality_pipeline.py --modality ir

Validation: To evaluate the fine-tuned models and generate metrics:

bash evaluate.sh

External Benchmarking: To run the SkySense benchmark test:

bash skysense_test.sh

3. Benchmarking & Evaluation

To compare the fine-tuned models against baselines (e.g., LLaVA-NeXT, Qwen-VL, RS-LLaVA), utilize the benchmarking suite.

cd scripts/benchmarking

# Run inference on all baseline models and generate metrics
bash run.sh

Installation & Setup (Web Application)

Prerequisites

Node.js (v18+)
Python (v3.9+)
PostgreSQL Database

Backend Setup

Navigate to the backend directory:
```
cd backend
```
Install dependencies:
```
npm install
```
Initialize storage directories:
```
mkdir uploads results
```
Configuration: Update config.js with your PostgreSQL credentials and Python environment paths.
Start the server:
```
node server.js
```

API Documentation

Authentication

Method	Endpoint	Description
`POST`	`/api/auth/signup`	Register a new user account.
`POST`	`/api/auth/login`	Authenticate and retrieve session token.

Media Management

Method	Endpoint	Description
`POST`	`/api/upload`	Upload a single image via `multipart/form-data`.
`GET`	`/api/uploads/:filename`	Retrieve raw uploaded image.
`GET`	`/api/results/:filename`	Retrieve processed grounding output.

Inference & Query

Method	Endpoint	Description
`POST`	`/api/query/captioning`	Generates a textual description of the image.
`POST`	`/api/query/grounding`	Performs YOLO object detection and returns bounding boxes.
`POST`	`/api/query/vqa`	Processes a natural language query regarding the image.
`POST`	`/api/translate`	Translates text between supported languages.

Unified Evaluation Mode

For automated benchmarking and structured reasoning, VeQRA offers a Unified Multimodal Evaluation API. This endpoint acts as an orchestrator, chaining multiple inference tasks (Captioning, Grounding, VQA) into a single request-response cycle.

Endpoint

POST /api/evaluate

Workflow

Captioning: Generates a scene description.
Grounding: Identifies objects with Oriented Bounding Boxes (OBB).
Attribute Reasoning: Executes specific VQA sub-queries (Binary, Numeric, Semantic).
Aggregation: Returns a monolithic JSON response containing all insights.

Request Format

The endpoint accepts a structured JSON payload defining the image metadata and the specific queries to run.

{
  "input_image": {
    "image_id": "img_001",
    "image_url": "http://source/path/to/image.jpg",
    "metadata": {
      "width": 1024,
      "height": 1024,
      "spatial_resolution_m": 0.5
    }
  },
  "queries": {
    "caption_query": {
      "instruction": "Detailed description of visible elements..."
    },
    "grounding_query": {
      "instruction": "Locate and return oriented bounding boxes..."
    },
    "attribute_query": {
      "binary": {
        "instruction": "Is there a vehicle present? (Answer: Yes/No)"
      },
      "numeric": {
        "instruction": "Count the number of buildings. (Answer: Float)"
      },
      "semantic": {
        "instruction": "What is the terrain type? (Answer: Text)"
      }
    }
  }
}

Purpose

Standardization: Ensures consistent output formats for benchmarking.
Efficiency: Reduces network overhead by consolidating multiple API calls.
Automation: Designed for batch processing large datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
backend		backend
frontend		frontend
scripts		scripts
README.md		README.md
User Guide.pdf		User Guide.pdf
config.js		config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative AI for Natural Language Understanding of Satellite Imagery

VeQRA — Visual Earth Query and Retrieval Assistant

Table of Contents

Overview

Solution Architecture

Core Features

Backend (Intelligence & Logic)

Frontend (User Experience)

Tech Stack

Project Structure

Research & Reproducibility

1. Dataset Preparation

2. Fine-Tuning Pipelines

3. Benchmarking & Evaluation

Installation & Setup (Web Application)

Prerequisites

Backend Setup

API Documentation

Authentication

Media Management

Inference & Query

Unified Evaluation Mode

Endpoint

Workflow

Request Format

Purpose

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Generative AI for Natural Language Understanding of Satellite Imagery

VeQRA — Visual Earth Query and Retrieval Assistant

Table of Contents

Overview

Solution Architecture

Core Features

Backend (Intelligence & Logic)

Frontend (User Experience)

Tech Stack

Project Structure

Research & Reproducibility

1. Dataset Preparation

2. Fine-Tuning Pipelines

3. Benchmarking & Evaluation

Installation & Setup (Web Application)

Prerequisites

Backend Setup

API Documentation

Authentication

Media Management

Inference & Query

Unified Evaluation Mode

Endpoint

Workflow

Request Format

Purpose

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages