Visual Question Answering (VQA) Service

Overview

The VQA Service is a machine learning based application that allows users to ask questions about images and receive answers. It leverages state-of-the-art models to process images and generate accurate responses to user queries. The service is designed to be flexible, supporting multiple backend models for different use cases.

Installation

To install and set up the VQA Service, follow these steps:

Clone the repository:

git clone https://github.com/cansik/vqa-service.git
cd vqa-service

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows use `venv\Scripts\activate`

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

To start the VQA Service, use the following command:

python -m vqa

Command-line Arguments

--host: Specify the service host (default: 127.0.0.1).
--port: Specify the service port (default: 7840).
--backend: Choose the VQA backend model (default: blip).

Example

To run the service on a specific host and port with a chosen backend, use:

python -m vqa --host 0.0.0.0 --port 8000 --backend blip2

Accessing the Service

Once the service is running, you can access it through a web interface provided by Gradio. Open your web browser and navigate to http://<host>:<port> to interact with the service.

Supported Models

The following VLM backends are supported:

Backend ID	Model	Description
`blip`	Salesforce/blip-vqa-base	BLIP base model for visual question answering
`blip2`	Salesforce/blip2-opt-2.7b	BLIP2 with OPT 2.7B language model
`blip2-flan`	Salesforce/blip2-flan-t5-xl	BLIP2 with Flan-T5-XL language model
`vilt`	dandelin/vilt-b32-finetuned-vqa	ViLT model fine-tuned for VQA tasks
`vlmmlx`	mlx-community/Qwen2-VL-2B-Instruct-4bit	Default MLX-based VLM for Apple Silicon
`vlmmlx-phi35`	mlx-community/Phi-3.5-vision-instruct-4bit	Phi-3.5 Vision model optimized for MLX
`vlmmlx-smolvlm2`	mlx-community/SmolVLM2-500M-Video-Instruct-mlx-8bit-skip-vision	SmolVLM2 optimized for MLX
`namo`	-	Namo VLM model
`moondream`	vikhyatk/moondream2	Moondream2 model with GPU support
`moondream-cpu`	vikhyatk/moondream2	Moondream2 model optimized for CPU inference
`smolvlm`	HuggingFaceTB/SmolVLM-256M-Instruct	Lightweight VLM model
`smolvlm2`	HuggingFaceTB/SmolVLM2-256M-Video-Instruct	SmolVLM2 with video instruction capabilities

Each backend offers different capabilities and performance characteristics.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
vqa		vqa
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dev-requirements.txt		dev-requirements.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Question Answering (VQA) Service

Overview

Installation

Usage

Command-line Arguments

Example

Accessing the Service

Supported Models

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Visual Question Answering (VQA) Service

Overview

Installation

Usage

Command-line Arguments

Example

Accessing the Service

Supported Models

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages