VLM Image Caption Server

Image directory captioning system using VLMs. Run on command line or FastAPI which includes a web interface. Useful for creating image datasets used to train a LoRA. This project is used by the Dataset Dedupe project for AI Captioning.

Overview

Caption all the images in a directory with downloaded Vision Language Models (VLMs) by using a command line tool or local web service. A simple web UI is included. Caption text files are saved in the same directory as the images, with the same name but with a .txt extension.

Features

Model Selection: Choose from available VLM models
No Uploading: Images are processed from a local directory.
Caption Generation: Receive captions (short, detailed, or tags) for images
User-Friendly Interface: Clean, responsive design with visual feedback

Models Currently Included

Qwen3-VLM-8B available through Ollama.
MiniCPM-V 2.6 available through Ollama.
Florence-2-base, but fixed with transformers using David Littlefield's models that have identical weights, just converted for native support. Note that although Florence-2 does a job at describing images in detail, it does poorly at "tagging" with the <OD> task prompt. local_florence2.py includes a prompt translation from the coded task prompts, see the original examples here.

Installation

Clone the repository
Create a virtual environment:
```
python -m venv .venv
```
Activate the virtual environment:
- On Windows:
```
.venv\Scripts\activate
```
- On macOS/Linux:
```
source .venv/bin/activate
```

Install dependencies:

# if you intend to run a model on ollama:
pip install ollama

# if you intend to use the web service to connect to it from other programs:
pip install "fastapi[standard]" uvicorn jinja2
# jinja2 is needed for the simple web UI template

# if you are going to use any Huggingface models running locally:
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
pip install torch transformers Pillow einops timm

If using Ollama to serve Qwen3-VLM-8b:
```
ollama pull qwen3-vl:8b  
```

Usage

Start the FastAPI server:
```
uvicorn api:app --reload
```
Read the Swagger UI docs or view the web UI. First, POST request to /load_model_service, and then POST to /caption_directory using

Or, you can run the command line tool with:

python program.py --model [model_name] --directory [image_directory] --prompt [prompt_name]

# To view the available models and prompts, simply run:
python program.py

For more in depth usage details, read the documentation.md page.

FastAPI Swagger UI Screenshot	Web UI Screenshot

To customize it for other models, you can add new service class, or modify existing ones

Edit service_selection.py to modify available models. Prompts can also be modified.
Adjust available_service_models dictionary to include your preferred vision language models
In services directory, you can add a new service class according to the model_service_abstract class in classes.py.

Regarding Microsoft's Florence-2-base-ft vision language model, it is currently broken or incompatible with the latest transformers library. This uses david-littlefield's fix/fork.

Contributing

I am open to suggestions and feedback. If there is a locally-run VLM that is great at captioning images or videos, I may want to include it in this project in the future. Contributions are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request.

License

This project is licensed under the Apache 2.0 License.
You are free to use, modify, and distribute with minimal restriction.

Acknowledgements

Qwen team at Alibaba Cloud for providing the Qwen-VL series of vision language models.
MiniCPM-V Team, OpenBMB for providing the MiniCPM-V-2.6 vision language model used in this project.
Microsoft for providing the Florence-2-base-ft vision language model used in this project.
David Littlefield for providing the fix/fork for Florence-2-base.
Ollama for providing the locally-run vision language model used in this project.
Hugging Face for providing the transformers library used in this project.
AI Code Assist was used in VS Code, presumably through Copilot. It did an amazing job offering suggestions and completing code.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
services		services
static		static
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
api.py		api.py
classes.py		classes.py
documentation.md		documentation.md
program.py		program.py
service_selection.py		service_selection.py
webui.py		webui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VLM Image Caption Server

Overview

Features

Models Currently Included

Installation

Usage

To customize it for other models, you can add new service class, or modify existing ones

Contributing

License

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

git9875/vlm_caption_server

Folders and files

Latest commit

History

Repository files navigation

VLM Image Caption Server

Overview

Features

Models Currently Included

Installation

Usage

To customize it for other models, you can add new service class, or modify existing ones

Contributing

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages