Image directory captioning system using VLMs. Run on command line or FastAPI which includes a web interface. Useful for creating image datasets used to train a LoRA. This project is used by the Dataset Dedupe project for AI Captioning.
Caption all the images in a directory with downloaded Vision Language Models (VLMs) by using a command line tool or local web service. A simple web UI is included. Caption text files are saved in the same directory as the images, with the same name but with a .txt extension.
- Model Selection: Choose from available VLM models
- No Uploading: Images are processed from a local directory.
- Caption Generation: Receive captions (short, detailed, or tags) for images
- User-Friendly Interface: Clean, responsive design with visual feedback
- Qwen3-VLM-8B available through Ollama.
- MiniCPM-V 2.6 available through Ollama.
- Florence-2-base, but fixed with transformers using David Littlefield's models that have identical weights, just converted for native support. Note that although Florence-2 does a job at describing images in detail, it does poorly at "tagging" with the
<OD>task prompt.local_florence2.pyincludes a prompt translation from the coded task prompts, see the original examples here.
- Clone the repository
- Create a virtual environment:
python -m venv .venv
- Activate the virtual environment:
- On Windows:
.venv\Scripts\activate
- On macOS/Linux:
source .venv/bin/activate
- On Windows:
- Install dependencies:
# if you intend to run a model on ollama: pip install ollama # if you intend to use the web service to connect to it from other programs: pip install "fastapi[standard]" uvicorn jinja2 # jinja2 is needed for the simple web UI template # if you are going to use any Huggingface models running locally: pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128 pip install torch transformers Pillow einops timm
- If using Ollama to serve Qwen3-VLM-8b:
ollama pull qwen3-vl:8b
-
Start the FastAPI server:
uvicorn api:app --reload
Read the Swagger UI docs or view the web UI. First, POST request to /load_model_service, and then POST to /caption_directory using
-
Or, you can run the command line tool with:
python program.py --model [model_name] --directory [image_directory] --prompt [prompt_name] # To view the available models and prompts, simply run: python program.py
For more in depth usage details, read the documentation.md page.
| FastAPI Swagger UI Screenshot | Web UI Screenshot |
|---|---|
![]() |
![]() |
- Edit
service_selection.pyto modify available models. Prompts can also be modified. - Adjust
available_service_modelsdictionary to include your preferred vision language models - In
servicesdirectory, you can add a new service class according to themodel_service_abstractclass inclasses.py.
Regarding Microsoft's Florence-2-base-ft vision language model, it is currently broken or incompatible with the latest transformers library. This uses david-littlefield's fix/fork.
I am open to suggestions and feedback. If there is a locally-run VLM that is great at captioning images or videos, I may want to include it in this project in the future. Contributions are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request.
This project is licensed under the Apache 2.0 License.
You are free to use, modify, and distribute with minimal restriction.
- Qwen team at Alibaba Cloud for providing the Qwen-VL series of vision language models.
- MiniCPM-V Team, OpenBMB for providing the MiniCPM-V-2.6 vision language model used in this project.
- Microsoft for providing the Florence-2-base-ft vision language model used in this project.
- David Littlefield for providing the fix/fork for Florence-2-base.
- Ollama for providing the locally-run vision language model used in this project.
- Hugging Face for providing the transformers library used in this project.
- AI Code Assist was used in VS Code, presumably through Copilot. It did an amazing job offering suggestions and completing code.

