Skip to content

Image directory captioning system using VLMs. Run on command line or FastAPI which includes a web interface. Useful for creating image datasets used to train a LoRA.

License

Notifications You must be signed in to change notification settings

git9875/vlm_caption_server

Repository files navigation

VLM Image Caption Server

Image directory captioning system using VLMs. Run on command line or FastAPI which includes a web interface. Useful for creating image datasets used to train a LoRA. This project is used by the Dataset Dedupe project for AI Captioning.

Overview

Caption all the images in a directory with downloaded Vision Language Models (VLMs) by using a command line tool or local web service. A simple web UI is included. Caption text files are saved in the same directory as the images, with the same name but with a .txt extension.

Features

  • Model Selection: Choose from available VLM models
  • No Uploading: Images are processed from a local directory.
  • Caption Generation: Receive captions (short, detailed, or tags) for images
  • User-Friendly Interface: Clean, responsive design with visual feedback

Models Currently Included

Installation

  1. Clone the repository
  2. Create a virtual environment:
    python -m venv .venv
  3. Activate the virtual environment:
    • On Windows:
      .venv\Scripts\activate
    • On macOS/Linux:
      source .venv/bin/activate
  4. Install dependencies:
    # if you intend to run a model on ollama:
    pip install ollama
    
    # if you intend to use the web service to connect to it from other programs:
    pip install "fastapi[standard]" uvicorn jinja2
    # jinja2 is needed for the simple web UI template
    
    # if you are going to use any Huggingface models running locally:
    pip install torch torchvision --index-url https://download.pytorch.org/whl/cu128
    pip install torch transformers Pillow einops timm
  5. If using Ollama to serve Qwen3-VLM-8b:
    ollama pull qwen3-vl:8b  

Usage

  • Start the FastAPI server:

    uvicorn api:app --reload

    Read the Swagger UI docs or view the web UI. First, POST request to /load_model_service, and then POST to /caption_directory using

  • Or, you can run the command line tool with:

    python program.py --model [model_name] --directory [image_directory] --prompt [prompt_name]
    
    # To view the available models and prompts, simply run:
    python program.py

For more in depth usage details, read the documentation.md page.

FastAPI Swagger UI Screenshot Web UI Screenshot
FastAPI Swagger UI Screenshot Web UI Screenshot

To customize it for other models, you can add new service class, or modify existing ones

  • Edit service_selection.py to modify available models. Prompts can also be modified.
  • Adjust available_service_models dictionary to include your preferred vision language models
  • In services directory, you can add a new service class according to the model_service_abstract class in classes.py.

Regarding Microsoft's Florence-2-base-ft vision language model, it is currently broken or incompatible with the latest transformers library. This uses david-littlefield's fix/fork.


Contributing

I am open to suggestions and feedback. If there is a locally-run VLM that is great at captioning images or videos, I may want to include it in this project in the future. Contributions are welcome! If you encounter any issues or have suggestions for improvements, please open an issue or submit a pull request.

License

This project is licensed under the Apache 2.0 License.
You are free to use, modify, and distribute with minimal restriction.

Acknowledgements

  • Qwen team at Alibaba Cloud for providing the Qwen-VL series of vision language models.
  • MiniCPM-V Team, OpenBMB for providing the MiniCPM-V-2.6 vision language model used in this project.
  • Microsoft for providing the Florence-2-base-ft vision language model used in this project.
  • David Littlefield for providing the fix/fork for Florence-2-base.
  • Ollama for providing the locally-run vision language model used in this project.
  • Hugging Face for providing the transformers library used in this project.
  • AI Code Assist was used in VS Code, presumably through Copilot. It did an amazing job offering suggestions and completing code.

About

Image directory captioning system using VLMs. Run on command line or FastAPI which includes a web interface. Useful for creating image datasets used to train a LoRA.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published