Skip to content

falenai/LightVLM

Repository files navigation

LightVLM 🔦

The Problem

Modern vision-language models are slow at inference because they treat every image patch token equally — a 336px image at 14px patch size generates 576 visual tokens, all fed into the LLM at full cost. Most of those tokens carry little semantic information (uniform backgrounds, empty regions). You're paying the attention cost for tokens that don't matter.

Our Approach

LightVLM adds a lightweight token pruner between the vision encoder and the LLM. After the vision encoder processes the image, we score each token by its contribution to the [CLS] token attention, then keep only the top-K. The pruner is a small MLP (< 1M params) that learns what "useful" means for your task.

Image
  │
  ▼
┌─────────────────────────────┐
│  Vision Encoder (CLIP/SigLIP)│  frozen
│  → 576 patch tokens          │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│       Token Pruner           │  trainable (< 1M params)
│  Scores tokens by relevance  │
│  → Top-K tokens retained     │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│    LLM Backbone (Phi-2)      │  frozen or LoRA fine-tuned
│  Processes text + K tokens   │
└─────────────────────────────┘

This gives ~2-3× faster prefill vs. full-token baselines with < 3% accuracy drop on standard VQA benchmarks.

Show Me

from lightvlm import LightVLM

model = LightVLM.from_pretrained("falenai/lightvlm-phi2-clip-k144")
model.eval()

from PIL import Image
image = Image.open("photo.jpg")
answer = model.chat(image, "What is in this image?")
print(answer)

Getting Started

Install

git clone https://github.com/falenai/LightVLM.git
cd LightVLM
pip install -r requirements.txt
pip install -e .

Quick inference

python scripts/infer.py \
    --model-path checkpoints/lightvlm-phi2-k144 \
    --image path/to/image.jpg \
    --prompt "Describe this image in detail."

Training

python scripts/train.py \
    --config configs/phi2_clip_k144.yaml \
    --data-path /path/to/llava_instruct_150k.json \
    --output-dir checkpoints/my_run/

How It Works

Token Pruning

The pruner scores each visual token using a lightweight attention-based mechanism:

# Attention from [CLS] to all patch tokens
cls_attn = vision_attn_weights[:, :, 0, 1:]  # (B, H, N-1)
scores = cls_attn.mean(dim=1)                 # average over heads
top_k_idx = scores.topk(k, dim=-1).indices   # select top-K
kept_tokens = patch_tokens.gather(1, top_k_idx.unsqueeze(-1).expand(...))

Dynamic K

Instead of a fixed K, LightVLM supports dynamic token budgets based on image complexity. Simple images use fewer tokens; complex scenes use more. A small complexity estimator (linear layer on pooled features) predicts the budget.

Results

On VQA-v2 val split (Phi-2 backbone, CLIP ViT-L/14 encoder):

Method VQA Accuracy Tokens Prefill speedup
Full tokens 72.4% 576
LightVLM-K144 71.8% 144 2.8×
LightVLM-K64 69.3% 64 5.1×
LightVLM-Dynamic 71.5% ~160 avg 2.6×

Measured on a single A100 with batch size 1. Speedup is for the LLM prefill stage only.

Roadmap

  • Attention-based static token pruning
  • Training pipeline with LoRA support
  • Dynamic token budget prediction
  • Support for SigLIP encoder
  • Int8 quantization integration
  • Gradio demo
  • Pre-trained weights release

Citation

@misc{lightvlm2024,
  author = {方先生},
  title  = {LightVLM: Efficient Vision-Language Inference via Dynamic Token Pruning},
  year   = {2024},
  url    = {https://github.com/falenai/LightVLM}
}

License

MIT

What's New

  • v0.2.0 (2025-04): Dynamic K predictor, PositionalTokenPruner, LLaVA data loader
  • v0.1.0 (2024-07): Initial release with static K pruning and Phi-2 backend

Developed at ZJU CCNT Lab. If you use this in your work, please star the repo ⭐

Performance Notes

LightVLM-K144 runs at ~35ms/image on A100 vs ~98ms for full-token baseline. Gap widens on longer contexts.

About

Efficient vision-language inference via dynamic visual token pruning — 2-5x faster prefill with minimal accuracy drop

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages