Modern vision-language models are slow at inference because they treat every image patch token equally — a 336px image at 14px patch size generates 576 visual tokens, all fed into the LLM at full cost. Most of those tokens carry little semantic information (uniform backgrounds, empty regions). You're paying the attention cost for tokens that don't matter.
LightVLM adds a lightweight token pruner between the vision encoder and the LLM. After the vision encoder processes the image, we score each token by its contribution to the [CLS] token attention, then keep only the top-K. The pruner is a small MLP (< 1M params) that learns what "useful" means for your task.
Image
│
▼
┌─────────────────────────────┐
│ Vision Encoder (CLIP/SigLIP)│ frozen
│ → 576 patch tokens │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ Token Pruner │ trainable (< 1M params)
│ Scores tokens by relevance │
│ → Top-K tokens retained │
└──────────────┬──────────────┘
│
▼
┌─────────────────────────────┐
│ LLM Backbone (Phi-2) │ frozen or LoRA fine-tuned
│ Processes text + K tokens │
└─────────────────────────────┘
This gives ~2-3× faster prefill vs. full-token baselines with < 3% accuracy drop on standard VQA benchmarks.
from lightvlm import LightVLM
model = LightVLM.from_pretrained("falenai/lightvlm-phi2-clip-k144")
model.eval()
from PIL import Image
image = Image.open("photo.jpg")
answer = model.chat(image, "What is in this image?")
print(answer)git clone https://github.com/falenai/LightVLM.git
cd LightVLM
pip install -r requirements.txt
pip install -e .python scripts/infer.py \
--model-path checkpoints/lightvlm-phi2-k144 \
--image path/to/image.jpg \
--prompt "Describe this image in detail."python scripts/train.py \
--config configs/phi2_clip_k144.yaml \
--data-path /path/to/llava_instruct_150k.json \
--output-dir checkpoints/my_run/The pruner scores each visual token using a lightweight attention-based mechanism:
# Attention from [CLS] to all patch tokens
cls_attn = vision_attn_weights[:, :, 0, 1:] # (B, H, N-1)
scores = cls_attn.mean(dim=1) # average over heads
top_k_idx = scores.topk(k, dim=-1).indices # select top-K
kept_tokens = patch_tokens.gather(1, top_k_idx.unsqueeze(-1).expand(...))Instead of a fixed K, LightVLM supports dynamic token budgets based on image complexity. Simple images use fewer tokens; complex scenes use more. A small complexity estimator (linear layer on pooled features) predicts the budget.
On VQA-v2 val split (Phi-2 backbone, CLIP ViT-L/14 encoder):
| Method | VQA Accuracy | Tokens | Prefill speedup |
|---|---|---|---|
| Full tokens | 72.4% | 576 | 1× |
| LightVLM-K144 | 71.8% | 144 | 2.8× |
| LightVLM-K64 | 69.3% | 64 | 5.1× |
| LightVLM-Dynamic | 71.5% | ~160 avg | 2.6× |
Measured on a single A100 with batch size 1. Speedup is for the LLM prefill stage only.
- Attention-based static token pruning
- Training pipeline with LoRA support
- Dynamic token budget prediction
- Support for SigLIP encoder
- Int8 quantization integration
- Gradio demo
- Pre-trained weights release
@misc{lightvlm2024,
author = {方先生},
title = {LightVLM: Efficient Vision-Language Inference via Dynamic Token Pruning},
year = {2024},
url = {https://github.com/falenai/LightVLM}
}MIT
- v0.2.0 (2025-04): Dynamic K predictor, PositionalTokenPruner, LLaVA data loader
- v0.1.0 (2024-07): Initial release with static K pruning and Phi-2 backend
Developed at ZJU CCNT Lab. If you use this in your work, please star the repo ⭐
LightVLM-K144 runs at ~35ms/image on A100 vs ~98ms for full-token baseline. Gap widens on longer contexts.