LightVLM 🔦

The Problem

Modern vision-language models are slow at inference because they treat every image patch token equally — a 336px image at 14px patch size generates 576 visual tokens, all fed into the LLM at full cost. Most of those tokens carry little semantic information (uniform backgrounds, empty regions). You're paying the attention cost for tokens that don't matter.

Our Approach

LightVLM adds a lightweight token pruner between the vision encoder and the LLM. After the vision encoder processes the image, we score each token by its contribution to the [CLS] token attention, then keep only the top-K. The pruner is a small MLP (< 1M params) that learns what "useful" means for your task.

Image
  │
  ▼
┌─────────────────────────────┐
│  Vision Encoder (CLIP/SigLIP)│  frozen
│  → 576 patch tokens          │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│       Token Pruner           │  trainable (< 1M params)
│  Scores tokens by relevance  │
│  → Top-K tokens retained     │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│    LLM Backbone (Phi-2)      │  frozen or LoRA fine-tuned
│  Processes text + K tokens   │
└─────────────────────────────┘

This gives ~2-3× faster prefill vs. full-token baselines with < 3% accuracy drop on standard VQA benchmarks.

Show Me

from lightvlm import LightVLM

model = LightVLM.from_pretrained("falenai/lightvlm-phi2-clip-k144")
model.eval()

from PIL import Image
image = Image.open("photo.jpg")
answer = model.chat(image, "What is in this image?")
print(answer)

Getting Started

Install

git clone https://github.com/falenai/LightVLM.git
cd LightVLM
pip install -r requirements.txt
pip install -e .

Quick inference

python scripts/infer.py \
    --model-path checkpoints/lightvlm-phi2-k144 \
    --image path/to/image.jpg \
    --prompt "Describe this image in detail."

Training

python scripts/train.py \
    --config configs/phi2_clip_k144.yaml \
    --data-path /path/to/llava_instruct_150k.json \
    --output-dir checkpoints/my_run/

How It Works

Token Pruning

The pruner scores each visual token using a lightweight attention-based mechanism:

# Attention from [CLS] to all patch tokens
cls_attn = vision_attn_weights[:, :, 0, 1:]  # (B, H, N-1)
scores = cls_attn.mean(dim=1)                 # average over heads
top_k_idx = scores.topk(k, dim=-1).indices   # select top-K
kept_tokens = patch_tokens.gather(1, top_k_idx.unsqueeze(-1).expand(...))

Dynamic K

Instead of a fixed K, LightVLM supports dynamic token budgets based on image complexity. Simple images use fewer tokens; complex scenes use more. A small complexity estimator (linear layer on pooled features) predicts the budget.

Results

On VQA-v2 val split (Phi-2 backbone, CLIP ViT-L/14 encoder):

Method	VQA Accuracy	Tokens	Prefill speedup
Full tokens	72.4%	576	1×
LightVLM-K144	71.8%	144	2.8×
LightVLM-K64	69.3%	64	5.1×
LightVLM-Dynamic	71.5%	~160 avg	2.6×

Measured on a single A100 with batch size 1. Speedup is for the LLM prefill stage only.

Roadmap

Attention-based static token pruning
Training pipeline with LoRA support
Dynamic token budget prediction
Support for SigLIP encoder
Int8 quantization integration
Gradio demo
Pre-trained weights release

Citation

@misc{lightvlm2024,
  author = {方先生},
  title  = {LightVLM: Efficient Vision-Language Inference via Dynamic Token Pruning},
  year   = {2024},
  url    = {https://github.com/falenai/LightVLM}
}

License

MIT

What's New

v0.2.0 (2025-04): Dynamic K predictor, PositionalTokenPruner, LLaVA data loader
v0.1.0 (2024-07): Initial release with static K pruning and Phi-2 backend

Developed at ZJU CCNT Lab. If you use this in your work, please star the repo ⭐

Performance Notes

LightVLM-K144 runs at ~35ms/image on A100 vs ~98ms for full-token baseline. Gap widens on longer contexts.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github		.github
configs		configs
demo		demo
lightvlm		lightvlm
scripts		scripts
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LightVLM 🔦

The Problem

Our Approach

Show Me

Getting Started

Install

Quick inference

Training

How It Works

Token Pruning

Dynamic K

Results

Roadmap

Citation

License

What's New

Performance Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LightVLM 🔦

The Problem

Our Approach

Show Me

Getting Started

Install

Quick inference

Training

How It Works

Token Pruning

Dynamic K

Results

Roadmap

Citation

License

What's New

Performance Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages