M.Eng. student at Zhejiang University, somewhere between finishing my thesis and convincing myself I understand attention mechanisms. I work on things at the intersection of vision, language, and speech — specifically: why are these models so slow, and can we make them not slow.
My day-to-day is a mix of reading papers I won't finish, writing PyTorch code that almost works, and wondering if 576 visual tokens is really necessary when most of the image is background.
Currently a member of the CCNT Lab, where we build and break multimodal systems.
Now:
- Finishing up LightVLM — dynamic visual token pruning for faster VLM inference
- Evaluating speech LLMs systematically with SpeechLLM-Bench
- Cleaning messy web-scraped data with vl-data-engine
- Reading every paper on KV-cache compression that shows up on arxiv
Multimodal LLMs Visual Token Compression Speech Understanding Audio-Visual Learning Efficient Inference Vision-Language Alignment
Lately I've been obsessed with making large VLMs practical — not just technically impressive. A model that takes 3 seconds to process one image is useless on a laptop. There's a lot of room between "full attention over all tokens" and "something smarter."
On the speech side: as speech-integrated LLMs become mainstream, I think we need better evaluation protocols. Ad-hoc demos are not benchmarks.
| Repo | Description |
|---|---|
| LightVLM | Efficient VLM inference via dynamic visual token pruning — 2-5× faster prefill with minimal accuracy drop |
| speechllm-bench | Unified evaluation benchmark for speech LLMs: ASR, emotion recognition, speech translation, TTS quality |
| vl-data-engine | Scalable pipeline for cleaning VL pretraining data — filtering, deduplication, bilingual augmentation |
⚡ misc
- I have very strong opinions about which papers should have released their code and did not
- Terminal > IDE (sorry)
- If your batch size is 1 your experiments don't count
- Favorite quote: "The purpose of computing is insight, not numbers." — R. Hamming