"The interesting thing about a model is not what it gets right — it's what it confidently gets wrong."
MSc student at Harbin Institute of Technology. I spend most of my time trying to figure out why multimodal models hallucinate, why speech tokenizers throw away the wrong bits, and why my vision-language data pipeline is always somehow the bottleneck.
Mostly PyTorch. Occasionally CUDA when I have to. Lots of YAML.
- 📐 Probing compositional reasoning in MLLMs — turns out "to the left of" is harder than it looks
- 🎙️ Comparing discrete speech tokenizers on downstream tasks (and discovering bitrate isn't everything)
- 🧹 Web-scale image-text data curation — 80% of the work, 20% of the credit
- 📝 Slowly writing a thesis. Slowly.
|
mm-reason-bench
|
speech-tokenizer-arena
|
|
vl-data-engine — Scalable preprocessing & filtering pipeline for VL pretraining data. CLIP filtering, perceptual dedup, language filters, webdataset shards. | |
🧰 Things I reach for
Also: torchaudio, open_clip, webdataset, vLLM, DeepSpeed, slurm, way too many wandb tabs.
📍 Harbin · ☕ probably awake · 📬 reach me via issues on any of the repos above
