Jiaming An

"The interesting thing about a model is not what it gets right — it's what it confidently gets wrong."

MSc student at Harbin Institute of Technology. I spend most of my time trying to figure out why multimodal models hallucinate, why speech tokenizers throw away the wrong bits, and why my vision-language data pipeline is always somehow the bottleneck.

Mostly PyTorch. Occasionally CUDA when I have to. Lots of YAML.

What I'm poking at right now

📐 Probing compositional reasoning in MLLMs — turns out "to the left of" is harder than it looks
🎙️ Comparing discrete speech tokenizers on downstream tasks (and discovering bitrate isn't everything)
🧹 Web-scale image-text data curation — 80% of the work, 20% of the credit
📝 Slowly writing a thesis. Slowly.

Open-source things I maintain

mm-reason-bench A lightweight benchmark suite for multimodal reasoning. VQA, charts, spatial, compositional.	speech-tokenizer-arena Drop in a tokenizer, get a leaderboard. EnCodec, HuBERT-units, DAC, SpeechTokenizer side-by-side.
vl-data-engine — Scalable preprocessing & filtering pipeline for VL pretraining data. CLIP filtering, perceptual dedup, language filters, webdataset shards.