"The interesting thing about a model is not what it gets right β it's what it confidently gets wrong."
MSc student at Harbin Institute of Technology. I spend most of my time trying to figure out why multimodal models hallucinate, why speech tokenizers throw away the wrong bits, and why my vision-language data pipeline is always somehow the bottleneck.
Mostly PyTorch. Occasionally CUDA when I have to. Lots of YAML.
- π Probing compositional reasoning in MLLMs β turns out "to the left of" is harder than it looks
- ποΈ Comparing discrete speech tokenizers on downstream tasks (and discovering bitrate isn't everything)
- π§Ή Web-scale image-text data curation β 80% of the work, 20% of the credit
- π Slowly writing a thesis. Slowly.
|
mm-reason-bench
|
speech-tokenizer-arena
|
|
vl-data-engine β Scalable preprocessing & filtering pipeline for VL pretraining data. CLIP filtering, perceptual dedup, language filters, webdataset shards. | |
π§° Things I reach for
Also: torchaudio, open_clip, webdataset, vLLM, DeepSpeed, slurm, way too many wandb tabs.
π Harbin Β· β probably awake Β· π¬ reach me via issues on any of the repos above