Skip to content
View falenai's full-sized avatar
  • Hangzhou, China
  • 00:19 (UTC +09:00)
  • Joined May 7, 2026

Block or report falenai

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
falenai/README.md

方先生

M.Eng. student at Zhejiang University, somewhere between finishing my thesis and convincing myself I understand attention mechanisms. I work on things at the intersection of vision, language, and speech — specifically: why are these models so slow, and can we make them not slow.

My day-to-day is a mix of reading papers I won't finish, writing PyTorch code that almost works, and wondering if 576 visual tokens is really necessary when most of the image is background.

Currently a member of the CCNT Lab, where we build and break multimodal systems.


Now:

  • Finishing up LightVLM — dynamic visual token pruning for faster VLM inference
  • Evaluating speech LLMs systematically with SpeechLLM-Bench
  • Cleaning messy web-scraped data with vl-data-engine
  • Reading every paper on KV-cache compression that shows up on arxiv

🔬 Research

Multimodal LLMs Visual Token Compression Speech Understanding Audio-Visual Learning Efficient Inference Vision-Language Alignment

Lately I've been obsessed with making large VLMs practical — not just technically impressive. A model that takes 3 seconds to process one image is useless on a laptop. There's a lot of room between "full attention over all tokens" and "something smarter."

On the speech side: as speech-integrated LLMs become mainstream, I think we need better evaluation protocols. Ad-hoc demos are not benchmarks.


🛠️ Stack

Python PyTorch HuggingFace CUDA Linux Git LaTeX Docker


📌 Projects

Repo Description
LightVLM Efficient VLM inference via dynamic visual token pruning — 2-5× faster prefill with minimal accuracy drop
speechllm-bench Unified evaluation benchmark for speech LLMs: ASR, emotion recognition, speech translation, TTS quality
vl-data-engine Scalable pipeline for cleaning VL pretraining data — filtering, deduplication, bilingual augmentation

⚡ misc
  • I have very strong opinions about which papers should have released their code and did not
  • Terminal > IDE (sorry)
  • If your batch size is 1 your experiments don't count
  • Favorite quote: "The purpose of computing is insight, not numbers." — R. Hamming

Pinned Loading

  1. LightVLM LightVLM Public

    Efficient vision-language inference via dynamic visual token pruning — 2-5x faster prefill with minimal accuracy drop

    Python 1

  2. speechllm-bench speechllm-bench Public

    Comprehensive benchmark and evaluation harness for speech large language models: ASR, SER, speech translation, and TTS quality

    Python 1

  3. vl-data-engine vl-data-engine Public

    Scalable data processing pipeline for vision-language pretraining: quality filtering, deduplication, and bilingual caption augmentation

    Python 1

  4. zhaozijie2022/LocoLeggedWheel zhaozijie2022/LocoLeggedWheel Public

    RL-based Legged-Wheeled Robot locomotion sim-to-real based on NVIDIA Isaac Lab

    Python 126 10

  5. ZhangJinHaHaHa/AgentLens ZhangJinHaHaHa/AgentLens Public

    Verify Before You Hire — a decentralized, TEE-attested audit & marketplace protocol for AI agents. On-chain audit registry (MDDRM reputation), SGX/DCAP-backed sandbox, and a buyer-facing trust mark…

    TypeScript 486 34