Efficient Inference for LLMs & MLLMs
An open-source research project from Alibaba Cloud dedicated to efficient large language model inference.
- ✨ Key Features
- 🔥 Latest Updates
- 📦 Installation
- ⚡ Quick Start
- 🧪 Benchmarks
- 📚 Publications
- 🤝 Contributing
- 📄 License
- ✉️ Contact
EfficientAI focuses on inference-time optimizations for LLMs and MLLMs:
| Feature | Description | Status |
|---|---|---|
| 🔹 Activation Sparsity | Dynamic sparsity methods for faster inference | ✅ LaRoSa (ICML 2025) |
| 🔹 Quantization | Post-training & quantization-aware techniques for MLLMs | ✅ MASQuant (CVPR 2026) |
| 🔹 Agentic Reasoning | Efficient tool-use and reasoning frameworks | ✅ D-CORE |
| 🔹 Reproducible Benchmarks | Standardized eval pipelines for research & production | 🔄 In Progress |
📰 Changelog (Click to expand)
-
[2026-03] 🎉 MASQuant accepted to CVPR 2026
→ Multimodal LLM PTQ algorithm with SOTA accuracy-efficiency tradeoff
📄 Paper | 💻 Code -
[2026-02] 🚀 D-CORE open-sourced
→ Efficient tool-use reasoning via dynamic computation routing
📄 Paper | 💻 Code | 🎮 Demo -
[2026-01] 🏆 LaRoSa accepted to ICML 2025
→ Training-free activation sparsity for LLM acceleration
📄 Paper | 💻 Code
# Clone the repository
git clone https://github.com/alibaba/EfficientAI.git
cd EfficientAI
# Install dependencies (recommended: use conda)
pip install -r requirements.txt
# Optional: Install with specific module support
# pip install -e ".[larosa]" # for LaRoSa
# pip install -e ".[masquant]" # for MASQuant