AIConfigurator Release v0.5.0
AIConfigurator 0.5.0
AIConfigurator 0.5.0 brings significant performance optimizations, expands backend support for vLLM and SGLang, and introduces new modeling capabilities including Power Estimation and Power Law workload distribution. This release also adds comprehensive support matrix testing.
Release Highlights
This version focuses on performance efficiency with optimizations to the generation engine and database lookups. New hardware data support includes L40S for SGLang, and we have expanded MoE (Mixture of Experts) support to the vLLM backend. Additionally, users can now target End-to-End (E2E) latency and estimate power consumption.
Features and Improvements
1. Performance Optimizations
- Engine Optimization: Optimized the implementation of run_generation and num_gpu lookups for faster execution (by @anish-shanbhag in #113, #114).
- Efficient Data Handling: Replaced dataframes with dictionaries for batch operations in InferenceSummary generation and added caching for repeated queries to improve speed (by @anish-shanbhag in #115, #128).
2. New Modeling Capabilities
- Power Estimation: Added support for estimating power consumption of configurations (by @kaim-eng in #153).
- Workload Distribution: Introduced a 'power_law' option for workload distribution in the CLI and prefill modeling (by @xutizhou in #147, #134).
- Hybrid Modeling: Added support for hybrid modeling scenarios (by @tianhaox in #125).
- Latency Targets: Users can now set E2E latency as a target metric (by @tianhaox in #145).
3. Framework and Hardware Support
- vLLM Support: Added MoE support for vLLM (by @ilyasher in #139) and generator support (by @Ethan-ES in #144).
- SGLang Support: Added support for WideEP TP attention modeling (by @AichenF in #143), L40S data (non-WideEP) (by @venkywonka in #165), and generator support (by @Ethan-ES in #144).
- DeepSeek: Replaced DeepSeek MLP with GEMM for better performance (by @AichenF in #155).
4. User Interface
- Profiler UI: Introduced a new Profiler UI for better visualization and analysis (by @Harrilee in #117).
- UI Updates: Relocated GPU cost references and updated profiling components (by @Harrilee in #167).
5. Build, CI and Test
- Testing Framework: Added a comprehensive support matrix testing framework (by @Harrilee in #126).
- Maintenance: Added a CODEOWNERS file for better repository management (by @Arsene12358 in #109).
Bug Fixes
- SGLang Fixes: Addressed vulnerabilities in the collector (#108), aligned GEMM quantization methods (#122), and fixed attention collection for the regular path (#123).
- MoE & Model Fixes: Fixed MoE memory issues and NVFP4 GEMM for TRT-LLM 1.x (#131), removed generation repeat attention (#148), and updated workload distribution logic for MoE/DeepSeek models (#146).
- CLI & Compatibility: Fixed CLI for GB200 with TP > 4 (#137), improved Python compatibility by using Union instead of | (#158), and relaxed Pydantic requirements (#161, #162).
- General Fixes: Fixed team name parsing (#130), updated custom_allreduce file locations (#156, #160), and removed PII from error stack traces (#166).
Documentation
- Added design documentation for Power Law distribution (by @YijiaZhao in #119, #129).
- Updated documentation to mention vLLM and SGLang support (by @jasonqinzhou in #159).
New Contributors
- @xueh-nv made their first contribution in #133
- @Harrilee made their first contribution in #117
- @gangmuk made their first contribution in #158
- @dmitry-tokarev-nv made their first contribution in #161
- @venkywonka made their first contribution in #165
- @kaim-eng made their first contribution in #153
- @bcfre made their first contribution in #175
Full Changelog: v0.4.0...v0.5.0