AIConfigurator - Release 0.9.0
Summary
AIConfigurator 0.9.0 broadens model, hardware, and deployment-target coverage and lays foundational SDK plumbing for the next wave of large MoE models. The release adds first-class DeepSeek-V4 support across attention, MHC, and MoE collectors and queries, brings DeepSeek-R1 to the vLLM backend, adds MiniMax-M2.7 (including the NVFP4 variant), and lands SILICON-mode profiles for QWEN 3.5 and Kimi K-2.5. Hardware coverage expands with B300 support on vLLM 0.19.0, estimate-only PCIe systems, RTX PRO 6000 Blackwell Server perf data across SGLang, vLLM 0.19.0, and TRT-LLM 1.3.0rc10, and MXFP4 MoE/attention data for gpt-oss on H200. A new llm-d deployment target joins the existing TRT-LLM/SGLang/vLLM targets, and a new AIC Rust core forward-pass estimator lands inside the SDK. The collector gains --resume-retry-failed, the SDK becomes framework-agnostic in hybrid mode by sharing op data across frameworks, and a new per-op silicon-vs-empirical attribution view exposes where each prediction comes from. The CLI adds --strict-sla for opt-in TTFT+TPOT constraint filtering. The release also introduces a support-matrix regression view alongside a prediction-accuracy regression testing workflow, makes the container OpenShift-compatible under random UIDs, and unifies 16-bit float naming on bfloat16. Numerous fixes harden DeepSeek-V4 attention extrapolation, vLLM ≥ 0.19 MLA/DSA collector compat, FP8 block config, balanced-EP routing, MLA KV cache sizing for Kimi K2.5, NaN handling in Pareto selection, and webapp defaults — plus a cluster of RC cherry-picks closing out release/0.9.0 NVBugs.
Key Highlights
Models and Architectures
- DeepSeek-V4 end-to-end: Module-level attention collect/query (#941), MHC collect/query (#942), and MoE modeling (#986) on top of the initial DeepSeek-V4 SDK support (#904); dsv4-flash collectors extended to Blackwell platforms (#1034).
- DeepSeek-R1 on vLLM: Added DSR1 support for the vLLM backend (#852) and added DeepSeek R1 to the support matrix (#851).
- MiniMax-M2.7 (FP8 and NVFP4): Added MiniMax-M2.7 and
nvidia/MiniMax-M2.7-NVFP4support (#964). - SILICON-mode coverage: QWEN 3.5 SILICON mode with TRT-LLM / SGLang / vLLM data (#738) and Kimi K-2.5 SILICON mode (#757).
- GLM-5: Added FP8 / NVFP4 support-matrix generation support for GLM-5 (#991).
Hardware and Backend Coverage
- B300 on vLLM 0.19.0: Added B300 system support for vLLM 0.19.0 (#829).
- RTX PRO 6000 Blackwell Server perf data: Perf tables across SGLang (#998), vLLM 0.19.0 (#999), and TRT-LLM 1.3.0rc10 (#1001).
- gpt-oss on H200: MXFP4 MoE and attention data for gpt-oss models on H200 (#894).
- Estimate-only PCIe systems: Added estimate-only PCIe system definitions so PCIe topologies can be evaluated even without a full empirical sweep (#980).
- Intel XPU: Enabled oneCCL benchmarking support for XPU (#694).
- SGLang 0.5.10: Made the SGLang collector compatible with SGLang 0.5.10 (#761).
Deployment Targets
llm-ddeployment target: Added a newllm-ddeployment target alongside TRT-LLM / SGLang / vLLM (#671), with follow-up output-quality improvements (#954).- OpenShift random UID: Made the container compatible with OpenShift's arbitrary-UID security context so AIConfigurator runs cleanly on OpenShift clusters (#670).
SDK and Modeling
- AIC Rust core forward-pass estimator: A new Rust-implemented core forward-pass estimator lands in the SDK (#981).
- Hybrid-mode op-data sharing: Op data is now shareable across frameworks in hybrid mode, so TRT-LLM / SGLang / vLLM can reuse the same underlying op measurements when projecting hybrid configurations (#997).
- Per-op silicon vs. empirical attribution: Each operator's prediction can be attributed to its silicon-model vs. empirical-data source, surfacing exactly where projections come from (#956).
- SDK package layout: Extracted
interpolation.pyandsystem_spec.pyout ofperf_database(#650) and convertedmodels.pyinto a propermodels/package (#651); removed the hard-codedtarget_version(#1052).
CLI and Collector Ergonomics
--strict-sla: New opt-in CLI flag for TTFT + TPOT constraint filtering so users can ask for configs that strictly satisfy both SLA targets simultaneously (#727).--resume-retry-failed: New collector flag that retries only the previously failed entries on resume instead of re-running everything (#914).
Support Matrix and Testing
- Regression view + cleanup: Added a regression view to the support matrix and dropped the unused static generator (#976).
- Combined cron + autofix: Combined the support-matrix tests into a single cron and trigger an autofix pipeline (#965).
- Prediction-accuracy regression workflow: Added a workflow that regression-tests prediction accuracy against measured data (#978).
Features & Enhancements
CLI and APIs
--strict-slaflag for opt-in TTFT+TPOT constraint filtering (#727).- Collector
--resume-retry-failedto retry only previously failed entries on resume (#914).
SDK and Modeling
- AIC Rust core forward-pass estimator (#981).
- Op data sharable across frameworks in hybrid mode (#997).
- Per-op silicon vs. empirical attribution (#956).
- Refactor: Extract
interpolation.pyandsystem_spec.pyfromperf_database(#650). - Refactor: Convert
models.pyto amodels/package (#651). - Refactor: Remove hard-coded
target_version(#1052).
Models and Architectures
- DeepSeek-V4 SDK support (#904); attention collect/query (#941); MHC collect/query (#942); MoE modeling (#986); dsv4-flash collectors extended to Blackwell (#1034).
- DeepSeek-R1 for vLLM (#852); DeepSeek R1 added to the support matrix (#851).
- MiniMax-M2.7 and
nvidia/MiniMax-M2.7-NVFP4(#964). - QWEN 3.5 SILICON mode with TRT-LLM / SGLang / vLLM data (#738).
- Kimi K-2.5 SILICON mode (#757).
- GLM-5 FP8/NVFP4 support-matrix generation (#991).
Hardware and Backend Coverage
- B300 support for vLLM 0.19.0 (#829).
- Estimate-only PCIe systems (#980).
- RTX PRO 6000 SGLang perf data (#998), TRT-LLM 1.3.0rc10 perf data (#1001), vLLM 0.19.0 perf data (#999).
- gpt-oss on H200: MXFP4 MoE and attention data (#894).
- Intel XPU: oneCCL benchmarking support (#694).
- SGLang 0.5.10 collector compatibility (#761).
Deployment Targets
Support Matrix and Testing
- Regression view + drop unused static generator (#976).
- Combined cron + autofix pipeline for support-matrix testing (#965).
- Prediction-accuracy regression testing workflow (#978).
Bug Fixes
Collectors and Models
- Custom all_reduce indefinite hang on B200 systems (#754).
- vLLM ≥ 0.19 MLA/DSA module collector compatibility (#864).
- FA3
scheduler_metadatamismatch in the vLLM attention collector (#1004). int4_woMoE collector for vLLM 0.19.0 (#957).- Remove reference to deleted
wideep_mlpops (#881). - Attention data for QWEN35 (#913) and additional attention data for vLLM + SGLang / QWEN35 (#926).
- DeepSeek-V4 attention extrapolation: Fall back to smaller-batch extrapolation when DeepSeek-V4 attention cubic interpolation fails (#996).
- DSv4 MoE workspace sized by hidden width (#1055).
- DSA context interpolation aligned across
topkboundary (#903). - MLA KV cache size derived from config, with Kimi K2.5 coverage (#912).
- Balanced EP routing aligned with rank-0 workload projection (#837).
- FP8 block config restored and power-law routing refined (#876).
Generator, Validator, and Templates
- TRT-LLM templates for release 0.8.0: Added
cache_transceiver_config.backendto the release/0.8.0 template baseline (#879). - Naive config generation robustness for large models with memory-fit awareness (#925).
enumerate: Accepttotal_gpusto cap candidates and size the min-fit floor (#887).- Reuse matching perf-database mode instead of re-loading (#972).
- Preserve experiment prefix through the generator pipeline (#988).
- NaN handling in Pareto selection (#979).
- Select support-matrix versions semantically (#992).
- Filter invalid benchmark concurrencies (#902).
- Skip missing keys in final summary (#910).
llm-doutput improvements (#954).safe_mkdir: Resolve symlinks in allowed-path prefixes (#898).
Errors and Edge Cases
- Warn on implicit SLA defaults and document undocumented CLI flags (#730).
- Unify 16-bit float naming and usage on
bfloat16(#895). - Wheel build: Fail wheel build on unmaterialized Git instead of producing an inconsistent artifact (#880).
RC Cherry-picks to release/0.9.0
- 0.9.0 RC0 generator/validator NVBugs cherry-pick of #1079 (#1089).
- Webapp param name correction cherry-pick (#1105).
- 0.9.0 RC0 NVBugs cherry-pick of #1076 (#1127).
- Kimi-K2.5 default quant config in webapp (cherry-pick of #1108) (#1130).
Documentation
- Webapp UI text: Fix typos in webapp UI text (#900).
- Docs cleanup: Fix typos, broken link, and missing
cdin docs (#915). - DeepSeek-V4 SGLang image requirement: Documented the required SGLang image for DeepSeek-V4 (#1000).
- AIC auto-collect agent skill: Added the AIC auto-collect agent skill (#1006); removed the corresponding setup script in favor of the skill (#1057).
CI/CD and Testing
- Build/test split: Split the build-test workflow into unit / e2e matrix jobs (#928).
- Faster CI: Reverted the test-timeout increase and sped up unit tests ~5× (#935); slightly faster e2e tests (#937).
- Temporary timeout bump: Increased CI test timeout 30 → 60 min while diagnosing slow runs (#931).
- Workflows at top level: Moved all workflow files to top level (#967).
- Sanity-check chart workflow: Split the sanity-check chart workflow to support fork PRs (#987).
- E2E resilience:
deep-epfailure no longer blocks the e2e workflow (#993); workflow no longer fails when no Pareto exists (#1003).
Other Changes
New Contributors
Thanks to everyone who contributed to this release:
- @Jont828 made their first contribution in #851
- @natoscott made their first contribution in #670
- @milesial made their first contribution in #912
- @yangeer made their first contribution in #902
- @littlefatfat made their first contribution in #942
- @devivasudevan made their first contribution in #925
Full Changelog
Full Changelog: v0.8.0...v0.9.0