AIConfigurator - Release 0.9.0

Summary

AIConfigurator 0.9.0 broadens model, hardware, and deployment-target coverage and lays foundational SDK plumbing for the next wave of large MoE models. The release adds first-class DeepSeek-V4 support across attention, MHC, and MoE collectors and queries, brings DeepSeek-R1 to the vLLM backend, adds MiniMax-M2.7 (including the NVFP4 variant), and lands SILICON-mode profiles for QWEN 3.5 and Kimi K-2.5. Hardware coverage expands with B300 support on vLLM 0.19.0, estimate-only PCIe systems, RTX PRO 6000 Blackwell Server perf data across SGLang, vLLM 0.19.0, and TRT-LLM 1.3.0rc10, and MXFP4 MoE/attention data for gpt-oss on H200. A new llm-d deployment target joins the existing TRT-LLM/SGLang/vLLM targets, and a new AIC Rust core forward-pass estimator lands inside the SDK. The collector gains --resume-retry-failed, the SDK becomes framework-agnostic in hybrid mode by sharing op data across frameworks, and a new per-op silicon-vs-empirical attribution view exposes where each prediction comes from. The CLI adds --strict-sla for opt-in TTFT+TPOT constraint filtering. The release also introduces a support-matrix regression view alongside a prediction-accuracy regression testing workflow, makes the container OpenShift-compatible under random UIDs, and unifies 16-bit float naming on bfloat16. Numerous fixes harden DeepSeek-V4 attention extrapolation, vLLM ≥ 0.19 MLA/DSA collector compat, FP8 block config, balanced-EP routing, MLA KV cache sizing for Kimi K2.5, NaN handling in Pareto selection, and webapp defaults — plus a cluster of RC cherry-picks closing out release/0.9.0 NVBugs.

Key Highlights

Models and Architectures

DeepSeek-V4 end-to-end: Module-level attention collect/query (#941), MHC collect/query (#942), and MoE modeling (#986) on top of the initial DeepSeek-V4 SDK support (#904); dsv4-flash collectors extended to Blackwell platforms (#1034).
DeepSeek-R1 on vLLM: Added DSR1 support for the vLLM backend (#852) and added DeepSeek R1 to the support matrix (#851).
MiniMax-M2.7 (FP8 and NVFP4): Added MiniMax-M2.7 and nvidia/MiniMax-M2.7-NVFP4 support (#964).
SILICON-mode coverage: QWEN 3.5 SILICON mode with TRT-LLM / SGLang / vLLM data (#738) and Kimi K-2.5 SILICON mode (#757).
GLM-5: Added FP8 / NVFP4 support-matrix generation support for GLM-5 (#991).

Hardware and Backend Coverage

B300 on vLLM 0.19.0: Added B300 system support for vLLM 0.19.0 (#829).
RTX PRO 6000 Blackwell Server perf data: Perf tables across SGLang (#998), vLLM 0.19.0 (#999), and TRT-LLM 1.3.0rc10 (#1001).
gpt-oss on H200: MXFP4 MoE and attention data for gpt-oss models on H200 (#894).
Estimate-only PCIe systems: Added estimate-only PCIe system definitions so PCIe topologies can be evaluated even without a full empirical sweep (#980).
Intel XPU: Enabled oneCCL benchmarking support for XPU (#694).
SGLang 0.5.10: Made the SGLang collector compatible with SGLang 0.5.10 (#761).

Deployment Targets

llm-d deployment target: Added a new llm-d deployment target alongside TRT-LLM / SGLang / vLLM (#671), with follow-up output-quality improvements (#954).
OpenShift random UID: Made the container compatible with OpenShift's arbitrary-UID security context so AIConfigurator runs cleanly on OpenShift clusters (#670).

SDK and Modeling

AIC Rust core forward-pass estimator: A new Rust-implemented core forward-pass estimator lands in the SDK (#981).
Hybrid-mode op-data sharing: Op data is now shareable across frameworks in hybrid mode, so TRT-LLM / SGLang / vLLM can reuse the same underlying op measurements when projecting hybrid configurations (#997).
Per-op silicon vs. empirical attribution: Each operator's prediction can be attributed to its silicon-model vs. empirical-data source, surfacing exactly where projections come from (#956).
SDK package layout: Extracted interpolation.py and system_spec.py out of perf_database (#650) and converted models.py into a proper models/ package (#651); removed the hard-coded target_version (#1052).

CLI and Collector Ergonomics

--strict-sla: New opt-in CLI flag for TTFT + TPOT constraint filtering so users can ask for configs that strictly satisfy both SLA targets simultaneously (#727).
--resume-retry-failed: New collector flag that retries only the previously failed entries on resume instead of re-running everything (#914).

Support Matrix and Testing

Regression view + cleanup: Added a regression view to the support matrix and dropped the unused static generator (#976).
Combined cron + autofix: Combined the support-matrix tests into a single cron and trigger an autofix pipeline (#965).
Prediction-accuracy regression workflow: Added a workflow that regression-tests prediction accuracy against measured data (#978).

Features & Enhancements

CLI and APIs

--strict-sla flag for opt-in TTFT+TPOT constraint filtering (#727).
Collector --resume-retry-failed to retry only previously failed entries on resume (#914).

SDK and Modeling

AIC Rust core forward-pass estimator (#981).
Op data sharable across frameworks in hybrid mode (#997).
Per-op silicon vs. empirical attribution (#956).
Refactor: Extract interpolation.py and system_spec.py from perf_database (#650).
Refactor: Convert models.py to a models/ package (#651).
Refactor: Remove hard-coded target_version (#1052).

Models and Architectures

DeepSeek-V4 SDK support (#904); attention collect/query (#941); MHC collect/query (#942); MoE modeling (#986); dsv4-flash collectors extended to Blackwell (#1034).
DeepSeek-R1 for vLLM (#852); DeepSeek R1 added to the support matrix (#851).
MiniMax-M2.7 and nvidia/MiniMax-M2.7-NVFP4 (#964).
QWEN 3.5 SILICON mode with TRT-LLM / SGLang / vLLM data (#738).
Kimi K-2.5 SILICON mode (#757).
GLM-5 FP8/NVFP4 support-matrix generation (#991).

Hardware and Backend Coverage

B300 support for vLLM 0.19.0 (#829).
Estimate-only PCIe systems (#980).
RTX PRO 6000 SGLang perf data (#998), TRT-LLM 1.3.0rc10 perf data (#1001), vLLM 0.19.0 perf data (#999).
gpt-oss on H200: MXFP4 MoE and attention data (#894).
Intel XPU: oneCCL benchmarking support (#694).
SGLang 0.5.10 collector compatibility (#761).

Deployment Targets

llm-d deployment target (#671).
OpenShift random-UID container compatibility (#670).

Support Matrix and Testing

Regression view + drop unused static generator (#976).
Combined cron + autofix pipeline for support-matrix testing (#965).
Prediction-accuracy regression testing workflow (#978).

Bug Fixes

Collectors and Models

Custom all_reduce indefinite hang on B200 systems (#754).
vLLM ≥ 0.19 MLA/DSA module collector compatibility (#864).
FA3 scheduler_metadata mismatch in the vLLM attention collector (#1004).
int4_wo MoE collector for vLLM 0.19.0 (#957).
Remove reference to deleted wideep_mlp ops (#881).
Attention data for QWEN35 (#913) and additional attention data for vLLM + SGLang / QWEN35 (#926).
DeepSeek-V4 attention extrapolation: Fall back to smaller-batch extrapolation when DeepSeek-V4 attention cubic interpolation fails (#996).
DSv4 MoE workspace sized by hidden width (#1055).
DSA context interpolation aligned across topk boundary (#903).
MLA KV cache size derived from config, with Kimi K2.5 coverage (#912).
Balanced EP routing aligned with rank-0 workload projection (#837).
FP8 block config restored and power-law routing refined (#876).

Generator, Validator, and Templates

TRT-LLM templates for release 0.8.0: Added cache_transceiver_config.backend to the release/0.8.0 template baseline (#879).
Naive config generation robustness for large models with memory-fit awareness (#925).
enumerate: Accept total_gpus to cap candidates and size the min-fit floor (#887).
Reuse matching perf-database mode instead of re-loading (#972).
Preserve experiment prefix through the generator pipeline (#988).
NaN handling in Pareto selection (#979).
Select support-matrix versions semantically (#992).
Filter invalid benchmark concurrencies (#902).
Skip missing keys in final summary (#910).
llm-d output improvements (#954).
safe_mkdir: Resolve symlinks in allowed-path prefixes (#898).

Errors and Edge Cases

Warn on implicit SLA defaults and document undocumented CLI flags (#730).
Unify 16-bit float naming and usage on bfloat16 (#895).
Wheel build: Fail wheel build on unmaterialized Git instead of producing an inconsistent artifact (#880).

RC Cherry-picks to `release/0.9.0`

0.9.0 RC0 generator/validator NVBugs cherry-pick of #1079 (#1089).
Webapp param name correction cherry-pick (#1105).
0.9.0 RC0 NVBugs cherry-pick of #1076 (#1127).
Kimi-K2.5 default quant config in webapp (cherry-pick of #1108) (#1130).

Documentation

Webapp UI text: Fix typos in webapp UI text (#900).
Docs cleanup: Fix typos, broken link, and missing cd in docs (#915).
DeepSeek-V4 SGLang image requirement: Documented the required SGLang image for DeepSeek-V4 (#1000).
AIC auto-collect agent skill: Added the AIC auto-collect agent skill (#1006); removed the corresponding setup script in favor of the skill (#1057).

CI/CD and Testing

Build/test split: Split the build-test workflow into unit / e2e matrix jobs (#928).
Faster CI: Reverted the test-timeout increase and sped up unit tests ~5× (#935); slightly faster e2e tests (#937).
Temporary timeout bump: Increased CI test timeout 30 → 60 min while diagnosing slow runs (#931).
Workflows at top level: Moved all workflow files to top level (#967).
Sanity-check chart workflow: Split the sanity-check chart workflow to support fork PRs (#987).
E2E resilience: deep-ep failure no longer blocks the e2e workflow (#993); workflow no longer fails when no Pareto exists (#1003).

Other Changes

Automated support-matrix updates (#686, #929, #953, #970, #989, #1005).

New Contributors

Thanks to everyone who contributed to this release:

@Jont828 made their first contribution in #851
@natoscott made their first contribution in #670
@milesial made their first contribution in #912
@yangeer made their first contribution in #902
@littlefatfat made their first contribution in #942
@devivasudevan made their first contribution in #925

Full Changelog

Full Changelog: v0.8.0...v0.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AIConfigurator Release v0.9.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

AIConfigurator - Release 0.9.0

Summary

Key Highlights

Models and Architectures

Hardware and Backend Coverage

Deployment Targets

SDK and Modeling

CLI and Collector Ergonomics

Support Matrix and Testing

Features & Enhancements

CLI and APIs

SDK and Modeling

Models and Architectures

Hardware and Backend Coverage

Deployment Targets

Support Matrix and Testing

Bug Fixes

Collectors and Models

Generator, Validator, and Templates

Errors and Edge Cases

RC Cherry-picks to `release/0.9.0`

Documentation

CI/CD and Testing

Other Changes

New Contributors

Full Changelog

Uh oh!

AIConfigurator Release v0.9.0

AIConfigurator - Release 0.9.0

Summary

Key Highlights

Models and Architectures

Hardware and Backend Coverage

Deployment Targets

SDK and Modeling

CLI and Collector Ergonomics

Support Matrix and Testing

Features & Enhancements

CLI and APIs

SDK and Modeling

Models and Architectures

Hardware and Backend Coverage

Deployment Targets

Support Matrix and Testing

Bug Fixes

Collectors and Models

Generator, Validator, and Templates

Errors and Edge Cases

RC Cherry-picks to release/0.9.0

Documentation

CI/CD and Testing

Other Changes

New Contributors

Full Changelog

Uh oh!

RC Cherry-picks to `release/0.9.0`