Skip to content

AIConfigurator Release v0.9.0

Latest

Choose a tag to compare

@saturley-hall saturley-hall released this 27 May 14:09
2103aa2

AIConfigurator - Release 0.9.0

Summary

AIConfigurator 0.9.0 broadens model, hardware, and deployment-target coverage and lays foundational SDK plumbing for the next wave of large MoE models. The release adds first-class DeepSeek-V4 support across attention, MHC, and MoE collectors and queries, brings DeepSeek-R1 to the vLLM backend, adds MiniMax-M2.7 (including the NVFP4 variant), and lands SILICON-mode profiles for QWEN 3.5 and Kimi K-2.5. Hardware coverage expands with B300 support on vLLM 0.19.0, estimate-only PCIe systems, RTX PRO 6000 Blackwell Server perf data across SGLang, vLLM 0.19.0, and TRT-LLM 1.3.0rc10, and MXFP4 MoE/attention data for gpt-oss on H200. A new llm-d deployment target joins the existing TRT-LLM/SGLang/vLLM targets, and a new AIC Rust core forward-pass estimator lands inside the SDK. The collector gains --resume-retry-failed, the SDK becomes framework-agnostic in hybrid mode by sharing op data across frameworks, and a new per-op silicon-vs-empirical attribution view exposes where each prediction comes from. The CLI adds --strict-sla for opt-in TTFT+TPOT constraint filtering. The release also introduces a support-matrix regression view alongside a prediction-accuracy regression testing workflow, makes the container OpenShift-compatible under random UIDs, and unifies 16-bit float naming on bfloat16. Numerous fixes harden DeepSeek-V4 attention extrapolation, vLLM ≥ 0.19 MLA/DSA collector compat, FP8 block config, balanced-EP routing, MLA KV cache sizing for Kimi K2.5, NaN handling in Pareto selection, and webapp defaults — plus a cluster of RC cherry-picks closing out release/0.9.0 NVBugs.


Key Highlights

Models and Architectures

  • DeepSeek-V4 end-to-end: Module-level attention collect/query (#941), MHC collect/query (#942), and MoE modeling (#986) on top of the initial DeepSeek-V4 SDK support (#904); dsv4-flash collectors extended to Blackwell platforms (#1034).
  • DeepSeek-R1 on vLLM: Added DSR1 support for the vLLM backend (#852) and added DeepSeek R1 to the support matrix (#851).
  • MiniMax-M2.7 (FP8 and NVFP4): Added MiniMax-M2.7 and nvidia/MiniMax-M2.7-NVFP4 support (#964).
  • SILICON-mode coverage: QWEN 3.5 SILICON mode with TRT-LLM / SGLang / vLLM data (#738) and Kimi K-2.5 SILICON mode (#757).
  • GLM-5: Added FP8 / NVFP4 support-matrix generation support for GLM-5 (#991).

Hardware and Backend Coverage

  • B300 on vLLM 0.19.0: Added B300 system support for vLLM 0.19.0 (#829).
  • RTX PRO 6000 Blackwell Server perf data: Perf tables across SGLang (#998), vLLM 0.19.0 (#999), and TRT-LLM 1.3.0rc10 (#1001).
  • gpt-oss on H200: MXFP4 MoE and attention data for gpt-oss models on H200 (#894).
  • Estimate-only PCIe systems: Added estimate-only PCIe system definitions so PCIe topologies can be evaluated even without a full empirical sweep (#980).
  • Intel XPU: Enabled oneCCL benchmarking support for XPU (#694).
  • SGLang 0.5.10: Made the SGLang collector compatible with SGLang 0.5.10 (#761).

Deployment Targets

  • llm-d deployment target: Added a new llm-d deployment target alongside TRT-LLM / SGLang / vLLM (#671), with follow-up output-quality improvements (#954).
  • OpenShift random UID: Made the container compatible with OpenShift's arbitrary-UID security context so AIConfigurator runs cleanly on OpenShift clusters (#670).

SDK and Modeling

  • AIC Rust core forward-pass estimator: A new Rust-implemented core forward-pass estimator lands in the SDK (#981).
  • Hybrid-mode op-data sharing: Op data is now shareable across frameworks in hybrid mode, so TRT-LLM / SGLang / vLLM can reuse the same underlying op measurements when projecting hybrid configurations (#997).
  • Per-op silicon vs. empirical attribution: Each operator's prediction can be attributed to its silicon-model vs. empirical-data source, surfacing exactly where projections come from (#956).
  • SDK package layout: Extracted interpolation.py and system_spec.py out of perf_database (#650) and converted models.py into a proper models/ package (#651); removed the hard-coded target_version (#1052).

CLI and Collector Ergonomics

  • --strict-sla: New opt-in CLI flag for TTFT + TPOT constraint filtering so users can ask for configs that strictly satisfy both SLA targets simultaneously (#727).
  • --resume-retry-failed: New collector flag that retries only the previously failed entries on resume instead of re-running everything (#914).

Support Matrix and Testing

  • Regression view + cleanup: Added a regression view to the support matrix and dropped the unused static generator (#976).
  • Combined cron + autofix: Combined the support-matrix tests into a single cron and trigger an autofix pipeline (#965).
  • Prediction-accuracy regression workflow: Added a workflow that regression-tests prediction accuracy against measured data (#978).

Features & Enhancements

CLI and APIs

  • --strict-sla flag for opt-in TTFT+TPOT constraint filtering (#727).
  • Collector --resume-retry-failed to retry only previously failed entries on resume (#914).

SDK and Modeling

  • AIC Rust core forward-pass estimator (#981).
  • Op data sharable across frameworks in hybrid mode (#997).
  • Per-op silicon vs. empirical attribution (#956).
  • Refactor: Extract interpolation.py and system_spec.py from perf_database (#650).
  • Refactor: Convert models.py to a models/ package (#651).
  • Refactor: Remove hard-coded target_version (#1052).

Models and Architectures

  • DeepSeek-V4 SDK support (#904); attention collect/query (#941); MHC collect/query (#942); MoE modeling (#986); dsv4-flash collectors extended to Blackwell (#1034).
  • DeepSeek-R1 for vLLM (#852); DeepSeek R1 added to the support matrix (#851).
  • MiniMax-M2.7 and nvidia/MiniMax-M2.7-NVFP4 (#964).
  • QWEN 3.5 SILICON mode with TRT-LLM / SGLang / vLLM data (#738).
  • Kimi K-2.5 SILICON mode (#757).
  • GLM-5 FP8/NVFP4 support-matrix generation (#991).

Hardware and Backend Coverage

  • B300 support for vLLM 0.19.0 (#829).
  • Estimate-only PCIe systems (#980).
  • RTX PRO 6000 SGLang perf data (#998), TRT-LLM 1.3.0rc10 perf data (#1001), vLLM 0.19.0 perf data (#999).
  • gpt-oss on H200: MXFP4 MoE and attention data (#894).
  • Intel XPU: oneCCL benchmarking support (#694).
  • SGLang 0.5.10 collector compatibility (#761).

Deployment Targets

  • llm-d deployment target (#671).
  • OpenShift random-UID container compatibility (#670).

Support Matrix and Testing

  • Regression view + drop unused static generator (#976).
  • Combined cron + autofix pipeline for support-matrix testing (#965).
  • Prediction-accuracy regression testing workflow (#978).

Bug Fixes

Collectors and Models

  • Custom all_reduce indefinite hang on B200 systems (#754).
  • vLLM ≥ 0.19 MLA/DSA module collector compatibility (#864).
  • FA3 scheduler_metadata mismatch in the vLLM attention collector (#1004).
  • int4_wo MoE collector for vLLM 0.19.0 (#957).
  • Remove reference to deleted wideep_mlp ops (#881).
  • Attention data for QWEN35 (#913) and additional attention data for vLLM + SGLang / QWEN35 (#926).
  • DeepSeek-V4 attention extrapolation: Fall back to smaller-batch extrapolation when DeepSeek-V4 attention cubic interpolation fails (#996).
  • DSv4 MoE workspace sized by hidden width (#1055).
  • DSA context interpolation aligned across topk boundary (#903).
  • MLA KV cache size derived from config, with Kimi K2.5 coverage (#912).
  • Balanced EP routing aligned with rank-0 workload projection (#837).
  • FP8 block config restored and power-law routing refined (#876).

Generator, Validator, and Templates

  • TRT-LLM templates for release 0.8.0: Added cache_transceiver_config.backend to the release/0.8.0 template baseline (#879).
  • Naive config generation robustness for large models with memory-fit awareness (#925).
  • enumerate: Accept total_gpus to cap candidates and size the min-fit floor (#887).
  • Reuse matching perf-database mode instead of re-loading (#972).
  • Preserve experiment prefix through the generator pipeline (#988).
  • NaN handling in Pareto selection (#979).
  • Select support-matrix versions semantically (#992).
  • Filter invalid benchmark concurrencies (#902).
  • Skip missing keys in final summary (#910).
  • llm-d output improvements (#954).
  • safe_mkdir: Resolve symlinks in allowed-path prefixes (#898).

Errors and Edge Cases

  • Warn on implicit SLA defaults and document undocumented CLI flags (#730).
  • Unify 16-bit float naming and usage on bfloat16 (#895).
  • Wheel build: Fail wheel build on unmaterialized Git instead of producing an inconsistent artifact (#880).

RC Cherry-picks to release/0.9.0

  • 0.9.0 RC0 generator/validator NVBugs cherry-pick of #1079 (#1089).
  • Webapp param name correction cherry-pick (#1105).
  • 0.9.0 RC0 NVBugs cherry-pick of #1076 (#1127).
  • Kimi-K2.5 default quant config in webapp (cherry-pick of #1108) (#1130).

Documentation

  • Webapp UI text: Fix typos in webapp UI text (#900).
  • Docs cleanup: Fix typos, broken link, and missing cd in docs (#915).
  • DeepSeek-V4 SGLang image requirement: Documented the required SGLang image for DeepSeek-V4 (#1000).
  • AIC auto-collect agent skill: Added the AIC auto-collect agent skill (#1006); removed the corresponding setup script in favor of the skill (#1057).

CI/CD and Testing

  • Build/test split: Split the build-test workflow into unit / e2e matrix jobs (#928).
  • Faster CI: Reverted the test-timeout increase and sped up unit tests ~5× (#935); slightly faster e2e tests (#937).
  • Temporary timeout bump: Increased CI test timeout 30 → 60 min while diagnosing slow runs (#931).
  • Workflows at top level: Moved all workflow files to top level (#967).
  • Sanity-check chart workflow: Split the sanity-check chart workflow to support fork PRs (#987).
  • E2E resilience: deep-ep failure no longer blocks the e2e workflow (#993); workflow no longer fails when no Pareto exists (#1003).

Other Changes


New Contributors

Thanks to everyone who contributed to this release:


Full Changelog

Full Changelog: v0.8.0...v0.9.0