AIConfigurator v0.8.0
AIConfigurator - Release 0.8.0
Summary
AIConfigurator 0.8.0 expands offline-optimization coverage to every actively supported Dynamo backend and hardware combination, opens up programmatic access for the Dynamo Planner and Profiler, and adds a new cli estimate command that predicts TTFT, TPOT, and power for a single config without running a full sweep. The release lands first-class Intel XPU support contributed by Intel, DeepSeek-V3.2 (DSA) modeling across all three backends, and a unified HYBRIDMOE family covering Llama 4 Scout/Maverick and MiMo-V2-Flash. Perf-database tables now cover vLLM 0.14/0.16/0.17/0.19, SGLang 0.5.8 → 0.5.10, and TensorRT-LLM 1.2.0rc5 → 1.3.0+ across B200 SXM, GB200, GB300, H100 SXM, A100 SXM, L40S, RTX Pro 6000 Blackwell Server, and Intel B60. The collector itself got faster and more reliable, and the generator was rewired to the new --trtllm.<group>.<key> dynamic CLI-flag syntax introduced in Dynamo. Numerous fixes harden TRT-LLM template correctness, OOM handling, generator-validator usage, and SGLang/vLLM compatibility.
Breaking Changes & Deprecations
- Minimum Python is now 3.10 (was 3.9), driven by Gradio's Python ≥ 3.10 requirement (#678).
aiconfigurator evalsubcommand removed. Theevalmodule and its automation scripts have been deleted; users who scriptedaiconfigurator evalmust migrate (#504).- Llama 2, Qwen 2.5, and Mixtral are no longer supported. Their
model_configs/JSONs and support-matrix entries have been removed; pin to v0.7.0 if you require these families (#553).
Key Highlights
Backend and Hardware Coverage
- vLLM 0.14.0 → 0.19.0: Refreshed perf data for vLLM 0.14.0, 0.14.1, 0.16.0, 0.17.0, and 0.19.0 across B200 SXM, GB200, GB300, H100 SXM, L40S, RTX Pro 6000 Blackwell Server, and A100 SXM (#517, #560, #707, #712, #883, #892).
- SGLang 0.5.8 → 0.5.10: Refreshed SGLang perf data through 0.5.10 with new AllReduce collector support (#508), 0.5.9 collector compatibility (#531),
custom_allreduce_perf.txtfiles (#607), and the 0.5.10 cherry-pick of collector fixes plus refreshed perf tables (#952). - TensorRT-LLM 1.2.0rc5 → 1.3.0+: Added FP8 block in TRT-LLM on Blackwell sm100 (#480), GPT-OSS simulation on Blackwell (#575),
w4a16_mxfp4/w4a8_mxfp4_mxfp8MoE data for B200 / GB200 (#590), NVLink OneSided alltoall (#536), and refreshed data through 1.3.0+ (#502, #611, #617, #618, #635, #638). - GB300 and Intel B60: Added a GB300 system entry (#352), renamed
gb200_sxm→gb200for consistency (#362), and added first-class Intel XPU support with a new--deviceflag and B60 system configuration (#246).
Dynamo Planner and Profiler Integration
- Standalone picking API: Decoupled
pick_default,pick_min_gpus, andpick_max_throughputfromTaskConfigso external pipelines can call them directly (#386). - Optimization-type-aware parallelization: Added
pick_optimization_typeso the Dynamo profiler can request configs optimized for throughput or latency without explicit SLA targets — TP for dense, TEP for MoE, DEP for MLA+MoE+throughput (#647). --trtllm.*dynamic flags: Translated extra engine args YAML into the new--trtllm.<group>.<key>flag syntax introduced in Dynamo#7335, replacing the old--override-engine-argsJSON path (#693).- Dynamo 1.1.0 backend templates: Bundled vLLM 0.19.0, SGLang 0.5.10.post1, and TRT-LLM 1.3.0rc11 templates (#759).
- Hetero disagg GPU budget: Added
max_prefill_gpusandmax_decode_gpusso callers can express asymmetric resource allocation (e.g. 4 prefill GPUs + 12 decode GPUs) thattotal_gpusalone cannot (#709).
cli estimate Single-Point Prediction
- New command and Python API:
aiconfigurator cli estimateandcli_estimate()predict TTFT, TPOT, and power for a single model/system/config combination without running a full sweep or SLA optimization, in both aggregated and disaggregated modes, with separate prefill/decode parallelism, decode systems, and quantization overrides (#387). - Result parity with
cli default: Single-point estimates now align with the corresponding row inbest_config_topn.csvfromcli default, with a uniform 1.8× TTFT correction factor applied across the estimate, autoscale, and pareto disagg paths (#539). Chunked-prefill is exposed as a CLI parameter (#614), and end-to-end tests assert column and value parity (#557).
Intel XPU Support
- First-class XPU collector and SDK: Added Intel XPU support across the collector and SDK with a
--deviceflag and B60 system configuration (#246), then enabled FP8 GEMM collection (#591), MoE projection (#625), FP8 projection (#645), FP8 MoE collection for vLLM (#675), and MXFP4 MoE / GPT-OSS coverage (#699). Contributed by @Spycsh (Intel).
DeepSeek-V3.2 (DSA) and New Models
- DSA support across backends: Added module-level MLA/DSA attention collectors and perf-database support for DeepSeek-V3 / V3.2 across TensorRT-LLM (#470), vLLM (#643), and SGLang (#657), enabled DSv32 + GLM-5 modeling (#662), and projected global MoE routing onto rank-1 workload so SGLang EP MoE benchmarking preserves the intended 128-top-8 routing distribution (#753).
- MiniMax-M2.5 and HYBRIDMOE family: Added MiniMax-M2.5 with Multi-Token Prediction (MTP) scaling for MoE generation throughput (#405) and the NVFP4-quantized variant
nvidia/MiniMax-M2.5-NVFP4for vLLM and TRT-LLM (#685); introduced a unifiedHYBRIDMOEfamily covering Llama 4 Scout/Maverick and MiMo-V2-Flash with shared interleaved SWA/global + dense/MoE structure (#408). - Other model coverage: Added MTP simulation for Qwen-series models (#472), QWEN3.5 hybrid-mode support (#667), Kimi-K2.5 quant detection through nested
text_config(#454), and updated the HuggingFace ID for Nemotron-3-Super-120B (#577).
Features & Enhancements
CLI and APIs
cli estimate: New single-point estimate command andcli_estimate()Python API for TTFT/TPOT/power prediction (#387, #539, #557, #614).--system alland--backend all: Runcli supportagainst the full system × backend matrix in a single command (#435).- HYBRID recommendation on SILICON failure: When
clifails in SILICON mode, AIConfigurator recommends HYBRID mode instead of leaving the user stuck (#411). - Unify CLI args: Unified CLI argument naming across subcommands and updated docs and examples to match (#345).
Generator and Validator
- Optimization-type-aware picking and standalone API: Standalone
pick_default/pick_min_gpus/pick_max_throughput(#386), exposed parallelization metadata onenumerate_profiling_configs(#401), andpick_optimization_typefor SLA-free optimization (#647). - Generator improvements: Emit the matching benchmark command alongside each config (#334), strip empty lines from generated config files (#347), hook into the Dynamo planner profiler's config-gen path (#361), align the generator's default backend version with the latest Dynamo version (#366), expose real-GPU enumeration logic for external sweep callers (#373), support
K8S_ETCD_ENDPOINTSsubstitution in K8s deployment templates (#554), split PDwideep/eplbconfig into separate emission paths (#565), add ansflowrendering target for nv-sflow integration (#584), accept generator-override params viacli_default()(#592), bump default Dynamo image and version references to 1.0.0 (#594), translate extra engine args YAML into the--trtllm.<group>.<key>syntax (#693), and bundle Dynamo 1.1.0 backend templates (#759).
SDK and Modeling
- Hetero disagg GPU budget:
max_prefill_gpus/max_decode_gpusinfind_best_disagg_result_under_constraints(#709). - Static latency-only fast path:
run_static_latency_only()skips summary/dataframe materialization for replay-style callers (#665). - GEMM lookup optimization: Exact-hit shortcut and 1D interpolation for partial hits, reducing per-query latency (#721).
- Full database mode for TRT-LLM WideEP: SOL / EMPIRICAL modes for TRT-LLM WideEP query functions (#485).
- Configurable rate-matching factors:
rate_matchingdegradation factors are now configurable instead of hardcoded (#615). - Better missing perf-data handling: Structured error at query time instead of an internal exception (#473).
- Attention imbalance correction: Added an attention imbalance correction scale interface and Qwen3 qk-norm I/O (#600).
- KV cache capacity check: Prevent batch-size oversubscription before launching a sweep (#652).
- MoE extrapolation by MFU: Switched MoE extrapolation to scale according to MFU instead of a fixed factor, improving accuracy when projecting compute for unmeasured shapes (#537).
Models and Architectures
- DeepSeek-V3.2 (DSA): Module-level MLA/DSA attention collectors and perf-database support across TRT-LLM (#470), vLLM (#643), and SGLang (#657); DSv32 + GLM-5 (#662); SGLang EP MoE rank-1 projection (#753).
- MiniMax-M2.5 / HYBRIDMOE / Qwen / Kimi: MiniMax-M2.5 with MTP (#405),
nvidia/MiniMax-M2.5-NVFP4(#685), unified HYBRIDMOE family for Llama 4 Scout/Maverick + MiMo-V2-Flash (#408), Qwen MTP simulation (#472), QWEN3.5 hybrid mode (#667), Kimi-K2.5 nestedtext_configquant detection (#454), Nemotron-3-Super-120B HF ID update (#577).
Collectors and Data
- Hardware/backend coverage: Refreshed perf-database tables across vLLM 0.14.0/0.14.1/0.16.0/0.17.0/0.19.0, SGLang 0.5.8.post1/0.5.9/0.5.10, and TensorRT-LLM 1.2.0rc5/1.2.0rc6/1.3.0+ on B200 SXM, GB200, GB300, H100 SXM, A100 SXM, L40S, RTX Pro 6000 Blackwell Server, and Intel B60 (#388, #390, #466, #502, #517, #520, #523, #529, #531, #535, #538, #547, #559, #560, #567, #576, #595, #607, #611, #617, #618, #623, #624, #628, #629, #630, #631, #632, #635, #638, #668, #680, #682, #707, #712, #883, #892, #952).
- Collector performance and reliability: Sped up the slow TensorRT-LLM MoE collector (#351), improved the MNNVL alltoall communication collector (#353), cut Blackwell GEMM collection time via weight caching and adaptive L2 bypass (#588), sped up the MoE collector (#524), added resume capability so interrupted runs no longer restart from scratch (#469), added
--profile(#523) and--model-pathfilters (#627), fixed log-perf stalling on NFS (#427), explicitly managed collector versions (#466), disabled core dumps for GPU crashes in worker and main processes (#587), adapted the GEMM collector for the zig-zag pattern (#503), and unified TRT-LLM alltoall MoE dispatch on sm100 (#566). - TensorRT-LLM additions: FP8 block on Blackwell sm100 (#480), GPT-OSS simulation on Blackwell (#575),
w4a16_mxfp4/w4a8_mxfp4_mxfp8MoE data (#590), NVLink OneSided alltoall (#536), FP4 GEMM data (#672). - vLLM additions: NVFP4 vLLM data (#546), NVFP4 vLLM MoE (#660),
w4a16_mxfp4MoE for GPT-OSS (#673), DSA support (#643), and 0.17.0 collector compatibility for DSA, MLA, and MoE MXFP4 on B200 (#718). - SGLang improvements: AllReduce collector (#508), 0.5.9 collector support (#531),
custom_allreduce_perf.txtfor 0.5.9 (#607), stop passing invalidmoe-dense-tp-size(#613), Qwen3-235B WideEP andEP > 1in non-WideEP mode (#684). - Intel XPU additions: First-class XPU support (#246), FP8 GEMM (#591), MoE projection (#625), FP8 projection (#645), FP8 MoE for vLLM (#675), MXFP4 MoE / GPT-OSS coverage (#699).
Hardware Definitions
- GB300 and rename: Added the GB300 system entry (#352), renamed
gb200_sxm→gb200for consistency (#362), corrected inter-node bandwidth in the system YAML (#477), and honored per-node GPU topology — disabling WideEP for small MoE models that don't span nodes (DYN-2544, #687).
Web App
- Support Matrix tab: New Support Matrix tab visualizing
support_matrix.csvwith PASS/FAIL per system × backend and a top-10 errors view per system, with lazy perf-database loading for faster startup (#568). - Webapp upkeep: Updated Gradio (#475), removed "gradio not installed" warnings on import (#426), bumped the webapp Docker image (#364), and fixed webapp visibility under the new end-to-end workflow (#436).
Logging and UX
- CLI ergonomics: Hid and deduplicated spammy logs (#441), added colored logging (#452), cleaned up ANSI codes when stdout is redirected and added a
--no-colorflag (#710), switchedaiconfigurator cli exp --helpto a YAML task example (#715), and replaced Unicode box-drawing with ASCII in piped output socat -vno longer renders garbage (#736).
Bug Fixes
TensorRT-LLM Templates
- Engine template correctness: Fixed
build_confignesting and missingbackendfield in version-specific engine templates (#434), made older TRT-LLM templates supportbuild_configfor the generator (#464), restored top-levelmax_batch_size/max_num_tokens/max_seq_lenafter a flat-key migration regression that crashed TRTLLMWorker pods on startup with'NoneType' * 'int'(NVBug 5956640, #540), removed thecache_transceiver_configblock from TRT-LLM 1.3.0+ templates that triggered Pydanticextra_forbiddenvalidation errors and crashed pods at startup (NVBug 5953595, #541), added thebackendfield tocache_transceiver_configincli_args.j2(NVBug 5974038, #585), now emitcache_transceiver_configfor disagg-serving configs (#609), addedcache_transceiver_config.backendto therelease/0.8.0template baseline (#879), and fixed a TRT-LLM engine CLI arg passthrough (#512). - Block alignment: Aligned
max_num_tokensandcache_transceiver_max_tokens_in_buffertotokens_per_blockin the benchmark TRT-LLM rule (#453), and ensuredcache_transceiver_max_tokens_in_buffer % block_size == 0for TRT-LLM (#375).
vLLM Compatibility
- vLLM disagg
--kv-transfer-config: Added--kv-transfer-configto vLLM disagg templates (NVBug 5952846, #519). --cudagraph-capture-sizesstartup failure: Fixed the vLLM--cudagraph-capture-sizesvalue ink8s_deploy.yamlthat caused vLLM startup failure (#377).- vLLM collector compatibility (≥ 0.14.0): Fixed vLLM collector compatibility for
set_current_vllm_configon vLLM 0.14.0+ (#688). - vLLM GEMM dtype inconsistency: Fixed a vLLM GEMM dtype inconsistency that produced wrong-precision data points (#376).
OOM and Error Handling
- OOM hardening: Replaced raw tracebacks from
collect_config_paths()with user-readable errors (#382), resolved OOM during support-matrix testing (#418), improved error messages when a model doesn't fit in GPU memory (#437), explicitly set OOM status (#442), raised OOM indisagg get_worker_candidates()instead of hitting anIndexErrordownstream (#456), improved OOM error handling in estimation functions (#509), gracefully skipped unsupported backends when--backend auto(#511), raised an actionable error when a model name is invalid (#528), used a single copy for activations and past KV cache to halve memory pressure (#621), and added a KV-cache-capacity check that prevents batch-size oversubscription (#652). - Sparse data: Handled sparse data gracefully in
_extrapolate_data_gridinstead of returning NaNs (#713). custom_allreduceempty bucket: Replaced rawAssertionErrorfromquery_custom_allreduceon empty(quant_mode, tp_size)buckets with a structuredPerfDataNotAvailableErrorso successful Pareto sweeps no longer spam tracebacks (#884, #890).
Generator Validator and Benchmark Templates
- Validator usability: Made
--backenda required argument in the generator validator (#429), corrected the validator-invocation syntax in the generator docs (#430), added a shebang and error handling tobench_run.sh(#432), added--artifact-dirto benchmark templates to prevent permission-denied writes (#433), used the correctartifact-dirfor AIC-generated AIPerf commands (#462), and avoided duplicated concurrencies and sorted the list in the AIPerf command (#450). - Naive generator: Fixed the naive config generator producing the RFC 1123-invalid DGD name
None-agg(#490), fixed broken--save-dirarg parsing (#497), fixedsflowconfigs generation when running through the generator (#640), and generated asafe_model_namebefore saving results so HuggingFace IDs containing/no longer break the output path (#656).
TRT-LLM and SGLang Constraints
- MOEModel and SGLang: Sharded the vocab embedding by
tp_sizeto matchVocabParallelEmbedding, added embedding allreduce fortp > 1, and removed thenum_experts >= 128guard on router GEMM (#711); modeledqkv_a_projas a standalone GEMM in the WideEP pipeline so latency accounting is correct (#476); added the missing AllReduce op on the Llama model (#604); added a MoE TP constraint to the SGLang rule plugin so generated configs respect SGLang's MoE TP requirements (#579); fixed Deepseek MoE compute/dispatch overlap accounting and applied the PDL discount correctly (#564); applied latency-correction scales inrun_disaggso disagg latency math matches the agg path (#487). - Per-node GPU topology: Used
num_gpus_per_nodeinstead of a hardcoded value when querying all-reduce data (#681), and honored per-node GPU topology — disabling WideEP for small MoE models that fit on a single node (DYN-2544, #687). - WideEP and shared expert dimensions: Fixed model WideEP generation overlap and shared expert dimensions (#367).
Collectors and Models
- MLA module path resolution: Resolved MLA module collector model paths locally instead of relying on remote download (#690).
- vLLM 0.17.0 collector compat: Fixed vLLM 0.17.0 collector compatibility for DSA, MLA module, and MoE MXFP4 on B200 via version-routed v2 collector files (#718).
- Duplicate test entry: Removed a duplicate
MiniMax-M2.5-NVFP4MoE entry from the test set (#714). - Skip missing keys in summary: Skipped missing keys when assembling the final summary, fixing crashes seen on partial sweeps (#910).
- GB200 fp8_block: Fixed the GB200 fp8_block error path that incorrectly raised on valid configs (#669).
- Validator service key mismatch: Fixed validator service key mismatch (#355).
- Duplicate
agg_decode max_batch_size: Removed duplicateagg_decode max_batch_size(#356). - FP16 KV cache dtype: Fixed fp16 KV cache dtype (#358).
- TRT-LLM MoE collector: Fixed a TRT-LLM MoE collector error path (#513).
Errors and Edge Cases
- System path / invalid backend: Fixed system path error handling (#359), caught invalid backend selection for the TRT-LLM backend before launching a sweep (#372), and caught unsupported
system × backend × versioncombos at the CLI side instead of crashing mid-sweep (#360). --num-gpus 1for single-GPU agg: Allowed--num-gpus 1to test single-GPU agg only (#363).agg_pareto()IndexError: FixedIndexErrorinagg_pareto()when all parallel configurations are skipped without exceptions (#378).- Deployment guide examples: Fixed the artifact-directory subdirectory level in
dynamo_deployment_guide.md(#380) and corrected the example--head_node_ipparameter name in the deployment guide (#381). HF_TOKENhonored: Fixed environment variableHF_TOKENso it is recognized when fetching models from HuggingFace (#620).- Inter-node bandwidth: Fixed inter-node bandwidth in the system YAML (#477).
- Conflict markers / wheel build: Removed leftover Git conflict markers (#354), made the wheel build fail on unmaterialized Git instead of producing an inconsistent artifact (#880), and added templates for
release/0.8.0so generation works against the new branch (#879). - Support check / nvfp4 memory: Made the support check case-insensitive and rephrased the model-not-found message (#431); corrected the nvfp4 memory requirement in the SDK so memory math is accurate (#391).
- Infinite loop after abort: Fixed an infinite loop after abort (#658).
- Intermittent CI hang: Mitigated an intermittent CI hang on
test_validate_database(#666).
Documentation
- DeepWiki badge: Added a DeepWiki badge to the README pointing at the AI-generated docs (#389).
- End-to-end workflow guide: Documented the end-to-end workflow, benchmark artifacts, and webapp visibility (#436).
- Estimate mode doc section: Added a documentation section describing the new
cli estimatecommand and its inputs/outputs (#642).
CI/CD and Testing
- Support matrix CI and publication: Added a GitHub Actions workflow that auto-creates sanity-check charts when new perf data lands (#400), enhanced PR-description generation for support-matrix PRs (#455), parallelized the support-matrix run via multi-process on a large runner after a multi-thread attempt was reverted as unsafe (#481, #488, #562), relaxed support-matrix constraints based on model sizes (#573), extended sanity-check tests (#605), added smoke tests gating new perf data (#608), introduced a per-PR test subset for fast feedback (#622), automated the support-matrix run on release branches (#571), DCO-signs commits generated by the daily support-matrix workflow (#679), and published the support matrix as a GitHub Pages site (#636, #644).
- Bot noise reduction: Removed the noisy bot comment left by failed runs (#457) and added CodeRabbit for code review (#468).
Other Changes
evalmodule removed: Deletedsrc/aiconfigurator/eval/along with its automation scripts and theaiconfigurator evalsubcommand registration (#504).- Model list cleanup: Removed Llama 2, Qwen 2.5, and Mixtral from the supported-model list to keep the matrix focused on actively supported families (#553).
- Python ≥ 3.10: Bumped minimum Python version from 3.9 to 3.10 (#678).
New Contributors
Thanks to everyone who contributed to this release:
- @Spycsh (Intel) made their first contribution in #246
- @nqzhou24 made their first contribution in #454
- @harryjing made their first contribution in #472
- @davilu-nvidia made their first contribution in #517
- @joshuayao made their first contribution in #620
- @ashnamehrotra made their first contribution in #647
- @diw-zw made their first contribution in #656
- @kyang-bd made their first contribution in #657
- @PeaBrane made their first contribution in #665
- @changhuaixin made their first contribution in #684
Full Changelog
Full Changelog: v0.7.0...v0.8.0