Skip to content

AIConfigurator v0.8.0

Choose a tag to compare

@dagil-nvidia dagil-nvidia released this 01 May 10:15
5c0c8cc

AIConfigurator - Release 0.8.0

Summary

AIConfigurator 0.8.0 expands offline-optimization coverage to every actively supported Dynamo backend and hardware combination, opens up programmatic access for the Dynamo Planner and Profiler, and adds a new cli estimate command that predicts TTFT, TPOT, and power for a single config without running a full sweep. The release lands first-class Intel XPU support contributed by Intel, DeepSeek-V3.2 (DSA) modeling across all three backends, and a unified HYBRIDMOE family covering Llama 4 Scout/Maverick and MiMo-V2-Flash. Perf-database tables now cover vLLM 0.14/0.16/0.17/0.19, SGLang 0.5.8 → 0.5.10, and TensorRT-LLM 1.2.0rc5 → 1.3.0+ across B200 SXM, GB200, GB300, H100 SXM, A100 SXM, L40S, RTX Pro 6000 Blackwell Server, and Intel B60. The collector itself got faster and more reliable, and the generator was rewired to the new --trtllm.<group>.<key> dynamic CLI-flag syntax introduced in Dynamo. Numerous fixes harden TRT-LLM template correctness, OOM handling, generator-validator usage, and SGLang/vLLM compatibility.


Breaking Changes & Deprecations

  • Minimum Python is now 3.10 (was 3.9), driven by Gradio's Python ≥ 3.10 requirement (#678).
  • aiconfigurator eval subcommand removed. The eval module and its automation scripts have been deleted; users who scripted aiconfigurator eval must migrate (#504).
  • Llama 2, Qwen 2.5, and Mixtral are no longer supported. Their model_configs/ JSONs and support-matrix entries have been removed; pin to v0.7.0 if you require these families (#553).

Key Highlights

Backend and Hardware Coverage

  • vLLM 0.14.0 → 0.19.0: Refreshed perf data for vLLM 0.14.0, 0.14.1, 0.16.0, 0.17.0, and 0.19.0 across B200 SXM, GB200, GB300, H100 SXM, L40S, RTX Pro 6000 Blackwell Server, and A100 SXM (#517, #560, #707, #712, #883, #892).
  • SGLang 0.5.8 → 0.5.10: Refreshed SGLang perf data through 0.5.10 with new AllReduce collector support (#508), 0.5.9 collector compatibility (#531), custom_allreduce_perf.txt files (#607), and the 0.5.10 cherry-pick of collector fixes plus refreshed perf tables (#952).
  • TensorRT-LLM 1.2.0rc5 → 1.3.0+: Added FP8 block in TRT-LLM on Blackwell sm100 (#480), GPT-OSS simulation on Blackwell (#575), w4a16_mxfp4 / w4a8_mxfp4_mxfp8 MoE data for B200 / GB200 (#590), NVLink OneSided alltoall (#536), and refreshed data through 1.3.0+ (#502, #611, #617, #618, #635, #638).
  • GB300 and Intel B60: Added a GB300 system entry (#352), renamed gb200_sxmgb200 for consistency (#362), and added first-class Intel XPU support with a new --device flag and B60 system configuration (#246).

Dynamo Planner and Profiler Integration

  • Standalone picking API: Decoupled pick_default, pick_min_gpus, and pick_max_throughput from TaskConfig so external pipelines can call them directly (#386).
  • Optimization-type-aware parallelization: Added pick_optimization_type so the Dynamo profiler can request configs optimized for throughput or latency without explicit SLA targets — TP for dense, TEP for MoE, DEP for MLA+MoE+throughput (#647).
  • --trtllm.* dynamic flags: Translated extra engine args YAML into the new --trtllm.<group>.<key> flag syntax introduced in Dynamo #7335, replacing the old --override-engine-args JSON path (#693).
  • Dynamo 1.1.0 backend templates: Bundled vLLM 0.19.0, SGLang 0.5.10.post1, and TRT-LLM 1.3.0rc11 templates (#759).
  • Hetero disagg GPU budget: Added max_prefill_gpus and max_decode_gpus so callers can express asymmetric resource allocation (e.g. 4 prefill GPUs + 12 decode GPUs) that total_gpus alone cannot (#709).

cli estimate Single-Point Prediction

  • New command and Python API: aiconfigurator cli estimate and cli_estimate() predict TTFT, TPOT, and power for a single model/system/config combination without running a full sweep or SLA optimization, in both aggregated and disaggregated modes, with separate prefill/decode parallelism, decode systems, and quantization overrides (#387).
  • Result parity with cli default: Single-point estimates now align with the corresponding row in best_config_topn.csv from cli default, with a uniform 1.8× TTFT correction factor applied across the estimate, autoscale, and pareto disagg paths (#539). Chunked-prefill is exposed as a CLI parameter (#614), and end-to-end tests assert column and value parity (#557).

Intel XPU Support

  • First-class XPU collector and SDK: Added Intel XPU support across the collector and SDK with a --device flag and B60 system configuration (#246), then enabled FP8 GEMM collection (#591), MoE projection (#625), FP8 projection (#645), FP8 MoE collection for vLLM (#675), and MXFP4 MoE / GPT-OSS coverage (#699). Contributed by @Spycsh (Intel).

DeepSeek-V3.2 (DSA) and New Models

  • DSA support across backends: Added module-level MLA/DSA attention collectors and perf-database support for DeepSeek-V3 / V3.2 across TensorRT-LLM (#470), vLLM (#643), and SGLang (#657), enabled DSv32 + GLM-5 modeling (#662), and projected global MoE routing onto rank-1 workload so SGLang EP MoE benchmarking preserves the intended 128-top-8 routing distribution (#753).
  • MiniMax-M2.5 and HYBRIDMOE family: Added MiniMax-M2.5 with Multi-Token Prediction (MTP) scaling for MoE generation throughput (#405) and the NVFP4-quantized variant nvidia/MiniMax-M2.5-NVFP4 for vLLM and TRT-LLM (#685); introduced a unified HYBRIDMOE family covering Llama 4 Scout/Maverick and MiMo-V2-Flash with shared interleaved SWA/global + dense/MoE structure (#408).
  • Other model coverage: Added MTP simulation for Qwen-series models (#472), QWEN3.5 hybrid-mode support (#667), Kimi-K2.5 quant detection through nested text_config (#454), and updated the HuggingFace ID for Nemotron-3-Super-120B (#577).

Features & Enhancements

CLI and APIs

  • cli estimate: New single-point estimate command and cli_estimate() Python API for TTFT/TPOT/power prediction (#387, #539, #557, #614).
  • --system all and --backend all: Run cli support against the full system × backend matrix in a single command (#435).
  • HYBRID recommendation on SILICON failure: When cli fails in SILICON mode, AIConfigurator recommends HYBRID mode instead of leaving the user stuck (#411).
  • Unify CLI args: Unified CLI argument naming across subcommands and updated docs and examples to match (#345).

Generator and Validator

  • Optimization-type-aware picking and standalone API: Standalone pick_default / pick_min_gpus / pick_max_throughput (#386), exposed parallelization metadata on enumerate_profiling_configs (#401), and pick_optimization_type for SLA-free optimization (#647).
  • Generator improvements: Emit the matching benchmark command alongside each config (#334), strip empty lines from generated config files (#347), hook into the Dynamo planner profiler's config-gen path (#361), align the generator's default backend version with the latest Dynamo version (#366), expose real-GPU enumeration logic for external sweep callers (#373), support K8S_ETCD_ENDPOINTS substitution in K8s deployment templates (#554), split PD wideep / eplb config into separate emission paths (#565), add an sflow rendering target for nv-sflow integration (#584), accept generator-override params via cli_default() (#592), bump default Dynamo image and version references to 1.0.0 (#594), translate extra engine args YAML into the --trtllm.<group>.<key> syntax (#693), and bundle Dynamo 1.1.0 backend templates (#759).

SDK and Modeling

  • Hetero disagg GPU budget: max_prefill_gpus / max_decode_gpus in find_best_disagg_result_under_constraints (#709).
  • Static latency-only fast path: run_static_latency_only() skips summary/dataframe materialization for replay-style callers (#665).
  • GEMM lookup optimization: Exact-hit shortcut and 1D interpolation for partial hits, reducing per-query latency (#721).
  • Full database mode for TRT-LLM WideEP: SOL / EMPIRICAL modes for TRT-LLM WideEP query functions (#485).
  • Configurable rate-matching factors: rate_matching degradation factors are now configurable instead of hardcoded (#615).
  • Better missing perf-data handling: Structured error at query time instead of an internal exception (#473).
  • Attention imbalance correction: Added an attention imbalance correction scale interface and Qwen3 qk-norm I/O (#600).
  • KV cache capacity check: Prevent batch-size oversubscription before launching a sweep (#652).
  • MoE extrapolation by MFU: Switched MoE extrapolation to scale according to MFU instead of a fixed factor, improving accuracy when projecting compute for unmeasured shapes (#537).

Models and Architectures

  • DeepSeek-V3.2 (DSA): Module-level MLA/DSA attention collectors and perf-database support across TRT-LLM (#470), vLLM (#643), and SGLang (#657); DSv32 + GLM-5 (#662); SGLang EP MoE rank-1 projection (#753).
  • MiniMax-M2.5 / HYBRIDMOE / Qwen / Kimi: MiniMax-M2.5 with MTP (#405), nvidia/MiniMax-M2.5-NVFP4 (#685), unified HYBRIDMOE family for Llama 4 Scout/Maverick + MiMo-V2-Flash (#408), Qwen MTP simulation (#472), QWEN3.5 hybrid mode (#667), Kimi-K2.5 nested text_config quant detection (#454), Nemotron-3-Super-120B HF ID update (#577).

Collectors and Data

  • Hardware/backend coverage: Refreshed perf-database tables across vLLM 0.14.0/0.14.1/0.16.0/0.17.0/0.19.0, SGLang 0.5.8.post1/0.5.9/0.5.10, and TensorRT-LLM 1.2.0rc5/1.2.0rc6/1.3.0+ on B200 SXM, GB200, GB300, H100 SXM, A100 SXM, L40S, RTX Pro 6000 Blackwell Server, and Intel B60 (#388, #390, #466, #502, #517, #520, #523, #529, #531, #535, #538, #547, #559, #560, #567, #576, #595, #607, #611, #617, #618, #623, #624, #628, #629, #630, #631, #632, #635, #638, #668, #680, #682, #707, #712, #883, #892, #952).
  • Collector performance and reliability: Sped up the slow TensorRT-LLM MoE collector (#351), improved the MNNVL alltoall communication collector (#353), cut Blackwell GEMM collection time via weight caching and adaptive L2 bypass (#588), sped up the MoE collector (#524), added resume capability so interrupted runs no longer restart from scratch (#469), added --profile (#523) and --model-path filters (#627), fixed log-perf stalling on NFS (#427), explicitly managed collector versions (#466), disabled core dumps for GPU crashes in worker and main processes (#587), adapted the GEMM collector for the zig-zag pattern (#503), and unified TRT-LLM alltoall MoE dispatch on sm100 (#566).
  • TensorRT-LLM additions: FP8 block on Blackwell sm100 (#480), GPT-OSS simulation on Blackwell (#575), w4a16_mxfp4 / w4a8_mxfp4_mxfp8 MoE data (#590), NVLink OneSided alltoall (#536), FP4 GEMM data (#672).
  • vLLM additions: NVFP4 vLLM data (#546), NVFP4 vLLM MoE (#660), w4a16_mxfp4 MoE for GPT-OSS (#673), DSA support (#643), and 0.17.0 collector compatibility for DSA, MLA, and MoE MXFP4 on B200 (#718).
  • SGLang improvements: AllReduce collector (#508), 0.5.9 collector support (#531), custom_allreduce_perf.txt for 0.5.9 (#607), stop passing invalid moe-dense-tp-size (#613), Qwen3-235B WideEP and EP > 1 in non-WideEP mode (#684).
  • Intel XPU additions: First-class XPU support (#246), FP8 GEMM (#591), MoE projection (#625), FP8 projection (#645), FP8 MoE for vLLM (#675), MXFP4 MoE / GPT-OSS coverage (#699).

Hardware Definitions

  • GB300 and rename: Added the GB300 system entry (#352), renamed gb200_sxmgb200 for consistency (#362), corrected inter-node bandwidth in the system YAML (#477), and honored per-node GPU topology — disabling WideEP for small MoE models that don't span nodes (DYN-2544, #687).

Web App

  • Support Matrix tab: New Support Matrix tab visualizing support_matrix.csv with PASS/FAIL per system × backend and a top-10 errors view per system, with lazy perf-database loading for faster startup (#568).
  • Webapp upkeep: Updated Gradio (#475), removed "gradio not installed" warnings on import (#426), bumped the webapp Docker image (#364), and fixed webapp visibility under the new end-to-end workflow (#436).

Logging and UX

  • CLI ergonomics: Hid and deduplicated spammy logs (#441), added colored logging (#452), cleaned up ANSI codes when stdout is redirected and added a --no-color flag (#710), switched aiconfigurator cli exp --help to a YAML task example (#715), and replaced Unicode box-drawing with ASCII in piped output so cat -v no longer renders garbage (#736).

Bug Fixes

TensorRT-LLM Templates

  • Engine template correctness: Fixed build_config nesting and missing backend field in version-specific engine templates (#434), made older TRT-LLM templates support build_config for the generator (#464), restored top-level max_batch_size / max_num_tokens / max_seq_len after a flat-key migration regression that crashed TRTLLMWorker pods on startup with 'NoneType' * 'int' (NVBug 5956640, #540), removed the cache_transceiver_config block from TRT-LLM 1.3.0+ templates that triggered Pydantic extra_forbidden validation errors and crashed pods at startup (NVBug 5953595, #541), added the backend field to cache_transceiver_config in cli_args.j2 (NVBug 5974038, #585), now emit cache_transceiver_config for disagg-serving configs (#609), added cache_transceiver_config.backend to the release/0.8.0 template baseline (#879), and fixed a TRT-LLM engine CLI arg passthrough (#512).
  • Block alignment: Aligned max_num_tokens and cache_transceiver_max_tokens_in_buffer to tokens_per_block in the benchmark TRT-LLM rule (#453), and ensured cache_transceiver_max_tokens_in_buffer % block_size == 0 for TRT-LLM (#375).

vLLM Compatibility

  • vLLM disagg --kv-transfer-config: Added --kv-transfer-config to vLLM disagg templates (NVBug 5952846, #519).
  • --cudagraph-capture-sizes startup failure: Fixed the vLLM --cudagraph-capture-sizes value in k8s_deploy.yaml that caused vLLM startup failure (#377).
  • vLLM collector compatibility (≥ 0.14.0): Fixed vLLM collector compatibility for set_current_vllm_config on vLLM 0.14.0+ (#688).
  • vLLM GEMM dtype inconsistency: Fixed a vLLM GEMM dtype inconsistency that produced wrong-precision data points (#376).

OOM and Error Handling

  • OOM hardening: Replaced raw tracebacks from collect_config_paths() with user-readable errors (#382), resolved OOM during support-matrix testing (#418), improved error messages when a model doesn't fit in GPU memory (#437), explicitly set OOM status (#442), raised OOM in disagg get_worker_candidates() instead of hitting an IndexError downstream (#456), improved OOM error handling in estimation functions (#509), gracefully skipped unsupported backends when --backend auto (#511), raised an actionable error when a model name is invalid (#528), used a single copy for activations and past KV cache to halve memory pressure (#621), and added a KV-cache-capacity check that prevents batch-size oversubscription (#652).
  • Sparse data: Handled sparse data gracefully in _extrapolate_data_grid instead of returning NaNs (#713).
  • custom_allreduce empty bucket: Replaced raw AssertionError from query_custom_allreduce on empty (quant_mode, tp_size) buckets with a structured PerfDataNotAvailableError so successful Pareto sweeps no longer spam tracebacks (#884, #890).

Generator Validator and Benchmark Templates

  • Validator usability: Made --backend a required argument in the generator validator (#429), corrected the validator-invocation syntax in the generator docs (#430), added a shebang and error handling to bench_run.sh (#432), added --artifact-dir to benchmark templates to prevent permission-denied writes (#433), used the correct artifact-dir for AIC-generated AIPerf commands (#462), and avoided duplicated concurrencies and sorted the list in the AIPerf command (#450).
  • Naive generator: Fixed the naive config generator producing the RFC 1123-invalid DGD name None-agg (#490), fixed broken --save-dir arg parsing (#497), fixed sflow configs generation when running through the generator (#640), and generated a safe_model_name before saving results so HuggingFace IDs containing / no longer break the output path (#656).

TRT-LLM and SGLang Constraints

  • MOEModel and SGLang: Sharded the vocab embedding by tp_size to match VocabParallelEmbedding, added embedding allreduce for tp > 1, and removed the num_experts >= 128 guard on router GEMM (#711); modeled qkv_a_proj as a standalone GEMM in the WideEP pipeline so latency accounting is correct (#476); added the missing AllReduce op on the Llama model (#604); added a MoE TP constraint to the SGLang rule plugin so generated configs respect SGLang's MoE TP requirements (#579); fixed Deepseek MoE compute/dispatch overlap accounting and applied the PDL discount correctly (#564); applied latency-correction scales in run_disagg so disagg latency math matches the agg path (#487).
  • Per-node GPU topology: Used num_gpus_per_node instead of a hardcoded value when querying all-reduce data (#681), and honored per-node GPU topology — disabling WideEP for small MoE models that fit on a single node (DYN-2544, #687).
  • WideEP and shared expert dimensions: Fixed model WideEP generation overlap and shared expert dimensions (#367).

Collectors and Models

  • MLA module path resolution: Resolved MLA module collector model paths locally instead of relying on remote download (#690).
  • vLLM 0.17.0 collector compat: Fixed vLLM 0.17.0 collector compatibility for DSA, MLA module, and MoE MXFP4 on B200 via version-routed v2 collector files (#718).
  • Duplicate test entry: Removed a duplicate MiniMax-M2.5-NVFP4 MoE entry from the test set (#714).
  • Skip missing keys in summary: Skipped missing keys when assembling the final summary, fixing crashes seen on partial sweeps (#910).
  • GB200 fp8_block: Fixed the GB200 fp8_block error path that incorrectly raised on valid configs (#669).
  • Validator service key mismatch: Fixed validator service key mismatch (#355).
  • Duplicate agg_decode max_batch_size: Removed duplicate agg_decode max_batch_size (#356).
  • FP16 KV cache dtype: Fixed fp16 KV cache dtype (#358).
  • TRT-LLM MoE collector: Fixed a TRT-LLM MoE collector error path (#513).

Errors and Edge Cases

  • System path / invalid backend: Fixed system path error handling (#359), caught invalid backend selection for the TRT-LLM backend before launching a sweep (#372), and caught unsupported system × backend × version combos at the CLI side instead of crashing mid-sweep (#360).
  • --num-gpus 1 for single-GPU agg: Allowed --num-gpus 1 to test single-GPU agg only (#363).
  • agg_pareto() IndexError: Fixed IndexError in agg_pareto() when all parallel configurations are skipped without exceptions (#378).
  • Deployment guide examples: Fixed the artifact-directory subdirectory level in dynamo_deployment_guide.md (#380) and corrected the example --head_node_ip parameter name in the deployment guide (#381).
  • HF_TOKEN honored: Fixed environment variable HF_TOKEN so it is recognized when fetching models from HuggingFace (#620).
  • Inter-node bandwidth: Fixed inter-node bandwidth in the system YAML (#477).
  • Conflict markers / wheel build: Removed leftover Git conflict markers (#354), made the wheel build fail on unmaterialized Git instead of producing an inconsistent artifact (#880), and added templates for release/0.8.0 so generation works against the new branch (#879).
  • Support check / nvfp4 memory: Made the support check case-insensitive and rephrased the model-not-found message (#431); corrected the nvfp4 memory requirement in the SDK so memory math is accurate (#391).
  • Infinite loop after abort: Fixed an infinite loop after abort (#658).
  • Intermittent CI hang: Mitigated an intermittent CI hang on test_validate_database (#666).

Documentation

  • DeepWiki badge: Added a DeepWiki badge to the README pointing at the AI-generated docs (#389).
  • End-to-end workflow guide: Documented the end-to-end workflow, benchmark artifacts, and webapp visibility (#436).
  • Estimate mode doc section: Added a documentation section describing the new cli estimate command and its inputs/outputs (#642).

CI/CD and Testing

  • Support matrix CI and publication: Added a GitHub Actions workflow that auto-creates sanity-check charts when new perf data lands (#400), enhanced PR-description generation for support-matrix PRs (#455), parallelized the support-matrix run via multi-process on a large runner after a multi-thread attempt was reverted as unsafe (#481, #488, #562), relaxed support-matrix constraints based on model sizes (#573), extended sanity-check tests (#605), added smoke tests gating new perf data (#608), introduced a per-PR test subset for fast feedback (#622), automated the support-matrix run on release branches (#571), DCO-signs commits generated by the daily support-matrix workflow (#679), and published the support matrix as a GitHub Pages site (#636, #644).
  • Bot noise reduction: Removed the noisy bot comment left by failed runs (#457) and added CodeRabbit for code review (#468).

Other Changes

  • eval module removed: Deleted src/aiconfigurator/eval/ along with its automation scripts and the aiconfigurator eval subcommand registration (#504).
  • Model list cleanup: Removed Llama 2, Qwen 2.5, and Mixtral from the supported-model list to keep the matrix focused on actively supported families (#553).
  • Python ≥ 3.10: Bumped minimum Python version from 3.9 to 3.10 (#678).

New Contributors

Thanks to everyone who contributed to this release:


Full Changelog

Full Changelog: v0.7.0...v0.8.0