AIConfigurator Release v0.7.0
AIConfigurator - Release 0.7.0
Summary
AIConfigurator 0.7.0 builds on the multi-backend foundation of 0.6.0 with a stronger focus on CLI and API ergonomics, new models and hardware (Nemotron, Mamba2, GB200 NVL72), generator and validator tooling, and operational robustness. This release unifies model input around model_path, adds naive config generation and generator benchmark mode, introduces a generator validator to compare configs with engine APIs, and supports auto backend (homogeneous), AIPerf benchmark command generation, and Dynamo planner profiler integration. Support matrix and CLI gain huggingface_id/architecture usage, support command and APIs, and --system all / --backend all. New collectors and data cover TensorRT-LLM WideEP, Mamba2, vLLM 0.14.0, and SGLang 0.5.8; Nemotron and EPLB support is expanded. Numerous fixes improve generator output, Kubernetes templates, TRT-LLM/SGLang constraints, validator usage, and error handling. Documentation and CI are updated for end-to-end workflow and notebook validation.
Key Highlights
Unified model input and CLI
- model_path as unified input: Model input is unified on model_path (#289); support matrix uses huggingface_id and huggingface architecture (#275).
- Naive config generation: CLI generate command can produce naive configs (#271); CLI APIs enable Python callers (#293).
- Support command and APIs: New support command and cli_support API (#294), with --system all and --backend all (#439); Support for --backend any (homogeneous) (#331), later renamed to auto (#346).
Generator and validator
- Benchmark mode and validator: Generator gains benchmark mode via rule plugins (#290) and a generator validator to compare generator configs with engine APIs (#329).
- AIPerf and Dynamo integration: Generator can emit AIPerf benchmark commands (#357); hook to Dynamo planner profiler’s config gen (#419); picking modes and standalone picking API (#421); real-GPU enumeration exposed (#420).
- Backend and version flexibility: generator supports both Dynamo version and backend version (#333).
New models and hardware
- Nemotron: Nemotron support (#273), Nemotron v3 super model (#325), and Mamba2 ops in Nemotron v3 simulation (#342).
- Mamba2 and WideEP: Mamba2 performance data collectors (#297); TensorRT-LLM WideEP All-to-All collector (#313); Trtllm wideep pipeline (#320); Trtllm wideep MoE collector (#335).
- GB200 NVL72 and SGLang EPLB: GB200 NVL72 all2all data (#337); EPLB support in SGLang (#343).
Backend and collector updates
- vLLM 0.14.0: vLLM collector updated to 0.14.0 with H100 data (#310) and B200 data (#525).
- SGLang 0.5.8: SGLang performance collectors updated for v0.5.8 (#323).
- FP8 and quantization: FP8 static_quant_mode / lowbit_input with compute_scale & scale_matrix modeling (#261); infer quantization from model info (#338).
Features & Enhancements
CLI and APIs
- Support matrix: Use huggingface_id and huggingface architecture in support_matrix.csv (#275).
- model_path: Use model_path as the unified model input argument (#289).
- Generate command: CLI generate command to generate naive configs (#271).
- CLI APIs: Create CLI APIs to support Python calls (#293).
- Support command: CLI support command and cli_support API (#294).
- Support all: Add --system all and --backend all for cli-support (#439).
- Top-N: Add configurable top_n parameter for result limiting (#315).
Generator and validator
- Benchmark mode: Generator add benchmark mode by setting rule plugins (#290).
- Generator validator: Add generator validator to compare generator configs with engine APIs (#329).
- Backend any/auto: Add support for --backend any (homogeneous) (#331); rename to 'auto' backend (#346).
- Dynamo and backend version: Generator supports both Dynamo version and backend version (#333).
- AIPerf command: Enhance generator to generate AIPerf benchmark command (#357).
- Dynamo planner profiler: Hook to Dynamo planner profiler's config gen (#419).
- Picking API: Add picking modes and expose standalone picking API (#421).
- Real-GPU enumeration: Expose real-gpu enumeration logic (#420).
- Common templates: Extract common generator templates for reuse (#314).
Kubernetes and deployment
- k8s_hf_home: Add k8s_hf_home option (#303).
- Customized system path: Customized system path support (#321).
Models and architectures
- Nemotron: Nemotron support (#273); support Nemotron v3 super model (#325); add Mamba2 ops to Nemotron v3 simulation (#342).
- Mamba2: Add performance data collectors for Mamba2 (#297).
- Quantization: Infer quantization from model info (#338).
- FP8 modeling: Add fp8 static_quant_mode/lowbit_input with compute_scale & scale_matrix modeling (#261).
Collectors and data
- vLLM: Update vLLM collector to 0.14.0 and add H100 data (#310).
- SGLang: Update SGLang performance collectors for v0.5.8 (#323); support EPLB in SGLang (#343).
- TensorRT-LLM WideEP: TensorRT-LLM WideEP All-to-All collector (#313); Trtllm wideep pipeline (#320); Trtllm wideep MoE collector (#335).
- GB200 NVL72: Add GB200 NVL72 all2all data (#337).
- Qwen3-32B NVFP4 with vLLM: You can configure and deploy Qwen3-32B with NVFP4 quantization using the vLLM backend (e.g.,
--model-path nvidia/Qwen3-32B-NVFP4with--backend vllm). The same CLI workflow applies across backends—only the generated deployment artifacts (config files, CLI args, K8s manifests) differ by backend (#546).
Support matrix and UX
- PR description: Enhance PR description generation in support matrix (#460).
- Logging: Hide/deduplicate spammy logs (#494).
Bug Fixes
Generator and config
- Graceful exit and doc: Update generator doc and allow graceful exit of CLI when lacking database data (#277).
- Dynamo 0.8.0: Align generator run script with Dynamo 0.8.0 (#278).
- NIXL default: Use NIXL as default disagg transfer backend for SGLang 0.5.6.post2; allow user to set disagg transfer backend in CLI (#279).
- NIXL KV backend: Add NIXL as default generator SGLang KV backend (#281).
- model_name mapping: Map internal model_name to huggingface_architecture (#274).
- MODEL_PATH in templates: Use MODEL_PATH to replace MODEL in vLLM template to align with TRT-LLM/SGLang (#282).
- Output path: Add hardware and framework into config output path (#284).
- Revert template refactor: Revert "extract common generator templates for reuse" to fix regressions (#340).
- Validator: Fix validator service key mismatch (#370); make --backend required in generator validator (#443); correct validator invocation syntax in generator docs (#444).
- bench_run.sh: Add shebang and error handling to bench_run.sh template (#446).
- TRT-LLM templates: Correct TRT-LLM version-specific engine templates (build_config nesting + missing backend) (#447).
- Artifact dir: Add --artifact-dir to benchmark templates to prevent Permission denied (#448); use correct artifact-dir for AIC-generated AIPerf commands (#474).
- AIPerf concurrencies: Avoid duplicated concurrencies and sort the list in AIPerf command (#451).
- TRT-LLM alignment: Align max_num_tokens and cache_transceiver_max_tokens_in_buffer to tokens_per_block in benchmark TRT-LLM rule (#459).
- build_config (old TRT-LLM): Fix build_config for old TRT-LLM version (#465).
- Naive config DGD name: Naive config generator produces RFC 1123-invalid DGD name 'None-agg' — fixed (#496).
Kubernetes and backend templates
- k8s_model_cache: vLLM/SGLang K8s template missing k8s_model_cache param (#280).
- PVC: Move PVC support from frontend to workers for SGLang backend (#291).
- vLLM cudagraph: vLLM --cudagraph-capture-sizes causes startup failure in k8s_deploy.yaml (#395).
TRT-LLM and SGLang constraints
- max_num_tokens % tokens_per_block: Ensure max_num_tokens % tokens_per_block == 0 in TRT-LLM (#307).
- SGLang moe_dense_tp_size: SGLang moe_dense_tp_size only supports 1 or None (#308).
- Quantization block sizes: Ensure quantization block sizes can be divided by MoE intermediate size per GPU (#311).
- cache_transceiver_max_tokens_in_buffer: Ensure cache_transceiver_max_tokens_in_buffer % block_size == 0 (#424).
Collectors and models
- MoE collector TRT-LLM 1.3.0: Update MoE collector to support TRT-LLM 1.3.0 (#326).
- GPT-OSS inter_size: Correctly obtain inter_size for GPT-OSS; use w4a16_mxfp4 as default MoE quant mode (#319).
- nemotron_nas block config: Enable correct parsing of block configs for nemotron_nas (#318).
- Mamba2 collector: Add missing columns to Mamba2 collector (#341).
- SGLang disaggregated: Filter out different TP sizes for SGLang non-wideep disaggregated serving (#344).
- agg_decode: Remove duplicate agg_decode max_batch_size (#368).
- fp16 KV cache: Fix fp16 KV cache dtype (#369).
Errors and edge cases
- dynamoNamespace: Remove dynamoNamespace field (nvbugs/5830661) (#296).
- Deployment guide: Update guide on Dynamo deployment (nvbugs/5833205) (#295).
- SGLang L40S: Handle SGLang L40S missing data gracefully (#301).
- System path: Fix system path error handling (#371).
- Invalid backend: Catch invalid backend for TRT-LLM backend (#374).
- Artifacts directory: Generated artifacts directory structure in dynamo_deployment_guide.md had incorrect extra subdirectory (#396).
- head_node_ip: Example command in dynamo_deployment_guide.md failed due to invalid --head_node_ip (#397).
- agg_pareto IndexError: Fix IndexError when all parallel configurations are skipped without exceptions (#398).
- collect_config_paths: Uncaught exception no longer leaks raw traceback to user (#399).
- Support matrix OOM: Resolve OOM issue during support matrix testing (#425).
- Gradio warnings: Remove "gradio not installed" warnings (#428).
- Support check: Make support check case-insensitive; rephrase model not found (#438).
- GPU memory: Improve error messages when model doesn't fit in GPU memory (#445).
- Disagg OOM: OOM not raised in disagg get_worker_candidates() causing IndexError — fixed (#461).
Documentation
- End-to-end workflow: Add end-to-end workflow, document benchmark artifacts, and fix webapp visibility (#440).
CI/CD and testing
- validate_database.ipynb: Add test for validate_database.ipynb (#268).
Other changes
- Support matrix: Automated support matrix updates (#254, #298); update support matrix in README (#324).
New contributors
Thanks to everyone who contributed to this release:
- @github-actions[bot] made their first contribution in #254
- @yingxuanl-dot made their first contribution in #261
Full changelog
Full Changelog: v0.6.0...v0.7.0