Skip to content

AIConfigurator Release v0.7.0

Choose a tag to compare

@saturley-hall saturley-hall released this 12 Mar 20:13
5d419f9

AIConfigurator - Release 0.7.0

Summary

AIConfigurator 0.7.0 builds on the multi-backend foundation of 0.6.0 with a stronger focus on CLI and API ergonomics, new models and hardware (Nemotron, Mamba2, GB200 NVL72), generator and validator tooling, and operational robustness. This release unifies model input around model_path, adds naive config generation and generator benchmark mode, introduces a generator validator to compare configs with engine APIs, and supports auto backend (homogeneous), AIPerf benchmark command generation, and Dynamo planner profiler integration. Support matrix and CLI gain huggingface_id/architecture usage, support command and APIs, and --system all / --backend all. New collectors and data cover TensorRT-LLM WideEP, Mamba2, vLLM 0.14.0, and SGLang 0.5.8; Nemotron and EPLB support is expanded. Numerous fixes improve generator output, Kubernetes templates, TRT-LLM/SGLang constraints, validator usage, and error handling. Documentation and CI are updated for end-to-end workflow and notebook validation.


Key Highlights

Unified model input and CLI

  • model_path as unified input: Model input is unified on model_path (#289); support matrix uses huggingface_id and huggingface architecture (#275).
  • Naive config generation: CLI generate command can produce naive configs (#271); CLI APIs enable Python callers (#293).
  • Support command and APIs: New support command and cli_support API (#294), with --system all and --backend all (#439); Support for --backend any (homogeneous) (#331), later renamed to auto (#346).

Generator and validator

  • Benchmark mode and validator: Generator gains benchmark mode via rule plugins (#290) and a generator validator to compare generator configs with engine APIs (#329).
  • AIPerf and Dynamo integration: Generator can emit AIPerf benchmark commands (#357); hook to Dynamo planner profiler’s config gen (#419); picking modes and standalone picking API (#421); real-GPU enumeration exposed (#420).
  • Backend and version flexibility: generator supports both Dynamo version and backend version (#333).

New models and hardware

  • Nemotron: Nemotron support (#273), Nemotron v3 super model (#325), and Mamba2 ops in Nemotron v3 simulation (#342).
  • Mamba2 and WideEP: Mamba2 performance data collectors (#297); TensorRT-LLM WideEP All-to-All collector (#313); Trtllm wideep pipeline (#320); Trtllm wideep MoE collector (#335).
  • GB200 NVL72 and SGLang EPLB: GB200 NVL72 all2all data (#337); EPLB support in SGLang (#343).

Backend and collector updates

  • vLLM 0.14.0: vLLM collector updated to 0.14.0 with H100 data (#310) and B200 data (#525).
  • SGLang 0.5.8: SGLang performance collectors updated for v0.5.8 (#323).
  • FP8 and quantization: FP8 static_quant_mode / lowbit_input with compute_scale & scale_matrix modeling (#261); infer quantization from model info (#338).

Features & Enhancements

CLI and APIs

  • Support matrix: Use huggingface_id and huggingface architecture in support_matrix.csv (#275).
  • model_path: Use model_path as the unified model input argument (#289).
  • Generate command: CLI generate command to generate naive configs (#271).
  • CLI APIs: Create CLI APIs to support Python calls (#293).
  • Support command: CLI support command and cli_support API (#294).
  • Support all: Add --system all and --backend all for cli-support (#439).
  • Top-N: Add configurable top_n parameter for result limiting (#315).

Generator and validator

  • Benchmark mode: Generator add benchmark mode by setting rule plugins (#290).
  • Generator validator: Add generator validator to compare generator configs with engine APIs (#329).
  • Backend any/auto: Add support for --backend any (homogeneous) (#331); rename to 'auto' backend (#346).
  • Dynamo and backend version: Generator supports both Dynamo version and backend version (#333).
  • AIPerf command: Enhance generator to generate AIPerf benchmark command (#357).
  • Dynamo planner profiler: Hook to Dynamo planner profiler's config gen (#419).
  • Picking API: Add picking modes and expose standalone picking API (#421).
  • Real-GPU enumeration: Expose real-gpu enumeration logic (#420).
  • Common templates: Extract common generator templates for reuse (#314).

Kubernetes and deployment

  • k8s_hf_home: Add k8s_hf_home option (#303).
  • Customized system path: Customized system path support (#321).

Models and architectures

  • Nemotron: Nemotron support (#273); support Nemotron v3 super model (#325); add Mamba2 ops to Nemotron v3 simulation (#342).
  • Mamba2: Add performance data collectors for Mamba2 (#297).
  • Quantization: Infer quantization from model info (#338).
  • FP8 modeling: Add fp8 static_quant_mode/lowbit_input with compute_scale & scale_matrix modeling (#261).

Collectors and data

  • vLLM: Update vLLM collector to 0.14.0 and add H100 data (#310).
  • SGLang: Update SGLang performance collectors for v0.5.8 (#323); support EPLB in SGLang (#343).
  • TensorRT-LLM WideEP: TensorRT-LLM WideEP All-to-All collector (#313); Trtllm wideep pipeline (#320); Trtllm wideep MoE collector (#335).
  • GB200 NVL72: Add GB200 NVL72 all2all data (#337).
  • Qwen3-32B NVFP4 with vLLM: You can configure and deploy Qwen3-32B with NVFP4 quantization using the vLLM backend (e.g., --model-path nvidia/Qwen3-32B-NVFP4 with --backend vllm). The same CLI workflow applies across backends—only the generated deployment artifacts (config files, CLI args, K8s manifests) differ by backend (#546).

Support matrix and UX

  • PR description: Enhance PR description generation in support matrix (#460).
  • Logging: Hide/deduplicate spammy logs (#494).

Bug Fixes

Generator and config

  • Graceful exit and doc: Update generator doc and allow graceful exit of CLI when lacking database data (#277).
  • Dynamo 0.8.0: Align generator run script with Dynamo 0.8.0 (#278).
  • NIXL default: Use NIXL as default disagg transfer backend for SGLang 0.5.6.post2; allow user to set disagg transfer backend in CLI (#279).
  • NIXL KV backend: Add NIXL as default generator SGLang KV backend (#281).
  • model_name mapping: Map internal model_name to huggingface_architecture (#274).
  • MODEL_PATH in templates: Use MODEL_PATH to replace MODEL in vLLM template to align with TRT-LLM/SGLang (#282).
  • Output path: Add hardware and framework into config output path (#284).
  • Revert template refactor: Revert "extract common generator templates for reuse" to fix regressions (#340).
  • Validator: Fix validator service key mismatch (#370); make --backend required in generator validator (#443); correct validator invocation syntax in generator docs (#444).
  • bench_run.sh: Add shebang and error handling to bench_run.sh template (#446).
  • TRT-LLM templates: Correct TRT-LLM version-specific engine templates (build_config nesting + missing backend) (#447).
  • Artifact dir: Add --artifact-dir to benchmark templates to prevent Permission denied (#448); use correct artifact-dir for AIC-generated AIPerf commands (#474).
  • AIPerf concurrencies: Avoid duplicated concurrencies and sort the list in AIPerf command (#451).
  • TRT-LLM alignment: Align max_num_tokens and cache_transceiver_max_tokens_in_buffer to tokens_per_block in benchmark TRT-LLM rule (#459).
  • build_config (old TRT-LLM): Fix build_config for old TRT-LLM version (#465).
  • Naive config DGD name: Naive config generator produces RFC 1123-invalid DGD name 'None-agg' — fixed (#496).

Kubernetes and backend templates

  • k8s_model_cache: vLLM/SGLang K8s template missing k8s_model_cache param (#280).
  • PVC: Move PVC support from frontend to workers for SGLang backend (#291).
  • vLLM cudagraph: vLLM --cudagraph-capture-sizes causes startup failure in k8s_deploy.yaml (#395).

TRT-LLM and SGLang constraints

  • max_num_tokens % tokens_per_block: Ensure max_num_tokens % tokens_per_block == 0 in TRT-LLM (#307).
  • SGLang moe_dense_tp_size: SGLang moe_dense_tp_size only supports 1 or None (#308).
  • Quantization block sizes: Ensure quantization block sizes can be divided by MoE intermediate size per GPU (#311).
  • cache_transceiver_max_tokens_in_buffer: Ensure cache_transceiver_max_tokens_in_buffer % block_size == 0 (#424).

Collectors and models

  • MoE collector TRT-LLM 1.3.0: Update MoE collector to support TRT-LLM 1.3.0 (#326).
  • GPT-OSS inter_size: Correctly obtain inter_size for GPT-OSS; use w4a16_mxfp4 as default MoE quant mode (#319).
  • nemotron_nas block config: Enable correct parsing of block configs for nemotron_nas (#318).
  • Mamba2 collector: Add missing columns to Mamba2 collector (#341).
  • SGLang disaggregated: Filter out different TP sizes for SGLang non-wideep disaggregated serving (#344).
  • agg_decode: Remove duplicate agg_decode max_batch_size (#368).
  • fp16 KV cache: Fix fp16 KV cache dtype (#369).

Errors and edge cases

  • dynamoNamespace: Remove dynamoNamespace field (nvbugs/5830661) (#296).
  • Deployment guide: Update guide on Dynamo deployment (nvbugs/5833205) (#295).
  • SGLang L40S: Handle SGLang L40S missing data gracefully (#301).
  • System path: Fix system path error handling (#371).
  • Invalid backend: Catch invalid backend for TRT-LLM backend (#374).
  • Artifacts directory: Generated artifacts directory structure in dynamo_deployment_guide.md had incorrect extra subdirectory (#396).
  • head_node_ip: Example command in dynamo_deployment_guide.md failed due to invalid --head_node_ip (#397).
  • agg_pareto IndexError: Fix IndexError when all parallel configurations are skipped without exceptions (#398).
  • collect_config_paths: Uncaught exception no longer leaks raw traceback to user (#399).
  • Support matrix OOM: Resolve OOM issue during support matrix testing (#425).
  • Gradio warnings: Remove "gradio not installed" warnings (#428).
  • Support check: Make support check case-insensitive; rephrase model not found (#438).
  • GPU memory: Improve error messages when model doesn't fit in GPU memory (#445).
  • Disagg OOM: OOM not raised in disagg get_worker_candidates() causing IndexError — fixed (#461).

Documentation

  • End-to-end workflow: Add end-to-end workflow, document benchmark artifacts, and fix webapp visibility (#440).

CI/CD and testing

  • validate_database.ipynb: Add test for validate_database.ipynb (#268).

Other changes

  • Support matrix: Automated support matrix updates (#254, #298); update support matrix in README (#324).

New contributors

Thanks to everyone who contributed to this release:

  • @github-actions[bot] made their first contribution in #254
  • @yingxuanl-dot made their first contribution in #261

Full changelog

Full Changelog: v0.6.0...v0.7.0