AITune Release v0.3.0
Summary
AITune is an open-source (Apache 2.0) inference toolkit, hosted under the ai-dynamo GitHub organization and distributed via PyPI. It is designed for tuning and deploying Deep Learning models on NVIDIA GPUs, significantly improving inference speed and efficiency across various AI workloads.
Major Features & Improvements
Tuning Modes
- Just-in-Time (JIT) Tuning: Zero-code model tuning and inspection controlled through a single import or environment flag. Tunes on the very first model call using only one sample, with automatic fallback to Torch Inductor when a backend cannot compile a module.
- Ahead-of-Time (AOT) Tuning: Low-code API for explicit model inspection, backend selection, and module-level tuning. Supports forward hooks for custom pre/post-processing logic around tuned modules.
Backend Support
- TensorRT: Multi-profile engines with auto-generated and user-provided profiles, CUDA graph capture, FP16/FP8/INT8 mixed precision via TensorRT Model Optimizer, and Dynamo-based ONNX export (
torch.onnx.export(dynamo=True)) for improved graph fidelity. - TorchInductor: Added support for static and dynamic HuggingFace models, broadening model compatibility beyond TensorRT workflows.
Model Compatibility
- Complex Inputs: Support for dataclasses, user-defined objects in
module.forward()arguments, and lists/dicts within Torch module containers for more complete model analysis. - LLM Support: Added KV cache support to enable tuning of autoregressive large language models.
Performance & Observability
- Memory Optimization: Reduced CPU/GPU memory usage during tuning by offloading inactive modules to the
metadevice, with optimized input/output metadata handling. - Profiling: Extended metrics collection through NVTX annotations for Nsight Systems integration. Added configurable console output suppression with automatic log-to-file.
Documentation & Examples
- Documentation & Examples: Added comprehensive documentation and extended end-to-end examples across Computer Vision, Generative AI, Speech Recognition, and NLP workloads.
Bug Fixes
- Fixed dynamic shapes handling in TorchTensorRT AoT and TensorRT ONNX Dynamo export paths, calibration data creation for ModelOpt PTQ, bfloat16 precision in TensorRT, JIT cache directory collisions, and profiling for models without batching support.
Known Issues
- AITune currently only supports single-GPU configurations.
- Just-in-Time tuning does not support
transformers>=5due to@capture_outputsdecorator.