Skip to content

Dynamo Release v0.5.0

Choose a tag to compare

@saturley-hall saturley-hall released this 18 Sep 22:47
65f12d7

Dynamo 0.5.0 Release Notes

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open-source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details).

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Release Highlights

This release introduces TRT-LLM integration for KV cache management, supports gRPC support and tool calling capabilities. We also delivered major improvements to system reliability, with request cancellation features and improved observability.


Major Features and Improvements

1. Fault Tolerance & Observability

  • Implemented End to End request cancellation (#2158, #2500) with Python context propagation
  • Implemented DRT shutdown on vLLM engine failures (#2698)
  • Added fast-fail validation for NATS JetStream requirements to prevent silent failures (#2590)
  • Unified metrics across all components with model labels for vLLM (#2474), TensorRT-LLM (#2666), and SGLang (#2679)
  • Standardized Prometheus metrics naming and sanitization with KvStats integration (#2733, #2704)
  • Added automatic uptime tracking and auto-start of metrics collection upon NATS service creation (#2587, #2664), improving observability readiness

2. Kubernetes Deployments

  • Integrated Grove and KAI scheduler into Dynamo Cloud Helm chart for multi-node deployments (#2755)
  • Implemented auto-injection of kai-scheduler annotations and labels with parent DGD Kubernetes name support (#2748, #2774)
  • Deployed Dynamo EPP-aware gateway with prevention of double-tokenization for optimized routing (#2633, #2559)
  • Integrated Model Express client for optimized model downloads with URL injection support (#2574, #2769)

3. KV Cache Management & Transfer

  • Integrated Dynamo KVBM connector API with TensorRT-LLM for G2-G3 offloading and onboarding (#2544)
  • Added support for user selection among multiple KV transfer connectors (nixl, kvbm, lmcache) (#2517)
  • Added detailed KV Block Manager metrics for match, offload, and onboard operations (#2626, #2673)

4. Planning & Routing

Router

  • Separated frontend and Router, through Python bindings for KvPushRouter, so the Router and frontend can be scaled independently (#2658, #2548)
  • Implemented warm restarts via durable KV event consumers and radix snapshotting for router persistence (#2756, #2740, #2800)

Planner

  • Added comprehensive tests for replica calculation and planner scaling with automated Kubernetes deployment validation (#2525)
  • Added SLA planner dry-run mode with a CLI to simulate workloads, generate plots, and expose optional Prometheus metrics (#2557)

5. Others

Tool Calling

  • Introduced parsers library (#2542) supporting multiple reasoning and tool-calling formats.
  • Implemented multiple tool-calling parsers, including Pythonic (#2788), Harmony (#2796), and JSON-based parsers with normal text parsing alongside tool calls (#2709)
  • Added support for separating reasoning from visible text (#2555) along with GPT-OSS reasoning parser integration (#2656)
  • Added support for custom logits processors in the TensorRT-LLM backend, enabling in-place logits modification during generation (#2613, #2702)

Multimodal Support Expansion

  • Added complete multimodal deployment examples for Llava and Qwen, with video support using vLLM v1 (#2628, #2694, #2738)
  • Added Encode Worker and NIXL support for TensorRT-LLM multimodal disaggregated flows (#2452)

Infrastructure & Performance

  • Added comprehensive KServe gRPC support for industry-standard model inference protocol (#2638)
  • Enhanced Hugging Face integration with HF_HOME and HF_ENDPOINT environment variable support (#2642, #2637)

Developer Experience

  • Added Devcontainer improvements with enhanced documentation and SGLang-specific setup (#2255, #2578, #2741)
  • Added logging setup for Kubernetes with Loki integration and Grafana dashboards (#2699)
  • Added benchmarking guide with GenAI-Perf integration and automated performance comparison (#2620)
  • Updated TensorRT-LLM to 1.0.0rc6 and simplified Eagle model configuration (#2606, #2661)

Bug Fixes

  • Improved Hugging Face download speeds with better API client configuration (#2566)
  • Added missing Prometheus to runtime images for SGLang and general runtime (#2565, #2689)
  • Fixed kv-event-config command line respect and environment variable overrides (#2627, #2640)
  • Enhanced pytest robustness and parsing errors with proper timeout handling (#2676, #2572)
  • Resolved metrics registration timing issues and prevented early returns from affecting measurements (#2664, #2576)

Documentation

  • Created SNS aggregated Kubernetes example and simplified sphinx build process (#2773, #2519)
  • Streamlined cloud installation documentation and deployment guides (#2818)
  • Updated benchmarking framework documentation with comprehensive deployment guides (#2620)
  • Updated supported models documentation for multimodal and SGLang container build instructions (#2651, #2707)

Build, CI, and Test

  • Added replica calculation and planner scaling tests with automated Kubernetes deployment validation (#2525)
  • Added vLLM sanity testing support on GitHub Actions with build optimizations (#2526)
  • Optimized CI job execution for docs-only changes and Rust-specific changes (#2775)
  • Enabled KVBM in vLLM container with improved virtual environment handling (#2763)
  • Enhanced test reliability with proper KVBM test exclusions and determinism testing (#2611)
  • Fixed concurrency settings to prevent main branch run cancellations (#2780)
  • Improved container build process with default dev builds for vLLM (#2837)

Migration Notes

  • Parser Integration: New parsing capabilities require updated CLI flags for reasoning and tool calling features
  • Container Updates: Runtime images now include Prometheus by default - review monitoring configurations

Looking Forward

This release sets the stage for more features in our H2 roadmap, including benchmarking KVBM performance, E2E performance, and improved fault tolerance and request rejection at every level. We will focus on significantly updating documentation and examples for a better experience and include Kubernetes benchmark scripts for most popular models.

Release Assets

Python Wheels:

Rust Crates:

Containers:

Helm Charts:


Contributors

We welcome new contributors in this release:
@jasonqinzhou, @michaelfeil, @ahinsutime, @bhuvan002, @WaelBKZ, @hhk7734, @Michaelgathara, @KavinKrishnan, @michaelshin

Full Changelog: v0.4.1...v0.5.0