Skip to content

Dynamo v1.3.0-dev.1

Pre-release
Pre-release

Choose a tag to compare

@dagil-nvidia dagil-nvidia released this 09 Jun 17:29
30f92be

Release Notes

Dynamo v1.3.0-dev.1 is an experimental dev build giving an early look at v1.3.0. It is not recommended for production — features may be incomplete and APIs, behaviors, and defaults may change before the stable release. Use it for evaluation, testing, and early feedback only.

Summary

Dynamo v1.3.0-dev.1 is an early preview of v1.3.0. The biggest change is tool-calling and parser parity. Dynamo re-baselined the parser and reasoning stack across the model fleet (GLM-4.7, Qwen3-Coder, DeepSeek V3/V4, Gemma4, Kimi K2, GPT-OSS Harmony, MiniMax, Llama 3.x, and Nemotron) and added structural tag generation, a cross-engine parser parity table, and recovery for EOF, bare, and truncated tool calls. v1.3.0 also lands a unified backend abstraction: SGLang, TensorRT-LLM, and vLLM now share one path for KV-aware routing, Prometheus metrics parity, OTel tracing, and a health-check canary. Embeddings serving arrived through aggregated text-embedding workers with OpenAI dimensions and base64. The release also adds the /v1/realtime protocol surface, topology-aware routing with an experimental KV-transfer policy, expanded performance modeling (the Mocker engine, trace Replay, and AIConfigurator perf shims), and continued standalone KV Router and RL / LoRA scheduling work.

Release Branch: release/1.3.0-dev.1, cut from main commit f0192033 after the TensorRT-LLM v1.3.0rc17 upgrade (#10251); release tip 30f92be.

Container Images

Component Arch / Variant Image
SGLang runtime CUDA 13 / CUDA 12 / EFA nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.3.0-dev.1-{cuda13,cuda12,efa}
TensorRT-LLM runtime CUDA 13 / EFA nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.3.0-dev.1-{cuda13,efa}
vLLM runtime CUDA 13 / CUDA 12 / EFA nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.3.0-dev.1-{cuda13,cuda12,efa}
Frontend multi-arch (amd64 + arm64) nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.3.0-dev.1
Kubernetes Operator multi-arch nvcr.io/nvidia/ai-dynamo/kubernetes-operator:1.3.0-dev.1
Planner multi-arch nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.3.0-dev.1
Snapshot Agent amd64 only nvcr.io/nvidia/ai-dynamo/snapshot-agent:1.3.0-dev.1

Backend Versions

Backend Version CUDA Python Notes
SGLang 0.5.12.post1 13.0 / 12.x 3.12
TensorRT-LLM 1.3.0rc17 13.0 3.12 Built on the upstream TRT-LLM base image
vLLM 0.22.0 13.0 / 12.x 3.12 XPU / CPU prebuilt images on vLLM 0.21.0

Release Artifacts

Same artifact set as v1.2.0 — all runtime and platform container images (above), the ai-dynamo / ai-dynamo-runtime / kvbm wheels, Rust crates (dynamo-runtime, dynamo-llm, …), and Helm charts (dynamo-platform, snapshot). For the complete pinned list and per-artifact links, see Release Artifacts and the Support Matrix.

Prerelease wheels are on pypi.nvidia.com, not public PyPI. Because 1.3.0-dev.1 is a prerelease, the ai-dynamo / ai-dynamo-runtime / kvbm wheels are published to NVIDIA's package index pypi.nvidia.com at 1.3.0.dev1. Install on a supported Linux host:

pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo==1.3.0.dev1

Major Features

Tool-Calling & Parser Parity

  • Structural Tag Generation: Added structural tag generation for tool calls (#9711) and a warning when skip_special_tokens strips parser markers (#10225).
  • Parser Stack Refactor: Decoupled dynamo-parsers from dynamo-protocols (#9922) and renamed the parity suite from parser to tool-calling (#9948) — the refactor the rest of the parser work builds on.
  • Cross-Model Parser & Reasoning Parity: Re-baselined tool-call and reasoning parsing across GLM-4.7 (#9438, #9355, #9629, #10101), Qwen3-Coder / MiniMax (#9462, #9807, #9866), DeepSeek V3/V4 DSML (#9524, #9813, #9985, #10192), Gemma4 (#9411, #9970, #9981), Kimi K2 (#9594, #9971), GPT-OSS Harmony (#9897, #10054, #10111), Llama 3.x (#9536), and Nemotron (#10115), with aligned SGLang reasoning mapping (#10096, #10114) and harmony behavior (#9729, #9844).
  • Robust Recovery: Recovered EOF-truncated (#9864), bare DeepSeek (#10133), and Top-N-damaged (#10144) tool calls, switched non-streaming chat to batch vLLM tool parsing (#10051), preserved tool_calls for tool_choice=required with reasoning (#9804), kept verbatim tool_call arguments for string templates (#9301), and deduplicated reasoning-template validation (#10253).
  • Parser Parity Table: Added a combined cross-engine parser parity table UI (#10113) with an overview mode (#10143) and parity-matrix auto-generation (#9473).

Unified Backend Abstraction

  • KV-Aware Routing: Added KV-aware routing on the unified backend abstraction so all backends share one routing path (#9493).
  • Prometheus Parity: Added Prometheus metrics parity on the unified backend (#9586).
  • OTel Tracing: Added OpenTelemetry tracing for the unified backend (#9543).
  • Health-Check Canary: Added a health-check canary on the unified backend (#9642).
  • Custom Logits Processors: Attached custom logits processors in the TensorRT-LLM unified backend (#10080).

Standalone KV Router & Routing

  • Predict-On-Route: Added predict-on-route to close a sibling-request routing race (#8276).
  • Overlap Score Exposure: Exposed router overlap scores (#9538) and refreshed them at dequeue time (#9663).
  • Queue-Depth Backpressure: Added a KV-router queue-depth option with a 503 backpressure signal when clients exceed the queue depth (#8144).
  • DP-Rank Sticky Affinity: Added DP-rank-aware sticky session affinity (#9920).
  • Recovery Event Buffering: Buffered live events during recovery (#9881).
  • Routing Constraints: Introduced RoutingConstraints and worker taints (#9558).
  • Global-Router Retries: Retried failed requests on faster pools (#9460) and allowed passing target TTFT/ITL (#9845).
  • Per-Request Timing: Propagated per-request timing from a standalone KV-router to the frontend (#10182).

Topology-Aware Routing & KV Transfer

  • KV Transfer Policy API: Added an experimental KV transfer policy API on the operator (#9768).
  • Typed Topology Metadata: Added typed topology metadata and KV transfer enforcement (#9767), with propagation to decode (#9893).
  • Worker Topology Volume: Injected a worker topology Downward API volume (#9792) and added a node-label topology controller (#9879).
  • WorkerType Scaffolding: Added WorkerType and topology-readiness scaffolding (#8626) and extended register_model to expose WorkerType (#8700).

KV Block Manager (KVBM)

  • v2 Consolidator: Added the consolidator for KVBM v2 (#9480).
  • kvbm-logical: Simplified the kvbm-logical backend (#8793) and improved its performance (#9551).
  • Inactive-Block Cache Toggles: Added pool-level and per-block toggles to disable caching of inactive blocks (#9504).

Embeddings

  • Text-Embedding Workers: Added an aggregated text-embedding worker on vLLM (#9713) and OpenAI embeddings dimensions on SGLang (#9722) and vLLM (#9751).
  • Base64 Encoding: Honored OpenAI encoding_format=base64 end-to-end (#9887) and always used base64 on the worker↔frontend wire, decoding at the HTTP boundary (#10139).
  • Embedding Metrics: Added a dynamo_embedding_latency_seconds frontend histogram (#9758) and gated chat-shaped collectors on embedding workers (#9886, #9830).

Realtime, RL & LoRA

  • /v1/realtime Protocol: Added dedicated realtime-API protocol types for /v1/realtime (#9205) and wired them through ModelManager with a bidirectional PushRouter (#9308).
  • RL Response Protocol: Added the nvext Tokens-in-Tokens-Out RL response protocol and frontend support (#9649).
  • LoRA Placement: Added LoRA load estimation (#8178) and a Min-Cost Flow placement solver (#8179).

Planner & Profiler

  • Load Optimization Target: Added a load optimization target to the planner (#9590).
  • Prometheus Auth/TLS: Added configurable Prometheus auth and TLS for the planner — static bearer token (#9512), bearer-token file for rotation (#9513), custom CA bundle (#9511), SSL-verify toggle (#9510), and fixed extra query params (#9557).

Performance Modeling & Tooling

  • AIConfigurator Perf Shim: Added an AIConfigurator (AIC) forward-pass engine perf shim to the mocker (#10150) and adopted the Rust engine perf shim in the planner (#10229), aligning planner cost estimates with the AIC model.
  • Mooncake Replay: Added Mooncake delta replay (#9653) and agentic Mooncake trace replay (#9728).
  • KVBM Offload Simulation: Added KVBM G3 offload simulation (#9337) and G4 object-store offload simulation (#9939) in replay.
  • TRT-LLM Scheduler Simulation: Added TensorRT-LLM scheduler simulation to the mocker (#10193).
  • Replay Metrics: Added native Prometheus metrics (#10056) and a --report-jsonl per-request metrics option (#9720).

GPU Memory Service (GMS)

  • ModelExpress P2P: Integrated ModelExpress P2P weight transfer into the GMS loader and worker (#8218).
  • Load Overhead: Reduced GMS load overhead (#9635) and supported user-declared GMS checkpoint clients (#9641).

Frontend & Agents

  • Agent Traces: Added agent traces to /v1/completions (#9125) and autodetected agent behavior in agent traces (#9817).
  • Disaggregated Processors: Supported disaggregation with the vLLM (#9503) and SGLang (#9577) processors, plus migration on those paths (#9617).
  • Admission Control: Added an admission-control escape hatch (#9547) and replaced --no-admission-control with --admission-control {token-capacity,none} (#9694).
  • Context Propagation: Added first-class context metadata propagation (#9662) including the HTTP side (#9726).

Tokenizers, Recipes & Hardware

  • L1 Prefix Cache: Added an opt-in L1 prefix cache for tokenization (#9742) and a multi-turn extension on a moka W-TinyLFU backend (#10201).
  • Recipes: Added a Kimi K2.5 + agentic-coding recipe (#9621) and a GLM-5-NVFP4 EFA disagg variant for GB200 on AWS (#9712).
  • EFA / libfabric: Added a configurable libfabric repo with a v2.5.1 overlay for EFA (#9727, #10047).
  • XPU / CPU Images: Added prebuilt vLLM images for XPU and CPU (#9661) and upgraded them to vLLM 0.21.0 (#9837).
  • GB10 SKU: Added the GB10 GPU SKU (#9976).
  • MooncakeConnector: Supported MooncakeConnector for vLLM PD disaggregation (#9414).

Observability

  • Resource Observability: Added standalone resource observability (#9780).
  • Per-Model Dashboard: Added an engine-agnostic per-model Grafana dashboard (#9811).

Minor Features & Improvements

  • Engine Management Routes: Added engine management routes to the backend (#10094).
  • Workers Endpoint: Enriched the /workers response and added filter query params (#9983).
  • Positional Indexer: Added an optional binary-search mode for the positional indexer (#10181) and improved BSI by removing a scheduler hop (#10200).
  • Cache Hit Weights: Exposed host/disk cache hit weights via CLI and env (#10157).
  • SGLang Tracing: Respected SGLANG_TRACE_LEVEL when tracing is enabled (#9327).
  • Operator Pod Metadata: Exposed podLabels and podAnnotations on the controller-manager (#9195) and passed DGD priority class to Grove (#10217).
  • Custom Init Container: Supported a custom init-container image in the Helm chart (#8432).
  • NIXL XPU Support: Added NIXL canonical memtype and XPU support (#9073).
  • Persistent Worker IDs: Published a stable routing id for workers / persistent id discovery (#9665) and advertised non-typed metadata siblings for self-host (#9707).
  • Planner Reports: Showed recommended replicas in planner reports (#9644) and wrote gzip diagnostics logs with reports (#9623).
  • Worker-Type Registration: Populated worker_type and needs at vLLM (#9395), TensorRT-LLM (#9396), and SGLang (#9397) registration sites.
  • Automated TRT-LLM Upgrade: Added an automated TensorRT-LLM dependency upgrade pipeline (#9274).

Bug Fixes

Operator & Kubernetes

  • DCD/DGDR Reconciliation: Gated DCD readiness on observed generation (#9499), surfaced specific DGDR failure reasons (#8227), and persisted discovered DGDR hardware metadata (#9890).
  • Grove & LWS: Preserved Grove PCS replica fields during sync (#9773), avoided LWS service-name collisions (#9612), and preserved legacy worker pod labels (#9738).
  • GMS Claims: Normalized GMS ResourceClaimTemplate names (#9829) and required Kubernetes 1.34+ DRA v1 (#9454).
  • Secrets & RBAC: Avoided imagePullSecrets drift during operator startup (#9826), scoped docker-secret indexing in restricted mode (#9863), and fixed DGDR profiling RBAC escalation (#9969).
  • Pod Template Metadata: Preserved embedded pod-template metadata in CRDs (#9553).

Router, Runtime & Scheduling

  • Tie-Breaks & Candidates: Used reservoir sampling for tie-breaks (#9516) and excluded over-threshold workers from load-balancer candidate sets (#9688).
  • State & Capacity: Preserved busy state during reconciliation (#9631), normalized DP capacity accounting (#9932), and routed admissions through the actor (#9796).
  • Event Plane: Let the runtime pick an explicit event plane (#10021) and aborted the TCP writer when the reader join fails to avoid a monitor hang (#9716).
  • MoE Prefill: Plumbed MoE AIC prefill-load config (#9479) and disabled eagle hash mode for the prefill router (#9857).

Backends (SGLang / TensorRT-LLM / vLLM)

  • SGLang: Stopped re-encoding routed_experts from SGLang 0.5.11+ (#9657), forwarded reasoning_effort to apply_chat_template (#9824), and aligned prefill CUDA-graph batch size with DP (#9962).
  • TensorRT-LLM: Skipped NIXL init in aggregated mode (#9501), removed the diffusion image-count restriction (#9822), and restored deps/CMD after the upstream base switch (#9889).
  • vLLM: Applied max_thinking_tokens to SamplingParams (#9571), preserved the user --runner flag (#9710), and added libnixl.so to the CUDA runtime LD_LIBRARY_PATH (#9911).
  • Multimodal: Added MM-aware KV routing for Phi-3 / Qwen2-VL / Qwen2.5-VL (#9441) and forced eager mode for Wan2.2 video launchers (#9563).

Snapshot & GMS

  • Restore Reliability: Used a fast startup-probe cadence on restore (#9627), added per-target ready gating with an OCI container-ID fallback (#9534), and polled restore containers before pod running (#9984).
  • Checkpoint Integrity: Failed checkpoint jobs on helper-container errors (#9850) and preserved the vLLM torch-compile cache (#9943).
  • GMS Lifecycle: Coordinated sidecar lifecycle via a lock state machine instead of probes (#9514) and pruned unreferenced torch allocations (#10022).

Frontend, Metrics & Streaming

  • Stream Hygiene: Filtered empty stream chunks from multi-byte token assembly (#8036) and empty Text/Parts chat chunks (#9894).
  • Metrics Cardinality: Preserved original model casing on lifecycle guards (#9775) and bounded unknown-model label cardinality (#9836).
  • Anthropic / Responses: Accepted message-level system role (#10108) and a multimodal EasyMessage path (#9470).
  • Registration: Added metadata-only register_model with cache-first from_hf (#10102).

Build & Packaging

  • kvbm Import: Updated the kv_cache_connector import for TRT-LLM 1.3.0rc14 (#9622).
  • etcd: Bumped etcd to v3.5.30 for runtime containers (#9791).
  • Cargo: Removed broken README references blocking crate publication (#9809).
  • AMD/ROCm: Added import-time compatibility for AMD ROCm / Python 3.10 hosts (#9929).

Documentation

  • Backend Guides: Added Rust and Python unified-backend guides (#9492) and refreshed unified-backend feature gaps (#9098, #10098).
  • Parsers & Parity: Reorganized tool-calling and reasoning into top-level sections (#9400), added a logprobs troubleshooting guide (#9658), and maintained the parser parity matrix (#9614, #10138).
  • Kubernetes: Added a DGDR PCIe profiler callout and Known Issues (#9428), a Cold Start / Resiliency support matrix (#8612), and an API reference in the sidebar (#10210).
  • Router & Planner: Added topology-aware KV-transfer docs (#10123), refreshed the router benchmark guide (#10198), and improved the global planner guide (#9418).
  • Fern / Theme: Enabled the NVIDIA global theme and upgraded the Fern CLI (#9967).
  • Recipes: Updated the GLM-5 NVFP4 recipe to use the stable SGLang image (#9697).
  • Localization: Added Chinese-translated docs (#9816).

New Contributors

Full Changelog: v1.2.0...v1.3.0-dev.1