Release Notes

Dynamo v1.3.0-dev.1 is an experimental dev build giving an early look at v1.3.0. It is not recommended for production — features may be incomplete and APIs, behaviors, and defaults may change before the stable release. Use it for evaluation, testing, and early feedback only.

Summary

Dynamo v1.3.0-dev.1 is an early preview of v1.3.0. The biggest change is tool-calling and parser parity. Dynamo re-baselined the parser and reasoning stack across the model fleet (GLM-4.7, Qwen3-Coder, DeepSeek V3/V4, Gemma4, Kimi K2, GPT-OSS Harmony, MiniMax, Llama 3.x, and Nemotron) and added structural tag generation, a cross-engine parser parity table, and recovery for EOF, bare, and truncated tool calls. v1.3.0 also lands a unified backend abstraction: SGLang, TensorRT-LLM, and vLLM now share one path for KV-aware routing, Prometheus metrics parity, OTel tracing, and a health-check canary. Embeddings serving arrived through aggregated text-embedding workers with OpenAI dimensions and base64. The release also adds the /v1/realtime protocol surface, topology-aware routing with an experimental KV-transfer policy, expanded performance modeling (the Mocker engine, trace Replay, and AIConfigurator perf shims), and continued standalone KV Router and RL / LoRA scheduling work.

Release Branch: release/1.3.0-dev.1, cut from main commit f0192033 after the TensorRT-LLM v1.3.0rc17 upgrade (#10251); release tip 30f92be.

Container Images

Component	Arch / Variant	Image
SGLang runtime	CUDA 13 / CUDA 12 / EFA	`nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.3.0-dev.1-{cuda13,cuda12,efa}`
TensorRT-LLM runtime	CUDA 13 / EFA	`nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.3.0-dev.1-{cuda13,efa}`
vLLM runtime	CUDA 13 / CUDA 12 / EFA	`nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.3.0-dev.1-{cuda13,cuda12,efa}`
Frontend	multi-arch (`amd64` + `arm64`)	`nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.3.0-dev.1`
Kubernetes Operator	multi-arch	`nvcr.io/nvidia/ai-dynamo/kubernetes-operator:1.3.0-dev.1`
Planner	multi-arch	`nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.3.0-dev.1`
Snapshot Agent	`amd64` only	`nvcr.io/nvidia/ai-dynamo/snapshot-agent:1.3.0-dev.1`

Backend Versions

Backend	Version	CUDA	Python	Notes
SGLang	0.5.12.post1	13.0 / 12.x	3.12	—
TensorRT-LLM	1.3.0rc17	13.0	3.12	Built on the upstream TRT-LLM base image
vLLM	0.22.0	13.0 / 12.x	3.12	XPU / CPU prebuilt images on vLLM 0.21.0

Release Artifacts

Same artifact set as v1.2.0 — all runtime and platform container images (above), the ai-dynamo / ai-dynamo-runtime / kvbm wheels, Rust crates (dynamo-runtime, dynamo-llm, …), and Helm charts (dynamo-platform, snapshot). For the complete pinned list and per-artifact links, see Release Artifacts and the Support Matrix.

Prerelease wheels are on pypi.nvidia.com, not public PyPI. Because 1.3.0-dev.1 is a prerelease, the ai-dynamo / ai-dynamo-runtime / kvbm wheels are published to NVIDIA's package index pypi.nvidia.com at 1.3.0.dev1. Install on a supported Linux host:
pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo==1.3.0.dev1

Major Features

Tool-Calling & Parser Parity

Structural Tag Generation: Added structural tag generation for tool calls (#9711) and a warning when skip_special_tokens strips parser markers (#10225).
Parser Stack Refactor: Decoupled dynamo-parsers from dynamo-protocols (#9922) and renamed the parity suite from parser to tool-calling (#9948) — the refactor the rest of the parser work builds on.
Cross-Model Parser & Reasoning Parity: Re-baselined tool-call and reasoning parsing across GLM-4.7 (#9438, #9355, #9629, #10101), Qwen3-Coder / MiniMax (#9462, #9807, #9866), DeepSeek V3/V4 DSML (#9524, #9813, #9985, #10192), Gemma4 (#9411, #9970, #9981), Kimi K2 (#9594, #9971), GPT-OSS Harmony (#9897, #10054, #10111), Llama 3.x (#9536), and Nemotron (#10115), with aligned SGLang reasoning mapping (#10096, #10114) and harmony behavior (#9729, #9844).
Robust Recovery: Recovered EOF-truncated (#9864), bare DeepSeek (#10133), and Top-N-damaged (#10144) tool calls, switched non-streaming chat to batch vLLM tool parsing (#10051), preserved tool_calls for tool_choice=required with reasoning (#9804), kept verbatim tool_call arguments for string templates (#9301), and deduplicated reasoning-template validation (#10253).
Parser Parity Table: Added a combined cross-engine parser parity table UI (#10113) with an overview mode (#10143) and parity-matrix auto-generation (#9473).

Unified Backend Abstraction

KV-Aware Routing: Added KV-aware routing on the unified backend abstraction so all backends share one routing path (#9493).
Prometheus Parity: Added Prometheus metrics parity on the unified backend (#9586).
OTel Tracing: Added OpenTelemetry tracing for the unified backend (#9543).
Health-Check Canary: Added a health-check canary on the unified backend (#9642).
Custom Logits Processors: Attached custom logits processors in the TensorRT-LLM unified backend (#10080).

Standalone KV Router & Routing

Predict-On-Route: Added predict-on-route to close a sibling-request routing race (#8276).
Overlap Score Exposure: Exposed router overlap scores (#9538) and refreshed them at dequeue time (#9663).
Queue-Depth Backpressure: Added a KV-router queue-depth option with a 503 backpressure signal when clients exceed the queue depth (#8144).
DP-Rank Sticky Affinity: Added DP-rank-aware sticky session affinity (#9920).
Recovery Event Buffering: Buffered live events during recovery (#9881).
Routing Constraints: Introduced RoutingConstraints and worker taints (#9558).
Global-Router Retries: Retried failed requests on faster pools (#9460) and allowed passing target TTFT/ITL (#9845).
Per-Request Timing: Propagated per-request timing from a standalone KV-router to the frontend (#10182).

Topology-Aware Routing & KV Transfer

KV Transfer Policy API: Added an experimental KV transfer policy API on the operator (#9768).
Typed Topology Metadata: Added typed topology metadata and KV transfer enforcement (#9767), with propagation to decode (#9893).
Worker Topology Volume: Injected a worker topology Downward API volume (#9792) and added a node-label topology controller (#9879).
WorkerType Scaffolding: Added WorkerType and topology-readiness scaffolding (#8626) and extended register_model to expose WorkerType (#8700).

KV Block Manager (KVBM)

v2 Consolidator: Added the consolidator for KVBM v2 (#9480).
kvbm-logical: Simplified the kvbm-logical backend (#8793) and improved its performance (#9551).
Inactive-Block Cache Toggles: Added pool-level and per-block toggles to disable caching of inactive blocks (#9504).

Embeddings

Text-Embedding Workers: Added an aggregated text-embedding worker on vLLM (#9713) and OpenAI embeddings dimensions on SGLang (#9722) and vLLM (#9751).
Base64 Encoding: Honored OpenAI encoding_format=base64 end-to-end (#9887) and always used base64 on the worker↔frontend wire, decoding at the HTTP boundary (#10139).
Embedding Metrics: Added a dynamo_embedding_latency_seconds frontend histogram (#9758) and gated chat-shaped collectors on embedding workers (#9886, #9830).

Realtime, RL & LoRA

/v1/realtime Protocol: Added dedicated realtime-API protocol types for /v1/realtime (#9205) and wired them through ModelManager with a bidirectional PushRouter (#9308).
RL Response Protocol: Added the nvext Tokens-in-Tokens-Out RL response protocol and frontend support (#9649).
LoRA Placement: Added LoRA load estimation (#8178) and a Min-Cost Flow placement solver (#8179).

Planner & Profiler

Load Optimization Target: Added a load optimization target to the planner (#9590).
Prometheus Auth/TLS: Added configurable Prometheus auth and TLS for the planner — static bearer token (#9512), bearer-token file for rotation (#9513), custom CA bundle (#9511), SSL-verify toggle (#9510), and fixed extra query params (#9557).

Performance Modeling & Tooling

AIConfigurator Perf Shim: Added an AIConfigurator (AIC) forward-pass engine perf shim to the mocker (#10150) and adopted the Rust engine perf shim in the planner (#10229), aligning planner cost estimates with the AIC model.
Mooncake Replay: Added Mooncake delta replay (#9653) and agentic Mooncake trace replay (#9728).
KVBM Offload Simulation: Added KVBM G3 offload simulation (#9337) and G4 object-store offload simulation (#9939) in replay.
TRT-LLM Scheduler Simulation: Added TensorRT-LLM scheduler simulation to the mocker (#10193).
Replay Metrics: Added native Prometheus metrics (#10056) and a --report-jsonl per-request metrics option (#9720).

GPU Memory Service (GMS)

ModelExpress P2P: Integrated ModelExpress P2P weight transfer into the GMS loader and worker (#8218).
Load Overhead: Reduced GMS load overhead (#9635) and supported user-declared GMS checkpoint clients (#9641).

Frontend & Agents

Agent Traces: Added agent traces to /v1/completions (#9125) and autodetected agent behavior in agent traces (#9817).
Disaggregated Processors: Supported disaggregation with the vLLM (#9503) and SGLang (#9577) processors, plus migration on those paths (#9617).
Admission Control: Added an admission-control escape hatch (#9547) and replaced --no-admission-control with --admission-control {token-capacity,none} (#9694).
Context Propagation: Added first-class context metadata propagation (#9662) including the HTTP side (#9726).

Tokenizers, Recipes & Hardware

L1 Prefix Cache: Added an opt-in L1 prefix cache for tokenization (#9742) and a multi-turn extension on a moka W-TinyLFU backend (#10201).
Recipes: Added a Kimi K2.5 + agentic-coding recipe (#9621) and a GLM-5-NVFP4 EFA disagg variant for GB200 on AWS (#9712).
EFA / libfabric: Added a configurable libfabric repo with a v2.5.1 overlay for EFA (#9727, #10047).
XPU / CPU Images: Added prebuilt vLLM images for XPU and CPU (#9661) and upgraded them to vLLM 0.21.0 (#9837).
GB10 SKU: Added the GB10 GPU SKU (#9976).
MooncakeConnector: Supported MooncakeConnector for vLLM PD disaggregation (#9414).

Observability

Resource Observability: Added standalone resource observability (#9780).
Per-Model Dashboard: Added an engine-agnostic per-model Grafana dashboard (#9811).

Minor Features & Improvements

Engine Management Routes: Added engine management routes to the backend (#10094).
Workers Endpoint: Enriched the /workers response and added filter query params (#9983).
Positional Indexer: Added an optional binary-search mode for the positional indexer (#10181) and improved BSI by removing a scheduler hop (#10200).
Cache Hit Weights: Exposed host/disk cache hit weights via CLI and env (#10157).
SGLang Tracing: Respected SGLANG_TRACE_LEVEL when tracing is enabled (#9327).
Operator Pod Metadata: Exposed podLabels and podAnnotations on the controller-manager (#9195) and passed DGD priority class to Grove (#10217).
Custom Init Container: Supported a custom init-container image in the Helm chart (#8432).
NIXL XPU Support: Added NIXL canonical memtype and XPU support (#9073).
Persistent Worker IDs: Published a stable routing id for workers / persistent id discovery (#9665) and advertised non-typed metadata siblings for self-host (#9707).
Planner Reports: Showed recommended replicas in planner reports (#9644) and wrote gzip diagnostics logs with reports (#9623).
Worker-Type Registration: Populated worker_type and needs at vLLM (#9395), TensorRT-LLM (#9396), and SGLang (#9397) registration sites.
Automated TRT-LLM Upgrade: Added an automated TensorRT-LLM dependency upgrade pipeline (#9274).

Bug Fixes

Operator & Kubernetes

DCD/DGDR Reconciliation: Gated DCD readiness on observed generation (#9499), surfaced specific DGDR failure reasons (#8227), and persisted discovered DGDR hardware metadata (#9890).
Grove & LWS: Preserved Grove PCS replica fields during sync (#9773), avoided LWS service-name collisions (#9612), and preserved legacy worker pod labels (#9738).
GMS Claims: Normalized GMS ResourceClaimTemplate names (#9829) and required Kubernetes 1.34+ DRA v1 (#9454).
Secrets & RBAC: Avoided imagePullSecrets drift during operator startup (#9826), scoped docker-secret indexing in restricted mode (#9863), and fixed DGDR profiling RBAC escalation (#9969).
Pod Template Metadata: Preserved embedded pod-template metadata in CRDs (#9553).

Router, Runtime & Scheduling

Tie-Breaks & Candidates: Used reservoir sampling for tie-breaks (#9516) and excluded over-threshold workers from load-balancer candidate sets (#9688).
State & Capacity: Preserved busy state during reconciliation (#9631), normalized DP capacity accounting (#9932), and routed admissions through the actor (#9796).
Event Plane: Let the runtime pick an explicit event plane (#10021) and aborted the TCP writer when the reader join fails to avoid a monitor hang (#9716).
MoE Prefill: Plumbed MoE AIC prefill-load config (#9479) and disabled eagle hash mode for the prefill router (#9857).

Backends (SGLang / TensorRT-LLM / vLLM)

SGLang: Stopped re-encoding routed_experts from SGLang 0.5.11+ (#9657), forwarded reasoning_effort to apply_chat_template (#9824), and aligned prefill CUDA-graph batch size with DP (#9962).
TensorRT-LLM: Skipped NIXL init in aggregated mode (#9501), removed the diffusion image-count restriction (#9822), and restored deps/CMD after the upstream base switch (#9889).
vLLM: Applied max_thinking_tokens to SamplingParams (#9571), preserved the user --runner flag (#9710), and added libnixl.so to the CUDA runtime LD_LIBRARY_PATH (#9911).
Multimodal: Added MM-aware KV routing for Phi-3 / Qwen2-VL / Qwen2.5-VL (#9441) and forced eager mode for Wan2.2 video launchers (#9563).

Snapshot & GMS

Restore Reliability: Used a fast startup-probe cadence on restore (#9627), added per-target ready gating with an OCI container-ID fallback (#9534), and polled restore containers before pod running (#9984).
Checkpoint Integrity: Failed checkpoint jobs on helper-container errors (#9850) and preserved the vLLM torch-compile cache (#9943).
GMS Lifecycle: Coordinated sidecar lifecycle via a lock state machine instead of probes (#9514) and pruned unreferenced torch allocations (#10022).

Frontend, Metrics & Streaming

Stream Hygiene: Filtered empty stream chunks from multi-byte token assembly (#8036) and empty Text/Parts chat chunks (#9894).
Metrics Cardinality: Preserved original model casing on lifecycle guards (#9775) and bounded unknown-model label cardinality (#9836).
Anthropic / Responses: Accepted message-level system role (#10108) and a multimodal EasyMessage path (#9470).
Registration: Added metadata-only register_model with cache-first from_hf (#10102).

Build & Packaging

kvbm Import: Updated the kv_cache_connector import for TRT-LLM 1.3.0rc14 (#9622).
etcd: Bumped etcd to v3.5.30 for runtime containers (#9791).
Cargo: Removed broken README references blocking crate publication (#9809).
AMD/ROCm: Added import-time compatibility for AMD ROCm / Python 3.10 hosts (#9929).

Documentation

Backend Guides: Added Rust and Python unified-backend guides (#9492) and refreshed unified-backend feature gaps (#9098, #10098).
Parsers & Parity: Reorganized tool-calling and reasoning into top-level sections (#9400), added a logprobs troubleshooting guide (#9658), and maintained the parser parity matrix (#9614, #10138).
Kubernetes: Added a DGDR PCIe profiler callout and Known Issues (#9428), a Cold Start / Resiliency support matrix (#8612), and an API reference in the sidebar (#10210).
Router & Planner: Added topology-aware KV-transfer docs (#10123), refreshed the router benchmark guide (#10198), and improved the global planner guide (#9418).
Fern / Theme: Enabled the NVIDIA global theme and upgraded the Fern CLI (#9967).
Recipes: Updated the GLM-5 NVFP4 recipe to use the stable SGLang image (#9697).
Localization: Added Chinese-translated docs (#9816).

New Contributors

@sytianhe made their first contribution in #9327
@Dao007forever made their first contribution in #9195
@zhewenl made their first contribution in #9414
@nv-rinig made their first contribution in #9323
@bewestphal made their first contribution in #9512
@my-git9 made their first contribution in #8432
@jooe0824 made their first contribution in #9537
@kirillemilio made their first contribution in #9824
@Shaoting-Feng made their first contribution in #9982
@dynamo-ops made their first contribution in #10036
@Harrilee made their first contribution in #10076
@andyluo7 made their first contribution in #9929
@Change72 made their first contribution in #10157
@mvillmow made their first contribution in #9095

Full Changelog: v1.2.0...v1.3.0-dev.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamo v1.3.0-dev.1

Choose a tag to compare

Sorry, something went wrong.