Dynamo v1.3.0-dev.1
Pre-releaseRelease Notes
Dynamo v1.3.0-dev.1 is an experimental dev build giving an early look at v1.3.0. It is not recommended for production — features may be incomplete and APIs, behaviors, and defaults may change before the stable release. Use it for evaluation, testing, and early feedback only.
Summary
Dynamo v1.3.0-dev.1 is an early preview of v1.3.0. The biggest change is tool-calling and parser parity. Dynamo re-baselined the parser and reasoning stack across the model fleet (GLM-4.7, Qwen3-Coder, DeepSeek V3/V4, Gemma4, Kimi K2, GPT-OSS Harmony, MiniMax, Llama 3.x, and Nemotron) and added structural tag generation, a cross-engine parser parity table, and recovery for EOF, bare, and truncated tool calls. v1.3.0 also lands a unified backend abstraction: SGLang, TensorRT-LLM, and vLLM now share one path for KV-aware routing, Prometheus metrics parity, OTel tracing, and a health-check canary. Embeddings serving arrived through aggregated text-embedding workers with OpenAI dimensions and base64. The release also adds the /v1/realtime protocol surface, topology-aware routing with an experimental KV-transfer policy, expanded performance modeling (the Mocker engine, trace Replay, and AIConfigurator perf shims), and continued standalone KV Router and RL / LoRA scheduling work.
Release Branch: release/1.3.0-dev.1, cut from main commit f0192033 after the TensorRT-LLM v1.3.0rc17 upgrade (#10251); release tip 30f92be.
Container Images
| Component | Arch / Variant | Image |
|---|---|---|
| SGLang runtime | CUDA 13 / CUDA 12 / EFA | nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.3.0-dev.1-{cuda13,cuda12,efa} |
| TensorRT-LLM runtime | CUDA 13 / EFA | nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.3.0-dev.1-{cuda13,efa} |
| vLLM runtime | CUDA 13 / CUDA 12 / EFA | nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.3.0-dev.1-{cuda13,cuda12,efa} |
| Frontend | multi-arch (amd64 + arm64) |
nvcr.io/nvidia/ai-dynamo/dynamo-frontend:1.3.0-dev.1 |
| Kubernetes Operator | multi-arch | nvcr.io/nvidia/ai-dynamo/kubernetes-operator:1.3.0-dev.1 |
| Planner | multi-arch | nvcr.io/nvidia/ai-dynamo/dynamo-planner:1.3.0-dev.1 |
| Snapshot Agent | amd64 only |
nvcr.io/nvidia/ai-dynamo/snapshot-agent:1.3.0-dev.1 |
Backend Versions
| Backend | Version | CUDA | Python | Notes |
|---|---|---|---|---|
| SGLang | 0.5.12.post1 | 13.0 / 12.x | 3.12 | — |
| TensorRT-LLM | 1.3.0rc17 | 13.0 | 3.12 | Built on the upstream TRT-LLM base image |
| vLLM | 0.22.0 | 13.0 / 12.x | 3.12 | XPU / CPU prebuilt images on vLLM 0.21.0 |
Release Artifacts
Same artifact set as v1.2.0 — all runtime and platform container images (above), the ai-dynamo / ai-dynamo-runtime / kvbm wheels, Rust crates (dynamo-runtime, dynamo-llm, …), and Helm charts (dynamo-platform, snapshot). For the complete pinned list and per-artifact links, see Release Artifacts and the Support Matrix.
Prerelease wheels are on
pypi.nvidia.com, not public PyPI. Because1.3.0-dev.1is a prerelease, theai-dynamo/ai-dynamo-runtime/kvbmwheels are published to NVIDIA's package indexpypi.nvidia.comat1.3.0.dev1. Install on a supported Linux host:pip install --pre --extra-index-url https://pypi.nvidia.com ai-dynamo==1.3.0.dev1
Major Features
Tool-Calling & Parser Parity
- Structural Tag Generation: Added structural tag generation for tool calls (#9711) and a warning when
skip_special_tokensstrips parser markers (#10225). - Parser Stack Refactor: Decoupled
dynamo-parsersfromdynamo-protocols(#9922) and renamed the parity suite from parser to tool-calling (#9948) — the refactor the rest of the parser work builds on. - Cross-Model Parser & Reasoning Parity: Re-baselined tool-call and reasoning parsing across GLM-4.7 (#9438, #9355, #9629, #10101), Qwen3-Coder / MiniMax (#9462, #9807, #9866), DeepSeek V3/V4 DSML (#9524, #9813, #9985, #10192), Gemma4 (#9411, #9970, #9981), Kimi K2 (#9594, #9971), GPT-OSS Harmony (#9897, #10054, #10111), Llama 3.x (#9536), and Nemotron (#10115), with aligned SGLang reasoning mapping (#10096, #10114) and harmony behavior (#9729, #9844).
- Robust Recovery: Recovered EOF-truncated (#9864), bare DeepSeek (#10133), and Top-N-damaged (#10144) tool calls, switched non-streaming chat to batch vLLM tool parsing (#10051), preserved
tool_callsfortool_choice=requiredwith reasoning (#9804), kept verbatimtool_callarguments for string templates (#9301), and deduplicated reasoning-template validation (#10253). - Parser Parity Table: Added a combined cross-engine parser parity table UI (#10113) with an overview mode (#10143) and parity-matrix auto-generation (#9473).
Unified Backend Abstraction
- KV-Aware Routing: Added KV-aware routing on the unified backend abstraction so all backends share one routing path (#9493).
- Prometheus Parity: Added Prometheus metrics parity on the unified backend (#9586).
- OTel Tracing: Added OpenTelemetry tracing for the unified backend (#9543).
- Health-Check Canary: Added a health-check canary on the unified backend (#9642).
- Custom Logits Processors: Attached custom logits processors in the TensorRT-LLM unified backend (#10080).
Standalone KV Router & Routing
- Predict-On-Route: Added predict-on-route to close a sibling-request routing race (#8276).
- Overlap Score Exposure: Exposed router overlap scores (#9538) and refreshed them at dequeue time (#9663).
- Queue-Depth Backpressure: Added a KV-router queue-depth option with a 503 backpressure signal when clients exceed the queue depth (#8144).
- DP-Rank Sticky Affinity: Added DP-rank-aware sticky session affinity (#9920).
- Recovery Event Buffering: Buffered live events during recovery (#9881).
- Routing Constraints: Introduced RoutingConstraints and worker taints (#9558).
- Global-Router Retries: Retried failed requests on faster pools (#9460) and allowed passing target TTFT/ITL (#9845).
- Per-Request Timing: Propagated per-request timing from a standalone KV-router to the frontend (#10182).
Topology-Aware Routing & KV Transfer
- KV Transfer Policy API: Added an experimental KV transfer policy API on the operator (#9768).
- Typed Topology Metadata: Added typed topology metadata and KV transfer enforcement (#9767), with propagation to decode (#9893).
- Worker Topology Volume: Injected a worker topology Downward API volume (#9792) and added a node-label topology controller (#9879).
- WorkerType Scaffolding: Added WorkerType and topology-readiness scaffolding (#8626) and extended
register_modelto expose WorkerType (#8700).
KV Block Manager (KVBM)
- v2 Consolidator: Added the consolidator for KVBM v2 (#9480).
- kvbm-logical: Simplified the kvbm-logical backend (#8793) and improved its performance (#9551).
- Inactive-Block Cache Toggles: Added pool-level and per-block toggles to disable caching of inactive blocks (#9504).
Embeddings
- Text-Embedding Workers: Added an aggregated text-embedding worker on vLLM (#9713) and OpenAI embeddings
dimensionson SGLang (#9722) and vLLM (#9751). - Base64 Encoding: Honored OpenAI
encoding_format=base64end-to-end (#9887) and always used base64 on the worker↔frontend wire, decoding at the HTTP boundary (#10139). - Embedding Metrics: Added a
dynamo_embedding_latency_secondsfrontend histogram (#9758) and gated chat-shaped collectors on embedding workers (#9886, #9830).
Realtime, RL & LoRA
- /v1/realtime Protocol: Added dedicated realtime-API protocol types for
/v1/realtime(#9205) and wired them through ModelManager with a bidirectional PushRouter (#9308). - RL Response Protocol: Added the
nvextTokens-in-Tokens-Out RL response protocol and frontend support (#9649). - LoRA Placement: Added LoRA load estimation (#8178) and a Min-Cost Flow placement solver (#8179).
Planner & Profiler
- Load Optimization Target: Added a load optimization target to the planner (#9590).
- Prometheus Auth/TLS: Added configurable Prometheus auth and TLS for the planner — static bearer token (#9512), bearer-token file for rotation (#9513), custom CA bundle (#9511), SSL-verify toggle (#9510), and fixed extra query params (#9557).
Performance Modeling & Tooling
- AIConfigurator Perf Shim: Added an AIConfigurator (AIC) forward-pass engine perf shim to the mocker (#10150) and adopted the Rust engine perf shim in the planner (#10229), aligning planner cost estimates with the AIC model.
- Mooncake Replay: Added Mooncake delta replay (#9653) and agentic Mooncake trace replay (#9728).
- KVBM Offload Simulation: Added KVBM G3 offload simulation (#9337) and G4 object-store offload simulation (#9939) in replay.
- TRT-LLM Scheduler Simulation: Added TensorRT-LLM scheduler simulation to the mocker (#10193).
- Replay Metrics: Added native Prometheus metrics (#10056) and a
--report-jsonlper-request metrics option (#9720).
GPU Memory Service (GMS)
- ModelExpress P2P: Integrated ModelExpress P2P weight transfer into the GMS loader and worker (#8218).
- Load Overhead: Reduced GMS load overhead (#9635) and supported user-declared GMS checkpoint clients (#9641).
Frontend & Agents
- Agent Traces: Added agent traces to
/v1/completions(#9125) and autodetected agent behavior in agent traces (#9817). - Disaggregated Processors: Supported disaggregation with the vLLM (#9503) and SGLang (#9577) processors, plus migration on those paths (#9617).
- Admission Control: Added an admission-control escape hatch (#9547) and replaced
--no-admission-controlwith--admission-control {token-capacity,none}(#9694). - Context Propagation: Added first-class context metadata propagation (#9662) including the HTTP side (#9726).
Tokenizers, Recipes & Hardware
- L1 Prefix Cache: Added an opt-in L1 prefix cache for tokenization (#9742) and a multi-turn extension on a moka W-TinyLFU backend (#10201).
- Recipes: Added a Kimi K2.5 + agentic-coding recipe (#9621) and a GLM-5-NVFP4 EFA disagg variant for GB200 on AWS (#9712).
- EFA / libfabric: Added a configurable libfabric repo with a v2.5.1 overlay for EFA (#9727, #10047).
- XPU / CPU Images: Added prebuilt vLLM images for XPU and CPU (#9661) and upgraded them to vLLM 0.21.0 (#9837).
- GB10 SKU: Added the GB10 GPU SKU (#9976).
- MooncakeConnector: Supported MooncakeConnector for vLLM PD disaggregation (#9414).
Observability
- Resource Observability: Added standalone resource observability (#9780).
- Per-Model Dashboard: Added an engine-agnostic per-model Grafana dashboard (#9811).
Minor Features & Improvements
- Engine Management Routes: Added engine management routes to the backend (#10094).
- Workers Endpoint: Enriched the
/workersresponse and added filter query params (#9983). - Positional Indexer: Added an optional binary-search mode for the positional indexer (#10181) and improved BSI by removing a scheduler hop (#10200).
- Cache Hit Weights: Exposed host/disk cache hit weights via CLI and env (#10157).
- SGLang Tracing: Respected
SGLANG_TRACE_LEVELwhen tracing is enabled (#9327). - Operator Pod Metadata: Exposed
podLabelsandpodAnnotationson the controller-manager (#9195) and passed DGD priority class to Grove (#10217). - Custom Init Container: Supported a custom init-container image in the Helm chart (#8432).
- NIXL XPU Support: Added NIXL canonical memtype and XPU support (#9073).
- Persistent Worker IDs: Published a stable routing id for workers / persistent id discovery (#9665) and advertised non-typed metadata siblings for self-host (#9707).
- Planner Reports: Showed recommended replicas in planner reports (#9644) and wrote gzip diagnostics logs with reports (#9623).
- Worker-Type Registration: Populated
worker_typeandneedsat vLLM (#9395), TensorRT-LLM (#9396), and SGLang (#9397) registration sites. - Automated TRT-LLM Upgrade: Added an automated TensorRT-LLM dependency upgrade pipeline (#9274).
Bug Fixes
Operator & Kubernetes
- DCD/DGDR Reconciliation: Gated DCD readiness on observed generation (#9499), surfaced specific DGDR failure reasons (#8227), and persisted discovered DGDR hardware metadata (#9890).
- Grove & LWS: Preserved Grove PCS replica fields during sync (#9773), avoided LWS service-name collisions (#9612), and preserved legacy worker pod labels (#9738).
- GMS Claims: Normalized GMS ResourceClaimTemplate names (#9829) and required Kubernetes 1.34+ DRA v1 (#9454).
- Secrets & RBAC: Avoided imagePullSecrets drift during operator startup (#9826), scoped docker-secret indexing in restricted mode (#9863), and fixed DGDR profiling RBAC escalation (#9969).
- Pod Template Metadata: Preserved embedded pod-template metadata in CRDs (#9553).
Router, Runtime & Scheduling
- Tie-Breaks & Candidates: Used reservoir sampling for tie-breaks (#9516) and excluded over-threshold workers from load-balancer candidate sets (#9688).
- State & Capacity: Preserved busy state during reconciliation (#9631), normalized DP capacity accounting (#9932), and routed admissions through the actor (#9796).
- Event Plane: Let the runtime pick an explicit event plane (#10021) and aborted the TCP writer when the reader join fails to avoid a monitor hang (#9716).
- MoE Prefill: Plumbed MoE AIC prefill-load config (#9479) and disabled eagle hash mode for the prefill router (#9857).
Backends (SGLang / TensorRT-LLM / vLLM)
- SGLang: Stopped re-encoding routed_experts from SGLang 0.5.11+ (#9657), forwarded
reasoning_efforttoapply_chat_template(#9824), and aligned prefill CUDA-graph batch size with DP (#9962). - TensorRT-LLM: Skipped NIXL init in aggregated mode (#9501), removed the diffusion image-count restriction (#9822), and restored deps/CMD after the upstream base switch (#9889).
- vLLM: Applied
max_thinking_tokensto SamplingParams (#9571), preserved the user--runnerflag (#9710), and addedlibnixl.soto the CUDA runtimeLD_LIBRARY_PATH(#9911). - Multimodal: Added MM-aware KV routing for Phi-3 / Qwen2-VL / Qwen2.5-VL (#9441) and forced eager mode for Wan2.2 video launchers (#9563).
Snapshot & GMS
- Restore Reliability: Used a fast startup-probe cadence on restore (#9627), added per-target ready gating with an OCI container-ID fallback (#9534), and polled restore containers before pod running (#9984).
- Checkpoint Integrity: Failed checkpoint jobs on helper-container errors (#9850) and preserved the vLLM torch-compile cache (#9943).
- GMS Lifecycle: Coordinated sidecar lifecycle via a lock state machine instead of probes (#9514) and pruned unreferenced torch allocations (#10022).
Frontend, Metrics & Streaming
- Stream Hygiene: Filtered empty stream chunks from multi-byte token assembly (#8036) and empty Text/Parts chat chunks (#9894).
- Metrics Cardinality: Preserved original model casing on lifecycle guards (#9775) and bounded unknown-model label cardinality (#9836).
- Anthropic / Responses: Accepted message-level system role (#10108) and a multimodal
EasyMessagepath (#9470). - Registration: Added metadata-only
register_modelwith cache-firstfrom_hf(#10102).
Build & Packaging
- kvbm Import: Updated the
kv_cache_connectorimport for TRT-LLM 1.3.0rc14 (#9622). - etcd: Bumped etcd to v3.5.30 for runtime containers (#9791).
- Cargo: Removed broken README references blocking crate publication (#9809).
- AMD/ROCm: Added import-time compatibility for AMD ROCm / Python 3.10 hosts (#9929).
Documentation
- Backend Guides: Added Rust and Python unified-backend guides (#9492) and refreshed unified-backend feature gaps (#9098, #10098).
- Parsers & Parity: Reorganized tool-calling and reasoning into top-level sections (#9400), added a logprobs troubleshooting guide (#9658), and maintained the parser parity matrix (#9614, #10138).
- Kubernetes: Added a DGDR PCIe profiler callout and Known Issues (#9428), a Cold Start / Resiliency support matrix (#8612), and an API reference in the sidebar (#10210).
- Router & Planner: Added topology-aware KV-transfer docs (#10123), refreshed the router benchmark guide (#10198), and improved the global planner guide (#9418).
- Fern / Theme: Enabled the NVIDIA global theme and upgraded the Fern CLI (#9967).
- Recipes: Updated the GLM-5 NVFP4 recipe to use the stable SGLang image (#9697).
- Localization: Added Chinese-translated docs (#9816).
New Contributors
- @sytianhe made their first contribution in #9327
- @Dao007forever made their first contribution in #9195
- @zhewenl made their first contribution in #9414
- @nv-rinig made their first contribution in #9323
- @bewestphal made their first contribution in #9512
- @my-git9 made their first contribution in #8432
- @jooe0824 made their first contribution in #9537
- @kirillemilio made their first contribution in #9824
- @Shaoting-Feng made their first contribution in #9982
- @dynamo-ops made their first contribution in #10036
- @Harrilee made their first contribution in #10076
- @andyluo7 made their first contribution in #9929
- @Change72 made their first contribution in #10157
- @mvillmow made their first contribution in #9095
Full Changelog: v1.2.0...v1.3.0-dev.1