From 6e7d0e74b3577f664b959bda46fc854b7f48a07d Mon Sep 17 00:00:00 2001 From: Jiaxin Shan Date: Tue, 5 Aug 2025 13:21:23 -0700 Subject: [PATCH] Update v0.4.0 blog post Signed-off-by: Jiaxin Shan --- content/posts/2025-08-04-v0.4.0-release.md | 27 +++++++++++----------- 1 file changed, 13 insertions(+), 14 deletions(-) diff --git a/content/posts/2025-08-04-v0.4.0-release.md b/content/posts/2025-08-04-v0.4.0-release.md index 6b71bd0..1184307 100644 --- a/content/posts/2025-08-04-v0.4.0-release.md +++ b/content/posts/2025-08-04-v0.4.0-release.md @@ -17,7 +17,7 @@ ShowToc: true tocopen: true --- -AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release \- v0.4.0. This release tackles key bottlenecks in orchestration and routing for **Prefill/Decode(P/D) Disaggregation** and **Large‑scale Expert Parallelism(EP)**, optimizations in the **AIBrix KVCache V1 Connector**, **KV Event synchronization** from engine and **Multi‑Engine** support. +AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration and routing for **Prefill/Decode(P/D) Disaggregation** and **Large‑scale Expert Parallelism(EP)**, optimizations in the **AIBrix KVCache V1 Connector**, **KV Event synchronization** from engine and **Multi‑Engine** support. ## v0.4.0 Highlight Features @@ -41,11 +41,7 @@ The handling of the prefill request depends on the underlying inference engine:

-After the prefill step is complete, a decode worker is selected. In the current implementation, the decode worker is chosen randomly. However, future enhancements aim to optimize this selection by considering factors such as KV cache transfer latency and worker load to improve efficiency. - -The connection details of the selected decode worker are then returned to the Envoy proxy, which forwards the decode request accordingly. The subsequent propagation and response-handling mechanism from the Envoy proxy to the decode worker remains unchanged. - -The key distinction in this workflow lies in the special handling of the prefill request, which introduces a dedicated step to route and process prefill separately before proceeding to decode. +After the prefill step is complete, a decode worker is selected. In the current implementation, the decode worker is chosen randomly. However, future enhancements aim to optimize this selection by considering factors such as KV cache transfer latency and worker load to improve efficiency. The connection details of the selected decode worker are then returned to the Envoy proxy, which forwards the decode request accordingly. The subsequent propagation and response-handling mechanism from the Envoy proxy to the decode worker remains unchanged. The key distinction in this workflow lies in the special handling of the prefill request, which introduces a dedicated step to route and process prefill separately before proceeding to decode. The following figures illustrate the benefits of prefix-aware routing enabled by AIBrix's PD-aware routing support. To evaluate the impact of this feature, we design two workloads inspired by real-world scenarios. The **prefix-sharing workload** simulates requests that share a few long common prefixes, mimicking scenarios with significant prefix overlap (as described in our [benchmark setting](https://github.com/vllm-project/aibrix/blob/41289350823fc924acaf72ba648ed2116d4cfc44/benchmarks/config.yaml#L23)). The exact sharing patterns used are specified below. The **multiturn workload** simulates a multi-turn conversation, with a mean request length of 2,000 tokens (standard deviation: 500) and an average of [3.55 turns per conversation](https://github.com/vllm-project/aibrix/blob/41289350823fc924acaf72ba648ed2116d4cfc44/benchmarks/config.yaml#L18). @@ -111,8 +107,7 @@ The latest version brings several key features and enhancements: * Offered network auto-configuration functionality for RDMA-capable environments. * Introduced new AIBrix KVCache L2 connectors for PrisDB and EIC, ByteDance's key-value stores engineered for low-latency, scalable multi-tier caching architectures optimized for LLM inference workloads. -Benchmarks by the EIC team demonstrate 89.27% reduction in average TTFT and 3.97x throughput improvement under high-concurrency scenarios (70B model). -We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache backend. These benchmarks are carried out with two simulated production workloads. Both workloads maintain identical sharing characteristics but different scaling. All unique requests in **Workload-1** can be fit in the GPU KV cache, while **Workload-2** scales the unique request memory footprint to 8 times, simulating capacity-constrained use cases where cache contention is severe. Compared to vLLM Baseline (w/o prefix caching) and vLLM Prefix Caching, AIBrix + PrisDB shows superior TTFT performance, particularly under increasing QPS. The following figure shows that AIBrix + PrisDB delivers sub-second TTFT and orders of magnitude TTFT advantages across all load levels and benchmarks. +Benchmarks by the [Elastic Instant Cache](https://www.volcengine.com/product/eic)(EIC) team demonstrate 89.27% reduction in average TTFT and 3.97x throughput improvement under high-concurrency scenarios (70B model). We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache backend. These benchmarks are carried out with two simulated production workloads. Both workloads maintain identical sharing characteristics but different scaling. All unique requests in **Workload-1** can be fit in the GPU KV cache, while **Workload-2** scales the unique request memory footprint to 8 times, simulating capacity-constrained use cases where cache contention is severe. Compared to vLLM Baseline (w/o prefix caching) and vLLM Prefix Caching, AIBrix + PrisDB shows superior TTFT performance, particularly under increasing QPS. The following figure shows that AIBrix + PrisDB delivers sub-second TTFT and orders of magnitude TTFT advantages across all load levels and benchmarks. (Notation: 8B-1U = DeepSeek-R1-Distill-Llama-8B + Workload-1; 8B-8U = Workload-2 variant; 70B-1U = DeepSeek-R1-Distill-Llama-70B model + Workload-1) @@ -122,17 +117,18 @@ We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache bac ### KV Event Subscription System -AIBrix v0.4.0's new KV Event Subscription System improves prefix cache hit rates by synchronizing KV cache states in real-time across distributed nodes. Here, we will cover its design, trade-offs, and implementation details. The introduction of this new system offers a choice with different trade-offs, allowing users to decide between **system simplicity** and **prefix cache state accuracy** based on their needs. +AIBrix v0.4.0's new KV Event Subscription System improves prefix cache hit rates by synchronizing KV cache states in real-time across distributed nodes. The introduction of this new system offers a choice with different trade-offs, allowing users to decide between **system simplicity** and **prefix cache state accuracy** based on their needs. The core idea of this feature is to broadcast KV cache state change events across all routers via messaging middleware. This provides the routing layer with a near real-time, global view of the cache, enabling **more precise routing decisions**. (See PR [\#1349](https://github.com/vllm-project/aibrix/pull/1349) for details) -In theory, global state synchronization can significantly improve the cluster's potential prefix cache hit rate. However, this advantage comes at a cost. The approach introduces **additional overhead** from message queue management, increasing system complexity. In the current version, performance gains are not guaranteed, as the routing algorithms have not yet been fully adapted. Furthermore, the indexer may face **scalability challenges** in large-scale deployments. +In theory, global state synchronization can significantly improve the cluster's potential prefix cache hit rate. However, this advantage comes at a cost. The approach introduces **additional overhead** from message queue management, increasing system complexity. In the current version, performance gains are not guaranteed, as the routing algorithms have not yet been fully adapted. Furthermore, the indexer may face **scalability challenges** in large-scale deployments. -In contrast, the traditional unsynchronized approach is simpler and more lightweight, requiring **no extra synchronization components**. Its main drawback is the potential for **inconsistencies**, as each node runs its eviction policies independently, which can lower the overall cluster's prefix cache hit rate. -To enable the KV event subscription system, the remote tokenizer mode must be active, and the following environment variables must be set in gateway plugin component: +In contrast, the traditional unsynchronized approach is simpler and more lightweight, requiring **no extra synchronization components**. Its main drawback is the potential for **inconsistencies**, as each node runs its eviction policies independently, which can lower the overall cluster's prefix cache hit rate. + +To enable the KV event subscription system, the remote tokenizer mode must be active, and the following environment variables must be set in gateway plugin component: ``` // Enable KV event synchronization -AIBRIX_KV_EVENT_SYNC_ENABLED: true +AIBRIX_KV_EVENT_SYNC_ENABLED: true // Depends on and enables remote tokenizer mode AIBRIX_USE_REMOTE_TOKENIZE: true ``` @@ -147,7 +143,7 @@ The KV Event Subscription System is a step for AIBrix towards high-performance d ### Multi‑Engine Support -Previously, AIBrix primarily supported the vLLM engine, limiting flexibility for comparing different inference backends. However, growing community demand—as seen in [#137](https://github.com/vllm-project/aibrix/issues/137), [[#843](https://github.com/vllm-project/aibrix/issues/843), and [#1245](https://github.com/vllm-project/aibrix/issues/1245) —highlighted the need for broader engine support. With the latest update, AIBrix now supports **multi-engine deployment**, allowing developers to run **vLLM, SGLang, and xLLM** side-by-side within a single AIBrix cluster. This unlocks new possibilities for benchmarking and production deployment while leveraging AIBrix’s unified serving infrastructure. +Previously, AIBrix primarily supported the vLLM engine, limiting flexibility for comparing different inference backends. However, growing community demand—as seen in [#137](https://github.com/vllm-project/aibrix/issues/137), [#843](https://github.com/vllm-project/aibrix/issues/843), and [#1245](https://github.com/vllm-project/aibrix/issues/1245) —highlighted the need for broader engine support. With the latest update, AIBrix now supports **multi-engine deployment**, allowing developers to run **vLLM, SGLang, and [xLLM](https://www.volcengine.com/docs/6459/72358)** side-by-side within a single AIBrix cluster. This unlocks new possibilities for benchmarking and production deployment while leveraging AIBrix’s unified serving infrastructure. Key points include: @@ -159,6 +155,8 @@ Multi‑engine support makes it easy to run vLLM and SGLang side‑by‑side and ## Other Improvements +While the highlights focus on new architectural and orchestration capabilities, we've also delivered several foundational improvements that strengthen AIBrix’s robustness and observability in real-world deployments. + AIBrix Gateway now supports SLO-aware routing with request profiling and deadline-based traffic control, enabling more intelligent and responsive load handling under dynamic traffic patterns ([#1192](https://github.com/vllm-project/aibrix/pull/1192), [#1305](https://github.com/vllm-project/aibrix/pull/1305), [#1368](https://github.com/vllm-project/aibrix/pull/1368)). Additional enhancements include configurable timeouts, custom metrics ports, and a ready-to-use Grafana dashboard for observability ([#1211](https://github.com/vllm-project/aibrix/pull/1211), [#1212](https://github.com/vllm-project/aibrix/pull/1212)). On the control plane side, we've strengthened webhook validation, CRD existence checks, and added mechanisms to safely resync cache state during component restarts ([#1170](https://github.com/vllm-project/aibrix/pull/1170), [#1187](https://github.com/vllm-project/aibrix/pull/1187), [#1219](https://github.com/vllm-project/aibrix/pull/1219)). @@ -180,6 +178,7 @@ We deeply appreciate your contributions and feedback. Keep them coming! ## Next Steps We're continuing to push the boundaries of LLM system infrastructure, and **AIBrix v0.5.0** will focus on unlocking powerful capabilities for **agent-based use cases**, **multi-modality**, and **cost-efficient multi-tenant serving**. Here's a glimpse of what's coming: +- **P/D Disaggregation Improvements**: Introduced additional production-ready deployment patterns and examples, with improved integration of PodGroup for better scheduling alignment and enhanced autoscaling support. - **Batch API**: Introduce a new batch inference API to improve GPU utilization under latency-insensitive scenarios. - **Multi-Tenancy**: Add tenant-aware isolation, request segregation, and per-tenant SLO controls for safer shared deployments. - **Context Cache for Agents**: Enable efficient reuse of session history across multi-turn conversations and agentic programs via a new context caching interface.