Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 13 additions & 14 deletions content/posts/2025-08-04-v0.4.0-release.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ ShowToc: true
tocopen: true
---

AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release \- v0.4.0. This release tackles key bottlenecks in orchestration and routing for **Prefill/Decode(P/D) Disaggregation** and **Large‑scale Expert Parallelism(EP)**, optimizations in the **AIBrix KVCache V1 Connector**, **KV Event synchronization** from engine and **Multi‑Engine** support.
AIBrix is a composable, cloud‑native LLM inference infrastructure designed to deliver high performance and low cost at scale. We now present a major update in a new release - v0.4.0. This release tackles key bottlenecks in orchestration and routing for **Prefill/Decode(P/D) Disaggregation** and **Large‑scale Expert Parallelism(EP)**, optimizations in the **AIBrix KVCache V1 Connector**, **KV Event synchronization** from engine and **Multi‑Engine** support.

## v0.4.0 Highlight Features

Expand All @@ -41,11 +41,7 @@ The handling of the prefill request depends on the underlying inference engine:
<img src="/images/v0.4.0-release/aibrix-pd-router.png" width="100%" style="display:inline-block; margin-right:1%" />
</p>

After the prefill step is complete, a decode worker is selected. In the current implementation, the decode worker is chosen randomly. However, future enhancements aim to optimize this selection by considering factors such as KV cache transfer latency and worker load to improve efficiency.

The connection details of the selected decode worker are then returned to the Envoy proxy, which forwards the decode request accordingly. The subsequent propagation and response-handling mechanism from the Envoy proxy to the decode worker remains unchanged.

The key distinction in this workflow lies in the special handling of the prefill request, which introduces a dedicated step to route and process prefill separately before proceeding to decode.
After the prefill step is complete, a decode worker is selected. In the current implementation, the decode worker is chosen randomly. However, future enhancements aim to optimize this selection by considering factors such as KV cache transfer latency and worker load to improve efficiency. The connection details of the selected decode worker are then returned to the Envoy proxy, which forwards the decode request accordingly. The subsequent propagation and response-handling mechanism from the Envoy proxy to the decode worker remains unchanged. The key distinction in this workflow lies in the special handling of the prefill request, which introduces a dedicated step to route and process prefill separately before proceeding to decode.

The following figures illustrate the benefits of prefix-aware routing enabled by AIBrix's PD-aware routing support. To evaluate the impact of this feature, we design two workloads inspired by real-world scenarios. The **prefix-sharing workload** simulates requests that share a few long common prefixes, mimicking scenarios with significant prefix overlap (as described in our [benchmark setting](https://github.com/vllm-project/aibrix/blob/41289350823fc924acaf72ba648ed2116d4cfc44/benchmarks/config.yaml#L23)). The exact sharing patterns used are specified below. The **multiturn workload** simulates a multi-turn conversation, with a mean request length of 2,000 tokens (standard deviation: 500) and an average of [3.55 turns per conversation](https://github.com/vllm-project/aibrix/blob/41289350823fc924acaf72ba648ed2116d4cfc44/benchmarks/config.yaml#L18).

Expand Down Expand Up @@ -111,8 +107,7 @@ The latest version brings several key features and enhancements:
* Offered network auto-configuration functionality for RDMA-capable environments.
* Introduced new AIBrix KVCache L2 connectors for PrisDB and EIC, ByteDance's key-value stores engineered for low-latency, scalable multi-tier caching architectures optimized for LLM inference workloads.

Benchmarks by the EIC team demonstrate 89.27% reduction in average TTFT and 3.97x throughput improvement under high-concurrency scenarios (70B model).
We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache backend. These benchmarks are carried out with two simulated production workloads. Both workloads maintain identical sharing characteristics but different scaling. All unique requests in **Workload-1** can be fit in the GPU KV cache, while **Workload-2** scales the unique request memory footprint to 8 times, simulating capacity-constrained use cases where cache contention is severe. Compared to vLLM Baseline (w/o prefix caching) and vLLM Prefix Caching, AIBrix + PrisDB shows superior TTFT performance, particularly under increasing QPS. The following figure shows that AIBrix + PrisDB delivers sub-second TTFT and orders of magnitude TTFT advantages across all load levels and benchmarks.
Benchmarks by the [Elastic Instant Cache](https://www.volcengine.com/product/eic)(EIC) team demonstrate 89.27% reduction in average TTFT and 3.97x throughput improvement under high-concurrency scenarios (70B model). We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache backend. These benchmarks are carried out with two simulated production workloads. Both workloads maintain identical sharing characteristics but different scaling. All unique requests in **Workload-1** can be fit in the GPU KV cache, while **Workload-2** scales the unique request memory footprint to 8 times, simulating capacity-constrained use cases where cache contention is severe. Compared to vLLM Baseline (w/o prefix caching) and vLLM Prefix Caching, AIBrix + PrisDB shows superior TTFT performance, particularly under increasing QPS. The following figure shows that AIBrix + PrisDB delivers sub-second TTFT and orders of magnitude TTFT advantages across all load levels and benchmarks.

(Notation: 8B-1U = DeepSeek-R1-Distill-Llama-8B + Workload-1; 8B-8U = Workload-2 variant; 70B-1U = DeepSeek-R1-Distill-Llama-70B model + Workload-1)

Expand All @@ -122,17 +117,18 @@ We also conducted the same benchmarks as v0.3.0 using PrisDB as the L2 cache bac

### KV Event Subscription System

AIBrix v0.4.0's new KV Event Subscription System improves prefix cache hit rates by synchronizing KV cache states in real-time across distributed nodes. Here, we will cover its design, trade-offs, and implementation details. The introduction of this new system offers a choice with different trade-offs, allowing users to decide between **system simplicity** and **prefix cache state accuracy** based on their needs.
AIBrix v0.4.0's new KV Event Subscription System improves prefix cache hit rates by synchronizing KV cache states in real-time across distributed nodes. The introduction of this new system offers a choice with different trade-offs, allowing users to decide between **system simplicity** and **prefix cache state accuracy** based on their needs.

The core idea of this feature is to broadcast KV cache state change events across all routers via messaging middleware. This provides the routing layer with a near real-time, global view of the cache, enabling **more precise routing decisions**. (See PR [\#1349](https://github.com/vllm-project/aibrix/pull/1349) for details)
In theory, global state synchronization can significantly improve the cluster's potential prefix cache hit rate. However, this advantage comes at a cost. The approach introduces **additional overhead** from message queue management, increasing system complexity. In the current version, performance gains are not guaranteed, as the routing algorithms have not yet been fully adapted. Furthermore, the indexer may face **scalability challenges** in large-scale deployments.
In theory, global state synchronization can significantly improve the cluster's potential prefix cache hit rate. However, this advantage comes at a cost. The approach introduces **additional overhead** from message queue management, increasing system complexity. In the current version, performance gains are not guaranteed, as the routing algorithms have not yet been fully adapted. Furthermore, the indexer may face **scalability challenges** in large-scale deployments.

In contrast, the traditional unsynchronized approach is simpler and more lightweight, requiring **no extra synchronization components**. Its main drawback is the potential for **inconsistencies**, as each node runs its eviction policies independently, which can lower the overall cluster's prefix cache hit rate.
To enable the KV event subscription system, the remote tokenizer mode must be active, and the following environment variables must be set in gateway plugin component:
In contrast, the traditional unsynchronized approach is simpler and more lightweight, requiring **no extra synchronization components**. Its main drawback is the potential for **inconsistencies**, as each node runs its eviction policies independently, which can lower the overall cluster's prefix cache hit rate.

To enable the KV event subscription system, the remote tokenizer mode must be active, and the following environment variables must be set in gateway plugin component:

```
// Enable KV event synchronization
AIBRIX_KV_EVENT_SYNC_ENABLED: true
AIBRIX_KV_EVENT_SYNC_ENABLED: true
// Depends on and enables remote tokenizer mode
AIBRIX_USE_REMOTE_TOKENIZE: true
```
Expand All @@ -147,7 +143,7 @@ The KV Event Subscription System is a step for AIBrix towards high-performance d

### Multi‑Engine Support

Previously, AIBrix primarily supported the vLLM engine, limiting flexibility for comparing different inference backends. However, growing community demand—as seen in [#137](https://github.com/vllm-project/aibrix/issues/137), [[#843](https://github.com/vllm-project/aibrix/issues/843), and [#1245](https://github.com/vllm-project/aibrix/issues/1245) —highlighted the need for broader engine support. With the latest update, AIBrix now supports **multi-engine deployment**, allowing developers to run **vLLM, SGLang, and xLLM** side-by-side within a single AIBrix cluster. This unlocks new possibilities for benchmarking and production deployment while leveraging AIBrix’s unified serving infrastructure.
Previously, AIBrix primarily supported the vLLM engine, limiting flexibility for comparing different inference backends. However, growing community demand—as seen in [#137](https://github.com/vllm-project/aibrix/issues/137), [#843](https://github.com/vllm-project/aibrix/issues/843), and [#1245](https://github.com/vllm-project/aibrix/issues/1245) —highlighted the need for broader engine support. With the latest update, AIBrix now supports **multi-engine deployment**, allowing developers to run **vLLM, SGLang, and [xLLM](https://www.volcengine.com/docs/6459/72358)** side-by-side within a single AIBrix cluster. This unlocks new possibilities for benchmarking and production deployment while leveraging AIBrix’s unified serving infrastructure.

Key points include:

Expand All @@ -159,6 +155,8 @@ Multi‑engine support makes it easy to run vLLM and SGLang side‑by‑side and

## Other Improvements

While the highlights focus on new architectural and orchestration capabilities, we've also delivered several foundational improvements that strengthen AIBrix’s robustness and observability in real-world deployments.

AIBrix Gateway now supports SLO-aware routing with request profiling and deadline-based traffic control, enabling more intelligent and responsive load handling under dynamic traffic patterns ([#1192](https://github.com/vllm-project/aibrix/pull/1192), [#1305](https://github.com/vllm-project/aibrix/pull/1305), [#1368](https://github.com/vllm-project/aibrix/pull/1368)). Additional enhancements include configurable timeouts, custom metrics ports, and a ready-to-use Grafana dashboard for observability ([#1211](https://github.com/vllm-project/aibrix/pull/1211), [#1212](https://github.com/vllm-project/aibrix/pull/1212)).

On the control plane side, we've strengthened webhook validation, CRD existence checks, and added mechanisms to safely resync cache state during component restarts ([#1170](https://github.com/vllm-project/aibrix/pull/1170), [#1187](https://github.com/vllm-project/aibrix/pull/1187), [#1219](https://github.com/vllm-project/aibrix/pull/1219)).
Expand All @@ -180,6 +178,7 @@ We deeply appreciate your contributions and feedback. Keep them coming!
## Next Steps

We're continuing to push the boundaries of LLM system infrastructure, and **AIBrix v0.5.0** will focus on unlocking powerful capabilities for **agent-based use cases**, **multi-modality**, and **cost-efficient multi-tenant serving**. Here's a glimpse of what's coming:
- **P/D Disaggregation Improvements**: Introduced additional production-ready deployment patterns and examples, with improved integration of PodGroup for better scheduling alignment and enhanced autoscaling support.
- **Batch API**: Introduce a new batch inference API to improve GPU utilization under latency-insensitive scenarios.
- **Multi-Tenancy**: Add tenant-aware isolation, request segregation, and per-tenant SLO controls for safer shared deployments.
- **Context Cache for Agents**: Enable efficient reuse of session history across multi-turn conversations and agentic programs via a new context caching interface.
Expand Down