# Performance Optimization in Ray Serve

This guide covers three key performance optimization techniques in Ray Serve that can significantly improve throughput, latency, and resource utilization:

<div class="alert alert-info">
<b> Here is the roadmap of this notebook:</b>

<ol>
    <li> Dynamic Request Batching</li>
    <li> Multiplexing for Multi-Model Serving</li>
    <li> Request Pipelining with Streaming</li>
    <li> Combining Optimization Techniques</li>
</ol>

</div>

**Imports**


In [None]:
import asyncio
from typing import List

import ray
import requests
from fastapi import FastAPI
from ray import serve
from ray.serve.handle import DeploymentHandle

## 1. Dynamic Request Batching

### What is Dynamic Batching?

Dynamic batching groups incoming requests that arrive within a short window into batches, up to a maximum size.

Here is a diagram illustrating dynamic batching:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/dynamic_request_batching.png" width="700">

### When to Use Dynamic Batching

Use dynamic batching:
- To enable efficient GPU utilization and vectorized CPU operations.
- For high-throughput request patterns.
- When request sizes and processing times are relatively uniform.


### Example

Without batching, each request is processed individually:

In [None]:
@serve.deployment
class SimpleModel:
    def run(self, single_sample: int) -> int:
        return single_sample * 2

handle = serve.run(SimpleModel.bind())
await handle.run.remote(1)  # Returns 2

With batching enabled using `@serve.batch`:


In [None]:
@serve.deployment
class SimpleBatchedModel:
    @serve.batch(batch_wait_timeout_s=0.1, max_batch_size=4)
    async def run(self, multiple_samples: list[int]) -> list[int]:
        print(f"{multiple_samples=}")
        return [sample * 2 for sample in multiple_samples]

Now batching occurs on the replica:
- A batch forms when the maximum size is reached (4) or the timeout elapses (0.1 s).
- Clients still send individual requests with single samples.

In [None]:
handle = serve.run(SimpleBatchedModel.bind())
responses = [handle.run.remote(i) for i in range(8)]

Responses arrive individually and in order:

In [None]:
for i, resp in enumerate(responses):
    resp_val = await resp
    assert resp_val == i * 2

### How to tune batching configurations 

Here are some guidelines for tuning the `max_batch_size` parameter:

- Start small and double until throughput gains flatten (2 → 4 → 8 → 16 ...), watching latency and memory.
- Keep within memory limits; larger batches increase peak memory usage.
- Match downstream expectations in pipelines (prefer consistent multiples to avoid partial batches: 8→8→4 over 8→6→4).

Here are some guidelines for tuning the `batch_wait_timeout_s` parameter:
- Set based on latency SLO minus average compute time. Example: if SLO=150ms and compute≈100ms, use 10–20ms.
- Lower timeout = lower latency, smaller batches; higher timeout = bigger batches, higher tail latency.



## 2. Multiplexing for Multi-Model Serving


### What is Multiplexing?

Multiplexing enables a **single deployment** to serve **multiple models** by **dynamically loading and unloading** them as needed. This is ideal when serving **many models** with **sparse or unpredictable traffic patterns**.

#### The Problem Multiplexing Solves

Without multiplexing, serving **100 models** with sparse traffic patterns requires **100 separate deployments**:

❌ Without multiplexing:
- 100 deployments × 1 replica each = 100 replicas
- Either slow start up time: need to scale up replicas from 0
- Or inefficient resource usage: need to have all 100 replicas up and running

With multiplexing, a **single deployment** handles all models:

✅ With multiplexing:
- 1 deployment with N replicas
  - e.g. 20 replicas each caching up to 5 models can handle all 100 models
- Each replica keeps only **K most-used** models **cached in memory**
- Automatic **loading/eviction of cache** based on usage
- Intelligent routing to replicas with requested models already loaded


### When to Use Multiplexing

Use multiplexing when you have:

1. **Per-user personalized models**: Serve a unique model for each user
2. **Multi-tenancy**: Isolate models per customer/tenant
3. **Sparse traffic patterns**: Many models with infrequent requests


#### Common Use Cases

1. Serving base models (e.g LLMs) with fine-tuned variants (e.g. LoRa adapters).
   1. Fine-tuned adapters can be loaded/unloaded dynamically on top of a shared base model.
2. Personalized recommendation models per user


### How It Works

Here is a diagram illustrating how model multiplexing works:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/model_multiplexing.png" width="800">

**Request Execution Flow:**

1. **Client specifies model**: Requests include a model ID via HTTP header (`serve_multiplexed_model_id`) or handle options
2. **Intelligent routing**: The controller finds candidate replicas using a three-tier fallback strategy:
   - **First priority**: Replicas that already have the requested model loaded (cache hit)
   - **Second priority**: Replicas with the fewest models loaded (most capacity)
   - **Fallback**: After timeout (~1 second), choose available replicas


#### Performance Tradeoffs

**Pros:**
- Reduced memory footprint (only active models loaded)
- Efficient resource utilization
- Fewer deployments to manage

**Cons:**
- Model loading latency on first request (cold start)
- Complexity in routing and model ID management
- Potential thrashing if too many models accessed frequently


### Example

Let's take a look at a basic example of serve multiplexing


In [None]:
@serve.deployment
class ModelServer:
    # specify the size of the cache
    @serve.multiplexed(max_num_models_per_replica=2)
    async def load_model(self, model_id: str):
        return await self.load_model_from_storage(model_id)

    async def __call__(self, request):
        # fetch the model id associated with this request
        model_id = serve.get_multiplexed_model_id()
        model = await self.load_model(model_id)
        return self.predict(model, request)

    async def load_model_from_storage(self, model_id):
        print(f"loading {model_id=} from blob store")
        await asyncio.sleep(0.1) # simulate download weights from S3/GCS
        return model_id

    def predict(self, model, request):
        await asyncio.sleep(0.05)  # simulate loading data onto device and running inference
        return f"prediction from {model}"


handle = serve.run(ModelServer.bind())

We can now send requests specifying which model to use. 

We can either specify a `multiplexed_model_id` via the `handle.options` if performing a direct rpc call:


In [None]:
await handle.options(multiplexed_model_id="model_v1").remote({"simple": "request"})

Or we can use the `serve_multiplexed_model_id` **HTTP Header** if sending HTTP requests:


In [None]:
response = requests.post(
    "http://localhost:8000/",
    json={"data": [1, 2, 3]},
    headers={"serve_multiplexed_model_id": "model_v2"}
)

By now, we have loaded both `model_v1` and `model_v2` (2 models per replica) which is the configured limit. 

If we send a request for `model_v3` then `model_v1` will be evicted (unloaded) to make room


In [None]:
response = requests.post(
    "http://localhost:8000/",
    json={"data": [1, 2, 3]},
    headers={"serve_multiplexed_model_id": "model_v3"}
)

Note `model_v1` was chosen given the implementation uses a Least Recently Used (LRU) cache.

### Key knobs and best practices

Here are the key configuration options for multiplexing:

- **`max_num_models_per_replica`**: Cache size per replica (default 3).
- **Router timeout**: `RAY_SERVE_MULTIPLEXED_MODEL_ID_MATCHING_TIMEOUT_S` (~1s) before falling back.


Key best practices for multiplexing:

- **Tune cache size** to balance memory vs. cold starts.
- **Monitor cache hits** to spot thrashing; increase cache or shard if needed.
- **Pre-warm critical models** if you must avoid cold-start latency.


### Observability

Ray Serve provides metrics and the ray serve deployment dashboard has default panels showcasing multiplexed deployment performance.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/multiplexing-dashboard.png" width="900">


## 3. Request Pipelining with Streaming

Pipeline parallelism with streaming allows you to process data as soon as it becomes available, rather than waiting for an entire stage to complete. This is particularly effective for large inputs like videos or long sequences.

### Key Concept

Instead of waiting for each stage to complete before starting the next:

* Decode entire video → Predict on all frames (Perform object detection + add bounding boxes) → Encode entire video

```
❌ Naive approach (66s total):
Decode (4s) → Predict (38s) → Encode (24s)
```

Process chunks in parallel across stages:

```
✅ Streaming approach (7s total - 10x faster):
Chunk 1: Decode (0.1s) → Predict (0.5s) → Concat (0.1s)
Chunk 2: Decode (0.1s) → Predict (0.5s) → Concat (0.1s)
...
Chunk N: Decode (0.1s) → Predict (0.5s) → Concat (0.1s)
```

Here is a diagram illustrating request pipelining with streaming:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/video_processing_sequence.png" width="1000">


### Implementation Pattern

Let's look at examples of both naive and streaming orchestration patterns for a video processing pipeline that performs face detection.

#### NaiveOrchestration (no streaming)

In [None]:
fastapi_app = FastAPI()

@serve.deployment
@serve.ingress(fastapi_app)
class NaiveOrchestration:
    def __init__(
        self,
        decode_video: DeploymentHandle,
        fused_detect_encode: DeploymentHandle,
    ) -> None:
        self.decode_video = decode_video
        self.fused_detect_encode = fused_detect_encode

    @fastapi_app.post("/detect_faces_naive")
    async def run(
        self,
        video_url: str,
        output_path: str,
        decode_batch_size: int = 20,
    ) -> str:
        # Decode entire video first (no streaming)
        frames: List[ray.ObjectRef] = await self.decode_video.decode.remote(
            video_url, decode_batch_size
        )

        # Process frames (run detection + encoding) and store video on blob store
        return await self.fused_detect_encode.run.remote(frames)

#### StreamingOrchestration (streaming)

In [None]:
fastapi_app = FastAPI()

@serve.deployment
@serve.ingress(fastapi_app)
class StreamingOrchestration:
    def __init__(
        self,
        decode_video: DeploymentHandle,
        fused_detect_encode: DeploymentHandle,
        concat_video: DeploymentHandle,
    ) -> None:
        self.decode_video = decode_video.options(stream=True) # Enable streaming
        self.fused_detect_encode = fused_detect_encode
        self.concat_video = concat_video

    @fastapi_app.post("/detect_faces_streaming")
    async def run(
        self,
        video_url: str,
        output_path: str,
        decode_batch_size: int = 20,
    ) -> str:
        # Decode yields batches of frames incrementally (streaming)
        frames_iter = self.decode_video.decode.remote(video_url, decode_batch_size)
        encoded_video_refs: List[ray.ObjectRef] = []
        obj_ref_gen = await frames_iter._to_object_ref_gen()
        
        # Process frames as they arrive (detection + encoding)
        async for frame_ref in obj_ref_gen:
            encoded_video_refs.append(
                self.fused_detect_encode.run.remote(frame_ref)
            )

        # Concatenate encoded video chunks and store video on blob store
        return await self.concat_video.concat.remote(
            output_path,
            *encoded_video_refs
        )


### Key Implementation Details

1. **Enable streaming**: Use `.options(stream=True)` on the deployment handle
2. **Convert to ObjectRef generator**: Use `await frames_iter._to_object_ref_gen()` to avoid unnecessary data transfer
3. **Async iteration**: Use `async for` to process chunks as they arrive
4. **Prefer local routing**: Set `_prefer_local_routing=True` for better performance

### When to Use Streaming

- Large input data (videos, long documents, audio files)
- Multi-stage processing pipelines
- Need to minimize time-to-first-output
- Want to overlap computation across pipeline stages


## Combining Optimization Techniques

These techniques can be combined for maximum performance:


In [None]:
@serve.deployment
class OptimizedPipeline:
    @serve.multiplexed(max_num_models_per_replica=5)
    async def load_model(self, model_id: str):
        return load_model_from_storage(model_id)
    
    @serve.batch(max_batch_size=8, batch_wait_timeout_s=0.1)
    async def __call__(self, requests: list) -> list:
        model_id = serve.get_multiplexed_model_id()
        model = await self.load_model(model_id)
        results = model.predict_batch(requests)
        return results

This combines:
- **Multiplexing** for serving multiple models
- **Dynamic batching** for efficient GPU utilization
- Could also add **streaming** if processing large inputs