# Architecture 

This notebook provides an overview of the architecture of Ray Serve

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook: </b>
<ol>
    <li>Components of Ray Serve</li>
    <li>Lifetime of a Request</li>
    <li>Request Routing Process</li>
    <li>Fault Tolerance</li>
    <li>Load Shedding and Backpressure</li>
</ol>


**Imports**

In [None]:
import asyncio
import backoff
import time
import requests
import threading

import ray
import ray.util.state
from ray import serve
from starlette.requests import Request

## 1. Components of Ray Serve

### Sample Serve Instance

Below is a sample Serve instance that we will use to illustrate the architecture.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-serve/sample_sercve_instance.png" width="800">

We can break down the above diagram into the following steps:
1. HTTP or GRPC requests come in 
2. The load balancer routes the request to one of the cluster nodes
3. The request is handled by a proxy
4. The proxy routes the request to the relevant deployment replica
5. The replica processes the request and returns the response
6. The proxy returns the response to the client



### Architecture

Serve runs on Ray and utilizes Ray actors.

There are three kinds of actors that are created to make up a Serve instance:

 * **Controller Actor**
    * Global actor unique to each Serve instance
    * Manages the control plane
    * Handles creating, updating, and destroying other actors
    * Runs the Serve Autoscaler
    * Processes Serve API calls for deployment management

  * **Proxy Actor**
    * One HTTP proxy actor by default on head node
    * Runs a Uvicorn HTTP server
    * Accepts incoming requests
    * Forwards requests to replicas
    * Returns responses when completed
    * Can be scaled across cluster nodes using `proxy_location` setting

  * **Replica Actors**
    * Execute the actual request processing code
    * Can host ML models or other business logic
    * Process individual requests from the proxy
    * Support dynamic request batching via `@serve.batch`

Here is a diagram of the Serve architecture:

<img src="https://docs.ray.io/en/latest/_images/architecture-2.0.svg" width="800">


The system tab of the Serve dashboard shows the controller and proxy actors.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/serve-dashboard-system-tab.png" width="600">

The applications tab shows the applications and their deployments.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/serve-dashboard-applications-tab.png" width="600">

And clicking on a deployment shows the replicas and their status.

## 2. Lifetime of a request

When an HTTP or gRPC request is sent to the corresponding HTTP or gRPC proxy, the following happens:

Requests flow through the system as follows:
1. Request is received and parsed
2. Deployment is looked up based on HTTP URL path
3. Request enters deployment queue
4. Available replica is identified for processing
5. Request is either immediately processed by replica or stored in its queue
6. Response is returned through proxy

Queueing in Ray Serve occurs in two places:
- On the caller side (DeploymentHandle or Proxy)
- On the receiver side (Replica)

Each replica maintains a queue of requests and executes requests one at a time, possibly using asyncio to process them concurrently.

When making a request via a `DeploymentHandle` (e.g. instead of HTTP), the request is placed on a queue in the `DeploymentHandle`, and we skip to step 3 above.

Here is a diagram of the request lifecycle:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/geotab/request_lifecycle.jpg" width="800">

<div class="alert alert-block alert-secondary">

<details>

<summary>Click to view implementation details</summary>

- [`HTTPProxy.__call__` will delegate to `HTTPProxy.proxy_request`](https://github.com/ray-project/ray/blob/78bb1f0fe5dbc3d15f94953593c3751c17e2097c/python/ray/serve/_private/proxy.py#L854)
- [`HTTPProxy.proxy_request` will invoke a response handler method](https://github.com/ray-project/ray/blob/78bb1f0fe5dbc3d15f94953593c3751c17e2097c/python/ray/serve/_private/proxy.py#L429)
- [The response handler will delegate to `HTTPProxy.match_route` if an HTTP request is received](https://github.com/ray-project/ray/blob/78bb1f0fe5dbc3d15f94953593c3751c17e2097c/python/ray/serve/_private/proxy.py#L351)
- [The response handler will then call `HTTPProxy.send_request_to_replica`](https://github.com/ray-project/ray/blob/78bb1f0fe5dbc3d15f94953593c3751c17e2097c/python/ray/serve/_private/proxy.py#L401C1-L402C1)
- [`HTTPProxy.send_request_to_replica` will call the `DeploymentHandle.remote` method](https://github.com/ray-project/ray/blob/78bb1f0fe5dbc3d15f94953593c3751c17e2097c/python/ray/serve/_private/proxy.py#L966)
- [`DeploymentHandle.remote` will call the `AsyncioRouter.assign_request` method](https://github.com/ray-project/ray/blob/master/python/ray/serve/handle.py#L206)
- [`AsyncioRouter.assign_request` will call `AsyncioRouter.schedule_and_send_request` method](https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/router.py#L613)
- [`AsyncioRouter.schedule_and_send_request` will first call `ReplicaScheduler.choose_replica_for_request` to choose a replica for the request](https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/router.py#L574)
    - [`ReplicaScheduler.choose_replica_for_request` will call `ReplicaScheduler.choose_replicas` to try to submit a scheduling task to find a replica for the request](https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/replica_scheduler/pow_2_scheduler.py#L811)
- [`AsyncioRouter.schedule_and_send_request` will then call `ReplicaWrapper.send_request`](https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/router.py#L540C46-L540C58)
- [`ReplicaWrapper.send_request` will call `ReplicaWrapper.send_request_python`](https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/replica_scheduler/replica_wrapper.py#L188)
- [`ReplicaWrapper.send_request_python` will then execute `Replica.handle_request` Actor tasks and get back the replica result](https://github.com/ray-project/ray/blob/master/python/ray/serve/_private/replica_scheduler/replica_wrapper.py#L92)
- the ReplicaResult will be sent back to the response handler
- the response handler will then send the replica result back to the client

</details>

If you have tracing enabled, here are the three main spans as seen in the traces:

- `proxy_http_request` (`HTTPProxy.send_request_to_replica` scope)
    - `proxy_route_to_replica` (`AsyncioRouter.assign_request` scope)
        - `replica_handle_request` (`Replica.handle_request or Replica.handle_request_streaming` scope)

## 3. Request Routing Process

### Power of Two Choices Replica Routing Algorithm
* Randomly samples two replicas for each request
* Selects replica with shorter queue length if below `max_ongoing_requests`
* Maintains strict FIFO ordering of requests

see the below diagram which visualizes the routing/scheduling process:

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-serve/power_of_two_choices.png" width="800">


### Routing Priority
1. Local node replicas (if enabled - default True for proxy actors and False for replicas)
2. Same availability zone replicas (if enabled - default True)
3. Any available replica

#### Model Multiplexing (if enabled)
* First attempts to route to replicas with requested model loaded
* Falls back to replicas with fewest loaded models
* Finally considers all replicas after configurable timeout

### Fault Handling
* Implements exponential backoff on failures
* Removes dead replicas from selection pool

### Implementation Details
* Caches replica queue lengths with configurable TTL
* Uses active probing with deadlines for queue length updates
* Limits concurrent routing tasks to `(2 * num_replicas)`
* Single routing task handles multiple requests to maintain FIFO

This scheduler balances efficient request distribution with predictable ordering, while handling replica failures and model multiplexing requirements.

<div class="alert alert-block alert-info">

To view implementation details, check out the `PowerOfTwoChoicesRequestRouter` [class](https://github.com/ray-project/ray/blob/ray-2.50.0/python/ray/serve/_private/request_router/pow_2_router.py)

</div>

### New (alpha): Custom request routing

Ray Serve now lets you plug in a custom <code>RequestRouter</code> to decide which replica handles a request (beyond the default power-of-two-choices).

- <b>What</b>: Implement your own policy (e.g., random, throughput-aware, KV-cache-aware), optionally using <code>FIFOMixin</code>, <code>LocalityMixin</code>, and <code>MultiplexMixin</code>.
- <b>How</b>: Set in <code>@serve.deployment</code> via <code>request_router_config=RequestRouterConfig(request_router_class="{module}:{Class}")</code>. Routers can read per-replica stats you expose via <code>record_routing_stats</code>.
- <b>Note</b>: Alpha API; configure at deploy time (not swappable on existing handles).

Read more in the [custom request routing docs](https://docs.ray.io/en/latest/serve/advanced-guides/custom-request-router.html).


## 4. Fault tolerance

### Application errors
- Application-level errors like exceptions in your model evaluation code are caught and wrapped.
- A 500 status code will be returned with the traceback information. 
- The replica will be able to continue to handle requests.

### Machine errors and faults
Machine errors and faults are handled by Ray Serve as follows:

- When **Replica** Actors fail, the **Controller** Actor replaces them with new ones.
- When the **Proxy** Actor fails, the **Controller** Actor restarts it.
- When the **Controller** Actor fails, Ray restarts it.

#### Transient Data Loss
- When a machine hosting any of the actors crashes, those actors are automatically restarted on another available machine.
- Transient data in the router and the replica (like network connections and internal request queues) will be lost for this kind of failure. 

<div class="alert alert-block alert-info">

**Best practice** implement client side retries and backoff to handle transient data loss.

</div>

Let's create a deployment that simulates a spot instance interruption.



In [None]:
@serve.deployment
class SpotInstanceReplica:
    def __init__(self):
        self.state = "my_model"
    async def __call__(self, request: Request):
        await asyncio.sleep(60) # Simulate long computation
        return self.state

We run the application



In [None]:
app = SpotInstanceReplica.bind()
app_handle = serve.run(app, name="spot-instance", blocking=False)

We define some helper functions and class to:
- make a request from a thread
- simulate an instance preemption by killing the replica actor



In [None]:
def make_request():
    response = requests.get("http://localhost:8000/")
    return response


class RequestThread(threading.Thread):
    def run(self):
        self.result = make_request()


def simulate_instance_preemption():
    actor_name = ray.util.state.list_actors(filters=[("state", "=", "ALIVE"), ("class_name", "=", 'ServeReplica:spot-instance:SpotInstanceReplica'),])[0]["name"]
    actor_handle = ray.get_actor(name=actor_name, namespace="serve") 
    ray.kill(actor_handle)

We now launch two threads:
- one to make a request
- one to simulate an instance preemption

Given we don't have client-side retries, the request will fail immediately after the replica is killed.



In [None]:
# Create threads
t1 = RequestThread()
t1.start()
time.sleep(20)
t2 = threading.Thread(target=simulate_instance_preemption)
t2.start()

# Wait for threads to complete
t1.join()
t2.join()

# Get the response object
t1_result = t1.result

Serve will replace the replica but all transient requests will be lost.



In [None]:
t1_result.status_code, t1_result.text

To resolve this, it is best to implement client-side retries.



In [None]:
@backoff.on_exception(backoff.expo, requests.exceptions.RequestException, max_tries=5)
def make_request_robust():
    response = requests.get("http://localhost:8000/")
    response.raise_for_status()
    return response


class RequestThreadWithRetries(threading.Thread):
    def run(self):
        self.result = make_request_robust()

We now launch the same two threads:
- one to make a request
- one to simulate an instance preemption

Given we have client-side retries, the request will be retried and eventually succeed.



In [None]:
# Create threads
t1 = RequestThreadWithRetries()
t1.start()
time.sleep(20)
t2 = threading.Thread(target=simulate_instance_preemption)
t2.start()

# Wait for threads to complete
t1.join()
t2.join()

# Get the response object
t1_result = t1.result

We verify the status code is 200 and the text is "my_model"



In [None]:
t1_result.status_code, t1_result.text

<div class="alert alert-block alert-info">

**Best practice** If client retries are not an option, you will need to introduce request persistence (e.g. using a queueing system like kafka, sqs, etc.)

</div>




In [None]:
serve.shutdown()

### Ray Head Node Failure

If Ray's global control service (GCS) fails, Ray Serve will continue serving traffic. However, no cluster autoscaling will be triggered until the GCS is restored.

For more details on fault-tolerance, check out the [End-to-End Fault Tolerance](https://docs.ray.io/en/latest/serve/production-guide/fault-tolerance.html) documentation.

## 5. Load shedding and backpressure

Here is a reminder of **request handling**:
- When a request is sent to a cluster, it is first received by the **Serve proxy**.
- The proxy forwards the request to a **replica** for handling using a `DeploymentHandle`.
- Replicas can handle a configurable number of requests at a time, set using the **`max_ongoing_requests`** option.
- If all replicas are busy, the request is queued in the `DeploymentHandle` until a replica becomes available.

### Implementing Load Shedding

Under heavy load, `DeploymentHandle` queues can grow, causing high tail latency and excessive load on the system.

To avoid instability, it is often preferable to intentionally reject some requests to prevent queues from growing indefinitely. This technique is called **"load shedding,"** allowing the system to handle excessive load gracefully without spiking tail latencies or overloading components.

#### Configuration

Configure load shedding for your Serve deployments using the **`max_queued_requests`** parameter in the `@serve.deployment` decorator.

- This parameter controls the maximum number of requests that each `DeploymentHandle`, including the Serve proxy, will queue.
- Once the limit is reached, any new requests will immediately raise a **`BackPressureError`**.
- HTTP requests will return a **503 status code** (service unavailable).


#### Example
The following example defines a deployment that emulates slow request handling and has `max_ongoing_requests` and `max_queued_requests` configured.



In [None]:
@serve.deployment(
    # Each replica will be allowed to handle 2 requests at a time.
    max_ongoing_requests=2,
    # Each caller will be allowed to queue up to 2 requests at a time.
    # (beyond those that are sent to replicas).
    max_queued_requests=2,
)
class SlowDeployment:
    def __call__(self, request: Request) -> str:
        # Emulate a long-running request, such as ML inference.
        time.sleep(2)
        return "Hello!"

We define a requester actor



In [None]:
@ray.remote
class Requester:
    async def do_request(self) -> str:
        handle = serve.get_app_handle("slow-deployment")
        return await handle.remote({"sample": "input"})

Let's schedule the actor



In [None]:
r = Requester.remote()

Let's run the serve application



In [None]:
serve.run(SlowDeployment.bind(), name="slow-deployment", blocking=False)

Observe this sequence of requests



In [None]:
# Send 2 requests.
# These will be sent to the replica. Requests take two seconds to execute.
first_refs = [r.do_request.remote() for _ in range(2)]
available, pending = ray.wait(first_refs, timeout=1)
assert len(pending) == 2
assert len(available) == 0

# Send another 2 requests
# These will get queued in the proxy because we have exceeed max_ongoing_requests
queued_refs = [r.do_request.remote() for _ in range(2)]
available, pending = ray.wait(queued_refs, timeout=0.1)
assert len(pending) == 2

# Send another 2 requests.
# These should be **rejected** immediately because the replica and the proxy queue are already full.
# The replica has 2 ongoing, and proxy has 2 queued = max_queued_requests
try:
    ray.get([r.do_request.remote() for _ in range(2)], timeout=5)
except ray.serve.exceptions.BackPressureError as e:
    print("Received expected BackPressureError as expected")

# The initial requests will finish successfully.
for ref in first_refs:
    print(f"Request finished with status code {ray.get(ref)}.")

Run clean up


In [None]:
!serve shutdown -y