## Autoscaling

This notebook is an overview of how to configure autoscaling in Ray Serve.

<div class="alert alert-info">
<b> Here is the roadmap of this notebook:</b>
<ol>
  <li> Scaling across the stack</li>
  <li> Manual Scaling in Ray Serve</li>
  <li> Autoscaling Basic Configuration in Ray Serve</li>
  <li> Tuning autoscaling in Ray Serve</li>
  <li> Finetuning of autoscaling in Ray Serve</li>
  <li> Custom Autoscaling (Coming Soon)</li>
</ol>

</div>

**imports**



In [None]:
import time

from ray import serve
from ray.serve.config import AutoscalingConfig

## 1. Scaling Across the Stack

Below is a diagram that illustrates the scaling across the stack.

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-serve/scaling_across_the_stack.png" width="800">

Here are the steps in the scaling process:

1. **Ray Serve Controller Monitoring**  
   The Ray Serve controller controls scaling decisions

2. **Scaling Decision**  
   Based on this comparison, it decides whether to scale up or down the number of Serve replicas (actors).

3. **Replica Creation Requests**  
   If scaling up, the controller submits pending requests to create new actors to handle increased traffic.

4. **Ray Cluster Autoscaler Response**  
   These pending actor requests create unmet resource demands. The Ray Cluster Autoscaler detects this and attempts to provision additional Ray nodes to fulfill the need.

5. **Kubernetes Layer (if applicable)**  
   When running on Kubernetes, each Ray node is scheduled as a Kubernetes Pod. To launch more Pods, the Kubernetes scheduler must find available capacity.

6. **Kubernetes Cluster Autoscaler**  
   If there isn’t enough capacity (e.g., insufficient nodes), the Kubernetes Cluster Autoscaler attempts to scale the underlying infrastructure by adding nodes to the appropriate node group.


## 2. Manual Scaling in Ray Serve

Before jumping into autoscaling, which is more complex, the other option to consider is manual scaling. You can increase the number of replicas by setting a higher value for `num_replicas` in **the deployment options** through **in-place updates**.

By default, `num_replicas` is 1. Increasing the number of replicas will horizontally scale out your deployment and improve latency and throughput for increased levels of traffic.

```yaml
# Deploy with a single replica
deployments:
- name: Model
  num_replicas: 1

# Scale up to 10 replicas
deployments:
- name: Model
  num_replicas: 10
```

## 3. Autoscaling in Ray Serve

Ray Serve automatically changes the number of replicas based on traffic.

Here is a diagram that illustrates how replicas communicate with the controller to report metrics for autoscaling.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/serve-autoscaling-replica-reporting.png" width="800">

### 1. Target Ongoing Requests
- **Meaning**: Average active requests per replica.
- **Config**: `target_ongoing_requests` (default = 2)
- **How it works**:
  - Compare total ongoing requests to `target_ongoing_requests * num_replicas`.
  - Ratio < 1 → scale **down**
  - Ratio > 1 → scale **up**
- **Example**: Ratio = 2 → doubles the replicas.

### 2. Maximum Ongoing Requests
- **Meaning**: Max number of requests a replica can handle at once.
- **Config**: `max_ongoing_requests` (default = 5)
- **Purpose**:
  - Handles spikes safely.
  - Keeps replicas stable if requests vary in length.
- **Tip**: Set ~20–50% higher than `target_ongoing_requests`.
  - Too low → slows throughput.
  - Too high → overloads replicas.
  - Just right → balances performance.

Here is the same diagram as above now showcasing how the default autoscaling policy works.

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/serve_replica_queue_length_autoscaling_policy.png" width="800">



### Example

Define a service that takes 60 seconds to process each request. With `target_ongoing_requests=1`, Ray Serve will scale up when more than 1 request per replica is processing.



In [None]:
@serve.deployment(
    autoscaling_config=AutoscalingConfig(
        target_ongoing_requests=1,
        min_replicas=1,
        max_replicas=2,
    )
)
class SlowService:
    def run(self, id: int):
        time.sleep(60)  # purposefully sync and blocking
        print(f"got request {id}")
        return f"Done {id}"

Deploy the service. Initially, there will be 1 replica.



In [None]:
handle = serve.run(SlowService.bind())

Send 2 requests simultaneously. Since both arrive while processing, Ray Serve detects 2 ongoing requests > target of 1, triggering scale-up to 2 replicas.



In [None]:
# send 2 ongoing requests to replica
result1 = handle.run.remote(1)
result2 = handle.run.remote(2)

## 4. Tuning Autoscaling in Ray Serve

#### Step 1: Baseline Testing with a Single Replica

**Goal**: Determine optimal `target_ongoing_requests` for your workload

1. **Setup**: Deploy with a single replica and autoscaling disabled
   ```python
   @serve.deployment(num_replicas=1)
   class MyDeployment:
       ...
   ```
   Note set `max_ongoing_requests` to a large number to avoid queueing at the caller. 

2. **Benchmark Process**:
   - Start with low query-per-second (QPS) rate
   - Gradually increase load until you hit your latency SLA (e.g., P99 < 500ms)
   - Monitor replica queue length using Ray Dashboard or metrics

3. **Key Metrics to Track**:
   - **Ongoing requests per replica** when latency is acceptable
   - **Request processing time** (average and P99)
   - **Throughput** (requests/second at SLA limit)

4. **Calculate Target Settings**:

Example: If replica handles 20 requests well before hitting latency SLA, then set:
- `target_ongoing_requests` ~= 16  # Use 80% of max capacity
- `max_ongoing_requests` ~= 24  # 1.2-1.5x

#### Example

Take a look at `examples/autoscaling`

1. `resnet50_model.py` contains our model deployment that we are attempting to autoscale
2. `locustfile.py` is how we have configured locust to send concurrent requests by spawning additional users
3. `benchmark.yaml` is how we have configured the deployment for benchmarking purposes.

**Key Benchmarking Configuration:**
- `max_ongoing_requests: 10000` - Set artificially high to avoid queueing during benchmarking
- `min_replicas: 1` and `max_replicas: 1` - Benchmark against a single replica to measure its capacity

**Observation from Load Testing:**

After running the benchmark and inspecting the Serve deployment dashboard, we observe a clear relationship between ongoing requests and latency:

<img src="https://anyscale-materials.s3.us-west-2.amazonaws.com/ray-serve-deep-dive/load_test_ongoing_requests.png" width="1000">

**Key Finding:** P90 latency starts to exceed 500ms when there are approximately **6 ongoing requests** per replica.

**Recommended Configuration:**

Based on this observation, we should configure autoscaling as follows:

- **`target_ongoing_requests: 4`** - Set at ~67% of the capacity threshold (6 requests) to maintain headroom
- **`max_ongoing_requests: 6`** - Set at ~1.5x the threshold to allow brief bursts while triggering backpressure if sustained

This configuration ensures:
- Replicas operate below their latency threshold under normal load
- The autoscaler adds replicas before latency degrades
- Brief traffic spikes are handled without immediately rejecting requests 

## 5. Finetuning of autoscaling in Ray Serve

You can adjust how quickly and smoothly Ray Serve reacts to changes in traffic using these settings:

#### 1. Controller Responsiveness

* **Upscale Delay** (`upscale_delay_s`, default = 30s)
  How long Serve waits before adding replicas.

  * If traffic stays **above** the target for this duration, Serve adds replicas.
  * Use a **smaller** delay for fast reaction to traffic spikes.
  * Example: For bursty workloads, set `upscale_delay_s` to a lower value (like 5–10s).

* **Downscale Delay** (`downscale_delay_s`, default = 600s)
  How long Serve waits before removing replicas.

  * If traffic stays **below** the target for this duration, Serve removes replicas.
  * Use a **larger** delay for apps that start slowly or have unpredictable traffic, to avoid scaling down too soon.



#### 2. Metrics Window and Update Frequency

* **Look Back Period** (`look_back_period_s`, default = 30s)
  The time window over which Serve averages ongoing requests per replica.

* **Metrics Interval** (`metrics_interval_s`, default = 10s)
  How often each replica reports metrics to the autoscaler.

  * The autoscaler only makes decisions when it gets new data.
  * Keep `metrics_interval_s` **≤** `upscale_delay_s` and `downscale_delay_s`.
  * Example: If `upscale_delay_s = 3` but `metrics_interval_s = 10`, scaling up can only happen every ~10 seconds.



#### 3. Scale Adjustment Sensitivity

* **Upscaling Factor** (`upscaling_factor`, default = 1.0)
  Controls how aggressively to scale up.

  * Increase it (>1) for faster scale-ups when traffic surges.
  * Acts like a “gain” that amplifies the scaling response.

* **Downscaling Factor** (`downscaling_factor`, default = 1.0)
  Controls how aggressively to scale down.

  * Decrease it (<1) to make downscaling slower and more conservative.
  * Useful if you want to avoid frequent scale-down/scale-up cycles.


### Example

Configure fast upscaling for bursty traffic. Note: `look_back_period_s` and `metrics_interval_s` should be ≤ `upscale_delay_s` to ensure timely metric collection and decision-making.


In [None]:
@serve.deployment(
    autoscaling_config=AutoscalingConfig(
        target_ongoing_requests=2,
        min_replicas=1,
        max_replicas=5,
        upscale_delay_s=5,
        downscale_delay_s=60,
        look_back_period_s=5,
        metrics_interval_s=2,
        upscaling_factor=1.5,
    )
)
class BurstyService:
    def run(self, id: int):
        time.sleep(60)
        return f"Done {id}"

In [None]:
handle = serve.run(BurstyService.bind())

Send a burst of 10 requests. With fast upscaling, replicas are added quickly (after 5s) rather than the default 30s.


In [None]:
results = [handle.run.remote(i) for i in range(10)]

#### Step 2: Load Testing with Realistic Traffic

**Goal**: Validate and finetune autoscaling behavior under production-like conditions

1. **Create test scenarios** that mimic your production traffic
   - **Steady ramp-up**: Gradual increase from min to max load
   - **Traffic spikes**: Sudden 2-5x increase in QPS
   - **Sustained high load**: Run at peak for 10-15 minutes
   - **Scale-down**: Drop traffic to trigger downscaling

2. **Monitor During Tests**:
   - **Latency degradation** during scale-up transitions
   - **Time to scale**: How long until new replicas handle traffic (typically 30-60s)
   - **Request rejections**: Check for 503 errors or BackPressureErrors
   - **Replica utilization**: Ensure load distributes evenly

3. **Common Issues & Fixes**:

| Symptom | Likely Cause |
|---------|--------------|
| Frequent request rejections | `max_ongoing_requests` too low | 
| Slow scale-up | `upscale_delay_s` too high | 
| Replica thrashing (up/down) | Traffic at boundary due to drastic delay values `upscale_delay_s` and `downscale_delay_s` |


## 6. Custom Autoscaling (Coming Soon)

Ray Serve is introducing custom autoscaling capabilities that go beyond the default queue-depth policy, allowing you to implement domain-specific scaling logic tailored to your application's needs.

### Key Capabilities

**1. Custom Metrics Collection**
- **Prometheus Integration**: Scrape external metrics (CPU, GPU memory, latency) from Prometheus
- **Code-Based Metrics**: Export custom metrics from your deployments
- Metrics are aggregated over configurable time windows and made available to your scaling policy

**2. Deployment-Level Custom Policies**
- Write Python functions that receive an `AutoscalingContext` with current replicas, metrics, and constraints
- Implement custom logic (e.g., scale based on GPU memory, latency percentiles, or business KPIs)
- Return target replica count and optional state to persist across policy invocations

**3. Application-Level Policies (Joint Scaling)**
- Coordinate scaling decisions across multiple deployments in an application
- Useful for maintaining ratios between services (e.g., app1 scales 2x faster than app2)
- Policy receives context for all deployments and returns scaling decisions for each

**4. External Scaler Integration**
- REST API endpoints allow third-party systems (Kubernetes HPA, custom monitors) to control scaling
- `POST /api/v1/applications/{app}/deployments/{dep}/scale` to set target replicas
- `GET /api/v1/applications/{app}/deployments/{dep}/status` to query current state
- Optional webhook registration for scale events

### Example Use Cases

- **GPU Memory-Aware**: Scale based on GPU memory utilization instead of request count
- **LLM KV-Cache Aware**: Scale when KV-cache utilization exceeds thresholds
- **Multi-Deployment Coordination**: Scale app1 and app2 services proportionally

### Backward Compatibility

All existing autoscaling configurations continue to work. If you don't specify a custom `policy`, Serve uses the default replica queue-length based algorithm with `target_ongoing_requests`.

<div class="alert alert-block alert-info">

**Note:** Custom autoscaling is currently in development. Check this [Ray Serve issue](https://github.com/ray-project/ray/issues/41135) for availability and updates.

</div>
