[Feature]: TTFT/ITL Based Autoscaling

### Problem

Time To First Token (TTFT) and Inter-Token Latency (ITL) directly reflect user experience:

**TTFT:** Time until the first token appears (responsiveness)
**ITL:** Time between subsequent tokens (generation speed)

Scaling based on these metrics ensures the system meets SLA targets and maintains good user experience under varying load.

### Solution


**First Iteration: Aggregated Scaling**
In the first iteration, we scale aggregated inference replicas (combined prefill and decode).

**Second Iteration: Disaggregated Scaling**
In the second iteration, we will implement disaggregated inference, similar to NVIDIA Dynamo:
Use TTFT to scale prefill workers independently
Use ITL to scale decode workers independently

**Key Changes:**

**1. Service Configuration**

Declare multiple autoscalers in service config as below.
```
type: service
name: sample-service

python: 3.12

env:
  - HF_TOKEN
  
commands:
  - ..
  - ..
  - python -m sglang.launch_server --model-path $MODEL_ID --host 0.0.0.0 --port 8000 --enable-metrics

port: 8000
model: meta-llama/Llama-3.2-3B-Instruct

resources:
  gpu: 24GB

replicas: 0..10
scaling:
   - metric: rps
     target: 10.0
   - metric: ttft
     target: 0.3
   - metric: itl
     target: 1.5
```


**2. StatsCollector** 
Gateway’s StatsCollector fetches Prometheus metrics from (http://worker_url/metrics) from each replica (SGLang worker) and includes TTFT, ITL, ISL(Input Sequence Length), OSL(Output Sequence Length) in frames.

**3. Statistics Retrieval**
Server collects ServiceStats that includes TTFT, ITL,ISL, OSL (in addition to requests and request_time)

**4. Scaling Decision**
When update_service_desired_replica_count() is called:

```
desired_counts = []
# multiple scalers: rps, ttft and itl based.
    for scaler in scalers:
        desired = scaler.get_desired_count(
            current_desired_count=run_model.desired_replica_count,
            stats=stats,
            last_scaled_at=last_scaled_at,
        )
        desired_counts.append(desired)
    
    # Take maximum of all desired counts
    run_model.desired_replica_count = max(desired_counts)
```
**5. Implement TTFT, ITL Autoscalers**
```
class TTFTAutoscaler(BaseServiceScaler):
      …
      def get_desired_count(
        …,
        ) -> int:
       
      
 class ITLAutoscaler(BaseServiceScaler):
      …
      def get_desired_count(
        …,
        ) -> int:
```

Taking reference from dynamo’s autoscaling approach as summarized below, we plan to test manually whether desired replica count satisfies the target TTFT and ITL requirements. Once the test is successful, we will implement the get_desired_count methods in TTFT and ITL Autoscaler.


**Nvidia Dynamo Autoscaling Process (Disaggregated Scaling)**
Dynamo scales prefill and decode separately. Both prefill and decode use throughput for replica calculation. Target TTFT/ITL (configured by users) are used to select the appropriate throughput from profiled data.

**Core concept** 

**Prefill**
desired_replica_count = ceil(predicted_throughput / replica_throughput_capacity)

replica_throughput_capacity = throughput_per_gpu × gpus_per_replica

where:
throughput_per_gpu is found from profiled data (selected based on **target TTFT**) using **predicted ISL**(Input Sequence Length)


**Decode**
desired_replica_count = ceil(predicted_throughput / replica_throughput_capacity)

replica_throughput_capacity = throughput_per_gpu × gpus_per_replica

where:
  throughput_per_gpu is found from profiled data based on **target ITL** and **Context Length**.


Dynamo allows to configure three different predictors for forecasting load components (**number of requests, ISL, OSL**), which are then used to calculate predicted_throughput, they are

1. Constant Predictor: This is simplest. It returns the values (**number of requests, ISL, OSL**) collected in the most recent interval. 

2. ARIMA model: Forecasts the next value by analyzing patterns in recent historical data. Automatically detects trends and uses them for prediction.

3. Prophet: A forecasting model from Meta that identifies trends and patterns in historical data. More advanced than ARIMA and better at handling complex or seasonal patterns.



### Workaround

_No response_

### Would you like to help us implement this feature by sending a PR?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: TTFT/ITL Based Autoscaling #3293

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: TTFT/ITL Based Autoscaling #3293

Description

Problem

Solution

Workaround

Would you like to help us implement this feature by sending a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions