Skip to content

[Feature]: TTFT/ITL Based Autoscaling #3293

@Bihan

Description

@Bihan

Problem

Time To First Token (TTFT) and Inter-Token Latency (ITL) directly reflect user experience:

TTFT: Time until the first token appears (responsiveness)
ITL: Time between subsequent tokens (generation speed)

Scaling based on these metrics ensures the system meets SLA targets and maintains good user experience under varying load.

Solution

First Iteration: Aggregated Scaling
In the first iteration, we scale aggregated inference replicas (combined prefill and decode).

Second Iteration: Disaggregated Scaling
In the second iteration, we will implement disaggregated inference, similar to NVIDIA Dynamo:
Use TTFT to scale prefill workers independently
Use ITL to scale decode workers independently

Key Changes:

1. Service Configuration

Declare multiple autoscalers in service config as below.

type: service
name: sample-service

python: 3.12

env:
  - HF_TOKEN
  
commands:
  - ..
  - ..
  - python -m sglang.launch_server --model-path $MODEL_ID --host 0.0.0.0 --port 8000 --enable-metrics

port: 8000
model: meta-llama/Llama-3.2-3B-Instruct

resources:
  gpu: 24GB

replicas: 0..10
scaling:
   - metric: rps
     target: 10.0
   - metric: ttft
     target: 0.3
   - metric: itl
     target: 1.5

2. StatsCollector
Gateway’s StatsCollector fetches Prometheus metrics from (http://worker_url/metrics) from each replica (SGLang worker) and includes TTFT, ITL, ISL(Input Sequence Length), OSL(Output Sequence Length) in frames.

3. Statistics Retrieval
Server collects ServiceStats that includes TTFT, ITL,ISL, OSL (in addition to requests and request_time)

4. Scaling Decision
When update_service_desired_replica_count() is called:

desired_counts = []
# multiple scalers: rps, ttft and itl based.
    for scaler in scalers:
        desired = scaler.get_desired_count(
            current_desired_count=run_model.desired_replica_count,
            stats=stats,
            last_scaled_at=last_scaled_at,
        )
        desired_counts.append(desired)
    
    # Take maximum of all desired counts
    run_model.desired_replica_count = max(desired_counts)

5. Implement TTFT, ITL Autoscalers

class TTFTAutoscaler(BaseServiceScaler):
      …
      def get_desired_count(
        …,
        ) -> int:
       
      
 class ITLAutoscaler(BaseServiceScaler):
      …
      def get_desired_count(
        …,
        ) -> int:

Taking reference from dynamo’s autoscaling approach as summarized below, we plan to test manually whether desired replica count satisfies the target TTFT and ITL requirements. Once the test is successful, we will implement the get_desired_count methods in TTFT and ITL Autoscaler.

Nvidia Dynamo Autoscaling Process (Disaggregated Scaling)
Dynamo scales prefill and decode separately. Both prefill and decode use throughput for replica calculation. Target TTFT/ITL (configured by users) are used to select the appropriate throughput from profiled data.

Core concept

Prefill
desired_replica_count = ceil(predicted_throughput / replica_throughput_capacity)

replica_throughput_capacity = throughput_per_gpu × gpus_per_replica

where:
throughput_per_gpu is found from profiled data (selected based on target TTFT) using predicted ISL(Input Sequence Length)

Decode
desired_replica_count = ceil(predicted_throughput / replica_throughput_capacity)

replica_throughput_capacity = throughput_per_gpu × gpus_per_replica

where:
throughput_per_gpu is found from profiled data based on target ITL and Context Length.

Dynamo allows to configure three different predictors for forecasting load components (number of requests, ISL, OSL), which are then used to calculate predicted_throughput, they are

  1. Constant Predictor: This is simplest. It returns the values (number of requests, ISL, OSL) collected in the most recent interval.

  2. ARIMA model: Forecasts the next value by analyzing patterns in recent historical data. Automatically detects trends and uses them for prediction.

  3. Prophet: A forecasting model from Meta that identifies trends and patterns in historical data. More advanced than ARIMA and better at handling complex or seasonal patterns.

Workaround

No response

Would you like to help us implement this feature by sending a PR?

Yes

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions