-
Notifications
You must be signed in to change notification settings - Fork 202
Description
Problem
Time To First Token (TTFT) and Inter-Token Latency (ITL) directly reflect user experience:
TTFT: Time until the first token appears (responsiveness)
ITL: Time between subsequent tokens (generation speed)
Scaling based on these metrics ensures the system meets SLA targets and maintains good user experience under varying load.
Solution
First Iteration: Aggregated Scaling
In the first iteration, we scale aggregated inference replicas (combined prefill and decode).
Second Iteration: Disaggregated Scaling
In the second iteration, we will implement disaggregated inference, similar to NVIDIA Dynamo:
Use TTFT to scale prefill workers independently
Use ITL to scale decode workers independently
Key Changes:
1. Service Configuration
Declare multiple autoscalers in service config as below.
type: service
name: sample-service
python: 3.12
env:
- HF_TOKEN
commands:
- ..
- ..
- python -m sglang.launch_server --model-path $MODEL_ID --host 0.0.0.0 --port 8000 --enable-metrics
port: 8000
model: meta-llama/Llama-3.2-3B-Instruct
resources:
gpu: 24GB
replicas: 0..10
scaling:
- metric: rps
target: 10.0
- metric: ttft
target: 0.3
- metric: itl
target: 1.5
2. StatsCollector
Gateway’s StatsCollector fetches Prometheus metrics from (http://worker_url/metrics) from each replica (SGLang worker) and includes TTFT, ITL, ISL(Input Sequence Length), OSL(Output Sequence Length) in frames.
3. Statistics Retrieval
Server collects ServiceStats that includes TTFT, ITL,ISL, OSL (in addition to requests and request_time)
4. Scaling Decision
When update_service_desired_replica_count() is called:
desired_counts = []
# multiple scalers: rps, ttft and itl based.
for scaler in scalers:
desired = scaler.get_desired_count(
current_desired_count=run_model.desired_replica_count,
stats=stats,
last_scaled_at=last_scaled_at,
)
desired_counts.append(desired)
# Take maximum of all desired counts
run_model.desired_replica_count = max(desired_counts)
5. Implement TTFT, ITL Autoscalers
class TTFTAutoscaler(BaseServiceScaler):
…
def get_desired_count(
…,
) -> int:
class ITLAutoscaler(BaseServiceScaler):
…
def get_desired_count(
…,
) -> int:
Taking reference from dynamo’s autoscaling approach as summarized below, we plan to test manually whether desired replica count satisfies the target TTFT and ITL requirements. Once the test is successful, we will implement the get_desired_count methods in TTFT and ITL Autoscaler.
Nvidia Dynamo Autoscaling Process (Disaggregated Scaling)
Dynamo scales prefill and decode separately. Both prefill and decode use throughput for replica calculation. Target TTFT/ITL (configured by users) are used to select the appropriate throughput from profiled data.
Core concept
Prefill
desired_replica_count = ceil(predicted_throughput / replica_throughput_capacity)
replica_throughput_capacity = throughput_per_gpu × gpus_per_replica
where:
throughput_per_gpu is found from profiled data (selected based on target TTFT) using predicted ISL(Input Sequence Length)
Decode
desired_replica_count = ceil(predicted_throughput / replica_throughput_capacity)
replica_throughput_capacity = throughput_per_gpu × gpus_per_replica
where:
throughput_per_gpu is found from profiled data based on target ITL and Context Length.
Dynamo allows to configure three different predictors for forecasting load components (number of requests, ISL, OSL), which are then used to calculate predicted_throughput, they are
-
Constant Predictor: This is simplest. It returns the values (number of requests, ISL, OSL) collected in the most recent interval.
-
ARIMA model: Forecasts the next value by analyzing patterns in recent historical data. Automatically detects trends and uses them for prediction.
-
Prophet: A forecasting model from Meta that identifies trends and patterns in historical data. More advanced than ARIMA and better at handling complex or seasonal patterns.
Workaround
No response
Would you like to help us implement this feature by sending a PR?
Yes