# Benchmark Metric Analysis Tutorial
This interactive notebook guides you through analysising the benchmark metrics.

- In this guide, we will introduce some benchmarking concepts, including metrics such as TTFT (Time to First Token) and ITL (Inter-Token Latency).
- Understanding these metrics helps correctly interpret aiperf results.
- This understanding lets you
    - Identify performance bottlenecks
    - Tune parameters to improve inference performance

---


## Part 1: Time to First Token(TTFT) Analysis
TTFT shows how long a user needs to wait before seeing the model’s output. This is the time it takes from submitting the query to receiving the first token (if the response is not empty). Time to first token generally includes both request queuing time, prefill time and network latency. 

![TTFT Def](images/ttft_def.png)

In practice, there are usually several ways to reduce TTFT. Here, we briefly list one example that controls TTFT through the `max_num_tokens` parameter. The definition of `max_num_tokens` is: it specifies the maximum number of batched input tokens.



![TTFT](images/ttft_illustration.png)

### Now let's test it
The AIPerf test command is as follows, with an input sequence length of 4096 and an output sequence length of 200.

```bash
aiperf profile \
  --model Qwen/Qwen3-32B-FP8 \
  --url http://localhost:8000 \
  --endpoint-type chat \
  --num-requests 100 \
  --concurrency 8 \
  --isl 4096 \
  --osl 200 \
  --streaming
```

The first test scenario sets a smaller `max_num_tokens` to ensure that only one prefill is processed at a iteration.

```yaml
build_config:
  max_num_tokens: 5000
  max_batch_size: 8

kv_cache_config:
  dtype: fp8
  enable_block_reuse: false
```

![TTFT with large max_num_tokens](images/ttft_seq_prefill.png)

The second test scenario sets a larger `max_num_tokens` to ensure that all the prefill are processed at a iteration.

```yaml
build_config:
  max_num_tokens: 65536
  max_batch_size: 8

kv_cache_config:
  dtype: fp8
  enable_block_reuse: false
```

![TTFT with small max_num_tokens](images/ttft_batch_prefill.png)

We can clearly see that compared to a limited `max_num_tokens`, an excessively large value causes a sharp increase in TTFT.

For disaggregated serving, another important factor to consider is at which stage the first token is returned. Typically, there are two implementations in the industry: one returns immediately after the prefill stage, while the other returns only after the KV cache has been transferred to the decode worker.

![TTFT disagg](images/ttft_disagg.png)

In general, the return stage can be roughly identified by checking the time to second token — a higher value typically indicates the first approach, while a lower one suggests the latter.

## Part 2: Inter-token Latency Analysis (ITL)

ITL is defined as the average time between consecutive tokens. (Different benchmark tools may calculate ITL differently; here take AIPerf as an example)
$$ITL = \frac{Request \ latency - \ TTFT}{Total \ output \ tokens - 1}$$
![ITL def](images/itl_def.png)

Output tokens per user correspond to $\frac{1000}{ITL}$ (1000 for millisecond), which reflects the per-user output token throughput (here noted as TPS per user).
![ITL p50](images/itl_p50.png)

Even though the p50 ITL × p50 TPS per user is roughly 1000, let’s move on to analyze the average ITL and average TPS per user.

![ITL avg](images/itl_avg.png)

Obviously, the average ITL multiplied by the average TPS per user does not equal 1000 — apart from statistical factors, this partially reveals the characteristics of the data distribution.

## Part 3: Others

This section briefly lists some configurations that may affect performance

### Cuda Graph Batch Size

Currently, three frameworks support CUDA Graph. When configuring it, note that enabling padding may reduce performance. The following are two different configurations.
The first one is:
```yaml
cuda_graph_config:
  batch_sizes: [1,4,8,16,32,64]
  enable_padding: true
```
And its performance is:
![ITL avg](images/cuda_graph_skip.png)
The second one is:
```yaml
cuda_graph_config:
  batch_sizes: [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,
        26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,
        51,52,53,54,55,56,57,58,59,60,61,62,63,64]
```
And its performance is:
![ITL avg](images/cuda_graph_seq.png)
