In [1]:
from IPython.display import Image

- `gpu_memory_utilization`
    - Weights
    - activation
    - kv cache
- advanced parameters
    - dynamic split fuse 默认 able（除非 disable）
    - 开启之后，可以显著提升 TTFT（time to first token）

## prefill & decode

In [2]:
Image(url='https://pica.zhimg.com/v2-12657ad5334825239b580b86a7c7e0c4_1440w.jpg', width=500)

https://zhuanlan.zhihu.com/p/14668411986

- 在使用vllm的过程中，我们发现大语言模型输出token过程中，偶尔会出现有较长时间的等待才吐出下一个token的问题。
    - 这是由于vllm服务是会在一个服务实例中，即做prefill又做decode。但同一时间只能执行prefill、decode其中一个，并且执行prefill的优先级高于decode。
    - 所以在多并发情况下，会优先处理prefill而导致停滞decode，造成部分token会等待较长时间才输出的情况。
- 如上图所示，假设最开始有A、B两个序列，他们都处在decode阶段。在A和B完成1次decode之后，来了C和D的两个请求。由于vllm是prefill优先的，所以它会先处理C和D的prefill，这就使得decode暂停了。等C和D的prefill完成了，A、B、C、D再同时做decode。

### chunked prefill

- chunked-prefill，是修改了传统的推理服务里，非prefill即decode的方法。让prefill和decode能同时放在一个batch里做推理。对于比较长的请求序列，它的prefill无法再一个batch里执行完，它会做chunk切割，分在多个batch里完成。所以叫做chunked-prefill。其实也可以看出来，原来一条序列不切分直接进行prefill，和现在把它拆成了多个chunk，那么每个chunk去计算时，肯定要去读前一个chunk的KV cache，会额外多出一些开销。所以会稍微影响到TTFT，但是考虑到它对TPOT/TBT的更多提升，这样的开销还是可以接受的。

### gpu-memory-utilization

- default: 0.9

### automatic prefix caching

- The core idea of PagedAttention is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.

## generate vs. chat

In [3]:
from vllm import LLM
llm = LLM(model='Qwen/Qwen2.5-0.5B-Instruct')

INFO 02-13 21:48:04 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_st

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-13 21:48:07 model_runner.py:1071] Loading model weights took 0.9228 GB
INFO 02-13 21:48:08 gpu_executor.py:122] # GPU blocks: 100627, # CPU blocks: 21845
INFO 02-13 21:48:08 gpu_executor.py:126] Maximum concurrency for 32768 tokens per request: 49.13x
INFO 02-13 21:48:12 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 02-13 21:48:12 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 02-13 21:48:29 model_runner.py:1530] Graph capturing finished in 18 secs.


In [4]:
gen_output = llm.generate('hello?')

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  9.04it/s, est. speed input: 18.35 toks/s, output: 146.75 toks/s]


In [8]:
gen_output[0].prompt, gen_output[0].outputs[0].text

('hello?', ' Help please?\\n\\nTopic: SOy dhy so thy sis? D')

### chat

- 送给 model 前向 inference 的是这里的 prompt
    - 与 model 训练时的 chat ml template 相对；
    - 见代码中的 `chat_output[0].prompt`
- 这里可以看到对 instruct model 对 chat 的兼容性更好？

In [10]:
msgs = [
    {
        "role": "system",
        "content": "You are a helpful assistant"
    },
    {
        "role": "user",
        "content": "Hello"
    },
    {
        "role": "assistant",
        "content": "Hello! How can I assist you today?"
    },
    {
        "role": "user",
        "content": "Write an essay about the importance of higher education.",
    },
]

In [11]:
chat_output = llm.chat(msgs)

Processed prompts: 100%|██████████| 1/1 [00:00<00:00, 10.18it/s, est. speed input: 489.60 toks/s, output: 163.16 toks/s]


In [13]:
chat_output[0].prompt

'<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHello! How can I assist you today?<|im_end|>\n<|im_start|>user\nWrite an essay about the importance of higher education.<|im_end|>\n<|im_start|>assistant\n'

In [15]:
chat_output[0].outputs[0].text

"Higher education is crucial in today's world for many reasons. It provides a foundation"

## quant

- vllm vs. ollama
    - vllm: awq, gptq
    - ollama: gguf

## benchmark_serving.py

- benchmark metircs
    - ttft: time to first token