- https://blog.vllm.ai/2023/06/20/vllm.html
- https://docs.vllm.ai/en/latest/design/kernel/paged_attention.html

### parallel sampling

- 场景，multiple sampling 、 reject sampling
- CoW mechanism（写时复制）
    - To ensure safe sharing, PagedAttention keeps track of the **reference counts**(引用计数) of the physical blocks and implements the Copy-on-Write mechanism.
        - 当多个不同的请求（或序列）需要访问完全相同的数据时（例如，它们有共同的提示前缀），PagedAttention 不会为每个请求都复制一份物理内存块。相反，它会让这些请求的逻辑块都指向同一个物理内存块。
        - 为了知道这个物理块被多少个逻辑块共享着，系统会维护一个引用计数。每当有一个新的逻辑块指向这个物理块时，它的引用计数就加 1。
        - ref cont > 1 时，如果要修改，会触发 CoW；
    - 想象一下，你有一份很重要的参考资料（比如一本厚书），你的几个朋友都想看。
        - 直接复制（效率低）： 如果你为每个朋友都复印一份完整的书，会非常耗费时间和纸张（资源）。尤其是如果他们大部分时间只是阅读，并不在上面写字。
        - 共享（初始状态）： 你可以先把你的原版书放在桌子上，告诉所有朋友：“大家先看这一本，只要不往上面写字就行。” 这样，大家都共享同一份资源，非常高效。
        - 写时复制（COW）： 当某个朋友（比如小明）忍不住想在书上做笔记或划重点时（也就是要“写入”），你不能让他直接在原版上写，不然会影响其他人。这时，你赶紧拿出复印机，只在小明要写的那一刻，帮他复印一份他专属的书。然后，小明就可以在他自己的复印本上随意写画了。其他朋友仍然看原来的那本，不受影响。

### APC（Automatic Prefix Caching）

- https://docs.vllm.ai/en/stable/features/automatic_prefix_caching.html

In [8]:
import time
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

In [2]:
LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + """
| ID  | Name          | Age | Occupation    | Country       | Email                  | Phone Number   | Address                       |
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
| 1   | John Doe      | 29  | Engineer      | USA           | john.doe@example.com   | 555-1234       | 123 Elm St, Springfield, IL  |
| 2   | Jane Smith    | 34  | Doctor        | Canada        | jane.smith@example.com | 555-5678       | 456 Oak St, Toronto, ON      |
| 3   | Alice Johnson | 27  | Teacher       | UK            | alice.j@example.com    | 555-8765       | 789 Pine St, London, UK      |
| 4   | Bob Brown     | 45  | Artist        | Australia     | bob.b@example.com      | 555-4321       | 321 Maple St, Sydney, NSW    |
| 5   | Carol White   | 31  | Scientist     | New Zealand   | carol.w@example.com    | 555-6789       | 654 Birch St, Wellington, NZ |
| 6   | Dave Green    | 28  | Lawyer        | Ireland       | dave.g@example.com     | 555-3456       | 987 Cedar St, Dublin, IE     |
| 7   | Emma Black    | 40  | Musician      | USA           | emma.b@example.com     | 555-1111       | 246 Ash St, New York, NY     |
| 8   | Frank Blue    | 37  | Chef          | Canada        | frank.b@example.com    | 555-2222       | 135 Spruce St, Vancouver, BC |
| 9   | Grace Yellow  | 50  | Engineer      | UK            | grace.y@example.com    | 555-3333       | 864 Fir St, Manchester, UK   |
| 10  | Henry Violet  | 32  | Artist        | Australia     | henry.v@example.com    | 555-4444       | 753 Willow St, Melbourne, VIC|
| 11  | Irene Orange  | 26  | Scientist     | New Zealand   | irene.o@example.com    | 555-5555       | 912 Poplar St, Auckland, NZ  |
| 12  | Jack Indigo   | 38  | Teacher       | Ireland       | jack.i@example.com     | 555-6666       | 159 Elm St, Cork, IE         |
| 13  | Karen Red     | 41  | Lawyer        | USA           | karen.r@example.com    | 555-7777       | 357 Cedar St, Boston, MA     |
| 14  | Leo Brown     | 30  | Chef          | Canada        | leo.b@example.com      | 555-8888       | 246 Oak St, Calgary, AB      |
| 15  | Mia Green     | 33  | Musician      | UK            | mia.g@example.com      | 555-9999       | 975 Pine St, Edinburgh, UK   |
| 16  | Noah Yellow   | 29  | Doctor        | Australia     | noah.y@example.com     | 555-0000       | 864 Birch St, Brisbane, QLD  |
| 17  | Olivia Blue   | 35  | Engineer      | New Zealand   | olivia.b@example.com   | 555-1212       | 753 Maple St, Hamilton, NZ   |
| 18  | Peter Black   | 42  | Artist        | Ireland       | peter.b@example.com    | 555-3434       | 912 Fir St, Limerick, IE     |
| 19  | Quinn White   | 28  | Scientist     | USA           | quinn.w@example.com    | 555-5656       | 159 Willow St, Seattle, WA   |
| 20  | Rachel Red    | 31  | Teacher       | Canada        | rachel.r@example.com   | 555-7878       | 357 Poplar St, Ottawa, ON    |
| 21  | Steve Green   | 44  | Lawyer        | UK            | steve.g@example.com    | 555-9090       | 753 Elm St, Birmingham, UK   |
| 22  | Tina Blue     | 36  | Musician      | Australia     | tina.b@example.com     | 555-1213       | 864 Cedar St, Perth, WA      |
| 23  | Umar Black    | 39  | Chef          | New Zealand   | umar.b@example.com     | 555-3435       | 975 Spruce St, Christchurch, NZ|
| 24  | Victor Yellow | 43  | Engineer      | Ireland       | victor.y@example.com   | 555-5657       | 246 Willow St, Galway, IE    |
| 25  | Wendy Orange  | 27  | Artist        | USA           | wendy.o@example.com    | 555-7879       | 135 Elm St, Denver, CO       |
| 26  | Xavier Green  | 34  | Scientist     | Canada        | xavier.g@example.com   | 555-9091       | 357 Oak St, Montreal, QC     |
| 27  | Yara Red      | 41  | Teacher       | UK            | yara.r@example.com     | 555-1214       | 975 Pine St, Leeds, UK       |
| 28  | Zack Blue     | 30  | Lawyer        | Australia     | zack.b@example.com     | 555-3436       | 135 Birch St, Adelaide, SA   |
| 29  | Amy White     | 33  | Musician      | New Zealand   | amy.w@example.com      | 555-5658       | 159 Maple St, Wellington, NZ |
| 30  | Ben Black     | 38  | Chef          | Ireland       | ben.b@example.com      | 555-7870       | 246 Fir St, Waterford, IE    |
"""

In [4]:
def get_generation_time(llm, sampling_params, prompts):
    # time the generation
    start_time = time.time()
    output = llm.generate(prompts, sampling_params=sampling_params)
    end_time = time.time()
    # print the output and generation time
    print(f"Output: {output[0].outputs[0].text}")
    print(f"Generation time: {end_time - start_time} seconds.")

In [5]:
# set enable_prefix_caching=True to enable APC
llm = LLM(
    model='Qwen/Qwen2.5-3B-Instruct',
    enable_prefix_caching=True
)

INFO 04-18 20:04:23 [config.py:585] This model supports multiple tasks: {'score', 'reward', 'generate', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 04-18 20:04:23 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 04-18 20:04:25 [core.py:54] Initializing a V1 LLM engine (v0.8.2) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=Non

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.38it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.82it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.74it/s]



INFO 04-18 20:04:29 [loader.py:447] Loading weights took 1.19 seconds
INFO 04-18 20:04:30 [gpu_model_runner.py:1186] Model loading took 5.7916 GB and 3.318596 seconds
INFO 04-18 20:04:40 [backends.py:415] Using cache directory: /home/whaow/.cache/vllm/torch_compile_cache/14f004164d/rank_0_0 for vLLM's torch.compile
INFO 04-18 20:04:40 [backends.py:425] Dynamo bytecode transform time: 10.76 s
INFO 04-18 20:04:44 [backends.py:132] Cache the graph of shape None for later use
INFO 04-18 20:05:13 [backends.py:144] Compiling a graph for general shape takes 31.70 s
INFO 04-18 20:05:25 [monitor.py:33] torch.compile takes 42.47 s in total
INFO 04-18 20:05:26 [kv_cache_utils.py:566] GPU KV cache size: 273,184 tokens
INFO 04-18 20:05:26 [kv_cache_utils.py:569] Maximum concurrency for 32,768 tokens per request: 8.34x
INFO 04-18 20:05:49 [gpu_model_runner.py:1534] Graph capturing finished in 23 secs, took 1.82 GiB
INFO 04-18 20:05:49 [core.py:151] init engine (profile, create kv cache, warmup model

In [6]:
sampling_params = SamplingParams(temperature=0, max_tokens=100)

In [9]:
tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-3B-Instruct')

In [10]:
msg = [{'role': 'system', 'content': LONG_PROMPT}, 
      {'role': 'user', 'content': 'What is the age of John Doe?'}]
prompt = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)

In [11]:
get_generation_time(
    llm,
    sampling_params,
    prompt)

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  4.33it/s, est. speed input: 7072.37 toks/s, output: 65.67 toks/s]

Output: According to the table, John Doe is 29 years old.
Generation time: 0.2613101005554199 seconds.





In [12]:
msg = [{'role': 'system', 'content': LONG_PROMPT}, 
      {'role': 'user', 'content': 'What is the age of Zack Blue?'}]
prompt = tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)

In [13]:
get_generation_time(
    llm,
    sampling_params,
    prompt)

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  6.49it/s, est. speed input: 10622.64 toks/s, output: 98.62 toks/s]

Output: According to the table, Zack Blue is 30 years old.
Generation time: 0.1722562313079834 seconds.



