In [1]:
from IPython.display import Image

https://www.youtube.com/watch?v=hMs8VNRy5Ys
- nvidia
    - https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
    - https://res.cloudinary.com/dyd911kmh/image/upload/v1713882586/Marketing/webinars/Slides/dataacamp-llm-inference-webinar.pdf

- decoder-only inference
    - GPT-like models
    - No encoder, no encoder-decoder multi-head attention
    - input processing (aka **prefill**): highly parallel
        - input (the tokenized prompt) are embedded and encoded
        - mha computes the keys and values (KV)
        - large matrix multiplication, high usage of the hardware accelerator
    - output generation: sequential
        - the answer is generated **one token** at a time.
        - each generated token is **appended** to the previous input
        - the process is repeated until the stopping criteria is met
            - max length or EOS
        - low usage of the hardware accelerator
- initial prompt processing (Prefill):
    - The phase where the model gets to understand the input prompt
- Decode Phase – Generating one token at a time
    - Decoding is sped up because of the KV Cache

In [2]:
Image(url='./imgs/decoder-only.png', width=400)

### measurement

- TTFT：Time to first token
- Inter-token latency
- total time to generation

### kv cache

- https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/transformers-neuronx/generative-llm-inference-with-neuron.html
- cache size (fp16)
    - `2*2*bs*seq_length*num_layers*d_model`

In [2]:
Image(url='https://developer-blogs.nvidia.com/wp-content/uploads/2023/11/key-value-caching_.png', width=400)

In [3]:
Image(url='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/masked-self-attention-operator.png', width=400)

In [4]:
Image(url='https://awsdocs-neuron.readthedocs-hosted.com/en/latest/_images/kv-cache-optimization.png', width=400)

### continuous batching

- https://www.anyscale.com/blog/continuous-batching-llm-inference
- Decoder-only inference requests are harder to batch than for traditional Transformers
- Input and output lengths can greatly vary, leading to very different generation times
- traditional batching waits for all requests to complete
- continuous batching evicts completed requests and runs new requests

In [5]:
Image(url='https://images.ctfassets.net/xjan103pcp94/1LJioEsEdQQpDCxYNWirU6/82b9fbfc5b78b10c1d4508b60e72fdcf/cb_02_diagram-static-batching.png', width=400)

In [6]:
Image(url='https://images.ctfassets.net/xjan103pcp94/744TAv4dJIQqeHcEaz5lko/b823cc2d92bbb0d82eb252901e1dce6d/cb_03_diagram-continuous-batching.png', width=400)

### speculative decoding

- https://huggingface.co/blog/assisted-generation
- https://github.com/huggingface/transformers/blob/849367ccf741d8c58aa88ccfe1d52d8636eaf2b7/src/transformers/generation/utils.py#L4064
- the two models must share the same tokenizer

In [7]:
import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

In [8]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"
device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint).to(device)
outputs = model.generate(**inputs, assistant_model=assistant_model)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
# ['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']


tokenizer_config.json:   0%|          | 0.00/396 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.93G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/569 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/375M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:0 for open-end genera

['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']


### speculative decoding：n-grams

- https://github.com/apoorvumang/prompt-lookup-decoding
    - prompt lookup algorithm
- https://twitter.com/joao_gante/status/1747322418425741550
- Input-grounded tasks (summarization, document QA, multi-turn chat, code editing):
    - **high n-gram overlap** between the input (prompt) and the generated output
- We can use strings present in the prompt to generate candidate token sequences
- Significant speedups (2x-4x), without model modification and with no effect on output quality
- Implemented in the transformers library
```
generation_output = model.generate(
    **input_ids, 
    do_sample=False, 
    max_new_tokens=512, 
    streamer=streamer, 
    prompt_lookup_num_tokens=10
)
```