## Faster inference - vLLM

vLLM is library optimized for inference - hence it provides better GPU utilization and much faster inference than base transformers library. vLLM offers two usage options:
- [vLLM server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server/): Creates an inference server (compatible with OpenAI library), that can be used for API access;
- **Local inference**: Creates a LLM class that can be used inside the running script. We will use this option today.

Let us first load the model. 4-bit quantization can be easilly added to vLLM by setting `quantization="bitsandbytes"`. If we ignore this parameter, the model will be set up in standard 16-bit precision. An important additional parameter for vLLM is  `gpu_memory_utilization`. It tells the vLLM engine, what percentage of GPU memory it should reserve for model weights and KV-cache. By default it is set to `0.9`. Since we are using 4-bit quantization, the model weights will take only 3-4 GB of RAM and by adding KV cache, we should be ok with 10 GB of vRAM. Set this parameter according to your GPU vRAM and model memory usage requirements.

In [None]:
!pip install vllm==0.10.2 bitsandbytes==0.46.1 transformers==4.57.6

Collecting vllm==0.10.2
  Downloading vllm-0.10.2-cp38-abi3-manylinux1_x86_64.whl.metadata (16 kB)
Collecting bitsandbytes==0.46.1
  Downloading bitsandbytes-0.46.1-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting blake3 (from vllm==0.10.2)
  Downloading blake3-1.0.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.6 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm==0.10.2)
  Downloading prometheus_fastapi_instrumentator-7.1.0-py3-none-any.whl.metadata (13 kB)
Collecting lm-format-enforcer==0.11.3 (from vllm==0.10.2)
  Downloading lm_format_enforcer-0.11.3-py3-none-any.whl.metadata (17 kB)
Collecting llguidance<0.8.0,>=0.7.11 (from vllm==0.10.2)
  Downloading llguidance-0.7.30-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting outlines_core==0.2.11 (from vllm==0.10.2)
  Downloading outlines_core-0.2.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting diskcache=

In [1]:
from vllm import LLM, SamplingParams
import torch
import os
os.environ["VLLM_USE_V1"] = "0"

torch.cuda.empty_cache()
model = LLM(
    "Qwen/Qwen2-7B-Instruct",
    dtype=torch.float16,
    trust_remote_code=True,
    quantization="bitsandbytes",
    gpu_memory_utilization=0.9,
    max_model_len=1024,
)

INFO 01-13 11:20:10 [__init__.py:216] Automatically detected platform cuda.
INFO 01-13 11:20:12 [utils.py:328] non-default args: {'trust_remote_code': True, 'dtype': torch.float16, 'max_model_len': 1024, 'disable_log_stats': True, 'quantization': 'bitsandbytes', 'model': 'Qwen/Qwen2-7B-Instruct'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/663 [00:00<?, ?B/s]

INFO 01-13 11:20:36 [__init__.py:742] Resolved architecture: Qwen2ForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 01-13 11:20:36 [__init__.py:1815] Using max model len 1024
INFO 01-13 11:20:41 [llm_engine.py:221] Initializing a V0 LLM engine (v0.10.2) with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=Qwen/Qwen2-7B-Instruct, enable_pre

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/243 [00:00<?, ?B/s]

INFO 01-13 11:20:43 [cuda.py:408] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 01-13 11:20:43 [cuda.py:453] Using XFormers backend.
INFO 01-13 11:20:44 [parallel_state.py:1165] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 01-13 11:20:44 [model_runner.py:1051] Starting to load model Qwen/Qwen2-7B-Instruct...
INFO 01-13 11:20:45 [bitsandbytes_loader.py:758] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 01-13 11:20:45 [weight_utils.py:348] Using model weights format ['*.safetensors']


model-00001-of-00004.safetensors:   0%|          | 0.00/3.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/3.86G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/3.56G [00:00<?, ?B/s]

INFO 01-13 11:24:08 [weight_utils.py:369] Time spent downloading weights for Qwen/Qwen2-7B-Instruct: 202.890220 seconds


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 01-13 11:25:12 [model_runner.py:1083] Model loading took 5.4588 GiB and 266.431915 seconds
INFO 01-13 11:25:24 [worker.py:290] Memory profiling takes 11.39 seconds
INFO 01-13 11:25:24 [worker.py:290] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 01-13 11:25:24 [worker.py:290] model weights take 5.46GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.40GiB; the rest of the memory reserved for KV Cache is 6.36GiB.
INFO 01-13 11:25:24 [executor_base.py:114] # cuda blocks: 7442, # CPU blocks: 4681
INFO 01-13 11:25:24 [executor_base.py:119] Maximum concurrency for 1024 tokens per request: 116.28x
INFO 01-13 11:25:29 [model_runner.py:1355] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider de

Capturing CUDA graph shapes:   0%|          | 0/35 [00:00<?, ?it/s]

INFO 01-13 11:27:16 [model_runner.py:1507] Graph capturing finished in 107 secs, took 0.67 GiB
INFO 01-13 11:27:16 [worker.py:467] Free memory on device (14.64/14.74 GiB) on startup. Desired GPU memory utilization is (0.9, 13.27 GiB). Actual usage is 5.46 GiB for weight, 1.4 GiB for peak activation, 0.05 GiB for non-torch memory, and 0.67 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=5949911142` to fit into requested memory, or `--kv-cache-memory=7424805376` to fully utilize gpu memory. Current kv cache memory in use is 6828617830 bytes.
INFO 01-13 11:27:16 [llm_engine.py:420] init engine (profile, create kv cache, warmup model) took 124.15 seconds
INFO 01-13 11:27:16 [llm.py:295] Supported_tasks: ['generate']
INFO 01-13 11:27:16 [__init__.py:36] No IOProcessor plugins requested by the model


### Inference function

vLLM uses same chat format for inference as transformers. There are two ways of using vLLM for response generation:
- `generate`: used for standard text-completion task (pretrained models). If we would want to use a chat model in this format, we would have to pretokenize the prompt using chat template;
- `chat`: better suited for chat model. Using this method, we can send the conversation directly to the model and its tokenizer will automatically apply chat template (the same way as this is handled in transformers while using `pipeline`). Since we are using chat model, we will use this option today.

**Sampling params**: vLLM uses a special class for generation parameters such as temperature, top_p, top_k, etc. It also supports guided decoding format (enforcing JSON schema or answers given regular expression - see the last cell in this notebook for structured outputs description).

In [2]:
def prompt_to_conversation(prompt):
    messages = [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": prompt}
    ]

    return messages

def vllm_generate(model, conversations):
    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.9,
        top_k=64,
        max_tokens=1024
    )

    responses = model.chat(conversations, sampling_params)
    predicted_texts = []
    for response in responses:
        prediction = response.outputs[0].text
        predicted_texts.append(prediction)

    return predicted_texts

### Single example inference

With vLLM we can do both single example and batch inference. For single example, we simply need to put our conversation as an input.

In [3]:
prompt = "Translate the following text to English & French. Put translations in separate lines. \n\nWenn es darum geht, das eigene Geld zu vermehren oder was fürs Alter anzusparen, ist immer häufiger von ETFs die Rede – kurz für Exchange Traded Funds, also börsengehandelte Indexfonds. Klingt sperrig? Mag sein. Aber einen ersten Sparplan anzulegen, ist unkompliziert. Ein Smartphone reicht. Und um in ETFs zu investieren, benötigen wir weder Startkapital noch großes Vorwissen. Selbst, wer die Altersvorsorge seit Jahrzehnten vor sich herschiebt, kann noch starten."
conversation = prompt_to_conversation(prompt)

predicted_text = vllm_generate(model, conversation)

print(50*"-")
print("Input text:")
print(prompt)
print()
print()
print("Model's response:")
print(predicted_text[0])
print(50*"-")

INFO 01-13 11:28:47 [chat_utils.py:538] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

--------------------------------------------------
Input text:
Translate the following text to English & French. Put translations in separate lines. 

Wenn es darum geht, das eigene Geld zu vermehren oder was fürs Alter anzusparen, ist immer häufiger von ETFs die Rede – kurz für Exchange Traded Funds, also börsengehandelte Indexfonds. Klingt sperrig? Mag sein. Aber einen ersten Sparplan anzulegen, ist unkompliziert. Ein Smartphone reicht. Und um in ETFs zu investieren, benötigen wir weder Startkapital noch großes Vorwissen. Selbst, wer die Altersvorsorge seit Jahrzehnten vor sich herschiebt, kann noch starten.


Model's response:
English:
When it comes to multiplying one's own capital or saving for retirement, it is increasingly being discussed - short for Exchange Traded Funds, or stock exchange traded index funds. Sounds daunting? Perhaps it does. But setting up a first savings plan is straightforward. All you need is a smartphone. And to invest in ETFs, we neither need start capital

### Batch inference

Batch inference is very simple with vLLM. All we need to do is to provide the list of conversations - be careful, each conversation is a list itself, so we need to provide the list of lists.

In [4]:
prompts = [
    "Translate the following text to English & French. Put translations in separate lines. \n\nWenn es darum geht, das eigene Geld zu vermehren oder was fürs Alter anzusparen, ist immer häufiger von ETFs die Rede – kurz für Exchange Traded Funds, also börsengehandelte Indexfonds. Klingt sperrig? Mag sein. Aber einen ersten Sparplan anzulegen, ist unkompliziert. Ein Smartphone reicht. Und um in ETFs zu investieren, benötigen wir weder Startkapital noch großes Vorwissen. Selbst, wer die Altersvorsorge seit Jahrzehnten vor sich herschiebt, kann noch starten.",
    "Translate to English:\n\nÄußerlich kann Levemir InnoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden."
]
conversations = [prompt_to_conversation(prompt) for prompt in prompts]

predicted_texts = vllm_generate(model, conversations)

for prompt, predicted_text in zip(prompts, predicted_texts):
    print(50*"-")
    print("Input text:")
    print(prompt)
    print()
    print()
    print("Model's response:")
    print(predicted_text)
    print(50*"-")

Adding requests:   0%|          | 0/2 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

--------------------------------------------------
Input text:
Translate the following text to English & French. Put translations in separate lines. 

Wenn es darum geht, das eigene Geld zu vermehren oder was fürs Alter anzusparen, ist immer häufiger von ETFs die Rede – kurz für Exchange Traded Funds, also börsengehandelte Indexfonds. Klingt sperrig? Mag sein. Aber einen ersten Sparplan anzulegen, ist unkompliziert. Ein Smartphone reicht. Und um in ETFs zu investieren, benötigen wir weder Startkapital noch großes Vorwissen. Selbst, wer die Altersvorsorge seit Jahrzehnten vor sich herschiebt, kann noch starten.


Model's response:
English:
When it comes to multiplying your own capital or saving for retirement, ETFs are increasingly being discussed - short for Exchange Traded Funds, or stock exchange traded index funds. Sounds daunting? Maybe. But setting up a first savings plan is straightforward. All you need is a smartphone. And to invest in ETFs, we don't need any start-up capital or