## Faster inference - vLLM

vLLM is library optimized for inference - hence it provides better GPU utilization and much faster inference than base transformers library. vLLM offers two usage options:
- [vLLM server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server/): Creates an inference server (compatible with OpenAI library), that can be used for API access;
- **Local inference**: Creates a LLM class that can be used inside the running script. We will use this option today.

Let us first load the model. 4-bit quantization can be easilly added to vLLM by setting `quantization="bitsandbytes"`. If we ignore this parameter, the model will be set up in standard 16-bit precision. An important additional parameter for vLLM is  `gpu_memory_utilization`. It tells the vLLM engine, what percentage of GPU memory it should reserve for model weights and KV-cache. By default it is set to `0.9`. Since we are using 4-bit quantization, the model weights will take only 3-4 GB of RAM and by adding KV cache, we should be ok with 10 GB of vRAM. Set this parameter according to your GPU vRAM and model memory usage requirements.

In [None]:
!pip install vllm==0.10.2 bitsandbytes==0.46.1

[0m

In [2]:
from vllm import LLM, SamplingParams
import torch

model = LLM(
    "Qwen/Qwen2-7B-Instruct",
    dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization="bitsandbytes",
    gpu_memory_utilization=0.5
)

  import pynvml  # type: ignore[import]
  from .autonotebook import tqdm as notebook_tqdm


INFO 12-17 12:11:46 [__init__.py:241] Automatically detected platform cuda.
INFO 12-17 12:11:46 [utils.py:326] non-default args: {'model': '/models/Qwen2-7B-Instruct', 'trust_remote_code': True, 'dtype': torch.bfloat16, 'gpu_memory_utilization': 0.5, 'disable_log_stats': True, 'quantization': 'bitsandbytes'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


INFO 12-17 12:11:49 [__init__.py:711] Resolved architecture: Qwen2ForCausalLM
INFO 12-17 12:11:49 [__init__.py:1750] Using max model len 32768


2025-12-17 12:11:51,139	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 12-17 12:11:51 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:11:52 [core.py:636] Waiting for init message from front-end.
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:11:52 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1+381074ae.nv25.09) with config: model='/models/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/models/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=bitsandbytes, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backe



[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:11:53 [cuda.py:328] Using Flash Attention backend on V1 engine.
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:11:53 [bitsandbytes_loader.py:742] Loading weights with BitsAndBytes quantization. May take a while ...


Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:18<00:55, 18.37s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:38<00:38, 19.29s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:58<00:19, 19.79s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:19<00:00, 20.32s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:19<00:00, 19.96s/it]
[1;36m(EngineCore_0 pid=280)[0;0m 


[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:13:14 [gpu_model_runner.py:2007] Model loading took 5.4588 GiB and 80.600050 seconds
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:13:16 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/3792b0e028/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:13:16 [backends.py:559] Dynamo bytecode transform time: 2.13 s
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:13:18 [backends.py:194] Cache the graph for dynamic shape for later use
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:13:22 [backends.py:215] Compiling a graph for dynamic shape takes 5.49 s
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:13:24 [monitor.py:34] torch.compile takes 7.62 s in total
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:13:51 [gpu_worker.py:276] Available KV cache memory: 50.93 GiB
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:13:51 [kv_cache_utils.py:849] GPU KV cache size

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████| 67/67 [00:13<00:00,  5.13it/s]


[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:14:06 [gpu_model_runner.py:2708] Graph capturing finished in 13 secs, took 1.01 GiB
[1;36m(EngineCore_0 pid=280)[0;0m INFO 12-17 12:14:06 [core.py:214] init engine (profile, create kv cache, warmup model) took 52.58 seconds
INFO 12-17 12:14:07 [llm.py:298] Supported_tasks: ['generate']


### Inference function

vLLM uses same chat format for inference as transformers. There are two ways of using vLLM for response generation:
- `generate`: used for standard text-completion task (pretrained models). If we would want to use a chat model in this format, we would have to pretokenize the prompt using chat template;
- `chat`: better suited for chat model. Using this method, we can send the conversation directly to the model and its tokenizer will automatically apply chat template (the same way as this is handled in transformers while using `pipeline`). Since we are using chat model, we will use this option today.

**Sampling params**: vLLM uses a special class for generation parameters such as temperature, top_p, top_k, etc. It also supports guided decoding format (enforcing JSON schema or answers given regular expression - see the last cell in this notebook for structured outputs description).

In [5]:
def prompt_to_conversation(prompt):
    messages = [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": prompt}
    ]

    return messages

def vllm_generate(model, conversations):
    sampling_params = SamplingParams(
        temperature=0.6,
        top_p=0.9,
        top_k=64,
        max_tokens=1024
    )

    responses = model.chat(conversations, sampling_params)
    predicted_texts = []
    for response in responses:
        prediction = response.outputs[0].text
        predicted_texts.append(prediction)

    return predicted_texts

### Single example inference

With vLLM we can do both single example and batch inference. For single example, we simply need to put our conversation as an input.

In [6]:
prompt = "Translate the following text to English & French. Put translations in separate lines. \n\nWenn es darum geht, das eigene Geld zu vermehren oder was fürs Alter anzusparen, ist immer häufiger von ETFs die Rede – kurz für Exchange Traded Funds, also börsengehandelte Indexfonds. Klingt sperrig? Mag sein. Aber einen ersten Sparplan anzulegen, ist unkompliziert. Ein Smartphone reicht. Und um in ETFs zu investieren, benötigen wir weder Startkapital noch großes Vorwissen. Selbst, wer die Altersvorsorge seit Jahrzehnten vor sich herschiebt, kann noch starten."
conversation = prompt_to_conversation(prompt)

predicted_text = vllm_generate(model, conversation)

print(50*"-")
print("Input text:")
print(prompt)
print()
print()
print("Model's response:")
print(predicted_text[0])
print(50*"-")

Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3187.16it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.55s/it, est. speed input: 21.87 toks/s, output: 33.66 toks/s]

--------------------------------------------------
Input text:
Translate the following text to English & French. Put translations in separate lines. 

Wenn es darum geht, das eigene Geld zu vermehren oder was fürs Alter anzusparen, ist immer häufiger von ETFs die Rede – kurz für Exchange Traded Funds, also börsengehandelte Indexfonds. Klingt sperrig? Mag sein. Aber einen ersten Sparplan anzulegen, ist unkompliziert. Ein Smartphone reicht. Und um in ETFs zu investieren, benötigen wir weder Startkapital noch großes Vorwissen. Selbst, wer die Altersvorsorge seit Jahrzehnten vor sich herschiebt, kann noch starten.


Model's response:
English Translation:
When it comes to multiplying one's own money or saving for retirement, ETFs are increasingly being talked about - short for Exchange-Traded Funds, or exchange-traded index funds. Sounds daunting? Maybe it does. But setting up a first savings plan is straightforward. All you need is a smartphone. And to invest in ETFs, we neither need a sta




### Batch inference

Batch inference is very simple with vLLM. All we need to do is to provide the list of conversations - be careful, each conversation is a list itself, so we need to provide the list of lists.

In [9]:
prompts = [
    "Translate the following text to English & French. Put translations in separate lines. \n\nWenn es darum geht, das eigene Geld zu vermehren oder was fürs Alter anzusparen, ist immer häufiger von ETFs die Rede – kurz für Exchange Traded Funds, also börsengehandelte Indexfonds. Klingt sperrig? Mag sein. Aber einen ersten Sparplan anzulegen, ist unkompliziert. Ein Smartphone reicht. Und um in ETFs zu investieren, benötigen wir weder Startkapital noch großes Vorwissen. Selbst, wer die Altersvorsorge seit Jahrzehnten vor sich herschiebt, kann noch starten.",
    "Translate to English:\n\nÄußerlich kann Levemir InnoLet durch Abwischen mit einem medizinischen Tupfer gereinigt werden."
]
conversations = [prompt_to_conversation(prompt) for prompt in prompts]

predicted_texts = vllm_generate(model, conversations)

for prompt, predicted_text in zip(prompts, predicted_texts):
    print(50*"-")
    print("Input text:")
    print(prompt)
    print()
    print()
    print("Model's response:")
    print(predicted_text)
    print(50*"-")

Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 4826.59it/s]
Processed prompts: 100%|█████████████████████████████████████████████████████| 2/2 [00:09<00:00,  4.57s/it, est. speed input: 23.42 toks/s, output: 28.78 toks/s]

--------------------------------------------------
Input text:
Translate the following text to English & French. Put translations in separate lines. 

Wenn es darum geht, das eigene Geld zu vermehren oder was fürs Alter anzusparen, ist immer häufiger von ETFs die Rede – kurz für Exchange Traded Funds, also börsengehandelte Indexfonds. Klingt sperrig? Mag sein. Aber einen ersten Sparplan anzulegen, ist unkompliziert. Ein Smartphone reicht. Und um in ETFs zu investieren, benötigen wir weder Startkapital noch großes Vorwissen. Selbst, wer die Altersvorsorge seit Jahrzehnten vor sich herschiebt, kann noch starten.


Model's response:
English:
When it comes to multiplying one's own capital or saving for retirement, ETFs are increasingly being talked about - short for Exchange Traded Funds, or stock-traded index funds. Sounds daunting? Maybe. But setting up a first savings plan is straightforward. All you need is a smartphone. And to invest in ETFs, we neither need a starting capital nor ext


