In [4]:
# install necessary libraries
!pip install vllm==0.10.2 bitsandbytes==0.46.1 transformers==4.57.6



## Exercise 4: Advanced Inference — Comparing Hugging Face and vLLM Performance

This exercise is intended for **advanced users** who want to understand performance trade-offs between different LLM inference frameworks. You will compare standard Hugging Face–based inference with **vLLM**, a high-throughput inference engine designed for efficient serving of large language models.

The focus is on **speed, throughput, and resource utilization**, rather than model quality.

---

### Setup

- Use the **same model** for both frameworks (e.g. GaMS or another comparable open model).
- Ensure identical hardware and similar settings where possible.
- Disable unnecessary logging and debugging outputs to avoid skewing timings.

---

### Tasks

1. **Select an appropriate dataset (at least 1000 records) from https://huggingface.co/datasets and download it.**
   - You can also use dataset provided in other notebooks.

2. **Baseline: Hugging Face inference**
   - Load appropriate model using the Hugging Face `transformers` library.
   - Run inference for a fixed set of prompts (max. 10) from the selected dataset.
   - Measure total generation time.

3. **vLLM inference**
   - Load the same model using vLLM. Consult the notebook on faster inference with vLLM.
   - Use the same prompts and generation parameters.
   - Measure total generation time.

4. **Results comparison**
   - Create a small table comparing both libraries.

---


#### **Task 1**

Selected dataset: **Slovenian LLM evaluation dataset** https://huggingface.co/datasets/cjvt/slovenian-llm-eval

**NOTE:** We will only use the first 5 examples from the selected dataset to not use too much time.


In [5]:
from datasets import load_dataset

# load the dataset
dataset = load_dataset("cjvt/slovenian-llm-eval", "arc_challenge", split="test")

# get first 5 examples
examples = [dataset[i] for i in range(5)]

# each example has: id, query, choices, gold index (index of the correct choice)
for ex in examples:
    print(f"Query: {ex['query']}")
    print(f"Choices: {ex['choices']}")
    print(f"Gold index: {ex['gold']}")
    print()

Query: Vprašanje: Astronom opazi, da planet po trku z meteoritom rotira hitreje. Kateri je najverjetnejši učinek te povečane rotacije?
odgovor:
Choices: ['Planetarna gostota se bo zmanjšala.', 'Planetarna leta bodo postala daljša.', 'Planetarni dnevi bodo postali krajši.', 'Planetarna gravitacija bo postala močnejša.']
Gold index: 2

Query: Vprašanje: Skupina inženirjev je želela izvedeti, kako bi se različne zasnove zgradb odzvale med potresom. Izdelali so več modelov zgradb in vsako testirali za njeno sposobnost prenašanja potresnih razmer. Kateri rezultat je najbolj verjeten po testiranju različnih zasnov zgradb?
odgovor:
Choices: ['stavbe bodo zgrajene hitreje', 'stavbe bodo varnejše', 'zasnove zgradb bodo videti lepše', 'gradbeni materiali bodo cenejši']
Gold index: 1

Query: Vprašanje: Končni rezultat v procesu fotosinteze je proizvodnja sladkorja in kisika. Kateri korak označuje začetek fotosinteze?
odgovor:
Choices: ['Kemična energija se absorbira skozi korenine.', 'Svetlobna e

#### **Task 2**

Baseline inference with GaMS-2B-Instruct model.

First we load the model with selected parameters and create a function for answer generation.





In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from accelerate import init_empty_weights
from transformers import BitsAndBytesConfig

device = "cuda" # the device to load the model onto

# initialize the model
model = AutoModelForCausalLM.from_pretrained(
    "cjvt/GaMS-2B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("cjvt/GaMS-2B-Instruct")

# create function that generates a message from prompts
def prompt_to_message(prompt):
    messages = [
        {"role": "user", "content": prompt}
    ]

    return messages


# create a function that returns the answer from the baseline model
def baseline_generate(messages, temp=0.6):

    # apply message to the model
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)

    generated_ids = model.generate(
        model_inputs.input_ids,
        temperature=temp,
        top_p=0.9,
        top_k=64,
        max_new_tokens=512
    )

    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]

    # get response
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

    return response

config.json:   0%|          | 0.00/874 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.05G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/34.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

We run inference on the first 5 examples from the dataset.

There are two options:
1.   Use only the query from each example and the model generates the answer by itself,
2.   Use existing answers from the dataset and the model chooses the correct answer.






In [7]:
# 1. use query from each example, the model generates the answer by itself

# start timer
import time
tb_openq_start = time.time()

# generate all answers
for i, example in enumerate(examples):
    prompt = example["query"]
    message = prompt_to_message(prompt)
    response = baseline_generate(message)
    expected = example["choices"][example["gold"]]

    print(f"\n[{i+1}] {prompt}")
    print(f"Model: {response}")
    print(f"Expected: {expected}")

# measure total time
base_time_openq = time.time() - tb_openq_start
print(f"Total time for open question: {base_time_openq:.2f}s")

The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



[1] Vprašanje: Astronom opazi, da planet po trku z meteoritom rotira hitreje. Kateri je najverjetnejši učinek te povečane rotacije?
odgovor:
Model: Najverjetnejši učinek povečane rotacije je, da se bo planet začel vrteti hitreje.

Expected: Planetarni dnevi bodo postali krajši.

[2] Vprašanje: Skupina inženirjev je želela izvedeti, kako bi se različne zasnove zgradb odzvale med potresom. Izdelali so več modelov zgradb in vsako testirali za njeno sposobnost prenašanja potresnih razmer. Kateri rezultat je najbolj verjeten po testiranju različnih zasnov zgradb?
odgovor:
Model: Najbolj verjeten rezultat po testiranju različnih zasnov zgradb je, da bo imela zgradba, ki je bila zasnovana na najmočnejši način, največjo sposobnost prenašanja potresnih razmer.

Expected: stavbe bodo varnejše

[3] Vprašanje: Končni rezultat v procesu fotosinteze je proizvodnja sladkorja in kisika. Kateri korak označuje začetek fotosinteze?
odgovor:
Model: Začetek fotosinteze označuje absorpcija svetlobe s stran

In [8]:
# 2. use existing answers from the dataset and the model chooses the correct answer

# start timer
import time
tb_choice_start = time.time()

# select each answer
for i, example in enumerate(examples):
    # build prompt with query and possible choices
    choices_text = ", ".join(example["choices"])
    prompt = f"{example['query']}\n\nMožni odgovori: {choices_text}\n\nOdgovori samo z izbranim odgovorom."

    #create message from prompt
    message = prompt_to_message(prompt)

    # we lower the temperature and max number of tokens so the answer is more constrained
    response = baseline_generate(message, 0)
    expected = example["choices"][example["gold"]]

    print(f"\n[{i+1}] {example['query']}")
    print(f"Choices: {choices_text}")
    print(f"Model: {response}")
    print(f"Expected: {expected}")


# measure total time
base_time_choice = time.time() - tb_choice_start
print(f"Total time for multiple choice question: {base_time_choice:.2f}s")


[1] Vprašanje: Astronom opazi, da planet po trku z meteoritom rotira hitreje. Kateri je najverjetnejši učinek te povečane rotacije?
odgovor:
Choices: Planetarna gostota se bo zmanjšala., Planetarna leta bodo postala daljša., Planetarni dnevi bodo postali krajši., Planetarna gravitacija bo postala močnejša.
Model: Planetarna gostota se bo zmanjšala.

Expected: Planetarni dnevi bodo postali krajši.

[2] Vprašanje: Skupina inženirjev je želela izvedeti, kako bi se različne zasnove zgradb odzvale med potresom. Izdelali so več modelov zgradb in vsako testirali za njeno sposobnost prenašanja potresnih razmer. Kateri rezultat je najbolj verjeten po testiranju različnih zasnov zgradb?
odgovor:
Choices: stavbe bodo zgrajene hitreje, stavbe bodo varnejše, zasnove zgradb bodo videti lepše, gradbeni materiali bodo cenejši
Model: Odgovor: Stavbe bodo varnejše.

Expected: stavbe bodo varnejše

[3] Vprašanje: Končni rezultat v procesu fotosinteze je proizvodnja sladkorja in kisika. Kateri korak ozna

In [9]:
del model

#### **Task 3**

vLLM inference with GaMS-2B-Instruct.

We use the same prompts and generation parameters as for the baseline model.

In [10]:
from vllm import LLM, SamplingParams

# initialize the model
from vllm import LLM, SamplingParams
import torch
import os
os.environ["VLLM_USE_V1"] = "0"

torch.cuda.empty_cache()
model = LLM(
    "cjvt/GaMS-2B-Instruct",
    dtype=torch.float32,
    trust_remote_code=True,
    gpu_memory_utilization=0.9,
    max_model_len=512,
)


# create function that generates a message from prompts
def prompt_to_message(prompt):
    messages = [
        {"role": "user", "content": prompt}
    ]

    return messages

# create a function that returns the answer from the vllm model
def vllm_generate(messages, temp=0.6):
    # set sampling parameters
    sampling_params = SamplingParams(
        temperature=temp,
        top_p=0.9,
        top_k=64,
        max_tokens=512
    )

    # generate response
    outputs = model.chat(messages, sampling_params)

    # get response
    response = []
    for output in outputs:
        response.append(output.outputs[0].text)

    return response


INFO 02-03 14:57:14 [__init__.py:216] Automatically detected platform cuda.
INFO 02-03 14:57:15 [utils.py:328] non-default args: {'trust_remote_code': True, 'dtype': torch.float32, 'max_model_len': 512, 'disable_log_stats': True, 'model': 'cjvt/GaMS-2B-Instruct'}


The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.


INFO 02-03 14:57:34 [__init__.py:742] Resolved architecture: Gemma2ForCausalLM
INFO 02-03 14:57:34 [__init__.py:2761] Upcasting torch.bfloat16 to torch.float32.
INFO 02-03 14:57:34 [__init__.py:1815] Using max model len 512
INFO 02-03 14:57:36 [llm_engine.py:221] Initializing a V0 LLM engine (v0.10.2) with config: model='cjvt/GaMS-2B-Instruct', speculative_config=None, tokenizer='cjvt/GaMS-2B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float32, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metric

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 02-03 14:57:45 [default_loader.py:268] Loading weights took 3.24 seconds
INFO 02-03 14:57:46 [model_runner.py:1083] Model loading took 9.7637 GiB and 3.711658 seconds
INFO 02-03 14:57:51 [worker.py:290] Memory profiling takes 4.14 seconds
INFO 02-03 14:57:51 [worker.py:290] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.90) = 35.60GiB
INFO 02-03 14:57:51 [worker.py:290] model weights take 9.76GiB; non_torch_memory takes 0.02GiB; PyTorch activation peak memory takes 2.21GiB; the rest of the memory reserved for KV Cache is 23.60GiB.
INFO 02-03 14:57:52 [executor_base.py:114] # cuda blocks: 7436, # CPU blocks: 1260
INFO 02-03 14:57:52 [executor_base.py:119] Maximum concurrency for 512 tokens per request: 232.38x
INFO 02-03 14:57:55 [model_runner.py:1355] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in t

Capturing CUDA graph shapes:   0%|          | 0/35 [00:00<?, ?it/s]

INFO 02-03 14:58:36 [model_runner.py:1507] Graph capturing finished in 41 secs, took 0.25 GiB
INFO 02-03 14:58:36 [worker.py:467] Free memory on device (34.18/39.56 GiB) on startup. Desired GPU memory utilization is (0.9, 35.6 GiB). Actual usage is 9.76 GiB for weight, 2.21 GiB for peak activation, 0.02 GiB for non-torch memory, and 0.25 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=24918418329` to fit into requested memory, or `--kv-cache-memory=23394568704` to fully utilize gpu memory. Current kv cache memory in use is 25344140185 bytes.
INFO 02-03 14:58:36 [llm_engine.py:420] init engine (profile, create kv cache, warmup model) took 49.25 seconds
INFO 02-03 14:58:36 [llm.py:295] Supported_tasks: ['generate']
INFO 02-03 14:58:36 [__init__.py:36] No IOProcessor plugins requested by the model


We run inference using vLLM model for both options:
1. Open-ended questions
2. Multiple-choice questions

In [11]:
# 1. use query from each example, the model generates the answer by itself

# start timer
import time
tv_openq_start = time.time()

messages = [prompt_to_message(example['query']) for example in examples]

responses = vllm_generate(messages)

for i, (example, response) in enumerate(zip(examples, responses)):
    print(f"\n[{i+1}] {example['query']}")
    print(f"Model: {response}")
    print(f"Expected: {example["choices"][example["gold"]]}")

# measure total time
vllm_time_openq = time.time() - tv_openq_start

print(f"Total time for open question: {vllm_time_openq:.2f}s")

INFO 02-03 14:58:38 [chat_utils.py:538] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.


Adding requests:   0%|          | 0/5 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/5 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]


[1] Vprašanje: Astronom opazi, da planet po trku z meteoritom rotira hitreje. Kateri je najverjetnejši učinek te povečane rotacije?
odgovor:
Model: Planet se bo zaradi povečane rotacije začel hitreje vrteti okoli svoje osi, kar bo imelo za posledico večje trenje med površino planeta in atmosfero, kar bo zmanjšalo hitrost vetra in s tem tudi hitrost, s katero se planet giblje okoli Sonca.

Expected: Planetarni dnevi bodo postali krajši.

[2] Vprašanje: Skupina inženirjev je želela izvedeti, kako bi se različne zasnove zgradb odzvale med potresom. Izdelali so več modelov zgradb in vsako testirali za njeno sposobnost prenašanja potresnih razmer. Kateri rezultat je najbolj verjeten po testiranju različnih zasnov zgradb?
odgovor:
Model: Najbolj verjeten rezultat po testiranju različnih zasnov zgradb je, da bo večina zgradb delovala brez večjih poškodb ali celo presegla pričakovane mejne vrednosti.

Expected: stavbe bodo varnejše

[3] Vprašanje: Končni rezultat v procesu fotosinteze je proi

In [None]:
# 2. use existing answers from the dataset and the model chooses the correct answer

# start timer
import time
tv_choice_start = time.time()

# prepare prompts for multiple-choice
prompts_mc = []
for example in examples:
    choices_text = ", ".join(example["choices"])
    prompt = f"{example['query']}\n\nMožni odgovori: {choices_text}\n\nOdgovori samo z izbranim odgovorom."
    prompts_mc.append(prompt)

messages = [prompt_to_message(prompt) for prompt in prompts_mc]

# generate response
responses = vllm_generate(messages,0)

for i, (example, response) in enumerate(zip(examples, responses)):
    print(f"\n[{i+1}] {example['query']}")
    print(f"Choices: {example['choices']}")
    print(f"Model: {response}")
    print(f"Expected: {example["choices"][example["gold"]]}")


# measure total time
vllm_time_choice = time.time() - tv_choice_start
print(f"Total time for multiple choice question: {vllm_time_choice:.2f}s")

Adding requests:   0%|          | 0/5 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/5 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]


[1] Vprašanje: Astronom opazi, da planet po trku z meteoritom rotira hitreje. Kateri je najverjetnejši učinek te povečane rotacije?
odgovor:
Choices: ['Planetarna gostota se bo zmanjšala.', 'Planetarna leta bodo postala daljša.', 'Planetarni dnevi bodo postali krajši.', 'Planetarna gravitacija bo postala močnejša.']
Model: - Planetarna gostota se bo zmanjšala.

Expected: Planetarni dnevi bodo postali krajši.

[2] Vprašanje: Skupina inženirjev je želela izvedeti, kako bi se različne zasnove zgradb odzvale med potresom. Izdelali so več modelov zgradb in vsako testirali za njeno sposobnost prenašanja potresnih razmer. Kateri rezultat je najbolj verjeten po testiranju različnih zasnov zgradb?
odgovor:
Choices: ['stavbe bodo zgrajene hitreje', 'stavbe bodo varnejše', 'zasnove zgradb bodo videti lepše', 'gradbeni materiali bodo cenejši']
Model: 1. stavbe bodo zgrajene hitreje

Expected: stavbe bodo varnejše

[3] Vprašanje: Končni rezultat v procesu fotosinteze je proizvodnja sladkorja in ki

#### **Task 4**

Table comparison of both models.

We showcase how much time each model took to generate an answer and how many of the multiple-choice answers the model got correct.

In [13]:
import pandas as pd

# Main comparison table
results = {
    "Model": ["Transformers", "Transformers", "vLLM", "vLLM"],
    "Type": ["Open-ended", "Multiple-choice", "Open-ended", "Multiple-choice"],
    "Time (s)": [round(base_time_openq, 2), round(base_time_choice, 2), round(vllm_time_openq, 2), round(vllm_time_choice, 2)]
}

df_results = pd.DataFrame(results)
display(df_results)


Unnamed: 0,Model,Type,Time (s)
0,Transformers,Open-ended,7.4
1,Transformers,Multiple-choice,3.08
2,vLLM,Open-ended,4.03
3,vLLM,Multiple-choice,0.51
