<a href="https://colab.research.google.com/github/anushadudi/inference_latency_optimization/blob/main/vllm_bench.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install vllm huggingface-hub

Collecting vllm
  Downloading vllm-0.7.2-cp38-abi3-manylinux1_x86_64.whl.metadata (12 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.2-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)
  Downloading lm_format_enforcer-0.10.10-py3-none-any.whl.metadata (17 kB)
Collecting outlines==0.1.11 (from vllm)
  Downloading outlines-0.1.11-py3-none-any.whl.metadata (17 kB)
Collecting lark

In [None]:
from vllm import LLM, SamplingParams
from huggingface_hub import login
import time
import pandas as pd
from google.colab import userdata

questions = [
    # Coding questions
    "Implement a Python function to compute the Fibonacci numbers.",
    "Write a Rust function that performs binary exponentiation.",
    "What are the differences between Javascript and Python?",
    # Literature
    "Write a story in the style of James Joyce about a trip to the Australian outback in 2083, to see robots in the beautiful desert.",
    "Who does Harry turn into a balloon?",
    "Write a tale about a time-traveling historian who's determined to witness the most significant events in human history.",
    # Math
    "What is the product of 9 and 8?",
    "If a train travels 120 kilometers in 2 hours, what is its average speed?",
    "Think through this step by step. If the sequence a_n is defined by a_1 = 3, a_2 = 5, and a_n = a_(n-1) + a_(n-2) for n > 2, find a_6.",
]


def initiateModel():
  model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
  login(token=userdata.get('HF_TOKEN'))
  return LLM(model=model_id)




In [None]:
def generate(question, llm):
  sampling_params = SamplingParams(temperature=0.01, top_p=0.01, max_tokens=200)

  start = time.perf_counter()
  result = llm.generate(question, sampling_params)
  request_time = time.perf_counter() - start
  response = {'question': question}
  for output in result:
      response['tok_count'] = len(output.outputs[0].token_ids)
      response['time'] = request_time
      response['answer'] = output.outputs[0].text
      response['tokens_per_second'] = len(output.outputs[0].token_ids) / request_time
      response['ms_per_seq_output_token'] = request_time * 1000 / len(output.outputs[0].token_ids)
      response['time_to_first_token'] = output.metrics.first_token_time - output.metrics.arrival_time
      response['metrics'] = output.metrics
  return response

In [None]:
def run_benchmark(llm):
    counter = 1
    responses = []

    for q in questions:
        response = generate(question=q, llm=llm)
        if counter >= 2:
            responses.append(response)
        counter += 1

    df = pd.DataFrame(responses)
    df.to_csv('bench-vllm.csv', index=False)
    return df



In [None]:
llm = initiateModel()


INFO 02-18 23:07:17 config.py:542] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'.
INFO 02-18 23:07:17 config.py:1556] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 02-18 23:07:17 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='meta-llama/Meta-Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collec

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]


INFO 02-18 23:07:27 model_runner.py:1115] Loading model weights took 14.9576 GB
INFO 02-18 23:07:28 worker.py:267] Memory profiling takes 0.59 seconds
INFO 02-18 23:07:28 worker.py:267] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.90) = 35.60GiB
INFO 02-18 23:07:28 worker.py:267] model weights take 14.96GiB; non_torch_memory takes 0.00GiB; PyTorch activation peak memory takes 1.18GiB; the rest of the memory reserved for KV Cache is 19.46GiB.
INFO 02-18 23:07:28 executor_base.py:110] # CUDA blocks: 9965, # CPU blocks: 2048
INFO 02-18 23:07:28 executor_base.py:115] Maximum concurrency for 131072 tokens per request: 1.22x
INFO 02-18 23:07:28 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:28<00:00,  1.24it/s]

INFO 02-18 23:07:56 model_runner.py:1562] Graph capturing finished in 28 secs, took 0.08 GiB
INFO 02-18 23:07:56 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 29.81 seconds





In [None]:
result = run_benchmark(llm)
result


Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.90s/it, est. speed input: 3.79 toks/s, output: 68.87 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.88s/it, est. speed input: 3.82 toks/s, output: 69.37 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.89s/it, est. speed input: 3.46 toks/s, output: 69.22 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.89s/it, est. speed input: 10.74 toks/s, output: 69.31 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.89s/it, est. speed input: 3.12 toks/s, output: 69.28 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.89s/it, est. speed input: 8.31 toks/s, output: 69.27 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.89s/it, est. speed input: 4.15 toks/s, output: 69.19 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.89s/it, est. speed input: 6.59 toks/s, output: 69.34 toks/s]
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  

Unnamed: 0,question,tok_count,time,answer,tokens_per_second,ms_per_seq_output_token,time_to_first_token,metrics
0,Write a Rust function that performs binary exp...,200,2.886388,Binary exponentiation is a method for efficie...,69.290763,14.431938,0.024341,"RequestMetrics(arrival_time=1739921408.891137,..."
1,What are the differences between Javascript an...,200,2.892429,- Stack Overflow\nWhat are the differences be...,69.146038,14.462145,0.024481,RequestMetrics(arrival_time=1739921411.7775493...
2,Write a story in the style of James Joyce abou...,200,2.889185,"\nThe sun beat down upon the dusty horizon, a...",69.22367,14.445926,0.024553,"RequestMetrics(arrival_time=1739921414.670004,..."
3,Who does Harry turn into a balloon?,200,2.890007,"In the Harry Potter series, Harry is turned i...",69.203974,14.450037,0.02483,"RequestMetrics(arrival_time=1739921417.559226,..."
4,Write a tale about a time-traveling historian ...,200,2.890262,"But, as she navigates through the ages, she f...",69.197875,14.451311,0.024359,RequestMetrics(arrival_time=1739921420.4492574...
5,What is the product of 9 and 8?,200,2.893935,"[ #permalink ] New post 03 Mar 2018, 06:45\...",69.110061,14.469673,0.024902,RequestMetrics(arrival_time=1739921423.3395412...
6,"If a train travels 120 kilometers in 2 hours, ...",200,2.887658,Average speed is calculated by dividing the t...,69.260275,14.438291,0.024789,RequestMetrics(arrival_time=1739921426.2334993...
7,Think through this step by step. If the sequen...,200,2.892458,1) Find a_3. a_3 = a_2 + a_1 = 5 + 3 = 8 2) F...,69.145341,14.46229,0.024809,RequestMetrics(arrival_time=1739921429.1211786...


In [None]:
res = result['result'][0]
res[0].metrics

RequestMetrics(arrival_time=1739920101.3490393, last_token_time=1739920104.223131, first_scheduled_time=1739920101.351214, first_token_time=1739920101.3737535, time_in_queue=0.0021746158599853516, finished_time=1739920104.2232606, scheduler_time=0.016016362000527806, model_forward_time=None, model_execute_time=None)

In [None]:
ftt = result['result'][0][0].metrics.first_token_time
at = result['result'][0][0].metrics.arrival_time
ltt = result['result'][0][0].metrics.last_token_time
ft = result['result'][0][0].metrics.finished_time
response_time = ft - at
time_to_first_token = ftt - at
print(f"res time: {response_time}, time_to_first_token: {time_to_first_token}")

res time: 2.8742213249206543, time_to_first_token: 0.024714231491088867


In [None]:
!nvidia-smi

Tue Feb 18 23:34:00 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:00:04.0 Off |                    0 |
| N/A   35C    P0             50W /  400W |   36157MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                

Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.90s/it, est. speed input: 3.79 toks/s, output: 68.92 toks/s]
