# Hello World vLLM

## TL;DR
The [notebook](./Hello_World_vLLM.ipynb) is a concise, quick-start guide that sets up vLLM, loads a 125M model, demonstrates the generation of replies, and showcases the manipulation of the model's log probabilities and tokens. It is a minimal template for experiments with larger models, streaming, and benchmarking.

## High-level flow:
1. Environment bootstrap
2. LLM instantiation: LLM(model= "facebook/opt-125m") creates a small model that fits on almost any GPU.
3. Prompt-to-Text demo: builds a list of prompts, sets sampling parameters, and generates a set of replies for each prompt, in three different use cases.
4. Find the top three generated texts based on their tokens' average log probability.

## Prompt-to-Text use cases:
This notebook shows three basic use cases of generating text replies from prompts:
Single prompt to single reply
Single prompt to many replies
Many prompts to a single reply (each of the prompts)

## Top three replies:
The last cell of this notebook contains a program that generates 10 different replies for a single prompt. Then, it calculates the average log probability for each reply based on the probabilities of its selected tokens.<br>
It sorts the outputs according to the calculation result and displays the top three.

## Possible simple extensions for this notebook:
1. Load a larger model, such as meta-llama/Llama-3-8b if hardware allows.
2. Wrap 'llm.generate' in a FastAPI endpoint and load-test it.
3. Compare performance with Hugging Face TextGenerationInference.

Key references
Official docs at https://vllm.ai/

# Initializing environment

In [None]:
import os, sys
print(f'{os.getcwd()=}')
print(f'{sys.executable=}')

sys_path_vllm = '/notebooks/vllm'
if os.path.isdir(sys_path_vllm) and sys_path_vllm not in sys.path:
    sys.path.insert(0, sys_path_vllm)

print('='*50)
for i,p in enumerate(sys.path):
    print(f'sys.path[{i}] = {p}')
print('='*50)

os.getcwd()='/notebooks'
sys.executable='/notebooks/vllm_venv/bin/python'
sys.path[0] = /notebooks/vllm
sys.path[1] = /usr/lib/python311.zip
sys.path[2] = /usr/lib/python3.11
sys.path[3] = /usr/lib/python3.11/lib-dynload
sys.path[4] = 
sys.path[5] = /notebooks/vllm_venv/lib/python3.11/site-packages
sys.path[6] = __editable__.vllm-0.9.2+cu118.finder.__path_hook__


In [None]:
from vllm import LLM, SamplingParams

INFO 07-22 13:58:56 [__init__.py:244] Automatically detected platform cuda.


In [None]:
llm = LLM(model="facebook/opt-125m")

config.json:   0%|          | 0.00/651 [00:00<?, ?B/s]

INFO 07-22 13:59:12 [config.py:841] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 07-22 13:59:12 [config.py:1472] Using max model len 2048
INFO 07-22 13:59:14 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=8192.


tokenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

INFO 07-22 13:59:14 [core.py:526] Waiting for init message from front-end.
INFO 07-22 13:59:14 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, 

pytorch_model.bin:   0%|          | 0.00/251M [00:00<?, ?B/s]

INFO 07-22 13:59:20 [weight_utils.py:308] Time spent downloading weights for facebook/opt-125m: 1.244602 seconds


Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 07-22 13:59:20 [default_loader.py:272] Loading weights took 0.22 seconds
INFO 07-22 13:59:20 [gpu_model_runner.py:1801] Model loading took 0.2389 GiB and 1.732135 seconds
INFO 07-22 13:59:29 [backends.py:508] Using cache directory: /root/.cache/vllm/torch_compile_cache/7b4b1a726a/rank_0_0/backbone for vLLM's torch.compile
INFO 07-22 13:59:29 [backends.py:519] Dynamo bytecode transform time: 8.33 s
INFO 07-22 13:59:31 [backends.py:181] Cache the graph of shape None for later use


[rank0]:W0722 13:59:31.641000 129 torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode


INFO 07-22 13:59:36 [backends.py:193] Compiling a graph for general shape takes 6.93 s
INFO 07-22 13:59:38 [monitor.py:34] torch.compile takes 15.26 s in total
INFO 07-22 13:59:41 [gpu_worker.py:232] Available KV cache memory: 20.54 GiB
INFO 07-22 13:59:41 [kv_cache_utils.py:716] GPU KV cache size: 598,240 tokens
INFO 07-22 13:59:41 [kv_cache_utils.py:720] Maximum concurrency for 2,048 tokens per request: 292.11x


Capturing CUDA graph shapes: 100%|██████████| 67/67 [00:22<00:00,  2.94it/s]


INFO 07-22 14:00:04 [gpu_model_runner.py:2326] Graph capturing finished in 23 secs, took 0.21 GiB
INFO 07-22 14:00:04 [core.py:172] init engine (profile, create kv cache, warmup model) took 44.00 seconds


# Single prompt --> Single output

In [None]:
prompt = "Hello, my name is"
sampling_params = SamplingParams(temperature=0.7, max_tokens=20)

outputs = llm.generate(prompt, sampling_params)

for output in outputs:
    print("Prompt:", output.prompt)
    print("Generated:", output.outputs[0].text)

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: Hello, my name is
Generated:  Joel, I'm a software developer and I'm trying to make a mobile app. The problem is


# Single prompt --> Many outputs

In [None]:
prompt = "Hello, my name is"
sampling_params = SamplingParams(temperature=0.7, max_tokens=20, n=3)

# Generate multiple completions
outputs = llm.generate(prompt, sampling_params)

# Print all completions
for output in outputs:
    print("Prompt:", output.prompt)
    for i, result in enumerate(output.outputs):
        print(f"Generated #{i+1}:", result.text)
    print("-" * 40)

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/3 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: Hello, my name is
Generated #1:  Yuvraj and I’m 25 years old. I am a happy, conservative Indian
Generated #2:  James.
I'm a Psychologist.
You're better off being a broke man.

Generated #3:  Liz, I'm going to be a waitress at a coffee shop.
Liz, please don
----------------------------------------


# Many prompts --> Single output

In [None]:
# Define a batch of prompts
prompts = [
    "What is the capital of France?",
    "Write a short poem about the ocean.",
    "Explain what a black hole is.",
    "Give me a Python one-liner to reverse a string."
]

# Set generation configuration
sampling_params = SamplingParams(temperature=0.7, max_tokens=30)

# Run batched generation
outputs = llm.generate(prompts, sampling_params)

# Print each prompt with its response
for output in outputs:
    print("Prompt:", output.prompt)
    print("Response:", output.outputs[0].text)
    print("-" * 50)


Adding requests:   0%|          | 0/4 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Prompt: What is the capital of France?
Response: 
France is a capital of France, because of its size and population. It’s the capital of Germany and it’s the capital
--------------------------------------------------
Prompt: Write a short poem about the ocean.
Response: 

I am going to leave this link here for now, so I can stop posting. What I will do is try to write a poem about
--------------------------------------------------
Prompt: Explain what a black hole is.
Response: 
It seems to be a black hole with a red hole. I can't see where it is, but I've seen it happen.
--------------------------------------------------
Prompt: Give me a Python one-liner to reverse a string.
Response: 

John

#3: "But the world is full of contradictions, and what a contradiction!"
#4: "If it's
--------------------------------------------------


# Top 3 completions for a single prompt based on likelihood

In [None]:
prompt = "Hello, my name is"

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=20,
    n=10,
    logprobs=1  # Request logprobs for each token
)

outputs = llm.generate(prompt, sampling_params)

# Collect completions
for output in outputs:
    pr = output.prompt
    print_results = len(output.outputs) <= 10
    scored_completions = []
    for ind_res,res in enumerate(output.outputs):
        logprobs = []
        for token_id, token_logprobs_dict in zip(res.token_ids, res.logprobs):
            logprob_entry = token_logprobs_dict.get(token_id, None)
            if logprob_entry is not None:
                logprobs.append(logprob_entry.logprob)

        avg_logprob = sum(logprobs) / max(len(logprobs),1)
        completion = res.text
        scored_completions.append((avg_logprob,completion))
        if print_results:
            print(f'prompt={pr} --> [{ind_res+1}]: {completion=}  {avg_logprob=}')

# Sort by highest average log-probability
scored_completions.sort(key=lambda x: x[0], reverse=True)

# Print top 3 completions
print("\nTop 3 completions by average log-likelihood:")
for i, (avg_logprob, text) in enumerate(scored_completions[:3]):
    print(f"\n#{i+1} {avg_logprob=} :\n{text}")


Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/10 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

prompt=Hello, my name is --> [1]: completion=' Melissa and I am a thirty-six year old English teacher. I love teaching and I love spending'  avg_logprob=-2.1378440018743277
prompt=Hello, my name is --> [2]: completion=' Mika, I used to be a graphic artist, I would love to find more of your work'  avg_logprob=-2.429113742336631
prompt=Hello, my name is --> [3]: completion=' Minnie and i am a 6\'2" country singer. I\'m a singer in the St'  avg_logprob=-2.8559442430734636
prompt=Hello, my name is --> [4]: completion=' Ramona and I am a 15 year old girl. My boyfriend is a guy and we have been'  avg_logprob=-2.052470563072711
prompt=Hello, my name is --> [5]: completion=' Tom, and I\'m the owner of the award-winning and award-winning book "The Book'  avg_logprob=-2.0749319683993237
prompt=Hello, my name is --> [6]: completion=" Subramaniam.\nHello, it's Subramaniam."  avg_logprob=-2.02913129364606
prompt=Hello, my name is --> [7]: completion=" Jack and I'm a newbie to Reddit and I'm looking