<a href="https://colab.research.google.com/github/dongukmoon/misc/blob/main/vllm_quickstart_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# vLLM Tutorial: Quick Start on Google Colab
This guide walks you through setting up and running offline batched inference and OpenAI-compatible inference with vLLM on Google Colab.


1.   Offline batched inference
2.   OpenAI-compatible inference


## Running Offline Batched Inference on Colab

**Step 1: Install Required Packages**

Start by installing vLLM and other necessary packages using pip:

In [1]:
!pip install vllm torch  triton

Collecting vllm
  Downloading vllm-0.6.4.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (10 kB)
Collecting triton
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.32.1-py3-none-any.whl.metadata (6.6 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)
  Downloading lm_format_enforcer-0.10.9-py3-none-any.whl.metadata (17 kB)
Collecting outlines<0.1,>=0.0.43 (from vllm)
  Downloading outlines-0.0.46-py3-none-any.whl.metadata (15 kB)
Collecting partial-json-parser (from vllm)
  Downloading partial_json_parser-0.2.1.1.post4-py3-none-any.whl.metadata (6.2 kB)
Collecting msgs

**Step 2: Set Up GPU Environment**

Google Colab offers free access to a Tesla T4 GPU (15GB RAM), which is sufficient for running vLLM. In this tutorial, we'll use the GPT-2 model and set the `gpu_memory_utilization` parameter to 0.5. This ensures that the key-value cache reserved for model inference does not exceed 50% of the total GPU memory. Without this setting, you might encounter out-of-memory errors due to the large key-value cache required for the maximum number of tokens supported by GPT-2.
With `gpu_memory_utilization`=0.5, the model weights will use about 0.24GB, and the key-value cache will require approximately 6.52GB of GPU memory.



**Step 3: Load the Model and Generate Outputs**

Below is the Python script to load the GPT-2 model, define prompts, and generate text outputs:

In [1]:
from vllm import LLM, SamplingParams

# Load a pre-trained Hugging Face model
model_name = "gpt2"  # Replace with your desired model

# Create an LLM.
#gpu_memory_utilization: The ratio (between 0 and 1) of GPU memory to
#reserve for the model weights, activations, and KV cache. Higher
#values will increase the KV cache size and thus improve the model's
#throughput. However, if the value is too high, it may cause out-of-
# memory (OOM) errors.
llm = LLM(model_name, gpu_memory_utilization = 0.5)

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

INFO 12-01 11:54:08 config.py:1861] Downcasting torch.float32 to torch.float16.
INFO 12-01 11:54:17 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='gpt2', speculative_config=None, tokenizer='gpt2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=gpt2, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=Fa

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

INFO 12-01 11:54:24 selector.py:135] Using Flash Attention backend.
INFO 12-01 11:54:24 model_runner.py:1072] Starting to load model gpt2...
INFO 12-01 11:54:25 weight_utils.py:243] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

INFO 12-01 11:54:29 weight_utils.py:288] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 12-01 11:54:30 model_runner.py:1077] Loading model weights took 0.2378 GB
INFO 12-01 11:54:31 worker.py:232] Memory profiling results: total_gpu_memory=22.17GiB initial_memory_usage=0.47GiB peak_torch_memory=0.71GiB memory_usage_post_profile=0.50GiB non_torch_memory=0.26GiB kv_cache_size=10.12GiB gpu_memory_utilization=0.50
INFO 12-01 11:54:31 gpu_executor.py:113] # GPU blocks: 18426, # CPU blocks: 7281
INFO 12-01 11:54:31 gpu_executor.py:117] Maximum concurrency for 1024 tokens per request: 287.91x
INFO 12-01 11:54:34 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-01 11:54:34 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO

Processed prompts: 100%|██████████| 4/4 [00:00<00:00, 23.24it/s, est. speed input: 128.28 toks/s, output: 373.16 toks/s]

Prompt: 'Hello, my name is', Generated text: " Scott. I'm a social media manager at Amazon, working as a Data Scientist"
Prompt: 'The president of the United States is', Generated text: ' not sitting right now.\n\nDemocrats made a concerted effort to make his name'
Prompt: 'The capital of France is', Generated text: ' still staging elections this year, and not many want to make the hard decision on'
Prompt: 'The future of AI is', Generated text: ' shaping up to be a hotly debated issue in a world increasingly beset by artificial'





**Step 4: Monitor GPU Memory Usage**

Use the nvidia-smi command to check GPU memory usage. As shown, approximately 7.5GB out of 15GB of GPU memory is used.

In [2]:
!nvidia-smi

Sun Dec  1 11:54:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   51C    P0              30W /  72W |  11295MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

**Sample Output**

The output of the script may look like this:

```
Prompt: 'Hello, my name is', Generated text: ' Scott. If you want to talk to me, you can. I am a'
Prompt: 'The president of the United States is', Generated text: ' a trained lawyer with experience in working with the vast majority of lawyers who are part'
Prompt: 'The capital of France is', Generated text: ' on the basis of the basis of the shared law. This is called the Common'
Prompt: 'The future of AI is', Generated text: ' becoming a bit more difficult. The big, big picture lies with AI. We'
```

**Step 5: Handle GPU Memory for Repeated Runs**

If you run the script again without restarting the Colab instance, you might encounter out-of-memory errors. To free up GPU memory, you can simply restart the Colab environment using the following command:

In [None]:
import os
os._exit(00)

## Running OpenAI-Compatible Inference

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments. The server currently hosts one model at a time and implements endpoints such as list models, create chat completion, and create completion endpoints.

Run the following command to start the vLLM server with the GPT-2 model:


In [1]:
# Cannot use FlashAttention-2 backend for Volta and Turing GPUs
# and T4 does not support BF16, causing error.
# So, I used a L4 GPU instance

# wait until the server bootup is complete. Check the log in 'nohup.out'
!nohup vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 8001 --gpu-memory-utilization 0.3 &


nohup: appending output to 'nohup.out'


This server can be queried in the same format as OpenAI API. For example, to list the models:

In [10]:
!curl http://localhost:8001/v1/models

{"object":"list","data":[{"id":"Qwen/Qwen2.5-1.5B-Instruct","object":"model","created":1733054408,"owned_by":"vllm","root":"Qwen/Qwen2.5-1.5B-Instruct","parent":null,"max_model_len":32768,"permission":[{"id":"modelperm-eb143e34881d4f02a02d5948ac28a750","object":"model_permission","created":1733054408,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

Once your server is started, you can query the model with input prompts:

In [11]:
!curl http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct","prompt": "San Francisco is a", \
"max_tokens": 7, \
"temperature": 0 \
}'

{"id":"cmpl-3d8885498cec4bdc9cbe2327debd1afb","object":"text_completion","created":1733054423,"model":"Qwen/Qwen2.5-1.5B-Instruct","choices":[{"index":0,"text":" city in the state of California,","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":4,"total_tokens":11,"completion_tokens":7,"prompt_tokens_details":null}}

Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the openai python package:

In [12]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="Qwen/Qwen2.5-1.5B-Instruct",
                                      prompt="San Francisco is a")
print("Completion result:", completion)

Completion result: Completion(id='cmpl-e9f64f81de1a4ebaafdfb42cb13f7e4e', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text=' boomtown. The city is known for its cable cars, rock music, and', stop_reason=None, prompt_logprobs=None)], created=1733054438, model='Qwen/Qwen2.5-1.5B-Instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=4, total_tokens=20, completion_tokens_details=None, prompt_tokens_details=None))


vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.

In [13]:
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print("Chat response:", chat_response)

Chat response: ChatCompletion(id='chatcmpl-5498c249066845f0a587c32544c15e58', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content="Sure, here's a joke for you:\n\nWhy don't scientists trust atoms?\n\nBecause they make up everything.", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[]), stop_reason=None)], created=1733054449, model='Qwen/Qwen2.5-1.5B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=23, prompt_tokens=24, total_tokens=47, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)


https://docs.vllm.ai/en/stable/models/engine_args.html
