# vLLM on Azure Machine Learning

Reference: [Quickstart doc](https://docs.vllm.ai/en/stable/getting_started/quickstart.html)

## Installation

Spin up an H100 compute instance (e.g. `Standard_NC80adis_H100_v5` which has 2 H100 GPUs, each with 94GiB GPU memory) in Azure Machine Learning, open `Terminal` app and install vLLM:

```bash
$ conda create -n myenv python=3.10 -y
$ conda activate myenv
$ pip install vllm
```

## Run an OpenAI-compatible LLM server

vLLM can be used as a server that implements OpemAI API protocol. Simply run it with a model available on Hugging Face, e.g. [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) as follows:

```bash
$ vllm serve Qwen/Qwen2.5-1.5B-Instruct
```

By default, the server starts at `localhost` with port `8000`.

vLLM supports many models from [HuggingFace Transformers](https://huggingface.co/models). A list of supported models can be found [here](https://docs.vllm.ai/en/latest/models/supported_models.html#supported-models).

### List models

With the server running, you can simply call the list models API with a cURL command like this:

```bash
$ curl http://localhost:8000/v1/models
```

Or with Python `requests`:

In [2]:
import requests

response = requests.get("http://localhost:8000/v1/models")

print(response.text)

{"object":"list","data":[{"id":"Qwen/Qwen2.5-1.5B-Instruct","object":"model","created":1733570012,"owned_by":"vllm","root":"Qwen/Qwen2.5-1.5B-Instruct","parent":null,"max_model_len":32768,"permission":[{"id":"modelperm-43718e5e43ae4f749f8a59f8233dbe2b","object":"model_permission","created":1733570012,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}


### OpenAI completion

Use OpenAI client to call the Chat completion API:

In [None]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    prompt="San Francisco is a",
    max_tokens=7,
    temperature=0.7,
)
print("Completion result:", completion.choices[0].text)

### OpenAI Chat Completion

Now let vllm serve Phi-3.5 mini:

```bash
$ vllm serve microsoft/Phi-3.5-mini-instruct
```

#### With cURL

Call the model list API again

```bash
$ curl http://localhost:8000/v1/models
```

And we'll see it's `"microsoft/Phi-3.5-mini-instruct"`

```json
{
   "object":"list",
   "data":[
      {
         "id":"microsoft/Phi-3.5-mini-instruct",
         "object":"model",
         "created":1733574665,
         "owned_by":"vllm",
         "root":"microsoft/Phi-3.5-mini-instruct",
         "parent":null,
         "max_model_len":131072,
         "permission":[
            {
               "id":"modelperm-bfdf3099c73443f1a7b222ca7b344a90",
               "object":"model_permission",
               "created":1733574665,
               "allow_create_engine":false,
               "allow_sampling":true,
               "allow_logprobs":true,
               "allow_search_indices":false,
               "allow_view":true,
               "allow_fine_tuning":false,
               "organization":"*",
               "group":null,
               "is_blocking":false
            }
         ]
      }
   ]
}
```

Now call the chat completion API. A cURL request would look like this:

```bash
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "microsoft/Phi-3.5-mini-instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'
```

And the response

```json
{
   "id":"chatcmpl-3022a77606db41ac997645b339ea6a78",
   "object":"chat.completion",
   "created":1733574718,
   "model":"microsoft/Phi-3.5-mini-instruct",
   "choices":[
      {
         "index":0,
         "message":{
            "role":"assistant",
            "content":" The world series in Major League Baseball (MLB) for the 2020 season was won by the Los Angeles Dodgers. The Dodgers defeated the Tampa Bay Rays in seven games, with the series concluding on October 30, 2020. This was a significant year because it was shortened due to the COVID-19 pandemic, resulting in a regular season of just 60 games. The Dodgers' victory was a culmination of a strong regular season and postseason performance.",
            "tool_calls":[
               
            ]
         },
         "logprobs":null,
         "finish_reason":"stop",
         "stop_reason":32007
      }
   ],
   "usage":{
      "prompt_tokens":23,
      "total_tokens":138,
      "completion_tokens":115,
      "prompt_tokens_details":null
   },
   "prompt_logprobs":null
}
```

#### With OpenAI Python client

Or use OpenAI Python client:

In [8]:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
    model="microsoft/Phi-3.5-mini-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a geeky joke."},
    ]
)
print("Chat response:", chat_response.choices[0].message.content)

Chat response:  Sure, here's a tech-themed joke for you:

Why do programmers prefer dark mode?

Because light attracts bugs!

This joke plays on the double meaning of "light." In the context of a computer screen, light mode refers to the default, bright display. However, "light" is also commonly used to refer to bugs or glitches in programming. So, when programmers say they prefer "dark mode," they're humorously suggesting that a darker screen might help them avoid attracting bugs in their code.


### (Optional) Free up GPU memory

If the first model is still in the memory, loading a second model and start a server could encounter a `CUDA out of memory` error.

```bash
ERROR 12-07 12:14:35 engine.py:366] CUDA out of memory. Tried to allocate 48.00 MiB. GPU 0 has a total capacity of 93.02 GiB of which 40.19 MiB is free. Process 24272 has 82.55 GiB memory in use. Process 72349 has 5.78 GiB memory in use. Including non-PyTorch memory, this process has 4.60 GiB memory in use. Of the allocated memory 3.98 GiB is allocated by PyTorch, and 26.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
```

Kill the process that takes up the biggest GPU memory (whose PID is shown in the error message as 24272):

```bash
$ kill -9 24272
```

## Offline Batch Inference

Use `LLM` class for offline inference with vLLM engine. `SamplingParams` speifies the parameters for the sampling process.

In [1]:
from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

sampling_params = SamplingParams(temperature = 0.8, top_p = 0.95)

  from .autonotebook import tqdm as notebook_tqdm
2024-12-07 11:59:37,190	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [4]:
llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.95)

INFO 12-07 12:06:10 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cach

Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 12.83it/s]



INFO 12-07 12:06:11 model_runner.py:1077] Loading model weights took 0.2389 GB
INFO 12-07 12:06:11 worker.py:232] Memory profiling results: total_gpu_memory=93.02GiB initial_memory_usage=83.77GiB peak_torch_memory=0.97GiB memory_usage_post_profile=83.77GiB non_torch_memory=83.26GiB kv_cache_size=4.14GiB gpu_memory_utilization=0.95
INFO 12-07 12:06:11 gpu_executor.py:113] # GPU blocks: 7529, # CPU blocks: 7281
INFO 12-07 12:06:11 gpu_executor.py:117] Maximum concurrency for 2048 tokens per request: 58.82x
INFO 12-07 12:06:14 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 12-07 12:06:14 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO

(I ran into `No available memory for the cache blocks` error and added `gpu_memory_utilization=0.95`.)

Now run a batch LLM generation with `llm.generate`:

In [None]:
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 72.71it/s, est. speed input: 473.54 toks/s, output: 1165.49 toks/s]

Prompt: 'Hello, my name is', Generated text: ' Joel, my dad is my friend and we are in a relationship. I am'
Prompt: 'The president of the United States is', Generated text: ' speaking out against the release of some State Department documents which show the Russians were involved'
Prompt: 'The capital of France is', Generated text: ' known as the “Proud French capital”. What is this city'
Prompt: 'The future of AI is', Generated text: ' literally in danger of being taken by any other company.\nAgreed. '





: 