# Running `vllm` as a server

In the previous notebooks, you have seen how `vllm` can work as a
(powerful) library. Most of the time, however, you will run `vllm`
as a separate server process.

This has several advantages:
* `vllm` exposes an OpenAI compatible API
* You can easily migrate existing software to use this API (if OpenAI has been used before)
* Many different programs can access the API
* You could work with load balancing if it becomes necessary

We will use the OpenAI client:

In [None]:
from openai import OpenAI

An API key can be added, but we don't need it here

In [None]:
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

Run: `vllm serve Qwen/Qwen3-4B-Instruct-2507 --max-model-len 16384`

Unfortunately, it is not so easy to use the chat template here. This has to be done on the client side:

In [None]:
%%time
completion = client.completions.create(model="Qwen/Qwen3-4B-Instruct-2507",
                                       prompt="""<|im_start|>system\nYou are a helpful assistant.<|im_end|>
<|im_start|>user\nTell me about O'Reilly online learning!<|im_end|>
<|im_start|>assistant""", max_tokens=1024)
print("Completion result:", completion)

In [None]:
print(completion.choices[0].text)

In [None]:
!nvidia-smi