# Text generation with `vllm` as a library

[`vllm`](https://github.com/vllm-project/vllm) can be used both as a library and
a server. In this notebook, we focus on the library. Compared to `transformes`,
a lot of the complexity has been abstracted away.

In this notebook, we can now (finally!) use the AWQ version of Qwen-8B.

In [None]:
from vllm import LLM, SamplingParams

This can take a while, as the model is loaded and optimized:

In [None]:
model_name = "Qwen/Qwen3-8B-AWQ"
llm = LLM(model=model_name, max_model_len=16384, trust_remote_code=True)

The `messages` are exactly the same:

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain O'Reilly online learning!"}
]

Sampling parameters are passed separately.

In [None]:
sampling_params = SamplingParams(
  max_tokens=1024,
  temperature=0.0,
)

Avoid the thinking phase:

In [None]:
output = llm.chat(messages=messages, sampling_params=sampling_params, 
                  chat_template_kwargs={"enable_thinking": False})

In [None]:
print(output[0].outputs[0].text)

In [None]:
!nvidia-smi