# Runing a quantized Qwen3-8B model

As you have seen in the last notebook, the speed of the models could be tremendously increased
by reducing the number of (active) parameters. In model inference, the speed is dominated by
the availabe memory bandwidth. This is the main reason why GPUs are so much faster in 
text generation compared to CPUs (in addition to prompt parsing).

However, if we can reduce the size of the parameters (not just the number), we could also get
speed increases. This can be achieved by quantization. Unfortunately, `transformers` do not
support quantized models. Therefore, we have to use another software which is optimized for
that.

`vllm` can be used both as a library and as an Open AI compatible REST server. In this
notebook, we take the library approach, but it is quite easy to switch later.

In [None]:
from vllm import LLM, SamplingParams

https://github.com/vllm-project/vllm/issues/13127

`vllm` does not yet support `transformers` 5.0. You can install an older version of `transformers`.
For simplicity, I used a different kernel here with an old version of the library installed.
As this notebook is just for demonstration how you can save RAM (and increase the speed),
you do not necessarily have to run it!

Unfortunately, the *monkey patch* did not work for me

In [None]:
model_name = "Qwen/Qwen3-8B-AWQ"
llm = LLM(model=model_name, max_model_len=16384, trust_remote_code=True)

In [None]:
!nvidia-smi

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "How many 'r's are in 'strawberry'?"}
]

In [None]:
sampling_params = SamplingParams(
  max_tokens=1024,
  temperature=0.0,
)

Note that this is (hopefully) much faster than the generation based on `transformers`.

In [None]:
%%time
output = llm.chat(messages=messages, sampling_params=sampling_params)

In [None]:
for o in output:
    prompt = o.prompt
    generated_text = o.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    print(generated_text)

In [None]:
print(output[0].outputs[0].text)

In [None]:
from IPython.display import display, Markdown
display(Markdown(output[0].outputs[0].text))