# Generate text using the model Qwen-8B

In this notebook, we use the generative large language model
[Qwen-8B](https://huggingface.co/Qwen/Qwen3-8B). This is a hybrid model
which can use "thinking" mode, but it is also possible to omit it.

Qwen-8B has 8 billion parameters, so with 2 bytes per parameter
(`bfloat16`) we expect a memory usage of 16 GB.

We will use the chat template for Qwen-8B and different prompts. You
can observe the behavior from LLMs that they produce different answers
for the same prompts. You will also see how can you can avoid that
and get reproducible answers.

In [None]:
import torch
torch.cuda.get_device_properties(0) 

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-8B"

# model and tokenizer must match
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# transfer the model to the GPU
model.cuda()

Check the memory usage

In [None]:
!nvidia-smi

Prompting the LLM works via the a special array which contains `dict`s. Each `dict`
has two keys, one is for the `role`, the other for the `content`. This structure 
is the same for all LLMs. 

In [None]:
prompt = "Tell me about O'Reilly online learning"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

The template itself however differs considerably. Fortunately,
it is included in the tokenizer.

In [None]:
print(tokenizer.chat_template)

The tokenizer also know how to apply the template and the result is much easier:

In [None]:
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    enable_thinking=False,
    add_generation_prompt=True
)
print(text)

This can now be used for instructing the LLM to `generate`
which invokes the text completion mechanism.

In [None]:
import time

start = time.time()
print("\n\n*** Generate:")

tokens = tokenizer(text, return_tensors='pt')
output = model.generate(inputs=tokens.input_ids.cuda(), attention_mask=tokens.attention_mask.cuda(),
                        temperature=0.7, 
                        do_sample=True, top_p=0.95, top_k=40, 
                        max_new_tokens=512)
used = time.time() - start
tps = len(output[0]) / used
print(tokenizer.decode(output[0]))
print(f"{used} seconds, {tps} tokens/s")

Due to the positive temperature, we get different responses (sampling):

In [None]:
start = time.time()
print("\n\n*** Generate:")

output = model.generate(inputs=tokens.input_ids.cuda(), attention_mask=tokens.attention_mask.cuda(),
                        temperature=0.7, 
                        do_sample=True, top_p=0.95, top_k=40, 
                        max_new_tokens=512)
used = time.time() - start
tps = len(output[0]) / used
print(tokenizer.decode(output[0]))
print(f"{used} seconds, {tps} tokens/s")

If we set `do_sample=False`, we get reproducible results:

In [None]:
start = time.time()
print("\n\n*** Generate:")

output = model.generate(inputs=tokens.input_ids.cuda(), attention_mask=tokens.attention_mask.cuda(),
                        do_sample=False, max_new_tokens=512)
used = time.time() - start
tps = len(output[0]) / used
print(tokenizer.decode(output[0]))
print(f"{used} seconds, {tps} tokens/s")

In [None]:
start = time.time()
print("\n\n*** Generate:")

output = model.generate(inputs=tokens.input_ids.cuda(), attention_mask=tokens.attention_mask.cuda(),
                        do_sample=False, max_new_tokens=512)
used = time.time() - start
tps = len(output[0]) / used
print(tokenizer.decode(output[0]))
print(f"{used} seconds, {tps} tokens/s")

Take a look at what happens when we ask for knowledge which is not in the training data:

In [None]:
prompt = "Explain the GRPO training method for LLMs!"
messages = [
    {"role": "system", "content": "You are a helpful assistant. Only answer if you are absolutely sure. Otherwise tell me that you don't know the answer"},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

Slight change: take a look at the system prompt!

In [None]:
prompt = "Writen an abstract about 'Frontiers in GRPO training LLMs'!"
messages = [
    {"role": "system", "content": "You are a creative researcher."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)