# Using Qwen-8B with exl2 quantization and kernel

ExllamaV2 is a powerful quantization technique which works
with dedicated kernels. Unfortunately, it is not integrated
into the Hugging Face ecosystem.

However, it is really fast! Therefore, we will take a look
at this notebook and see what's different. The real *revolution*
works behind the scenes. By using a shortcut (skipping
`transformers` and `torch` and going directly via 
`triton` to CUDA), specialized kernels for each language model
are highly optimized and contribute to the excellent performance.

In [None]:
from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache, ExLlamaV2Tokenizer, Timer
from exllamav2.generator import ExLlamaV2DynamicGenerator, ExLlamaV2Sampler

Exllama has a cache which speeds things up.

In [None]:
total_cache_tokens = 16384

You have to download Exllama models manually (from Hugging Face) and work with local directories:

In [None]:
model_dir = "/home/cwinkler/oreilly/models/Qwen3-8B-exl2"
config = ExLlamaV2Config(model_dir)
config.arch_compat_overrides()
model = ExLlamaV2(config)
cache = ExLlamaV2Cache(model, max_seq_len = total_cache_tokens, lazy = True)
model.load_autosplit(cache, progress = True)

Of course, this models also have a tokenizer.

In [None]:
tokenizer = ExLlamaV2Tokenizer(config)

However, it is not so easy to apply the chat template:

In [None]:
prompt = """<|im_start|>system\nYou are a helpful assistant.<|im_end|>
<|im_start|>user\nTell me about O'Reilly online learning!<|im_end|>
<|im_start|>assistant
<think>

</think>

"""

Some parameters for generating the text:

In [None]:
max_new_tokens = 1024
gen_settings = ExLlamaV2Sampler.Settings.greedy()

Instantiation and warmup

In [None]:
generator = ExLlamaV2DynamicGenerator(
    model = model,
    cache = cache,
    tokenizer = tokenizer,
)
generator.warmup()

This is the actual text generation:

In [None]:
with Timer() as used:
    output = generator.generate(
        prompt = prompt,
        max_new_tokens = max_new_tokens,
        encode_special_tokens = True,
        gen_settings = gen_settings
    )
print(output)

It is fast!

In [None]:
max_new_tokens / used.interval

And uses less RAM than the full model:

In [None]:
!nvidia-smi