## AQLM inference example

<a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/streaming_example.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Install the `aqlm` library**
- the only extra dependency to run AQLM models.

In [1]:
%%capture
!pip install aqlm[gpu]==1.0.0

**Load the model as usual**

Just don't forget to add:
 - `trust_remote_code=True` to pull the inference code
 - `torch_dtype="auto"` to load the model in it's native dtype.

The tokenizer is just a normal `Llama 2` tokenizer.

**Check that the output is what one would expect from Llama-2-7b**

In [2]:
from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM
import transformers
import torch

In [3]:
quantized_model = AutoModelForCausalLM.from_pretrained(
    "ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf", trust_remote_code=True, torch_dtype=torch.float16,
).cuda()
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Llama-2-7b-AQLM-2Bit-1x16-hf")

Do a few forward passes to load CUDA and automatically compile the kernels.

In [4]:
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)

Generate output using GPU streaming.

In [5]:
inputs = tokenizer(["An increasing sequence: one,"], return_tensors="pt")["input_ids"].cuda()

streamer = TextStreamer(tokenizer)
_ = quantized_model.generate(inputs, streamer=streamer, max_new_tokens=120)

<s> An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty-one, twenty-two, twenty-three, twenty-four, twenty-five, twenty-six, twenty-seven, twenty-eight, twenty-nine, thirty, thirty-one, thirty-two, thirty-three, thirty-four, thirty-five, thirty-six, thirty-seven, thirty-eight


# On CPU

In [6]:
quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf", trust_remote_code=True, torch_dtype=torch.float32,
).cpu()
tokenizer = AutoTokenizer.from_pretrained("BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf")

Compile AQLM numba kernel.

In [7]:
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cpu(), max_new_tokens=10)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Compiling AQLM numba kernel with parameters: kernel_key=(8, 4096, 4096, 2)
Compiling AQLM numba kernel with parameters: kernel_key=(8, 11008, 4096, 2)
Compiling AQLM numba kernel with parameters: kernel_key=(8, 4096, 11008, 2)


Generate output using CPU streaming.
**Warning:** collabs CPU is slow, please use more powerfull CPU for comfortable generation.

In [8]:
inputs = tokenizer(["An increasing sequence: one,"], return_tensors="pt")["input_ids"].cpu()

streamer = TextStreamer(tokenizer)
_ = quantized_model.generate(inputs, streamer=streamer, max_new_tokens=120)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> An increasing sequence: one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, twenty-one, twenty-two, twenty-three, twenty-four, twenty-five, twenty-six, twenty-seven, twenty-eight, twenty-nine, twenty-ten, twenty-eleven, twenty-twelve, twenty-twenty, twenty-twenty, twenty-twenty-three, twenty-twenty
