## AQLM transformers integration example

**Install the `aqlm` library**
- The only extra dependency to run AQLM models.
- Add `[gpu]` to install the required CUDA specific dependencies.

In [None]:
%%capture
!pip install aqlm[gpu]>=1.0.1
!pip install accelerate>=0.27.0
!pip install transformers>=4.38.0

**Load the model as usual**

The tokenizer is just a normal `Mixtral` tokenizer.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
    "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
    torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")

Do a few forward passes to load CUDA and automatically compile the kernels. It's done separately here for it not to affect the generation speed benchmark below.

In [None]:
%%capture
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


**Measure generation speed**

In [None]:
%%time
output = quantized_model.generate(tokenizer("I'm AQLM, ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


CPU times: user 21.5 s, sys: 0 ns, total: 21.5 s
Wall time: 21.6 s


Note that `transformers` generation is not the fastest implementation and it's heavily influenced by CPU capabilities of _Google Colab_.

**Check that the output is what one would expect from Mixtral**

In [None]:
print(tokenizer.decode(output[0]))

<s> I'm AQLM, 20 years old, and I'm a student at the University of California, Berkeley. I'm a member of the Berkeley Student Union, and I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student Union. I'm a member of the Berkeley Student


**Check peak memory usage**

In [None]:
import torch

print(f"Peak memory usage: {torch.cuda.max_memory_allocated()*1e-9:.2f} Gb")

Peak memory usage: 13.68 Gb


Indeed, it's ~2 bits per model weight.