## AQLM inference example

<a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Install the requirements**
- `aqlm` is the only extra dependency to run AQLM models.
- Install the latest `accelerate` to pull the latest bugfixes.

In [None]:
%%capture
!pip install aqlm[gpu]==1.0.0
!pip install git+https://github.com/huggingface/accelerate.git@main

**Load the model as usual**

Just don't forget to add:
 - `trust_remote_code=True` to pull the inference code.
 - `torch_dtype="auto"` to load the model in it's native dtype.
 - `device_map="cuda"` to load the model on GPU straight away, saving RAM.

The tokenizer is just a normal `Mixtral` tokenizer.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
    trust_remote_code=True, torch_dtype="auto", device_map="cuda"
).cuda()
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")

Do a few forward passes to load CUDA and automatically compile the kernels. It's done separately here for it not to affect the generation speed benchmark below.

In [None]:
%%capture
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)

**Measure generation speed**

In [None]:
%%time
output = quantized_model.generate(tokenizer("I'm AQLM, ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)

CPU times: user 7.21 s, sys: 104 ms, total: 7.31 s
Wall time: 7.38 s


Note that `transformers` generation is not the fastest implementation and it's heavily influenced by CPU capabilities of _Google Colab_.

**Check that the output is what one would expect from Mixtral**

In [None]:
print(tokenizer.decode(output[0]))

<s> I'm AQLM, 20 years old, and I'm from the Netherlands. I'm a student and I'm currently studying at the University of Amsterdam. I'm a very active person and I love to meet new people. I'm a very open person and I'm always looking for new things to do. I'm a very active person and I love to meet new people. I'm a very open person and I'm always looking for new things to do. I'm a very active person and I love to meet new people. I'm a very open person and I'm always looking for new


**Check peak memory usage**

In [None]:
import torch

print(f"Peak memory usage: {torch.cuda.max_memory_allocated()*1e-9:.2f} Gb")

Peak memory usage: 2.62 Gb


Indeed, it's ~2 bits per model weight.