## AQLM inference example

<a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Install the `aqlm` library**
- the only extra dependency to run AQLM models.

In [1]:
%%capture
!pip install aqlm[gpu]==1.0.0dev10

**Load the model as usual**

Just don't forget to add:
 - `trust_remote_code=True` to pull the inference code
 - `torch_dtype="auto"` to load the model in it's native dtype.

The tokenizer is just a normal `Llama 2` tokenizer.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf", trust_remote_code=True, torch_dtype="auto",
).cuda()
tokenizer = AutoTokenizer.from_pretrained("BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  return self.fget.__get__(instance, owner)()


Do a few forward passes to load CUDA and automatically compile the kernels. It's done separately here for it not to affect the generation speed benchmark below.

In [3]:
%%capture
output = quantized_model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)

**Measure generation speed**

In [4]:
%%time
output = quantized_model.generate(tokenizer("I'm AQLM, ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)

CPU times: user 9.87 s, sys: 165 ms, total: 10 s
Wall time: 13.5 s


Note that `transformers` generation is not the fastest implementation and it's heavily influenced by CPU capabilities of _Google Colab_.

**Check that the output is what one would expect from Llama-2-7b**

In [5]:
print(tokenizer.decode(output[0]))

<s> I'm AQLM, 20 years old, and I'm from the Netherlands. I'm a student and I'm currently studying at the University of Amsterdam. I'm a very active person and I love to meet new people. I'm a very open person and I'm always looking for new things to do. I'm a very active person and I love to meet new people. I'm a very open person and I'm always looking for new things to do. I'm a very active person and I love to meet new people. I'm a very open person and I'm always looking for new
