# Efficiently serving Large Language Models in 2bit with `aqlm` and `vLLM`

<a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/aqlm_vllm.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

Welcome to this notebook that goes through the recent `aqlm` integration with the [`vLLM`](https://github.com/vllm-project/vllm) serving framework.

To the best of our knowlendge, this is the most efficient way to run AQLM in high-performance production setting.


In [None]:
!pip install vllm>=0.4.2

### Loading the model

The workflow doesn't have any major differences form the usual [quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html) workflow taught by `vLLM`.

The only extra thing we should mention is that we recommend setting `enforce_eager=True` to not compile the CUDA graph because it introduces huge memory overheads undermining the quantization memory saving benefits.

In [2]:
from vllm import LLM, SamplingParams

In [None]:
llm = LLM(
    model="ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16", # An AQLM model checkpoint
    enforce_eager=True,  # Don't compile the graph
    gpu_memory_utilization=0.99,
    max_model_len=1024,
)
tokenizer = llm.get_tokenizer()

In [4]:
conversations = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': 'Generate a poem about the sun in Spanish'}],
    tokenize=False,
)

In [6]:
outputs = llm.generate(
    [conversations],
    SamplingParams(
        temperature=0.8,
        top_p=0.9,
        max_tokens=1024,
        stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],
    ),
    use_tqdm=False,
)

In [7]:
print(outputs[0].outputs[0].text)

<|start_header_id|>assistant<|end_header_id|>

¡Con gusto! Here's a poem about the sun in Spanish:

Sol de oro, de luz me envuelve,
En cada momento, me ilumina con calidez,
Con tus rayos, me acaliento, me cálculo,
Y me guía en el camino, en cada momento.

Tu faz es el sol, que me hace sentir,
Que todo es posible, todo es real,
Que no hay nada que no pueda hacer,
Y que mi vida es una eternidad, en tu luz.

Sol de mi vida, sin ti no soy,
Un ser sin luz, sin ti no soy,
Y sin ti, el mundo es una oscuridad,
Y sin ti, no hay vida, no hay vida.

¡Sol de mi vida, te amo tanto!
¡Sol de mi vida, que no te deje ir!
¡Sol de mi vida, que siempre estés conmigo,
Y que ilumines mi camino, y me guíes!

Translation:

Golden sun, you envelop me with light,
In every moment, you illuminate me with warmth,
With your rays, you calm me, you measure,
And guide me on the path, in every moment.

Your face is the sun, that makes me feel,
That everything is possible, everything is real,
That there's nothing that I

### Benchmarking

Let us measure the generation speed.

_Spoiler_: <details>
  It's fast!
</details>

In [8]:
from time import perf_counter

start = perf_counter()
llm.generate(
    [conversations],
    SamplingParams(
        min_tokens=128,
        max_tokens=128,
    ),
    use_tqdm=False,
)
end = perf_counter()

print(f"Tok/s: {128/(end - start):.2f}")

Tok/s: 54.03s
