#  Quantize and speedup any LLM

In [None]:
# if you are not running the latest version of this tutorial, make sure to install the matching version of pruna
# the following command will install the latest version of pruna
!pip install pruna==0.2.5

Collecting pruna==0.2.5
  Using cached pruna-0.2.5-py3-none-any.whl.metadata (2.9 kB)
Collecting ConfigSpace>=1.2.1 (from pruna==0.2.5)
  Using cached configspace-1.2.1.tar.gz (130 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting DeepCache (from pruna==0.2.5)
  Using cached DeepCache-0.1.1-py3-none-any.whl.metadata (16 kB)
Collecting bitsandbytes (from pruna==0.2.5)
  Using cached bitsandbytes-0.46.0-py3-none-manylinux_2_24_x86_64.whl.metadata (10 kB)
Collecting blobfile (from pruna==0.2.5)
  Using cached blobfile-3.0.0-py3-none-any.whl.metadata (15 kB)
Collecting codecarbon (from pruna==0.2.5)
  Using cached codecarbon-3.0.2-py3-none-any.whl.metadata (9.1 kB)
Collecting colorama (from pruna==0.2.5)
  Using cached colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting ctranslate2==4.5.0 (from pruna==0.2.5)
  Using cached ctranslate2-4.5.0-cp3

In [None]:
import os
# Securely Load API Key
os.environ["HF_TOKEN"] = ""

### 1. Loading the LLM

First, load your LLM and its associated tokenizer.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Llama-3.2-1b-Instruct"

# We observed better performance with bfloat16 precision.
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

### 2. Test the original model speed

In [None]:
import time

# Warmup the model
for _ in range(3):
    with torch.no_grad():
        inp = tokenizer(["This is a test of this large language model"], return_tensors="pt")
        input_ids = inp['input_ids'].cuda()
        generated_ids = model.generate(input_ids, max_length=input_ids.shape[1] + 56, min_length=input_ids.shape[1] + 56)
        text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

torch.cuda.synchronize()
t = time.time()
with torch.no_grad():
    inp = tokenizer(["This is a test of this large language model"], return_tensors="pt")
    input_ids = inp['input_ids'].cuda()
    generated_ids = model.generate(input_ids, max_length=input_ids.shape[1] + 56, min_length=input_ids.shape[1] + 56)
    text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
print(text)
torch.cuda.synchronize()
print(time.time() - t)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention m

['This is a test of this large language model\'s ability to generate coherent and grammatically correct text.\n\nI\'m ready to begin. What is the prompt?\n\n(Note: you can also use the "go" command to start the test, but I\'ll assume you want to start with a prompt.)\n\n\nExample prompt: "Write']
0.7246272563934326


### 3. Initializing the Smash Config

In [None]:
from pruna import SmashConfig

smash_config = SmashConfig()
# Select the quantizer
smash_config['quantizer'] = 'hqq'
smash_config['hqq_weight_bits'] = 4  # can work with 2, 8 also (but 4 is the best performance)
smash_config['hqq_compute_dtype'] = 'torch.bfloat16'  # can work with float16, but better performance with bfloat16

# Select torch_compile for the compilation
smash_config['compiler'] = 'torch_compile'
# smash_config['torch_compile_max_kv_cache_size'] = 400 # uncomment if you want to use a custom kv cache size
smash_config['torch_compile_fullgraph'] = True
smash_config['torch_compile_mode'] = 'max-autotune'
# If the model is not compatible with cudagraphs, you can try to comment the line above and uncomment the line below
# smash_config['torch_compile_mode'] = 'max-autotune-no-cudagraphs'

INFO - No device specified. Using best available device: 'cuda'


### 3. Smashing the Model

Now, smash the model. This can take up to 30 seconds.

In [None]:
from pruna import smash

# Smash the model
smashed_model = smash(
    model=model,
    smash_config=smash_config,
)

INFO - Starting quantizer hqq...
100%|██████████| 51/51 [00:00<00:00, 664.61it/s]
100%|██████████| 113/113 [00:24<00:00,  4.69it/s]
INFO - quantizer hqq was applied successfully.
INFO - Starting compiler torch_compile...
INFO - compiler torch_compile was applied successfully.


### 4. Running the Model


Finally, run the model to generate the text you want.
Note we need a small warmup the first time we run it (< 1 minute).

NB: Currently the quantized+compiled LLM only support the default sampling strategy, and you need to generate tokens following `model.generate(input_ids, max_new_tokens=X)`, where X is the number of tokens you want to produce. We plan to support other sampling schemes (dola, contrastive, etc.) in the near future.

In [None]:
import time

# Warmup the model
for _ in range(3):
    with torch.no_grad():
        inp = tokenizer(["This is a test of this large language model"], return_tensors="pt")
        input_ids = inp['input_ids'].cuda()
        generated_ids = smashed_model.generate(input_ids, max_new_tokens=56)
        text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

torch.cuda.synchronize()
t = time.time()
with torch.no_grad():
    inp = tokenizer(["This is a test of this large language model"], return_tensors="pt")
    input_ids = inp['input_ids'].cuda()
    generated_ids = smashed_model.generate(input_ids, max_new_tokens=56)
    text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

print(text)
torch.cuda.synchronize()
print(time.time() - t)

INFO - Cache size changed from 1x400 to 1x1000. Re-initializing StaticCache.
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.



["This is a test of this large language model. Please wait for a bit to see how it responds.\n\nOnce I am ready, please ask me anything you'd like to know. I'm ready to help. What would you like to do?\n\n(Note: I can simulate conversations in various domains, such as science, history,"]
0.22874045372009277
