# Hands-On Quantization

In [4]:
!pip install transformers
!pip install accelerate
!pip install bitsandbytes
!pip install -U "huggingface_hub[cli]"



- Line 1: We install the transformers library, which is a Hugging Face library that provides APIs and tools to download and train state-of-the-art pretrained models.
- 
Line 2: We install the accelerate library, which is designed to facilitate training deep learning models across different hardware. It enables the training and inference to be simple, efficient, and adaptable.- 

Line 3: We install the bitsandbytes library, which is a transformers library that helps with the quantization of the mode- l.

Line 4: We install the Hugging Face CLI to log in and access the model and dataset from Hugging Face.

In [None]:
from huggingface_hub import login
import os
login(token=os.getenv("HF_TOKEN"))

Token is valid (permission: fineGrained).
The token `vllm-docker` has been saved to C:\Users\felip\.cache\huggingface\stored_tokens
Your token has been saved in your configured git credential helpers (manager).
Your token has been saved to C:\Users\felip\.cache\huggingface\token
Login successful.
The current active token is: `vllm-docker`


In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

Line 1: We load the following libraries:

AutoModelForCausalLM: The auto model class to load a pretrained model.

AutoTokenizer: The auto tokenizer class to load the tokenizer of the selected model.

BitsAndBytesConfig: The configuration class for bitsandbytes quantization.

Line 2: We load the PyTorch library for GPU accel

## Unquantized Llama 3.1
Before moving to quantize the Llama 3.1 model, first, let’s see the memory footprints, the data type of the model’s parameter, and the inference of the meta-llama/Meta-Llama-3.1-8B-Instruct model without quantization to differentiate between the unquantized and quantized model.
### Load the Model
el
Let’s load the pretrained model from the Hugging Face.eration.eration.

In [4]:
import torch 
print(torch.__version__)
print(torch.cuda.is_available())

2.4.1
True


In [7]:
model_dir = "C:/Users/felip/Desktop/modelli/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", offload_folder="offload")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.


In [13]:
param_dtypes = [param.dtype for param in model.parameters()]
print("Parameter dtypes:", param_dtypes)

Parameter dtypes: [torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.float32, torch.

In [21]:
print(model.get_memory_footprint())

32121045248


In [25]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
input = tokenizer("Portugal is", return_tensors="pt").to("cuda")

response = model.generate(**input, max_new_tokens=50)
print(tokenizer.batch_decode(response, skip_special_tokens=True))

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


['Portugal is a country located in southwestern Europe on the Iberian Peninsula. It is bordered by Spain to the east and north, and the Atlantic Ocean to the west and south. Portugal has a rich history and culture, with a mix of Moorish, Gothic']


In [31]:
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_enable_fp32_cpu_offload=True  # Offload su CPU mantiene alcuni layer FP32 sulla cpu, io resto in 8bit sulla GPU
)

In [33]:
model_dir = "C:/Users/felip/Desktop/modelli/Llama-3.1-8B-Instruct"
quantized_model = AutoModelForCausalLM.from_pretrained(model_dir, quantization_config=bnb_config, device_map="auto")

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu and disk.


In [35]:
param_dtypes = [param.dtype for param in quantized_model.parameters()]
print("Parameter dtypes:", param_dtypes)

Parameter dtypes: [torch.float16, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.float16, torch.float16, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.float16, torch.float16, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.float16, torch.float16, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.float16, torch.float16, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.int8, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.float16, torch.

In [37]:
print(quantized_model.get_memory_footprint())

14970003712


In [39]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
input = tokenizer("Portugal is", return_tensors="pt").to('cuda')

response = quantized_model.generate(**input, max_new_tokens = 50)
print(tokenizer.batch_decode(response, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


['Portugal is a"nilIDGET Záp�数 graceful/WebAPIAllowAnonymousικαronic翼fcnutasionalména مطisseurванняColonotherapyalnızンデ.gsillis poil투kiyeunma젠ounters927ỆicodeArgbرفةandenycastleemiz.GunaksamstownštiRefPtrsdaleserteropleftoggleriosperumptdds']


In [12]:
# proviamo con una quantizzazione a 4bit
# poichè alcuni moduli del modello originale sono in RAM o disk in FP32 dobbiamo settare llm_int8_enable_fp32_cpu_offload=True
bnb_config_4bit = BitsAndBytesConfig(
    load_in_4bit=True,
    llm_int8_enable_fp32_cpu_offload=True
)

In [14]:
model_dir = "C:/Users/felip/Desktop/modelli/Llama-3.1-8B-Instruct"
# Carica il modello forzandolo sulla CPU
quantized_model_4bit = AutoModelForCausalLM.from_pretrained(
    model_dir,
    quantization_config=bnb_config_4bit,
    device_map="auto"  
)

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the disk and cpu.


In [16]:
print(quantized_model_4bit.get_memory_footprint())

11807498496


In [18]:
param_dtypes = [param.dtype for param in quantized_model_4bit.parameters()]
print("Parameter dtypes:", param_dtypes)

Parameter dtypes: [torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.uint8, torch.float16, torch.float16

In [22]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
input = tokenizer("Portugal is", return_tensors="pt").to('cpu')

response = quantized_model_4bit.generate(**input, max_new_tokens = 50)
print(tokenizer.batch_decode(response, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


ValueError: Blockwise quantization only supports 16/32-bit floats, but got torch.uint8