# Quantizing Open-Source LLMs
* Notebook by Adam Lang
* Date: 1/17/2025

# Overview
* In this notebook I will demo how to quantize open source LLMs from hugging face.

# Install Dependencies

In [1]:
%%capture
!pip install transformers torch bitsandbytes accelerate

# Load Open source model and tokenizer
* Let's work with the Falcon-7b model from hugging face.
* model card: https://huggingface.co/tiiuae/Falcon3-7B-Base
* This is the raw pre-trained model `Falcon3-7B-Base`.
  * The model achieves state of art results (at the time of release) on:
    * reasoning
    * language understanding
    * instruction following
    * code
    * mathematics tasks.
  * The Falcon3-7B-Base supports 4 languages:
    * english
    * french
    * spanish
    * portuguese

* The model context length is up to 32K!

## Load Tokenizer
* First we will load the tokenizer before we start messing around with quantizing the actual model weights.

In [3]:
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

## model checkpoint
model_ckpt = "tiiuae/Falcon3-7B-Base"

## load tokenizer & model
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
tokenizer.vocab_size

131072

# Load Model - Using 16bit Precision
* If we were to load a 7B parameter model we would need 14GB of GPU RAM and this will crash this notebook and my computer, triple bam, yikes!!

In [4]:
## standard model load
## the main thing to note here is the `torch_dtype` parameter
#model = AutoModelForCausalLM.from_pretrained(model_ckpt, torch_dtype=torch.bfloat16)

# Load Model - Using 8 bit Precision
* This is more like it and most common practice when loading big LLMs like this.

In [5]:
import torch
import transformers
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
)

## setup bitsandbytes config
config = BitsAndBytesConfig(
    load_in_8bit=True,
)

## now load the model
## difference here is we are using a `quantization_config`
model = AutoModelForCausalLM.from_pretrained("tiiuae/Falcon3-7B-Base",
                                             quantization_config=config)
## how many GB of memory
gbs = model.get_memory_footprint() / 1e9

## print parameters
print(f"Number of model parameters: {model.num_parameters()}")
print(f"Memory footprint if FP32: {(model.num_parameters()*4/1e9)} GB")
print(f"Memory footprint of model with 8-bit quantization: {gbs:2f} GB")

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/21.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.22G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/805M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/113 [00:00<?, ?B/s]

Number of model parameters: 7455550464
Memory footprint if FP32: 29.822201856 GB
Memory footprint of model with 8-bit quantization: 8.261047 GB


Summary
* 8-bit quantization reduced the model footprint by about 4x.

In [11]:
## test prompt
test_prompt = "German Shepherd dogs are intelligent, loyal, and energetic dogs. They are known for being courageous, confident, and willing to protect their loved ones."
test_prompt

'German Shepherd dogs are intelligent, loyal, and energetic dogs. They are known for being courageous, confident, and willing to protect their loved ones.'

In [12]:
## setup device agnostic code
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [15]:
## generate output
def generate(prompt):
  tokenized_text = tokenizer(prompt, return_tensors="pt").to(device)
  # remove token_type_ids if present
  if "token_type_ids" in tokenized_text:
    del tokenized_text["token_type_ids"]
  output = model.generate(**tokenized_text, eos_token_id=tokenizer.eos_token_id, do_sample=True, max_new_tokens=100)
  result = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
  return result

In [16]:
## generate result
result = generate(test_prompt)
print(result)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


German Shepherd dogs are intelligent, loyal, and energetic dogs. They are known for being courageous, confident, and willing to protect their loved ones. They make great family dogs and companion animals.
|Affectionate and loving
|Grooming and bathing needs
|Regular grooming and attention
|Trainability and intelligence
|Intelligent and trainable
|Size and weight:
|Medium to large - Males 23-26” and 90-105 pounds; females 21-24” and 80-95 pounds
|Playful, energetic, and often prone to


# Load Model - Using 4bit Precision

In [17]:
import torch
from transformers import BitsAndBytesConfig

## set 4-bit config
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", ## normal float 4
    bnb_4bit_use_double_quant=True,
)

In [19]:
## load 4-bit model quantization
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers

## 4-bit model
four_bit_model = AutoModelForCausalLM.from_pretrained(model_ckpt,
                                                      quantization_config=config)

`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [21]:
## check memory footprint
gbs = four_bit_model.get_memory_footprint() / 1e9

print(f"Number of parameters: {four_bit_model.num_parameters()}")
print(f"Memory footprint if FP32: {(four_bit_model.num_parameters()*4/1e9)} GB")
print(f"Memory footprint of model with 4-bit quantization: {gbs:.2f} GB")

Number of parameters: 7455550464
Memory footprint if FP32: 29.822201856 GB
Memory footprint of model with 4-bit quantization: 4.94 GB


Summary:
* We can see the footprint is now about 8x smaller hence the 8-bit precision.

In [25]:
## test prompt
test_prompt = "German Shepherd dogs are intelligent, loyal, and energetic dogs. They are known for being courageous, confident, and willing to protect their loved ones."
test_prompt

'German Shepherd dogs are intelligent, loyal, and energetic dogs. They are known for being courageous, confident, and willing to protect their loved ones.'

In [26]:
## generate again
result = generate(test_prompt)

Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


In [27]:
print(result)

German Shepherd dogs are intelligent, loyal, and energetic dogs. They are known for being courageous, confident, and willing to protect their loved ones.
GSDs are considered one of the most versatile breeds, performing tasks from herding sheep to being used by law enforcement agencies as police dogs.
These dogs are great companions for active families, as they require regular exercise and mental stimulation to prevent boredom or destructive behavior.
When it comes to grooming, GSDs have a dense double coat that requires regular brushing and bathing. It is essential to maintain good dental hygiene for your GSD by regularly brushing their teeth and providing appropriate chew toys.



Summary
* At 4-bit we can see the ouput is significantly different than with 8-bit