<a href="https://colab.research.google.com/github/bharathkreddy/brks_agents/blob/main/QuantizingOpensourceModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is quantization anyway ?
Replacing real numbers (like `-0.83`, `0.12`, `1.25`) with values from a small, **fixed set** of representative numbers.

1. Split weights into blocks (e.g. 256 weights)
2. For each block:
  1. Find the min and max (or standard deviation range) for a block. You store a scale and zero point (these are also quantized - called double layer quantization)
  2. Normalize weights to fit into the range of your 4-bit codebook
  3. Quantize to nearest code i.e. map values in that block to the codebook (like NF4)
  4. Save the scale and zero-point

At inference:
1. Load the 4-bit codes
2. Use the saved scale to dequantize (`original ≈ scale × codebook_value + offset`)
3. Multiply back to get an approximate original

In 4-bit quantization: You have only 16 possible values (because 4 bits = 2⁴ = 16).
First all weights are scaled.
So instead of storing a number like `0.123455`, you map it to the closest quantized value in your 16-entry lookup table (called a "codebook"). Codebook's numbers depend on -
1. Quantization method (int4, fp4, nf4, etc.)
2. Implementation (uniform vs. non-uniform)
3. Whether scaling is applied per layer, per group, or per tensor

examples:
## Uniform Quantization (e.g. int4 or fp4)
These do use fixed ranges and equal spacing.

For int4 (signed):
4-bit → 16 values: `[-8, -7, ..., 7]`

Codebook: uniformly spaced integers

If normalized to `[-1, 1]`, you’d scale all weights to fit in that range and map them to: `[-1.0, -0.857, ..., 1.0]` ➡️ These are equally spaced. But it’s still not always `[-1, +1]` unless scaling is applied after quantization.

## NF4 (Normal Float 4) — Not Uniform [BitsAndBytes NF4 paper](https://arxiv.org/pdf/2305.14314.pdf)
NF4 was designed specifically for LLMs and is not uniformly spaced. It uses non-linear spacing to approximate Gaussian-distributed weights. Since all weights are near 0, most of the 16 numbers are around 0 and less nearing -1 and 1.

In [None]:
!pip install -q --upgrade torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
!pip install -q requests bitsandbytes==0.46.0 transformers==4.48.3 accelerate==1.3.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m908.3/908.3 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.3/7.3 MB[0m [31m57.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m64.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m29.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m54.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from google.colab import userdata
from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch
import gc

In [None]:
# hugging face login to pull the models

hf_token = userdata.get('HUGGINGFACE_API_KEY')
login(hf_token, add_to_git_credential=True)

In [None]:
# instruct models

LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"
PHI3 = "microsoft/Phi-3-mini-4k-instruct"
GEMMA2 = "google/gemma-2-2b-it"
QWEN2 = "Qwen/Qwen2-7B-Instruct" # exercise for you
MIXTRAL = "mistralai/Mixtral-8x7B-Instruct-v0.1" # If this doesn't fit it your GPU memory, try others from the hub

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Tell a light-hearted joke for a room of Data Scientists"}
  ]

# 🧮 Define Quantization Config

This config tells Hugging Face to load the model in 4-bit precision, which saves memory and boosts performance, especially on consumer GPUs.

1. `load_in_4bit=True`: Instructs Hugging Face to load the model in 4-bit integer quantized weights, instead of full 16-bit or 32-bit floating-point values.
  - Hugely reduces memory (up to 75% less than fp16)
  - Allows running large models (e.g., LLaMA-7B) on GPUs with 8–12GB VRAM
  - Works by mapping full-precision weights to 4-bit buckets, storing a quantized value and a scaling factor to reconstruct the original
  - During model loading, `bitsandbytes` replaces regular `Linear` layers with `4-bit Linear4bit` modules.
2. `bnb_4bit_use_double_quant=True`: Enables a second level of quantization on the scaling factors that are used to dequantize the 4-bit weights.
  - 4-bit weights still need scaling factors (1 per block of weights).
  - Those scaling factors are usually 16- or 32-bit floats.
  - With double quantization, these scales themselves are quantized again, saving even more memory.
  - Original weight → quantized into 4-bit with a scale
  - That scale is also quantized (e.g., with 8-bit or 4-bit) using a second lightweight quantizer
3. `bnb_4bit_compute_dtype=torch.bfloat16`: Internally, computation is still done in bfloat16, which balances speed and numerical range.
  - Controls the internal math precision used during inference.
  - Even though weights are in 4-bit, the actual matrix multiplications (like `W·x`) use a higher-precision format for numerical stability and output quality.
  - brain float is preffered by H100 or RTX chips.
4. `bnb_4bit_quant_type="nf4"`: Specifies the quantization scheme used for mapping full-precision values into 4-bit.
  - NF4 is a non-uniform quantization scheme that better preserves the distribution of original floating-point values.
  - Designed specifically for LLMs, where small weight differences matter a lot.



In [None]:
# Quantization Config - this allows us to load the model into memory and use less memory

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

🧠 Tokenizer converts your structured message text into numeric IDs the model understands.

1. Loads the tokenizer that matches your LLAMA model (e.g., meta-llama/Llama-2-7b-chat-hf).
2. Chat-style models often don't have a padding token defined. So we use `eos_token` as padding, which is a safe fallback.
3. `apply_chat_template` formats the messages into the model’s expected prompt (e.g., `<s>[INST] ... [/INST]`).
4. `return_tensors="pt"` converts the output into PyTorch tensors.
5. `.to("cuda")` sends those tensors to your GPU.

🧠 At this point, you have your input IDs on the GPU, ready for inference.

In [None]:
# Tokenizer

tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

# Loads the LLaMA model with quantization settings applied.

`device_map="auto"` lets Hugging Face dispatch the model layers to the appropriate GPU(s) automatically.

`quantization_config=quant_config` applies the 4-bit config defined earlier.

💡 This loads the entire model in 4-bit, saving memory and enabling you to run large models on mid-range GPUs (like a 2070 or 3060 Ti).

In [None]:
# The model

model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto", quantization_config=quant_config)

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

In [9]:
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")

Memory footprint: 5,591.5 MB


# **Under the hood of Llama3.1 transformer model**

Llama 3.1 is an advanced Transformer-based neural network, designed for tasks like language modeling—predicting the next word in a sentence. Transformers like Llama are famous for their effectiveness in handling sequences (like words in sentences), primarily thanks to their unique ability to weigh the importance of each word contextually using self-attention.

Llama specifically has a decoder-only architecture:

1. `Embedding layer`: Encodes tokens (words or subwords).
2. `32 decoder layers`: Repeatedly refine understanding.
3. `LM Head`: Predicts the next word from embeddings.

# **Embeddings - Token Encoding**

The first layer is:
```
(embed_tokens): Embedding(128256, 4096)
```

`Embedding`: Converts discrete tokens (like words) into dense vectors (numbers), allowing the model to learn meaning and relationships.

The vocabulary size is `128,256` tokens, and each token is represented by a `4096`-dimensional vector.

`Rotary Embedding (LlamaRotaryEmbedding)` is a technique that helps the model capture the positions of words more effectively, improving attention computations, especially for longer contexts.

# **The 32 Decoder Layers - A Quick Overview**

Next, there are have 32 identical decoder layers:
```
(0-31): 32 x LlamaDecoderLayer(...)
```

Each decoder layer refines the embeddings by:
  - Performing self-attention (to relate tokens to each other).
  - Passing the results through an MLP (multi-layer perceptron) to add non-linear transformations.
  - Using normalization layers to stabilize training and improve generalization.

Think of each layer as progressively adding clarity and context to the embeddings, refining predictions step-by-step.

# **Self-Attention – Core Concept and PyTorch Layers**

Within each decoder layer is a self-attention mechanism:
```
(self_attn): LlamaAttention(
  (q_proj): Linear4bit(4096→4096)
  (k_proj): Linear4bit(4096→1024)
  (v_proj): Linear4bit(4096→1024)
  (o_proj): Linear4bit(4096→4096)
)
```
Self-attention allows each word in a sentence to attend to all other words to better understand context. This happens through:
  - Queries (q_proj): Determines what information the token seeks.
  - Keys (k_proj): Identifies what each token contains.
  - Values (v_proj): Holds the content or meaning.
  - Output (o_proj): Projects attention results back to embeddings.

PyTorch layers like Linear4bit indicate that parameters have been quantized to 4 bits, greatly reducing model size and memory needs while retaining good performance.

# **The MLP Layers – Adding Non-Linearity**

Each decoder layer also has an MLP:
```
(mlp): LlamaMLP(
  (gate_proj): Linear4bit(4096→14336)
  (up_proj): Linear4bit(4096→14336)
  (down_proj): Linear4bit(14336→4096)
  (act_fn): SiLU()
)
```
The MLP layers (gate, up, down projections) transform the data into a higher-dimensional space (14336 dimensions) for richer interactions and then back down to 4096 dimensions.

SiLU (Sigmoid Linear Unit) activation function introduces non-linearities that help the network learn complex relationships.

# **Norm Layers – Stability and Generalization**

Normalization layers stabilize training by keeping values within a consistent range:
```
(input_layernorm): LlamaRMSNorm(4096)
(post_attention_layernorm): LlamaRMSNorm(4096)
(norm): LlamaRMSNorm(4096)
```
RMSNorm (Root Mean Square Normalization) ensures smoother training and improves generalization.

It's applied both before and after the self-attention and MLP operations to keep embeddings stable.

# **The LM Head – Generating Predictions**

Finally, the LM Head generates predictions:
```
(lm_head): Linear(in_features=4096, out_features=128256)
```
Takes refined embeddings (4096-dimensional) and converts them back into probabilities for each token in the vocabulary (128,256 tokens).

The highest probability indicates the model's prediction for the next word.

# **Dimensions and 4-bit Quantization**

Notice the dimensions:
```
Embeddings: 128256 → 4096

Decoder Layers: 4096 ↔ 14336 ↔ 4096 (in MLP), 4096 ↔ 1024 (attention)

LM Head: 4096 → 128256
```
Matching embedding dimensions at input and output (4096) is crucial because embeddings flow through the entire model. This symmetry helps the network efficiently learn context.

The 4-bit quantization significantly reduces memory requirements and computational costs, enabling powerful models like Llama to run on smaller hardware without drastically losing performance.



In [10]:
# model investigate the layers

model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((409

In [11]:
# run the quatized model
outputs = model.generate(inputs, max_new_tokens=80)
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Tell a light-hearted joke for a room of Data Scientists<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Why did the linear regression model go to therapy?

Because it was struggling to find the right fit.<|eot_id|>
