# Model Quantization with Transformers
1.  Load a full-precision pre-trained model
2.  Clear the model from memory
3.  Reload the same model with 4-bit quantization enabled
4.  Calculate and print its new, smaller size
5.  Run a sample inference to show the quantized model is still functional.

---
## 1. Setup Environment

In [18]:
!pip install transformers torch accelerate bitsandbytes -q

In [19]:
!pip install --upgrade transformers accelerate -q

---
## 2. Import Libraries

In [None]:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

---
## 3. Define Model and Device

In [22]:
model_id = "microsoft/Phi-3-mini-4k-instruct"

In [23]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [24]:
print(f"Using model: {model_id}")
print(f"Using device: {device}")

Using model: microsoft/Phi-3-mini-4k-instruct
Using device: cuda


---
## 4. Load Full-Precision Model (Before Quantization)
First, we load the model in its standard `bfloat16` precision to establish a baseline for its memory usage.

In [25]:
model_fp16 = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

---
## 5. Clear Memory
To get an accurate measurement of the quantized model, we must first clear the full-precision model from GPU memory.

In [27]:
del model_fp16

In [28]:
torch.cuda.empty_cache()

In [29]:
print("Full-precision model cleared from memory.")

Full-precision model cleared from memory.


---
## 6. Load Quantized Model
Now, we load the same model, but this time we provide a `BitsAndBytesConfig` to instruct transformers to load it in 4-bit precision.

### Step 6.1: Define Quantization Configuration

In [30]:
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

### Step 6.2: Load the Model with Quantization

In [31]:
tokenizer = AutoTokenizer.from_pretrained(model_id)

In [32]:
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device,
    quantization_config=quantization_config,
    trust_remote_code=True
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

---
## 7. Test Inference on Quantized Model
Let's run a quick generation to confirm that the quantized model is still working correctly.

In [33]:
prompt = "What is model quantization? Explain it like I'm five."

In [34]:
inputs = tokenizer(prompt, return_tensors="pt").to(device)

In [35]:
# Generate text
outputs = model_4bit.generate(**inputs, max_new_tokens=100)

AttributeError: 'DynamicCache' object has no attribute 'seen_tokens'

In [None]:
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
print(response_text)