<a href="https://colab.research.google.com/github/bharathbolla/The-LLM-Cookbook-Practical-Recipes-for-Fine-Tuning-Optimization-and-Deployment/blob/main/Chapter_10.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Recipe: Effortless Shrinking with `bitsandbytes

In [None]:

from huggingface_hub import HfApi
from huggingface_hub import login

api = HfApi()
whoami = api.whoami(token="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxx")
print(whoami)
login("hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")

{'type': 'user', 'id': '65feba1b57cc48d9d30d11cf', 'name': 'kalpasubbaiah', 'fullname': 'Kalpa Subbaiah', 'email': 'kalpa.subbaiah@gmail.com', 'emailVerified': True, 'canPay': False, 'periodEnd': None, 'isPro': False, 'avatarUrl': '/avatars/319094e0eb55ce89334d7bd3685ceeb0.svg', 'orgs': [{'type': 'org', 'id': '681b0cb0dba891d54be0773d', 'name': 'mcp-course', 'fullname': 'Hugging Face MCP Course', 'email': None, 'canPay': False, 'periodEnd': None, 'avatarUrl': 'https://cdn-avatars.huggingface.co/v1/production/uploads/62d648291fa3e4e7ae3fa6e8/itgTDqMrnvgNfJZJ4YmCt.png', 'roleInOrg': 'read', 'isEnterprise': False}], 'auth': {'type': 'access_token', 'accessToken': {'displayName': 'hugging_face_token_read', 'role': 'read', 'createdAt': '2025-08-31T13:41:46.429Z'}}}


## Recipe-1:Effortless Quantization with bitsandbytes

In [None]:
pip install -U bitsandbytes

Note: you may need to restart the kernel to use updated packages.


## Recipe-1: Effortless Quantization with bitsandbytes

In [None]:
# --- Recipe: Effortless Shrinking with `bitsandbytes` ---
# Goal: Load a pre-trained model using 8-bit and 4-bit quantization via transformers + bitsandbytes.
# Libraries: transformers, torch, accelerate, bitsandbytes, sentencepiece
# Note: Requires `bitsandbytes` installation. `accelerate` needed for `device_map`.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import time # For basic timing

# --- Configuration ---
MODEL_ID = "google/gemma-2b" # Choose a model
# MODEL_ID = "mistralai/Mistral-7B-v0.1" # Larger model to see more significant memory savings

# --- 1. Load Tokenizer ---
print(f"Loading tokenizer for: {MODEL_ID}")
try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
except Exception as e:
    print(f"Error loading tokenizer: {e}")
    exit()

# --- 2. Load Model in Native Precision (Reference) ---
print("\n--- Loading Model in Native Precision (BF16/FP16) ---")
# Determine compute dtype
compute_dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
print(f"Using compute dtype: {compute_dtype}")
try:
    model_native = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        torch_dtype=compute_dtype,
        device_map="auto" # Use GPU if available
    )
    print("Native model loaded.")
    mem_footprint_native = model_native.get_memory_footprint()
    print(f"Native Model Memory Footprint: {mem_footprint_native / 1024**3:.2f} GB")
except Exception as e:
    print(f"Error loading native model: {e}")
    model_native = None # Ensure variable exists

# --- 3. Load Model in 8-bit ---
print("\n--- Loading Model in 8-bit ---")
try:
    bnb_config_8bit = BitsAndBytesConfig(load_in_8bit=True)
    model_8bit = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config_8bit,
        device_map="auto" # device_map handles quantized models too
    )
    print("8-bit model loaded.")
    mem_footprint_8bit = model_8bit.get_memory_footprint()
    print(f"8-bit Model Memory Footprint: {mem_footprint_8bit / 1024**3:.2f} GB")
    if model_native:
         print(f"Reduction vs Native: {(1 - mem_footprint_8bit / mem_footprint_native) * 100:.1f}%")
except Exception as e:
    print(f"Error loading 8-bit model: {e}")
    print("Ensure 'bitsandbytes' is installed correctly.")
    model_8bit = None

# --- 4. Load Model in 4-bit (NF4) ---
print("\n--- Loading Model in 4-bit (NF4) ---")
try:
    bnb_config_4bit = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4", # NormalFloat4 data type
        bnb_4bit_compute_dtype=compute_dtype, # Compute in bf16/fp16
        bnb_4bit_use_double_quant=True, # Enable double quantization
    )
    model_4bit = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config_4bit,
        device_map="auto"
    )
    print("4-bit model loaded.")
    mem_footprint_4bit = model_4bit.get_memory_footprint()
    print(f"4-bit Model Memory Footprint: {mem_footprint_4bit / 1024**3:.2f} GB")
    if model_native:
        print(f"Reduction vs Native: {(1 - mem_footprint_4bit / mem_footprint_native) * 100:.1f}%")
except Exception as e:
    print(f"Error loading 4-bit model: {e}")
    print("Ensure 'bitsandbytes' is installed correctly.")
    model_4bit = None

# --- 5. Test Inference (Optional) ---
# Run generation to see if models work after loading
prompt = "Instruction: Write a short description of quantization.\nResponse:"
max_new_tokens_inf = 50

def run_inference(model, model_name):
    if model is None:
        print(f"\nSkipping inference for {model_name} (not loaded).")
        return
    print(f"\n--- Running Inference ({model_name}) ---")
    print(f"Prompt: {prompt}")
    try:
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        start_time = time.time()
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens_inf,
            pad_token_id=tokenizer.eos_token_id # Use EOS token ID for padding in generation
        )
        end_time = time.time()
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        print(f"Response:\n{response}")
        print(f"Inference Time: {end_time - start_time:.2f} seconds")
    except Exception as e:
        print(f"Error during {model_name} inference: {e}")

# Run inference on loaded models
# run_inference(model_native, "Native Precision") # Can be slow
run_inference(model_8bit, "8-bit")
run_inference(model_4bit, "4-bit")

print("\nNote: Memory footprint is approximate. Inference time depends heavily on hardware.")
# --- End of Recipe ---

Loading tokenizer for: google/gemma-2b


tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]


--- Loading Model in Native Precision (BF16/FP16) ---
Using compute dtype: torch.bfloat16


config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

2025-08-31 09:23:09.338127: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756632189.670346      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756632189.770573      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Native model loaded.
Native Model Memory Footprint: 4.67 GB

--- Loading Model in 8-bit ---


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

8-bit model loaded.
8-bit Model Memory Footprint: 2.82 GB
Reduction vs Native: 39.5%

--- Loading Model in 4-bit (NF4) ---


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

4-bit model loaded.
4-bit Model Memory Footprint: 1.90 GB
Reduction vs Native: 59.3%

--- Running Inference (8-bit) ---
Prompt: Instruction: Write a short description of quantization.
Response:
Response:
Instruction: Write a short description of quantization.
Response:
Quantization is the process of converting a continuous signal into a discrete signal. The process of quantization is done by dividing the continuous signal into a number of discrete values. The process of quantization is done by dividing the continuous signal into a number of discrete
Inference Time: 6.58 seconds

--- Running Inference (4-bit) ---
Prompt: Instruction: Write a short description of quantization.
Response:
Response:
Instruction: Write a short description of quantization.
Response:
Quantization is the process of converting a continuous-time signal into a discrete-time signal. The quantization process is a linear operation. The quantization process is a linear operation. The quantization process is a linear o


##  Recipe: Quantizing with AutoGPTQ

## Recipe-2: Quantizing with AutoGPTQ


##  Recipe: Quantizing with AutoGPTQ

In [None]:
pip install auto-gptq optimum

Collecting auto-gptq
  Downloading auto_gptq-0.7.1.tar.gz (126 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.1/126.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hcanceled
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/commands/install.py", line 377, in run
    requirement_set = resolver.resolve(
                      ^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pip/_internal/reso

In [None]:
# --- Recipe: Precise Compression with GPTQ ---
# Goal: Quantize a pre-trained model using the AutoGPTQ library.
# Libraries: transformers, torch, optimum, datasets, auto-gptq
# Note: Requires installing AutoGPTQ: pip install auto-gptq optimum
#       Requires a GPU compatible with AutoGPTQ kernels (usually NVIDIA).
#       Uses gpt2-medium as an example and C4 dataset for calibration.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextGenerationPipeline
from datasets import load_dataset
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig # AutoGPTQ specific imports
import time
import logging

# --- Configuration ---
# Choose a base model supported by AutoGPTQ (check their GitHub)
# Smaller models quantize faster. gpt2 variants are common examples.
MODEL_CHECKPOINT = "gpt2-medium" # ~355M parameters
# Calibration dataset - needs to be representative of text the model will see
CALIBRATION_DATASET = "allenai/c4"
CALIBRATION_SPLIT = "train" # Use train split
NUM_CALIBRATION_SAMPLES = 128 # Number of samples for calibration (e.g., 128)
CALIBRATION_SEQ_LEN = 512 # Sequence length for calibration data
# GPTQ Quantization Config
QUANTIZE_BITS = 4 # Target bits (e.g., 4, 3, 8)
QUANTIZE_GROUP_SIZE = 128 # Group size for quantization (e.g., 32, 64, 128, -1 for per-channel)
QUANTIZE_DESC_ACT = False # Or True - Whether to quantize using act_order=True, can improve accuracy but slower
# Output directory for quantized model
QUANTIZED_MODEL_DIR = f"./{MODEL_CHECKPOINT.split('/')[-1]}-gptq-{QUANTIZE_BITS}bit"

# Setup logging for AutoGPTQ
logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

# --- 1. Load Tokenizer and Calibration Data ---
print(f"Loading tokenizer: {MODEL_CHECKPOINT}")
try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
    if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
except Exception as e: print(f"Error loading tokenizer: {e}"); exit()

print(f"\nLoading calibration data: {CALIBRATION_DATASET} (subset)")
try:
    # Load calibration data (streaming recommended for large datasets like C4)
    calibration_dataset = load_dataset(CALIBRATION_DATASET, name="en", split=CALIBRATION_SPLIT, streaming=True)
    # Take a sample and tokenize
    samples = []
    for data in calibration_dataset.take(NUM_CALIBRATION_SAMPLES):
        # Tokenize, ensuring padding/truncation to fixed length for calibration
        tokenized_sample = tokenizer(data['text'], return_tensors='pt', max_length=CALIBRATION_SEQ_LEN, padding='max_length', truncation=True)
        samples.append({
        "input_ids": tokenized_sample["input_ids"].squeeze(0),          # LongTensor [SEQ]
        "attention_mask": tokenized_sample["attention_mask"].squeeze(0) # LongTensor [SEQ]
        })
        # Alternative: provide list of strings directly to quantize method if supported by backend
        # samples.append(data['text'])
    if not samples: raise ValueError("No calibration samples loaded.")
    print(f"Loaded {len(samples)} calibration samples.")
    # If using input_ids, stack them if needed by quantize method, otherwise keep as list
    # calibration_data_final = torch.stack(samples)

    # AutoGPTQ often expects a list of strings or dicts
    #calibration_data_final = [tokenizer.decode(s, skip_special_tokens=True) for s in samples]
    calibration_data_final = samples


except Exception as e:
    print(f"Error loading or processing calibration data: {e}")
    exit()

# --- 2. Load Base Model ---
print(f"\nLoading base model: {MODEL_CHECKPOINT}")
try:
    # Load in native precision on CPU first maybe, or directly to GPU if memory allows
    model = AutoModelForCausalLM.from_pretrained(MODEL_CHECKPOINT, torch_dtype=torch.float16, low_cpu_mem_usage=True)
    # model.to('cuda:0') # Move to GPU if not done automatically
    print("Base model loaded.")
except Exception as e:
    print(f"Error loading base model: {e}")
    exit()

# --- 3. Define Quantization Config ---
print("\nDefining GPTQ quantization config...")
quantize_config = BaseQuantizeConfig(
    bits=QUANTIZE_BITS, # Number of bits for quantization
    group_size=QUANTIZE_GROUP_SIZE, # Group size
    desc_act=QUANTIZE_DESC_ACT, # Activation order; True might improve accuracy, False is faster
    damp_percent=0.01, # Dampening percentage for Hessian computation
    sym=True # Use symmetric quantization
)

# --- 4. Quantize Model ---
print("\nStarting GPTQ quantization process...")
print(f"Bits: {QUANTIZE_BITS}, Group Size: {QUANTIZE_GROUP_SIZE}, Desc Act: {QUANTIZE_DESC_ACT}")
print("This can take a while...")
start_time = time.time()
try:
    # Wrap model with AutoGPTQ wrapper
    quantized_model_gptq = AutoGPTQForCausalLM.from_pretrained(
        MODEL_CHECKPOINT,
        quantize_config=quantize_config, # Pass the config
        # Optional: Pass model directly if already loaded
        # model=model, # Pass the pre-loaded model object
        torch_dtype=torch.float16, # Ensure consistency
        trust_remote_code=True, # Often needed
        device_map="auto" # Let AutoGPTQ handle device placement
    )

    # Run the quantization process
    quantized_model_gptq.quantize(
        calibration_data_final, # Pass the prepared calibration data
        batch_size=1, # Calibration batch size
        use_triton=torch.cuda.is_available(), # Use Triton kernels if available (faster)
        # cache_examples_on_gpu=True # If VRAM allows
    )
    end_time = time.time()
    print(f"Quantization finished in {end_time - start_time:.2f} seconds.")

    # --- 5. Save Quantized Model ---
    print(f"\nSaving quantized model to: {QUANTIZED_MODEL_DIR}")
    # Use export_quantized=True argument or specific save methods depending on AutoGPTQ version
    # Option 1: Standard save_pretrained (might work for newer versions)
    #quantized_model_gptq.save_pretrained(QUANTIZED_MODEL_DIR, safe_serialization=True)

    # Option 2: Use export_quantized (check AutoGPTQ docs for current best practice)
    # Example for older versions might differ
    quantized_model_gptq.save_pretrained(QUANTIZED_MODEL_DIR, use_safetensors=True)

    tokenizer.save_pretrained(QUANTIZED_MODEL_DIR) # Save tokenizer too
    print("Quantized model and tokenizer saved.")

except Exception as e:
    print(f"Error during GPTQ quantization or saving: {e}")
    print("Ensure AutoGPTQ and its dependencies (like optimum) are installed.")
    exit()

# --- 6. Load and Test Quantized Model (Optional) ---
print("\nLoading and testing quantized model...")
try:
    # Load the quantized model using the AutoGPTQ class
    # Important: Ensure the environment loading the model has AutoGPTQ installed
    model_loaded_gptq = AutoGPTQForCausalLM.from_quantized(
        QUANTIZED_MODEL_DIR,
        device_map="auto", # Load onto GPU
        use_triton=torch.cuda.is_available(),
        trust_remote_code=True,
        # inject_fused_attention=True, # Optional: Speed up inference
        # inject_fused_mlp=True # Optional: Speed up inference
    )
    print("Quantized model loaded successfully.")

    # Test inference
    prompt = "Quantization in deep learning is"
    print(f"Prompt: {prompt}")
    # Use pipeline for easy generation
    #pipeline_gptq = TextGenerationPipeline(model=model_loaded_gptq, tokenizer=tokenizer, device=model_loaded_gptq.device)
    pipeline_gptq = TextGenerationPipeline(model=model_loaded_gptq, tokenizer=tokenizer)
    start_time = time.time()
    outputs = pipeline_gptq(prompt, max_new_tokens=50, do_sample=True, temperature=0.7)
    end_time = time.time()
    print(f"Generated Text:\n{outputs[0]['generated_text']}")
    print(f"Inference Time: {end_time - start_time:.2f} seconds")

except Exception as e:
    print(f"Error loading or testing quantized model: {e}")

# --- End of Recipe ---


ModuleNotFoundError: No module named 'auto_gptq'

In [None]:
pip install autoawq

## Recipe-3: Activation-Aware Quantization (AWQ)

In [None]:
# --- Procedure 3: Activation-Aware Quantization (AWQ) ---
# Goal: Quantize a pre-trained model using the Activation-aware Weight Quantization (AWQ) method.
# Libraries: transformers, torch, optimum, datasets, autoawq
# Note: Requires installing AutoAWQ: pip install autoawq
#       Requires a GPU and is often faster than GPTQ for the same model size.
#       Uses gpt2-medium as an example and C4 dataset for calibration.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from awq import AutoAWQForCausalLM
from datasets import load_dataset
import time
import logging

# --- Configuration ---
# Choose a base model. AWQ works well with many modern architectures.
MODEL_CHECKPOINT = "gpt2-medium" # ~355M parameters
# Calibration dataset - a small, representative sample of text.
CALIBRATION_DATASET = "allenai/c4"
CALIBRATION_SPLIT = "train"
NUM_CALIBRATION_SAMPLES = 128
# AWQ Quantization Config
QUANTIZE_BITS = 4
QUANTIZE_GROUP_SIZE = 128
# Output directory for the quantized model
QUANTIZED_MODEL_DIR = f"./{MODEL_CHECKPOINT.split('/')[-1]}-awq-{QUANTIZE_BITS}bit"

# Setup logging
logging.basicConfig(
    format="%(asctime)s %(levelname)s [%(name)s] %(message)s", level=logging.INFO, datefmt="%Y-%m-%d %H:%M:%S"
)

# --- 1. Load Tokenizer and Calibration Data ---
print(f"Loading tokenizer: {MODEL_CHECKPOINT}")
try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
    if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
except Exception as e:
    print(f"Error loading tokenizer: {e}"); exit()

print(f"\nLoading calibration data: {CALIBRATION_DATASET} (subset)")
try:
    # Load a small subset of the data for calibration
    calibration_dataset = load_dataset(CALIBRATION_DATASET, name="en", split=f"{CALIBRATION_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
    # AWQ expects a list of strings
    calibration_data_final = [example['text'] for example in calibration_dataset]
    print(f"Loaded {len(calibration_data_final)} calibration samples.")
except Exception as e:
    print(f"Error loading or processing calibration data: {e}")
    exit()

# --- 2. Load Base Model ---
# AWQ handles loading the model internally, so we just need the path.
print(f"\nPreparing to load base model for AWQ: {MODEL_CHECKPOINT}")

# --- 3. Define Quantization Config ---
# For AWQ, the configuration is passed directly to the quantize method.
# The core parameters are the number of bits and the group size.
awq_config = {
    "w_bit": QUANTIZE_BITS,
    "q_group_size": QUANTIZE_GROUP_SIZE,
    "zero_point": True # Use a zero point for better accuracy
}
print(f"\nDefined AWQ config: {awq_config}")

# --- 4. Quantize Model ---
print("\nStarting AWQ quantization process...")
print("This involves loading the model, analyzing activations, and quantizing weights.")
start_time = time.time()
try:
    # Load the model and quantize it in one step
    model = AutoAWQForCausalLM.from_pretrained(MODEL_CHECKPOINT, low_cpu_mem_usage=True, device_map="auto")

    # Run the quantization process
    model.quantize(
        tokenizer,
        quant_config=awq_config,
        calo_data=calibration_data_final
    )
    end_time = time.time()
    print(f"Quantization finished in {end_time - start_time:.2f} seconds.")

    # --- 5. Save Quantized Model ---
    print(f"\nSaving quantized model to: {QUANTIZED_MODEL_DIR}")
    # The `save_quantized` method saves the model in a format that can be loaded for fast inference.
    model.save_quantized(QUANTIZED_MODEL_DIR)
    tokenizer.save_pretrained(QUANTIZED_MODEL_DIR)
    print("Quantized model and tokenizer saved.")

except Exception as e:
    print(f"Error during AWQ quantization or saving: {e}")
    print("Ensure AutoAWQ is installed correctly.")
    exit()

# --- 6. Load and Test Quantized Model (Optional) ---
print("\nLoading and testing quantized model...")
try:
    # Load the quantized model using the AutoAWQ class again
    model_quantized = AutoAWQForCausalLM.from_quantized(QUANTIZED_MODEL_DIR, device_map="auto")
    print("Quantized model loaded successfully.")

    # Test inference
    prompt = "Activation-aware Weight Quantization is a technique that"
    print(f"Prompt: {prompt}")

    # Use the model directly for generation
    tokens = tokenizer(prompt, return_tensors="pt").to("cuda")
    start_time = time.time()
    outputs = model_quantized.generate(**tokens, max_new_tokens=50, do_sample=True, temperature=0.7)
    end_time = time.time()

    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"Generated Text:\n{generated_text}")
    print(f"Inference Time: {end_time - start_time:.2f} seconds")

except Exception as e:
    print(f"Error loading or testing quantized model: {e}")

# --- End of Recipe ---

## Recipe-4: Measuring Quantization Performance

## Recipe: Measuring Quantization Performance

In [None]:
# --- Recipe: Measuring the Gains (Quantization Performance) ---
# Goal: Compare memory usage and inference speed before and after quantization.
# Method: Uses BitsAndBytes 4-bit quantization for easy comparison within one script.
# Libraries: transformers, torch, accelerate, bitsandbytes, sentencepiece, time

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import time
import numpy as np

# --- Configuration ---
MODEL_ID = "google/gemma-2b" # Choose a model to test
# MODEL_ID = "gpt2-large" # Another option
PROMPT = "Explain the concept of transfer learning in machine learning in about 50 words."
NUM_TOKENS_TO_GENERATE = 100
NUM_INFERENCE_RUNS = 5 # Number of times to run inference for averaging speed

# --- 1. Load Tokenizer ---
print(f"Loading tokenizer for: {MODEL_ID}")
try:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
except Exception as e: print(f"Error loading tokenizer: {e}"); exit()

# --- 2. Load Native Model & Measure ---
print("\n--- Loading and Measuring Native Model (BF16/FP16) ---")
native_results = {"memory_gb": "N/A", "avg_latency_s": "N/A"}
model_native = None # Define variable outside try block
try:
    compute_dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
    model_native = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        torch_dtype=compute_dtype,
        device_map="auto"
    )
    if model_native.config.pad_token_id is None: model_native.config.pad_token_id = tokenizer.pad_token_id
    print(f"Native model loaded in {compute_dtype}.")

    # Measure Memory
    mem_footprint_native = model_native.get_memory_footprint()
    native_results["memory_gb"] = mem_footprint_native / 1024**3
    print(f"Native Memory Footprint: {native_results['memory_gb']:.2f} GB")

    # Measure Inference Speed
    print(f"Running inference ({NUM_INFERENCE_RUNS} runs)...")
    latencies = []
    inputs = tokenizer(PROMPT, return_tensors="pt").to(model_native.device)
    for _ in range(NUM_INFERENCE_RUNS + 1): # +1 for warmup run
        torch.cuda.synchronize() # Ensure sync before timing
        start_time = time.time()
        _ = model_native.generate(
            **inputs,
            max_new_tokens=NUM_TOKENS_TO_GENERATE,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=False # Use greedy for consistent timing
        )
        torch.cuda.synchronize() # Ensure sync after generation
        end_time = time.time()
        latencies.append(end_time - start_time)

    avg_latency = np.mean(latencies[1:]) # Exclude warmup run
    native_results["avg_latency_s"] = avg_latency
    print(f"Native Avg. Latency ({NUM_TOKENS_TO_GENERATE} tokens): {avg_latency:.3f} seconds")

    # Clean up memory
    del model_native
    torch.cuda.empty_cache()
    print("Native model unloaded.")

except Exception as e:
    print(f"Error during native model loading or inference: {e}")
    if 'model_native' in locals() and model_native is not None: del model_native
    torch.cuda.empty_cache()


# --- 3. Load 4-bit Quantized Model & Measure ---
print("\n--- Loading and Measuring 4-bit Model (NF4) ---")
quantized_results = {"memory_gb": "N/A", "avg_latency_s": "N/A"}
model_4bit = None # Define variable outside try block
try:
    compute_dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.is_bf16_supported() else torch.float16
    bnb_config_4bit = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
    )
    model_4bit = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        quantization_config=bnb_config_4bit,
        device_map="auto"
    )
    if model_4bit.config.pad_token_id is None: model_4bit.config.pad_token_id = tokenizer.pad_token_id
    print("4-bit model loaded.")

    # Measure Memory
    mem_footprint_4bit = model_4bit.get_memory_footprint()
    quantized_results["memory_gb"] = mem_footprint_4bit / 1024**3
    print(f"4-bit Memory Footprint: {quantized_results['memory_gb']:.2f} GB")

    # Measure Inference Speed
    print(f"Running inference ({NUM_INFERENCE_RUNS} runs)...")
    latencies_4bit = []
    inputs = tokenizer(PROMPT, return_tensors="pt").to(model_4bit.device)
    for _ in range(NUM_INFERENCE_RUNS + 1): # Warmup run
        torch.cuda.synchronize()
        start_time = time.time()
        _ = model_4bit.generate(
            **inputs,
            max_new_tokens=NUM_TOKENS_TO_GENERATE,
            pad_token_id=tokenizer.eos_token_id,
            do_sample=False
        )
        torch.cuda.synchronize()
        end_time = time.time()
        latencies_4bit.append(end_time - start_time)

    avg_latency_4bit = np.mean(latencies_4bit[1:]) # Exclude warmup
    quantized_results["avg_latency_s"] = avg_latency_4bit
    print(f"4-bit Avg. Latency ({NUM_TOKENS_TO_GENERATE} tokens): {avg_latency_4bit:.3f} seconds")

    # Clean up memory
    del model_4bit
    torch.cuda.empty_cache()
    print("4-bit model unloaded.")

except Exception as e:
    print(f"Error during 4-bit model loading or inference: {e}")
    if 'model_4bit' in locals() and model_4bit is not None: del model_4bit
    torch.cuda.empty_cache()

# --- 4. Comparison Summary ---
print("\n--- Performance Comparison Summary ---")
dtype_str = str(compute_dtype).replace("torch.", "") if isinstance(compute_dtype, torch.dtype) else "N/A"
print(f"Metric                 | Native ({dtype_str}) | 4-bit (NF4)")
print(f"-----------------------|-------------------|-----------------")
mem_native_str = f"{native_results['memory_gb']:.2f} GB" if isinstance(native_results['memory_gb'], float) else native_results['memory_gb']
mem_4bit_str = f"{quantized_results['memory_gb']:.2f} GB" if isinstance(quantized_results['memory_gb'], float) else quantized_results['memory_gb']
print(f"Memory Footprint       | {mem_native_str:<17} | {mem_4bit_str:<15}")

lat_native_str = f"{native_results['avg_latency_s']:.3f} s" if isinstance(native_results['avg_latency_s'], float) else native_results['avg_latency_s']
lat_4bit_str = f"{quantized_results['avg_latency_s']:.3f} s" if isinstance(quantized_results['avg_latency_s'], float) else quantized_results['avg_latency_s']
print(f"Avg. Latency ({NUM_TOKENS_TO_GENERATE} toks) | {lat_native_str:<17} | {lat_4bit_str:<15}")

# Calculate relative changes if possible
if isinstance(native_results['memory_gb'], float) and isinstance(quantized_results['memory_gb'], float):
    mem_reduction = (1 - quantized_results['memory_gb'] / native_results['memory_gb']) * 100
    print(f"\nMemory Reduction (4-bit vs Native): {mem_reduction:.1f}%")
if isinstance(native_results['avg_latency_s'], float) and isinstance(quantized_results['avg_latency_s'], float):
    speedup = native_results['avg_latency_s'] / quantized_results['avg_latency_s']
    print(f"Inference Speedup (4-bit vs Native): {speedup:.2f}x")

print("\nNote: Results are indicative and highly dependent on hardware, model, batch size, and specific generation parameters.")

# --- End of Recipe ---


Loading tokenizer for: google/gemma-2b


tokenizer_config.json:   0%|          | 0.00/33.6k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]


--- Loading and Measuring Native Model (BF16/FP16) ---


config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

2025-08-31 13:31:49.768205: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1756647110.089381      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1756647110.215475      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Native model loaded in torch.bfloat16.
Native Memory Footprint: 4.67 GB
Running inference (5 runs)...
Native Avg. Latency (100 tokens): 1.431 seconds
Native model unloaded.

--- Loading and Measuring 4-bit Model (NF4) ---


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

4-bit model loaded.
4-bit Memory Footprint: 1.90 GB
Running inference (5 runs)...
4-bit Avg. Latency (100 tokens): 2.481 seconds
4-bit model unloaded.

--- Performance Comparison Summary ---
Metric                 | Native (bfloat16) | 4-bit (NF4)
-----------------------|-------------------|-----------------
Memory Footprint       | 4.67 GB           | 1.90 GB        
Avg. Latency (100 toks) | 1.431 s           | 2.481 s        

Memory Reduction (4-bit vs Native): 59.3%
Inference Speedup (4-bit vs Native): 0.58x

Note: Results are indicative and highly dependent on hardware, model, batch size, and specific generation parameters.
