# Day 31: Advanced Quantization with GPTQ - Part 4

In this notebook, we'll explore GPTQ (Generative Pre-trained Transformer Quantization), an advanced technique for quantizing large language models with minimal quality degradation.

## Overview

1. Understanding GPTQ
2. Setup and dependencies
3. Implementing GPTQ quantization
4. Evaluating GPTQ model performance

## 1. Understanding GPTQ

GPTQ (Generative Pre-trained Transformer Quantization) is a one-shot weight quantization method specifically designed for large language models. It uses a layer-by-layer approach with error redistribution to minimize the impact of quantization on model quality.

Key features of GPTQ:
1. Quantizes one layer at a time, preserving the overall model structure
2. Uses second-order information (Hessian matrix) to minimize quantization error
3. Redistributes quantization errors to unquantized weights
4. Achieves high compression rates (INT4) with minimal quality degradation
5. No need for retraining or fine-tuning

## 2. Setup and Dependencies

We'll use the `optimum` library with the `auto-gptq` backend, which provides an implementation of GPTQ for Hugging Face models.

In [None]:
# Install required packages
!pip install -q torch transformers accelerate optimum auto-gptq

In [None]:
import os
import time
import torch
import numpy as np
import gc
from transformers import AutoTokenizer

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Print PyTorch version
print(f"PyTorch version: {torch.__version__}")

## 3. Helper Functions

Let's define some helper functions to measure inference time and memory usage.

In [None]:
# Function to measure inference time
def measure_inference_time(model, tokenizer, prompt, num_runs=5):
    """Measure average inference time over multiple runs"""
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    # Warm-up run
    with torch.no_grad():
        _ = model.generate(**inputs, max_length=50)
    
    # Measure inference time
    start_time = time.time()
    with torch.no_grad():
        for _ in range(num_runs):
            _ = model.generate(**inputs, max_length=50)
    end_time = time.time()
    
    avg_time = (end_time - start_time) / num_runs
    return avg_time

# Function to get GPU memory usage
def get_gpu_memory():
    """Get current GPU memory usage in MB"""
    if torch.cuda.is_available():
        return torch.cuda.memory_allocated() / (1024 * 1024)
    else:
        return 0

## 4. Implementing GPTQ Quantization

We'll use the `optimum` library with `auto-gptq` to quantize a pre-trained model.

In [None]:
# Import necessary libraries
try:
    from optimum.gptq import GPTQQuantizer, load_quantized_model
    from transformers import AutoModelForCausalLM
    print("GPTQ libraries imported successfully")
except ImportError as e:
    print(f"Error importing GPTQ libraries: {e}")
    print("Please make sure optimum and auto-gptq are installed correctly.")

In [None]:
# Define model name
model_name = "facebook/opt-350m"  # Using a smaller model for demonstration

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token

In [None]:
# Record initial GPU memory
initial_memory = get_gpu_memory()
print(f"Initial GPU memory usage: {initial_memory:.2f} MB")

# Load model for GPTQ quantization
try:
    print("Loading model for GPTQ quantization...")
    model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
    print("Model loaded successfully")
except Exception as e:
    print(f"Error loading model: {e}")

### 4.1 Creating a Calibration Dataset

GPTQ requires a calibration dataset to compute the quantization parameters.

In [None]:
# Create a small calibration dataset
calibration_data = [
    "Artificial intelligence has revolutionized many industries.",
    "The future of technology depends on sustainable innovation.",
    "Machine learning models require large amounts of data for training.",
    "Climate change presents significant challenges for our planet.",
    "Quantum computing may solve problems that are currently intractable."
]

# Tokenize the calibration data
tokenized_data = [tokenizer(text, return_tensors="pt").input_ids.to(device) for text in calibration_data]

### 4.2 Quantizing the Model with GPTQ

Now we'll quantize the model using the GPTQ algorithm.

In [None]:
# Create output directory
output_dir = "./gptq-model"
os.makedirs(output_dir, exist_ok=True)

# Initialize the GPTQ quantizer
try:
    print("Initializing GPTQ quantizer...")
    quantizer = GPTQQuantizer(
        bits=4,  # 4-bit quantization
        dataset=tokenized_data,  # Calibration dataset
        block_name_to_quantize="model.decoder.layers",  # Target blocks to quantize
        model_seqlen=2048  # Maximum sequence length
    )
    print("GPTQ quantizer initialized successfully")
except Exception as e:
    print(f"Error initializing quantizer: {e}")

In [None]:
# Quantize the model with GPTQ
try:
    print("Quantizing model with GPTQ...")
    quantized_model = quantizer.quantize_model(model, tokenizer)
    
    # Save the quantized model
    quantized_model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"Model quantized and saved to {output_dir}")
except Exception as e:
    print(f"Error during quantization: {e}")
    print("Skipping quantization step due to error")

### 4.3 Loading the Quantized Model

Now let's load the quantized model for inference.

In [None]:
# Clear memory before loading the quantized model
del model
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None

# Reset memory baseline
initial_memory = get_gpu_memory()
print(f"GPU memory after cleanup: {initial_memory:.2f} MB")

In [None]:
# Load the quantized model
try:
    print("Loading GPTQ quantized model...")
    gptq_model = load_quantized_model(output_dir, device_map="auto")
    print("GPTQ quantized model loaded successfully")
    
    # Calculate memory usage
    gptq_memory = get_gpu_memory() - initial_memory
    print(f"GPTQ model GPU memory usage: {gptq_memory:.2f} MB")
except Exception as e:
    print(f"Error loading quantized model: {e}")
    print("Falling back to standard model for comparison...")
    gptq_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
    gptq_memory = get_gpu_memory() - initial_memory

## 5. Evaluating GPTQ Model Performance

Let's evaluate the performance of our GPTQ quantized model.

In [None]:
# Define a test prompt
prompt = "Artificial intelligence will transform the future by"

# Measure inference time
gptq_time = measure_inference_time(gptq_model, tokenizer, prompt)
print(f"GPTQ model average inference time: {gptq_time:.4f} seconds")

In [None]:
# Generate text with the GPTQ model
inputs = tokenizer(prompt, return_tensors="pt").to(gptq_model.device)

with torch.no_grad():
    outputs = gptq_model.generate(
        **inputs,
        max_length=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        num_return_sequences=1
    )

# Decode the generated text
gptq_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated text with GPTQ model:")
print(gptq_text)

## 6. Comparing with FP16 Model

Let's load an FP16 model for comparison.

In [None]:
# Clear memory before loading the FP16 model
del gptq_model
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None

# Reset memory baseline
initial_memory = get_gpu_memory()
print(f"GPU memory after cleanup: {initial_memory:.2f} MB")

In [None]:
# Load FP16 model
print("Loading model in FP16...")
fp16_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.float16
).to(device)

# Calculate memory usage
fp16_memory = get_gpu_memory() - initial_memory
print(f"FP16 model GPU memory usage: {fp16_memory:.2f} MB")

In [None]:
# Measure FP16 inference time
fp16_time = measure_inference_time(fp16_model, tokenizer, prompt)
print(f"FP16 average inference time: {fp16_time:.4f} seconds")

In [None]:
# Generate text with the FP16 model
inputs = tokenizer(prompt, return_tensors="pt").to(fp16_model.device)

with torch.no_grad():
    outputs = fp16_model.generate(
        **inputs,
        max_length=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        num_return_sequences=1
    )

# Decode the generated text
fp16_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated text with FP16 model:")
print(fp16_text)

## 7. Comparing Results

Let's compile and visualize our results.

In [None]:
# Compile results
results = {
    "Model": ["FP16", "GPTQ (4-bit)"],
    "Memory Usage (MB)": [fp16_memory, gptq_memory],
    "Inference Time (s)": [fp16_time, gptq_time],
    "Memory Reduction": ["1.0x", f"{fp16_memory/gptq_memory:.2f}x"],
    "Speed Improvement": ["1.0x", f"{fp16_time/gptq_time:.2f}x"]
}

# Display results as a table
import pandas as pd
results_df = pd.DataFrame(results)
results_df

In [None]:
# Visualize the results
import matplotlib.pyplot as plt

# Set up the figure
plt.figure(figsize=(12, 5))

# Plot memory usage comparison
plt.subplot(1, 2, 1)
plt.bar(results["Model"], results["Memory Usage (MB)"], color=["blue", "green"])
plt.title("Memory Usage Comparison")
plt.ylabel("Memory (MB)")
plt.grid(axis="y", alpha=0.3)

# Plot inference time comparison
plt.subplot(1, 2, 2)
plt.bar(results["Model"], results["Inference Time (s)"], color=["blue", "green"])
plt.title("Inference Time Comparison")
plt.ylabel("Time (seconds)")
plt.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Text Quality Comparison

Let's compare the quality of text generated by both models.

In [None]:
# Print both generated texts for comparison
print("FP16 model output:")
print(fp16_text)
print("\n" + "-"*50 + "\n")
print("GPTQ model output:")
print(gptq_text)

## 9. GPTQ vs. Other Quantization Methods

Let's compare GPTQ with other quantization methods we've explored.

In [None]:
# Comparison table (theoretical values based on literature)
comparison = {
    "Method": ["FP16", "INT8", "INT4 (Naive)", "GPTQ (INT4)", "AWQ (INT4)"],
    "Bits per Weight": [16, 8, 4, 4, 4],
    "Memory Reduction": ["1.0x", "2.0x", "4.0x", "4.0x", "4.0x"],
    "Quality Preservation": ["Excellent", "Very Good", "Fair", "Good", "Very Good"],
    "Complexity": ["Low", "Low", "Low", "Medium", "Medium"],
    "Key Advantage": [
        "Full precision", 
        "Good balance", 
        "Maximum compression", 
        "Error redistribution", 
        "Activation-aware"
    ]
}

# Display comparison as a table
comparison_df = pd.DataFrame(comparison)
comparison_df

## Conclusion

In this notebook, we've explored GPTQ (Generative Pre-trained Transformer Quantization), an advanced technique for quantizing large language models to 4-bit precision while maintaining quality. We've seen that:

1. GPTQ can significantly reduce model memory usage compared to FP16 models
2. The inference speed may improve depending on hardware support for INT4 operations
3. The quality of text generation can be preserved even with extreme quantization

Key advantages of GPTQ:
- Layer-by-layer quantization with error redistribution
- Uses second-order information to minimize quantization error
- Better quality preservation compared to naive INT4 quantization
- No need for retraining or fine-tuning

GPTQ is particularly useful for deploying large language models on resource-constrained hardware while maintaining acceptable quality. When compared to other methods like AWQ, GPTQ has its own strengths and may perform better on certain models or tasks.