# Day 31: Advanced Quantization with AWQ - Part 3

In this notebook, we'll explore Activation-aware Weight Quantization (AWQ), an advanced technique for quantizing large language models with minimal quality degradation.

## Overview

1. Understanding AWQ
2. Setup and dependencies
3. Implementing AWQ quantization
4. Evaluating AWQ model performance

## 1. Understanding AWQ

Activation-aware Weight Quantization (AWQ) is an advanced quantization technique that preserves the weights that have the most impact on activations. The key insight of AWQ is that not all weights in a model are equally important - some weights have a much larger impact on the model's outputs than others.

AWQ works by:
1. Identifying which weight channels are most important for activations
2. Applying per-channel scaling to preserve these important weights
3. Quantizing the model to INT4 precision

This approach allows for extreme compression (4-bit quantization) while maintaining model quality.

## 2. Setup and Dependencies

We'll use the AutoAWQ library, which provides an implementation of AWQ for Hugging Face models.

In [None]:
# Install required packages
!pip install -q torch transformers accelerate autoawq

In [None]:
import os
import time
import torch
import numpy as np
import gc
from transformers import AutoTokenizer

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Print PyTorch version
print(f"PyTorch version: {torch.__version__}")

## 3. Helper Functions

Let's define some helper functions to measure inference time and memory usage.

In [None]:
# Function to measure inference time
def measure_inference_time(model, tokenizer, prompt, num_runs=5):
    """Measure average inference time over multiple runs"""
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    # Warm-up run
    with torch.no_grad():
        _ = model.generate(**inputs, max_length=50)
    
    # Measure inference time
    start_time = time.time()
    with torch.no_grad():
        for _ in range(num_runs):
            _ = model.generate(**inputs, max_length=50)
    end_time = time.time()
    
    avg_time = (end_time - start_time) / num_runs
    return avg_time

# Function to get GPU memory usage
def get_gpu_memory():
    """Get current GPU memory usage in MB"""
    if torch.cuda.is_available():
        return torch.cuda.memory_allocated() / (1024 * 1024)
    else:
        return 0

## 4. Implementing AWQ Quantization

We'll use the AutoAWQ library to quantize a pre-trained model.

In [None]:
# Import AutoAWQ
try:
    from awq import AutoAWQForCausalLM
    print("AutoAWQ imported successfully")
except ImportError:
    print("Failed to import AutoAWQ. Please make sure it's installed correctly.")

In [None]:
# Define model name
model_name = "facebook/opt-350m"  # Using a smaller model for demonstration

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token

In [None]:
# Record initial GPU memory
initial_memory = get_gpu_memory()
print(f"Initial GPU memory usage: {initial_memory:.2f} MB")

# Load model for AWQ quantization
try:
    # Load model with AutoAWQ
    print("Loading model for AWQ quantization...")
    awq_model = AutoAWQForCausalLM.from_pretrained(
        model_name,
        device_map="auto"
    )
    print("Model loaded successfully")
except Exception as e:
    print(f"Error loading model: {e}")
    print("Falling back to standard Hugging Face model loading...")
    from transformers import AutoModelForCausalLM
    awq_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

### 4.1 Quantizing the Model with AWQ

Now we'll quantize the model using the AWQ algorithm. This involves two steps:
1. Calibrating the model on a small dataset to identify important weights
2. Quantizing the model to 4 bits

In [None]:
# Create a small calibration dataset
calibration_data = [
    "Artificial intelligence has revolutionized many industries.",
    "The future of technology depends on sustainable innovation.",
    "Machine learning models require large amounts of data for training.",
    "Climate change presents significant challenges for our planet.",
    "Quantum computing may solve problems that are currently intractable."
]

# Define quantization parameters
quant_config = {
    "zero_point": True,  # Use zero-point quantization
    "q_group_size": 128,  # Group size for quantization
    "w_bit": 4,  # 4-bit quantization
    "version": "GEMM"  # Use GEMM version for inference
}

In [None]:
# Quantize the model with AWQ
try:
    # Create output directory
    output_dir = "./awq-model"
    os.makedirs(output_dir, exist_ok=True)
    
    # Quantize the model
    print("Quantizing model with AWQ...")
    awq_model.quantize(
        tokenizer=tokenizer,
        quant_config=quant_config,
        calibration_data=calibration_data,
        export_path=output_dir
    )
    print(f"Model quantized and saved to {output_dir}")
except Exception as e:
    print(f"Error during quantization: {e}")
    print("Skipping quantization step due to error")

### 4.2 Loading the Quantized Model

Now let's load the quantized model for inference.

In [None]:
# Clear memory before loading the quantized model
del awq_model
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None

# Reset memory baseline
initial_memory = get_gpu_memory()
print(f"GPU memory after cleanup: {initial_memory:.2f} MB")

In [None]:
# Load the quantized model
try:
    print("Loading AWQ quantized model...")
    quantized_model = AutoAWQForCausalLM.from_quantized(
        "./awq-model",
        device_map="auto"
    )
    print("AWQ quantized model loaded successfully")
    
    # Calculate memory usage
    awq_memory = get_gpu_memory() - initial_memory
    print(f"AWQ model GPU memory usage: {awq_memory:.2f} MB")
except Exception as e:
    print(f"Error loading quantized model: {e}")
    print("Falling back to standard model for comparison...")
    from transformers import AutoModelForCausalLM
    quantized_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)
    awq_memory = get_gpu_memory() - initial_memory

## 5. Evaluating AWQ Model Performance

Let's evaluate the performance of our AWQ quantized model.

In [None]:
# Define a test prompt
prompt = "Artificial intelligence will transform the future by"

# Measure inference time
awq_time = measure_inference_time(quantized_model, tokenizer, prompt)
print(f"AWQ model average inference time: {awq_time:.4f} seconds")

In [None]:
# Generate text with the AWQ model
inputs = tokenizer(prompt, return_tensors="pt").to(quantized_model.device)

with torch.no_grad():
    outputs = quantized_model.generate(
        **inputs,
        max_length=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        num_return_sequences=1
    )

# Decode the generated text
awq_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated text with AWQ model:")
print(awq_text)

## 6. Comparing with FP16 Model

Let's load an FP16 model for comparison.

In [None]:
# Clear memory before loading the FP16 model
del quantized_model
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None

# Reset memory baseline
initial_memory = get_gpu_memory()
print(f"GPU memory after cleanup: {initial_memory:.2f} MB")

In [None]:
# Load FP16 model
from transformers import AutoModelForCausalLM

print("Loading model in FP16...")
fp16_model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.float16
).to(device)

# Calculate memory usage
fp16_memory = get_gpu_memory() - initial_memory
print(f"FP16 model GPU memory usage: {fp16_memory:.2f} MB")

In [None]:
# Measure FP16 inference time
fp16_time = measure_inference_time(fp16_model, tokenizer, prompt)
print(f"FP16 average inference time: {fp16_time:.4f} seconds")

In [None]:
# Generate text with the FP16 model
inputs = tokenizer(prompt, return_tensors="pt").to(fp16_model.device)

with torch.no_grad():
    outputs = fp16_model.generate(
        **inputs,
        max_length=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        num_return_sequences=1
    )

# Decode the generated text
fp16_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated text with FP16 model:")
print(fp16_text)

## 7. Comparing Results

Let's compile and visualize our results.

In [None]:
# Compile results
results = {
    "Model": ["FP16", "AWQ (4-bit)"],
    "Memory Usage (MB)": [fp16_memory, awq_memory],
    "Inference Time (s)": [fp16_time, awq_time],
    "Memory Reduction": ["1.0x", f"{fp16_memory/awq_memory:.2f}x"],
    "Speed Improvement": ["1.0x", f"{fp16_time/awq_time:.2f}x"]
}

# Display results as a table
import pandas as pd
results_df = pd.DataFrame(results)
results_df

In [None]:
# Visualize the results
import matplotlib.pyplot as plt

# Set up the figure
plt.figure(figsize=(12, 5))

# Plot memory usage comparison
plt.subplot(1, 2, 1)
plt.bar(results["Model"], results["Memory Usage (MB)"], color=["blue", "red"])
plt.title("Memory Usage Comparison")
plt.ylabel("Memory (MB)")
plt.grid(axis="y", alpha=0.3)

# Plot inference time comparison
plt.subplot(1, 2, 2)
plt.bar(results["Model"], results["Inference Time (s)"], color=["blue", "red"])
plt.title("Inference Time Comparison")
plt.ylabel("Time (seconds)")
plt.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Text Quality Comparison

Let's compare the quality of text generated by both models.

In [None]:
# Print both generated texts for comparison
print("FP16 model output:")
print(fp16_text)
print("\n" + "-"*50 + "\n")
print("AWQ model output:")
print(awq_text)

## Conclusion

In this notebook, we've explored Activation-aware Weight Quantization (AWQ), an advanced technique for quantizing large language models to 4-bit precision while maintaining quality. We've seen that:

1. AWQ can significantly reduce model memory usage compared to FP16 models
2. The inference speed may improve depending on hardware support for INT4 operations
3. The quality of text generation can be preserved even with extreme quantization

Key advantages of AWQ:
- More sophisticated than simple quantization, preserving important weights
- Better quality preservation compared to naive INT4 quantization
- No need for retraining or fine-tuning

AWQ is particularly useful for deploying large language models on resource-constrained hardware while maintaining acceptable quality.