# Day 31: INT8 Quantization - Part 1

In this notebook, we'll explore INT8 quantization for large language models. We'll focus on implementing basic INT8 quantization using PyTorch and measuring its impact on model size, latency, and quality.

## Overview

1. Setup and dependencies
2. Loading a pre-trained model
3. Basic INT8 quantization with PyTorch
4. Measuring model size and memory usage

## 1. Setup and Dependencies

In [None]:
!pip install -q torch transformers datasets evaluate accelerate psutil

In [None]:
import os
import time
import torch
import numpy as np
import psutil
import gc
from transformers import AutoModelForCausalLM, AutoTokenizer

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Print PyTorch version
print(f"PyTorch version: {torch.__version__}")

## 2. Loading a Pre-trained Model

We'll use a small language model for demonstration purposes. In practice, quantization is most beneficial for larger models.

In [None]:
# Define model name
model_name = "facebook/opt-350m"  # Using a smaller model for demonstration

# Function to measure memory usage
def get_memory_usage():
    """Get current memory usage in MB"""
    process = psutil.Process(os.getpid())
    memory_info = process.memory_info()
    memory_mb = memory_info.rss / (1024 * 1024)  # Convert to MB
    return memory_mb

# Function to count parameters
def count_parameters(model):
    """Count the number of parameters in a model"""
    return sum(p.numel() for p in model.parameters())

# Function to calculate model size in MB
def calculate_model_size(model, dtype=torch.float32):
    """Calculate model size in MB based on parameter count and dtype"""
    bytes_per_element = {
        torch.float32: 4,
        torch.float16: 2,
        torch.int8: 1
    }
    num_params = count_parameters(model)
    size_bytes = num_params * bytes_per_element.get(dtype, 4)
    size_mb = size_bytes / (1024 * 1024)  # Convert to MB
    return size_mb

In [None]:
# Record initial memory usage
initial_memory = get_memory_usage()
print(f"Initial memory usage: {initial_memory:.2f} MB")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token

# Load model in FP32 (default precision)
print("Loading model in FP32...")
fp32_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

# Calculate model size and memory usage
fp32_params = count_parameters(fp32_model)
fp32_size = calculate_model_size(fp32_model, torch.float32)
fp32_memory = get_memory_usage() - initial_memory

print(f"FP32 model parameters: {fp32_params:,}")
print(f"FP32 model size: {fp32_size:.2f} MB")
print(f"FP32 model memory usage: {fp32_memory:.2f} MB")

## 3. Basic INT8 Quantization with PyTorch

PyTorch provides built-in support for INT8 quantization. We'll use the `torch.quantization` module to quantize our model.

In [None]:
# First, let's create a FP16 version for comparison
print("Converting model to FP16...")
fp16_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(device)

# Calculate FP16 model size and memory usage
fp16_params = count_parameters(fp16_model)
fp16_size = calculate_model_size(fp16_model, torch.float16)
fp16_memory = get_memory_usage() - initial_memory - fp32_memory

print(f"FP16 model parameters: {fp16_params:,}")
print(f"FP16 model size: {fp16_size:.2f} MB")
print(f"FP16 model memory usage: {fp16_memory:.2f} MB")

In [None]:
# Clear memory before loading the INT8 model
del fp32_model
gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None

# Reset memory baseline
initial_memory = get_memory_usage()
print(f"Memory after cleanup: {initial_memory:.2f} MB")

In [None]:
# Load model with INT8 quantization
print("Loading model with INT8 quantization...")
int8_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True  # Enable INT8 quantization
)

# Calculate INT8 model size and memory usage
int8_params = count_parameters(int8_model)
int8_size = calculate_model_size(int8_model, torch.int8)
int8_memory = get_memory_usage() - initial_memory

print(f"INT8 model parameters: {int8_params:,}")
print(f"INT8 model size: {int8_size:.2f} MB")
print(f"INT8 model memory usage: {int8_memory:.2f} MB")

## 4. Measuring Model Size and Memory Usage

Let's compare the size and memory usage of the different precision models.

In [None]:
# Compile results
results = {
    "Precision": ["FP32", "FP16", "INT8"],
    "Parameters": [fp32_params, fp16_params, int8_params],
    "Model Size (MB)": [fp32_size, fp16_size, int8_size],
    "Memory Usage (MB)": [fp32_memory, fp16_memory, int8_memory],
    "Size Reduction": ["1.0x", f"{fp32_size/fp16_size:.2f}x", f"{fp32_size/int8_size:.2f}x"],
    "Memory Reduction": ["1.0x", f"{fp32_memory/fp16_memory:.2f}x", f"{fp32_memory/int8_memory:.2f}x"]
}

# Display results as a table
import pandas as pd
results_df = pd.DataFrame(results)
results_df

In [None]:
# Visualize the results
import matplotlib.pyplot as plt

# Set up the figure
plt.figure(figsize=(12, 5))

# Plot model size comparison
plt.subplot(1, 2, 1)
plt.bar(results["Precision"], results["Model Size (MB)"], color=["blue", "green", "red"])
plt.title("Model Size Comparison")
plt.ylabel("Size (MB)")
plt.grid(axis="y", alpha=0.3)

# Plot memory usage comparison
plt.subplot(1, 2, 2)
plt.bar(results["Precision"], results["Memory Usage (MB)"], color=["blue", "green", "red"])
plt.title("Memory Usage Comparison")
plt.ylabel("Memory (MB)")
plt.grid(axis="y", alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Basic Inference Test

Let's perform a basic inference test to ensure our quantized model is working correctly.

In [None]:
# Define a test prompt
prompt = "Artificial intelligence will transform the future by"

# Tokenize the prompt
inputs = tokenizer(prompt, return_tensors="pt").to(int8_model.device)

# Generate text with the INT8 model
with torch.no_grad():
    outputs = int8_model.generate(
        **inputs,
        max_length=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        num_return_sequences=1
    )

# Decode the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated text with INT8 model:")
print(generated_text)

In [None]:
# Generate text with the FP16 model for comparison
inputs = tokenizer(prompt, return_tensors="pt").to(fp16_model.device)

with torch.no_grad():
    outputs = fp16_model.generate(
        **inputs,
        max_length=100,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        num_return_sequences=1
    )

# Decode the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated text with FP16 model:")
print(generated_text)

## Conclusion

In this notebook, we've explored basic INT8 quantization using PyTorch's built-in functionality. We've seen that:

1. INT8 quantization significantly reduces model size and memory usage compared to FP32 and FP16 models
2. The quantized model can still generate coherent text

In the next part, we'll explore more advanced quantization techniques and measure the impact on inference latency and quality metrics.