# TorchAO Demo

This notebook demonstrates using TorchAO in optimizing a BERT model for inference. This is mainly meant to show off the ease with which AO is performed with this library, not excplicitly results given that I am on a CPU system and almost all training & inference memory performance gains are seen on GPU. 

In [60]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torchao.quantization import quantize_, int8_weight_only
import time
import psutil
import numpy as np

def get_memory_usage():
    return psutil.Process().memory_info().rss / 1024 / 1024  # in MB

def run_inference(model, inputs, num_runs=10):
    start_time = time.time()
    for _ in range(num_runs):
        with torch.no_grad():
            outputs = model(**inputs)
    end_time = time.time()
    return (end_time - start_time) / num_runs

## Load Model and Prepare Data
We'll use BERT-large and prepare a batch of 300 samples for inference.

In [61]:
# Load model and tokenizer
model_name = "bert-large-uncased"
print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Prepare input (larger batch size)
print("Preparing input data...")
texts = [
    "TorchAO is an amazing library for optimizing PyTorch models.",
    "The weather is beautiful today.",
    "Machine learning is transforming various industries."
] * 100  # 300 samples
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

Loading bert-large-uncased...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Preparing input data...


## Baseline Model

Let's first measure the performance of the baseline model.

In [None]:
print("\nRunning baseline model...")
baseline_memory = get_memory_usage()
baseline_time = run_inference(model, inputs)
print(f"Baseline - Time: {baseline_time:.4f}s, Memory: {baseline_memory:.2f}MB")

## Int8 Weight-Only Quantization

Now, let's apply TorchAO's int8 weight-only quantization and measure its performance.

In [None]:
print("\nRunning int8 weight-only quantized model...")
model_int8 = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
quantize_(model_int8, int8_weight_only())
int8_memory = get_memory_usage()
int8_time = run_inference(model_int8, inputs)
print(f"Int8 Weight-Only - Time: {int8_time:.4f}s, Memory: {int8_memory:.2f}MB")

## Dynamic Quantization

For comparison, let's also try PyTorch's dynamic quantization.

In [None]:
print("\nRunning dynamically quantized model...")
model_dynamic = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
dynamic_memory = get_memory_usage()
dynamic_time = run_inference(model_dynamic, inputs)
print(f"Dynamic Quantization - Time: {dynamic_time:.4f}s, Memory: {dynamic_memory:.2f}MB")

## Performance Comparison

Let's compare the performance improvements of both quantization methods.

In [65]:
print("\nPerformance Improvements:")
print(f"Int8 weight-only speedup: {baseline_time / int8_time:.2f}x")
print(f"Int8 weight-only memory reduction: {baseline_memory / int8_memory:.2f}x")
print(f"Dynamic quantization speedup: {baseline_time / dynamic_time:.2f}x")
print(f"Dynamic quantization memory reduction: {baseline_memory / dynamic_memory:.2f}x")


Performance Improvements:
Int8 weight-only speedup: 0.87x
Int8 weight-only memory reduction: 0.57x
Dynamic quantization speedup: 2.79x
Dynamic quantization memory reduction: 0.43x
