<h1 align="center" style="color:green;font-size: 3em;">
Implementing Quantization Techniques</h1>

In this notebook, we will explore quantization techniques to optimize memory requirements.

### Install dependencies

In [2]:
%pip install datasets -q

Note: you may need to restart the kernel to use updated packages.


### Import Libraries

In [1]:
import torch

from transformers import BertModel, BertTokenizer, DistilBertForSequenceClassification, DistilBertTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, classification_report
from tqdm import tqdm
from torch.optim import AdamW
import torch.quantization

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

### Introduction to Quantization

Quantization is a model compression technique people use to reduce the size and the computational requirements of LLMs. The central idea behind quantization is to represent the model’s weights and activations using lower-precision data types, such as `int8` or `float16`, instead of the standard `float32`. This significantly reduces the memory footprint and allows for faster computations, as lower-precision arithmetic operations are generally less computationally expensive.

There are various types of quantization techniques, including post-training quantization (PTQ), where the model is quantized after training, and quantization-aware training (QAT), where the model is trained with quantization in mind. While quantization often results in some loss of model accuracy, advances like QAT help to somewhat eliminate this by adjusting weights during training to account for the reduced precision. By having a balance between computational efficiency and model performance, quantization enables LLMs to run effectively in real-world applications without the need for extensive hardware resources.

First, we will explore the memory usage of different tensor data types in PyTorch. Understanding how the choice of data type affects memory consumption is crucial when working with large datasets or models in deep learning.

In [4]:
# Create a tensor of type float32
tensor = torch.randn(100,100, dtype = torch.float32)
print(f"Memory (float32): {tensor.element_size() * tensor.nelement()} bytes")

# Create a tensor of the same shape of type float 16
tensor_fp16 = tensor.to(dtype=torch.float16)
print(f"Memory (float16): {tensor_fp16.element_size() * tensor_fp16.nelement()} bytes")

# Create a tensor of the same shape of type int 8
tensor_int8 = torch.quantize_per_tensor(tensor, scale=0.1, zero_point=0, dtype=torch.qint8)
print(f"Memory (int8): {tensor_int8.int_repr().element_size() * tensor_int8.numel()} bytes")

Memory (float32): 40000 bytes
Memory (float16): 20000 bytes
Memory (int8): 10000 bytes


### Quantize a Small NN Model

Next, we will explore the impact of data type conversion on the output of a BERT model using PyTorch. Specifically, we will compare the output shapes and memory usage of the BERT model when using different tensor data types: float32 and float16.

In [5]:
# Load the model and tokenizer
model = BertModel.from_pretrained("prajjwal1/bert-small")
tokenizer = BertTokenizer.from_pretrained("prajjwal1/bert-small")

# Tokenize a random sentence and run it through the model
input_text = "Quantization is useful!"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)

# Quantize the model and run the sentence through the new model
model.half()
quantized_outputs = model(**inputs)

# Print the bytes used for both
print(f"Memory (float32): {outputs.last_hidden_state.element_size() * outputs.last_hidden_state.nelement()} bytes")
print(f"Memory (float16): {quantized_outputs.last_hidden_state.element_size() * quantized_outputs.last_hidden_state.nelement()} bytes")


config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/116M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/116M [00:00<?, ?B/s]

Memory (float32): 16384 bytes
Memory (float16): 8192 bytes



- What is quantization and why is it important for large language models?

  Quantization is a compression technique which reduces the precision of model's weights from higher precision(in our case from float32) to lower precision(float16, int8). This in turn reduces the memory requirement and computational load.

- How does reducing precision from float32 to int8 impact memory usage?

  Reducing precision from float32 (32 bits) to int8 (8 bits) means that now each weight requires only 1/4th of the original storage and hence this step reduces memory usage by 75%.

- Explain the difference between per-layer and per-channel quantization. Why might per-channel quantization be more effective for certain tasks?

  Per-layer quantization assigns same parameter to all values within a tensor while per-channel quantization allows different parameters to different channels within a tensor. This allows per-channel quantization to adapt to variations in activation ranges within channels thus improving accuracy.

### Post-Training Quantization

Now that we have seen a small example of quantization, let us explore some of quantization techniques starting with Post-Training Quantization (PTQ).

PTQ optimizes pretrained neural network models by reducing the precision of weights and activations, thereby decreasing memory usage and improving inference speed while preserving accuracy. There are two main types of quantization: static and dynamic. Static quantization computes scaling factors for weights and activations during a calibration phase using a representative dataset, enabling fixed quantized values for more efficient inference. Conversely, dynamic quantization quantizes weights at runtime, leaving activations in their original precision, making it easier to implement without needing a calibration dataset. Together, these strategies enhance model performance for deployment in resource-constrained environments.

### Implementing Dynamic Quantization

For dynamic quantization, first, we will load a pre-trained DistilBERT model and its corresponding tokenizer, which will be used for sequence classification tasks.

In [6]:
# Load a pre-trained DistilBERT model and tokenizer
model_name = 'distilbert-base-uncased'
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Next, we will apply dynamic quantization to the pre-trained DistilBERT model to reduce its size and improve inference speed without significant loss in accuracy.

In [7]:
def apply_dynamic_quantization(model):
    model.eval()
    model = torch.quantization.quantize_dynamic(
        model,
        {torch.nn.Linear},
        dtype=torch.qint8
    )
    return model.to('cpu') ## NEED THIS!!!!

Next, we will evaluate the performance of the quantized DistilBERT model using various metrics to gain a comprehensive understanding of its effectiveness.

In [9]:
def evaluate_model(model, data_loader):
  model.eval()
  all_predictions = []
  all_labels = []
  with torch.no_grad():
    for batch in data_loader:
      inputs = tokenizer(batch['text'], return_tensors='pt', padding=True, truncation=True)
      labels = batch['label']
      outputs = model(**inputs)
      logits = outputs.logits
      predictions = torch.argmax(outputs.logits, dim=1)
      all_predictions.extend(predictions.cpu().numpy())
      all_labels.extend(labels.cpu().numpy())

  accuracy = accuracy_score(all_labels, all_predictions)
  f1 = f1_score(all_labels, all_predictions, average='weighted')
  report = classification_report(all_labels, all_predictions)
  return accuracy, f1, report

Finally, we want to combine all the steps together. We will load a dataset, apply dynamic quantization to the pre-trained DistilBERT model, and evaluate its performance.

In [10]:
# Load the IMDB dataset
dataset = load_dataset('imdb')
training_dataset = dataset['train'].shuffle(seed=42).select(range(2000))
evaluation_dataset = dataset['test'].shuffle(seed=42).select(range(500))

# Create DataLoader for training and testing
training_dataloader = DataLoader(training_dataset, batch_size=16, shuffle=True)
evaluation_dataloader = DataLoader(evaluation_dataset, batch_size=16)

# Apply dynamic quantization
quantized_model = apply_dynamic_quantization(model)

# Evaluate the quantized model
accuracy, f1, report = evaluate_model(quantized_model, evaluation_dataloader)
print(f"Accuracy: {accuracy}")
print(f"F1 Score: {f1}")
print(f"Classification Report:\n{report}")