# Model Compression Techniques

[Reference Article: Compressing Large Language Models (LLMs) by Shaw Talebi](https://towardsdatascience.com/compressing-large-language-models-llms-9f406eea5b5e)

# Purpose
The purpose of compression LLMs is to reduce model size without sacificing performance.
# Strategies
There are three main ways to do so and they are independent of each other, meaning that the method can be used togheter to potential yield greater compression while maintaining comparable performace.
1. Quantization  
This means to reduce the precision of the parameters in the model.
- Post-training Quantization means to train the model then quantize, improving end user inference speeds
- Quantization Aware Training means to quantize then train, which may yield better performance on specfic downstream tasks
2. Pruning  
This means to remove parameters or even layers from the model.
- Unstuctured pruning removes individual weights and leads to a greater reduction by requires specialized hardware.
- Structured pruning remove entire sturctures and yields less reduction.
3. Knowledge Distillation  
This means to transfer knowledge from a larger (teacher) model to a smaller (student) model.
- When the teacher is another model, the teacher model produces "soft targets" meaning the values produced are proablistic. The student "learns" by comparing its outputs to that of the the teacher model.
- When we use ground truth values to train the student model, those would be called "hard targets" i.e in the case of binary classification 0 or 1
- Another method is to use an existing LLM to generate Synthetic Data which can be feed into the student model.

To reemphasize, these methods are independent and can be used in conjuction with each other.

# Imports

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import DistilBertForSequenceClassification, DistilBertConfig
import tqdm

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Dataset

In [3]:
data = load_dataset("shawhin/phishing-site-classification")

# Teacher Model

In [4]:
device = torch.device("cuda")
model_path = "andrewcyeow/phishing_url_model"

tokenizer = AutoTokenizer.from_pretrained(model_path)
teacher_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)

# Student Model

In [5]:
# Drop 4 heads per layer and 2 layers
my_config = DistilBertConfig(n_heads=8, n_layers=4)

student_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", config=my_config).to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Tokenize the dataset

In [6]:
def preprocess_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_data = data.map(preprocess_function, batched=True)
tokenized_data.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Map:   0%|          | 0/450 [00:00<?, ? examples/s]

Map: 100%|██████████| 450/450 [00:00<00:00, 3160.75 examples/s]


# Evaluation Strategy

In [7]:
def evaluate_model(model, dataloader, device):
    model.eval()
    all_preds = []
    all_labels = []
    
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)
            
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            
            preds = torch.argmax(logits, dim=1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(labels.cpu().numpy())
            
    accuracy = accuracy_score(all_labels, all_preds)
    precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average="binary")
    
    return accuracy, precision, recall, f1

# Custom Loss Function

This loss function combines the distillation loss from the soft targets produced by the teacher model and the hard loss from comparing the model outputs with the group truth values. Alpha is the parameter used to control the relative weight of distillation loss to hard loss.

In [14]:
def distillation_loss(student_logits, teacher_logits, 
                      true_labels, temperature, alpha):
    # Compute soft targets from teacher logits
    soft_targets = nn.functional.softmax(teacher_logits / temperature, dim=1)
    student_soft = nn.functional.log_softmax(student_logits / temperature, dim=1)

    # KL Divergence loss for distillation
    distill_loss = nn.functional.kl_div(student_soft, 
                                    soft_targets, 
                                    reduction='batchmean') * (temperature ** 2)

    # Cross-entropy loss for hard labels
    hard_loss = nn.CrossEntropyLoss()(student_logits, true_labels)

    # Combine losses
    loss = alpha * distill_loss + (1.0 - alpha) * hard_loss

    return loss

# Hyperparameters, Optimizers, and Train/Test Datasets

In [15]:
# hyperparameters
batch_size = 32
lr = 1e-4
num_epochs = 5
temperature = 2.0
alpha = 0.5

# define optimizer
optimizer = optim.Adam(student_model.parameters(), lr=lr)

# create training data loader
dataloader = DataLoader(tokenized_data['train'], batch_size=batch_size)
# create testing data loader
test_dataloader = DataLoader(tokenized_data['test'], batch_size=batch_size)

# Train the Model

In [16]:
# put student model in train mode
student_model.train()

# train model
for epoch in range(num_epochs):
    for batch in dataloader:
        # Prepare inputs
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Disable gradient calculation for teacher model
        with torch.no_grad():
            teacher_outputs = teacher_model(input_ids, attention_mask=attention_mask)
            teacher_logits = teacher_outputs.logits

        # Forward pass through the student model
        student_outputs = student_model(input_ids, attention_mask=attention_mask)
        student_logits = student_outputs.logits

        # Compute the distillation loss
        loss = distillation_loss(student_logits, teacher_logits, labels, temperature, alpha)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f"Epoch {epoch + 1} completed with loss: {loss.item()}")

    # Evaluate the teacher model
    teacher_accuracy, teacher_precision, teacher_recall, teacher_f1 = evaluate_model(teacher_model, test_dataloader, device)

    print(f"Teacher (test) - Accuracy: {teacher_accuracy:.4f}, Precision: {teacher_precision:.4f}, Recall: {teacher_recall:.4f}, F1 Score: {teacher_f1:.4f}")

    # Evaluate the student model
    student_accuracy, student_precision, student_recall, student_f1 = evaluate_model(student_model, test_dataloader, device)
    
    print(f"Student (test) - Accuracy: {student_accuracy:.4f}, Precision: {student_precision:.4f}, Recall: {student_recall:.4f}, F1 Score: {student_f1:.4f}")
    print("\n")

    # put student model back into train mode
    student_model.train()

Epoch 1 completed with loss: 0.08567795157432556
Teacher (test) - Accuracy: 0.8644, Precision: 0.8962, Recall: 0.8297, F1 Score: 0.8617
Student (test) - Accuracy: 0.9156, Precision: 0.9099, Recall: 0.9258, F1 Score: 0.9177


Epoch 2 completed with loss: 0.055420562624931335
Teacher (test) - Accuracy: 0.8644, Precision: 0.8962, Recall: 0.8297, F1 Score: 0.8617
Student (test) - Accuracy: 0.9156, Precision: 0.9361, Recall: 0.8952, F1 Score: 0.9152


Epoch 3 completed with loss: 0.07385220378637314
Teacher (test) - Accuracy: 0.8644, Precision: 0.8962, Recall: 0.8297, F1 Score: 0.8617
Student (test) - Accuracy: 0.9067, Precision: 0.8912, Recall: 0.9301, F1 Score: 0.9103


Epoch 4 completed with loss: 0.16781014204025269
Teacher (test) - Accuracy: 0.8644, Precision: 0.8962, Recall: 0.8297, F1 Score: 0.8617
Student (test) - Accuracy: 0.9000, Precision: 0.9381, Recall: 0.8603, F1 Score: 0.8975


Epoch 5 completed with loss: 0.071306973695755
Teacher (test) - Accuracy: 0.8644, Precision: 0.8962

# Evaluate the Model on an Independent Validation Set

In [17]:
validation_dataloader = DataLoader(tokenized_data['validation'], batch_size=8)

teacher_accuracy, teacher_precision, teacher_recall, teacher_f1 = evaluate_model(teacher_model, validation_dataloader, device)
print(f"Teacher (test) - Accuracy: {teacher_accuracy:.4f}, Precision: {teacher_precision:.4f}, Recall: {teacher_recall:.4f}, F1 Score: {teacher_f1:.4f}")

student_accuracy, student_precision, student_recall, student_f1 = evaluate_model(student_model, validation_dataloader, device)
print(f"Student (test) - Accuracy: {student_accuracy:.4f}, Precision: {student_precision:.4f}, Recall: {student_recall:.4f}, F1 Score: {student_f1:.4f}")

Teacher (test) - Accuracy: 0.8800, Precision: 0.9091, Recall: 0.8444, F1 Score: 0.8756
Student (test) - Accuracy: 0.9222, Precision: 0.9481, Recall: 0.8933, F1 Score: 0.9199


# Push the Student Model to HuggingFace

In [18]:
student_model.push_to_hub("andrewcyeow/phishing_url_student_model")

model.safetensors: 100%|██████████| 211M/211M [00:10<00:00, 20.4MB/s] 


CommitInfo(commit_url='https://huggingface.co/andrewcyeow/phishing_url_student_model/commit/dd8b098c4b71e643159e9253749b2179c8a69009', commit_message='Upload DistilBertForSequenceClassification', commit_description='', oid='dd8b098c4b71e643159e9253749b2179c8a69009', pr_url=None, repo_url=RepoUrl('https://huggingface.co/andrewcyeow/phishing_url_student_model', endpoint='https://huggingface.co', repo_type='model', repo_id='andrewcyeow/phishing_url_student_model'), pr_revision=None, pr_num=None)

# Load the Model in using QLoRA
- store model parameters using the 4-bit NormalFloat data type
- bfloat16 for computation

In [19]:
from transformers import BitsAndBytesConfig

model_id = "andrewcyeow/phishing_url_student_model"

# load model in model as 4-bit
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype = torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model_nf4 = AutoModelForSequenceClassification.from_pretrained(model_id, device_map=device, quantization_config=nf4_config)

# Evaluate the Quantized Model

In [21]:
quantized_accuracy, quantized_precision, quantized_recall, quantized_f1 = evaluate_model(model_nf4, validation_dataloader, device)

print("Post-quantization Performance")
print(f"Accuracy: {quantized_accuracy:.4f}, Precision: {quantized_precision:.4f}, Recall: {quantized_recall:.4f}, F1 Score: {quantized_f1:.4f}")

Post-quantization Performance
Accuracy: 0.9289, Precision: 0.9573, Recall: 0.8978, F1 Score: 0.9266
