# **Model Compression Presentation** 
Model compression refers to techniques used to reduce the size and computational requirements of machine learning models—particularly large language models (LLMs)—without significantly sacrificing performance.

📚 Credit: [Compressing Large Language Models (LLMs) | w/ Python Code (Shaw Talebi)](https://www.youtube.com/watch?v=FLkUOkeMd5M)

🔧 3 Main Techniques to Compress LLMs


1.   Quantization
2.   Pruning
3.   Knowledge Distillation


#1. Quantization
Quantization involves reducing the precision of the numerical values (parameters) used in a model, such as weights and biases.

🤔 What does it mean?
Most models are trained using 32-bit floating-point numbers (FP32). Quantization replaces these high-precision values with lower-precision formats like 8-bit integers (INT8), significantly reducing the model's memory footprint and computational cost.

🔄 Two Main Approaches:
Post-Training Quantization (PTQ):

Quantize a pre-trained model without re-training it.

Useful when using open-source models (e.g., on Hugging Face).

Simple and quick, but may slightly reduce accuracy.

Quantization-Aware Training (QAT):

Simulate quantization during training so the model learns to compensate for reduced precision.

Often results in better accuracy compared to PTQ.

Requires access to the training pipeline and data.

📝 Note: INT8 models can be up to 4x smaller than FP32 ones and can also run faster on compatible hardware.

#2. Pruning
Pruning removes unnecessary parameters from a model, directly reducing its size and potentially improving inference speed.

🧠 Two Types of Pruning:
Unstructured Pruning:

Removes individual weights (usually those close to zero).

Results in sparse matrices (many zeroes).

Requires specialized hardware to efficiently skip zero computations.

Structured Pruning:

Removes entire components (e.g., attention heads, neurons, layers).

Leads to a smaller and faster model compatible with standard hardware.

Easier to deploy in production, though with less aggressive compression.

#3. Knowledge Distillation
Knowledge Distillation is a technique where a smaller model (the student) learns to mimic a larger, well-trained model (the teacher).

🧪 Key Concepts:
Soft Targets:

Instead of training the student with hard labels (e.g., 0 or 1), we use the teacher’s output probabilities (soft labels).

This provides richer information and improves generalization.

Especially useful in tasks like classification or language modeling.

Synthetic Data Generation:

Use the teacher model to generate data (e.g., synthetic text prompts and responses).

Helps train the student model even in low-data scenarios.

🧠 Result: The student model can approach the teacher’s performance while being significantly smaller and faster.

# **Example**: Compressin Text Classifier with Knowledge Distillation - Soft Targets

In [None]:
%pip install datasets transformers torch scikit-learn

# Imports

In [10]:
from datasets import load_dataset

from transformers import AutoTokenizer, AutoModelForSequenceClassification, DistilBertConfig, DistilBertForSequenceClassification

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import pandas as pd

# Load Data

In [5]:
ds = load_dataset("shawhin/phishing-site-classification")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.45k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/98.0k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/21.4k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/24.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/450 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/450 [00:00<?, ? examples/s]

In [6]:
df = pd.DataFrame(ds["train"])
df.head(10)

Unnamed: 0,text,labels
0,http://bazurashop.com/idex.html?sfm_from_ifram...,1
1,hollywoodland.org/?p=29,0
2,tunnekylmyysmiddletonii.02leds.com/me4xcdste0....,1
3,usa-people-search.com/Find-Carla-Brown-IA.aspx,0
4,inspire-consultants.com.my/487ygfh,1
5,taiwanteastore.com/,0
6,citizendia.org/Morocco_national_football_team,0
7,osscamp.pl/poeosias/xskkswee/oeidppda/doeiidas/,1
8,www.luckybell.com/index/,0
9,lquuqkf.org/information.cgi,1


Features

*   text = website URL
*   label = phishing site indicator (1=phishing, 0=not phishing)

# Load Teacher Model

In [16]:
device = torch.device("cuda")
model_path = "shawhin/bert-phishing-classifier_teacher"
tokenizer = AutoTokenizer.from_pretrained(model_path)
teacher_model = AutoModelForSequenceClassification.from_pretrained(model_path).to(device)

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/851 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

# Load Student Model

Here we are creating a student model configuration based on DistilBertConfig. We are reducing the number of attention heads (from the default) to 8, and also reducing the number of transformer layers to 4. This makes the model lighter.

In [11]:
student_architecture = DistilBertConfig(n_heads=8, n_layers=4)
student_model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", config=student_architecture).to(device)

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Before going any further, lets compare them

In [17]:
def count_parameters(model):
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return total, trainable

In [19]:
teacher_total, teacher_trainable = count_parameters(teacher_model)
student_total, student_trainable = count_parameters(student_model)

print(f"🧑‍🏫 Teacher model - Total: {teacher_total:,}, Trainable: {teacher_trainable:,}")
print(f"🧑‍🎓 Student model - Total: {student_total:,}, Trainable: {student_trainable:,}")

🧑‍🏫 Teacher model - Total: 109,483,778, Trainable: 109,483,778
🧑‍🎓 Student model - Total: 52,779,266, Trainable: 52,779,266


The student model has about half the number of parameters as the teacher (52M vs. 109M), making it significantly lighter while still retaining good learning capacity.

# Tokenize part

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length")
# We apply truncation and padding so that all input sequences have the same length.
# This is important because the text inputs (in this case, website URLs) can vary a lot in length.
# Padding ensures uniform size, which is required for efficient batching and conversion into PyTorch tensors.

# Now we tokenize the dataset
tokenized_data = ds.map(preprocess_function, batched=True)

Lets explain the content of tokenized_data :

1.  text:
This is the original input, in our case a website URL like "http://bazurashop.com/idex.html?sfm_from_iframe=1".

2.  labels:
This is the indicator of whether the site is phishing or not (i.e., the target label for classification).

3.  input_ids:
These are the numerical tokens representing the URL, based on the tokenizer's vocabulary.
Each token corresponds to a subword or symbol in the vocabulary.
We see a lot of padding here — as mentioned earlier, we use padding to ensure all inputs have the same length for efficient batching.

4.  attention_mask:
This tells the model which tokens are actual content (1) and which are just padding (0).
Padding tokens carry no meaningful information and could introduce noise, so the model uses this mask to ignore them during the attention computation.

In [22]:
tokenized_data.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

# Evaluation Function

In [30]:
def evaluate_model(model,dataloader,device):
  model.eval()
  all_labels = []
  all_preds = []
  #Disable gradient computations because by default torch compute gradient for loss.backward() for example
  with torch.no_grad():
    for batch in dataloader:
      inputs_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['labels'].to(device)

      # Forward pass to get logits (predictions)
      outputs = model(inputs_ids, attention_mask=attention_mask)
      logits = outputs.logits

      # Convert logits to predictions because the output consist of two probs (isPhising, !isPhising) and we want a binary
      preds = torch.argmax(logits, dim=1).cpu().numpy()
      labels = labels.cpu().numpy()

      preds = torch.argmax(logits, dim=1).cpu().numpy()
      all_preds.extend(preds)
      all_labels.extend(labels)
  accuracy = accuracy_score(all_labels, all_preds)
  precision, recall, f1, _ = precision_recall_fscore_support(all_labels, all_preds, average='binary')
  return accuracy, precision, recall, f1

# Custom Loss Function

In [None]:
def distillation_loss(student_logits, teacher_logits, true_labels, temperature, alpha):
  # We use both soft and hard losses with an alpha coefficient to train the student model
  soft_targets = nn.functional.softmax(teacher_logits / temperature, dim=1)
  student_soft = nn.functional.log_softmax(student_logits / temperature, dim=1)

  # KL divergence loss for distillation
  distillation_loss = nn.functional.kl_div(student_soft, soft_targets, reduction="batchmean") * (temperature ** 2)
  # Using KL divergence allows us to measure whether the distribution of student logits is close to the distribution of teacher logits, not just the predicted label

  # Cross-entropy loss for hard labels
  hard_loss = nn.CrossEntropyLoss()(student_logits, true_labels)

  # Combine both losses based on alpha
  return alpha * distillation_loss + (1 - alpha) * hard_loss


# Hyperparameters definition


In [32]:
batch_size = 32
lr = 1e-4
num_epochs = 5
temperature = 2
alpha = 0.5

optimizer = optim.Adam(student_model.parameters(), lr=lr)

dataloader= DataLoader(tokenized_data["train"], batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(tokenized_data["test"], batch_size=batch_size, shuffle=False)

# Train Model

In [34]:
student_model.train()

for epoch in range(num_epochs):
    for batch in dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Disable Gradient for teacher model
        with torch.no_grad():
            teacher_outputs = teacher_model(input_ids, attention_mask=attention_mask)
            teacher_logits = teacher_outputs.logits

        # Forward pass through the student model
        student_outputs = student_model(input_ids, attention_mask=attention_mask)
        student_logits = student_outputs.logits

        # Compute distillation loss
        loss = distillation_loss(student_logits, teacher_logits, labels, temperature, alpha)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    # End of epoch
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {loss.item()}")

    # Evaluate teacher model
    teacher_accuracy, teacher_precision, teacher_recall, teacher_f1 = evaluate_model(teacher_model, test_dataloader, device)
    print(f"Teacher Model - Accuracy: {teacher_accuracy:.4f}, Precision: {teacher_precision:.4f}, Recall: {teacher_recall:.4f}, F1: {teacher_f1:.4f}")

    # Evaluate the student model
    student_accuracy, student_precision, student_recall, student_f1 = evaluate_model(student_model, test_dataloader, device)
    print(f"Student Model - Accuracy: {student_accuracy:.4f}, Precision: {student_precision:.4f}, Recall: {student_recall:.4f}, F1: {student_f1:.4f}")

    print("\n")

    student_model.train()


Epoch 1/5, Loss: 0.2836276888847351
Teacher Model - Accuracy: 0.8644, Precision: 0.8925, Recall: 0.8341, F1: 0.8623
Student Model - Accuracy: 0.8933, Precision: 0.8755, Recall: 0.9214, F1: 0.8979


Epoch 2/5, Loss: 0.2187705785036087
Teacher Model - Accuracy: 0.8644, Precision: 0.8925, Recall: 0.8341, F1: 0.8623
Student Model - Accuracy: 0.9111, Precision: 0.9565, Recall: 0.8646, F1: 0.9083


Epoch 3/5, Loss: 0.07151704281568527
Teacher Model - Accuracy: 0.8644, Precision: 0.8925, Recall: 0.8341, F1: 0.8623
Student Model - Accuracy: 0.9178, Precision: 0.9364, Recall: 0.8996, F1: 0.9176


Epoch 4/5, Loss: 0.1525426059961319
Teacher Model - Accuracy: 0.8644, Precision: 0.8925, Recall: 0.8341, F1: 0.8623
Student Model - Accuracy: 0.9156, Precision: 0.9526, Recall: 0.8777, F1: 0.9136


Epoch 5/5, Loss: 0.09237013757228851
Teacher Model - Accuracy: 0.8644, Precision: 0.8925, Recall: 0.8341, F1: 0.8623
Student Model - Accuracy: 0.9222, Precision: 0.9450, Recall: 0.8996, F1: 0.9217




As you can see, the accuracy of the student model improves with each epoch, while the teacher model's accuracy remains unchanged. This is expected, as we are not training the teacher model — backpropagation is only being applied to the student model.

# Validation Set Evaluation

In [35]:
validation_dataloader = DataLoader(tokenized_data["validation"], batch_size=8)

#First lets evaluate teacher model
teacher_accuracy, teacher_precision, teacher_recall, teacher_f1 = evaluate_model(teacher_model, validation_dataloader, device)

print(f"Teacher Model Validation - Accuracy: {teacher_accuracy:.4f}, Precision: {teacher_precision:.4f}, Recall: {teacher_recall:.4f}, F1: {teacher_f1:.4f}")

#Evaluate student model
student_accuracy, student_precision, student_recall, student_f1 = evaluate_model(student_model, validation_dataloader, device)

print(f"Student Model Validation - Accuracy: {student_accuracy:.4f}, Precision: {student_precision:.4f}, Recall: {student_recall:.4f}, F1: {student_f1:.4f}")


Teacher Model Validation - Accuracy: 0.8933, Precision: 0.9155, Recall: 0.8667, F1: 0.8904
Student Model Validation - Accuracy: 0.9356, Precision: 0.9579, Recall: 0.9111, F1: 0.9339
