# Model Pruning with a Pre-Trained Transformer
1.  Load a pre-trained sentiment analysis model (`DistilBERT`).
2.  Evaluate its performance on a sample sentence **before** pruning.
3.  Inspect the model's layers to choose a target for pruning.
4.  Apply **unstructured magnitude pruning** to the target layer.
     - Unstructured magnitude pruning is a popular technique for compressing large language models (LLMs) by removing individual connections, or weights, that are deemed least important
5.  Verify the effect of pruning by measuring the layer's sparsity.
     - The percentage of zero-valued parameters
6.  Evaluate the model's performance **after** pruning to observe the impact.

---
## 1. Setup Environment

In [1]:
!pip install datasets transformers torch numpy huggingface_hub -q

In [2]:
!pip install 'accelerate>=0.26.0' --upgrade -q

---
## 2. Import Libraries
We'll import all the necessary components for our task.

In [3]:
import torch
import numpy as np
import torch.nn.utils.prune as prune
from transformers import AutoTokenizer, AutoModelForSequenceClassification

---
## 3. Define Model and Device
Let's specify the model we'll use from the Hugging Face Hub and set up our compute device (GPU if available).

In [4]:
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [7]:
print(f"Using device: {device}")

Using device: cuda


---
## 4. Load Pre-Trained Model and Tokenizer

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [9]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [10]:
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


---
## 5. Evaluate the Original Model's Performance
Before we change anything, let's establish a baseline for the model's performance on a sample sentence.

### Create a helper function for evaluation

In [14]:
def evaluate_model(model, tokenizer, sentence, device):
    inputs = tokenizer(sentence, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = model(**inputs).logits

    probabilities = torch.softmax(logits, dim=1).cpu().numpy()[0]
    predicted_class_id = torch.argmax(logits, dim=1).item()
    prediction = model.config.id2label[predicted_class_id]

    print(f'Sentence: "{sentence}"')
    print(f'Prediction: {prediction} (Confidence: {probabilities[predicted_class_id]:.4f})')

In [None]:
test_sentences = [
    'The performances in this movie are absolutely stellar',
    'The movie was not bad.',
    "I'm not sure if I liked the new design.",
    'The food was, well, edible.',
    "It's an interesting idea, but I'm not sold.",
    'The book had its moments.',
    'I have mixed feelings about the new policy.',
    'The performance was...memorable.',
    'That\'s one way to look at it.',
    'The weather is certainly something.',
    "I wouldn't say I loved it, but I didn't hate it either.",
    'The acting was surprisingly good in an otherwise mediocre film.',
    'The service was incredibly slow, but the food was worth the wait.',
    "It's a bold strategy, Cotton. Let's see if it pays off for 'em.",
    'This is a very unique piece of art.',
    'The ending of the show was definitely a choice.',
    'I could see why some people would like this.',
    'The special effects were amazing, but the plot was a bit weak.',
    'Well, that was an experience.',
    'The new update is... different.',
    "I'm on the fence about this one."
]

In [None]:
print("--- Evaluating Original Model ---")

for test_sentence in test_sentences:
 evaluate_model(model, tokenizer, test_sentence, device)

--- Evaluating Original Model ---
Sentence: "The performances in this movie are absolutely stellar."
Prediction: POSITIVE (Confidence: 0.9999)


---
## 6. The Pruning Process
Now we'll perform the actual pruning. We'll target the **query projection layer** (`q_lin`) in the first attention block of the transformer.

### Step 6.1: Select the Target Layer

In [17]:
module_to_prune = model.distilbert.transformer.layer[0].attention.q_lin

In [18]:
print("Selected module for pruning:")
print(module_to_prune)

Selected module for pruning:
Linear(in_features=768, out_features=768, bias=True)


### Step 6.2: Check Sparsity Before Pruning
Sparsity is the percentage of weights that are zero. For an unpruned model, this should be 0%.

In [19]:
def calculate_sparsity(module):
    return 100. * float(torch.sum(module.weight == 0)) / float(module.weight.nelement())

In [20]:
initial_sparsity = calculate_sparsity(module_to_prune)
print(f"Sparsity before pruning: {initial_sparsity:.2f}%")

Sparsity before pruning: 0.00%


### Step 6.3: Apply Pruning
We will prune 30% of the weights in the layer with the lowest L1 magnitude (i.e., closest to zero).

In [21]:
prune.l1_unstructured(module_to_prune, name="weight", amount=0.3)

Linear(in_features=768, out_features=768, bias=True)

### Step 6.4: Check Sparsity After Pruning
The pruning is applied via a 'forward hook'. The original weights are still there, but a mask is applied. The sparsity calculation should now reflect the pruned weights.

In [None]:
sparsity_after_pruning = calculate_sparsity(module_to_prune)
print(f"Sparsity after applying pruning mask: {sparsity_after_pruning:.2f}%")

Sparsity after applying pruning mask: 30.00%


### Step 6.5: Make the Pruning Permanent
The `prune.remove` function removes the hook and permanently sets the pruned weights to zero in the weight tensor.

In [23]:
prune.remove(module_to_prune, 'weight')

Linear(in_features=768, out_features=768, bias=True)

In [24]:
final_sparsity = calculate_sparsity(module_to_prune)
print(f"Sparsity after making pruning permanent: {final_sparsity:.2f}%")

Sparsity after making pruning permanent: 30.00%


---
## 7. Evaluate the Pruned Model (Before Fine-Tuning)
Now let's see how our model performs on the same sentence after we've removed 30% of the weights from a key layer. We expect a drop in performance or confidence.

In [26]:
print("--- Evaluating Pruned Model ---")
evaluate_model(model, tokenizer, test_sentence, device)

--- Evaluating Pruned Model ---
Sentence: "The performances in this movie are absolutely stellar."
Prediction: POSITIVE (Confidence: 0.9999)
