<a href="https://colab.research.google.com/github/peremartra/optipfair/blob/main/examples/layer_importance_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#OptiPFair Notebook Series – Example: Analyze Layers Importance.

![optiPfair Logo](https://github.com/peremartra/optipfair/blob/main/images/optiPfair.png?raw=true)


This notebook demonstrates how to use the `analyze_layer_importance` function from OptiPFair to identify which transformer layers in a model are the most redundant or have the least impact on the model's representations.

This analysis is a crucial first step for developing a data-driven **Depth Pruning** strategy.

##Recommended Environment

- **Platform**: [Google Colab](https://colab.research.google.com)  
- **Hardware**: GPU runtime (recommended: T4 or better for 1B–3B models)  
- **Dependencies**: Installed automatically in the first cell (optipfair, transformers, torch)

##by Pere Martra.

- [LinkedIn](https://www.linkedin.com/in/pere-martra)  
- [GitHub](https://github.com/peremartra)  
- [X / Twitter](https://x.com/peremartra)

---

> If you find this useful, please ⭐ the [repository](https://github.com/peremartra/optipfair) and share it!
---
If you want to use your favorite LLM to create code with optiPfair, you just need to provide it with the file: [**optipfair_llm_reference_manual.txt**](https://github.com/peremartra/optipfair/blob/main/optipfair_llm_reference_manual.txt), which contains all the necessary information for the LLM to become an expert in using the library.

## 1. Installation and Setup
First, we install the necessary libraries. We include `matplotlib` and `seaborn` to visualize the results.

In [None]:
!pip install -q transformers optipfair torch

In [22]:
import torch
import os
import gc
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from torch.utils.data import DataLoader
from optipfair import prune_model_depth, analyze_layer_importance


# Check device availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

Using device: cuda
GPU: Tesla T4
GPU Memory: 14.7 GB


## Load Model

For this example we'll use a Qwen3-0.6B model with 28 transformer blocks. We'll identify the three that contribute least to the dataset used and remove them, creating a new 25-block model.


In [6]:
MODEL_NAME = 'Qwen/Qwen3-0.6B'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=torch.float16,
    device_map="auto"
)
model.eval()

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-27): 28 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
        (post_attention_layernorm): Qwe

## Load & Preparing Dataset

We're going to use a standard dataset to measure the importance of the different transformer blocks. The idea is that you use the dataset you're preparing the model for, this way you'll get a lighter model that will work correctly with your data.

In [4]:
RECOVERY_SAMPLES = 100
BATCH_SIZE = 8
MAX_LENGTH = 512

In [9]:
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split=f'train[:{RECOVERY_SAMPLES}]')

In [10]:
def prepare_dataset(dataset, text_field='text'):
  def tokenize_function(examples):
      if text_field in examples:
          texts = examples[text_field]
      elif 'sms' in examples:  # SMS dataset
          texts = examples['sms']
      elif 'text' in examples:
          texts = examples['text']
      else:
          texts = examples[list(examples.keys())[0]]  # First available field

      return tokenizer(
          texts,
          truncation=True,
          padding='max_length',
          max_length=MAX_LENGTH,
          return_tensors='pt'
      )

  tokenized = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)
  tokenized.set_format(type='torch', columns=['input_ids', 'attention_mask'])
  return DataLoader(tokenized, batch_size=BATCH_SIZE, shuffle=False)

In [None]:
dataloader = prepare_dataset(dataset)

## Call to `analyze_layer_importance`

The `analyze_layer_importance` function only needs to receive the model and the dataloader it should work with.

It will analyze how the different blocks transform information and return a dictionary with the layers and their cosine distance.

A value of 1 indicates it modifies information in an important way and therefore it's an active block that contributes to the final result. A value of 0 or close to 0 indicates the block is passive and doesn't alter information when passing through it.

Blocks with values close to 0 are the candidates to be eliminated.

In [15]:
importance_scores = analyze_layer_importance(model, dataloader, show_progress=True)


Processing batches: 100%|██████████| 13/13 [00:06<00:00,  1.96it/s]


In [18]:
print(importance_scores)

{0: np.float64(0.8903949444110577), 1: np.float64(0.30757962740384615), 2: np.float64(0.7715407151442307), 3: np.float64(0.044283353365384616), 4: np.float64(0.051382211538461536), 5: np.float64(0.047475961538461536), 6: np.float64(0.04713792067307692), 7: np.float64(0.03339092548076923), 8: np.float64(0.040151742788461536), 9: np.float64(0.049692007211538464), 10: np.float64(0.04417067307692308), 11: np.float64(0.051194411057692304), 12: np.float64(0.04424579326923077), 13: np.float64(0.04852764423076923), 14: np.float64(0.045335036057692304), 15: np.float64(0.045335036057692304), 16: np.float64(0.051645132211538464), 17: np.float64(0.060884915865384616), 18: np.float64(0.03786057692307692), 19: np.float64(0.054762620192307696), 20: np.float64(0.042405348557692304), 21: np.float64(0.06276292067307693), 22: np.float64(0.0692608173076923), 23: np.float64(0.0829326923076923), 24: np.float64(0.07241586538461539), 25: np.float64(0.07466947115384616), 26: np.float64(0.06366436298076923), 27

In [25]:
layers_2_remove = sorted(importance_scores.keys(), key=lambda x: importance_scores[x])[:3]
print (layers_2_remove)

[7, 18, 8]


### Prune model.

With the least important blocks identified, we call `prune_model_depth` and remove the blocks.

It will return the new model without the selected layers.

In [26]:
pruned_model = prune_model_depth(
        model=model,
        layer_indices=layers_2_remove,
        show_progress=True
    )

Removing layers: 100%|██████████| 28/28 [00:00<00:00, 306633.19it/s]


In [27]:
pruned_model

Qwen3ForCausalLM(
  (model): Qwen3Model(
    (embed_tokens): Embedding(151936, 1024)
    (layers): ModuleList(
      (0-24): 25 x Qwen3DecoderLayer(
        (self_attn): Qwen3Attention(
          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
          (q_norm): Qwen3RMSNorm((128,), eps=1e-06)
          (k_norm): Qwen3RMSNorm((128,), eps=1e-06)
        )
        (mlp): Qwen3MLP(
          (gate_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (up_proj): Linear(in_features=1024, out_features=3072, bias=False)
          (down_proj): Linear(in_features=3072, out_features=1024, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen3RMSNorm((1024,), eps=1e-06)
        (post_attention_layernorm): Qwe

The new model has only 25 layers.

Now you could start a Knowledge Distillation process to recover the lost general capabilities and you would have a lighter model that would work perfectly for the specific task you're preparing it for.

### ➡️ [**Star OptiPFair on GitHub**](https://github.com/peremartra/optipfair)

---
You can also follow my work and new projects on:

* **[LinkedIn](https://www.linkedin.com/in/pere-martra/)**
* **[X / Twitter](https://twitter.com/PereMartra)**