<a href="https://colab.research.google.com/github/edenlum/DynamicTransformer/blob/main/Dynamic_Transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Hypothesis 1:

All layers work on roughly the same space, that is, their inputs and outputs are tensors from the same distribution.

If that is true, we can change the order of the layers and they might still make sense. We can also skip some layers and that might make sense.

## Hypothesis 2:

Not all layers are used on all inputs. In other words, there are inputs for which we can skip some of the layers, and the output will not change by much. This is supported by the "circuits" theory where on some tasks you can find a circuit inside the transformer that is made out of a subset of the transformer layers.

In [None]:
!pip install -q datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.8/40.8 MB[0m [31m34.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 16.1.0 w

In [None]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

model_name = "distilbert-base-uncased"
model = DistilBertForSequenceClassification.from_pretrained(model_name)
tokenizer = DistilBertTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
from datasets import load_dataset

dataset = load_dataset("imdb", split="test[:1%]")
texts = dataset["text"]
labels = dataset["label"]

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
import torch
import random
from torch.utils.data import DataLoader, TensorDataset
import copy

def encode_texts(texts, tokenizer, max_length=512):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt", max_length=max_length)
    return inputs.input_ids, inputs.attention_mask

def prepare_dataloader(texts, labels, tokenizer, batch_size=32):
    input_ids, attention_mask = encode_texts(texts, tokenizer)
    dataset = TensorDataset(input_ids, attention_mask, torch.tensor(labels))
    return DataLoader(dataset, batch_size=batch_size)

def evaluate_sample(model, input_id, attention_mask, device):
    model.to(device)
    model.eval()
    with torch.no_grad():
        output = model(input_id.unsqueeze(0).to(device), attention_mask=attention_mask.unsqueeze(0).to(device))
    return torch.argmax(output.logits, dim=-1).item()

In [None]:
import torch.nn as nn

def remove_layer(model, layer_to_remove):
    modified_model = copy.deepcopy(model)
    modified_model.distilbert.transformer.layer = nn.ModuleList(
        [layer for i, layer in enumerate(modified_model.distilbert.transformer.layer) if i != layer_to_remove]
    )
    return modified_model

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dataloader = prepare_dataloader(texts, labels, tokenizer)

def test_hypothesis(model, dataloader, device):
    stable_count = {}
    for i in range(len(model.distilbert.transformer.layer)):
        stable_count[i] = 0
    total_count = 0

    for batch in dataloader:
        input_ids, attention_mask, labels = [x.to(device) for x in batch]

        for i in range(input_ids.size(0)):
            original_output = evaluate_sample(model, input_ids[i], attention_mask[i], device)

            for layer_to_skip in range(len(model.distilbert.transformer.layer)):
                modified_model = remove_layer(model, layer_to_skip)
                modified_output = evaluate_sample(modified_model, input_ids[i], attention_mask[i], device)

                if original_output == modified_output:
                    stable_count[layer_to_skip] += 1

            total_count += 1

    print(f"Stable samples: {stable_count}")
    print(f"Total samples: {total_count}")
    print(f"Stability rate:")
    for i in range(len(model.distilbert.transformer.layer)):
        print(f"Layer {i}: {stable_count[i] / total_count}")

test_hypothesis(model, dataloader, device)

Stable samples: {0: 228, 1: 231, 2: 230, 3: 231, 4: 229, 5: 232}
Total samples: 250
Stability rate:
Layer 0: 0.912
Layer 1: 0.924
Layer 2: 0.92
Layer 3: 0.924
Layer 4: 0.916
Layer 5: 0.928
