# dPrune: Forgetting Score Example

This notebook demonstrates how to implement pruning using the **Forgetting Score** in `dPrune`. The forgetting score is based on the [An Empirical Study of Example Forgetting during Deep Neural Network Learning](https://arxiv.org/abs/1812.05159) paper and measures how many times an example is "forgotten" during training. Such examples are found to be more *informative* than others.

An example is "forgotten" if it transitions from being classified correctly to incorrectly between epochs. Therefore, it is well-suited for the classification tasks.


## 1. Setup and Installation


In [None]:
!pip install transformers torch scikit-learn tqdm accelerate

!pip install -U datasets huggingface_hub fsspec

In [None]:
!pip install -i https://test.pypi.org/simple/ dprune

In [None]:
import torch
import numpy as np
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)

from dprune import ForgettingCallback, ForgettingScorer, TopKPruner, BottomKPruner, PruningPipeline


## 2. Load the IMDB dataset

For the forgetting score to be meaningful, we need a dataset large enough and training long enough to observe forgetting events. We will be using IMDB dataset from HuggingFace.


In [None]:
from datasets import load_dataset
raw_dataset = load_dataset("stanfordnlp/imdb", split="train")

"""
If you want to use a sample of the dataset for faster training, uncomment the snipper below
# raw_dataset = raw_dataset.shuffle()
# raw_dataset = raw_dataset.filter(lambda entry, index: index < 0.1 * len(raw_dataset), with_indices=True)
"""

print(f"Positive examples: {sum(raw_dataset['label'])}")
print(f"Negative examples: {len(raw_dataset) - sum(raw_dataset['label'])}")
print("\nFirst few examples:")
for i in range(3):
    print(f"  {i}: '{raw_dataset['text'][i]}' -> {raw_dataset['label'][i]}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Positive examples: 12500
Negative examples: 12500

First few examples:
  0: 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered

## 3. Setup Model and Tokenizer


In [None]:
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)

print("Model and tokenizer loaded successfully!")
print(f"Tokenized dataset: {tokenized_dataset}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Model and tokenizer loaded successfully!
Tokenized dataset: Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 25000
})


## 4. Initialize the Forgetting Callback

This is the key step! We create a `ForgettingCallback` that will monitor the training process.


In [None]:
forgetting_callback = ForgettingCallback()

## 5. Train the Model with the Callback

We'll train for several epochs to give the model a chance to "forget" some examples.


In [None]:
# Training arguments - we want multiple epochs to observe forgetting
training_args = TrainingArguments(
    output_dir='./forgetting_results',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    logging_steps=100,
    save_strategy="no",
    report_to="none"
)

# Create trainer with our callback
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    callbacks=[forgetting_callback],  # This is the key addition!
)

forgetting_callback.trainer = trainer

print("Trainer created with ForgettingCallback.")
print("Starting training...")

# Train the model
trainer.train()

print("Training completed!")


Trainer created with ForgettingCallback.
Starting training...


Step,Training Loss
100,0.5074
200,0.3775
300,0.3457
400,0.3427
500,0.3236
600,0.3293
700,0.3103
800,0.301
900,0.2332
1000,0.2207


Training completed!


## 6. Examine the Forgetting Events

Let's look at what the callback recorded during training.


In [None]:
print(f"Number of examples tracked: {len(forgetting_callback.learning_events)}")

# Calculate forgetting scores
forgetting_scores = forgetting_callback.calculate_forgetting_scores()
print(f"\nForgetting scores calculated for {len(forgetting_scores)} examples")
print(f"Score distribution: min={min(forgetting_scores)}, max={max(forgetting_scores)}, mean={np.mean(forgetting_scores):.2f}")

total_examples_forgotten = sum([score for score in forgetting_scores if score > 0])
total_examples_forgotten

Number of examples tracked: 25000

Forgetting scores calculated for 25000 examples
Score distribution: min=0, max=1, mean=0.01


174

## 7. Use the Forgetting Scorer in a Pipeline



In [None]:
# Create the forgetting scorer using our callback
forgetting_scorer = ForgettingScorer(forgetting_callback)

# Score the dataset
scored_dataset = forgetting_scorer.score(raw_dataset)

print("Dataset scored with forgetting scores!")
print(f"Scored dataset columns: {scored_dataset.column_names}")
print("\nFirst few examples with scores:")
for i in range(5):
    print(f"  Score: {scored_dataset['score'][i]}, Text: '{scored_dataset['text'][i][:50]}...', Label: {scored_dataset['label'][i]}")

Dataset scored with forgetting scores!
Scored dataset columns: ['text', 'label', 'score']

First few examples with scores:
  Score: 1, Text: 'I rented I AM CURIOUS-YELLOW from my video store b...', Label: 0
  Score: 0, Text: '"I Am Curious: Yellow" is a risible and pretentiou...', Label: 0
  Score: 0, Text: 'If only to avoid making this type of film in the f...', Label: 0
  Score: 0, Text: 'This film was probably inspired by Godard's Mascul...', Label: 0
  Score: 0, Text: 'Oh, brother...after hearing about this ridiculous ...', Label: 0


## 8. Pruning with Forgetting Scores



In [None]:
# Strategy: Keep examples that are forgotten the most (hardest examples)
top_pruner = TopKPruner(k=0.7)  # Keep top 70%
pipeline_hard = PruningPipeline(scorer=forgetting_scorer, pruner=top_pruner)
hard_examples = pipeline_hard.run(raw_dataset)

print("\nHardest examples (most forgotten):")
for i in range(min(3, len(hard_examples))):
    print(f" Text: '{hard_examples['text'][i][:60]}...', Label: {hard_examples['label'][i]}")

print("Length of pruned dataset", len(hard_examples))


Hardest examples (most forgotten):
 Text: 'I rented I AM CURIOUS-YELLOW from my video store because of ...', Label: 0
 Text: 'Ned Kelly (Ledger), the infamous Australian outlaw and legen...', Label: 0
 Text: 'Maybe you shouldn't compare, but Wild Style and Style Wars a...', Label: 0
Length of pruned dataset 17500


In [None]:
!pip install evaluate -q

In [None]:
import subprocess
import sys
import evaluate


## 9. Training with Pruned Dataset


In [None]:
# Now let's train a fresh model on the pruned dataset and compare performance
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import Dataset
import evaluate

# Load the F1 metric
f1_metric = evaluate.load("f1")

# Load original test dataset for evaluation
test_dataset = load_dataset("stanfordnlp/imdb", split="test")
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

# Tokenize the pruned dataset
tokenized_hard_examples = hard_examples.map(tokenize_function, batched=True)

print(f"Original dataset size: {len(tokenized_dataset)}")
print(f"Pruned dataset size: {len(tokenized_hard_examples)}")
print(f"Test dataset size: {len(tokenized_test_dataset)}")


Map:   0%|          | 0/17500 [00:00<?, ? examples/s]

Original dataset size: 25000
Pruned dataset size: 17500
Test dataset size: 25000


In [None]:
# Create a fresh model for training on pruned dataset
model_pruned = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Training arguments for pruned dataset
training_args_pruned = TrainingArguments(
    output_dir='./pruned_results',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    logging_steps=50,
    save_strategy="no",
    report_to="none"
)

# Create trainer for pruned dataset
trainer_pruned = Trainer(
    model=model_pruned,
    args=training_args_pruned,
    train_dataset=tokenized_hard_examples,
    eval_dataset=tokenized_test_dataset,
)

print("Training model on pruned dataset...")
trainer_pruned.train()
print("Training on pruned dataset completed!")


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training model on pruned dataset...


Step,Training Loss
50,0.5565
100,0.3591
150,0.3413
200,0.3353
250,0.3037
300,0.296
350,0.3013
400,0.2756
450,0.244
500,0.2781


Training on pruned dataset completed!


In [None]:
def evaluate_model_f1(trainer_obj, eval_dataset):
    trainer_obj.model.eval()
    predictions = trainer_obj.predict(eval_dataset)
    y_pred = predictions.predictions.argmax(axis=1)
    y_true = predictions.label_ids

    f1_score = f1_metric.compute(predictions=y_pred, references=y_true, average='macro')

    return f1_score

print("Evaluating models...")

f1_full = evaluate_model_f1(trainer, tokenized_test_dataset)
print(f"Model trained on full dataset - F1 Score: {f1_full['f1']:.4f}")

f1_pruned = evaluate_model_f1(trainer_pruned, tokenized_test_dataset)
print(f"Model trained on pruned dataset - F1 Score: {f1_pruned['f1']:.4f}")

pruning_ratio = len(tokenized_hard_examples) / len(tokenized_dataset)
print(f"\nPruning Results:")
print(f"Dataset size reduction: {(1 - pruning_ratio) * 100:.1f}%")
print(f"F1 Score difference: {f1_full['f1'] - f1_pruned['f1']:.4f}")
print(f"Relative performance: {(f1_pruned['f1'] / f1_full['f1']) * 100:.1f}%")


Evaluating models...


Model trained on full dataset - F1 Score: 0.8790


Model trained on pruned dataset - F1 Score: 0.8533

Pruning Results:
Dataset size reduction: 30.0%
F1 Score difference: 0.0256
Relative performance: 97.1%


## 10. Conclusion

This notebook demonstrated how to:

1. **Set up the `ForgettingCallback`** to monitor training
2. **Train a model** while recording learning events
3. **Calculate forgetting scores** from the recorded events
4. **Use the `ForgettingScorer`** in a `dPrune` pipeline
5. **Train models on both full and pruned datasets**
6. **Compare performance using F1-score evaluation**
