# Unsupervised Data Pruning with Perplexity Scoring

This notebook demonstrates how to use the `PerplexityScorer` to improve dataset quality for text summarization tasks using the CNN/DailyMail dataset. We'll train summarization models on both original and pruned datasets and compare their ROUGE-L scores.

## Overview

The `PerplexityScorer` calculates perplexity scores for text using a KenLM language model. Higher perplexity indicates harder (but potentially noisier) instances, while lower perplexity indicates easier and more prototypical instances.

For summarization tasks, we can use perplexity scoring to filter out noisier articles that are too difficult or unusual.

In [2]:
%pip install transformers rouge-score nltk -q

In [3]:
%pip install -U datasets huggingface_hub fsspec

Collecting datasets
  Downloading datasets-4.0.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.33.4-py3-none-any.whl.metadata (14 kB)
Collecting fsspec
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-4.0.0-py3-none-any.whl (494 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m494.8/494.8 kB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-0.33.4-py3-none-any.whl (515 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.3/515.3 kB[0m [31m41.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, huggingface_hub, datasets
  Attempting uninstall: fsspec
    Found existing installat

In [4]:
%pip install dprune[kenlm] -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/427.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m427.5/427.5 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m70.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m78.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.3 MB/s[0m eta [36m

In [5]:
import os
import numpy as np
import pandas as pd
from datasets import Dataset, load_dataset
from transformers import (
    AutoTokenizer, AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments, Seq2SeqTrainer,
    DataCollatorForSeq2Seq, pipeline
)
from rouge_score import rouge_scorer
import torch
from dprune import PerplexityScorer, TopKPruner, BottomKPruner
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

torch.manual_seed(42)
np.random.seed(42)


## Step 1: Download KenLM Model and Load Dataset

First, we'll download a pre-trained KenLM model and load the CNN/DailyMail dataset.


In [6]:
from dprune.utils import download_kenlm_model, get_supported_languages

# The model is 4.44 GB so may take a while to download (~5 mins on colab)
KENLM_MODEL_PATH = download_kenlm_model(
    output_dir_path="./models",  # Local models directory
    lang_id="en",
    source="wikipedia",
    verbose=True
)

# Load CNN/DailyMail dataset
print("\nLoading CNN/DailyMail dataset...")
dataset = load_dataset("abisee/cnn_dailymail", "3.0.0", split="train")

# Take a subset for faster experimentation (remove this for full dataset)
# SUBSET_SIZE = 1000  # Adjust based on your computational resources
# dataset = dataset.select(range(SUBSET_SIZE))

print(f"Dataset loaded with {len(dataset)} examples")
print(f"Dataset columns: {dataset.column_names}")

# Show a sample
sample = dataset[0]
print(f"\nSample article (first 300 chars): {sample['article'][:300]}...")
print(f"Sample highlights (first 200 chars): {sample['highlights'][:200]}...")


Downloading KenLM model for language: en
Successfully downloaded KenLM model for en to ./models/en.arpa.bin

Loading CNN/DailyMail dataset...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00001-of-00003.parquet:   0%|          | 0.00/257M [00:00<?, ?B/s]

train-00002-of-00003.parquet:   0%|          | 0.00/259M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/34.7M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/30.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Dataset loaded with 287113 examples
Dataset columns: ['article', 'highlights', 'id']

Sample article (first 300 chars): LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappoi...
Sample highlights (first 200 chars): Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been hel...


## Step 2 and 3: Calculate Perplexity Scores and Prune

We'll use the PerplexityScorer to score the articles based on their text quality and complexity.


In [7]:
from dprune import BottomKPruner, PruningPipeline

scorer = PerplexityScorer(
    model_path=KENLM_MODEL_PATH,
    text_column='article',
    batch_size=50
)
bottom_pruner = BottomKPruner(k=0.2)  # keep the bottom 20%
pipeline_easy = PruningPipeline(scorer=scorer, pruner=bottom_pruner)
pruned_dataset = pipeline_easy.run(dataset)

Calculating perplexity:   0%|          | 0/5743 [00:00<?, ?it/s]

## Step 4: Train Summarization Models

We'll train lightweight summarization models on both the original and pruned datasets, then compare their performance.


In [15]:
MODEL_NAME = "facebook/bart-base"
MAX_INPUT_LENGTH = 1024
MAX_TARGET_LENGTH = 128


# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
print(f"Loaded tokenizer for {MODEL_NAME}")

def preprocess_function(examples):
    """Preprocess the data for summarization."""
    inputs = [f"summarize: {article}" for article in examples["article"]]
    targets = examples["highlights"]

    # Tokenize inputs
    model_inputs = tokenizer(
        inputs,
        max_length=MAX_INPUT_LENGTH,
        truncation=True,
        padding=True
    )

    # Tokenize targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(
            targets,
            max_length=MAX_TARGET_LENGTH,
            truncation=True,
            padding=True
        )

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

print("Preprocessing datasets...")

# Preprocess both datasets
tokenized_original = dataset.map(preprocess_function, batched=True)
tokenized_pruned = pruned_dataset.map(preprocess_function, batched=True)

# Split datasets into train/test (80/20 split)
def split_dataset(dataset, test_size=0.2):
    dataset = dataset.train_test_split(test_size=test_size, seed=42)
    return dataset["train"], dataset["test"]

original_train, original_test = split_dataset(tokenized_original)
pruned_train, pruned_test = split_dataset(tokenized_pruned)

print(f"\nDataset splits:")
print(f"Original - Train: {len(original_train)}, Test: {len(original_test)}")
print(f"Pruned - Train: {len(pruned_train)}, Test: {len(pruned_test)}")


config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Loaded tokenizer for facebook/bart-base
Preprocessing datasets...


Map:   0%|          | 0/287113 [00:00<?, ? examples/s]



Map:   0%|          | 0/57423 [00:00<?, ? examples/s]


Dataset splits:
Original - Train: 229690, Test: 57423
Pruned - Train: 45938, Test: 11485


In [22]:
import gc
gc.collect()

4277

In [23]:
def train_model(train_dataset, output_dir, model_name="Model"):
    """Train a summarization model."""
    print(f"\n=== Training {model_name} ===")

    model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
    training_args = Seq2SeqTrainingArguments(
        output_dir=output_dir,
        learning_rate=LEARNING_RATE,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        num_train_epochs=NUM_EPOCHS,
        weight_decay=0.01,
        save_strategy="epoch",
        save_total_limit=1,
        predict_with_generate=True,
        fp16=torch.cuda.is_available(),
        logging_steps=500,
        report_to="none",
    )
    data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

    print(f"Starting training with {len(train_dataset)} examples...")
    trainer.train()

    trainer.save_model()
    print(f"Model saved to {output_dir}")

    return model

BATCH_SIZE = 32
LEARNING_RATE = 2e-5
NUM_EPOCHS = 1


# Train both models
print("Starting model training...")
print("Note: This may take some time depending on your hardware.")

# Train on original dataset
original_model = train_model(
    train_dataset=original_train,
    output_dir="./model_original",
    model_name="Original Dataset Model"
)

# Train on pruned dataset
pruned_model = train_model(
    train_dataset=pruned_train,
    output_dir="./model_pruned",
    model_name="Pruned Dataset Model"
)

print("\n✅ Both models trained successfully!")


Starting model training...
Note: This may take some time depending on your hardware.

=== Training Original Dataset Model ===


  trainer = Seq2SeqTrainer(


Starting training with 229690 examples...


Step,Training Loss
500,1.9868
1000,1.1784
1500,1.1517
2000,1.1225
2500,1.116
3000,1.112
3500,1.1027
4000,1.0941
4500,1.0879
5000,1.078




Model saved to ./model_original

=== Training Pruned Dataset Model ===


  trainer = Seq2SeqTrainer(


Starting training with 45938 examples...


Step,Training Loss
500,1.9679
1000,1.1118




Model saved to ./model_pruned

✅ Both models trained successfully!


## Summary

This notebook demonstrated how to use **PerplexityScorer** from dPrune to improve text summarization models through perplexity based data pruning.