# dPrune: Unsupervised Dataset Pruning Example with K-Means Clustering

This notebook demonstrates how to use the **KMeansCentroidDistanceScorer** in `dPrune`. This is a unsupervised pruning method that scores examples based on their distance to cluster centroids in embedding space. Variants of this method appear in the following papers:
1.  [Beyond neural scaling laws:
beating power law scaling via data pruning](https://arxiv.org/pdf/2206.14486)
2. [Self-Supervised Dataset Pruning for Efficient Training in Audio Anti-spoofing](https://www.isca-archive.org/interspeech_2023/azeemi23_interspeech.pdf)

## Key Concepts:
- **Unsupervised**: No external labels required for scoring. Some papers refer to this as a *self-supervised* pruning method as well.
- **Embedding-based**: Uses transformer model embeddings
- **Clustering**: Groups similar examples and measures distances to centroids
- **Distance scoring**: Examples closer to centroids get lower scores (more representative)


## 1. Setup and Installation


In [None]:
# Install required packages if needed
# !pip install -e .[test]
!pip install transformers torch scikit-learn tqdm accelerate

!pip install -U datasets huggingface_hub fsspec

In [3]:
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModel
from sklearn.decomposition import PCA
import random

from dprune.scorers.unsupervised import KMeansCentroidDistanceScorer, _get_embeddings
from dprune.pruners.selection import TopKPruner, BottomKPruner, StratifiedPruner
from dprune.pipeline import PruningPipeline


## 2. Load the Dataset

We'll load the emotion dataset for classification


In [11]:
# Load the TREC dataset from Hugging Face
from datasets import load_dataset
import random

dataset = load_dataset("dair-ai/emotion", split="train")

label_names = dataset.features['label'].names
print(f"Fine label categories: {label_names}")


def add_category_name(example):
    example['category'] = label_names[example['label']]
    return example

raw_dataset = dataset.map(add_category_name)

print(f"Dataset loaded with {len(raw_dataset)} examples (sampled from {len(dataset)} total)")

# Count examples per category
category_counts = {}
for cat in raw_dataset['category']:
    category_counts[cat] = category_counts.get(cat, 0) + 1

print(f"\nExamples per category:")
for category, count in category_counts.items():
    print(f"  {category}: {count}")

print("\nSample texts from each category:")
seen_categories = set()
for i, example in enumerate(raw_dataset):
    if example['category'] not in seen_categories:
        print(f"{example['category']}: '{example['text']}'")
        seen_categories.add(example['category'])
        if len(seen_categories) >= 6:  # Show all 6 categories
            break


Fine label categories: ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']


Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Dataset loaded with 16000 examples (sampled from 16000 total)

Examples per category:
  sadness: 4666
  anger: 2159
  love: 1304
  surprise: 572
  fear: 1937
  joy: 5362

Sample texts from each category:
sadness: 'i didnt feel humiliated'
anger: 'im grabbing a minute to post i feel greedy wrong'
love: 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property'
surprise: 'ive been taking or milligrams or times recommended amount and ive fallen asleep a lot faster but i also feel like so funny'
fear: 'i feel as confused about life as a teenager or as jaded as a year old man'
joy: 'i have been with petronas for years i feel that petronas has performed well and made a huge profit'


## 3. Setup Model and Extract Embeddings

We'll use a pre-trained transformer model for getting the embeddings.


In [16]:
# Load a pre-trained model for embeddings
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

## 4. Apply K-Means Clustering Scorer

Now we'll use the `KMeansCentroidDistanceScorer` to score our examples based on their distance to cluster centroids.


In [19]:
# Create the K-means scorer
# We'll use 6 clusters since we have 6 emotion categories
kmeans_scorer = KMeansCentroidDistanceScorer(
    model=model,
    tokenizer=tokenizer,
    text_column="text",
    num_clusters=6)

# Score the dataset
scored_dataset = kmeans_scorer.score(dataset)

print("Dataset scored with K-means centroid distances!")
print(f"Scored dataset columns: {scored_dataset.column_names}")

# Examine the scores
scores = scored_dataset['score']
print(f"\nScore statistics:")
print(f"  Min score: {min(scores):.3f}")
print(f"  Max score: {max(scores):.3f}")
print(f"  Mean score: {np.mean(scores):.3f}")
print(f"  Std score: {np.std(scores):.3f}")

print("\nFirst few examples with scores:")
for i in range(5):
    print(f"  Score: {scores[i]:.3f}, Category: {scored_dataset['label'][i]}, Text: '{scored_dataset['text'][i][:60]}...'")


Extracting embeddings:   0%|          | 0/2000 [00:00<?, ?it/s]

Dataset scored with K-means centroid distances!
Scored dataset columns: ['text', 'label', 'score']

Score statistics:
  Min score: 1.402
  Max score: 7.795
  Mean score: 2.837
  Std score: 0.639

First few examples with scores:
  Score: 2.449, Category: 0, Text: 'i didnt feel humiliated...'
  Score: 2.470, Category: 0, Text: 'i can go from feeling so hopeless to so damned hopeful just ...'
  Score: 2.199, Category: 3, Text: 'im grabbing a minute to post i feel greedy wrong...'
  Score: 2.502, Category: 2, Text: 'i am ever feeling nostalgic about the fireplace i will know ...'
  Score: 1.792, Category: 3, Text: 'i am feeling grouchy...'


## 5. Different Pruning Strategies

Let's explore different pruning strategies using the clustering-based scores.


In [48]:
# Strategy 1: Keep examples closest to centroids (most representative)
bottom_pruner = BottomKPruner(k=0.3)  # Keep bottom 0% (lowest distances)
pipeline_representative = PruningPipeline(scorer=kmeans_scorer, pruner=bottom_pruner)
representative_examples = pipeline_representative.run(raw_dataset)

# Strategy 2: Keep examples farthest from centroids (most diverse/outliers)
top_pruner = TopKPruner(k=0.3)  # Keep top 30% (highest distances)
pipeline_diverse = PruningPipeline(scorer=kmeans_scorer, pruner=top_pruner)
diverse_examples = pipeline_diverse.run(raw_dataset)

# Strategy 3: Stratified sampling across score ranges
stratified_pruner = StratifiedPruner(k=0.3, num_strata=4)
pipeline_stratified = PruningPipeline(scorer=kmeans_scorer, pruner=stratified_pruner)
stratified_examples = pipeline_stratified.run(raw_dataset)

print("Pruning Results:")
print(f"Original dataset: {len(scored_dataset)} examples")
print(f"Representative examples (closest to centroids): {len(representative_examples)} examples")
print(f"Diverse examples (farthest from centroids): {len(diverse_examples)} examples")
print(f"Stratified examples (balanced across score ranges): {len(stratified_examples)} examples")


Extracting embeddings:   0%|          | 0/2000 [00:00<?, ?it/s]

Extracting embeddings:   0%|          | 0/2000 [00:00<?, ?it/s]

Extracting embeddings:   0%|          | 0/2000 [00:00<?, ?it/s]

Pruning Results:
Original dataset: 16000 examples
Representative examples (closest to centroids): 4800 examples
Diverse examples (farthest from centroids): 4800 examples
Stratified examples (balanced across score ranges): 4800 examples


## 6. Analysis of Selected Examples

Let's examine specific examples from each pruning strategy to understand what types of content they select.


In [50]:
print("=== REPRESENTATIVE EXAMPLES (Closest to Centroids) ===")
print("These are the most 'typical' examples:")
rep_scores = representative_examples['score']
rep_text = [text for text in representative_examples['text']]

for entry in rep_text[:5]:
    print(f"  Text: '{entry}'")

print("=== DIVERSE EXAMPLES (Farthest from Centroids) ===")
print("These are the most 'unusual' or outlier examples:")
div_text = [text for text in diverse_examples['text']]

for entry in div_text[:5]:
    print(f"  Text: '{entry}'")

=== REPRESENTATIVE EXAMPLES (Closest to Centroids) ===
These are the most 'typical' examples:
  Text: 'i feel complacent about it all'
  Text: 'i actually feel sorrowful'
  Text: 'i feel helpless about it'
  Text: 'i feel absolutely defeated socially'
  Text: 'i am feeling a little rejected by my sister'
=== DIVERSE EXAMPLES (Farthest from Centroids) ===
These are the most 'unusual' or outlier examples:
  Text: 'i waited to hold my precious boy in my arms no i did not get to feel his sweet skin against mine after his birth no i could not rub his soft hair or look into his beautiful eyes but god had a plan'
  Text: 'for the loss of a close friend or relative'
  Text: 'when in a car accident where car was total wipe off wipe out'
  Text: 'im excited to get home and spend time with everyone please feel free to email call or text and let me know if youre available for dinner or coffee or anything'
  Text: 'i took away all the disappointed feeling all the paining i gave my heart to be heal 

## 7. Training based on pruned examples

In [51]:
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification

model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

tokenized_dataset = representative_examples.map(tokenize_function, batched=True)

training_args = TrainingArguments(
    output_dir='./training_results',
    num_train_epochs=20,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    logging_steps=100,
    save_strategy="no",
    report_to="none"
)

eval_dataset = load_dataset("dair-ai/emotion", split="validation")
tokenized_eval_dataset = eval_dataset.map(tokenize_function, batched=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_eval_dataset
)

print("Starting training...")

trainer.train()


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/4800 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Starting training...


Step,Training Loss
100,1.2373
200,0.5374
300,0.2541
400,0.145
500,0.1203
600,0.0934
700,0.0722
800,0.0638
900,0.0568
1000,0.0394


TrainOutput(global_step=3000, training_loss=0.09826534907023111, metrics={'train_runtime': 252.5091, 'train_samples_per_second': 380.184, 'train_steps_per_second': 11.881, 'total_flos': 3179444355072000.0, 'train_loss': 0.09826534907023111, 'epoch': 20.0})

In [None]:
!pip install evaluate

In [52]:
import evaluate
f1_metric = evaluate.load("f1")
predictions = trainer.predict(tokenized_eval_dataset)
predicted_labels = np.argmax(predictions.predictions, axis=1)
results = f1_metric.compute(references=eval_dataset['label'], predictions=predicted_labels, average="macro")
results

{'f1': 0.8837758063388028}

## Comparison with training on the complete dataset

In [46]:
tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    eval_dataset=tokenized_eval_dataset
)

print("Starting training...")

trainer.train()

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Starting training...


Step,Training Loss
100,0.3898
200,0.2627
300,0.2502
400,0.2136
500,0.2035
600,0.1416
700,0.1365
800,0.1305
900,0.1378
1000,0.1449


TrainOutput(global_step=10000, training_loss=0.044204889214038846, metrics={'train_runtime': 837.4017, 'train_samples_per_second': 382.134, 'train_steps_per_second': 11.942, 'total_flos': 1.059814785024e+16, 'train_loss': 0.044204889214038846, 'epoch': 20.0})

In [47]:
import evaluate
f1_metric = evaluate.load("f1")
predictions = trainer.predict(tokenized_eval_dataset)
predicted_labels = np.argmax(predictions.predictions, axis=1)
results = f1_metric.compute(references=eval_dataset['label'], predictions=predicted_labels, average="macro")
results

{'f1': 0.9107237376230696}

## 8. Conclusion

We were able to achieve 88% F1-score with 30% of the dataset in 4m 12s, compared to 91% F1-score on the complete dataset in 13m 57s.