# DistilBERT for Text Classification

Objective: To fine-tune a pre-trained DistilBERT transformer for text classification on the AG News dataset, and evaluate its performance against the traditional and CNN-based approaches (from 'traditional_vs_cnn_text_classification.ipynb'. The goal is to assess the benefit of large pre-trained language models for text classification.

## 1. Importing Libraries

This cell imports all libraries necessary for data pre-processing, model setup, training, and evaluation.

In [3]:
# Checking for the libraries that don't come pre-installed with Python or Anaconda, and installing them if needed
try:
    import torch
except ImportError:
    !pip install torch

try:
    import transformers
except ImportError:
    !pip install transformers

try:
    import datasets
except ImportError:
    !pip install datasets

# Checking that the rest of the necessary libraries import properly 
import pandas as pd
import re
import torch
import numpy as np

from transformers import AutoTokenizer
from datasets import Dataset
from sklearn.metrics import accuracy_score
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback

## 2. Data Preprocessing and Cleaning

The AG News dataset is loaded from the provided train.csv and test.csv files. The target labels are encoded numerically, and each article's title and description are concatenated to form a single input sequence. The text is lowercased and punctuation/special characters are removed to reduce noise and standardise inputs for DistilBERT tokenisation.

In [5]:
# Importing training and testing datasets
import pandas as pd

train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

# Data pre-processing

# Label encoding the target 'Class Index' variable
from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_data['Class Index'])
test_y = encoder.transform(test_data['Class Index'])

# Combining the Title and Description as the text input for each instance
train_x = train_data['Title'] + ' ' + train_data['Description']
test_x = test_data['Title'] + ' ' + test_data['Description']

In [6]:
import re # used for removing certain characters from the dataset

# Data preprocessing / cleaning

# Splitting the training set into a smaller training set and a validation set
from sklearn.model_selection import train_test_split
train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, test_size=0.25, random_state=42, shuffle=True) # test_size=0.25 is default setting

# Cleaning up data by making it all lowercase and removing certain characters
def text_cleaner(og_text):
  clean_text = og_text.lower() # Converting to lowercase for consistency
  clean_text = re.sub(r'[^a-z0-9\s]', '', clean_text) # eliminates punctuation and special characters
  return clean_text

clean_train_x = train_x.apply(text_cleaner)
clean_val_x = val_x.apply(text_cleaner)
clean_test_x = test_x.apply(text_cleaner) # Preprocessing also applies to test data

## 3. Tokenisation with pre-trained DistilBERT

The pre-trained DistilBERT tokenizer is loaded to convert textual input into token IDs suitable for the model.
The cleaned training, validation, and test sets are first stored in Pandas DataFrames, then converted to HuggingFace 'Dataset' objects.
Finally, each dataset is tokenised with truncation and padding to a maximum sequence length of 128 tokens, ensurig consistent input size for the model.

In [8]:
# Loading the pre-trained DistilBERT tokenizer and tokenizing the data

import torch
import numpy as np
np.random.seed(42) # for consistency

from transformers import AutoTokenizer
from datasets import Dataset
from sklearn.metrics import accuracy_score

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased', clean_up_tokenization_spaces=True)

# Storing the cleaned data in Pandas DataFrames
train_df = pd.DataFrame({'clean_text': clean_train_x, 'label': train_y})
val_df = pd.DataFrame({'clean_text': clean_val_x, 'label': val_y})
test_df = pd.DataFrame({'clean_text': clean_test_x, 'label': test_y})

# Converting the DataFrames to a HuggingFace Dataset
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

def data_tokenizer(data):
  return tokenizer(data['clean_text'], truncation=True, padding="max_length", max_length=128)

tokenized_train = train_dataset.map(data_tokenizer, batched=True)
tokenized_val = val_dataset.map(data_tokenizer, batched=True)
tokenized_test = test_dataset.map(data_tokenizer, batched=True)

Map:   0%|          | 0/90000 [00:00<?, ? examples/s]

Map:   0%|          | 0/30000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

## 4. Setting up and Training the DistilBERT Model
The pre-trained DistilBERT model is loaded for sequence classification with 4 output labels for the 4 AG dataset classes. TrainingArguments are configured for 10 epochs, a learning rate of 2e-5, batch size 32, and early stopping if validation metrics do not improve for 3 epochs.

In [10]:
# Setting up a Trainer with the pretrained DistilBERT model

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, EarlyStoppingCallback

n_classes = 4 # AG dataset has 4 class labels
lr = 2e-5 # standard learning rate for fine-tuning DistilBERT
batch_size = 32 # fits comfortably on GPU memory
train_epochs = 10 # sufficient to see convergence on validation set

# Loading the pre-trained DistilBERT model to use for sequence classification
distilbert_model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=n_classes)

# Defining the arguments for the Trainer
training_args = TrainingArguments(
    output_dir='./results', # saves the model's checkpoints, logs, and final results
    learning_rate=lr,
    per_device_train_batch_size=batch_size,
    num_train_epochs=train_epochs,
    weight_decay=0.01,
    eval_strategy='epoch', # evaluates the model on the validation set every epoch
    per_device_eval_batch_size=batch_size,
    save_strategy='epoch', # saves the model's checkpoints every epoch
    load_best_model_at_end=True, # reloads the model checkpoint with the best validation metrics
    report_to='none'
)

def compute_accuracy(eval_pred):
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  accuracy = accuracy_score(labels, predictions)
  return {'accuracy': accuracy}

# Trainer will automatically detect if a GPU like CUDA is available and if it is, it will use it
trainer = Trainer(
    model=distilbert_model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    processing_class=tokenizer,
    compute_metrics=compute_accuracy,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] # Stops training if validation loss doesn't improve for 3 epochs; prevents overfitting
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
# Training and evaluating on the AG dataset
# Note: For the full training dataset, this step takes about 15min per epoch

# Training the model on the training and validation sets
trainer.train()

# Final evaluation on the unseen test data
test_metrics = trainer.evaluate(eval_dataset=tokenized_test)
print("Test loss:", test_metrics['eval_loss'])
print("Test accuracy:", test_metrics['eval_accuracy'])

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2215,0.199448,0.931933
2,0.1608,0.192555,0.938067
3,0.113,0.208787,0.9393
4,0.0813,0.253623,0.940267
5,0.0507,0.290175,0.938767


Test loss: 0.2036733478307724
Test accuracy: 0.9353947368421053


## Final Results
The fine-tuned DistilBERT model achieves 93.5% test accuracy on the AG News dataset. Comparsion to traditional and CNN-based approaches is presented in final_results.md.