# Newsgroup Document Classification with DistilBERT

Newsgroup documents play an important role in various aspect of information management and decision-making processes.

In this mini-project, I will explore how transformers (DistilBERT) can be used to predict newsgroups (the classes) from the text content of documents.

Transformer models can also be used in other predictive modeling applications related to information retrieval and content recommendation.

In addition, I will explore the impact of stopwords & metadata removal on the performance of the classifier.

**Summary of Contents:**

1. Setup & Data Preparation
2. Data Pre-processing: Tokenization and Data Loader Construction
3. Initialise Pre-trained DistilBERT Model & Transfer Learning
4. Model Training & Evaluation
5. Findings & Conclusion


Note: This project was originally run on Google Colab.

# 1. Setup & Data Preparation

In [None]:
# installing pytorch lightning and transformer
! pip install --quiet lightning
! pip install --quiet transformers

Update the path to where the data file is uploaded on Google Drive.

In [None]:
# For Google Colaboratory
import sys, os
if 'google.colab' in sys.modules:
    # mount google drive
    from google.colab import drive
    drive.mount('/content/gdrive')
    path_to_file = '/content/gdrive/My Drive/DistilBertClassification'

    print(path_to_file)
    # move to Google Drive directory
    os.chdir(path_to_file)
    !pwd

In [None]:
# importing packages
from os import listdir
from os.path import join
from sklearn.model_selection import train_test_split
import string
from torch.utils.data import Dataset, DataLoader
import torch
import torchmetrics
import torch.nn.functional as F
import pytorch_lightning as pl
from transformers import AutoModelForSequenceClassification
from pytorch_lightning.loggers import TensorBoardLogger, WandbLogger
from pytorch_lightning.callbacks import ModelCheckpoint

## Data preparation

In this mini-project, a subset of the newsgroup dataset is utilised.

Newgroups utilised: 'rec.sport.baseball', 'comp.graphics', 'sci.space', 'talk.religion.misc'

In [None]:
# each text document belongs to a group which denotes the class label
# all documents belonging to the same class are in one folder
folders = ['rec.sport.baseball', 'comp.graphics', 'sci.space', 'talk.religion.misc'] #

# Define a helper function to gather the pathnames of documents in the dataset, and the label is the name of the folder
def get_data(folders):
    my_path = '20_newsgroups'
    files = []
    # Extract names of files in each folder
    for folder_name in folders:
        folder_path = join(my_path, folder_name)
        files.append([f for f in listdir(folder_path) if not f.startswith('.')])

    pathname_list, Y = [], []
    for fo in range(len(folders)):     # each folder
        for fi in files[fo]:           # each file in folder
            pathname_list.append(join(my_path, join(folders[fo], fi)))
            Y.append(folders[fo])

    return pathname_list, Y

# Get list of pathnames & labels
pathname_list, Y = get_data(folders)

Divide the dataset into training, validation, and testing sets using a split ratio of 60-20-20.

In [None]:
# Divide the dataset into training, validation, and testing sets using a split ratio of 60-20-20.
X_train_path, X_temp_path, y_train, y_temp = train_test_split(pathname_list, Y, train_size=0.6, random_state=88)
X_val_path, X_test_path, y_val, y_test = train_test_split(X_temp_path, y_temp, test_size=0.5, random_state=88)

# 2. Data Pre-processing: Tokenization and Data Loader Construction

## 2.1 Removing Stopwords & Pre-processing the Text Data (X)

The document contains many meaningless words and symbols. For each document, a function `preprocess_sub` defined below will be applied to remove the metadata and filter out punctuation, quotations, tabs, short words, and stop words.

In [None]:
# Define list of stopwords
stopwords = ['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at',
 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by',
 'can', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during',
 'each', 'few', 'for', 'from', 'further',
 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's",
 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's",
 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself',
 "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself',
 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours' 'ourselves', 'out', 'over', 'own',
 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such',
 'than', 'that',"that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', "there's", 'these', 'they', "they'd",
 "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very',
 'was', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', "weren't", 'what', "what's", 'when', "when's", 'where',
 "where's", 'which', 'while', 'who', "who's", 'whom', 'why', "why's",'will', 'with', "won't", 'would', "wouldn't",
 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves',
 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine', 'ten', 'hundred', 'thousand', '1st', '2nd', '3rd',
 '4th', '5th', '6th', '7th', '8th', '9th', '10th']

# Helper function to remove stopwords
def preprocess_sub(words):
    #filter out some unnecessary data like tabs
    table = str.maketrans('', '', '\t')
    words = [word.translate(table) for word in words]

    # the character: ' is removed from the list of symbols that are to be discarded from the documents
    punctuations = (string.punctuation).replace("'", "")
    trans_table = str.maketrans('', '', punctuations)
    stripped_words = [word.translate(trans_table) for word in words]

    # remove white spaces
    words = [str for str in stripped_words if str]

    # unquote quoted words
    p_words = []
    for word in words:
        if (word[0] and word[len(word) - 1] == "'"):
            word = word[1:len(word) - 1]
        elif (word[0] == "'"):
            word = word[1:len(word)]
        else:
            word = word
        p_words.append(word)

    words = p_words.copy()

    # remove white spaces
    words = [str for str in words if str]

    # remove just-numeric strings as they do not have any significant meaning in text classification
    words = [word for word in words if not word.isdigit()]

    #  remove words with less than 2 characters
    words = [word for word in words if len(word) > 2]

    # normalize the cases of our words
    words = [word.lower() for word in words]

    # remove stop words, which do not have any significant meaning in text classification
    words = [word for word in words if not word in stopwords]

    return words

# Helper functions to remove metadata
def remove_metadata(lines):
    for i in range(len(lines)):
        if(lines[i] == '\n'):
            start = i+1
            break
    new_lines = lines[start:]
    return new_lines

# Consolidated function to preprocess a document, including removing stopwords and metadata
def preprocess(doc):
    list_of_words = []
    for path in doc:
        #print(path)
        f = open(path, 'r', encoding='utf-8')
        text_lines = f.readlines()

        # remove the meta-data at the top of each document
        text_lines = remove_metadata(text_lines)

        doc_words = []
        for line in text_lines:
            words = line[0:len(line) - 1].strip().split(" ")
            words = preprocess_sub(words)
            doc_words.extend(words)
        list_of_words.append(' '.join(doc_words))

    return list_of_words


In [None]:
# Convert from file pathname to one-line strings and remove stopwords / metadata using the "preprocess" function
X_train_cleaned = preprocess(X_train_path)
X_val_cleaned = preprocess(X_val_path)
X_test_cleaned = preprocess(X_test_path)

## 2.2 Encode the Labels (Y)

Subsequently, the document labels (y) are encoded to numerical values.

In [None]:
# Encode Labels (Y) to numerical values
from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Fit label encoder on y_train
encoded_labels = label_encoder.fit(y_train)

# Transform labels to numerical values
y_train_encoded = encoded_labels.transform(y_train)
y_val_encoded = encoded_labels.transform(y_val)
y_test_encoded = encoded_labels.transform(y_test)

The labels are encoded as such:
* 0: comp.graphics
* 1: rec.sport.baseball
* 2: sci.space
* 3: talk.religion.misc



## 2.3 Tokenise using Pre-trained DistilBERT & Construct DataLoader

Use pre-trained "distilbert-base-uncased" tokenizer to tokenize the dataset.

Note: DistilBERT (`distilbert-base-uncased`) has an input length of 512 tokens. Hence, the tokenised sequences are padded to a length of 512.

In [None]:
from transformers import AutoTokenizer

class TextDataset(Dataset):
    def __init__(self, X, y, tokenizer):
        self.X = X
        self.y = y
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        # Tokenize text
        text = self.X[idx]

        tokenized_text = self.tokenizer(text, truncation=True, padding='max_length', max_length=512)   # pad to length of 512, in line with the input length of "distilbert-base-uncased"
        # Tokenizer input max length: 512   # Tokenizer vocabulary size: 30522
        # truncation=True: This parameter ensures that if the length of the tokenized input exceeds the maximum sequence length supported by the model (in this case, 512 tokens), it will truncate the input sequence.
        # padding=True: This parameter ensures that the tokenized sequences are padded to the same length using padding tokens. This is necessary because neural networks typically expect inputs of the same length

        # Convert to torch tensors
        input_ids = torch.tensor(tokenized_text['input_ids'])
        attention_mask = torch.tensor(tokenized_text['attention_mask'])
        label = torch.tensor(self.y[idx])

        # Create dictionary
        data_dict = {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "label": label
        }

        return data_dict

# Creating tokenizer
# Use pre-trained "distilbert-base-uncased" tokenizer to tokenize the dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


The secret `HF_TOKEN` does not exist in your Colab secrets.

To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.

You will be able to reuse this secret in all of your notebooks.

Please note that authentication is recommended but still optional to access public models or datasets.



Create corresponding dataloaders for each set to facilitate training and evaluation processes

In [None]:
# Creating datasets
train_dataset_cleaned = TextDataset(X_train_cleaned, y_train_encoded, tokenizer)
val_dataset_cleaned = TextDataset(X_val_cleaned, y_val_encoded, tokenizer)
test_dataset_cleaned = TextDataset(X_test_cleaned, y_test_encoded, tokenizer)

# Create corresponding dataloaders for each set to facilitate training and evaluation processes.
# Set a batch size of 10

# Create DataLoaders for each set
train_loader_cleaned = DataLoader(
    dataset=train_dataset_cleaned,
    batch_size=10,
    shuffle=True,
    num_workers=2
)

val_loader_cleaned = DataLoader(
    dataset=val_dataset_cleaned,
    batch_size=10,
    num_workers=2
)

test_loader_cleaned = DataLoader(
    dataset=test_dataset_cleaned,
    batch_size=10,
    num_workers=2
)

# 3. Initialise Pre-trained DistilBERT Model & Transfer Learning


Initialise the pretrained model DistilBert model and set the number of labels.

In [None]:
# Get the pretrained model DistilBert model from AutoModelForSequenceClassification and set the number of labels
pretrained_distilbert = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=4)   # 4 possible labels

pretrained_distilbert

# Note based the description that the weights of ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight'] are not loaded --> to train these 2 layers

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

For transfer learning using DistilBERT, only the Linear classifier layers are unfrozen. The pretrained Transformer layers are kept frozen & applied for feature extraction.

Note: as the DistilBERT model is quite large, finetuning of the entire model will require a large amount of computational resources.

In [None]:
# Freeze all layers
for param in pretrained_distilbert.parameters():
    param.requires_grad = False

# Unfreeze only the pre_classifier & classifier layers
for param in pretrained_distilbert.pre_classifier.parameters():
    param.requires_grad = True

for param in pretrained_distilbert.classifier.parameters():
    param.requires_grad = True

Next, define the PyTorch Lightning model, including:

- Implementing the training, validation and testing procedure
- Define loss function for the multi-class classification problem (CrossEntropyLoss), and use Accuracy for evaluation
- Loss function & Accuracy are logged to allow monitoring of the training process using Tensorboard

A low learning rate is used to prevent overfitting of the pre-trained DistilBERT model.

In [None]:
# Define Classification Lightning Model

class TextClassificationModel(pl.LightningModule):
    def __init__(self, model, learning_rate=5e-5):   # set a low learning rate to prevent overfitting of the pre-trained model / disrupting its pretrained-weights
        super().__init__()

        self.learning_rate = learning_rate
        self.model = model ## to use pretrained DistilBERT model

        # Classification Metrics
        self.acc_score = torchmetrics.Accuracy(task="multiclass", num_classes=4)    # 4 label classes

        # Define loss function
        self.ce_loss = torch.nn.CrossEntropyLoss()

    def forward(self, input_ids, attention_mask, labels):
        # Forward pass through the pre-trained model
        return self.model(input_ids, attention_mask=attention_mask, labels=labels)

    def training_step(self, batch, batch_idx):
        # Gather outputs
        outputs = self.forward(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["label"])
        logits = outputs["logits"]  #outputs.logits
        labels = batch["label"]  # actual label

        # Calculate cross-entropy loss
        loss = self.ce_loss(logits, labels)
        self.log("train_loss", loss, prog_bar=True)

        # Predict class
        predicted_labels = torch.argmax(logits, 1)
        # Calculate Accuracy
        accuracy = self.acc_score(predicted_labels, labels)
        self.log("train_acc", accuracy, prog_bar=True)

        return loss #outputs["loss"]  # this is passed to the optimizer for training

    def validation_step(self, batch, batch_idx):
        # Gather outputs
        outputs = self.forward(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["label"])
        logits = outputs["logits"]  #outputs.logits
        labels = batch["label"]  # actual label

        # Calculate cross-entropy loss
        loss = self.ce_loss(logits, labels)
        self.log("val_loss", loss, prog_bar=True)

        # Predict class
        predicted_labels = torch.argmax(logits, 1)
        # Calculate Accuracy
        accuracy = self.acc_score(predicted_labels, labels)
        self.log("val_acc", accuracy, prog_bar=True)

    def test_step(self, batch, batch_idx):
        # Gather outputs
        outputs = self.forward(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"], labels=batch["label"])
        logits = outputs["logits"]  #outputs.logits
        labels = batch["label"]  # actual label

        # Predict class
        predicted_labels = torch.argmax(logits, 1)
        # Calculate Accuracy
        accuracy = self.acc_score(predicted_labels, labels)
        self.log("test_acc", accuracy, prog_bar=True)

        print(labels)
        print(predicted_labels)
        print(accuracy)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer

# 4. Model Training & Evaluation

Initialise the classification model, define best model checkpoint & Trainer. The model will be trained over 10 epochs.

In [None]:
# Create a similar model using the predefined "TextClassificationModel" from Step 3
text_classification_model_cleaned = TextClassificationModel(pretrained_distilbert)

#specify logger
logger = TensorBoardLogger("distilbert/", name="finetuning_cleaned", version="v1")

# define model checkpoint callback
callbacks_cleaned = [
    ModelCheckpoint(
        save_top_k=1, mode="max", monitor="val_acc"
    )  # save top 1 model
]

#define trainer -- add callback
trainer_cleaned = pl.Trainer(
    max_epochs=10,
    callbacks=callbacks_cleaned,
    accelerator="gpu",
    devices=1,
    logger=logger,
    log_every_n_steps=1,
)

# train the model
trainer_cleaned.fit(model=text_classification_model_cleaned,
            train_dataloaders=train_loader_cleaned,
            val_dataloaders=val_loader_cleaned)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True

INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores

INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs

INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

INFO:pytorch_lightning.callbacks.model_summary:

  | Name      | Type                                | Params

------------------------------------------------------------------

0 | model     | DistilBertForSequenceClassification | 67.0 M

1 | acc_score | MulticlassAccuracy                  | 0     

2 | ce_loss   | CrossEntropyLoss                    | 0     

------------------------------------------------------------------

593 K     Trainable params

66.4 M    Non-trainable params

67.0 M    Total params

267.826   Total estimated model params size (MB)


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=10` reached.


With the newly preprocessed data to remove metadata & stopwords, there was an improvement in the test accuracy from 91.25% to 92.5%.

In [None]:
# Run the best model using the test data

best_model_path_cleaned = callbacks_cleaned[0].best_model_path
print(best_model_path_cleaned)

trainer_cleaned.test(ckpt_path=best_model_path_cleaned, dataloaders=test_loader_cleaned)

INFO:pytorch_lightning.utilities.rank_zero:Restoring states from the checkpoint path at distilbert/finetuning_cleaned/v1/checkpoints/epoch=8-step=216.ckpt


distilbert/finetuning_cleaned/v1/checkpoints/epoch=8-step=216.ckpt


INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

INFO:pytorch_lightning.utilities.rank_zero:Loaded model weights from the checkpoint at distilbert/finetuning_cleaned/v1/checkpoints/epoch=8-step=216.ckpt


Testing: |          | 0/? [00:00<?, ?it/s]

tensor([2, 0, 3, 0, 2, 2, 0, 2, 1, 1], device='cuda:0')

tensor([2, 0, 3, 0, 3, 2, 0, 2, 3, 1], device='cuda:0')

tensor(0.8000, device='cuda:0')

tensor([2, 2, 1, 0, 2, 0, 1, 0, 0, 3], device='cuda:0')

tensor([2, 2, 1, 0, 2, 0, 1, 0, 0, 3], device='cuda:0')

tensor(1., device='cuda:0')

tensor([3, 2, 1, 1, 3, 2, 1, 0, 3, 1], device='cuda:0')

tensor([3, 2, 1, 1, 3, 2, 1, 0, 3, 1], device='cuda:0')

tensor(1., device='cuda:0')

tensor([1, 3, 2, 1, 3, 1, 3, 0, 1, 2], device='cuda:0')

tensor([3, 3, 2, 1, 3, 1, 3, 0, 1, 2], device='cuda:0')

tensor(0.9000, device='cuda:0')

tensor([3, 1, 1, 2, 3, 1, 0, 2, 1, 1], device='cuda:0')

tensor([3, 1, 1, 2, 3, 1, 0, 2, 1, 1], device='cuda:0')

tensor(1., device='cuda:0')

tensor([2, 2, 2, 3, 3, 2, 2, 0, 3, 1], device='cuda:0')

tensor([2, 2, 3, 3, 3, 2, 2, 0, 3, 1], device='cuda:0')

tensor(0.9000, device='cuda:0')

tensor([3, 0, 1, 0, 0, 0, 0, 2, 2, 3], device='cuda:0')

tensor([3, 0, 1, 2, 0, 0, 0, 2, 2, 3], device='cuda:0')

tensor(0.9000, d

[{'test_acc': 0.925000011920929}]

# 5. Findings & Conclusion

After training both models with 10 epochs, the model trained on the data with additional preprocessing to remove stopwords performed slightly better, with a test accuracy of 92.50%, as compared to 91.25% for the initial model without additional preprocessing.

*(Note: there was a previous model that was trained without performing stopword removal. The code for this was not uploaded)*

However, while observing the tensorboard trend of training and validation accuracy/loss for both models, the initial model appeared to be performing better with higher validation accuracy after 10 epochs.

This contrast in observation was observed likely because stopwords removal, which include commonly used words/tokens without much semantic meaning, likely helped the classification model focus on more meaningful words or tokens. Hence, allowing the model trained after stopwords removal to learn more robust and meaningful representations of text, leading to better generalisation and classification accuracy on the testing data.

In contrast, the initial model without additional text preprocessing may have overfitted on the noise or common stopwords in the training data, thus it did not perform or generalise well on the test data.
