

# NLP Baseline on LLM using Transformers

This notebook implements an end-to-end NLP workflow using transformer-based models, with a primary focus on fine-tuning a BERT-based architecture for text classification tasks. It provides a structured pipeline for training, validation, and evaluation using modern NLP libraries.

## Key Features
- **Model Architecture**: Utilizes `distilbert-base-uncased` as the base transformer model for text classification. The architecture can be expanded or adapted to other models.
- **Custom Dataset Handling**: Implements a `CustomDataset` class that supports tokenization, encoding, and dynamic batching using PyTorch's `Dataset` and `DataLoader`.
- **ModelTrainer Class**: Includes a comprehensive `ModelTrainer` class that:
  - Loads and prepares datasets (e.g., IMDB dataset).
  - Configures model training components like optimizer (`AdamW`), learning rate scheduler (`linear_schedule_with_warmup`), and gradient scaling (`GradScaler`).
  - Supports multi-GPU training through `nn.DataParallel`.
  - Provides functionality for model training, validation, metrics calculation (`f1_score`, `precision`, `recall`), and saving.
- **Logging and Experiment Tracking**: Uses `wandb` (Weights & Biases) for experiment tracking, including automatic logging of metrics, losses, and configurations.
- **Implementation Details**:
  - Includes both the training (`train`) and validation (`validate`) methods to evaluate model performance.
  - Flexible initialization parameters like learning rate, batch size, maximum seother datasets or models.

In [None]:
!pip install -q bertopic==0.15.0 datasets==2.14.4 transformers==4.24.0
!pip install -q sentencepiece==0.1.99 sentence-transformers==2.2.2
!pip install -q wandb

In [None]:
import logging
import os
import uuid

import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import wandb
from datasets import load_dataset
from sklearn.metrics import (
    classification_report,
    f1_score,
    precision_score,
    recall_score,
)
from torch.cuda.amp import GradScaler, autocast
from torch.utils.data import DataLoader, Dataset
from tqdm.auto import tqdm
from transformers import get_linear_schedule_with_warmup

wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mexsandebest[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [None]:
def get_logger(name: str = __name__) -> logging.Logger:
    logging.basicConfig(
        format="%(asctime)s:%(module)s:%(funcName)s:%(levelname)s: %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
    )
    logger = logging.getLogger(name)
    logger.setLevel(logging.INFO)
    return logger


logger = get_logger(__name__)

In [None]:
class CustomDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length: int, tokenize_at_init:bool = False):
        """
        Initialize the custom dataset
        :param texts: List of texts (documents or sentences)
        :param labels: Corresponding labels for the texts
        :param tokenizer: Tokenizer instance for text processing
        :param max_length: Maximum length of the tokenized output
        :param tokenize_at_init: Flag to perform tokenization at initialization
        """

        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.tokenize_at_init = tokenize_at_init

        if tokenize_at_init:
            self.encoded_texts = self.tokenizer.batch_encode_plus(
                texts,
                add_special_tokens=True,
                max_length=max_length,
                truncation=True,
                padding="max_length",
                return_attention_mask=True,
                return_tensors="pt",
            )

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        if self.tokenize_at_init:
            return {
                "input_ids": self.encoded_texts["input_ids"][idx].flatten(),
                "attention_mask": self.encoded_texts["attention_mask"][idx].flatten(),
                "label": torch.tensor(self.labels[idx], dtype=torch.long),
            }
        else:
            text = self.texts[idx]
            label = self.labels[idx]

            encoding = self.tokenizer.encode_plus(
                text,
                add_special_tokens=True,
                max_length=self.max_length,
                truncation=True,
                padding="max_length",
                return_attention_mask=True,
                return_tensors="pt",
            )

            return {
                "input_ids": encoding["input_ids"].flatten(),
                "attention_mask": encoding["attention_mask"].flatten(),
                "label": label,
            }


class ModelTrainer:
    """
    This class implements logic run an experiemnt with a provided transformers classification model.
    It incudes following components:
    - load data
    - load and configure a model and its artifacts
    - train model
    - validate model
    - save model
    - compue metrics
    - run_experiment (as the man entrypoint to execute all flow)
    """

    def __init__(
        self,
        model_name: str = "distilbert-base-uncased",
        num_labels: int = 2,
        learning_rate: float = 2e-5,
        max_length: int = 128,
        num_workers: int = 4,
        weight_decay: float = 0.01,
        datasets_tokenize_at_init: bool = False,
    ):
        """
        Initialize the ModelTrainer class with a specified transformer model
        :param model_name: Name of the transformer model to be used
        :param num_labels: Number of labels for the classification task
        :param learning_rate: Learning rate for the optimizer
        :param max_length: Maximum length of the tokenized sequences
        :param datasets_tokenize_at_init: Flag to perform tokenization at initialization of datasets
        """

        logger.info("Initializing ModelTrainer")

        if model_name == "distilbert-base-uncased":
            logger.info(f"Using model {model_name}")
            from transformers import AutoTokenizer, AutoModel

            self.model = AutoModel.from_pretrained(
                model_name, num_labels=num_labels
            )
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        else:
            raise ValueError(f"Model {model_name} is not supported in ModelTrainer")

        self.model_name = model_name
        self.learning_rate = learning_rate
        self.max_length = max_length
        self.num_workers = num_workers
        self.datasets_tokenize_at_init = datasets_tokenize_at_init
        self.weight_decay = weight_decay

        self.optimizer = None
        self.scaler = None
        self.scheduler = None

        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

        logger.info("ModelTrainer successfully initialized!")

    def configure_optimizer(self, weight_decay: float = 0.01):
        """
        Configure the optimizer for the model
        :param weight_decay: Weight decay factor for regularization
        """

        no_decay = ["bias", "LayerNorm.weight"]
        optimizer_grouped_parameters = [
            {
                "params": [
                    p
                    for n, p in self.model.named_parameters()
                    if not any(nd in n for nd in no_decay)
                ],
                "weight_decay": weight_decay,
            },
            {
                "params": [
                    p
                    for n, p in self.model.named_parameters()
                    if any(nd in n for nd in no_decay)
                ],
                "weight_decay": 0.0,
            },
        ]
        self.optimizer = optim.AdamW(
            optimizer_grouped_parameters, lr=self.learning_rate
        )

        self.scaler = GradScaler()

    def configure_scheduler(self, num_warmup_steps: int, num_training_steps: int) -> None:
        """
        Set up the learning rate scheduler
        :param num_warmup_steps: Number of warm-up steps for the scheduler
        :param num_training_steps: Total number of training steps
        """

        if not self.optimizer:
            raise ValueError(
                "Optimizer not configured. Please configure optimizer before setting up the scheduler."
            )

        self.scheduler = get_linear_schedule_with_warmup(
            self.optimizer,
            num_warmup_steps=num_warmup_steps,
            num_training_steps=num_training_steps,
        )

    def apply_data_parallel(self) -> None:
        """
        Apply data parallelism to the model if multiple GPUs are available
        """

        if torch.cuda.device_count() > 1:
            self.model = nn.DataParallel(self.model)

    def load_data(
        self, filename: str, split: str, max_samples: int = None
    ) -> pd.DataFrame:
        """
        Load the dataset from a given source.
        :param filename: Name of the file or dataset.
        :param split: Specific split of the dataset (e.g., 'train', 'test').
        :param max_samples: Maximum number of samples to load.
        :return: The dataset as a pandas DataFrame.
        """

        dataset = load_dataset(filename, split=split)
        df = dataset.to_pandas()
        if max_samples is not None:
            df = df[:max_samples]
        return df

    def train(self, dataset: CustomDataset, batch_size: int = 32, epochs: int = 3):
        """
        Train the model on the provided dataset
        :param dataset: The dataset to train the model on
        :param batch_size: The size of each batch during training
        :param epochs: The number of epochs to train the model
        """

        dataloader = DataLoader(
            dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=self.num_workers,
            pin_memory=True,
        )

        self.model.train()

        for epoch in range(epochs):
            total_loss = 0

            for batch_index, batch in enumerate(tqdm(dataloader)):
                input_ids = batch["input_ids"].to(self.device)
                attention_mask = batch["attention_mask"].to(self.device)
                labels = batch["label"].to(self.device)

                self.optimizer.zero_grad()

                with autocast():
                    outputs = self.model(
                        input_ids, attention_mask=attention_mask, labels=labels
                    )
                    loss = outputs.loss

                self.scaler.scale(loss).backward()
                self.scaler.step(self.optimizer)
                self.scaler.update()

                loss_value = loss.item()
                total_loss += loss_value
                wandb.log(
                    {"loss": loss_value, "epoch": epoch, "batch_index": batch_index}
                )

            avg_loss = total_loss / len(dataloader)
            logger.info(f"Epoch {epoch + 1}/{epochs} - Loss: {avg_loss:.4f}")

            if self.scheduler:
                self.scheduler.step()

    def validate(self, dataset: CustomDataset, batch_size: int = 32) -> dict:
        """
        Validate the model on the provided dataset
        :param dataset: The dataset to validate the model on
        :param batch_size: The size of each batch during validation
        :return: A dictionary containing validation labels and predictions
        """

        dataloader = DataLoader(
            dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=self.num_workers,
            pin_memory=True,
        )

        self.model.eval()

        valid_labels = []
        valid_preds = []

        with torch.no_grad():
            for batch in tqdm(dataloader):
                input_ids = batch["input_ids"].to(self.device)
                attention_mask = batch["attention_mask"].to(self.device)
                labels = batch["label"].to(self.device)

                outputs = self.model(input_ids, attention_mask=attention_mask)

                logits = outputs.logits
                predictions = torch.argmax(logits, dim=1)
                valid_preds.extend(predictions.cpu().numpy())
                valid_labels.extend(labels.cpu().numpy())

        return {"valid_labels": valid_labels, "valid_preds": valid_preds}

    def compute_metrics_report(self, labels, predictions):
        """
        Compute and return metrics for the validation
        :param labels: True labels of the validation set
        :param predictions: Model's predictions on the validation set
        :return: A dictionary with computed metrics
        """

        f1 = f1_score(labels, predictions, average="weighted")
        precision = precision_score(labels, predictions, average="weighted")
        recall = recall_score(labels, predictions, average="weighted")
        report = classification_report(labels, predictions)

        metrics = {
            "f1_score": f1,
            "precision_score": precision,
            "recall_score": recall,
            "classification_report": report,
        }

        wandb.run.summary["f1_score"] = f1
        wandb.run.summary["precision"] = precision
        wandb.run.summary["recall"] = recall

        return metrics

    def save_model(self, dst_path: str) -> None:
        """
        Save the trained model to a specified path
        :param dst_path: Destination path for saving the model
        """

        model_to_save = (
            self.model.module if hasattr(self.model, "module") else self.model
        )

        if not os.path.exists(os.path.dirname(dst_path)):
            os.makedirs(os.path.dirname(dst_path))

        torch.save(model_to_save.state_dict(), dst_path)

    def run_experiment(
        self,
        train_file: str,
        test_file: str,
        batch_size: int = 32,
        epochs: int = 2,
        max_samples: int = None,
        save_path:str = None,
    ):
        """
        Run the experiment including training and validation
        :param train_file: Path or name of the training data file
        :param test_file: Path or name of the testing/validation data file
        :param batch_size: Batch size for both training and validation
        :param epochs: Number of epochs to train the model
        :param max_samples: Maximum number of samples to be used in training/testing
        :param save_path: Path to save the trained model. If None, the model will not be saved
        """

        wandb.init(
            project="bdt-nlp-hw1",
            name=f"experiment_{uuid.uuid4()}",
            config={
                "model": self.model_name,
                "learning_rate": self.learning_rate,
                "tokenizer_max_length": self.max_length,
                "num_workers": self.num_workers,
                "weight_decay": self.weight_decay,
                "architecture": "BERT",
                "dataset": train_file,
                "epochs": epochs,
                "optimizer": "AdamW",
                "scheduler": "linear_schedule_with_warmup",
            },
        )

        logger.info("Running experiment!")
        logger.info(
            f"Loading training data (tokenize_at_init = {self.datasets_tokenize_at_init})"
        )
        train_data = self.load_data(train_file, split="train", max_samples=max_samples)
        train_dataset = CustomDataset(
            texts=train_data["text"].tolist(),
            labels=train_data["label"].tolist(),
            tokenizer=self.tokenizer,
            max_length=self.max_length,
            tokenize_at_init=self.datasets_tokenize_at_init,
        )

        logger.info(
            f"Loading validation data (tokenize_at_init = {self.datasets_tokenize_at_init})"
        )
        test_data = self.load_data(test_file, split="test", max_samples=max_samples)
        test_dataset = CustomDataset(
            texts=test_data["text"].tolist(),
            labels=test_data["label"].tolist(),
            tokenizer=self.tokenizer,
            max_length=self.max_length,
            tokenize_at_init=self.datasets_tokenize_at_init,
        )

        logger.info("Trying apply Data Parallel")
        self.apply_data_parallel()

        logger.info("Configuring optimizer")
        self.configure_optimizer(self.weight_decay)

        logger.info("Configuring scheduler")
        total_steps = len(train_dataset) // batch_size * epochs
        self.configure_scheduler(num_warmup_steps=0, num_training_steps=total_steps)

        logger.info("Running model training")
        self.train(train_dataset, batch_size=batch_size, epochs=epochs)

        logger.info("Running model validation")
        validation_results = self.validate(test_dataset, batch_size=batch_size)

        logger.info("Computing metrics")
        metrics = self.compute_metrics_report(
            validation_results["valid_labels"], validation_results["valid_preds"]
        )
        for key, value in metrics.items():
            logger.info(f"\n{key}:\n{value}\n")

        if save_path is not None:
            logger.info("Saving model")
            self.save_model(save_path)

        logger.info("Experiment successfully completed!")

        wandb.finish()


In [None]:
model_trainer = ModelTrainer(
    model_name='distilbert-base-uncased',
    num_labels=2,
    learning_rate=3e-5,
    max_length=256,
    num_workers=2,
    weight_decay=5e-2,
    datasets_tokenize_at_init=False
)

INFO:__main__:Initializing ModelTrainer
INFO:__main__:Using model distilbert-base-uncased
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initi

In [None]:
model_trainer.run_experiment(
    train_file='imdb',
    test_file='imdb',
    batch_size=64,
    epochs=3
)

INFO:__main__:Running experiment!
INFO:__main__:Loading training data (tokenize_at_init = False)
INFO:__main__:Loading validation data (tokenize_at_init = False)
INFO:__main__:Trying apply Data Parallel
INFO:__main__:Configuring optimizer
INFO:__main__:Configuring scheduler
INFO:__main__:Running model training


  0%|          | 0/391 [00:00<?, ?it/s]

INFO:__main__:Epoch 1/3 - Loss: 0.2920


  0%|          | 0/391 [00:00<?, ?it/s]

INFO:__main__:Epoch 2/3 - Loss: 0.1765


  0%|          | 0/391 [00:00<?, ?it/s]

INFO:__main__:Epoch 3/3 - Loss: 0.1033
INFO:__main__:Running model validation


  0%|          | 0/391 [00:00<?, ?it/s]

INFO:__main__:Computing metrics
INFO:__main__:
f1_score:
0.9057922482678108

INFO:__main__:
precision_score:
0.9073979164005953

INFO:__main__:
recall_score:
0.90588

INFO:__main__:
classification_report:
              precision    recall  f1-score   support

           0       0.88      0.94      0.91     12500
           1       0.93      0.88      0.90     12500

    accuracy                           0.91     25000
   macro avg       0.91      0.91      0.91     25000
weighted avg       0.91      0.91      0.91     25000


INFO:__main__:Experiment successfully completed!


VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
batch_index,▁▂▂▃▃▄▄▅▅▆▇▇█▁▁▂▃▃▄▄▅▅▆▆▇█▁▁▂▂▃▄▄▅▅▆▆▇▇█
epoch,▁▁▁▁▁▁▁▁▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▅▅▅██████████████
loss,█▄▆▄▄▅▄▃▅▄▄▅▄▃▂▁▂▂▂▂▄▄▃▂▄▂▂▂▃▃▂▂▂▃▂▂▄▃▂▁

0,1
batch_index,390.0
epoch,2.0
f1_score,0.90579
loss,0.04176
precision,0.9074
recall,0.90588
