## Notebook Description

- This notebook outlines the development, training, and evaluation of the model for the Toxic Comment Classification project. The approach was informed by insights gained during the Exploratory Data Analysis (EDA) phase.

- Key Characteristics:

    - Model Used: DistilBERT-cased (a change from the initial uncased version, as capitalization was identified as an important factor during EDA)
    - Early Stopping: Implemented with patience set to 3
    - Class Weights: Applied to handle class imbalance, though results were not as expected, indicating potential issues in implementation
    - Labels: Added a non-toxic label to the existing six toxicity labels
    - Tracking: Used MLflow to monitor validation and training losses, which will be presented during the project demonstration
    - Challenges:

        - Determining the optimal number of epochs to keep the backbone frozen
        - Addressing sudden spikes in validation loss after initial stability
        - Improving various elements of the model architecture for better performance


This notebook captures the journey and learnings from the model development process, as well as highlighting areas for potential improvement.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install transformers
!pip install torch
!pip install pandas
!pip install scikit-learn
!pip install pytorch-lightning
!pip install mlflow

In [None]:
import pandas as pd
from transformers import AutoTokenizer
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import DataLoader, TensorDataset
import pytorch_lightning as pl
from transformers import AutoModelForSequenceClassification, AdamW, get_linear_schedule_with_warmup
import torch
from torchmetrics import AUROC
from pytorch_lightning.loggers import MLFlowLogger
import os
from pytorch_lightning.loggers import MLFlowLogger
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
import pytorch_lightning as pl
import mlflow
import numpy as np
from sklearn.metrics import classification_report
import mlflow
from pytorch_lightning import _logger as log
from tqdm import tqdm
from sklearn.metrics import roc_auc_score

In [None]:
path = '/content/drive/My Drive/NLP comments/train.csv'
train_df = pd.read_csv(path)

In [None]:
labels_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train_df['non_toxic'] = (train_df[labels_columns].sum(axis=1) == 0).astype('int8')
train_df.drop('id', axis=1, inplace=True)
label_columns = labels_columns + ['non_toxic']
train_df[label_columns] = train_df[label_columns].astype('int8')
label_counts = train_df[label_columns].sum().values

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')
labels = train_df[labels_columns + ['non_toxic']].values
train_texts, val_texts, train_labels, val_labels = train_test_split(train_df['comment_text'], labels, test_size=0.3, random_state=33)
max_length = 80
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding='max_length', max_length=max_length, return_tensors='pt')
val_encodings = tokenizer(val_texts.tolist(), truncation=True, padding='max_length', max_length=max_length, return_tensors='pt')
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], torch.tensor(train_labels))
val_dataset = TensorDataset(val_encodings['input_ids'], val_encodings['attention_mask'], torch.tensor(val_labels))
batch_size = 32
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)

In [None]:
class ToxicCommentClassifier(pl.LightningModule):
    def __init__(self, model_name, num_labels, class_weights):
        super().__init__()
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
        self.criterion = torch.nn.BCEWithLogitsLoss(pos_weight=class_weights)
        self.save_hyperparameters()
        self.validation_predictions = []
        self.validation_labels = []
        self.best_threshold = 0.5
        self.validation_step_outputs = []


        for param in self.model.distilbert.transformer.layer.parameters():
            param.requires_grad = False

    def forward(self, input_ids, attention_mask=None, labels=None):
        output = self.model(input_ids, attention_mask=attention_mask)
        return output.logits

    def training_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        logits = self(input_ids, attention_mask)
        loss = self.criterion(logits, labels.float())

        self.log('train_loss', loss, prog_bar=True, logger=True, on_step=True, on_epoch=True)
        return loss


    def validation_step(self, batch, batch_idx):
        input_ids, attention_mask, labels = batch
        logits = self(input_ids, attention_mask)
        loss = self.criterion(logits, labels.float())
        self.validation_step_outputs.append(loss)
        return {'val_loss': loss}

    def on_validation_epoch_end(self):
        avg_loss = torch.stack(self.validation_step_outputs).mean()
        self.log('val_loss', avg_loss, prog_bar=True, logger=True, on_step=False, on_epoch=True)
        self.validation_step_outputs.clear()

    def configure_optimizers(self):
        self.fine_tune_lr = 5e-5
        self.initial_lr = 2e-5
        optimizer = AdamW(self.model.parameters(), lr=self.initial_lr)
        num_training_steps = self.trainer.estimated_stepping_batches
        scheduler = get_linear_schedule_with_warmup(optimizer,
                                                    num_warmup_steps=int(num_training_steps * 0.1),
                                                    num_training_steps=num_training_steps)
        return [optimizer], [{"scheduler": scheduler, "interval": "step"}]


    def on_epoch_start(self):
        if self.current_epoch == 3:
            logging.info("Unfreezing the last two layers and applying a smaller learning rate for fine-tuning.")
            for layer in self.model.distilbert.transformer.layer[-2:]:
                for param in layer.parameters():
                    param.requires_grad = True

            optimizer = self.optimizers()
            for param_group in optimizer.param_groups:
                param_group['lr'] = self.fine_tune_lr
            self.trainer.optimizers[0] = optimizer

In [None]:
num_labels_updated = len(labels_columns) + 1
label_freq = train_labels.mean(axis=0)
class_weights = torch.tensor((1 / (label_freq + 1e-9)) * (len(train_labels) / 2.0)).float()

In [None]:
base_mlflow_dir = '/content/drive/My Drive/mlflow'
os.makedirs(base_mlflow_dir, exist_ok=True)


mlflow_tracking_uri = f"{base_mlflow_dir}"
mlflow.set_tracking_uri(mlflow_tracking_uri)

mlflow.set_experiment('ToxicCommentClassification_2')


mlflow_logger = MLFlowLogger(
    experiment_name='ToxicCommentClassification_2',
    tracking_uri=mlflow_tracking_uri
)


checkpoint_dir = os.path.join(base_mlflow_dir, 'checkpoints')
os.makedirs(checkpoint_dir, exist_ok=True)


checkpoint_callback = ModelCheckpoint(
    dirpath=checkpoint_dir,
    monitor='val_loss',
    mode='min',
    save_top_k=2
)


early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=3
)


mlflow_logger = MLFlowLogger(
    experiment_name='ToxicCommentClassification_2',
    tracking_uri=mlflow_tracking_uri
)

trainer = pl.Trainer(
    max_epochs=10,
    callbacks=[checkpoint_callback, early_stopping],
    logger=mlflow_logger,
    accelerator='auto',
    precision='16-mixed',
)


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 302, in search_experiments
    exp = self._get_experiment(exp_id, view_type)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 395, in _get_experiment
    meta = FileStore._read_yaml(experiment_dir, FileStore.META_DATA_FILE_NAME)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 1320, in _read_yaml
    return _read_helper(root, file_name, attempts_remaining=retries)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 1313, in _read_helper
    result = read_yaml(root, file_name)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/utils/file_utils.py", line 310, in read_yaml
    raise MissingConfigException(f"Yaml file '{file_path}' does not exist.")
mlflow.exceptions.MissingConfigException: Yaml file '/content/drive/My Drive/mlflow/checkpoin

In [None]:
model = ToxicCommentClassifier(
    model_name='distilbert-base-cased',
    num_labels=num_labels_updated,
    class_weights=class_weights
)

trainer.fit(model, train_loader, val_loader)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 302, in search_experiments
    exp = self._get_experiment(exp_id, view_type)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 395, in _get_experiment
    meta = FileStore._read_yaml(experiment_dir, FileStore.META_DATA_FILE_NAME)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/file_store.py", line 1320, in _read_yaml
    return _read_helper(root, file_name, attempts_remaining=retries)
  File "/usr/local/lib/python3.10/dist-packages/mlflow/store/tracking/fi

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

In [None]:
model_path = "/content/drive/My Drive/model_checkpoints/toxic_comment_classifier.pt"
os.makedirs(os.path.dirname(model_path), exist_ok=True)

In [None]:
torch.save(model.state_dict(), model_path)

### Evaluating on test data

In [None]:
model_path = os.path.join(base_mlflow_dir, 'ToxicCommentClassifier')
tokenizer_path = os.path.join(base_mlflow_dir, 'Tokenizer')
model.model.save_pretrained(model_path)
tokenizer.save_pretrained(tokenizer_path)

In [None]:
path = '/content/drive/My Drive/NLP comments/merged_data.csv'
test_df = pd.read_csv(path)

- Merged data contains the test comments and test labels that I merged into one dataset, I also removed the "-1" values for labels that decision was madae based on dataset Kaggle description.

In [None]:
test_encodings = tokenizer(test_df['comment_text'].tolist(), truncation=True, padding='max_length', max_length=max_length, return_tensors='pt')

In [None]:
labels_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', 'non_toxic']
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], torch.tensor(test_df[labels_columns].values))
test_loader = DataLoader(test_dataset, batch_size=batch_size)
model.eval()
predictions = []
true_labels = []

In [None]:
with torch.no_grad():
    for batch in tqdm(test_loader, desc="Evaluating", leave=True):
        input_ids, attention_mask, labels = batch
        logits = model(input_ids, attention_mask)
        preds = torch.sigmoid(logits)
        predictions.extend(preds.cpu().numpy())
        true_labels.extend(labels.cpu().numpy())

predictions = np.array(predictions)
true_labels = np.array(true_labels)
threshold = 0.5
binary_predictions = np.where(predictions > threshold, 1, 0)

print(classification_report(true_labels, binary_predictions, target_names=labels_columns))

In [None]:
roc_auc_scores = {label: roc_auc_score(true_labels[:, idx], predictions[:, idx])
                  for idx, label in enumerate(labels_columns)}

for label, score in roc_auc_scores.items():
    print(f"{label}: {score}")

toxic: 0.9484846311001097
severe_toxic: 0.9748050962172712
obscene: 0.9623207746049303
threat: 0.9525613353770097
insult: 0.9533770621593085
identity_hate: 0.9309706861913867
non_toxic: 0.7772625735731786


In [None]:
thresholds = [0.01, 0.5, 0.99]

for thresh in thresholds:
    binary_predictions = np.where(predictions > thresh, 1, 0)
    print(f"Classification report for threshold {thresh}:\n")
    print(classification_report(true_labels, binary_predictions, target_names=labels_columns))

Classification report for threshold 0.01:

               precision    recall  f1-score   support

        toxic       0.10      1.00      0.17      6090
 severe_toxic       0.01      1.00      0.01       367
      obscene       0.06      1.00      0.11      3691
       threat       0.00      1.00      0.01       211
       insult       0.05      1.00      0.10      3427
identity_hate       0.01      1.00      0.02       712
    non_toxic       0.90      1.00      0.95     57735

    micro avg       0.16      1.00      0.28     72233
    macro avg       0.16      1.00      0.20     72233
 weighted avg       0.73      1.00      0.78     72233
  samples avg       0.16      1.00      0.27     72233

Classification report for threshold 0.5:

               precision    recall  f1-score   support

        toxic       0.10      1.00      0.17      6090
 severe_toxic       0.02      1.00      0.03       367
      obscene       0.06      1.00      0.11      3691
       threat       0.01      1

### Concluding remarks

- Despite setting extreme thresholds, the model's outputs remain suboptimal.
- Class imbalance appears to be the primary cause of poor performance.
- Minority classes show poor metrics in classification reports but good performance in ROCAUC, which is puzzling.
- Concerns arose about the learning rate update when unfreezing the backbone, as validation loss spiked afterward.
- Experiments with consistent learning rates still resulted in validation loss spikes after unfreezing.
- Due to exhausted compute resources, this model is the final submission.
- Future improvements should focus on tweaking the learning rate and determining the optimal number of epochs for fine-tuning, considering whether the learning rate should change between frozen and unfrozen stages.