<a href="https://colab.research.google.com/github/celelunar/Twitter-Hate-Speech-Multilabel-Classification/blob/main/Multilabel%20Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name: Diva Nabila Henryka

ID: 2501975620

## Import Library

In [None]:
import numpy as np
import pandas as pd

import re

from sklearn.utils import resample
from sklearn.model_selection import train_test_split

import torch
import torch.nn as nn

from tqdm import tqdm
from torch.optim import Adam
from sklearn.metrics import classification_report
from transformers import DistilBertModel, DistilBertTokenizer, get_linear_schedule_with_warmup

import shutil
import sys

## Data Preprocessing

### Read the Data
The data is first uploaded into the Google Colaboratory File System before it can be read by the pandas library function `pd.read_csv()`

In [None]:
df = pd.read_csv('data_1D.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,index,Tweet,HS,Abusive,HS_Individual,HS_Group,HS_Religion,HS_Race,HS_Physical,HS_Gender,HS_Other
0,0,9320,USER wkwkwkw akhirnya antek Amerika ini pasang...,0,0,0,0,0,0,0,0,0
1,1,3964,Terlalu suuzon nih rezim URL,1,0,1,0,0,0,0,0,1
2,2,8551,"Mau tanya sama guntur romli, Bener gak neh :;",0,0,0,0,0,0,0,0,0
3,3,12366,Genosida Muslim Rohingya oleh Teroris Budha My...,1,0,0,1,1,0,0,0,0
4,4,6271,USER USER SubhanAllah sekolah di sekolah krist...,0,0,0,0,0,0,0,0,0


### Text Cleaning
Removing irrelevant words, lowering the text data, and removing non-alphanumeric characters.

In [None]:
def cleaning(text):
  # to remove "USER"
  text_clean = [re.sub(r"USER", " ", i) for i in text]

  # to remove "RT"
  text_clean = [re.sub(r"RT", "", i) for i in text_clean]

  # to remove "URL"
  text_clean = [re.sub(r"URL", "", i) for i in text_clean]

  # to remove "\n"
  text_clean = [re.sub(r'\\n', '', i) for i in text_clean]

  # to lower case
  text_clean = [i.lower() for i in text_clean]

  # to remove emoji patterns
  text_clean = [re.sub(r'\\x[\da-fA-F]{2}', '', i) for i in text_clean]

  # to remove URLs
  text_clean = [re.sub(r'http\S+', '', i) for i in text_clean]

  # to remove numbers
  text_clean = [re.sub(r"\d+", "",i ) for i in text_clean]

  # to remove symbols
  text_clean = [re.sub(r'[^\w\s]', ' ', i) for i in text_clean]

  # to remove non-ascii characters
  text_clean = [re.sub(r'[^\x00-\x7F]+', '', i) for i in text_clean]

  # to remove extra spaces
  text_clean = [re.sub(r'\s+', ' ', i) for i in text_clean]

  # to remove leading or trailing whitespaces
  text_clean = [i.strip() for i in text_clean]

  return text_clean

In [None]:
cleaned = cleaning(df['Tweet'])
cleaned

['wkwkwkw akhirnya antek amerika ini pasang badan juga krn saham freeport dikuasai indonesia',
 'terlalu suuzon nih rezim',
 'mau tanya sama guntur romli bener gak neh',
 'genosida muslim rohingya oleh teroris budha myanmar desa muslim rohingya tlh dibakar teroris budha',
 'subhanallah sekolah di sekolah kristen namun iman dan kepercayaan terhadap alloh dan agama islam semakin besar',
 'cebong lo benerin cd aja kon bisa',
 'negeri perak hari ini menyatakan sokongan penuh dan tidak berbelah bahagi terhadap kepimpinan perdana menteri dato seri hebatkannegaraku bersamabn datuk seri r s presiden dahulukanrakyat',
 'komunis kau bangsat keluar kalian segera kapan pun di mana pun',
 'cabean bisa nya hanya bbm n facrbook n masalah politik mana tau doi tau nya berita sampah pencitraan',
 'pink titit itu gmana aku kira titit warnanya coklat benyek',
 'aset indonesia katanya di ambil orang asing freeport di ambil amrik',
 'cuba la tgk anime ni karakai jozu no takagi san kalau rasa nak feeling fee

In [None]:
df.insert(3, 'Cleaned Tweet', cleaned)
df.head()

Unnamed: 0.1,Unnamed: 0,index,Tweet,Cleaned Tweet,HS,Abusive,HS_Individual,HS_Group,HS_Religion,HS_Race,HS_Physical,HS_Gender,HS_Other
0,0,9320,USER wkwkwkw akhirnya antek Amerika ini pasang...,wkwkwkw akhirnya antek amerika ini pasang bada...,0,0,0,0,0,0,0,0,0
1,1,3964,Terlalu suuzon nih rezim URL,terlalu suuzon nih rezim,1,0,1,0,0,0,0,0,1
2,2,8551,"Mau tanya sama guntur romli, Bener gak neh :;",mau tanya sama guntur romli bener gak neh,0,0,0,0,0,0,0,0,0
3,3,12366,Genosida Muslim Rohingya oleh Teroris Budha My...,genosida muslim rohingya oleh teroris budha my...,1,0,0,1,1,0,0,0,0
4,4,6271,USER USER SubhanAllah sekolah di sekolah krist...,subhanallah sekolah di sekolah kristen namun i...,0,0,0,0,0,0,0,0,0


### Drop the Empty Cleaned Data Cell
The cleaning process may create empty cells as there are some rows where the original data ("Tweet") only have irrelevant words or non-alphanumerical characters,

In [None]:
df['Cleaned Tweet'].loc[94]

''

In [None]:
final = df.drop(df.columns[:3], axis = 1)
final = final.query("`Cleaned Tweet`.str.strip() != ''")
final.reset_index(drop = True, inplace = True)
final.head()

Unnamed: 0,Cleaned Tweet,HS,Abusive,HS_Individual,HS_Group,HS_Religion,HS_Race,HS_Physical,HS_Gender,HS_Other
0,wkwkwkw akhirnya antek amerika ini pasang bada...,0,0,0,0,0,0,0,0,0
1,terlalu suuzon nih rezim,1,0,1,0,0,0,0,0,1
2,mau tanya sama guntur romli bener gak neh,0,0,0,0,0,0,0,0,0
3,genosida muslim rohingya oleh teroris budha my...,1,0,0,1,1,0,0,0,0
4,subhanallah sekolah di sekolah kristen namun i...,0,0,0,0,0,0,0,0,0


In [None]:
print("Original Length: ", len(df))
print("Cleaned Length: ", len(final))

Original Length:  3293
Cleaned Length:  3276


### Data Splitting
Since it can be seen below that the number of each labels are inbalanced, thus there will be some labels that the model cannot predict if they are not in the training dataset. This is why I'm using oversampling to the minority classes to ensure that the model can also predict labels with small data points.

80% Training 10% Validation 10% Testing

In [None]:
column_sums = final.columns[1:]

for c in column_sums:
  sum = final[c].sum()
  print(f"{c}: {sum}")

HS: 1384
Abusive: 1295
HS_Individual: 908
HS_Group: 476
HS_Religion: 212
HS_Race: 140
HS_Physical: 75
HS_Gender: 69
HS_Other: 932


In [None]:
# Define function to oversample minority classes
def oversample(df):
    max_size = df.iloc[:, 1:].sum(axis=0).max()
    df_list = [df]
    for class_index in range(1, df.shape[1]):
        class_df = df[df.iloc[:, class_index] == 1]
        oversampled_class_df = resample(class_df, replace=True, n_samples=max_size, random_state=42)
        df_list.append(oversampled_class_df)
    return pd.concat(df_list)

oversampled_final = oversample(final)

print("Original Length:", len(final))
print("Oversampled Length:", len(oversampled_final))

# Perform stratified split after oversampling
train_df, temp_df = train_test_split(oversampled_final, test_size=0.2, random_state=42, stratify=oversampled_final.iloc[:, 1:])
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42, stratify=temp_df.iloc[:, 1:])

print("Train:", len(train_df))
print("Validation:", len(val_df))
print("Test:", len(test_df))

Original Length: 3276
Oversampled Length: 15732
Train: 12585
Validation: 1573
Test: 1574


## Modelling
The model that I will be using for this task is the pretrained DistilBERT with its tokenizer.

DistilBERT itself is a smaller and faster version of the popular pre-trained language model while still maintaining its high performance, thus making it more suitable for deployment on devices with limited resources. It also offers faster training and inference speed due to its reduces size.

### Hyperparameters initialization

In [None]:
# HYPERPARAMETERS INITIALIZATION
# Load DistilBERT model and tokenizer
TOKENIZER = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
MODEL = DistilBertModel.from_pretrained('distilbert-base-uncased')

MAX_LEN = final['Cleaned Tweet'].str.split().str.len().max()
BATCH_SIZE = 8
EPOCHS = 5
LR = 1e-05

### Build dataset
To define a custom dataset that can be used to load and prepare the text data and corresponding multilabel annotations.

In [None]:
NUM_LABELS = 9

class Build_Dataset(torch.utils.data.Dataset):
    def __init__(self, df):
        self.labels = df.iloc[:, 1:].values.astype(np.float32)  # Convert to float32 for torch
        self.texts = df['Cleaned Tweet'].tolist()

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        text = self.texts[idx]
        labels = self.labels[idx]
        encoding = TOKENIZER(text,
                             padding='max_length', max_length=MAX_LEN, truncation=True, return_tensors="pt")

        return {key: val.squeeze(0) for key, val in encoding.items()}, torch.tensor(labels, dtype=torch.float)

### Classifier
Down below is the actual model architecture itself.

In [None]:
class BertClassifier(nn.Module):
  def __init__(self, dropout=0.5):
    super(BertClassifier, self).__init__()
    self.bert = MODEL
    self.dropout = nn.Dropout(dropout)
    self.linear = nn.Linear(768, 9) # 9 labels
    self.sigmoid = nn.Sigmoid()

  def forward(self, input_id, mask):
    output = self.bert(input_ids = input_id,
                                attention_mask = mask,
                                return_dict = False)
    pooled_output = output[0][:, 0, :]
    dropout_output = self.dropout(pooled_output)
    linear_output = self.linear(dropout_output)
    sigmoid_output = self.sigmoid(linear_output)

    return sigmoid_output

model = BertClassifier()

### Accuracy Formula

In [None]:
def calculate_accuracy(predict, labels, threshold = 0.5):
    rounded_predict = torch.round(predict)
    correct = (rounded_predict == labels).float()
    accuracy = correct.sum() / (len(correct) * labels.shape[1])
    return accuracy

### Function for Training the Model
To iteratively performs a forward pass through the model to get the predictions, accumulate the total loss for each epoch, backpropagates the loss and update the model weights using the optimizer. Do the same thing to the validation dataset, but disabling the gradient calculation.

In [None]:
def train(model, train_data, val_data, lr, dropout, epoch):
  train, val = Build_Dataset(train_data), Build_Dataset(val_data)

  train_loader = torch.utils.data.DataLoader(train,
                                             batch_size = BATCH_SIZE,
                                             shuffle = True)
  val_loader = torch.utils.data.DataLoader(train,
                                            batch_size = BATCH_SIZE)

  use_cuda = torch.cuda.is_available()
  device = torch.device("cuda" if use_cuda else "cpu")

  criterion = nn.BCEWithLogitsLoss()
  optimizer = Adam(model.parameters(), lr = LR)

  if use_cuda:
    model = model.cuda()
    criterion = criterion.cuda()

  for i in range(epoch):
    total_loss_train = 0
    total_acc_train = 0
    model.train()
    train_predict, train_labels = [], []

    for train_input, train_label in tqdm(train_loader):
      train_label = train_label.to(device)
      mask = train_input['attention_mask'].to(device)
      input_id = train_input['input_ids'].to(device)

      output = model(input_id, mask)

      loss = criterion(output, train_label)
      total_loss_train += loss.item()
      acc = calculate_accuracy(output, train_label)
      total_acc_train += acc.item()

      model.zero_grad()
      loss.backward()
      optimizer.step()

      train_predict.extend(torch.round(output).detach().cpu().numpy())
      train_labels.extend(train_label.detach().cpu().numpy())

    total_loss_val = 0
    total_acc_val = 0
    model.eval()
    val_predict, val_labels = [], []

    with torch.no_grad():
      for val_input, val_label in val_loader:
          val_label = val_label.to(device)
          mask = val_input['attention_mask'].to(device)
          input_id = val_input['input_ids'].to(device)

          output = model(input_id, mask)

          loss = criterion(output, val_label)
          total_loss_val += loss.item()

          acc = calculate_accuracy(output, val_label)
          total_acc_val += acc.item()

          val_predict.extend(torch.round(output).detach().cpu().numpy())
          val_labels.extend(val_label.detach().cpu().numpy())

    print(f'Epochs: {i + 1} | Train Loss: {total_loss_train / len(train_loader): .3f} \ | Train Accuracy: {total_acc_train / len(train_loader): .3f} \ | Val Loss: {total_loss_val / len(val_loader): .3f} \ | Val Accuracy: {total_acc_val / len(val_loader): .3f}')

  print(f'Train Accuracy: {total_acc_train / len(train_loader): .3f}')
  print("Training Classification Report")
  print(classification_report(train_labels, train_predict,
                              target_names = train_data.columns[1:]))

  print(f'Val Accuracy: {total_acc_val / len(val_loader): .3f}')
  print("Validation Classification Report")
  print(classification_report(val_labels, val_predict,
                              target_names = val_data.columns[1:]))

  val_accuracy = total_acc_val/len(val_loader)
  return val_accuracy

### Evaluate the Model
To calculates predictions using the trained model without gradient calculation, implements dynamic thresholding for each label based on average predicted probabilities in the batch, rounds the predictions to binary labels.

In [None]:
def evaluation(model, test_data):
    test = Build_Dataset(test_data)
    test_loader = torch.utils.data.DataLoader(test, batch_size=BATCH_SIZE)

    use_cuda = torch.cuda.is_available()
    device = torch.device("cuda" if use_cuda else "cpu")

    if use_cuda:
        model = model.cuda()

    total_acc_test = 0
    test_predict, test_labels = [], []

    model.eval()
    with torch.no_grad():
        for test_input, test_label in test_loader:
            test_label = test_label.to(device)
            mask = test_input['attention_mask'].to(device)
            input_id = test_input['input_ids'].to(device)

            output = model(input_id, mask)

            # Calculate dynamic threshold for each label
            thresholds = torch.mean(output, dim=0)  # mean probability for each label
            thresholds = thresholds.cpu().numpy()

            # Round predictions using dynamic thresholds
            rounded_predictions = (output.cpu().numpy() > thresholds).astype(int)

            acc = calculate_accuracy(output, test_label)
            total_acc_test += acc.item()

            test_predict.extend(rounded_predictions)
            test_labels.extend(test_label.cpu().numpy())

    print(f'Test Accuracy: {total_acc_test / len(test_loader): .3f}')

    # Convert to numpy arrays for classification_report
    test_predict = np.array(test_predict)
    test_labels = np.array(test_labels)

    print("Test Classification Report")
    print(classification_report(test_labels, test_predict, target_names=test_data.columns[1:]))

### Base Model

#### Training

In [None]:
train(model, train_df, val_df, LR, 0.5, EPOCHS)

100%|██████████| 1574/1574 [01:33<00:00, 16.76it/s]


Epochs: 1 | Train Loss:  0.638 \ | Train Accuracy:  0.819 \ | Val Loss:  0.615 \ | Val Accuracy:  0.871


100%|██████████| 1574/1574 [01:33<00:00, 16.75it/s]


Epochs: 2 | Train Loss:  0.602 \ | Train Accuracy:  0.908 \ | Val Loss:  0.583 \ | Val Accuracy:  0.948


100%|██████████| 1574/1574 [01:33<00:00, 16.81it/s]


Epochs: 3 | Train Loss:  0.583 \ | Train Accuracy:  0.948 \ | Val Loss:  0.574 \ | Val Accuracy:  0.966


100%|██████████| 1574/1574 [01:33<00:00, 16.88it/s]


Epochs: 4 | Train Loss:  0.574 \ | Train Accuracy:  0.966 \ | Val Loss:  0.569 \ | Val Accuracy:  0.973


100%|██████████| 1574/1574 [01:33<00:00, 16.86it/s]


Epochs: 5 | Train Loss:  0.570 \ | Train Accuracy:  0.972 \ | Val Loss:  0.567 \ | Val Accuracy:  0.977
Train Accuracy:  0.972
Training Classification Report
               precision    recall  f1-score   support

           HS       0.99      0.99      0.99     10685
      Abusive       0.99      0.97      0.98      7901
HS_Individual       0.98      0.98      0.98      6468
     HS_Group       0.99      0.98      0.98      4217
  HS_Religion       0.99      0.96      0.97      2408
      HS_Race       0.98      0.86      0.92      1772
  HS_Physical       1.00      0.19      0.31      1551
    HS_Gender       1.00      0.85      0.92      1544
     HS_Other       0.98      0.95      0.97      4462

    micro avg       0.99      0.94      0.96     41008
    macro avg       0.99      0.86      0.89     41008
 weighted avg       0.99      0.94      0.95     41008
  samples avg       0.89      0.85      0.86     41008

Val Accuracy:  0.977
Validation Classification Report
               

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.9769783237383357

#### Testing

In [None]:
evaluation(model, test_df)

Test Accuracy:  0.968
Test Classification Report
               precision    recall  f1-score   support

           HS       0.97      0.83      0.89      1336
      Abusive       0.98      0.93      0.96       989
HS_Individual       0.96      0.98      0.97       810
     HS_Group       0.91      0.98      0.94       526
  HS_Religion       0.74      0.99      0.85       301
      HS_Race       0.56      0.92      0.70       221
  HS_Physical       0.07      0.19      0.11       194
    HS_Gender       0.45      0.91      0.60       194
     HS_Other       0.92      0.98      0.95       558

    micro avg       0.81      0.90      0.85      5129
    macro avg       0.73      0.86      0.77      5129
 weighted avg       0.88      0.90      0.88      5129
  samples avg       0.76      0.81      0.77      5129



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Analysis
**Accuracy:**
- Training accuracy is high (0.972), indicating the model performs well on the training data.
- Validation accuracy is slightly higher (0.977), suggesting the model generalizes well to unseen validation data.
- Test accuracy (0.968) is close to the training and validation accuracies, indicating consistent performance across different datasets.

**Precision and Recall:**
- Most labels have high precision and recall, except for HS_Physical with recalls of 0.19. This indicated that model have a hard time predicting the label correctly.

**F1-Score**
-  A high F1-score indicates the model has a good balance of precision and recall which can be seen in most labels, indicates strong overall performance.
- Low F1-scores for HS_Physical in all set and HS_Gender in the test set indicates performance issues with these specific classes, where the balance between precision and recall is less optimal.

**Support:**
- Classes with higher support (e.g., HS, Abusive) tend to have more reliable performance metrics due to the larger number of samples.
- Conversely, classes with lower support (e.g., HS_Physical, HS_Gender) may have less stable performance metrics.

### Tuned Hyperparameters Model

#### Training

In [None]:
train(model, train_df, val_df, 5e-4, 0.2, 5)

100%|██████████| 1574/1574 [01:41<00:00, 15.46it/s]


Epochs: 1 | Train Loss:  0.568 \ | Train Accuracy:  0.976 \ | Val Loss:  0.565 \ | Val Accuracy:  0.981


100%|██████████| 1574/1574 [01:34<00:00, 16.68it/s]


Epochs: 2 | Train Loss:  0.566 \ | Train Accuracy:  0.978 \ | Val Loss:  0.564 \ | Val Accuracy:  0.981


100%|██████████| 1574/1574 [01:34<00:00, 16.74it/s]


Epochs: 3 | Train Loss:  0.565 \ | Train Accuracy:  0.979 \ | Val Loss:  0.564 \ | Val Accuracy:  0.982


100%|██████████| 1574/1574 [01:33<00:00, 16.83it/s]


Epochs: 4 | Train Loss:  0.565 \ | Train Accuracy:  0.980 \ | Val Loss:  0.563 \ | Val Accuracy:  0.983


100%|██████████| 1574/1574 [01:33<00:00, 16.85it/s]


Epochs: 5 | Train Loss:  0.565 \ | Train Accuracy:  0.980 \ | Val Loss:  0.562 \ | Val Accuracy:  0.984
Train Accuracy:  0.980
Training Classification Report
               precision    recall  f1-score   support

           HS       0.99      0.99      0.99     10685
      Abusive       0.99      0.98      0.99      7901
HS_Individual       0.98      0.98      0.98      6468
     HS_Group       0.99      0.99      0.99      4217
  HS_Religion       0.99      0.99      0.99      2408
      HS_Race       0.99      0.95      0.97      1772
  HS_Physical       1.00      0.19      0.32      1551
    HS_Gender       1.00      0.98      0.99      1544
     HS_Other       0.99      0.97      0.98      4462

    micro avg       0.99      0.95      0.97     41008
    macro avg       0.99      0.89      0.91     41008
 weighted avg       0.99      0.95      0.96     41008
  samples avg       0.90      0.87      0.88     41008

Val Accuracy:  0.984
Validation Classification Report
               

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.9837815844497656

#### Testing

In [None]:
evaluation(model, test_df)

Test Accuracy:  0.978
Test Classification Report
               precision    recall  f1-score   support

           HS       0.99      0.85      0.92      1336
      Abusive       0.98      0.96      0.97       989
HS_Individual       0.98      0.98      0.98       810
     HS_Group       0.94      0.99      0.97       526
  HS_Religion       0.75      0.99      0.85       301
      HS_Race       0.54      0.93      0.69       221
  HS_Physical       0.09      0.19      0.12       194
    HS_Gender       0.50      0.99      0.66       194
     HS_Other       0.95      0.99      0.97       558

    micro avg       0.83      0.91      0.87      5129
    macro avg       0.75      0.88      0.79      5129
 weighted avg       0.89      0.91      0.89      5129
  samples avg       0.79      0.83      0.80      5129



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### Analysis
This hyperparameter tuned model returns back similar result as the base model.

**Accuracy**
- Training accuracy is high (0.980), indicating the model performs well on the training data.
- Validation accuracy is slightly higher (0.984), suggesting the model generalizes well to unseen validation data.
- Test accuracy (0.978) is close to the training and validation accuracies, indicating consistent performance across different datasets.

**Precision and Recall**
- Most labels have high precision and recall, especially for the HS, Abusive, HS_Individual, HS_Group, and HS_Other categories across all datasets. This suggest that the model is effective at minimizing false positives and identifying true positives for those labels.
- However, for labels like HS_Physical, precision is very high in the training and validation sets but drops significantly in the test set, indicating potential overfitting or a smaller number of true positives.
- Some labels, such as HS_Physical and HS_Gender, have lower recall scores in all datasets, indicating these labels are harder for the model to predict correctly. The significant drop in recall for HS_Physical and HS_Gender in the test set points to challenges in correctly identifying these labels in unseen data.

**F1-score**
- High F1-scores across many labels show that the model maintains a good balance.
- Lower F1-scores for HS_Physical and HS_Gender in the test set highlight difficulties in consistently identifying these labels correctly.

**Support**
- Labels with higher support (e.g., HS, Abusive) have more reliable performance metrics due to the larger number of samples.
- Labels with lower support (e.g., HS_Physical, HS_Gender) show more variability in performance, potentially due to fewer samples.


### Different Model

#### Adding 1 hidden layer

In [None]:
class BertClassifier2(nn.Module):
    def __init__(self, dropout=0.5):
        super(BertClassifier2, self).__init__()
        self.bert = MODEL
        self.dropout = nn.Dropout(dropout)
        self.hidden = nn.Linear(768, 256)
        self.activation = nn.ReLU()
        self.output = nn.Linear(256, 9)
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_id, mask):
        last_hidden_state = self.bert(input_ids=input_id, attention_mask=mask, return_dict=False)[0]
        pooled_output = last_hidden_state[:, 0, :]
        hidden_output = self.hidden(pooled_output)
        activated_output = self.activation(hidden_output)
        dropout_output = self.dropout(activated_output)
        linear_output = self.output(dropout_output)
        sigmoid_output = self.sigmoid(linear_output)

        return sigmoid_output

model2 = BertClassifier2()

#### Training

In [None]:
train(model2, train_df, val_df, 5e-4, 0.2, 5)

100%|██████████| 1574/1574 [01:37<00:00, 16.14it/s]


Epochs: 1 | Train Loss:  0.662 \ | Train Accuracy:  0.800 \ | Val Loss:  0.623 \ | Val Accuracy:  0.857


100%|██████████| 1574/1574 [01:34<00:00, 16.61it/s]


Epochs: 2 | Train Loss:  0.617 \ | Train Accuracy:  0.871 \ | Val Loss:  0.605 \ | Val Accuracy:  0.894


100%|██████████| 1574/1574 [01:34<00:00, 16.61it/s]


Epochs: 3 | Train Loss:  0.602 \ | Train Accuracy:  0.899 \ | Val Loss:  0.590 \ | Val Accuracy:  0.919


100%|██████████| 1574/1574 [01:35<00:00, 16.51it/s]


Epochs: 4 | Train Loss:  0.591 \ | Train Accuracy:  0.919 \ | Val Loss:  0.583 \ | Val Accuracy:  0.934


100%|██████████| 1574/1574 [01:34<00:00, 16.68it/s]


Epochs: 5 | Train Loss:  0.585 \ | Train Accuracy:  0.931 \ | Val Loss:  0.579 \ | Val Accuracy:  0.941
Train Accuracy:  0.931
Training Classification Report
               precision    recall  f1-score   support

           HS       0.98      0.98      0.98     10685
      Abusive       0.98      0.96      0.97      7901
HS_Individual       0.97      0.97      0.97      6468
     HS_Group       0.98      0.96      0.97      4217
  HS_Religion       0.99      0.62      0.76      2408
      HS_Race       0.99      0.77      0.87      1772
  HS_Physical       0.00      0.00      0.00      1551
    HS_Gender       0.00      0.00      0.00      1544
     HS_Other       0.95      0.58      0.72      4462

    micro avg       0.98      0.83      0.90     41008
    macro avg       0.76      0.65      0.69     41008
 weighted avg       0.90      0.83      0.86     41008
  samples avg       0.88      0.76      0.81     41008

Val Accuracy:  0.941
Validation Classification Report
               

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


0.9412325377915685

#### Testing

In [None]:
evaluation(model2, test_df)

Test Accuracy:  0.932
Test Classification Report
               precision    recall  f1-score   support

           HS       0.97      0.82      0.89      1336
      Abusive       0.97      0.96      0.97       989
HS_Individual       0.95      0.98      0.97       810
     HS_Group       0.92      0.97      0.95       526
  HS_Religion       0.46      0.63      0.53       301
      HS_Race       0.45      0.81      0.58       221
  HS_Physical       0.02      0.04      0.02       194
    HS_Gender       0.06      0.19      0.09       194
     HS_Other       0.86      0.66      0.74       558

    micro avg       0.71      0.81      0.75      5129
    macro avg       0.63      0.67      0.64      5129
 weighted avg       0.83      0.81      0.81      5129
  samples avg       0.72      0.74      0.70      5129



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Analysis
#### Analysis
This hyperparameter tuned model returns back similar result as the base model.

**Accuracy**
- Training accuracy is high (0.931), indicating the model performs well on the training data.
- Validation accuracy is slightly higher (0.941), suggesting the model generalizes well to unseen validation data.
- Test accuracy (0.932) is close to the training and validation accuracies, indicating consistent performance across different datasets.

**Precision and Recall**
- The model performs well for major labels (HS, Abusive, HS_Individual, HS_Group) with high precision, recall, and F1-scores across training, validation, and test datasets.
- However, for labels like HS_Physical and HS_Gender the precision, recall, and F1-scores are low. This indicates that the model struggles significantly with these labels.

**F1-score**
- High F1-scores across many labels show that the model maintains a good balance.
- Lower F1-scores for HS_Physical and HS_Gender in the test set highlight difficulties in consistently identifying these labels correctly.

**Support**
- Labels with higher support (e.g., HS, Abusive) have more reliable performance metrics due to the larger number of samples.
- Labels with lower support (e.g., HS_Physical, HS_Gender) show more variability in performance, potentially due to fewer samples.


## Summary
Out of all three model that were tried, **model 2 emerges as the best performer overall**, showing high accuracy and consistent performance across major labels.

However, **significant work is needed to address the low performance in the HS_Physical and HS_Gender** categories across all models. Enhancing the training data and utilizing more advanced techniques could lead to overall improved model performance. This could be done by using data augmentation or feature engineering.

We can also employ other techniques such as SMOTE or class-weight adjustments to improve recall and precision for underrepresented classes.

