# Hate Speech Detection
This notebook performs hate speech detection on a dataset of tweets. The pipeline is as follows:
- Load & preprocess data
- Finetune a pretrained RoBERTa model for hate speech classification
- Evaluate the model on the test set

### Setup

#### Imports

In [1]:
# imports
# standard
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from dotenv import load_dotenv
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'    # for debugging
load_dotenv()
import torch
# from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer, AutoModel
# custom
from src.utils import *
from src.models import HateSpeechClassifier
from src.train import train_pl
%load_ext autoreload
%autoreload 2

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# set device
# device = 'cuda' if torch.cuda.is_available() else 'cpu'
# print(f'using device: {device}')

## Train model

In [2]:
def get_model_memory_usage(model):
    total_params = sum(p.numel() for p in model.parameters())
    total_buffers = sum(p.numel() for p in model.buffers())
    total_memory = (total_params + total_buffers) * 4  # assuming float32 parameters
    return total_memory / (1024 ** 2)  # Convert bytes to MB

def get_batch_memory_usage(dataloader):
    for batch in dataloader:
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        
        # Calculate memory for both tensors
        batch_memory = (input_ids.numel() + attention_mask.numel()) * 4  # assuming float32 data
        return batch_memory / (1024 ** 2)  # Convert bytes to MB

In [16]:
del model
del train_loader, val_loader, test_loader
torch.cuda.empty_cache()

In [3]:
# clear GPU memory
torch.cuda.empty_cache()

# create data loaders
data_path = 'data/hs_davidson2017.csv'
bsz = 64
max_len = 256
num_workers = 12
tokenizer = AutoTokenizer.from_pretrained('roberta-large')
train_loader, val_loader, test_loader = create_data_loaders(data_path, tokenizer, max_len, bsz, num_workers)

# create the model
model = HateSpeechClassifier('roberta-large', num_labels=2)
# model = model.to(device)

# check gpu memory usage
# model_memory = get_model_memory_usage(model)
# print(f"Model Memory Usage: {model_memory:.2f} MB")
# train_batch_memory = get_batch_memory_usage(train_loader)
# val_batch_memory = get_batch_memory_usage(val_loader)
# print(f"Train Batch Memory Usage: {train_batch_memory:.2f} MB")
# print(f"Validation Batch Memory Usage: {val_batch_memory:.2f} MB")

# training args
args = {'num_epochs': 10, 'patience': 3, 'ckpt_name': 'ckpt_best_new', 'precision': '16-mixed'}

# train
trainer, _ = train_pl(model, train_loader, val_loader, args)

# test
trainer.test(test_dataloaders=test_loader)

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Using 16bit Automatic Mixed Pr


trainer args:
state: TrainerState(status=<TrainerStatus.INITIALIZING: 'initializing'>, fn=None, stage=None)
barebones: False
_data_connector: <pytorch_lightning.trainer.connectors.data_connector._DataConnector object at 0x7fb9b0708b90>
_accelerator_connector: <pytorch_lightning.trainer.connectors.accelerator_connector._AcceleratorConnector object at 0x7fb9b296ef50>
_logger_connector: <pytorch_lightning.trainer.connectors.logger_connector.logger_connector._LoggerConnector object at 0x7fb9b09e0210>
_callback_connector: <pytorch_lightning.trainer.connectors.callback_connector._CallbackConnector object at 0x7fb9aff45890>
_checkpoint_connector: <pytorch_lightning.trainer.connectors.checkpoint_connector._CheckpointConnector object at 0x7fb9b04ab5d0>
_signal_connector: <pytorch_lightning.trainer.connectors.signal_connector._SignalConnector object at 0x7fb9b04abed0>
fit_loop: <pytorch_lightning.loops.fit_loop._FitLoop object at 0x7fb9b04ab750>
validate_loop: <pytorch_lightning.loops.evaluatio

You are using a CUDA device ('NVIDIA A100-SXM4-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name     | Type               | Params
------------------------------------------------
0 | model    | RobertaModel       | 355 M 
1 | fc       | Linear             | 2.0 K 
2 | loss     | CrossEntropyLoss   | 0     
3 | accuracy | MulticlassAccuracy | 0     
------------------------------------------------
355 M     Trainable params
0         Non-trainable params
355 M     Total params
1,421.447 Total estimated model params size (MB)


Epoch 0: 100%|██████████| 310/310 [02:46<00:00,  1.87it/s, v_num=2]        

Metric val_loss improved. New best score: 0.210


Epoch 4:  66%|██████▋   | 206/310 [01:46<00:53,  1.94it/s, v_num=2]

In [9]:
torch.cuda.empty_cache()

## 3. Evaluate model

In [14]:
# load the best model
model = HateSpeechClassifier('roberta-large', num_labels=2)
checkpoint = torch.load('checkpoints/ckpt_best.pt')
model.load_state_dict(checkpoint['model_state_dict'])
test_loss, test_accuracy = evaluate(model, test_data_loader, loss_fn, device)
print(f'test loss: {test_loss:.4f}, test acc: {test_accuracy:.4f}')

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


test loss: 0.2208, test acc: 0.9423
