# Hate Speech Detection
This notebook performs hate speech detection on a dataset of tweets. The pipeline is as follows:
- Load & preprocess data
- Finetune a pretrained RoBERTa model for hate speech classification
- Evaluate the model on the test set

## 1. Setup

1.1 Imports

In [1]:
# imports
# standard
import os
import numpy as np
import pandas as pd
from tqdm import tqdm
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModel
# custom
from utils import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'using device: {device}')
# autoreload modules
%load_ext autoreload
%autoreload 2

using device: cuda


### 1.2 Load data

In [3]:
# Load data
df = pd.read_csv('data/hs_davidson2017.csv')

# Preprocessing
df['label'] = df['class'].apply(lambda x: 1 if x == 0 else 0)  # Binary indicator of hate_speech or not
df = df[['tweet', 'label']]  # Only keep necessary columns

# Split the dataset into train, validation and test
train_val_data, test_data = train_test_split(df, test_size=0.2, random_state=42, stratify=df.label)
train_data, val_data = train_test_split(train_val_data, test_size=0.25, random_state=42, stratify=train_val_data.label)  # 0.25 x 0.8 = 0.2

In [8]:
# Define the dataloaders
BATCH_SIZE = 64
MAX_LEN = 128
# Create data loaders
tokenizer = AutoTokenizer.from_pretrained('roberta-large')
train_data_loader, val_data_loader, test_data_loader = create_data_loaders(train_data, val_data, test_data, tokenizer, MAX_LEN, BATCH_SIZE)

## 2. Train model

In [11]:
# clear GPU memory
torch.cuda.empty_cache()

# create the model
model = HateSpeechClassifier('roberta-large', num_labels=2)
model = model.to(device)

# define the loss function and optimizer
loss_fn = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=1e-4)

# training loop
NUM_EPOCHS = 10
best_val_acc = train(model, train_data_loader, val_data_loader, loss_fn, optimizer, device, NUM_EPOCHS, patience=5, accumulation_steps=10, resume_ckpt='checkpoints/ckpt_best.pt')
print(f'best val acc: {best_val_acc:.4f}')

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 26%|██▌       | 61/233 [00:56<02:40,  1.07it/s]

In [None]:
# Evaluation on test set
test_loss, test_accuracy = evaluate(model, test_data_loader, loss_fn, device)
print(f'Test Loss: {test_loss:.4f}, Test Acc: {test_accuracy:.4f}')