# Project 2 NLP: Hatespeech Classifier

## Authors:

Adrian Obermühlner & Freja Rasmussen

## Resarch Question:

How do different preprocessing methods (nothing, stop word removal, lemming, stemming,…) affect the result of a hate speech classifier?

## Imports

In [1]:
# Imports
import pandas as pd
import numpy as np
from datasets import Dataset
import torch

# Preprocessing imports
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Tokenizing
import gensim.downloader as api


from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset
import gensim.downloader as api


from tqdm import tqdm

from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
import torch.optim as optim
from gensim.models import KeyedVectors
from gensim.models.keyedvectors import Vocab


In [2]:
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 1 GPU(s) available.
Device name: NVIDIA GeForce GTX 1650 Ti


## Data Import


In [3]:
RANDOM_SEED = 42
BINARY_LABEL = "is_hate"
CATEGORIES = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

np.random.seed(RANDOM_SEED)  # set random seed for reproducibility
# Make the labels into hate and no hate as 1 and 0

def binarize_labels(df):
    return (df[CATEGORIES].sum(axis=1) > 0).astype(int)

data_train = pd.read_csv("./data/train/train.csv", index_col=0)
data_train[BINARY_LABEL] = binarize_labels(data_train)

data_test = pd.read_csv("./data/test/test.csv", index_col=0).join(
    pd.read_csv("./data/test_labels/test_labels.csv", index_col=0)
)
data_test.drop(data_test[data_test["toxic"] == -1].index, inplace=True)
data_test[BINARY_LABEL] = binarize_labels(data_test)

In [4]:
data_train['comment_text'].head(10)

id
0000997932d777bf    Explanation\nWhy the edits made under my usern...
000103f0d9cfb60f    D'aww! He matches this background colour I'm s...
000113f07ec002fd    Hey man, I'm really not trying to edit war. It...
0001b41b1c6bb37e    "\nMore\nI can't make any real suggestions on ...
0001d958c54c6e35    You, sir, are my hero. Any chance you remember...
00025465d4725e87    "\n\nCongratulations from me as well, use the ...
0002bcb3da6cb337         COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK
00031b1e95af7921    Your vandalism to the Matt Shirvington article...
00037261f536c51d    Sorry if the word 'nonsense' was offensive to ...
00040093b2687caa    alignment on this subject and which are contra...
Name: comment_text, dtype: object

In [5]:
data_test.head(10)

Unnamed: 0_level_0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,is_hate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0001ea8717f6de06,Thank you for understanding. I think very high...,0,0,0,0,0,0,0
000247e83dcc1211,:Dear god this site is horrible.,0,0,0,0,0,0,0
0002f87b16116a7f,"""::: Somebody will invariably try to add Relig...",0,0,0,0,0,0,0
0003e1cccfd5a40a,""" \n\n It says it right there that it IS a typ...",0,0,0,0,0,0,0
00059ace3e3e9a53,""" \n\n == Before adding a new product to the l...",0,0,0,0,0,0,0
000663aff0fffc80,this other one from 1897,0,0,0,0,0,0,0
000689dd34e20979,== Reason for banning throwing == \n\n This ar...,0,0,0,0,0,0,0
000844b52dee5f3f,|blocked]] from editing Wikipedia. |,0,0,0,0,0,0,0
00091c35fa9d0465,"== Arabs are committing genocide in Iraq, but ...",1,0,0,0,0,0,1
000968ce11f5ee34,Please stop. If you continue to vandalize Wiki...,0,0,0,0,0,0,0


In [6]:
# get the distribution of the labels to see if roughly similar for both

is_hate_count_train = data_train['is_hate'].value_counts()
ratio_train = is_hate_count_train/ len(data_train)

is_hate_count_test = data_test['is_hate'].value_counts()
ratio_test = is_hate_count_test/ len(data_test)

print('Ratio of no/is hate for train set: ', ratio_train)
print('Ratio of no/is hate for test set: ', ratio_test)

Ratio of no/is hate for train set:  0    0.898321
1    0.101679
Name: is_hate, dtype: float64
Ratio of no/is hate for test set:  0    0.90242
1    0.09758
Name: is_hate, dtype: float64


## Representation

## Data Preprocessing

**Note**: We would need to make a loop for the different combinations of 
preprocessing (none, only stemming, only lemming, only stop word removal and every combination of this)
Either as coloumns that can be used to iterate over for the model training and validation, or make the preprocessing
and then go further and repeat from beginning.


In [7]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    
    # Lowercase all text    
    text = text.lower()
    
    # Tokenize the text into words
    tokens = word_tokenize(text)
    
    # Remove stopwords and apply stemming
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    
    # Lemming of words
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens if word.isalpha()]
    
    # Join the stemmed words back into a sentence
    return ' '.join(lemmatized_tokens)

data_train['comment_text_clean'] = data_train['comment_text'].apply(preprocess_text)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\flras\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\flras\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\flras\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [8]:
# Train and Test set
X_train = data_train["comment_text_clean"]
y_train = data_train["is_hate"]

X_test = data_test["comment_text"]
y_test = data_test["is_hate"]

In [9]:
maxLen = 0

for comment in data_train['comment_text_clean']:
    length = len(comment.split())
    if(length > maxLen):
        maxLen = length

print(maxLen)

1250


## Word Embedding

**Note**: How is word embedding implemented? Or is it better to use TF-IDF?

**Additional Notes and Answers**: Tokenizing with TF-IDF or alternatively
Look at example of word embedding with cnn for pytorch or tensorflow
fast text für word embedding from facebook, maybe shorten down to only words in data set, don't forget to add padding token
Tokenize the comment, hand the vector with padding(for longest comment in batch so all have same size) to CNN
in CNN first layer is embedding, this layer genrates matrix with size vocab dimensions x length of longest comment
Then use CNN as desired for output with softmax at the end for the classification

In [10]:
fasttext_model = api.load('fasttext-wiki-news-subwords-300')


In [11]:
x_train_short = data_train["comment_text_clean"][:1000]

In [12]:

# Step 1: Extract unique words from your dataset
unique_words = set()
for sentence in x_train_short:
    unique_words.update(sentence.split())

# Step 2: Filter FastText model
filtered_embeddings = {}
for word in unique_words:
    if word in fasttext_model:
        filtered_embeddings[word] = fasttext_model[word]

# Step 3: Create a new FastText model using only filtered embeddings
filtered_fasttext_model = KeyedVectors(vector_size=300)
word_to_index = {word: i for i, word in enumerate(filtered_embeddings.keys())}
index_to_word = list(filtered_embeddings.keys())
vectors = np.array(list(filtered_embeddings.values()))

# Assign filtered embeddings to the new model
filtered_fasttext_model.add_vectors(keys=index_to_word, weights=vectors, replace=False)


In [13]:
import gc
del fasttext_model
gc.collect()

0

In [15]:
# Tokenize and convert text data to numerical sequences
def wordVectors(sentence):
    # Tokenize the sentence and filter out words not in the FastText model
    sentence_tokens = [word for word in sentence.split() if word in filtered_fasttext_model]
    
    # Convert each token to its FastText embedding
    sentence_embeddings = [filtered_fasttext_model[word] for word in sentence_tokens]
    
    # Pad the sentence embeddings to the maximum length
    padded_embeddings = sentence_embeddings + [np.zeros(300)] * (maxLen - len(sentence_tokens))
    
    # Convert the padded embeddings to a single numpy array
    padded_embeddings_np = np.array(padded_embeddings, dtype=np.float32)
    
    # Convert the numpy array to a PyTorch tensor
    X_tensor = torch.tensor(padded_embeddings_np)
    
    return X_tensor

# Apply wordVectors to each item in X_train with progress bar
X_train_tensors = []
for sentence in tqdm(x_train_short, desc="Processing sentences"):
    X_train_tensors.append(wordVectors(sentence))

Processing sentences: 100%|██████████| 1000/1000 [00:02<00:00, 475.97it/s]


In [16]:
X_train_tensors[0].shape

torch.Size([1250, 300])

In [17]:
X_test_tensors = []
for sentence in tqdm(X_test[:1000], desc="Processing sentences"):
    X_test_tensors.append(wordVectors(sentence))

Processing sentences: 100%|██████████| 1000/1000 [00:01<00:00, 615.39it/s]


In [18]:
filtered_fasttext_model.vectors.shape

(7195, 300)

In [28]:
# Define batch size
batch_size = 5

# Initialize an empty list to store batches of tensors
X_train_batches = []

# Process the data in batches
for i in range(0, len(X_train_tensors), batch_size):
    batch = X_train_tensors[i:i+batch_size]  # Extract a batch of tensors
    stacked_batch = torch.stack(batch)  # Stack the batch of tensors
    X_train_batches.append(stacked_batch)  # Add the batch to the list

# Initialize an empty list to store batches of tensors
X_test_batches = []

# Process the data in batches
for i in range(0, len(X_test_tensors), batch_size):
    batch = X_test_tensors[i:i+batch_size]  # Extract a batch of tensors
    stacked_batch = torch.stack(batch)  # Stack the batch of tensors
    X_test_batches.append(stacked_batch)  # Add the batch to the list

# Convert y_train to a tensor
y_train_tensor = torch.tensor(y_train[:1000], dtype=torch.float32)
y_test_tensor = torch.tensor(y_train[:1000], dtype=torch.float32)

# Combine the list of batches into a single tensor
X_train_tensor = torch.cat(X_train_batches, dim=0)
X_test_tensor = torch.cat(X_train_batches, dim=0)

# Move tensors to the GPU if availableX_train_tensors = [tensor.to(device) for tensor in X_train_tensors]

# X_train_tensors = [tensor.to(device) for tensor in X_train_tensors]

# # Combine the list of tensors into a single tensor
# X_train_tensor = torch.cat(X_train_tensors, dim=0)
# X_test_tensors = [tensor.to(device) for tensor in X_test_tensors]

# # Combine the list of tensors into a single tensor
# X_test_tensor = torch.cat(X_test_tensors, dim=0)

# Define data loader
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)

# Define data loader for test data
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Define CNN model
class CNN(nn.Module):
    def __init__(self, embedding_dim, num_filters, filter_sizes, output_dim):
        super(CNN, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(torch.FloatTensor(filtered_fasttext_model.vectors))
        self.convs = nn.ModuleList([nn.Conv1d(in_channels=embedding_dim, out_channels=num_filters, kernel_size=fs) for fs in filter_sizes])
        self.fc = nn.Linear(num_filters * len(filter_sizes), output_dim)

    def forward(self, x):
        embedded = self.embedding(x.permute(0, 2, 1).long())  # Permute dimensions for Conv1d
        conved = [torch.relu(conv(embedded)).squeeze(3) for conv in self.convs]
        pooled = [torch.max(conv, dim=2)[0] for conv in conved]
        cat = torch.cat(pooled, dim=1)
        output = self.fc(cat)
        return output

# Initialize model and move it to the GPU if available
model = CNN(300, 60, [3, 4, 5], 2)  # Assuming 300 is the embedding dimension
model.to(device)

# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.1)
criterion = nn.BCEWithLogitsLoss()

# Training loop
model.train()
for epoch in range(10):
    for batch in train_loader:
        optimizer.zero_grad()
        inputs, targets = batch
        inputs, targets = inputs.to(device), targets.to(device)  # Move inputs and targets to GPU
        outputs = model(inputs)
        loss = criterion(outputs.squeeze(1), targets)
        loss.backward()
        optimizer.step()
    print(f'Epoch [{epoch+1}/{10}], Loss: {loss.item()}')

# Evaluation (Assuming you have defined test_loader similarly)
model.eval()
with torch.no_grad():
    total_correct = 0
    total_samples = 0
    for batch in test_loader:
        inputs, targets = batch
        inputs, targets = inputs.to(device), targets.to(device)  # Move inputs and targets to GPU
        outputs = model(inputs)
        predicted = torch.round(torch.sigmoid(outputs))
        total_correct += (predicted == targets).sum().item()
        total_samples += targets.size(0)
    accuracy = total_correct / total_samples
    print(f'Accuracy: {accuracy}')


RuntimeError: CUDA out of memory. Tried to allocate 1.40 GiB (GPU 0; 4.00 GiB total capacity; 9.81 GiB already allocated; 0 bytes free; 10.21 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [25]:
# Using .size() method
print("X_train_tensor size:", X_train_tensor.size())
print("y_train_tensor size:", y_train_tensor.size())

X_train_tensor size: torch.Size([1250000, 300])
y_train_tensor size: torch.Size([1000])


## Model Implementation

**Note**: Does a CNN makes sense for sentiment analysis? or a simpler model?

**Answers and additional Notes**:
Make a CNN with PyTorch using skorch as wrapper to make it possible to use sklearn.pipeline with the model
This way gridsearch for hyper parameters is possible and tfidfVectorizer can be used for tf-idf
CNN: vector size 300, conv. layer of some size, flatten, relu, end with softmax or something
Example: https://www.kaggle.com/code/raviusz/jigsaw-toxic-comment
example look very good to get basics and then change some of architecture
hyperparameter tuning for each model? only if time permits, alt. tune on best model and use for rest

## Test with Testset

**Note**: We will use the given test set to compare the different approaches. Make a dataframe with all the results
in accuracy, f1, recall, etc. 

**Note**: Setfit as example of what to use if I do not take CNN