### Hate Speech Detection using PyTorch and Hugging Face Transformers
Hate speech detection task to determine if a piece of text contains hateful content.

Data: A twitter corpus study of US Elections 2020 on the basis of Offensive speech and Stance detection.

Data URL: https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/stance-hof/

#### GPU Setup

In [4]:
import torch

# If GPU is available
if torch.cuda.is_available():
    # PyTorch will use GPU
    device = torch.device("cuda")
    print("The GPU that is used: ", torch.cuda.get_device_name(0))

# If GPU is not available
else:
    # PyTorch will use CPU
    device = torch.device("cpu")
    print("No GPU available, CPU is used.")

The GPU that is used:  NVIDIA GeForce RTX 3060 Ti


In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from tqdm import tqdm
from collections import Counter
import re


#### Read Data
Data we have is provided in a .tsv format. Just for the sake of ease, we will convert it to a readable .csv format first.

In [6]:
train_tsv = './data/train.tsv'
test_tsv = './data/test.tsv'

# Reading given tsv file
train_csv = pd.read_table(train_tsv, sep = '\t')
test_csv = pd.read_table(test_tsv, sep = '\t')

# Converting the tsv files into csv
train_csv.to_csv('./data/train.csv', index = False)
test_csv.to_csv('./data/test.csv', index = False)

print("Conversion completed successfully.")


Conversion completed successfully.


Now the data is converted into .csv files, let's check the data out.

In [7]:
train_csv

Unnamed: 0,text,Trump,Biden,West,HOF
0,@SukiRavan @ProgressPotato @MarkZuckerb0rg @JS...,Neither,Favor,Neither,Non-Hateful
1,@Newsweek Are you freaking crazy????[NEWLINE]I...,Neither,Favor,Neither,Non-Hateful
2,Undecided voters (and MAGATs alike);[NEWLINE]I...,Against,Neutral mentions,Neither,Non-Hateful
3,@cheaterwins @Hungry_For_More @DAYSORSHAY So a...,Favor,Neutral mentions,Neither,Non-Hateful
4,@CNN Nancy Pelosi and the Dems wont do a deal ...,Neutral mentions,Neither,Neither,Non-Hateful
...,...,...,...,...,...
2395,Just lost a ton of followers again. Looks like...,Favor,Neither,Neither,Non-Hateful
2396,@NovumQuid @OpenMothersMale @MikeSington I hav...,Neither,Mixed,Neither,Non-Hateful
2397,@TheLeoTerrell @SenatorLoeffler @realDonaldTru...,Favor,Neither,Neither,Non-Hateful
2398,It’s too bad that at a time when we’re unemplo...,Neither,Favor,Neither,Non-Hateful


The guidelines from the data already confirmed that the data contained 5 columns. But, we will not be using the three columns (Trump, Biden, and West) and will be only using 'text' and 'HOF' where 'HOF' column is the label for Hateful/Non-Hateful. Initially, I was thinking of using all the columns but I was thinking of situations where it could be in favor of all the three columns but can be labeled hateful.

In [10]:
# Data load function
def load_data(filename, sample_size = 10):
    df = pd.read_csv(filename)
    print('Sample of the dataset: ')
    display(df.sample(sample_size))

    return df

In [11]:
# Loading data into pandas dataframe
train_df = load_data('./data/train.csv')
test_df = load_data('./data/test.csv')


Sample of the dataset: 


Unnamed: 0,text,Trump,Biden,West,HOF
646,"HUGE NEWS!! Hillary, Obama, &amp; Biden ALL Kn...",Favor,Against,Neither,Non-Hateful
1975,are you ready to wear this shirt? 👇[NEWLINE][N...,Neither,Neither,Neither,Non-Hateful
1878,@seanhannity She is evil!! SHE SHOULD BE PROSE...,Against,Neither,Neither,Non-Hateful
1652,@patinsideout @KING5Seattle Nobody wants the D...,Favor,Against,Neither,Hateful
691,My Beardie is smarter than Sleepy Joe and Lyin...,Neither,Favor,Neither,Hateful
1302,"@johncusack ""If there were any such thing as t...",Neither,Favor,Neither,Non-Hateful
1402,@arleneprieto @ProudSocialist This sounds like...,Neither,Favor,Neither,Non-Hateful
1201,.@marklevinshow: [NEWLINE][NEWLINE]Tara Reade ...,Neither,Against,Neither,Non-Hateful
1736,Baffled by Cuban-Am voters favoring Trump. Isn...,Against,Favor,Neither,Non-Hateful
1464,It's way past time for @SpeakerPelosi to stop ...,Favor,Neither,Neither,Non-Hateful


Sample of the dataset: 


Unnamed: 0,text,Trump,Biden,West,HOF
7,@Rett94291220 @STLemmon @LizRNC Kamala Harris ...,Mixed,Mixed,Neither,Non-Hateful
51,@EcWitIt Why is always Trump with you guys. I ...,Against,Neutral mentions,Neither,Non-Hateful
94,@realDonaldTrump DEMOCRATS TRIED TO STOP TRUMP...,Favor,Against,Neither,Non-Hateful
174,"Help Twitter Fam, I need 50 more FOLLOWERS to ...",Favor,Neither,Neither,Non-Hateful
96,IM ALMOST AT 1000 followers! YaY! [NEWLINE][NE...,Favor,Neither,Neither,Non-Hateful
500,@PamelaM41583211 @FrankAmari2 @realDonaldTrump...,Neutral mentions,Mixed,Neither,Non-Hateful
217,Did you ever think you would be so happy to se...,Neither,Favor,Neither,Non-Hateful
116,Crazy fake news tweets all over Twitter about ...,Favor,Neither,Neither,Hateful
247,I have added few crucial historical events to ...,Neither,Favor,Neither,Non-Hateful
413,A united #Resistance is the only thing that ca...,Against,Favor,Neither,Hateful


Now, we chop off the middle three columns.

In [16]:
train_df = pd.concat([train_df.iloc[:, 0], train_df.iloc[:, -1]], axis = 1)
test_df = pd.concat([test_df.iloc[:, 0], test_df.iloc[:, -1]], axis = 1)

Guideline of the dataset claims there can be duplicates in the dataset. Thus, we will check if there is any.

In [19]:
print(len(train_df['text'].drop_duplicates()) == len(train_df))
print(len(test_df['text'].drop_duplicates()) == len(test_df))

True
True


Since, both are true, we can confirm there are no duplicates in the dataset.