#### Import the libraries

In [1]:
# For data handling and EDA
import pandas as pd
# For Visualisation
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# For NLP and ML
import torch
import sklearn
import nltk
import spacy

#### Load the dataset to Pandas

In [2]:
# Load the annotations metadata csv to get label info
annot = pd.read_csv("hate-speech-dataset-master/annotations_metadata.csv")
# Keep only file id and label
annot.drop(columns=['user_id','subforum_id','num_contexts'],inplace=True)
# Add new empty column to insert the message
annot['message']=''
# Loop through all lines
for i in range(len(annot['file_id'])):
    # Get current filename
    filename = annot['file_id'][i]
    # Open the file and read the contents
    with open('hate-speech-dataset-master/all_files/' +filename +'.txt') as f:
        contents= f.read()
    f.close()
    # Add contents to the message column
    annot['message'][i]=contents
# Drop file id
annot.drop(columns=['file_id'],inplace=True)

#### EDA and Pre-Processing

In [3]:
# DataFrame information
annot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10944 entries, 0 to 10943
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   label    10944 non-null  object
 1   message  10944 non-null  object
dtypes: object(2)
memory usage: 171.1+ KB


In [4]:
# Check the classes and their value counts
annot['label'].value_counts()

noHate      9507
hate        1196
relation     168
idk/skip      73
Name: label, dtype: int64

Based on the original paper, there are a total of 10944 sentences, 9507 of them are classified as no hate speech, 1196 are classified as hate speech, 168 are related to hate speech but depends on content and 73 are non English sentences, classified as idk/skip. For the purpose of this research, relation and idk/skip classes will be dropped completely in order to make the problem a binary classification.

In [5]:
# Keep sentences classified only as noHate and hate
cleanPosts = annot[(annot['label'] != 'relation') & (annot['label'] != 'idk/skip')]
# Reset index
cleanPosts=cleanPosts.reset_index(drop=True)

In [6]:
cleanPosts

Unnamed: 0,label,message
0,noHate,"As of March 13th , 2014 , the booklet had been..."
1,noHate,In order to help increase the booklets downloa...
2,noHate,( Simply copy and paste the following text int...
3,hate,Click below for a FREE download of a colorfull...
4,noHate,Click on the `` DOWNLOAD ( 7.42 MB ) '' green ...
...,...,...
10698,noHate,"Billy - `` That guy would n't leave me alone ,..."
10699,noHate,Wish we at least had a Marine Le Pen to vote f...
10700,noHate,Its like the choices are white genocide candid...
10701,hate,Why White people used to say that sex was a si...
