Import necessary packages.

In [36]:
import numpy as np
import pandas as pd
import re
import string
import nltk

Read in the hatespeech dataset that we created when we queried the Twitter API for the tweets and their associated metadata. Then read in the original dataset with the target variable labels, created by Zeerak Waseem. To get a sense of any of these datasets, uncomment the print statements to print the first 6 rows. Do an inner merge on the two dataframes to match up the target variable, class, from the original dataset with all of the data we've collected in the larger dataframe. We'll call the resulting dataframe "hs".

In [37]:
hs = pd.read_csv('hatespeech.csv', encoding="ISO-8859-1",index_col=6, keep_default_na=False)
#print(hs.head())

orig = pd.read_csv('NAACL_SRW_2016.csv', index_col=0, header=None)
orig.index.name = 'ID'
orig = orig.rename(columns={1: 'Class'})
orig.index = orig.index.astype(str)
#print(orig.head())

#merging the two dataframes
hs = pd.merge(hs, orig, how='inner', left_index=True, right_index=True)
#print(hs.head())

Split the dataframe into three different dataframes, one with only the sexist tweets, one with only the racist tweets, and one with the "none" class. Create lists from the Tweets variable of each of the mini-dataframes. We will treat these as our "documents" for TFIDF when we run our TFIDF in the next Notebook. 

In [38]:
sexism =  hs.loc[hs['Class'] == 'sexism']
racism = hs.loc[hs['Class'] == 'racism']
none = hs.loc[hs['Class'] == 'none']

s_tweets = list(sexism.Tweets)
r_tweets = list(racism.Tweets)
n_tweets = list(none.Tweets)

class_list = [s_tweets, r_tweets, n_tweets]

Punctuation is on a list that comes standard from nltk, but in our case we're going to use our own because @ and # are meaningful on social media and we'll want to access that information in our other preprocessing steps.

In [39]:
punctuation = [':',';','!',',','.']

Here we use regular expressions to get rid of some of the redundant variables that we have in the metadata. For example, we get rid of words that have # in front of them, because those are hashtags and we'll deal with them in a different way later. We also get rid of some strange things like the &amp and the RT value, because those have different meanings on social media as well. We have to do do this before tokenizing, because the tokenizer splits words on punctuation. For example, #hashtag would be tokenized as ["#", "hashtag"]. It is easier to move them first, then get rid of punctuation afterwards, like we've done here.

In [40]:
term_vec = []

for i in class_list:
    doc = []
    doc2 = []
    for d in i:
        d = re.sub('\@\w+', '', d)
        d = re.sub('\#\w+','', d)
        d = re.sub('\#','',d)
        d = re.sub('RT','',d)
        d = re.sub('&amp;','',d)
        d = re.sub('[0-9]+','',d)
        d = re.sub('//t.co/\w+','',d)
        d = re.sub('w//','',d)
        d = d.lower()
        doc.append( nltk.word_tokenize( d ) )
    for j in doc:
        for s in j:
            if s not in punctuation:
                doc2.append(s)
    term_vec.append(doc2)

#print(term_vec[1])

This removes standard stopwords from the list of terms for each of the classes. To see a sample of the list without the stopwords, uncomment the print statement.

In [41]:
stop_words = nltk.corpus.stopwords.words( 'english' )

for j in term_vec:
    for i in j:
        contractions = re.match('\'', i)
        if i in stop_words or contractions:
            j.remove(i)

We will now take these tokens and run them through a Porter stemmer. This reduces words like "run" and "runs" and reduces them both to "run" -- by reducing words to their root we reduce redundancy and get a better sense of their root meaning.

In [42]:
porter = nltk.stem.porter.PorterStemmer()

for i in range( 0, len( term_vec ) ):
    for j in range( 0, len( term_vec[ i ] ) ):
        term_vec[ i ][ j ] = porter.stem( term_vec[ i ][ j ] )

print(term_vec[0][0:30])

['oh', 'yeah', 'colin', 'smash', 'girl', 'swear', 'sexist', 'honestli', 'stand', 'woman', 'colleg', 'footbal', 'announc', 'espn', 'call', 'sexist', 'think', 'women', 'serious', 'lack', 'knowledg', 'l', 'come', 'femin', 'call', 'sexist', 'femal', 'realli', 'need', 'stop']


Our data is now relatively clean! For the sake of the length of this notebook, we're going to save each of the lists of words from each class into a separate csv to upload into the next Notebook where we will do our TFIDF analysis. We will also save the bigger dataset as 'hs_merged.csv'.

In [43]:
sexist = pd.DataFrame({'Sexist tokens': term_vec[0]})
racist = pd.DataFrame({'Racist tokens': term_vec[1]})
other = pd.DataFrame({'Other':term_vec[2]})

racist.to_csv('racist_tokens.csv')
sexist.to_csv('sexist_tokens.csv')
other.to_csv('other_tokens.csv')

hs.to_csv('hs_merged.csv')