# NLP Processing of Offensive Tweets
Twitter is a social media platform that allows individuals and business to communicate directly with their fans and consumers. While this direct access has facilitated lots of great moments for the platform's users, it also exposes users to a fair amount of bile and harsh criticism while going through their comments section. In order to streamline the experience to the useful interactions, I'm hoping to build a classifier for filtering out insults from a user's mentions.

This notebook imports a dataset of offensive tweets developed for the paper ["Automated Hate Speech Detection and the Problem of Offensive Language", ICWSM 2017, by Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber](https://github.com/t-davidson/hate-speech-and-offensive-language) and attempts to replicate a classifier that can then be used to predict direct insults and blatantly offensive tweets.

In [1]:
# general data containers
import pandas as pd
import numpy as np

# Text processing
import tweet_processing as tp
import re
import string
import nltk
from gensim import corpora, models, similarities, matutils

#
from sklearn.metrics import pairwise_distances

# text data structures
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.model_selection import train_test_split

In [2]:
# Bring in tweet data
path = './hate-speech-and-offensive-language/data/labeled_data.csv'
hate_speech = pd.read_csv('./hate-speech-and-offensive-language/data/labeled_data.csv', header=0, index_col=0)
hate_speech.head()

Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [3]:
# 24783 tweets in database
hate_speech.shape

(24783, 6)

The research group includes a set of custom engrams: 1-4 grams in length and found in nearly all tweets labeled as hate speech. I've imported them to look around but think that using them as the vocabulary in my word vectorizer will leave out all the data associated with the text I'd like to retain.

In [4]:
# Bring in custom n-grams by Thomas Davidson and Thomas Davidson, Dana Warmsley, Michael Macy, Ingmar Weber
full_hategram = 'hatebase_dict.csv'
smart_hategram = 'refined_ngram_dict.csv'
path = './hate-speech-and-offensive-language/lexicons/'
hate_ngrams = pd.read_csv(path+smart_hategram, header=0)
hate_ngrams.head()

Unnamed: 0,ngram,prophate
0,allah akbar,0.87
1,blacks,0.583
2,chink,0.467
3,chinks,0.542
4,dykes,0.602


Before vectorizing the tweets, I want to process the text to make sure we get actual words in the word vectors. In doing so I also extract additional features from "entities" in the tweet text, such as mentions, retweets, links, and emojis.

The first block below shows the output from the `process_tweet` function. The next block performs the scrubbing and feature engineering for all tweets within the data. This adds:
* scrubbed text, a string
* retweet, binary, 1 if tweet is a retweet
* n_retweet, integer count, number of retweets
* mention, binary, 1 if tweet has a mention
* n_mention, integer count, number of mentions
* link, binary, 1 if tweet has a link
* emoji, binary, 1 if tweet contains an HTML encoded emoji of style `&#12___;`
* n_emoji, integer count, number of emojis in tweet
* n_char, integer count, number of characters in tweet
* avg_wrd, float, average word length in characters of words in the tweet

In [5]:
print(hate_speech.tweet[0])
tp.prcs_rw_twt(hate_speech.tweet[0])

!!! RT @mayasolovely: As a woman you shouldn't complain about cleaning up your house. &amp; as a man you should always take the trash out...


('as a woman you shouldn t complain about cleaning up your house as a man you should always take the trash out',
 [1, 1, 0, 0, 110, 3.95])

In [6]:
# Define new features to add
new_cols = ['text','retweet', 'mention','link', 'emoji', 'n_char', 'avg_wrd']

# For each tweet, process text and add new features.
# Create a nested list representing feature values for each tweet.
new_vals = []
for tweet in hate_speech.tweet:
    text, new_feats = tp.prcs_rw_twt(tweet) 
    new_vals.append([text] + new_feats)


# Add in values by corresponding to column names aboves
new_vals = np.array(new_vals)
for idx, new_col in enumerate(new_cols):
    hate_speech[new_col] = new_vals[:,idx]

# The above array conversion turns new feature values into objects for type uniformity in array.
# Undo this with type conversion in dataframe.
col_to_int = ['retweet', 'mention','link', 'emoji', 'n_char']
col_to_float = ['avg_wrd']
for col in col_to_int:
    hate_speech[col] = hate_speech[col].astype(int)
for col in col_to_float:
    hate_speech[col] = hate_speech[col].astype(float)
    
hate_speech.head()

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet,text,retweet,mention,link,emoji,n_char,avg_wrd
0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...,as a woman you shouldn t complain about cleani...,1,1,0,0,110,3.95
1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...,boy dats cold tyga dwn bad for cuffin dat hoe ...,1,1,0,0,60,3.54
2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...,dawg you ever fuck a bitch and she start to cr...,2,2,0,0,72,3.5
3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...,she look like a tranny,1,2,0,0,23,3.6
4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...,the shit you hear about me might be true or it...,1,1,0,0,96,3.32


In [7]:
# Check for missing values
hate_speech.isnull().sum()

count                 0
hate_speech           0
offensive_language    0
neither               0
class                 0
tweet                 0
text                  0
retweet               0
mention               0
link                  0
emoji                 0
n_char                0
avg_wrd               0
dtype: int64

In [9]:
# Make counts and binary value features for twitter entities
hate_speech['n_retweet'] = hate_speech['retweet']
hate_speech['retweet'] = [1 if rt >0 else 0 for rt in hate_speech.retweet]
hate_speech['mention'] = [1 if mt >0 else 0 for mt in hate_speech.mention]
hate_speech['link'] = [1 if lk >0 else 0 for lk in hate_speech.link]
hate_speech.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24783 entries, 0 to 25296
Data columns (total 14 columns):
count                 24783 non-null int64
hate_speech           24783 non-null int64
offensive_language    24783 non-null int64
neither               24783 non-null int64
class                 24783 non-null int64
tweet                 24783 non-null object
text                  24783 non-null object
retweet               24783 non-null int64
mention               24783 non-null int64
link                  24783 non-null int64
emoji                 24783 non-null int64
n_char                24783 non-null int64
avg_wrd               24783 non-null float64
n_retweet             24783 non-null int64
dtypes: float64(1), int64(11), object(2)
memory usage: 3.5+ MB


More primitive text scrubbing methodology, shown here for comparison's sake:
* make lowercase
* remove punctuation and twitter usernames

In [11]:
tp.basic_tweet_scrub(hate_speech.tweet).head()

0        rt    as a woman you shouldn t complain ab...
1          rt     boy dats cold   tyga dwn bad for ...
2            rt   dawg     rt     you ever fuck a b...
3                       rt      she look like a tranny
4                  rt    the shit you hear about me...
Name: tweet, dtype: object

We will need to add additional stop words that are specific to twitter

In [123]:
full_stop_words = text.ENGLISH_STOP_WORDS.union(stop_words_twitter)

frozenset

In [133]:
cv1 = CountVectorizer(stop_words = full_stop_words, max_df = 0.9, min_df = 0.001)
X_vec = cv1.fit_transform(hate_speech['scrubed'])
pd.DataFrame(X_vec.toarray(), columns=cv1.get_feature_names()).head()

Unnamed: 0,act,actin,acting,actually,af,ago,ah,ain,aint,al,...,yeah,year,years,yellow,yes,yesterday,yo,young,youre,yu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


From here, before I model it's probably best that I find a way to reduce the number of features. Let's try using the vocabulary that the researchers developed.

In [137]:
cv1 = CountVectorizer(stop_words = full_stop_words, max_df = 0.9, min_df = 0.001)
X_vec = cv1.fit_transform(hate_speech['scrubed'])
pd.DataFrame(X_vec.toarray(), columns=cv1.get_feature_names()).head()

Unnamed: 0,act,actin,acting,actually,af,ago,ah,ain,aint,al,...,yeah,year,years,yellow,yes,yesterday,yo,young,youre,yu
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Creating a count vectorizer using only the ngrams found in hateful tweets. Bear in mind that using only these words leaves a lot of information "on the table": it exludes all the text you'd find in regular speech.

In [157]:
hate_dict = hate_ngrams.to_dict()['ngram']
hate_vocab = dict((term,idx) for idx, term in this_dict.items())

In [159]:
cv2 = CountVectorizer(vocabulary=hate_vocab, max_df = 0.9, min_df = 0.001)
X_vec_2 = cv2.fit_transform(hate_speech['scrubed'])
pd.DataFrame(X_vec_2.toarray(), columns=cv2.get_feature_names()).head()

Unnamed: 0,allah akbar,blacks,chink,chinks,dykes,faggot,faggots,fags,homo,inbred,...,full of white trash,how many niggers are,is full of white,lame nigga you a,many niggers are in,nigga you a lame,niggers are in my,wit a lame nigga,you a lame bitch,you fuck wit a
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Pair programming exercise

In [109]:
text = ["wookie stormtrooper",
        "wookie wookie wookie stormtrooper stormtrooper stormtrooper",
        "harry potter",
       "gucci prada versace",
       "I love my mother, who is a stormtrooper"]

text = ["wookie stormtrooper",
        "wookie wookie wookie stormtrooper stormtrooper stormtrooper",
        "harry potter"]

counter = CountVectorizer(stop_words='english')
cv_text = counter.fit_transform(text)
text_df = pd.DataFrame(cv_text.toarray(), columns=counter.get_feature_names()).head()
text_df.head()

Unnamed: 0,harry,potter,stormtrooper,wookie
0,0,0,1,1
1,0,0,3,3
2,1,1,0,0


In [110]:
# result_i,j is the distance between row i and row j
pairwise_distances(text_df, metric='euclidean')

array([[0.        , 2.82842712, 2.        ],
       [2.82842712, 0.        , 4.47213595],
       [2.        , 4.47213595, 0.        ]])

In [111]:
# Cosine distance
pairwise_distances(text_df, metric='cosine')

array([[0., 0., 1.],
       [0., 0., 1.],
       [1., 1., 0.]])

In [112]:
# Cosine similarity, not de-meaned
1-pairwise_distances(text_df, metric='cosine')

array([[1., 1., 0.],
       [1., 1., 0.],
       [0., 0., 1.]])

In [113]:
# Pearson Coefficient: not the same as cosine similarity
pw_dist = 1-pairwise_distances(text_df, metric='cosine')
np.corrcoef(pw_dist)

array([[ 1.,  1., -1.],
       [ 1.,  1., -1.],
       [-1., -1.,  1.]])

In [114]:
# Count Vectorizer, but de-meaned
text_df_demean = text_df.sub(text_df.mean(axis=1), axis=0)
text_df_demean

Unnamed: 0,harry,potter,stormtrooper,wookie
0,-0.5,-0.5,0.5,0.5
1,-1.5,-1.5,1.5,1.5
2,0.5,0.5,-0.5,-0.5


In [115]:
# Cosine similarity, de_meaned data
1-pairwise_distances(text_df_demean, metric='cosine')

array([[ 1.,  1., -1.],
       [ 1.,  1., -1.],
       [-1., -1.,  1.]])

In [116]:
# Pearson Coefficient: de_meaned data
pw_dist_demean = 1-pairwise_distances(text_df_demean, metric='cosine')
np.corrcoef(pw_dist_demean)

array([[ 1.,  1., -1.],
       [ 1.,  1., -1.],
       [-1., -1.,  1.]])

Part 2, babeeee!

In [78]:
# 3 observations
A = [[25, 30],
     [29, 40],
     [39, 78]]

data = pd.DataFrame(A)

In [85]:
data.head()

Unnamed: 0,0,1
0,25,30
1,29,40
2,39,78


In [86]:
de_mean = data - data.mean()
de_mean.head()

Unnamed: 0,0,1
0,-6.0,-19.333333
1,-2.0,-9.333333
2,8.0,28.666667


In [87]:
# Actually co-variance
np.matmul(de_mean.T, de_mean)

array([[ 104.        ,  364.        ],
       [ 364.        , 1282.66666667]])

In [92]:
np.dot(de_mean[1],de_mean[1])

1282.6666666666665