# Detecting Insults in Social Commentary

Because this dataset is far too small to use a neural network (which was my last approach to NLP), I'm going to experiment with some machine learning algorithms from scikit-learn to identify insults in social commentary.

When this competition ran on Kaggle 5 years ago, the best AUC ROC score was 0.84249 on the private test set (which I also have access to, and will be using as my test set). Let's see if, using all of the new tools available (i.e. GloVe vectrs), I can better this score.

## Making GloVe vectors

In this notebook, I make an array of GloVe vectors, which I can then use to test a variety of different machine learning algorithms. 

In [1]:
% matplotlib inline

import sys
import unicodedata
import re

import numpy as np

import spacy 

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from tqdm import *

First, I need to load the data

In [2]:
data = pd.read_csv('train.csv', encoding= 'utf8')
len(data)

3947

In [3]:
data.head()

Unnamed: 0,Insult,Date,Comment
0,1,20120618192155Z,"""You fuck your dad."""
1,0,20120528192215Z,"""i really don't understand your point.\xa0 It ..."
2,0,,"""A\\xc2\\xa0majority of Canadians can and has ..."
3,0,,"""listen if you dont wanna get married to a man..."
4,0,20120619094753Z,"""C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd..."


So I have 3947 comments to train a classifier with (for comparison, when training my Quora Neural Network, I had over 150,000 question pairs). 

Let's first check the distribution of insults: 

In [4]:
num_insults = len(data[data.Insult ==1])
num_non_insults = len(data[data.Insult == 0])

print ("There are " + str(num_insults) + " insults \nThere are " + str(num_non_insults) + " non insults")

There are 1049 insults 
There are 2898 non insults


So there is some skew to the data to bear in mind. 

There's some easy cleaning I can do to begin with, for instance removing the quotation marks around the comments. 

In [179]:
data["stripped_comment"] = data.Comment.apply(lambda x: re.sub('[^A-Za-z ]+', ' ',x[1:-1].replace('\\n', ' ')))

In [180]:
data.head()

Unnamed: 0,Insult,Date,Comment,stripped_comment
0,1,20120618192155Z,"""You fuck your dad.""",You fuck your dad
1,0,20120528192215Z,"""i really don't understand your point.\xa0 It ...",i really don t understand your point xa It se...
2,0,,"""A\\xc2\\xa0majority of Canadians can and has ...",A xc xa majority of Canadians can and has been...
3,0,,"""listen if you dont wanna get married to a man...",listen if you dont wanna get married to a man ...
4,0,20120619094753Z,"""C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd...",C xe c b u ea n xu u ed ng u u b u eddng bi u...


To extract more features, I'm going to use SpaCy, which will allow me to Tokenize and generally augment the text

In [173]:
nlp = spacy.load('en')

In [181]:
spacy_comments = []

for doc in nlp.pipe(data.stripped_comment, n_threads = -1):
    spacy_comments.append(doc)

I've already explored briefly what SpaCy is capable of in [this notebook](https://github.com/GabrielTseng/LearningDataScience/blob/master/Natural_Language_Processing/Spacy_NLP.ipynb), so even though *apparently* little has changed, there is plenty of new information under the hood contained in that sentence. 

The most powerful of these are the [GloVe](https://nlp.stanford.edu/projects/glove/) word vectors, which are now part of every token - 'word' - in each SpaCy comment. 

In [166]:
print ("The token '" + str(spacy_comments[8][0]) + "' has a vector of shape " + str(spacy_comments[8][0].vector.shape))

The token 'Either' has a vector of shape (300,)


(So does every other token). 

One challenge currently is that every comment has a different length. My solution to deal with this is different from with the Quora dataset. 

Given that each word vector has length 300, even truncating the dataset at some length would yield an x*300 matrix. Considering I only have ~4000 training datapoints, any value of x will very quickly lend itself to overfitting. 

Therefore, I am going to take the *average* of the word vectors instead, so that each comment is represented by a 300 length vector. 


In [12]:
comment_array = []
for x in spacy_comments[1]:
    comment_array.append(x.vector.flatten())
comment_array = np.mean(comment_array, axis = 0)

In [13]:
np.asarray(comment_array).shape

(300,)

Perfect. I just need to do this for every single comment. 

Since I'm averaging, I am going to remove stop words, which are unlikely to contribute to the 'sentiment' of the sentence, but may dilute the average. 

In [176]:
spacy_comments[3365]

How is defending your child vigilante justice?  Why are you defending a child rapist?\xa0 Are you a closet pedo?

In [187]:
glove_comments = []
for i in tqdm(spacy_comments):
    comment_array = [np.zeros(300)]
    for x in i:
        if ((x.vector == np.zeros(300)).all() == False): #and (x.is_stop == False)) :
            comment_array.append(x.vector)
    comment_array = np.mean(comment_array, axis = 0)
    glove_comments.append(np.asarray(comment_array))

100%|██████████| 3947/3947 [00:07<00:00, 525.06it/s]


In [188]:
glove_comments = np.asarray(glove_comments)
glove_comments.shape

(3947, 300)

Awesome. Now, I'm going to save this array for use with other algorithms. 

In [189]:
np.save('glove_array.npy', glove_comments)