# Toxicity

<p><a name="sections"></a></p>


## Sections

- <a href="#Initialization">Initialization</a><br>
- <a href="#Data-preparation">Data preparation</a><br>
    - <a href="#Import-the-data">Import the data</a><br>
    - <a href="#Formatting">Formatting</a><br>
    - <a href="#Detour---Shouting">Detour - Shouting</a><br>
- <a href="#Classification">Classification</a><br>
    - <a href="#Feature-sets">Feature sets</a><br>
    - <a href="#Training">Training</a><br>

# Initialization

Import the required modules.

In [104]:
# import the required modules
import numpy as np
import pandas as pd
import re
import nltk
from nltk import *

# Data preparation 

### Import the data

Import the data from csv file and print the first 5 records. 

In [105]:
# import the data from csv, print the dimensions and view the first 5 records
df = pd.read_csv(filepath_or_buffer='./data/train.csv')
print df.shape
df.head()

(159571, 8)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [106]:
# Calculate the frequency of each class label
df.iloc[:, 2:].sum() / df.shape[0]

toxic            0.095844
severe_toxic     0.009996
obscene          0.052948
threat           0.002996
insult           0.049364
identity_hate    0.008805
dtype: float64


Notice that there are 6 different classification labels in the dataset. For this project I am going to focus on predicting only 2 of the 6 different class labels. The other team members will focus on the remaining 4 class labels. I will focus only on **obscene** and **threat**. In order to do this I will remove unwanted columns.

### Formatting

Let's first prepare the data.

+ Drop redundant columns
+ Remove unknown ascii characters
+ Some comments contains a timestamp and/or IP address. Replace those with a single space
+ Remove automated comment
+ Remove unwanted characters
+ Replace multiple spaces with a single space
+ Remove words that are shorter than 3 characters long
+ Tokenize

In [107]:
# drop redundant columns
df.drop(['id', 'toxic', 'severe_toxic', 'insult', 'identity_hate'], axis=1, inplace=True)

In [108]:
# create functions to be used to format the comments
# handle non-asci characters 
def handle_unknown(s):
    return s.strip().decode("ascii","ignore").encode("ascii")

# handle ip addresses
def handle_ip(s):
    return re.sub('[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+', ' ', s ).strip()

# handle timestamps
def handle_ts(s):
    return re.sub('[0-9]{1,2}:[0-9]{1,2}.+,\s.+,\s[0-9]{4}\s+\([A-Z]+\)', ' ', s ).strip()

# handle automated comments
def handle_autocomment(s):
    return re.sub('The preceding unsigned comment was added by', ' ', s ).strip()

# handle multiple spaces
def handle_spaces(s):
    return re.sub('\s+', ' ', s ).strip()

# remove punctuation
def handle_punc(s):
    return re.sub("[^A-z\s]", ' ', s).strip()

# split each comment into a list of words
def handle_split(s):
    return s.split()

# keep only words where length >= 3
def handle_remove(s):
    return [e for e in s if len(e) >= 3]

# wrapping the above into one func for preformance sake
def handler(s):
    return handle_remove(handle_split(handle_spaces(handle_punc(handle_autocomment(handle_ts(handle_ip(handle_unknown(s))))))))


In [109]:
# create before and after variables to display the effect of the changes
before = df['comment_text'][37]
df['comment_text'] = df['comment_text'].apply(handler)
after = df['comment_text'][37]

In [110]:
# print before
print before

pretty much everyone from warren county/surrounding regions was born at glens falls hospital. myself included. however, i'm not sure this qualifies anyone as being a glens falls native. rachel ray is, i believe, actually from the town of lake luzerne.  —The preceding unsigned comment was added by 70.100.229.154  04:28:57, August 19, 2007 (UTC)


In [111]:
# print after
print after

['pretty', 'much', 'everyone', 'from', 'warren', 'county', 'surrounding', 'regions', 'was', 'born', 'glens', 'falls', 'hospital', 'myself', 'included', 'however', 'not', 'sure', 'this', 'qualifies', 'anyone', 'being', 'glens', 'falls', 'native', 'rachel', 'ray', 'believe', 'actually', 'from', 'the', 'town', 'lake', 'luzerne']


From the prinouts of *before* and *after* we can see how the changes have been applied correctly. 

### Detour - Shouting
Measure shouting using the number of UPPERCASE words in a comment. 

In [112]:
# function that return the number of words uppercase that are uppercase as a ratio
def ratio_uppers(l):
    return round(float(sum(i.isupper() for i in l))/len(l), ndigits=3) if len(l) > 0 else 0

# add new column which measure the degree of shouting
df['shouting'] = df['comment_text'].apply(ratio_uppers)

In [113]:
df[df['shouting'] > (1.0/3)]

Unnamed: 0,comment_text,obscene,threat,shouting
6,"[COCKSUCKER, BEFORE, YOU, PISS, AROUND, WORK]",1,0,1.000
43,"[FUCK, YOUR, FILTHY, MOTHER, THE, ASS, DRY]",1,0,1.000
51,"[GET, FUCKED, GET, FUCKEEED, GOT, DRINK, THAT,...",1,0,1.000
159,"[UNBLOCK, GET, LAWYERS, YOU, FOR, BLOCKING, CO...",0,0,1.000
183,"[new, userbox, TABTAB, TABTAB, White, TABTAB, ...",0,0,0.375
281,"[UTC, December]",0,0,0.500
324,"[MATT, HARDY, FUCKY, Italic, text[[Media, Exam...",1,0,0.600
338,"[THIS, WIIL, LAST, USE, THIS, ACOUNT, PLEASE, ...",0,0,1.000
369,"[PAGE, TRIED, CREATE, NOW, ACTUAL, PAGE, LOL, ...",0,0,1.000
415,"[Thank, you, for, your, RACIST, experimenting,...",1,0,0.561


Ouch! There are some really nasty comments in this dataset.

### Formatting continued

To complete the formatting step we need to change all words to lowercase. 

In [114]:
# function to convert all values in a list to lowercase
def conv_lower(l):
    return [i.lower() for i in l]

# convert all uppercase to lower case
df['comment_text'] = df['comment_text'].apply(conv_lower)

## Classification
We are going to classify the comments using the NaiveBayes algorithm that is part of the NLTK library. This library takes a list of tuples as input so some further formatting will be required.

### Dictionary
First we need to create a default dictionary with all words and a default flag=False. Then we need create a features set for each **comment** based on this dictionary.

In [115]:
# function to get ALL the words returned in a dict with default value of false
def get_all_words(lists):
    d = dict()
    for l in lists:
        d.update(dict.fromkeys(l, False))
    return d

In [116]:
# words as dict
all_words = get_all_words(df['comment_text'])
len(all_words)

177303

There are 177303 unique words. Lets' look at the first 5 entries.

In [117]:
[(all_words.keys()[:5], all_words.values()[:5])]

[(['tsukino', 'sowell', 'woods', 'spiders', 'gavan'],
  [False, False, False, False, False])]

### Feature sets
Next, we create the feature set for each **comment** using the dictionary.

In [119]:
# # convert toxic & severe_toxic to strings
# df['obscene'] = df['obscene'].apply(lambda i: 'obscene' if i == 1 else 'neutral')
# df['threat'] = df['threat'].apply(lambda i: 'threat' if i == 1 else 'neutral')

# # create two seperate lists in the correct format for nltk.classify.apply_features to consume
# dat_obscene = [tuple(i) for i in zip(df['comment_text'].tolist(), df['obscene'].tolist())]
# dat_threat = [tuple(i) for i in zip(df['comment_text'].tolist(), df['threat'].tolist())]

# # del df
# print len(dat_obscene), len(dat_threat)

In [120]:
# function to get features for a comment
def extract_features(words):
    d = all_words.copy()
    d.update(dict.fromkeys(words, True))
    return d

In [122]:
# df_orig = df.copy()
df = df_orig.loc[:300].copy()

In [124]:
len(df)

301

In [125]:
df['obscene'].sum()

18

In [126]:
dat = [tuple(i) for i in zip(df['comment_text'].tolist(), \
                                  df['obscene'].apply(lambda i: 'obscene' if i == 1 else 'neutral').tolist())]
len(dat)

301

### Training
Let's train a Naive Bayes classifier.

In [127]:
# train a model
training_set = nltk.classify.apply_features(extract_features, dat) 
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [138]:
import gc
del training_set, test
gc.collect()

0

In [129]:
dat[6][0]

['cocksucker', 'before', 'you', 'piss', 'around', 'work']

In [130]:
classifier.classify(extract_features(dat[6][0]))

'neutral'

In [132]:
test = [classifier.classify(extract_features(instance[0][0])) for instance in dat]

In [137]:
sum(i != 'obscene' for i in test)

301

In [142]:
obscene = df[df['obscene'] == 1]

In [143]:
obscene

Unnamed: 0,comment_text,obscene,threat,shouting
6,"[cocksucker, before, you, piss, around, work]",1,0,1.0
42,"[you, are, gay, antisemmitian, archangel, whit...",1,0,0.0
43,"[fuck, your, filthy, mother, the, ass, dry]",1,0,1.0
51,"[get, fucked, get, fuckeeed, got, drink, that,...",1,0,1.0
55,"[stupid, peace, shit, stop, deleting, stuff, a...",1,0,0.0
56,"[tony, sidaway, obviously, fistfuckee, loves, ...",1,0,0.0
58,"[band, page, deletion, you, thought, was, gone...",1,0,0.0
65,"[all, edits, are, good, cunts, like, you, who,...",1,0,0.0
105,"[pair, jew, hating, weiner, nazi, schmucks]",1,0,0.0
176,"[think, that, your, fagget, get, oife, and, bu...",1,1,0.0


In [146]:
most_informative_features = (classifier.most_informative_features())

In [149]:
[i for (i, j) in most_informative_features if i == 'fuck']

[]