# Toxicity

<p><a name="sections"></a></p>


## Sections

- <a href="#Initialization">Initialization</a><br>
- <a href="#Data-preparation">Data preparation</a><br>
    - <a href="#Import-the-data">Import the data</a><br>
    - <a href="#Formatting">Formatting</a><br>
    - <a href="#Detour---Shouting">Detour - Shouting</a><br>
- <a href="#Classification">Classification</a><br>
    - <a href="#Feature-sets">Feature sets</a><br>
    - <a href="#Training">Training</a><br>

# Initialization

Import the required modules.

In [39]:
# import the required modules
import numpy as np
import pandas as pd
import re
from nltk.tokenize import word_tokenize

# Data preparation 

### Import the data

Import the data from csv file and print the first 5 records. 

In [40]:
# import the data from csv, print the dimensions and view the first 5 records
df = pd.read_csv(filepath_or_buffer='./data/train.csv')
print df.shape
df.head()

(159571, 8)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [41]:
# Calculate the frequency of each class label
df.iloc[:, 2:].sum() / df.shape[0]

toxic            0.095844
severe_toxic     0.009996
obscene          0.052948
threat           0.002996
insult           0.049364
identity_hate    0.008805
dtype: float64


Notice that there are 6 different classification labels in the dataset. For this project I am going to focus on predicting only 2 of the 6 different class labels. The other team members will focus on the remaining 4 class labels. I will focus only on **toxic** and **severe_toxic**. In order to do this I will remove unwanted columns.

### Formatting

Let's first prepare the data before we create a corpus.

+ Drop redundant columns
+ Remove unknown ascii characters
+ Some comments contains a timestamp and/or IP address. Replace those with a single space
+ Remove automated comment
+ Remove unwanted characters
+ Replace multiple spaces with a single space
+ Remove words that are shorter than 3 characters long
+ Tokenize

In [42]:
# drop redundant columns
df.drop(['id', 'obscene', 'threat', 'insult', 'identity_hate'], axis=1, inplace=True)

In [43]:
# create functions to be used to format the comments
# handle non-asci characters 
def handle_unknown(s):
    return s.strip().decode("ascii","ignore").encode("ascii")

# handle ip addresses
def handle_ip(s):
    return re.sub('[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+', ' ', s ).strip()

# handle timestamps
def handle_ts(s):
    return re.sub('[0-9]{1,2}:[0-9]{1,2}.+,\s.+,\s[0-9]{4}\s+\([A-Z]+\)', ' ', s ).strip()

# handle automated comments
def handle_autocomment(s):
    return re.sub('The preceding unsigned comment was added by', ' ', s ).strip()

# handle multiple spaces
def handle_spaces(s):
    return re.sub('\s+', ' ', s ).strip()

# remove punctuation
def handle_punc(s):
    return re.sub("[^A-z\s]", ' ', s).strip()

# split each comment into a list of words
def handle_split(s):
    return s.split()

# keep only words where length >= 3
def handle_remove(s):
    return [e for e in s if len(e) >= 3]

# wrapping the above into one func for preformance sake
def handler(s):
    return handle_remove(handle_split(handle_spaces(handle_punc(handle_autocomment(handle_ts(handle_ip(handle_unknown(s))))))))


In [44]:
# create before and after variables to display the effect of the changes
before = df['comment_text'][37]
df['comment_text'] = df['comment_text'].apply(handler)
after = df['comment_text'][37]

In [45]:
# print before
print before

pretty much everyone from warren county/surrounding regions was born at glens falls hospital. myself included. however, i'm not sure this qualifies anyone as being a glens falls native. rachel ray is, i believe, actually from the town of lake luzerne.  —The preceding unsigned comment was added by 70.100.229.154  04:28:57, August 19, 2007 (UTC)


In [46]:
# print after
print after

['pretty', 'much', 'everyone', 'from', 'warren', 'county', 'surrounding', 'regions', 'was', 'born', 'glens', 'falls', 'hospital', 'myself', 'included', 'however', 'not', 'sure', 'this', 'qualifies', 'anyone', 'being', 'glens', 'falls', 'native', 'rachel', 'ray', 'believe', 'actually', 'from', 'the', 'town', 'lake', 'luzerne']


From the prinouts of *before* and *after* we can see how the changes have been applied correctly. 

### Detour - Shouting
Measure shouting using the number of UPPERCASE words in a comment. 

In [48]:
# function that return the number of words uppercase that are uppercase as a ratio
def ratio_uppers(l):
    return round(sum(1 for i in l if i.isupper())/len(l), ndigits=3) if len(l) > 0 else 0

# add new column which measure the degree of shouting
df['shouting'] = df['comment_text'].apply(ratio_uppers)

In [49]:
df[df['shouting']>0][:5]

Unnamed: 0,comment_text,toxic,severe_toxic,shouting
6,"[COCKSUCKER, BEFORE, YOU, PISS, AROUND, WORK]",1,1,1.0
43,"[FUCK, YOUR, FILTHY, MOTHER, THE, ASS, DRY]",1,0,1.0
51,"[GET, FUCKED, GET, FUCKEEED, GOT, DRINK, THAT,...",1,0,1.0
159,"[UNBLOCK, GET, LAWYERS, YOU, FOR, BLOCKING, CO...",1,0,1.0
338,"[THIS, WIIL, LAST, USE, THIS, ACOUNT, PLEASE, ...",0,0,1.0


Ouch! There are some really nasty comments in this dataset.

### Formatting continued

To complete the formatting step we need to change all words to lowercase. 

In [61]:
# function to convert all values in a list to lowercase
def conv_lower(l):
    return [i.lower() for i in l]

# convert all uppercase to lower case
df['comment_text'] = df['comment_text'].apply(conv_lower)

## Classification
We are going to classify the comments using the NaiveBayes algorithm that is part of the NLTK library. This library takes a list of tuples as input so some further formatting will be required.

### Feature sets
We need to build two lists, one for training and one for testing by converting the dataframe **df** to two seperate lists, one for **toxic** and the other for **severe_toxic** comments. These lists should be lists of tuples in this format (list of words, class label)

In [67]:
# convert toxic & severe_toxic to strings
df['tox'] = df['toxic'].apply(lambda i: 'toxic' if i == 1 else 'neutral')
df['sev_tox'] = df['severe_toxic'].apply(lambda i: 'severe_toxic' if i == 1 else 'neutral')
# drop redundant cols
df.drop(['toxic', 'severe_toxic'], axis=1, inplace=True)

# create two seperate lists in the correct format for nltk.classify.apply_features to consume
dat_tox = [tuple(i) for i in zip(df['comment_text'].tolist(), df['tox'].tolist())]
dat_sev_tox = [tuple(i) for i in zip(df['comment_text'].tolist(), df['sev_tox'].tolist())]

del df
print len(dat_tox), len(dat_sev_tox)

### Format

In [12]:
# function to get ALL the words returned in a dict with default value of false
def get_all_words(lists):
    d = dict()
    for l in lists:
        d.update(dict.fromkeys(l, False))
    return d

# function to get features for a comment
def extract_features(words):
    d = all_words.copy()
    d.update(dict.fromkeys(words, True))
    return d

Create a default dictionary, that we will use later when creating features

In [13]:
# words as dict
all_words = get_all_words(df['comment_text'])
len(all_words)

177303

Create a 'features' column in the dataframe. This is basically a dictionary for each observation stating which of the the words in **all_words** is present in the comment.

In [None]:
len(all_words)

### Training

In [None]:
w = df['comment_text'][37]

In [None]:
x = extract_features(w)

In [None]:
x['myself']

In [None]:
x = df.iloc[:5].copy()
x['features'] = x['comment_text'].apply(extract_features)
x['toxic'][2] = 1

In [None]:
x

In [None]:
features = x['features'].tolist()
response = x['toxic'].tolist()
dat = zip(features, response)

In [None]:
T = [tuple(i) for i in dat]

In [None]:
len(T[0][0])

In [None]:
T[2][0]