# Toxicity

<p><a name="sections"></a></p>


## Sections

- <a href="#Initialization">Initialization</a><br>
- <a href="#Data-preparation">Data preparation</a><br>
    - <a href="#Import-the-data">Import the data</a><br>
    - <a href="#Formatting">Formatting</a><br>
- <a href="#pos">POS Tag</a><br>
    - <a href="#ex2">Exercise: Lemmatization with POS Tag</a><br>
- <a href="#chunk">Chunk</a><br>
    - <a href="#ex3">Exercise: Syntax Tree/ Chunking</a><br>
- <a href="#classify">Text Classification</a><br>
    - <a href="#ex4">Exercise: Classify the Testing Set</a><br>
- <a href="#lda">Brief Introduction to LDA</a><br>

# Initialization

Import the required modules.

In [1]:
# import the required modules
import numpy as np
import pandas as pd
import re
from nltk.tokenize import word_tokenize

# Data preparation 

##### Import the data

Import the data from csv file and print the first 5 records. 

In [46]:
# import the data from csv, print the dimensions and view the first 5 records
df = pd.read_csv(filepath_or_buffer='./data/train.csv')
print df.shape
df.head()

(159571, 8)


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [48]:
# Calculate the frequency of each class label
df.iloc[:, 2:].sum() / df.shape[0]

toxic            0.095844
severe_toxic     0.009996
obscene          0.052948
threat           0.002996
insult           0.049364
identity_hate    0.008805
dtype: float64


Notice that there are 6 different classification labels in the dataset. For this project I am going to focus on predicting only 2 of the 6 different class labels. The other team members will focus on the remaining 4 class labels. I will focus only on **toxic** and **severe_toxic**. In order to do this I will remove unwanted columns.

##### Formatting

Let's first prepare the data before we create a corpus.

+ Drop redundant columns
+ Remove unknown ascii characters
+ Some comments contains a timestamp and/or IP address. Replace those with a single space
+ Remove automated comment
+ Remove unwanted characters
+ Replace multiple spaces with a single space

In [49]:
# drop redundant columns
df.drop(['id', 'obscene', 'threat', 'insult', 'identity_hate'], axis=1, inplace=True)

In [50]:
# create functions to be used to format the comments
# handle non-asci characters 
def handle_unknown(s):
    return s.strip().decode("ascii","ignore").encode("ascii")

# handle ip addresses
def handle_ip(s):
    return re.sub('[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+', ' ', s ).strip()

# handle timestamps
def handle_ts(s):
    return re.sub('[0-9]{1,2}:[0-9]{1,2}.+,\s.+,\s[0-9]{4}\s+\([A-Z]+\)', ' ', s ).strip()

# handle automated comments
def handle_autocomment(s):
    return re.sub('The preceding unsigned comment was added by', ' ', s ).strip()

# handle multiple spaces
def handle_spaces(s):
    return re.sub('\s+', ' ', s ).strip()

# remove punctuation
def handle_punc(s):
    return re.sub("[^A-z\s']", ' ', s).strip()

# split each comment into a list of words
def handle_split(s):
    return s.split()

# wrapping the above into one func for preformance sake
def handler(s):
    return handle_split(handle_spaces(handle_punc(handle_autocomment(handle_ts(handle_ip(handle_unknown(s)))))))


In [51]:
# create before and after variables to display the effect of the changes
before = df['comment_text'][37]
df['comment_text'] = df['comment_text'].apply(handler)
after = df['comment_text'][37]

In [52]:
# print before
print before

pretty much everyone from warren county/surrounding regions was born at glens falls hospital. myself included. however, i'm not sure this qualifies anyone as being a glens falls native. rachel ray is, i believe, actually from the town of lake luzerne.  —The preceding unsigned comment was added by 70.100.229.154  04:28:57, August 19, 2007 (UTC)


In [53]:
# print after
print after

['pretty', 'much', 'everyone', 'from', 'warren', 'county', 'surrounding', 'regions', 'was', 'born', 'at', 'glens', 'falls', 'hospital', 'myself', 'included', 'however', "i'm", 'not', 'sure', 'this', 'qualifies', 'anyone', 'as', 'being', 'a', 'glens', 'falls', 'native', 'rachel', 'ray', 'is', 'i', 'believe', 'actually', 'from', 'the', 'town', 'of', 'lake', 'luzerne']


From the prinouts of *before* and *after* we can see how the changes have been applied correctly. 

##### Detour - Shouting
Measure shouting using the number of UPPERCASE words + number of multiple exclamation/question marks in a comment. 

In [56]:
# function that return the number of words uppercase that are uppercase as a ratio
def ratio_uppers(l):
    return round(sum(1 for i in l if i.isupper())/len(l), ndigits=3) if len(l) > 0 else 0

In [67]:
# add new column which measure the degree of shouting
df['shouting'] = df['comment_text'].apply(ratio_uppers)