# Data Exploration: Toxic Comments

## Set Up: Load modules and training data

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

train = pd.read_csv("train.csv")

RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb

RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb

RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb

RuntimeError: module compiled against API version 0xc but this version of numpy is 0xb

## Label Exploration

First we need to learn about the class labels. Pandas Categorical class makes it easy to 
cross-tab the different labels on comments, so start by transforming the integer labels to 
categorical. 

In [2]:
for i in range(2,8):
    train.iloc[:,i] = pd.Categorical(train.iloc[:,i])
    
pd.crosstab(train.toxic, train.severe_toxic)

severe_toxic,0,1
toxic,Unnamed: 1_level_1,Unnamed: 2_level_1
0,86614,0
1,8272,965


As we might expect, all severely toxic comments are also toxic comments, but not all toxic 
comments are severe.

In [3]:
pd.crosstab(train.toxic, [train.obscene, train.threat, train.insult, train.identity_hate])

obscene,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
threat,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1
insult,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1
identity_hate,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
toxic,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4
0,86061,38,175,14,12,0,2,0,181,1,118,10,1,0,1,0
1,3449,74,754,70,79,3,8,3,1154,23,2904,520,10,0,128,58


Interestingly, over a third of comments are "civilly" toxic, meaning they are neither obscene, 
insult, threat, nor identity hate, yet they are still disruptive. However, adding these labels 
greatly increases the prevalence of toxicity. The worst cases, the 58 comments that are all 
four of the above, are 100% toxic. 

In [4]:
pd.crosstab(train.severe_toxic, [train.obscene, train.threat, train.insult, train.identity_hate])

obscene,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
threat,0,0,0,0,1,1,1,1,0,0,0,0,1,1,1,1
insult,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1
identity_hate,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
severe_toxic,Unnamed: 1_level_4,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4
0,89488,109,921,81,83,3,10,3,1242,20,2413,382,9,0,84,38
1,22,3,8,3,8,0,0,0,93,4,609,148,2,0,45,20


The cases of "civil" severely toxic comments are much rarer - there are only 22. Generally, a 
smaller portion of comments are severely toxic, no matter what other labels we condition on. 
Even in the worst cases, only 20 of the 58 obscene + threat + insult + identity hate comments 
are severely toxic.

In [5]:
# Convert the categorical data back to integers before correlations
for i in range(2, 8):
    train.iloc[:, i] = pd.to_numeric(train.iloc[:, i])

train.iloc[:, 2:8].corr()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
toxic,1.0,0.30881,0.677491,0.162967,0.64833,0.259124
severe_toxic,0.30881,1.0,0.40454,0.133469,0.37745,0.193385
obscene,0.677491,0.40454,1.0,0.149874,0.744685,0.287794
threat,0.162967,0.133469,0.149874,1.0,0.157534,0.123971
insult,0.64833,0.37745,0.744685,0.157534,1.0,0.331922
identity_hate,0.259124,0.193385,0.287794,0.123971,0.331922,1.0


The correlation matrix is another way of summarizing the relationships between labels,
although here we only see pair-wise correlations rather than the full cross-tabulation.
The story stays the same - all the correlations are positive, and the correlations for 
severe_toxic are always smaller than for toxic. 

The correlation matrix will make for a good sanity check later when making multi-class 
predicitons. We should expect the correlation matrix of the predicted probabilities to look 
very similar to this one, else something is likely awry. 

## What makes a comment toxic?

Let's start out with an overview of the comments' structures.

In [6]:
train[train.comment_text.isnull()]

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate


There are no missing values for the comment texts, so let's check for empty strings.

In [7]:
train[train.comment_text == '']

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate


Looks okay. If we find secretly missing values later we can deal with them then.

In [12]:
train['comment_length'] = train.comment_text.str.len()
train.comment_length.describe()

count    95851.000000
mean       395.341864
std        595.102072
min          6.000000
25%         96.000000
50%        206.000000
75%        435.000000
max       5000.000000
Name: comment_length, dtype: float64

The mean is about double the median, so there are some huge comments skewing the data. The 
largest comment is 5000 characters, while the inter-quartile range is only 96 to 435 characters.
Let's see how many comments are 2000 characters or more. 

In [38]:
train = train.sort_values(by="comment_length", ascending=False)
pd.set_option('display.max_colwidth', -1)
train.comment_text.head(1)

94815    JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!!  JIM WALES MUST DIE!!!!!!!!!!!! 

Well, I've only displayed 1 comment, but change this to head(10) or so, and you'll see for
yourself these are very vulgar and spammy. You could probably target these basic spam posts
by targeting a low ratio of unique words to comment length. For the record, you and I both 
surely do not want Jimmy Wales to die!

In [45]:
one_percent = int(np.ceil(train.shape[0] / 100))
train_sub = train.iloc[0:one_percent, :]
train_sub.toxic.value_counts()

0    818
1    141
Name: toxic, dtype: int64

Long comments in general aren't especially toxic. In the above 1% longest comments, still over
80% are not toxic.