## Machine Learning Engineer Nanodegree
### Capstone Project
### Project: Toxic Comment Classification with CNN

### Exploratory Data Analysis

In this step, the dataset provided by Kaggle will be analyzed. The analysis will include the type of the data, basic statistics, abnormalities, etc. Where possible, visualization of the features about the data will be provided.

File descriptions

    - train.csv - the training set, contains comments with their binary labels
    - test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring.
    - sample_submission.csv - a sample submission file in the correct format (will not be used for capstone project)
    - test_labels.csv - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!)

In [2]:
# import required labraries
import pandas as pd 
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
from os import chdir, getcwd

chdir('/opt/data/share01/jl2408/')
getcwd() 

In [4]:
# Read in train and test datasets
train = pd.read_csv('train.csv')
test_cm = pd.read_csv('test.csv')
test_lb = pd.read_csv('test_labels.csv')

In [5]:
# Merge test comments with test labels
test_all = pd.merge(test_cm, test_lb, on='id')
#test_all = test_all.reset_index(drop=True)

In [6]:
# use only a subset of test data since value of -1 indicates it was not used for scoring
test = test_all[test_all['toxic'] != -1]

In [7]:
# list label names
label_names = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']

In [8]:
# Example data in train
train.loc[train[label_names].sum(axis=1) != 0].head()

In [9]:
# Example data in test
test_all.loc[test_all[label_names].sum(axis=1) > 0].head()

In [10]:
# Data information in train
train.info()

In [11]:
# Data information in all test data
test_all.info()

In [12]:
# Data information in test data excluding labels with -1 value
test.info()

In [13]:
# unique values in integer columns in train
pd.unique(train[label_names].values.ravel())

In [14]:
# unique values in integer columns in test
pd.unique(test[label_names].values.ravel())

In summary, dataset train has 159571 comments. dataset test_all has 153164 comments. However, only 63978 comments in dataset test are used for scoring. Dataset train has 8 columns. These are id, comment_text, toxic, severe_toxic, obscene, threat, insult, identity_hate. Id column is just a unique identifier for each entry. comment_text will be classified. Each of the 6 categories is labeled either 0 or 1. dataset from test.csv has 2 columns. These are id and comment_text. The test_lb has 7 columns. These are id, toxic, severe_toxic, obscene, threat, insult, identity_hate. The test_labels was published after competition ended. So, we will use train dataset to train our model and use test dataset (excluding labels with -1 values) to test and score our trained model.

In [16]:
# Check for missing values in train
print(train.isnull().sum())

In [17]:
# Check for missing values in test
print(test.isnull().sum())

So, there are no missing values.

In [19]:
# Mean values for each label
train[label_names].mean(axis=0)

In [20]:
# Occurrences of each label relative to the number of samples
train_lb = train[label_names]
train_lb_counts = (train_lb.sum()/len(train_lb)).sort_values(ascending=False)
print(train_lb_counts)

Here we see about 9.6% comments in train are 'tpxic'; about 5% ar 'obscene' or 'insult'; about 1% are 'severe_toxic' or 'identity_hate'; about 0.3% is 'threat'.

In [22]:
# Plot relative count for each toxicity category

from collections import OrderedDict
colors = OrderedDict({1: 'red', 2: 'orange', 3: 'blue', 4: 'brown', 5: 'black', 6: 'grey'})
toxicity_mapping = OrderedDict({1: 'toxic', 4: 'severe_toxic', 2: 'obscene', 6: 'threat', 3: 'insult', 5: 'identity_hate'})
train_lb_counts.plot.bar(figsize = (8, 6), color = colors.values(), edgecolor = 'k', linewidth = 2)

# Formatting
plt.xlabel('Toxicity Level'); 
plt.ylabel('Count Percentage'); 
plt.xticks([x - 1 for x in toxicity_mapping.keys()], list(toxicity_mapping.values()), rotation = 60)
plt.title('Fig. 1 Toxicity Level Breakdown');

In [23]:
# Plot count for multi-label coments
import matplotlib.patches as mpatches
import seaborn as sns
#sns.set(style='darkgrid')
plt.figure(figsize=(8,6))
x = train[label_names].sum(axis=1).value_counts()
ax = sns.barplot(x.index, x.values, color='blue')

plt.xlabel('# of positive labels')
plt.ylabel('count')
plt.title("Fig. 2 Multi-label Comment Distrubution")

patch = mpatches.Patch(color='red', label='Count Percentage on Top of Bar')
plt.legend(handles=[patch])

# Add count on the bar
rects = ax.patches
labels = x.values / len(train.index)
for rect, label in zip(rects, labels):
    height = rect.get_height()
    label = format(label, '.4f')
    ax.text(rect.get_x() + rect.get_width()/2, height, label, ha='center', va='bottom', color='red')

plt.show()

Here we see most comments do not have any positive label. These are clean comments. About 4% comments have 1 positive toxicity label and about 6% have 2 or more positive toxicity labels.

In [25]:
# Does severe_toxic always mean toxic?
train.loc[(train['severe_toxic'] == 1) & (train['toxic'] != 1), label_names]

In [26]:
# What are category value looks like when severe_toxic is True?
train[train['severe_toxic'] == 1]

Looks like this is a multi-label classification problem. Based on the value of mean in train data description, most of the categories are labeled 0. We can also see this from count percentage for each label. The count percentages for severe_toxic, identity_hate and threat are below 5%. So, roc_auc might not be a good metric in this case since we are more interested in capturing toxic comments and not so care about clean comments. However, we'll stick to roc_auc since it is a required scoring for this challenge. In practice, we want to consider classification imbalance and other metric such as precision-recall curve might be more appropreate. Also, the severe_toxic is a sub-category of toxic and we see example comments with multiple positive labels.

In [28]:
import nltk
nltk.data.path

In [29]:
import nltk
nltk.data.path.append("/opt/data/share01/jl2408/")

In [30]:


# import stopwords
from nltk.corpus import stopwords
#from nltk.tokenize import RegexpTokenizer

# function for filter out stopwords
def filter_stop_words(sentences, stop_words):
    filtered = []
    for sentence in sentences:
        words = sentence.split()
        words_filtered = [word for word in words if word not in stop_words]
        filtered.append(" ".join(words_filtered))
    return filtered

stop_words = set(stopwords.words("english"))

# Comments in train
train_cm = train['comment_text']
train_cm_filtered = filter_stop_words(train_cm, stop_words)

# Comments in test_all
test_all_cm = test_all['comment_text']
test_all_cm_filtered = filter_stop_words(test_all_cm, stop_words)

# Comments in test (excluding labels with -1 values)
test_cm = test['comment_text']
test_cm_filtered = filter_stop_words(test_cm, stop_words)



In [31]:
# before filter out stop words
train_cm[0]

In [32]:
# convert to series
train_cm_filtered = pd.Series(train_cm_filtered)

In [33]:
# after filter out stop words
train_cm_filtered[0]

In [34]:
# Make a copy of the train datasets
c_train = train.copy(deep=True)

In [35]:
# Add a 'clean' column to the dataframe
c_train['clean'] = c_train[label_names].sum(axis=1) == 0

In [36]:
# Number of sentence in a comment
import re
sentence_count = train_cm.apply(lambda x: len(re.findall("\n",str(x)))+1)

In [37]:
# Add 'stopwords_count' column in a comment to dataframe
c_train['stopwords_count'] = c_train['comment_text'].apply(lambda x: len([w for w in str(x).lower().split() if w in stop_words]))

In [38]:
c_train.head()

In [39]:
# Plot number of stopwords in each comment by toxicity or not
plt.figure(figsize=(12,8))

sns.violinplot(y='stopwords_count',x='clean', data=c_train, split = True)
plt.xlabel('Clean?', fontsize=12)
plt.ylabel('# of stop words', fontsize=12)
plt.title("Number of stop words in each comment by toxicity or not", fontsize=15)

plt.show()

In [40]:
# Add 'ip' column in a comment to dataframe
c_train['ip'] = c_train['comment_text'].apply(lambda x: re.findall("(?:[0-9]{1,3}\.){3}[0-9]{1,3}",str(x)))

In [41]:
# Sample IPs
c_train['ip'][c_train['ip'].str.len() != 0].head(5)

In [42]:
# Add 'ip_count' column in a comment to dataframe
c_train['ip_count'] = c_train['ip'].apply(lambda x: len(x))

In [43]:
# Distribution of ip numbers in a clean comment
c_train[c_train['clean']]['ip_count'].value_counts()

In [44]:
# Distribution of ip numbers in a non-clean comment
c_train[c_train['clean'] == False]['ip_count'].value_counts()

In [45]:
# Plot and compare number of IPs in clean and non-clean comments
plt.figure(figsize=(12,8))

sns.violinplot(y='ip_count',x='clean', data=c_train, split = True)
plt.xlabel('Clean comments?', fontsize=10)
plt.ylabel('# of IPs', fontsize=10)
plt.title("Compare number of IPs in each comment", fontsize=15)

plt.show()

There is no significant indication that a toxic comment contain more IPs. So, we can remove IPs from comments

In [47]:
# Extract link from comments
link = c_train['comment_text'].apply(lambda x: re.findall("http://.*com",str(x)))

In [48]:
# Sample links in documents
link[link.str.len() != 0].head(5)

In [49]:
# Add 'link_count' column in a comment to dataframe
c_train['link_count'] = link.apply(lambda x: len(x))

In [50]:
# Distribution of link counts
c_train['link_count'].value_counts()

In [51]:
# Plot and compare number of links in each comment
plt.figure(figsize=(12,8))

sns.violinplot(y='link_count',x='clean', data=c_train, split = True)
plt.xlabel('Clean comments?', fontsize=10)
plt.ylabel('# of links', fontsize=10)
plt.title("Compare number of links in each comment", fontsize=15)

plt.show()

In [52]:
# Extract username from comments
username = train_cm.apply(lambda x: re.findall("\[\[.*\]",str(x)))

In [53]:
# Add 'username_count' column in a comment to dataframe
c_train['username_count'] = username.apply(lambda x: len(x))

In [54]:
# Distribution of username counts
c_train['username_count'].value_counts()

In [55]:
# Sample of usernames
username[username.str.len() != 0].head(5)

In [56]:
# Example comment containing a username
c_train['comment_text'][140]

In [57]:
# Plot and compare number of usernames in each comment
plt.figure(figsize=(12,8))

sns.violinplot(y='username_count',x='clean', data=c_train, split = True)
plt.xlabel('Clean comments?', fontsize=10)
plt.ylabel('# of username', fontsize=10)
plt.title("Compare number of usernames in each comment", fontsize=15)

plt.show()

In [58]:
all_cm = pd.concat([train_cm, test_all_cm])

In [59]:
# We want to generate features using sklearn tfidfVectorizer from all the comments
# We limit maximum features to 100000 due to memory constraint
from sklearn.feature_extraction.text import TfidfVectorizer
word_vectorizer = TfidfVectorizer(
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    stop_words='english',
    ngram_range=(1, 1),
    max_features=100000)
word_vectorizer.fit(train_cm)

In [60]:
# Get the feature names
features = np.array(word_vectorizer.get_feature_names())

In [61]:
# Generate both train set and test set
train_word_features = word_vectorizer.transform(train_cm)
test_all_word_features = word_vectorizer.transform(test_all_cm)
test_word_features = word_vectorizer.transform(test_cm)

In [62]:
train_word_features

In [63]:
test_all_word_features

In [64]:
test_word_features

In [65]:
# https://buhrmann.github.io/tfidf-analysis.html
# take a single row of the tf-idf matrix (corresponding to a particular document), 
# and return the n highest scoring words (or more generally tokens or features)
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

# https://buhrmann.github.io/tfidf-analysis.html
# convert a single row (row_id) from a sparse matrix (Xtr) into dense format
def top_feats_in_doc(Xtr, features, row_id, top_n=25):
    ''' Top tfidf features in specific document (matrix row) '''
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)

Sample top 10 features on index 1 in train_cm

In [67]:
train_cm[1]

In [68]:
x = top_feats_in_doc(train_word_features, features, 1, 10)

In [69]:
print(x)

In [70]:
# What is the vacabulary for the comments

from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences

# for train data
T = Tokenizer()
T.fit_on_texts(list(train_cm))
vocab = len(T.word_index) + 1
print('Vocabulary for train comments is {}'.format(vocab))

# for all data
T = Tokenizer()
T.fit_on_texts(list(train_cm) + list(test_all_cm))
vocab = len(T.word_index) + 1
print('Vocabulary for all comments is {}'.format(vocab))

# after filter out stopwords
T = Tokenizer()
T.fit_on_texts(list(train_cm_filtered) + list(test_all_cm_filtered))
vocab = len(T.word_index) + 1
print('Vocabulary for all comments after filtering out stopwords is {}'.format(vocab))

# for all data exclude labels with -1 values
T = Tokenizer()
T.fit_on_texts(list(train_cm) + list(test_cm))
vocab = len(T.word_index) + 1
print('Vocabulary for all comments excluding non-scoring ones is {}'.format(vocab))

# Excluding labels with -1 values and filtered out stopwords
T = Tokenizer()
T.fit_on_texts(list(train_cm_filtered) + list(test_cm_filtered))
vocab = len(T.word_index) + 1
print('Vocabulary for all comments after filtering out stopwords and excluding non-scoring ones is {}'.format(vocab))

In [71]:
# Check number of comments in train and test dataset
t_train = T.texts_to_sequences(train_cm_filtered)
t_test = T.texts_to_sequences(test_cm_filtered)
print(len(t_train))
print(len(t_test))

Let's look at an example of a text converted to a sequence. Here we see a total of 33 words after removing punctuations and stopwords. word 'I' translated to 2 for example and there are 2 of them. 

In [73]:
# A sample comment
train_cm[0]

In [74]:
# Words after removing punctuations and stopwords
word_t = text_to_word_sequence(train_cm_filtered[0])
print(len(word_t))
print(word_t)

In [75]:
# Word sequenced
print(len(t_train[0]))
t_train[0]

In [76]:
# Let's find max number of words in a sentence
n_words_in_a_sentence = [len(x) for x in t_train + t_test]
print(max(n_words_in_a_sentence))

In [77]:
# Plot number of words in a sentence histgram
plt.figure(figsize = (8, 6))
plt.hist(n_words_in_a_sentence,bins = np.arange(0,250,8))
# Formatting
plt.xlabel('Number of Words in a Comment Text'); plt.ylabel('Count'); 
plt.title('Fig. 3 Distribution of Number of Words in a Comment Text');

plt.show

In [78]:
# Estimate maximum word length to use in word embbedding to cover 97% of the sentences
n_words_in_a_sentence.sort()
percent_sentences = 97.3/100.0
a = n_words_in_a_sentence[:int(round(len(n_words_in_a_sentence)*percent_sentences))]
print(a[-1])

So, with maximum number of words in a sentence set at 200, we cover over 97 percent of the comments without cut off