# Machine Learning Engineer Nanodegree
## Capstone Project
Anvesh Tummala    
May 20th, 2018

In [19]:
%%html
<style>
    table {
        display: inline-block
    }
</style>

## Toxic Comment Classification  
###### Identify and classify toxic online comments

# 1. Definition
-----------
### Project Overview:

For a community to get the diverse opinions and feedbacks every individual voice matters a lot. But with increasing number of online threats, hate conversations, sexual abusive comments many people stop expressing themselves in the online communities. Many platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. This is a great threat to freedom of expression. 

The goal of this project is to create a model that is able to detect probability of different levels of toxicity like threats, obscenity, insults, and identity-based hate on any textual comments/posts. This model will helps online communities to create a better monitoring, in-turn creates a better place for productive and respectful conversations. 

This project is part of Kaggle's, [*Toxic Comment Classification Challenge*](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge). I will be trying to do a multi-labeled classification, that can be able to identify the nature of toxicity (threats, obscenity, insults, identity-based hate) in any text data. This classification problem takes the input training dataset of wiki comments that were hand labeled to different toxic classes and we will train and validate, test the model using different ML and deep-learning techniques learned through out this course.  


### Problem Statement

The goal of this project is to create a multi-headed model that is able to detect probability of different levels of toxicity like threats, obscenity, insults, and identity-based hate on any textual data(comments/posts). This model will helps online communities to create a better monitoring, in-turn creates a better place for productive and respectful conversations. For model creation we will be exploring a lot of Deep learning models and we will have great understanding of performance comparison for text classification.

Traditionally identifying the toxic comments is worked as a binary classification problem where they just try to make 2 labels (toxic, non-toxic) more like a sentiment analysis model. Whereas in this project we will be trying to do a multi-labeled classification, that can be able to identify the nature of toxicity (threats, obscenity, insults, identity-based hate). So this is a classification problem that takes the input training dataset of wiki comments that were hand labeled to different toxic classes and we will train a model. Finally we will test the model that takes the test comments and will try to label toxicity in them. I will also try to train ensemble of binary classification models with each class labels as true or false and compare it with single model.

### Datasets and Inputs

As part of Kaggle's Toxic Comment Classification Challenge, Jigsaw and Google together provided a dataset of comments from Wikipedia’s talk page edits. These comments have been labeled by human raters for the following toxic types "toxic, severe_toxic, obscene, threat, insult, identity_hate".

The dataset includes:
* train.csv - the training set, contains 159571 wiki comments with their binary labels
* test.csv - the test set, you must predict the toxicity probabilities for 153164 comments. To deter hand labeling, the test set contains some comments which are not included in scoring.

From the training dataset, we will be feature engineering and will be considering the features like number of words, parts of speech, number of punctuations, number of upper case words, etc

Here is the list of toxic labels and their distribution out of 159571 training samples. From the below diagram the classes are not balanced, so we need to use sampling techniques to overcome this.
![Class Label distribution](images/output_17_0.png)

 We will be using 60/15/25 % slit on training data to get train/validation/testing(as they haven't given labeled testing data). We will also try to use the k-fold split to gain the more training data, not compromising the correctness of model. In the split data sets I will be verifying the distribution of output labels, as that is very crucial for out testing score.  

-----------
1. [Kaggle, Jigsaw-toxic-comment-classification-challenge-data](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data)

### Evaluation Metrics

I will be using the evaluation metric as the mean column-wise ROC AUC. It is the average score of individual AUCs of each predicted column.

The possible outcomes of a classification are true positive (TP), false positive (FP), true negative (TN), false negative (FN).

* true positive rate: TPR = positives correctly classified / total positives = TP / P
* alse positive rate: FPR = negatives incorrectly classified / total negatives = FP / N

ROC is a graph over FPR (False positive rate) over X-axis and TPR(True positive rate) over Y-axis.

This metric would  a great measure of probabilistic classification among different labels. We will consider each column AUC separately and will average it for our final score. 

What is my Metrics expectation: AUC of 0.5 like a random guessing, AUC of 1 will be like a perfect classifier. Any model whose average AUC =0.9 for all labels, is a good model. 

# 2. Analysis
-----------

### Data Exploration

The files we are considering in this project are train.csv, test.csv, sample_submission.csv

Here is a peak into the data sets.

<img src="images/train_data.png" width="85%" alt="test data" align="left">


<img src="images/test_data.png" width="40%" alt="test data" align="left">


<img src="images/submission_data.png" width="45%" alt="test data" align="left">

### Exploratory Visualization	

* The total count of training and testing data:

| # train_data  | #test_data |     
| :- | :- |
| 159571 | 153164 |

* The count of toxic type comments in the training data:

| #Training_total | #toxic | #severe_toxic | #obscene | #threat | #insult | #identity_hate |        
| :- | :- | :- | :- | :- | :- | :- | :- |
| 159571 | 15294 | 1595 | 8449 | 478 | 7877 | 1405 |

* The number of comments with atleast one toxic type and with all toxic types. 

| #atleast_one_toxic_type  | #all_toxic_types - really bad|     
| :- | :- |
| 16225 | 31 |

* Correlation among catgories 


<img src="images/Correlation_columns.png" width="65%" alt="test data" align="left">


    From the heatmap, there is an high correlation among obscene, insult, toxic comments

* length of comment_text:

| test/train | mean of comment length | std of comment length|  max of comment length |
| :-  | :- | :- | :- | :- |
| train | 394.1 | 590.7 | 5000 |
| test | 364.9 | 592.5 | 5000 |

* no of words in comment_text:

| test/train | mean of word count | std of word count|  max of word count |
| :-  | :- | :- | :- | :- |
| train | 69.4 | 104.1 | 2319 |
| test | 66.9 | 106.8| 2833 |

As we have text data, we need to consider the following pre-process steps

1. handling the empty text - luckily we don't have any empty comments in the given train or test data
2. we need to ignore the stop words
3. lower case all text
4. feature extraction like part of speech, number of hashtags, number of urls, etc


> [Link to full data analysis notebook](https://www.google.com)


### Algorithms and Techniques

The primary strategy of this project is to compare different model implementations ranging from Machine Learning Models like
* Logistic Regression
* SVM
* LightGBM
* XGBoost

and deep learning models like

* GRU
* LSTM
* CNN
* RNN
* Capsule net
* other models

over different representations of comments like

* word2vec (skip-gram)
* word2vec - continuous bag of words(CBOW)
* Window based co-occurrence matrix
* Low dimensional vector - (SVD)
* Glove


### Benchmark Model

The basic Benchmark model will be using Logistic Regression model on the test representation of using TF-IDF (Term Frequency, Inverse Document Frequency), which will be a term document matrix. It would be nice to see how the Deep Learning Models will better perform over this basic benchmark model.

### Solution Statement

The plan of action to solve this problem involves:

* Downloading and analysis the input test data.
* Data Preprocessing - fill nulls, clean the comments, dimension reduction, etc  
* Using bag_of_words/Glove/WordToVec to encode the comment into a vector representation.
* Data Analysis and Feature Engineer to add or delete some features related to the problem domain.
* Splitting the training data into train and validation sets.
* Creating different Deep Learning models (LSTM, RNN, CNN)and comparing their accuracy.  
* Doing parameter tuning to yield better accuracy scores.

### Benchmark Model

The basic Benchmark model will be using Logistic Regression model on the test representation of using TF-IDF (Term Frequency, Inverse Document Frequency), which will be a term document matrix. It would be nice to see how the Deep Learning Models will better perform over this basic benchmark model.

### Project Design

The primary strategy of this project is to compare different model implementations ranging from Machine Learning Models like
* Logistic Regression
* SVM
* LightGBM
* XGBoost

and deep learning models like

* GRU
* LSTM
* CNN
* RNN
* Capsule net
* other models

over different representations of comments like

* word2vec (skip-gram)
* word2vec - continuous bag of words(CBOW)
* Window based co-occurrence matrix
* Low dimensional vector - (SVD)
* Glove

In this process the input representation matters a lot, the following pre-processing techniques will be considered.

Pre processing:
* Capitalization - case insensitivity
* Removing stop words - least useful words like 'the', 'and' will be removed
* Tokenization - creating separate tokens
* Part of speech tagging - to know meaning of word/sentence better
* Stemming - to reduce the input corpus. Prefer Lemmatization over stemming
* Lemmatization - to reduce the input corpus. It uses dictionary lookup, context of sentence, part of speech.

Feature Engineering techniques:
* Part of Speech - this would be a great feature in the case of classification as identity threat might consider more nouns. we can also consider the count of different parts of speech as a feature.
* We can also try to use Proportion of capitals, Number of unique words, Number of exclamation marks, Number of punctuations, number of emojis, etc

We will use PCA (Principle component analysis to extract the most valued components for our problem). This will help in better training by using more relevant feature combinations.

We will be leveraging Google Compute Service for GPU enabled instances for this project to run these DNN models faster.

This project will be a great exploratory project for different NLP model implementations and comparing their accuracy, time of execution, etc


-----------
1. [GPU setup instruction](https://github.com/atmc9/GPU-cloud-setup)
2. [NLP with deep learning by Stanford](https://www.youtube.com/watch?v=OQQ-W_63UgQ&list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6)
3. [Public kernels from Kaggle Competition](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/kernels)
