# Toxic Comment Classification - Kaggle Competition

## Motivation
- The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.
- The task we are planning to work on is a multi label toxic comment classification problem from a Kaggle Competition by Jigsaw
- We experimented and compared multiple models such as Logistic Regression, LSTM and Text CNN.

## Dataset
- The dataset is provided by Kaggle, containing a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
    
    - toxic
    - severe_toxic
    - obscene
    - threat
    - insult
    - identity_hate

- We are provided with train, test and a sample submission file
### Here's how the data looks like: 

In [9]:
import pandas as pd
# loading the data
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')
df_test_labels = pd.read_csv('data/test_labels.csv')

In [10]:
df_train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [7]:
df_test.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [8]:
df_test_labels.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,-1,-1,-1,-1,-1,-1


## Approach

Our approach to the problems is structured as follows:
1. EDA
2. Data Preprocessing
3. Models
4. Evaluation

### 1. EDA
- EDA notebook can be found here - [EDA](EDA.ipynb)

- Through EDA we found out that the dataset is highly skewed where 9.6% of the 160K comments are classified as toxic and rest as non-toxic.

- A severe toxic comment is always toxic and other classes seem to be a subset of toxic barring a few exceptions.

#### 1.1. Word Length Distribution
- Below plots show the total word count frequency distribution across train and test data after preprocessing.
<img src="Images/word_len_dist.png" alt="Word Length Distribution" style="width: 700px;"/>

#### 1.2. Category Distribution
- From the category distribution venn diagram and the bar plot below we can see that the toxicity is not evenly spread out across classes i.e. classes are imbalanced.
- Also we can see that the three major labels are:
    - toxic
    - obscene
    - insult

<img src="Images/label_dist.png" alt="Category Distribution 1" style="width: 600px;"/>
<img src="Images/label_dist_2.png" alt="Category Distribution 2" style="width: 300px;"/>

#### 1.3. Category Correlation
- The category correlation plot below shows label combinations that are frequent.
- We can deduce that `toxic` is coming in all combination.
- Also, the number of comments for each combination drops exponentially.
<img src="Images/label_correlation.png" alt="Category Correlation" style="width: 500px;"/>

#### 1.4. Category Correlation Venn diagrams
- Below venn diagrams show combinations for all labels with `toxic`
##### 1.4.1. Venn diagram for Toxic and severe_toxic comments
<img src="Images/venn_toxic_severe.png" alt="Toxic and severe_toxic comments" style="width: 300px;"/>

##### 1.4.2. Venn diagram for Toxic Toxic and obscene comments
<img src="Images/venn_toxic_obscene.png" alt="Toxic and obscene comments" style="width: 300px;"/>

##### 1.4.3. Venn diagram for Toxic Toxic and insult comments
<img src="Images/venn_toxic_insult.png" alt="Toxic and insult comments" style="width: 300px;"/>

##### 1.4.4. Venn diagram for Toxic Toxic and threat comments
<img src="Images/venn_toxic_threat.png" alt="Toxic and threat comments" style="width: 300px;"/>

##### 1.4.5. Venn diagram for Toxic Toxic and identity_hate comments
<img src="Images/venn_toxic_identity_hate.png" alt="Toxic and identity_hate comments" style="width: 300px;"/>

##### 1.4.6. Venn diagram for Toxic Toxic, insult and obscene comments
- We can see from the plot below that `toxic` labelled comments are highly correlated with `obscene` and `insult`.
<img src="Images/venn_toxic_insult_obsene.png" alt="Toxic, insult and obscene comments" style="width: 300px;"/>

### 2. Data Preprocessing

- Notebook with an example can be found here - [Data Cleaning](prepocessing.ipynb)
- We have created a dictionary of apostrophe words like `you're` that converts it into its raw form `you are`.
- We are using Tweet Tokenizer and WordNet Lemmatizer from NLTK
- Also, we use stopwords provided form NLTK
- Below is how we preprocess the text
<img src="Images/data_cleaning.png" alt="Preprocessing" style="width: 400px;"/>

### 3. Models
#### 3.1. Logistic Regression

- We have used logistic regression as the baseline model for this task.
- We are vectorizing the word from the dataset using TF-IDF and feeding this matrix into logistic regression. This method portrays the importance of a word to a document in a corpus. 
- For classification problems whole data set is used for feature extraction. We will lose information if we use only training 
data. 
- Classifier converges at 200 iterations.
- Inverse of regularization strength is set to 1 for stronger regularization.
- The weights are balanced to adjust inversely proportional to output label frequencies in the input data.

<img src="Images/logistic_regression.png" alt="Logistic Regression" style="width: 300px;"/>

#### 3.2. Two Stacked Bidirectional LSTM

- We have trained two stacked bidirectional LSTM networks for this task.
- The comment text is cleaned and preprocessed before feeding to the embedding layer.
- We have combined two pre-trained word embeddings trained on the Common Crawl dataset. They can be downloaded here - [crawl-300d-2M.vec.zip](https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip) and [glove.840B.300d.txt](http://nlp.stanford.edu/data/glove.840B.300d.zip)
- We used Keras Tokenizer class to create text sequences and padded them to get equal length inputs with maximum length of 100.
- Training of the model is done for 4 epochs only.
- During training, we have also used Learning Rate Scheduling.  

<img src="Images/lstm.png" alt="Two Stacked Bidirectional LSTM" style="width: 400px;"/>

#### 3.3. Text CNN

- Our CNN model is inspired from Yoo Kim’s CNN model - [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/abs/1408.5882)
- We use pre-trained word embeddings from FastText. They can be downloaded here - [crawl-300d-2M.vec.zip](https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip)
- The word vectors are fine tuned while training.
- We use convolutions with 32 filters of sizes 1, 2, 3 and 5 at different layers to extract features from the comment followed by max pooling layer
- Finally we concat the features together and pass them through  fully connected layers to get predictions.
- BCE loss for training. 
- Model is trained using ADAM optimizer with the default learning rate of 0.001 that keras has. The model converges after 3-4 epochs.

<img src="Images/textCNN.png" alt="Text CNN" style="width: 600px;"/>

### 4. Evaluation
The predictions are evaluated on the mean column-wise ROC AUC. In other words, the score is the average of the individual AUCs of each predicted column.

## Experimental Setup

Each model is trained based on a train validation data split. After training each model, the model is used for prediction on the test set provided by Kaggle and the generated submission file is submitted to Kaggle to calculate the public and private leaderboard AUC scores.

Code for each model can be found here:
- For Logistic Regression - `/Log_reg`
- For Two Stacked Bidirectional LSTM - `/LSTM`
- For Text CNN - `/TextCNN`

Each model can be run from their respective files in their corresponding folder:
- To run Logistic Regression - `python3 Log_reg/log_regression.py`
- To run Two Stacked Bidirectional LSTM - `python3 LSTM/LSTM.py`
- To run TextCNN - `python3 TextCNN/textCNN.py`

Submission file for each model can be found in their respective folders:
- For Logistic Regression - `Log_reg/logReg_submission.csv`
- For Two Stacked Bidirectional LSTM - `LSTM/LSTM_submission.csv`
- For Text CNN - `TextCNN/textCNN_submission.csv`

## Results
Loss plot for Logistic Regression 
<img src="Images/log_reg_loss.png" alt="Results" style="width: 500px;"/>

Loss plot for LSTM
<img src="Images/lstm_loss.png" alt="Results" style="width: 500px;"/>

Loss plot for TextCNN
<img src="Images/textCNN_loss.png" alt="Results" style="width: 500px;"/>

Following table shows results for all three methods
<img src="Images/results.png" alt="Results" style="width: 500px;"/>


## Analysis of the Results

- Based on our results TextCNN seems to have predicted better than Logistic Regression and LSTM.
- Using pre-trained embbeddings lead to faster convergence and due to using those embeddings, minor modifications to LSTM and CNN do not defer the predictions by major margins.
- Bettering the data preprocessing may improve the performance of all the models. For our approach a text like `Thanks for uploading Image:Wonju.jpg.` after cleaning looks like this `thank upload image wonju jpg` Here `wonju` and `jpg` do not add any information to the model

## Conclusions
- Using pre-trained word embeddings lead to faster convergence.
- The toxicity is not evenly spread out across classes hence class imbalance problems.
- Ensembling methods generally give higher leaderboard score.

## Future Work

- Exploring other models such as BERT, Two layer  Bidirectional GRU, etc
- We found that one of the top 10 Kagglers used Train and test-time augmentation (TTA) using translations to German, French and Spanish and back to English and got improved performance.
- Stacking and Blending models might improve the performance.
