# iLykei Lecture Series
# Advanced Machine Learning and Artificial Intelligence (MScA 32017)

# Project: Detection of Toxic Comments

The goal of the project is to identify and classify toxic online comments. The project is based on the dataset from [Jigsaw's Toxic Comment ClassificationChallenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge), organized by Kaggle. See more information about this contest and brief dataset analysis in the notebook [MScA_32017_AMLAI_TC2_DataOverview.ipynb](https://ilykei.com/api/fileProxy/documents%2FAdvanced%20Machine%20Learning%2FToxicComments%2FMScA_32017_AMLAI_TC2_DataOverview.ipynb).

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
import pickle
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Data

Download data from [kaggle.com](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data):

`train.csv`, `test.csv`, `test_labels.csv`.

Train and test data files format is shown in the notebook [MScA_32017_AMLAI_TC2_DataOverview.ipynb](https://ilykei.com/api/fileProxy/documents%2FAdvanced%20Machine%20Learning%2FToxicComments%2FMScA_32017_AMLAI_TC2_DataOverview.ipynb). Note that part of test data was moved to the train after test labels disclosure. Modified files were created there as `tc_train.csv` and `tc_test.csv`.

# Project steps

## Step 1.

Download FastText embeddings [crawl-300d-2M.vec](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip) (see [MScA_32017_AMLAI_TC1_NLP_Basics.ipynb](https://ilykei.com/api/fileProxy/documents%2FAdvanced%20Machine%20Learning%2FToxicComments%2FMScA_32017_AMLAI_TC1_NLP_Basics.ipynb)).


## Step 2.

Using function `get_embeddings()` from [MScA_32017_AMLAI_TC3_WineReviewsExample.ipynb](https://ilykei.com/api/fileProxy/documents%2FAdvanced%20Machine%20Learning%2FToxicComments%2FMScA_32017_AMLAI_TC3_WineReviewsExample.ipynb) create embeddings index - a dictionary with words as keys and embedding vectors as values.

In [None]:
# Create embedding index from file in .txt format. First line contains 
# dictionary size and embedding dim. Fields are space separated
def get_embeddings(file_name):
    embeddings_index = {}
    with open(file_name, encoding="utf8") as f:
        for line in f:
            values = line.rstrip().split(' ')
            if len(values) > 2:
                embeddings_index[values[0]] = np.asarray(values[1:], dtype="float32")
    return embeddings_index
embeddings_path = './Embeddings/'
embeddings_index = get_embeddings(embeddings_path+'crawl-300d-2M.vec')

# Data

Download data from [kaggle.com](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data):

`train.csv`, `test.csv`, `test_labels.csv`.

Train and test data files format is shown in the notebook [MScA_32017_AMLAI_TC2_DataOverview.ipynb](https://ilykei.com/api/fileProxy/documents%2FAdvanced%20Machine%20Learning%2FToxicComments%2FMScA_32017_AMLAI_TC2_DataOverview.ipynb). Note that part of test data was moved to the train after test labels disclosure. Modified files were created there as `tc_train.csv` and `tc_test.csv`.

In [None]:
import string
trans_table = str.maketrans({key: ' ' for key in string.digits + '\r\n' +
                             string.punctuation.replace("\'",'')})
def preprocess(text):
    return ' '.join(text.lower().translate(trans_table).split(' '))

## Step 4

Using `CountVectorizer` from `sklearn` package create the vocabulary of all words from comments except rare ones. (See notebook [MScA_32017_AMLAI_TC3_WineReviewsExample.ipynb](https://ilykei.com/api/fileProxy/documents%2FAdvanced%20Machine%20Learning%2FToxicComments%2FMScA_32017_AMLAI_TC3_WineReviewsExample.ipynb)).

## Step 5

Prepare input data for neural network as shown in  [MScA_32017_AMLAI_TC3_WineReviewsExample.ipynb](https://ilykei.com/api/fileProxy/documents%2FAdvanced%20Machine%20Learning%2FToxicComments%2FMScA_32017_AMLAI_TC3_WineReviewsExample.ipynb). 

Preparation includes the following actions:
- Transform each text to sequence of integer numbers replacing each word with its vocabulary index
- Make all vectors same length, truncating sequences longer than defined length and padding shorter ones.

#  Step 6

Split train data into train and validation sets using the method of multilabel dataset splitting from [MScA_32017_AMLAI_TC1_NLP_Basics.ipynb](https://ilykei.com/api/fileProxy/documents%2FAdvanced%20Machine%20Learning%2FToxicComments%2FMScA_32017_AMLAI_TC1_NLP_Basics.ipynb)).

#  Step 7

Create neural network, tune it on train and validation sets and make submission. Start with the network architechture from  [MScA_32017_AMLAI_TC3_WineReviewsExample.ipynb](https://ilykei.com/api/fileProxy/documents%2FAdvanced%20Machine%20Learning%2FToxicComments%2FMScA_32017_AMLAI_TC3_WineReviewsExample.ipynb) and try to improve it. 

**Suggestion**: learn and try using `GlobalAveragePooling1D` or `GlobalMaxPooling1D` layers instead of simple `Flatten` layer. 
You may also try to use both of them and concatenate their ouputs.

MaxPooling layer has already been used in Satelite Image Detection project. The only difference between MaxPooling and AveragePooling is that average used instead of maximum.

# Submission File

For each id in the `tc_test.csv` file predict probability for each of the six possible types of comment toxicity ("toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"). 
Columns must be in the same order as shown below. The file should contain header and have the following format:

# Evaluation
Submissions are evaluated on the mean column-wise ROC AUC. In other words, the score is the average of the individual AUCs of each predicted column.  
ROC AUC of your submission should be greater than 0.98 to get 100 points.

Upload the saved file using 
[shiny test application](https://shiny.ilykei.com/courses/AdvancedML/Toxic_Comments).
