# Toxic Comment Classification
---

## General Outline
---
0. Import the necessary libraries
1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.


## Step 0: Importing Libraries
---

In [1]:
!pip install --upgrade pip
!pip install googletrans
!pip install tqdm
!pip install torch
!pip install emoji
!pip install nltk

Requirement already up-to-date: pip in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (20.0.2)


In [2]:
import pandas as pd
from shutil import unpack_archive
import torch
import helper as hp
import os

In [3]:
# Set device to use GPU if available
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

## Step 1: Downloading the data
---

Two challenges you might face while trying to download this data are:
- Pulling the data from Kaggle directly to the notebook instance you might be working with
- Memory issue if you are using Amazon SagemMaker
I have tried to address these issues below.

#### Install Kaggle library and downlaod Zip file
To successfully pull the data directly to your notebook incase you are working on a remote device, you can take these steps:
1. **pip install the Kaggle library**. You can do this by opening the notebook's terminal and running, `pip install the Kaggle library`
2. **Create a directory** to put the API token in, so the Kaggle library will know where to look for your sign-in credentials when you try to access the API from your notebook. This directory must be named `.kaggle` and it must be in the same directory as your installation of Python, which is typically the home directory. Note that this file will be hidden. Do this on the terminal. `mkdir .kaggle`
3. **Go to your Kaggle account** (or create one if you haven’t yet), click the profile icon in the top right corner of the screen, and select “my profile” from the dropdown list. Scroll down to about middle of the page, and click on “Create a New API Token”. Then, a file named kaggle.json will automatically download to your downloads folder (or your default download folder). This file contains your sign-in credentials to allow you to access the API.
4. **Navigate to the directory** where the above file was downlaoded, you need to move the kaggle.json file to the new .kaggle directory. Do this on the terminal. `mv kaggle.json path_to_the_.kaggle_folder/.kaggle`
5. **Now you can view list of kaggle competions** from your terminal. `kaggle competitions list`
6. **You can down the competition zip folder** through your terminal directly to the location you are working at. `kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification`. If you want to do the download from the notebook, `!kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification`

> I also recommend you go through this github link `https://github.com/floydwch/kaggle-cli`

### Unzip the file
One major challenged I faced while trying to unzip the file on Amazon SageMaker was memory issue. This was how I resolved that.
1. **I went to Amazon SageMaker**, then to the Notebook Instance I was working on. I clicked on `edit`, then scrolled down to the `volume`. I increased the default volume from 5gb to like 160gb. Then updated. That solved the issue for me.
2. **Then I proceeded** to the next cell below.

In [None]:
unpack_archive('../data/jigsaw-multilingual-toxic-comment-classification.zip', '../data/')

## Step 2: Data Preparation
---

### Translate the data
Since the `validation and test sets` are in many languages, we need to translate them to English for consistency using google translate API.
The code for this can be found inside the helper function  
> **Note** that translating could be a bit difficult. In my case, I used google API. The keep blocking it, I deviced a walkaround whereby when you send in the file for translation, it checks the column with the tittle `outcome`. If the value for that row is `success`, it does nothing to it, else (if it is `failure`, which is the default value), it attempts to translate it. If the API has block, it returns that row values as they are, else, it returns the row values with the content (or context) translated.

### Vaildation Set

In [4]:
# Load the validation set
CSV_PATH = '../data/validation.csv'
df = pd.read_csv(CSV_PATH)#, index_col=0)
df.head()

Unnamed: 0,id,comment_text,lang,toxic
0,0,Este usuario ni siquiera llega al rango de ...,es,0
1,1,Il testo di questa voce pare esser scopiazzato...,it,0
2,2,Vale. Sólo expongo mi pasado. Todo tiempo pasa...,es,1
3,3,Bu maddenin alt başlığı olarak uluslararası i...,tr,0
4,4,Belçika nın şehirlerinin yanında ilçe ve belde...,tr,0


In [5]:
# Add a column named outcome and populate it's rows with failure.
# Save it as validation_ammended.csv
df['outcome'] = df.apply(lambda row : 'failure', axis = 1)
df.to_csv('../data/validation_ammended.csv',index=False)

In [6]:
# Load the validation_ammended.csv to see it's contents
CSV_PATH  = '../data/validation_ammended.csv'
df = pd.read_csv(CSV_PATH)
df.head()

Unnamed: 0,id,comment_text,lang,toxic,outcome
0,0,Este usuario ni siquiera llega al rango de ...,es,0,failure
1,1,Il testo di questa voce pare esser scopiazzato...,it,0,failure
2,2,Vale. Sólo expongo mi pasado. Todo tiempo pasa...,es,1,failure
3,3,Bu maddenin alt başlığı olarak uluslararası i...,tr,0,failure
4,4,Belçika nın şehirlerinin yanında ilçe ve belde...,tr,0,failure


The trick here is to make the call with CSV_PATH as input for the first time. Then after the translation is done, you check if there are still **outcome** columns with `failure`, if there is, make the next call on `OUT_PUT_PATH`. Keep making the subsequent calls on `OUT_PUT_PATH` untill there is no more **outcome** column with `failure`.  
> You do this check using `df.groupby('outcome').count()` as can be seen below

In [7]:
# Call the translateDf function from preprocess.py file
OUT_PUT_PATH = '../data/jigsaw-toxic-comment-validation-translated.csv'
# First call with CSV_PATH
#pr.translateDf(CSV_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

In [8]:
# Subsequent calls with OUT_PUT_PATH
#pr.translateDf(OUT_PUT_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

For each call above, run the two cells below to check if there is still any **outcome** column with `failure` 

In [9]:
# Load and print the output file
df = pd.read_csv(OUT_PUT_PATH)
df.tail()

Unnamed: 0,id,comment_text,lang,toxic,outcome
7995,7987,"Hello, you said very frequently encountered si...",tr,0,success
7996,7947,"1953 Midyat, Mardin, Turkey was born. As a chi...",tr,0,success
7997,7994,Hello Santiago! I have answered your user page...,es,0,success
7998,7976,Picture: Ozga-the-07.jpg Ozpirincc source and ...,tr,0,success
7999,7996,The imbesil ete moon dela not aware or so osti...,es,1,success


In [10]:
# Check if the outcome column still have failure
df.groupby('outcome').count()

Unnamed: 0_level_0,id,comment_text,lang,toxic
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
success,8000,8000,8000,8000


When you are done, you can process in the two files below to **remove** the added `outcome` column, **sort** the data in ascending order based on the `id` and **save** the final result as `validation_translated_ordered.csv`

In [11]:
# Drop the outcome column
df.drop('outcome', axis=1, inplace=True)
# Sort by id
df.sort_values('id', inplace=True)
df.head()

Unnamed: 0,id,comment_text,lang,toxic
143,0,This user does not even rank heretic. Therefor...,es,0
158,1,The text of this item seems to be plagiarized ...,it,0
61,2,OK. I'm just stating my past. All past time wa...,es,1
133,3,I var.ön My hesitation about continuing issues...,tr,0
67,4,Belgium's towns and villages while the city ne...,tr,0


In [12]:
# Save the final validation translated file
df.to_csv('../data/validation_translated_ordered.csv',index=False)

### Test Set

In [13]:
# Load the test set
CSV_PATH = '../data/test.csv'
df = pd.read_csv(CSV_PATH)
df.head()

Unnamed: 0,id,content,lang
0,0,Doctor Who adlı viki başlığına 12. doctor olar...,tr
1,1,"Вполне возможно, но я пока не вижу необходимо...",ru
2,2,"Quindi tu sei uno di quelli conservativi , ...",it
3,3,Malesef gerçekleştirilmedi ancak şöyle bir şey...,tr
4,4,:Resim:Seldabagcan.jpg resminde kaynak sorunu ...,tr


In [14]:
# Add a column named outcome and populate it's rows with failure.
# Save it as test_ammended.csv
df['outcome'] = df.apply(lambda row : 'failure', axis = 1)
df.to_csv('../data/test_ammended.csv',index=False)

In [15]:
# Load the test_ammended.csv to see it's contents
CSV_PATH  = '../data/test_ammended.csv'
df = pd.read_csv(CSV_PATH)
df.head()

Unnamed: 0,id,content,lang,outcome
0,0,Doctor Who adlı viki başlığına 12. doctor olar...,tr,failure
1,1,"Вполне возможно, но я пока не вижу необходимо...",ru,failure
2,2,"Quindi tu sei uno di quelli conservativi , ...",it,failure
3,3,Malesef gerçekleştirilmedi ancak şöyle bir şey...,tr,failure
4,4,:Resim:Seldabagcan.jpg resminde kaynak sorunu ...,tr,failure


The trick here is to make the call with CSV_PATH as input for the first time. Then after the translation is done, you check if there are still **outcome** columns with `failure`, if there is, make the next call on `OUT_PUT_PATH`. Keep making the subsequent calls on `OUT_PUT_PATH` untill there is no more **outcome** column with `failure`.  
> You do this check using `df.groupby('outcome').count()` as can be seen below

In [16]:
# Call the translateDf function from preprocess.py file
OUT_PUT_PATH = '../data/jigsaw-toxic-comment-test-translated.csv'
# First call with CSV_PATH
#pr.translateDf(CSV_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

In [17]:
# Subsequent calls with OUT_PUT_PATH
#pr.translateDf(OUT_PUT_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

For each call above, run the two cells below to check if there is still any **outcome** column with `failure` 

In [18]:
# Load and print the output file
df = pd.read_csv(OUT_PUT_PATH)
df.tail()

Unnamed: 0,id,content,lang,outcome
63807,63654,This is how absurd and authenticity of any suc...,tr,success
63808,63719,"a a, let lush ... Turkish government due to an...",tr,success
63809,63791,It does not matter. I've added posty Full info...,tr,success
63810,63788,Bush takes psychotropic Truth or dirty electio...,ru,success
63811,63806,"Yes, you're right, I've put the label, but of ...",tr,success


In [19]:
# Check if the outcome column still have failure
df.groupby('outcome').count()

Unnamed: 0_level_0,id,content,lang
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
success,63812,63812,63812


When you are done, you can process in the two files below to **remove** the added `outcome` column, **sort** the data in ascending order based on the `id` and **save** the final result as `test_translated_ordered.csv`

In [20]:
# Drop the outcome column
df.drop('outcome', axis=1, inplace=True)
# Sort by id
df.sort_values('id', inplace=True)
df.head()

Unnamed: 0,id,content,lang
4763,0,Doctor Who has a wiki-wiki title in the 12th d...,tr
3801,1,"Quite possibly, but I do not see the need to a...",ru
2460,2,"So you're one of those conservative, preferrin...",it
5283,3,"Unfortunately, however, he had not done someth...",tr
2458,4,Picture: Seldabagcan.jpg the official source o...,tr


In [21]:
# Save the final validation translated file
df.to_csv('../data/test_translated_ordered.csv',index=False)

### Extract neccessary columns and Shuffle

The code to extract neccessary columns and shuffle is included in the helper file. We will just make a call to it.

In [18]:
data_train = '../data/jigsaw-toxic-comment-train.csv'
data_valid = '../data/validation_translated_ordered.csv'
data_test  = '../data/test_translated_ordered.csv'

train_X,train_y, valid_X,valid_y,valid_lan, test_X,test_lan,test_id = hp.prepare_imdb_data(data_train, \
                                                                                           data_valid, data_test)

print("Toxic comments (combined): train = {}, validation = {}, test = {}".format(len(train_X), len(valid_X),\
                                                                                 len(test_X)))

Toxic comments (combined): train = 223549, validation = 8000, test = 63812


Now that we have our `training`, `validation` and `testing` sets prepared, we should do a quick check and see an example of the data our model will be trained on. This is generally a good idea as it allows you to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [5]:
print(train_X[100],'\n-----------------------------\n')
print(valid_X[100],'\n-----------------------------\n')
print(test_X[100])

However, the Moonlite edit noted by golden daph was me (on optus ...)  Wake up wikkis.  So funny 
-----------------------------

-----------------------------

Ebraim beginner Dear editor, please do not delete information, do not enter information that you know is wrong or create articles with texts meaningless, what can be considered vandalism. There are problems in Article Ebraim, edited by you. If you want to experience the Wikipedia software can do it in the sandbox at will. Nevertheless, whether Leandro Martinez Speaks Tchê!


### Remove HTML tags and Tokenize

Now, we want to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.
This is the breakdown of what the function below does:
1. It converts all words to lowercase
2. Removes stopwords
3. Removes punctuations
4. Splits the string into list of words

+ There is a method before the `preprocess` method named `review_to_words` which `preprocess` method applies to each of the user comments in the training, validation and testing datasets.
+ In addition it caches the results. This is because performing this processing step can take a long time. This way if you are unable to complete the notebook in the current session, you can come back without needing to process the data a second time.

**We now get the list of all the words in training and validation sets**