# Toxic Comment Classification
---

## General Outline
---
0. Import the necessary libraries
1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.


## Step 0: Importing Libraries
---

In [4]:
#!pip install googletrans
#!pip install tqdm
#!pip install torch
#!pip install --upgrade pip

In [14]:
import pandas as pd
from shutil import unpack_archive
import torch
import helper as hp
import os

In [6]:
# Set device to use GPU if available
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

## Step 1: Downloading the data
---

Two challenges you might face while trying to download this data are:
- Pulling the data from Kaggle directly to the notebook instance you might be working with
- Memory issue if you are using Amazon SagemMaker
I have tried to address these issues below.

#### Install Kaggle library and downlaod Zip file
To successfully pull the data directly to your notebook incase you are working on a remote device, you can take these steps:
1. **pip install the Kaggle library**. You can do this by opening the notebook's terminal and running, `pip install the Kaggle library`
2. **Create a directory** to put the API token in, so the Kaggle library will know where to look for your sign-in credentials when you try to access the API from your notebook. This directory must be named `.kaggle` and it must be in the same directory as your installation of Python, which is typically the home directory. Note that this file will be hidden. Do this on the terminal. `mkdir .kaggle`
3. **Go to your Kaggle account** (or create one if you haven’t yet), click the profile icon in the top right corner of the screen, and select “my profile” from the dropdown list. Scroll down to about middle of the page, and click on “Create a New API Token”. Then, a file named kaggle.json will automatically download to your downloads folder (or your default download folder). This file contains your sign-in credentials to allow you to access the API.
4. **Navigate to the directory** where the above file was downlaoded, you need to move the kaggle.json file to the new .kaggle directory. Do this on the terminal. `mv kaggle.json path_to_the_.kaggle_folder/.kaggle`
5. **Now you can view list of kaggle competions** from your terminal. `kaggle competitions list`
6. **You can down the competition zip folder** through your terminal directly to the location you are working at. `kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification`. If you want to do the download from the notebook, `!kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification`

> I also recommend you go through this github link `https://github.com/floydwch/kaggle-cli`

### Unzip the file
One major challenged I faced while trying to unzip the file on Amazon SageMaker was memory issue. This was how I resolved that.
1. **I went to Amazon SageMaker**, then to the Notebook Instance I was working on. I clicked on `edit`, then scrolled down to the `volume`. I increased the default volume from 5gb to like 160gb. Then updated. That solved the issue for me.
2. **Then I proceeded** to the next cell below.

In [None]:
unpack_archive('../data/jigsaw-multilingual-toxic-comment-classification.zip', '../data/')

## Step 2: Data Preparation
---

### Translate the data
Since the `validation and test sets` are in many languages, we need to translate them to English for consistency using google translate API.
The code for this can be found inside the helper function

In [None]:
# Set the input and output paths
CSV_PATH = '../data/validation.csv'
out = '../data/jigsaw-toxic-comment-validation-translated.csv'
df = pd.read_csv(CSV_PATH)
df.head()

In [None]:
# Make a call to the helper function
"""Uncomment below to translate the data"""

#out = '../data/jigsaw-toxic-comment-validation-translated.csv'
#hp.translateDf(CSV_PATH,out,['id','comment_text','lang', 'toxic']);
#out = '../data/jigsaw-toxic-comment-test-translated.csv'
#hp.translateDf(CSV_PATH,out,['id','content','lang']);

In [None]:
# Check the output of the translation
out = '../data/test.csv'
df = pd.read_csv(out)
df.head()

### Extract neccessary columns and Shuffle

The code to extract neccessary columns and shuffle is included in the helper file. We will just make a call to it.

In [7]:
data_train = '../data/jigsaw-toxic-comment-train.csv'
data_valid = '../data/jigsaw-toxic-comment-validation-translated.csv'
data_test  = '../data/jigsaw-toxic-comment-test-translated.csv'

train_X,train_y, valid_X,valid_y,valid_lan, test_X,test_lan,test_id = hp.prepare_imdb_data(data_train, data_valid, data_test)

print("Toxic comments (combined): train = {}, validation = {}, test = {}".format(len(train_X), len(valid_X), len(test_X)))

Toxic comments (combined): train = 223549, validation = 8000, test = 63812


Now that we have our `training`, `validation` and `testing` sets prepared, we should do a quick check and see an example of the data our model will be trained on. This is generally a good idea as it allows you to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [8]:
print(train_X[100],'\n-----------------------------')
print(valid_X[100],'\n-----------------------------')
print(test_X[100],'\n-----------------------------')

However, the Moonlite edit noted by golden daph was me (on optus ...)  Wake up wikkis.  So funny 
-----------------------------
concerning spam was that rewrote thinking that the first time he had not done well, it must conclude that I must change the focus of the article ?, is not promotional but descriptive 
-----------------------------
@ Lucretia Dashnak propaganda: are you talking about here? ENOUGH, fed up with your tone and your délires.Cesar Borgia 
-----------------------------


### Remove HTML tags and Tokenize

Now, we want to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.
This is the breakdown of what the function below does:
1. It converts all words to lowercase
2. Removes stopwords
3. Removes punctuations
4. Splits the string into list of words

+ There is a method before the `preprocess` method named `review_to_words` which `preprocess` method applies to each of the user comments in the training, validation and testing datasets.
+ In addition it caches the results. This is because performing this processing step can take a long time. This way if you are unable to complete the notebook in the current session, you can come back without needing to process the data a second time.

In [15]:
cache_dir = os.path.join("../data/cache", "toxic_comments")  # where to store cache files
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists

In [None]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)