# Toxic Comment Classification
---

## General Outline
---
0. Import the necessary libraries
1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.


## Step 0: Importing Libraries
---

In [None]:
!pip install --upgrade pip
!pip install googletrans
!pip install tqdm
!pip install torch
!pip install emoji
!pip install nltk

In [20]:
import pandas as pd
from shutil import unpack_archive
import torch
import helper as hp
import os
import ast
import pickle
import numpy as np

In [None]:
# Set device to use GPU if available
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

## Step 1: Downloading the data
---

Two challenges you might face while trying to download this data are:
- Pulling the data from Kaggle directly to the notebook instance you might be working with
- Memory issue if you are using Amazon SagemMaker
I have tried to address these issues below.

#### Install Kaggle library and downlaod Zip file
To successfully pull the data directly to your notebook incase you are working on a remote device, you can take these steps:
1. **pip install the Kaggle library**. You can do this by opening the notebook's terminal and running, `pip install the Kaggle library`
2. **Create a directory** to put the API token in, so the Kaggle library will know where to look for your sign-in credentials when you try to access the API from your notebook. This directory must be named `.kaggle` and it must be in the same directory as your installation of Python, which is typically the home directory. Note that this file will be hidden. Do this on the terminal. `mkdir .kaggle`
3. **Go to your Kaggle account** (or create one if you haven’t yet), click the profile icon in the top right corner of the screen, and select “my profile” from the dropdown list. Scroll down to about middle of the page, and click on “Create a New API Token”. Then, a file named kaggle.json will automatically download to your downloads folder (or your default download folder). This file contains your sign-in credentials to allow you to access the API.
4. **Navigate to the directory** where the above file was downlaoded, you need to move the kaggle.json file to the new .kaggle directory. Do this on the terminal. `mv kaggle.json path_to_the_.kaggle_folder/.kaggle`
5. **Now you can view list of kaggle competions** from your terminal. `kaggle competitions list`
6. **You can down the competition zip folder** through your terminal directly to the location you are working at. `kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification`. If you want to do the download from the notebook, `!kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification`

> I also recommend you go through this github link `https://github.com/floydwch/kaggle-cli`

### Unzip the file
One major challenged I faced while trying to unzip the file on Amazon SageMaker was memory issue. This was how I resolved that.
1. **I went to Amazon SageMaker**, then to the Notebook Instance I was working on. I clicked on `edit`, then scrolled down to the `volume`. I increased the default volume from 5gb to like 160gb. Then updated. That solved the issue for me.
2. **Then I proceeded** to the next cell below.

In [None]:
unpack_archive('../data/jigsaw-multilingual-toxic-comment-classification.zip', '../data/')

## Step 2: Data Preparation
---

### Translate the data
Since the `validation and test sets` are in many languages, we need to translate them to English for consistency using google translate API.
The code for this can be found inside the helper function  
> **Note** that translating could be a bit difficult. In my case, I used google API. The keep blocking it, I deviced a walkaround whereby when you send in the file for translation, it checks the column with the tittle `outcome`. If the value for that row is `success`, it does nothing to it, else (if it is `failure`, which is the default value), it attempts to translate it. If the API has block, it returns that row values as they are, else, it returns the row values with the content (or context) translated.

### Vaildation Set

In [None]:
# Load the validation set
CSV_PATH = '../data/validation.csv'
df = pd.read_csv(CSV_PATH)#, index_col=0)
df.head()

In [None]:
# Add a column named outcome and populate it's rows with failure.
# Save it as validation_ammended.csv
df['outcome'] = df.apply(lambda row : 'failure', axis = 1)
df.to_csv('../data/validation_ammended.csv',index=False)

In [None]:
# Load the validation_ammended.csv to see it's contents
CSV_PATH  = '../data/validation_ammended.csv'
df = pd.read_csv(CSV_PATH)
df.head()

The trick here is to make the call with CSV_PATH as input for the first time. Then after the translation is done, you check if there are still **outcome** columns with `failure`, if there is, make the next call on `OUT_PUT_PATH`. Keep making the subsequent calls on `OUT_PUT_PATH` untill there is no more **outcome** column with `failure`.  
> You do this check using `df.groupby('outcome').count()` as can be seen below

In [None]:
# Call the translateDf function from preprocess.py file
OUT_PUT_PATH = '../data/jigsaw-toxic-comment-validation-translated.csv'
# First call with CSV_PATH
#pr.translateDf(CSV_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

In [None]:
# Subsequent calls with OUT_PUT_PATH
#pr.translateDf(OUT_PUT_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

For each call above, run the two cells below to check if there is still any **outcome** column with `failure` 

In [None]:
# Load and print the output file
df = pd.read_csv(OUT_PUT_PATH)
df.tail()

In [None]:
# Check if the outcome column still have failure
df.groupby('outcome').count()

When you are done, you can process in the two files below to **remove** the added `outcome` column, **sort** the data in ascending order based on the `id` and **save** the final result as `validation_translated_ordered.csv`

In [None]:
# Drop the outcome column
df.drop('outcome', axis=1, inplace=True)
# Sort by id
df.sort_values('id', inplace=True)
df.head()

In [None]:
# Save the final validation translated file
df.to_csv('../data/validation_translated_ordered.csv',index=False)

### Test Set

In [23]:
# Load the test set
CSV_PATH = '../data/test.csv'
df = pd.read_csv(CSV_PATH)
df.head()

'Doctor Who adlı viki başlığına 12. doctor olarak bir viki yazarı kendi adını eklemiştir. Şahsen düzelttim. Onaylarsanız sevinirim. Occipital '

In [None]:
# Add a column named outcome and populate it's rows with failure.
# Save it as test_ammended.csv
df['outcome'] = df.apply(lambda row : 'failure', axis = 1)
df.to_csv('../data/test_ammended.csv',index=False)

In [None]:
# Load the test_ammended.csv to see it's contents
CSV_PATH  = '../data/test_ammended.csv'
df = pd.read_csv(CSV_PATH)
df.head()

The trick here is to make the call with CSV_PATH as input for the first time. Then after the translation is done, you check if there are still **outcome** columns with `failure`, if there is, make the next call on `OUT_PUT_PATH`. Keep making the subsequent calls on `OUT_PUT_PATH` untill there is no more **outcome** column with `failure`.  
> You do this check using `df.groupby('outcome').count()` as can be seen below

In [None]:
# Call the translateDf function from preprocess.py file
OUT_PUT_PATH = '../data/jigsaw-toxic-comment-test-translated.csv'
# First call with CSV_PATH
#pr.translateDf(CSV_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

In [None]:
# Subsequent calls with OUT_PUT_PATH
#pr.translateDf(OUT_PUT_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

For each call above, run the two cells below to check if there is still any **outcome** column with `failure` 

In [None]:
# Load and print the output file
df = pd.read_csv(OUT_PUT_PATH)
df.tail()

In [None]:
# Check if the outcome column still have failure
df.groupby('outcome').count()

When you are done, you can process in the two files below to **remove** the added `outcome` column, **sort** the data in ascending order based on the `id` and **save** the final result as `test_translated_ordered.csv`

In [None]:
# Drop the outcome column
df.drop('outcome', axis=1, inplace=True)
# Sort by id
df.sort_values('id', inplace=True)
df.head()

In [None]:
# Save the final validation translated file
df.to_csv('../data/test_translated_ordered.csv',index=False)

### Extract neccessary columns and Shuffle

The code to extract neccessary columns and shuffle is included in the helper file. We will just make a call to it.

In [None]:
data_train = '../data/jigsaw-toxic-comment-train.csv'
data_valid = '../data/validation_translated_ordered.csv'
data_test  = '../data/test_translated_ordered.csv'

train_X,train_y, valid_X,valid_y,valid_lan, test_X,test_lan,test_id = hp.prepare_imdb_data(data_train, \
                                                                                           data_valid, data_test)

print("Toxic comments (combined): train = {}, validation = {}, test = {}".format(len(train_X), len(valid_X),\
                                                                                 len(test_X)))

Now that we have our `training`, `validation` and `testing` sets prepared, we should do a quick check and see an example of the data our model will be trained on. This is generally a good idea as it allows you to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [None]:
print(train_X[100],'\n-----------------------------\n')
print(valid_X[100],'\n-----------------------------\n')
print(test_X[100])

### Remove HTML tags and Tokenize

Now, we want to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.
This is the breakdown of what the function below does:
1. It converts all words to lowercase
2. Removes stopwords
3. Removes punctuations
4. Splits the string into list of words

+ There is a method before the `preprocess` method named `review_to_words` which `preprocess` method applies to each of the user comments in the training, validation and testing datasets.
+ In addition it caches the results. This is because performing this processing step can take a long time. This way if you are unable to complete the notebook in the current session, you can come back without needing to process the data a second time.

**We now get the list of all the words in training and validation sets**

In [4]:
data_train = '../data/jigsaw-toxic-comment-train.csv'
data_valid = '../data/validation_translated_ordered.csv'
data_test  = '../data/test_translated_ordered.csv'
OUT_PUT_PATH_train = '../data/words_train.csv'
OUT_PUT_PATH_valid = '../data/words_validation.csv'
OUT_PUT_PATH_test = '../data/words_test.csv'
column_train = ['id','comment_text','toxic']
column_valid = ['id','comment_text','lang','toxic']
column_test = ['id','content','lang']

In [None]:
#hp.word_list_Df(OUT_PUT_PATH_train+'_1', OUT_PUT_PATH_train+'_1', column_train)

In [None]:
hp.word_list_Df(data_train, OUT_PUT_PATH_train, column_train)

In [None]:
df = pd.read_csv(OUT_PUT_PATH_train)
df.head()

In [None]:
df['comment_text'][0][0]

In [None]:
df.groupby('toxic').count()

#### Build word_dictionary

> - A careful examination of the contents of the dataset shows the lists created are wrapped around a string. We first extract the lists from the strings and build the `train_X, train_y, valid_X, valid_y, test_X and test_id`.
> - Next we buit the dictionary of words

In [6]:
train = pd.read_csv(OUT_PUT_PATH_train)
train_X = [0]*train.shape[0]
j = 0
for i,elem in enumerate(train['comment_text']):
    try:
        train_X[i] = ast.literal_eval(elem)
    except:
        j += 1
        train_X[i] = 'None'
train_y = list(train['toxic'])

In [9]:
valid = pd.read_csv(OUT_PUT_PATH_valid)
valid_X = [0]*valid.shape[0]
k = 0
for i,elem in enumerate(valid['comment_text']):
    try:
        valid_X[i] = ast.literal_eval(elem)
    except:
        k += 1
        valid_X[i] = 'None'
valid_y = list(valid['toxic'])

In [12]:
test = pd.read_csv(OUT_PUT_PATH_test)
test_X = [0]*test.shape[0]
m = 0
for i,elem in enumerate(test['content']):
    try:
        test_X[i] = ast.literal_eval(elem)
    except:
        m += 1
        test_X[i] = 'None'
test_id = list(test['id'])

We build the dictionary of words with the Train set

In [15]:
# Dictionary of words
word_dict = hp.build_dict(train_X)

In [16]:
# Default vocab_size = 5000
len(word_dict.values())

4998

In [17]:
# Checking the most frequent words
print(list({k: v for k, v in sorted(word_dict.items(), key=lambda item: item[1], reverse=True)}.keys())[:5])

['promo', 'cabl', 'lip', 'spade', 'vegetarian']


### Save `word_dict`

Later on when we construct an endpoint which processes a submitted review we will need to make use of the `word_dict` which we have created. As such, we will save it to a file now for future use.

In [25]:
data_dir = '../data/pytorch' # The folder we will use for storing data
if not os.path.exists(data_dir): # Make sure that the folder exists
    os.makedirs(data_dir)

In [None]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transform the reviews

Now that we have our word dictionary which allows us to transform the words appearing in the reviews into integers, it is time to make use of it and convert our reviews to their integer sequence representation, making sure to pad or truncate to a fixed length, which in our case is `500`.

In [18]:
train_X, train_X_len = hp.convert_and_pad_data(word_dict, train_X)
valid_X, valid_X_len = hp.convert_and_pad_data(word_dict, valid_X)
test_X, test_X_len = hp.convert_and_pad_data(word_dict, test_X)

As a quick check to make sure that things are working as intended, check to see what one of the reviews in the training set looks like after having been processeed. Does this look reasonable? What is the length of a review in the training set?

In [21]:
# Use this cell to examine one of the processed reviews to make sure everything is working as intended.
print("Shape of train data is {}, while shape of train label is {} \n\nOne training data is {} == {}" \
      .format(np.shape(train_X),np.shape(train_X_len), train_X_len[230], np.count_nonzero(train_X[230])))

Shape of train data is (223549, 500), while shape of train label is (223549,) 

One training data is 12 == 12


## Step 3: Upload the data to S3

---

We will need to upload the training dataset to S3 in order for our training code to access it. For now we will save it locally and we will upload to S3 later on.

### Save the processed training dataset locally

It is important to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, each row of the dataset has the form `label`, `length`, `review[500]` where `review[500]` is a sequence of `500` integers representing the words in the review.

In [26]:
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [29]:
pd.concat([pd.DataFrame(valid_y), pd.DataFrame(valid_X_len), pd.DataFrame(valid_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'valid.csv'), header=False, index=False)

In [30]:
pd.concat([pd.DataFrame(test_id), pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

### Uploading the training data


Next, we need to upload the training data to the SageMaker default S3 bucket so that we can provide access to it while training our model.

In [31]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

In [32]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

**NOTE:** The cell above uploads the entire contents of our data directory. This includes the `word_dict.pkl` file. This is fortunate as we will need this later on when we create an endpoint that accepts an arbitrary review. For now, we will just take note of the fact that it resides in the data directory (and so also in the S3 training bucket) and that we will need to make sure it gets saved in the model directory.

## Step 4: Build and Train the PyTorch Model

In the XGBoost notebook we discussed what a model is in the SageMaker framework. In particular, a model comprises three objects

 - Model Artifacts,
 - Training Code, and
 - Inference Code,
 
each of which interact with one another. In the XGBoost example we used training and inference code that was provided by Amazon. Here we will still be using containers provided by Amazon with the added benefit of being able to include our own custom code.

I have implemented the neural network in PyTorch along with a training script. The model object is in the `model.py` file, inside of the `train` folder. You can see the implementation by running the cell below.