# Toxic Comment Classification
---

## General Outline
---
0. Import the necessary libraries
1. Download or otherwise retrieve the data.
2. Process / Prepare the data.
3. Upload the processed data to S3.
4. Train a chosen model.
5. Test the trained model (typically using a batch transform job).
6. Deploy the trained model.
7. Use the deployed model.


## Step 0: Importing Libraries
---

In [3]:
#!pip install --upgrade pip
#!pip install googletrans
#!pip install tqdm
#!pip install torch
#!pip install emoji
#!pip install nltk

In [4]:
import pandas as pd
from shutil import unpack_archive
import torch
import helper as hp
import os
import ast
import pickle
import numpy as np

In [5]:
# Set device to use GPU if available
use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

## Step 1: Downloading the data
---

Two challenges you might face while trying to download this data are:
- Pulling the data from Kaggle directly to the notebook instance you might be working with
- Memory issue if you are using Amazon SagemMaker
I have tried to address these issues below.

#### Install Kaggle library and downlaod Zip file
To successfully pull the data directly to your notebook incase you are working on a remote device, you can take these steps:
1. **pip install the Kaggle library**. You can do this by opening the notebook's terminal and running, `pip install the Kaggle library`
2. **Create a directory** to put the API token in, so the Kaggle library will know where to look for your sign-in credentials when you try to access the API from your notebook. This directory must be named `.kaggle` and it must be in the same directory as your installation of Python, which is typically the home directory. Note that this file will be hidden. Do this on the terminal. `mkdir .kaggle`
3. **Go to your Kaggle account** (or create one if you haven’t yet), click the profile icon in the top right corner of the screen, and select “my profile” from the dropdown list. Scroll down to about middle of the page, and click on “Create a New API Token”. Then, a file named kaggle.json will automatically download to your downloads folder (or your default download folder). This file contains your sign-in credentials to allow you to access the API.
4. **Navigate to the directory** where the above file was downlaoded, you need to move the kaggle.json file to the new .kaggle directory. Do this on the terminal. `mv kaggle.json path_to_the_.kaggle_folder/.kaggle`
5. **Now you can view list of kaggle competions** from your terminal. `kaggle competitions list`
6. **You can down the competition zip folder** through your terminal directly to the location you are working at. `kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification`. If you want to do the download from the notebook, `!kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification`

> I also recommend you go through this github link `https://github.com/floydwch/kaggle-cli`

### Unzip the file
One major challenged I faced while trying to unzip the file on Amazon SageMaker was memory issue. This was how I resolved that.
1. **I went to Amazon SageMaker**, then to the Notebook Instance I was working on. I clicked on `edit`, then scrolled down to the `volume`. I increased the default volume from 5gb to like 160gb. Then updated. That solved the issue for me.
2. **Then I proceeded** to the next cell below.

In [6]:
#unpack_archive('../data/jigsaw-multilingual-toxic-comment-classification.zip', '../data/')

## Step 2: Data Preparation
---

### Translate the data
Since the `validation and test sets` are in many languages, we need to translate them to English for consistency using google translate API.
The code for this can be found inside the helper function  
> **Note** that translating could be a bit difficult. In my case, I used google API. The keep blocking it, I deviced a walkaround whereby when you send in the file for translation, it checks the column with the tittle `outcome`. If the value for that row is `success`, it does nothing to it, else (if it is `failure`, which is the default value), it attempts to translate it. If the API has block, it returns that row values as they are, else, it returns the row values with the content (or context) translated.

### Vaildation Set

In [7]:
# Load the validation set
CSV_PATH = '../data/validation.csv'
df = pd.read_csv(CSV_PATH)#, index_col=0)
df.head()

Unnamed: 0,id,comment_text,lang,toxic
0,0,Este usuario ni siquiera llega al rango de ...,es,0
1,1,Il testo di questa voce pare esser scopiazzato...,it,0
2,2,Vale. Sólo expongo mi pasado. Todo tiempo pasa...,es,1
3,3,Bu maddenin alt başlığı olarak uluslararası i...,tr,0
4,4,Belçika nın şehirlerinin yanında ilçe ve belde...,tr,0


In [8]:
# Add a column named outcome and populate it's rows with failure.
# Save it as validation_ammended.csv
df['outcome'] = df.apply(lambda row : 'failure', axis = 1)
df.to_csv('../data/validation_ammended.csv',index=False)

In [9]:
# Load the validation_ammended.csv to see it's contents
CSV_PATH  = '../data/validation_ammended.csv'
df = pd.read_csv(CSV_PATH)
df.head()

Unnamed: 0,id,comment_text,lang,toxic,outcome
0,0,Este usuario ni siquiera llega al rango de ...,es,0,failure
1,1,Il testo di questa voce pare esser scopiazzato...,it,0,failure
2,2,Vale. Sólo expongo mi pasado. Todo tiempo pasa...,es,1,failure
3,3,Bu maddenin alt başlığı olarak uluslararası i...,tr,0,failure
4,4,Belçika nın şehirlerinin yanında ilçe ve belde...,tr,0,failure


The trick here is to make the call with CSV_PATH as input for the first time. Then after the translation is done, you check if there are still **outcome** columns with `failure`, if there is, make the next call on `OUT_PUT_PATH`. Keep making the subsequent calls on `OUT_PUT_PATH` untill there is no more **outcome** column with `failure`.  
> You do this check using `df.groupby('outcome').count()` as can be seen below

In [10]:
# Call the translateDf function from preprocess.py file
OUT_PUT_PATH = '../data/jigsaw-toxic-comment-validation-translated.csv'
# First call with CSV_PATH
#pr.translateDf(CSV_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

In [11]:
# Subsequent calls with OUT_PUT_PATH
#pr.translateDf(OUT_PUT_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

For each call above, run the two cells below to check if there is still any **outcome** column with `failure` 

In [12]:
# Load and print the output file
df = pd.read_csv(OUT_PUT_PATH)
df.tail()

Unnamed: 0,id,comment_text,lang,toxic,outcome
7995,7987,"Hello, you said very frequently encountered si...",tr,0,success
7996,7947,"1953 Midyat, Mardin, Turkey was born. As a chi...",tr,0,success
7997,7994,Hello Santiago! I have answered your user page...,es,0,success
7998,7976,Picture: Ozga-the-07.jpg Ozpirincc source and ...,tr,0,success
7999,7996,The imbesil ete moon dela not aware or so osti...,es,1,success


In [13]:
# Check if the outcome column still have failure
df.groupby('outcome').count()

Unnamed: 0_level_0,id,comment_text,lang,toxic
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
success,8000,8000,8000,8000


When you are done, you can process in the two files below to **remove** the added `outcome` column, **sort** the data in ascending order based on the `id` and **save** the final result as `validation_translated_ordered.csv`

In [14]:
# Drop the outcome column
df.drop('outcome', axis=1, inplace=True)
# Sort by id
df.sort_values('id', inplace=True)
df.head()

Unnamed: 0,id,comment_text,lang,toxic
143,0,This user does not even rank heretic. Therefor...,es,0
158,1,The text of this item seems to be plagiarized ...,it,0
61,2,OK. I'm just stating my past. All past time wa...,es,1
133,3,I var.ön My hesitation about continuing issues...,tr,0
67,4,Belgium's towns and villages while the city ne...,tr,0


In [15]:
# Save the final validation translated file
df.to_csv('../data/validation_translated_ordered.csv',index=False)

### Test Set

In [16]:
# Load the test set
CSV_PATH = '../data/test.csv'
df = pd.read_csv(CSV_PATH)
df.head()

Unnamed: 0,id,content,lang
0,0,Doctor Who adlı viki başlığına 12. doctor olar...,tr
1,1,"Вполне возможно, но я пока не вижу необходимо...",ru
2,2,"Quindi tu sei uno di quelli conservativi , ...",it
3,3,Malesef gerçekleştirilmedi ancak şöyle bir şey...,tr
4,4,:Resim:Seldabagcan.jpg resminde kaynak sorunu ...,tr


In [17]:
# Add a column named outcome and populate it's rows with failure.
# Save it as test_ammended.csv
df['outcome'] = df.apply(lambda row : 'failure', axis = 1)
df.to_csv('../data/test_ammended.csv',index=False)

In [18]:
# Load the test_ammended.csv to see it's contents
CSV_PATH  = '../data/test_ammended.csv'
df = pd.read_csv(CSV_PATH)
df.head()

Unnamed: 0,id,content,lang,outcome
0,0,Doctor Who adlı viki başlığına 12. doctor olar...,tr,failure
1,1,"Вполне возможно, но я пока не вижу необходимо...",ru,failure
2,2,"Quindi tu sei uno di quelli conservativi , ...",it,failure
3,3,Malesef gerçekleştirilmedi ancak şöyle bir şey...,tr,failure
4,4,:Resim:Seldabagcan.jpg resminde kaynak sorunu ...,tr,failure


The trick here is to make the call with CSV_PATH as input for the first time. Then after the translation is done, you check if there are still **outcome** columns with `failure`, if there is, make the next call on `OUT_PUT_PATH`. Keep making the subsequent calls on `OUT_PUT_PATH` untill there is no more **outcome** column with `failure`.  
> You do this check using `df.groupby('outcome').count()` as can be seen below

In [19]:
# Call the translateDf function from preprocess.py file
OUT_PUT_PATH = '../data/jigsaw-toxic-comment-test-translated.csv'
# First call with CSV_PATH
#pr.translateDf(CSV_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

In [None]:
# Subsequent calls with OUT_PUT_PATH
#pr.translateDf(OUT_PUT_PATH, OUT_PUT_PATH, ['id','comment_text','lang','toxic','outcome']);

For each call above, run the two cells below to check if there is still any **outcome** column with `failure` 

In [20]:
# Load and print the output file
df = pd.read_csv(OUT_PUT_PATH)
df.tail()

Unnamed: 0,id,content,lang,outcome
63807,63654,This is how absurd and authenticity of any suc...,tr,success
63808,63719,"a a, let lush ... Turkish government due to an...",tr,success
63809,63791,It does not matter. I've added posty Full info...,tr,success
63810,63788,Bush takes psychotropic Truth or dirty electio...,ru,success
63811,63806,"Yes, you're right, I've put the label, but of ...",tr,success


In [21]:
# Check if the outcome column still have failure
df.groupby('outcome').count()

Unnamed: 0_level_0,id,content,lang
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
success,63812,63812,63812


When you are done, you can process in the two files below to **remove** the added `outcome` column, **sort** the data in ascending order based on the `id` and **save** the final result as `test_translated_ordered.csv`

In [22]:
# Drop the outcome column
df.drop('outcome', axis=1, inplace=True)
# Sort by id
df.sort_values('id', inplace=True)
df.head()

Unnamed: 0,id,content,lang
4763,0,Doctor Who has a wiki-wiki title in the 12th d...,tr
3801,1,"Quite possibly, but I do not see the need to a...",ru
2460,2,"So you're one of those conservative, preferrin...",it
5283,3,"Unfortunately, however, he had not done someth...",tr
2458,4,Picture: Seldabagcan.jpg the official source o...,tr


In [23]:
# Save the final validation translated file
df.to_csv('../data/test_translated_ordered.csv',index=False)

### Extract neccessary columns and Shuffle

The code to extract neccessary columns and shuffle is included in the helper file. We will just make a call to it.

In [24]:
data_train = '../data/jigsaw-toxic-comment-train.csv'
data_valid = '../data/validation_translated_ordered.csv'
data_test  = '../data/test_translated_ordered.csv'

train_X,train_y, valid_X,valid_y,valid_lan, test_X,test_lan,test_id = hp.prepare_imdb_data(data_train, \
                                                                                           data_valid, data_test)

print("Toxic comments (combined): train = {}, validation = {}, test = {}".format(len(train_X), len(valid_X),\
                                                                                 len(test_X)))

Toxic comments (combined): train = 223549, validation = 8000, test = 63812


Now that we have our `training`, `validation` and `testing` sets prepared, we should do a quick check and see an example of the data our model will be trained on. This is generally a good idea as it allows you to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [25]:
print(train_X[100],'\n-----------------------------\n')
print(valid_X[100],'\n-----------------------------\n')
print(test_X[100])

However, the Moonlite edit noted by golden daph was me (on optus ...)  Wake up wikkis.  So funny 
-----------------------------

-----------------------------

Ebraim beginner Dear editor, please do not delete information, do not enter information that you know is wrong or create articles with texts meaningless, what can be considered vandalism. There are problems in Article Ebraim, edited by you. If you want to experience the Wikipedia software can do it in the sandbox at will. Nevertheless, whether Leandro Martinez Speaks Tchê!


### Remove HTML tags and Tokenize

Now, we want to make sure that any html tags that appear should be removed. In addition we wish to tokenize our input, that way words such as *entertained* and *entertaining* are considered the same with regard to sentiment analysis.
This is the breakdown of what the function below does:
1. It converts all words to lowercase
2. Removes stopwords
3. Removes punctuations
4. Splits the string into list of words

+ There is a method before the `preprocess` method named `review_to_words` which `preprocess` method applies to each of the user comments in the training, validation and testing datasets.
+ In addition it caches the results. This is because performing this processing step can take a long time. This way if you are unable to complete the notebook in the current session, you can come back without needing to process the data a second time.

**We now get the list of all the words in training and validation sets**

In [26]:
data_train = '../data/jigsaw-toxic-comment-train.csv'
data_valid = '../data/validation_translated_ordered.csv'
data_test  = '../data/test_translated_ordered.csv'
OUT_PUT_PATH_train = '../data/words_train.csv'
OUT_PUT_PATH_valid = '../data/words_validation.csv'
OUT_PUT_PATH_test = '../data/words_test.csv'
column_train = ['id','comment_text','toxic']
column_valid = ['id','comment_text','lang','toxic']
column_test = ['id','content','lang']

In [27]:
#hp.word_list_Df(OUT_PUT_PATH_train+'_1', OUT_PUT_PATH_train+'_1', column_train)

In [None]:
hp.word_list_Df(data_train, OUT_PUT_PATH_train, column_train)

In [28]:
df = pd.read_csv(OUT_PUT_PATH_train)
df.head()

Unnamed: 0,id,comment_text,toxic
0,000113f07ec002fd,"['EN', 'hey', 'man', 'i', 'm', 'realli', 'not'...",0
1,001cadfd324f8087,"['EN', 'as', 'for', 'your', 'claim', 'of', 'st...",0
2,0001d958c54c6e35,"['EN', 'you', 'sir', 'are', 'my', 'hero', 'ani...",0
3,001d874a4d3e8813,"['EN', 'jmabel', 'in', 'regard', 'to', 'predom...",0
4,001dc38a83d420cf,"['EN', 'get', 'fuck', 'up', 'get', 'fuckee', '...",1


In [30]:
df.groupby('toxic').count()

Unnamed: 0_level_0,id,comment_text
toxic,Unnamed: 1_level_1,Unnamed: 2_level_1
0,202165,202164
1,21384,21384


#### Build word_dictionary

> - A careful examination of the contents of the dataset shows the lists created are wrapped around a string. We first extract the lists from the strings and build the `train_X, train_y, valid_X, valid_y, test_X and test_id`.
> - Next we buit the dictionary of words

In [31]:
train = pd.read_csv(OUT_PUT_PATH_train)
train_X = [0]*train.shape[0]
j = 0
for i,elem in enumerate(train['comment_text']):
    try:
        train_X[i] = ast.literal_eval(elem)
    except:
        j += 1
        train_X[i] = 'None'
train_y = list(train['toxic'])

In [32]:
valid = pd.read_csv(OUT_PUT_PATH_valid)
valid_X = [0]*valid.shape[0]
k = 0
for i,elem in enumerate(valid['comment_text']):
    try:
        valid_X[i] = ast.literal_eval(elem)
    except:
        k += 1
        valid_X[i] = 'None'
valid_y = list(valid['toxic'])

In [33]:
test = pd.read_csv(OUT_PUT_PATH_test)
test_X = [0]*test.shape[0]
m = 0
for i,elem in enumerate(test['content']):
    try:
        test_X[i] = ast.literal_eval(elem)
    except:
        m += 1
        test_X[i] = 'None'
test_id = list(test['id'])

We build the dictionary of words with the Train set

In [34]:
# Dictionary of words
word_dict = hp.build_dict(train_X)

In [35]:
# Default vocab_size = 5000
len(word_dict.values())

4998

In [36]:
# Checking the most frequent words
print(list({k: v for k, v in sorted(word_dict.items(), key=lambda item: item[1], reverse=True)}.keys())[:5])

['promo', 'cabl', 'lip', 'spade', 'vegetarian']


### Save `word_dict`

Later on when we construct an endpoint which processes a submitted review we will need to make use of the `word_dict` which we have created. As such, we will save it to a file now for future use.

In [37]:
data_dir = '../data/pytorch' # The folder we will use for storing data
if not os.path.exists(data_dir): # Make sure that the folder exists
    os.makedirs(data_dir)

In [38]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transform the reviews

Now that we have our word dictionary which allows us to transform the words appearing in the reviews into integers, it is time to make use of it and convert our reviews to their integer sequence representation, making sure to pad or truncate to a fixed length, which in our case is `500`.

In [39]:
train_X, train_X_len = hp.convert_and_pad_data(word_dict, train_X)
valid_X, valid_X_len = hp.convert_and_pad_data(word_dict, valid_X)
test_X, test_X_len = hp.convert_and_pad_data(word_dict, test_X)

As a quick check to make sure that things are working as intended, check to see what one of the reviews in the training set looks like after having been processeed. Does this look reasonable? What is the length of a review in the training set?

In [40]:
# Use this cell to examine one of the processed reviews to make sure everything is working as intended.
print("Shape of train data is {}, while shape of train label is {} \n\nOne training data is {} == {}" \
      .format(np.shape(train_X),np.shape(train_X_len), train_X_len[230], np.count_nonzero(train_X[230])))

Shape of train data is (223549, 500), while shape of train label is (223549,) 

One training data is 12 == 12


## Step 3: Upload the data to S3

---

We will need to upload the training dataset to S3 in order for our training code to access it. For now we will save it locally and we will upload to S3 later on.

### Save the processed training dataset locally

It is important to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, each row of the dataset has the form `label`, `length`, `review[500]` where `review[500]` is a sequence of `500` integers representing the words in the review.

In [41]:
pd.concat([pd.DataFrame(train_y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [42]:
pd.concat([pd.DataFrame(valid_y), pd.DataFrame(valid_X_len), pd.DataFrame(valid_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'valid.csv'), header=False, index=False)

In [43]:
pd.concat([pd.DataFrame(test_id), pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

### Uploading the training data


Next, we need to upload the training data to the SageMaker default S3 bucket so that we can provide access to it while training our model.

In [44]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'sagemaker/sentiment_rnn'

role = sagemaker.get_execution_role()

In [45]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

**NOTE:** The cell above uploads the entire contents of our data directory. This includes the `word_dict.pkl` file. This is fortunate as we will need this later on when we create an endpoint that accepts an arbitrary review. For now, we will just take note of the fact that it resides in the data directory (and so also in the S3 training bucket) and that we will need to make sure it gets saved in the model directory.

## Step 4: Build and Train the PyTorch Model

In the XGBoost notebook we discussed what a model is in the SageMaker framework. In particular, a model comprises three objects

 - Model Artifacts,
 - Training Code, and
 - Inference Code,
 
each of which interact with one another. In the XGBoost example we used training and inference code that was provided by Amazon. Here we will still be using containers provided by Amazon with the added benefit of being able to include our own custom code.

I have implemented the neural network in PyTorch along with a training script. The model object is in the `model.py` file, inside of the `train` folder. You can see the implementation by running the cell below.

In [46]:
!pygmentize train/model.py

[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m

[34mclass[39;49;00m [04m[32mLSTMClassifier[39;49;00m(nn.Module):
    [33m"""[39;49;00m
[33m    This is the simple RNN model we will be using to perform Sentiment Analysis.[39;49;00m
[33m    """[39;49;00m

    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m, embedding_dim, hidden_dim, vocab_size):
        [33m"""[39;49;00m
[33m        Initialize the model by settingg up the various layers.[39;49;00m
[33m        """[39;49;00m
        [36msuper[39;49;00m(LSTMClassifier, [36mself[39;49;00m).[32m__init__[39;49;00m()

        [36mself[39;49;00m.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=[34m0[39;49;00m)
        [36mself[39;49;00m.lstm = nn.LSTM(embedding_dim, hidden_dim)
        [36mself[39;49;00m.dense = nn.Linear(in_features=hidden_dim, out_features=[34m1[39;49;00m)
        [36mself[39;49;00m.sig = nn.Sigm

The important takeaway from the implementation provided is that there are three parameters that we may wish to tweak to improve the performance of our model. These are the embedding dimension, the hidden dimension and the size of the vocabulary. We will likely want to make these parameters configurable in the training script so that if we wish to modify them we do not need to modify the script itself. We will see how to do this later on. To start we will write some of the training code in the notebook so that we can more easily diagnose any issues that arise.

First we will load a small portion of the training data set to use as a sample. It would be very time consuming to try and train the model completely in the notebook as we do not have access to a gpu and the compute instance that we are using is not particularly powerful. However, we can work on a small bit of the data to get a feel for how our training script is behaving.

In [47]:
import torch
import torch.utils.data

# Read in only the first 250 rows
train_sample = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)

# Turn the input pandas dataframe into tensors
train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()

# Build the dataset
train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
# Build the dataloader
train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)

### Writing the training method

Next we need to write the training code itself. We will leave any difficult aspects such as model saving / loading and parameter loading until a little later.

In [48]:
def train(model, train_loader, epochs, optimizer, loss_fn, device):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch
            
            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)
            
            optimizer.zero_grad()
            out = model.forward(batch_X)
            loss = loss_fn(out, batch_y)
            loss.backward()
            optimizer.step()
            
            total_loss += loss.data.item()
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))

Supposing we have the training method above, we will test that it is working by writing a bit of code in the notebook that executes our training method on the small sample training set that we loaded earlier. The reason for doing this in the notebook is so that we have an opportunity to fix any errors that arise early when they are easier to diagnose.

In [49]:
import torch.optim as optim
from train.model import LSTMClassifier

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device)

Epoch: 1, BCELoss: 0.6348970651626586
Epoch: 2, BCELoss: 0.5790800094604492
Epoch: 3, BCELoss: 0.5077767610549927
Epoch: 4, BCELoss: 0.3965527772903442
Epoch: 5, BCELoss: 0.3329249083995819


In order to construct a PyTorch model using SageMaker we must provide SageMaker with a training script. We may optionally include a directory which will be copied to the container and from which our training code will be run. When the training container is executed it will check the uploaded directory (if there is one) for a `requirements.txt` file and install any required Python libraries, after which the training script will be run.

### Training the model

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file which will be executed when the model is trained. Inside of the `train` directory is a file called `train.py` contains code to train the model.

The way that SageMaker passes hyperparameters to the training script is by way of arguments. These arguments can then be parsed and used in the training script. To see how this is done take a look at the provided `train/train.py` file.

In [50]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 10,
                        'hidden_dim': 200,
                    })

In [51]:
estimator.fit({'training': input_data})

2020-04-21 07:05:16 Starting - Starting the training job...
2020-04-21 07:05:18 Starting - Launching requested ML instances...
2020-04-21 07:06:16 Starting - Preparing the instances for training.........
2020-04-21 07:07:34 Downloading - Downloading input data...
2020-04-21 07:08:13 Training - Downloading the training image...
2020-04-21 07:08:41 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-04-21 07:08:41,717 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-04-21 07:08:41,741 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-04-21 07:08:44,796 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-04-21 07:08:45,032 sagemaker-containers INFO     Module train does not provide a setup.py. [0

[34mModel loaded with embedding_dim 32, hidden_dim 200, vocab_size 5000.[0m
[34mEpoch: 1, BCELoss: 0.23916259155666256[0m
[34mEpoch: 2, BCELoss: 0.14631323526598497[0m
[34mEpoch: 3, BCELoss: 0.12196389555044523[0m
[34mEpoch: 4, BCELoss: 0.10965935456261755[0m
[34mEpoch: 5, BCELoss: 0.10464758840251023[0m
[34mEpoch: 6, BCELoss: 0.10066388203875423[0m
[34mEpoch: 7, BCELoss: 0.09555708391920108[0m
[34mEpoch: 8, BCELoss: 0.09210389316320693[0m
[34mEpoch: 9, BCELoss: 0.08865382146405683[0m
[34mEpoch: 10, BCELoss: 0.0852362287296856[0m
[34m2020-04-21 07:34:34,434 sagemaker-containers INFO     Reporting training SUCCESS[0m

2020-04-21 07:34:45 Uploading - Uploading generated training model
2020-04-21 07:34:45 Completed - Training job completed
Training seconds: 1631
Billable seconds: 1631


## Step 5: Testing the model

As mentioned at the top of this notebook, we will be testing this model by first deploying it and then sending the testing data to the deployed endpoint. We will do this so that we can make sure that the deployed model is working correctly.

## Step 6: Deploy the model for testing

Now that we have trained our model, we would like to test it to see how it performs. Currently our model takes input of the form `review_length, review[500]` where `review[500]` is a sequence of `500` integers which describe the words present in the review, encoded using `word_dict`. Fortunately for us, SageMaker provides built-in inference code for models with simple inputs such as this.

There is one thing that we need to provide, however, and that is a function which loads the saved model. This function must be called `model_fn()` and takes as its only parameter a path to the directory where the model artifacts are stored. This function must also be present in the python file which we specified as the entry point. In our case the model loading function has been provided and so no changes need to be made.

**NOTE**: When the built-in inference code is run it must import the `model_fn()` method from the `train.py` file. This is why the training code is wrapped in a main guard ( ie, `if __name__ == '__main__':` )

Since we don't need to change anything in the code that was uploaded during training, we can simply deploy the current model as-is.

**NOTE:** When deploying a model you are asking SageMaker to launch an compute instance that will wait for data to be sent to it. As a result, this compute instance will continue to run until *you* shut it down. This is important to know since the cost of a deployed endpoint depends on how long it has been running for.

In other words **If you are no longer using a deployed endpoint, shut it down!**

In [52]:
# Deploy the trained model
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

---------------!

## Step 7 - Use the model for testing

Once deployed, we can read in the test data and send it off to our deployed model to get some results. Once we collect all of the results we can determine how accurate our model is.

In [53]:
valid_X = pd.concat([pd.DataFrame(valid_X_len), pd.DataFrame(valid_X)], axis=1)

In [54]:
# We split the data into chunks and send each chunk seperately, accumulating the results.

def predict(data, rows=512):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = np.array([])
    for array in split_array:
        predictions = np.append(predictions, predictor.predict(array))
    
    return predictions

In [55]:
predictions = predict(valid_X.values)
predictions = [round(num) for num in predictions]

In [56]:
from sklearn.metrics import accuracy_score
accuracy_score(valid_y, predictions)

0.872375

## Now, I decide to make first submission to Kaggle Competition

In [57]:
test_X = pd.concat([pd.DataFrame(test_X_len), pd.DataFrame(test_X)], axis=1)

In [58]:
test_predictions = predict(test_X.values)
test_predictions = [round(num) for num in test_predictions]

In [59]:
test_result = pd.concat([pd.DataFrame(test_id), pd.DataFrame(test_predictions)], axis=1)
test_result.columns = ['id', 'toxic']
test_result.sort_values('id', inplace=True)

In [60]:
test_result.to_csv(os.path.join(data_dir, 'test_y_hat.csv'), header=True, index=False)

### More testing

We now have a trained model which has been deployed and which we can send processed reviews to and which returns the predicted sentiment. However, ultimately we would like to be able to send our model an unprocessed review. That is, we would like to send the review itself as a string. For example, suppose we wish to send the following review to our model.

In [61]:
test_review = 'You must be sick and mad. What an idiot.'

The question we now need to answer is, how do we send this review to our model?

Recall in the first section of this notebook we did a bunch of data processing to the IMDb dataset. In particular, we did two specific things to the provided reviews.
 - Removed any html tags and stemmed the input
 - Encoded the review as a sequence of integers using `word_dict`

In [62]:
# Convert test_review into a form usable by the model and save the results in test_data
test_data = hp.convert_and_pad(word_dict, hp.review_to_words(test_review), pad=500)
test_data = [[test_data[1]]+test_data[0]]

Now that we have processed the comment, we can send the resulting array to our model to predict the toxicity of the comment.

In [63]:
predictor.predict(test_data)

array(0.94564706, dtype=float32)

Since the return value of our model is close to `1`, we can be certain that the comment we submitted is toxic.

### Delete the endpoint

Of course, just like in the XGBoost notebook, once we've deployed an endpoint it continues to run until we tell it to shut down. Since we are done using our endpoint for now, we can delete it.

In [64]:
estimator.delete_endpoint()

## Step 6 (again) - Deploy the model for the web app

Now that we know that our model is working, it's time to create some custom inference code so that we can send the model a review which has not been processed and have it determine the sentiment of the review.

As we saw above, by default the estimator which we created, when deployed, will use the entry script and directory which we provided when creating the model. However, since we now wish to accept a string as input and our model expects a processed review, we need to write some custom inference code.

We will store the code that we write in the `serve` directory. Provided in this directory is the `model.py` file that we used to construct our model, a `utils.py` file which contains the `review_to_words` and `convert_and_pad` pre-processing functions which we used during the initial data processing, and `predict.py`, the file which will contain our custom inference code. Note also that `requirements.txt` is present which will tell SageMaker what Python libraries are required by our custom inference code.

When deploying a PyTorch model in SageMaker, you are expected to provide four functions which the SageMaker inference container will use.
 - `model_fn`: This function is the same function that we used in the training script and it tells SageMaker how to load our model.
 - `input_fn`: This function receives the raw serialized input that has been sent to the model's endpoint and its job is to de-serialize and make the input available for the inference code.
 - `output_fn`: This function takes the output of the inference code and its job is to serialize this output and return it to the caller of the model's endpoint.
 - `predict_fn`: The heart of the inference script, this is where the actual prediction is done and is the function which you will need to complete.

For the simple website that we are constructing during this project, the `input_fn` and `output_fn` methods are relatively straightforward. We only require being able to accept a string as input and we expect to return a single value as output.

### Deploying the model

Now that the custom inference code has been written, we will create and deploy our model. To begin with, we need to construct a new PyTorchModel object which points to the model artifacts created during training and also points to the inference code that we wish to use. Then we can call the deploy method to launch the deployment container.

**NOTE**: The default behaviour for a deployed PyTorch model is to assume that any input passed to the predictor is a `numpy` array. In our case we want to send a string so we need to construct a simple wrapper around the `RealTimePredictor` class to accomodate simple strings. In a more complicated situation you may want to provide a serialization object, for example if you wanted to sent image data.

In [65]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

-------------!

### Testing the model

Now that we have deployed our model with the custom inference code, we should test to see if everything is working. Here we test our model by loading the first `250` comments and send them to the endpoint, then collect the results. The reason for only sending some of the data is that the amount of time it takes for our model to process the input and then perform inference is quite long and so testing the entire data set would be prohibitive.

In [66]:
import glob


def test_reviews(data_dir='../data/validation.csv', stop=250):
    
    results = []
    ground = []
    
    data = pd.read_csv(data_dir)
    
    files_read = 0
    
    print('Starting..... ')
    
    for i in range(data.shape[0]):
        
        ground.append(data.iloc[i]['toxic'])
        
        review_input = str(data.iloc[i]['lang']+" "+data.iloc[i]['comment_text']).encode('utf-8')
        
        results.append(float(predictor.predict(review_input)))
        
        files_read += 1
        if files_read == stop:
            break
            
    print('Done..... ')
            
    return ground, results

In [67]:
ground, results = test_reviews()

Starting..... 
Done..... 


In [68]:
from sklearn.metrics import accuracy_score
accuracy_score(ground, results)

0.876

As an additional test, we can try sending the `test_review` that we looked at earlier.

In [69]:
test_review = 'You are honest and very person'
predictor.predict(test_review)

b'0.0'

In [70]:
test_review = 'You must be sick and mad. What an idiot.'
predictor.predict(test_review)

b'1.0'

Now that we know our endpoint is working as expected, we can set up the web page that will interact with it. If you don't have time to finish the project now, make sure to skip down to the end of this notebook and shut down your endpoint. You can deploy it again when you come back.

## Step 7 (again): Use the model for the web app

> **TODO:** This entire section and the next contain tasks for you to complete, mostly using the AWS console.

So far we have been accessing our model endpoint by constructing a predictor object which uses the endpoint and then just using the predictor object to perform inference. What if we wanted to create a web app which accessed our model? The way things are set up currently makes that not possible since in order to access a SageMaker endpoint the app would first have to authenticate with AWS using an IAM role which included access to SageMaker endpoints. However, there is an easier way! We just need to use some additional AWS services.

<img src="Web App Diagram.svg">

The diagram above gives an overview of how the various services will work together. On the far right is the model which we trained above and which is deployed using SageMaker. On the far left is our web app that collects a user's movie review, sends it off and expects a positive or negative sentiment in return.

In the middle is where some of the magic happens. We will construct a Lambda function, which you can think of as a straightforward Python function that can be executed whenever a specified event occurs. We will give this function permission to send and recieve data from a SageMaker endpoint.

Lastly, the method we will use to execute the Lambda function is a new endpoint that we will create using API Gateway. This endpoint will be a url that listens for data to be sent to it. Once it gets some data it will pass that data on to the Lambda function and then return whatever the Lambda function returns. Essentially it will act as an interface that lets our web app communicate with the Lambda function.

### Setting up a Lambda function

The first thing we are going to do is set up a Lambda function. This Lambda function will be executed whenever our public API has data sent to it. When it is executed it will receive the data, perform any sort of processing that is required, send the data (the review) to the SageMaker endpoint we've created and then return the result.

#### Part A: Create an IAM Role for the Lambda function

Since we want the Lambda function to call a SageMaker endpoint, we need to make sure that it has permission to do so. To do this, we will construct a role that we can later give the Lambda function.

Using the AWS Console, navigate to the **IAM** page and click on **Roles**. Then, click on **Create role**. Make sure that the **AWS service** is the type of trusted entity selected and choose **Lambda** as the service that will use this role, then click **Next: Permissions**.

In the search box type `sagemaker` and select the check box next to the **AmazonSageMakerFullAccess** policy. Then, click on **Next: Review**.

Lastly, give this role a name. Make sure you use a name that you will remember later on, for example `LambdaSageMakerRole`. Then, click on **Create role**.

#### Part B: Create a Lambda function

Now it is time to actually create the Lambda function.

Using the AWS Console, navigate to the AWS Lambda page and click on **Create a function**. When you get to the next page, make sure that **Author from scratch** is selected. Now, name your Lambda function, using a name that you will remember later on, for example `sentiment_analysis_func`. Make sure that the **Python 3.6** runtime is selected and then choose the role that you created in the previous part. Then, click on **Create Function**.

On the next page you will see some information about the Lambda function you've just created. If you scroll down you should see an editor in which you can write the code that will be executed when your Lambda function is triggered. In our example, we will use the code below. 

```python
# We need to use the low-level library to interact with SageMaker since the SageMaker API
# is not available natively through Lambda.
import boto3

def lambda_handler(event, context):

    # The SageMaker runtime is what allows us to invoke the endpoint that we've created.
    runtime = boto3.Session().client('sagemaker-runtime')

    # Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given
    response = runtime.invoke_endpoint(EndpointName = '**ENDPOINT NAME HERE**',    # The name of the endpoint we created
                                       ContentType = 'text/plain',                 # The data format that is expected
                                       Body = event['body'])                       # The actual review

    # The response is an HTTP response whose body contains the result of our inference
    result = response['Body'].read().decode('utf-8')

    return {
        'statusCode' : 200,
        'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
        'body' : result
    }
```

Once you have copy and pasted the code above into the Lambda code editor, replace the `**ENDPOINT NAME HERE**` portion with the name of the endpoint that we deployed earlier. You can determine the name of the endpoint using the code cell below.

In [71]:
predictor.endpoint

'sagemaker-pytorch-2020-04-21-07-55-31-772'

Once you have added the endpoint name to the Lambda function, click on **Save**. Your Lambda function is now up and running. Next we need to create a way for our web app to execute the Lambda function.

### Setting up API Gateway

Now that our Lambda function is set up, it is time to create a new API using API Gateway that will trigger the Lambda function we have just created.

Using AWS Console, navigate to **Amazon API Gateway** and then click on **Get started**.

On the next page, make sure that **New API** is selected and give the new api a name, for example, `sentiment_analysis_api`. Then, click on **Create API**.

Now we have created an API, however it doesn't currently do anything. What we want it to do is to trigger the Lambda function that we created earlier.

Select the **Actions** dropdown menu and click **Create Method**. A new blank method will be created, select its dropdown menu and select **POST**, then click on the check mark beside it.

For the integration point, make sure that **Lambda Function** is selected and click on the **Use Lambda Proxy integration**. This option makes sure that the data that is sent to the API is then sent directly to the Lambda function with no processing. It also means that the return value must be a proper response object as it will also not be processed by API Gateway.

Type the name of the Lambda function you created earlier into the **Lambda Function** text entry box and then click on **Save**. Click on **OK** in the pop-up box that then appears, giving permission to API Gateway to invoke the Lambda function you created.

The last step in creating the API Gateway is to select the **Actions** dropdown and click on **Deploy API**. You will need to create a new Deployment stage and name it anything you like, for example `prod`.

You have now successfully set up a public API to access your SageMaker model. Make sure to copy or write down the URL provided to invoke your newly created public API as this will be needed in the next step. This URL can be found at the top of the page, highlighted in blue next to the text **Invoke URL**.

## Step 4: Deploying our web app

Now that we have a publicly available API, we can start using it in a web app. For our purposes, we have provided a simple static html file which can make use of the public api you created earlier.

In the `website` folder there should be a file called `index.html`. Download the file to your computer and open that file up in a text editor of your choice. There should be a line which contains **\*\*REPLACE WITH PUBLIC API URL\*\***. Replace this string with the url that you wrote down in the last step and then save the file.

Now, if you open `index.html` on your local computer, your browser will behave as a local web server and you can use the provided site to interact with your SageMaker model.


In [None]:
predictor.delete_endpoint()