# Text Classification Using Keras & TensorFlow on Amazon SageMaker

Full lab guide can be found here: https://github.com/dbinoy/amazon-sagemaker-keras-text-classification

# Part 1: Dataset Exploration

In [None]:
import pandas as pd
import tensorflow as tf
import re
import numpy as np
import os

from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.utils import to_categorical

- Switch into the ‘data’ directory
- Download and unzip the dataset from UCI repository
- Download and unzip the pre-trained glove embedding files
- Since we'll be using 100-dimensional GloVe embeddings, remove the unnecessary files

In [None]:
%cd data
!rm -rf *
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip && unzip NewsAggregatorDataset.zip
!wget http://nlp.stanford.edu/data/glove.6B.zip && unzip glove.6B.zip
!rm 2pageSessions.csv glove.6B.200d.txt glove.6B.50d.txt glove.6B.300d.txt glove.6B.zip readme.txt NewsAggregatorDataset.zip && rm -rf __MACOSX/    

At this point, you should only see two files: ‘glove.6B.100d.txt’ (word embeddings) and ‘newsCorpora.csv’ (dataset) in the this data directory.


In [None]:
column_names = ["TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]
news_dataset = pd.read_csv(os.path.join('.', 'newsCorpora.csv'), names=column_names, header=None, delimiter='\t')
news_dataset.head()

We'll be using only 'Title' and 'Category' fields from the dataframe. Run the following snippet to shuffle the dataset and take a quick peek at a subset of records.

In [None]:
news_dataset_sampled = news_dataset.sample(frac=0.00005)
for i, n in enumerate(range(news_dataset_sampled.shape[0])):    
    category = news_dataset_sampled.iloc[i][3]
    if category == "b":
        category = "Business"
    elif category == "t":
        category = "Science & Technology"
    elif category == "e":
        category = "Entertainment"
    elif category == "m":
        category = "Health & Medicine"
    else:
        category = "unknown"
    print("{}. {} - {}".format(n+1, news_dataset_sampled.iloc[i][0], category))

# Part 2: Building the SageMaker TensorFlow Container

Since we are going to be using a custom built container for this workshop, we will need to create it. The Amazon SageMaker notebook instance already comes loaded with Docker. The SageMaker team has also created the [`sagemaker-tensorflow-container`](https://github.com/aws/sagemaker-tensorflow-container) project that makes it super easy for us to build custom TensorFlow containers that are optimized to run on Amazon SageMaker. Similar containers are also available for other widely used ML/DL frameworks as well.

We will first create a `base` TensorFlow container and then add our custom code to create a `final` container. We will use this `final` container for local testing. Once satisfied with local testing, we will push it up to Amazon Container Registery (ECR) where it can pulled from by Amazon SageMaker for training and deployment.

Start by creating the base TensorFlow container. Switch to the home directory and clone the `sagemaker-tensorflow-container` repo:


In [None]:
%cd ~
!git clone https://github.com/aws/sagemaker-tensorflow-container.git

We will be using TensorFlow 1.8.0 so lets switch to the appropriate directory
There are two Dockerfiles - one made for CPU based nodes and another for GPU based. Since, we will be using CPU machines, lets build the CPU docker image.

In [None]:
%cd sagemaker-tensorflow-container/docker/1.8.0/base
!docker build -t tensorflow-base:1.8.0-cpu-py2 -f Dockerfile.cpu .

Building the docker images should take about 5 minutes. Once finished, list the images. You should see the new base image named `tensorflow-base:1.8.0-cpu-py2`.

In [None]:
!docker images

Next we create our `final` images by including our code onto the `base` container. 
So switch to the container directory.

In [None]:
%cd ~/SageMaker/amazon-sagemaker-keras-text-classification/container/

Create a new Dockerfile with the content below.

We start from the `base` image, add the code directory to our path, copy the code into that directory and finally set the WORKDIR to the same path so any subsequent RUN/ENTRYPOINT commands run by Amazon SageMaker will use this directory.

In [None]:
%%writefile Dockerfile
# Build an image that can do training and inference in SageMaker

FROM tensorflow-base:1.8.0-cpu-py2

ENV PATH="/opt/program:${PATH}"

# Set up the program in the image
COPY sagemaker_keras_text_classification /opt/program
WORKDIR /opt/program

Build the `final` image

In [None]:
!docker build -t sagemaker-keras-text-class:latest .

In [None]:
!docker images

# Part 3: Local Testing of Training & Inference Code

Once we are finished developing the training portion (in `container/train`), we can start testing locally so we can debug our code quickly. 

Local test scripts are found in the `container/local_test` subfolder. Here we can run `local_train.sh` which will, in turn, run a Docker container within which our training code will execute.

The local testing framework expects the training data to be in the `/container/local_test/test_dir/input/data/training` folder so let’s copy over the contents of our `data` folder there.

In [None]:
%cd ~/SageMaker/amazon-sagemaker-keras-text-classification/data
!cp -a . ../container/local_test/test_dir/input/data/training/

Then we run the training in local mode by switching into the ‘local_test’ directory and running the `train_local.sh` script.

In [None]:
%cd ~/SageMaker/amazon-sagemaker-keras-text-classification/container/local_test
!./train_local.sh sagemaker-keras-text-class:latest

We now have a saved model called `news_breaker.h5` and the `tokenizer.pickle` file within `sagemaker-keras-text-classification/container/local_test /test_dir/model` – the local directory that we mapped to the `/opt/ml` directory within the container.

## Testing Inference Code

In order to not waste time debugging after deploying it is also advisable to locally test and debug the interference Flask app before deploying it as SageMaker Endpoint.

Run the following commands by opening a new terminal
```
cd ~/SageMaker/amazon-sagemaker-keras-text-classification/container/local_test/
./serve_local.sh sagemaker-keras-text-class:latest
```

In [None]:
%cd ~/SageMaker/amazon-sagemaker-keras-text-classification/container/local_test/
!./predict.sh input.json application/json

Great! Our model inference implementation responds and is correctly able to categorize this test headline.

# Part 4: Training and Hosting your Algorithm in Amazon SageMaker


We should modify our training code to take advantage of the more powerful hardware. Let’s update the number of epochs in the ‘train’ script to `2` to `20` to see how that impacts the validation accuracy of our model while training on Amazon SageMaker. This file is located in `container/sagemaker_keras_text_classification` directory. Navigate there 
and edit the file named `train` (Line 167)

```python
history = model.fit(x_train, y_train,
                            epochs=20,
                            batch_size=32,
                            validation_data=(x_test, y_test))

```

Remember to save the file and close the tab before proceeding further.

### Building and registering the container

The following shell code shows how to build the container image using `docker build` and push the container image to ECR using `docker push`. 

This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this will be the region where the notebook instance was created). If the repository doesn't exist, the script will create it.

In [None]:
%%sh

# The name of our algorithm
algorithm_name=sagemaker-keras-text-classification

cd ~/SageMaker/amazon-sagemaker-keras-text-classification/container

chmod +x sagemaker_keras_text_classification/train
chmod +x sagemaker_keras_text_classification/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-east-1 if none defined)
region=$(aws configure get region)
region=${region:-us-east-1}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

# On a SageMaker Notebook Instance, the docker daemon may need to be restarted in order
# to detect your network configuration correctly.  (This is a known issue.)
if [ -d "/home/ec2-user/SageMaker" ]; then
  sudo service docker restart
fi

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Once you have your container packaged, you can use it to train and serve models. Let's do that with the algorithm we made above.

## Set up the environment

Here we specify a bucket to use and the role that will be used for working with SageMaker.

In [None]:
# S3 prefix
prefix = 'sagemaker-keras-text-classification'

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

## Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [None]:
import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

## Upload the data for training

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3.  

We can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [None]:
WORK_DIRECTORY = '/home/ec2-user/SageMaker/amazon-sagemaker-keras-text-classification/data'

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)
print(data_location)

## Create an estimator and fit the model

In order to use SageMaker to fit our algorithm, we'll create an `Estimator` that defines how to use the container to to train. This includes the configuration we need to invoke SageMaker training:

* The __container name__. This is constucted as in the shell commands above.
* The __role__. As defined above.
* The __instance count__ which is the number of machines to use for training.
* The __instance type__ which is the type of machine to use for training.
* The __output path__ determines where the model artifact will be written.
* The __session__ is the SageMaker session object that we defined above.

Then we use fit() on the estimator to train against the data that we uploaded above.

In [None]:
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-keras-text-classification'.format(account, region)

tree = sage.estimator.Estimator(image,
                       role, 1, 'ml.c5.2xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

tree.fit(data_location)

## Deploy the model

Deploying the model to SageMaker hosting just requires a `deploy` call on the fitted model. This call takes an instance count, instance type, and optionally serializer and deserializer functions. These are used when the resulting predictor is created on the endpoint.

In [None]:
from sagemaker.predictor import json_serializer
predictor = tree.deploy(1, 'ml.t2.medium', serializer=json_serializer)

In [None]:
request = { "input": "Deadpool 2 Has More Swearing, Slicing and Dicing from Ryan Reynolds"}

print(predictor.predict(request).decode('utf-8'))

In [None]:
import json
news_dataset_sampled = news_dataset.sample(frac=0.0001)
for i, n in enumerate(range(news_dataset_sampled.shape[0])):    
    category = news_dataset_sampled.iloc[i][3]
    if category == "b":
        category = "Business"
    elif category == "t":
        category = "Science & Technology"
    elif category == "e":
        category = "Entertainment"
    elif category == "m":
        category = "Health & Medicine"
    else:
        category = "unknown"
    request = {"input": news_dataset_sampled.iloc[i][0]}
    result = json.loads(predictor.predict(request).decode('utf-8'))["result"]
    print("{}. {} - Expected: {}, Predicted: {}".format(n+1, news_dataset_sampled.iloc[i][0], category,result))
    

## Optional cleanup

When you're done with the endpoint, you'll want to clean it up.

In [None]:
sess.delete_endpoint(predictor.endpoint)