# Text Classification Using Keras & TensorFlow on Amazon SageMaker

A modified version of this AWS SageMaker lab guide: https://github.com/aws-samples/amazon-sagemaker-keras-text-classification

# Data Exploration

In [1]:
import pandas as pd
import tensorflow as tf
import re
import numpy as np
import os

from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.utils import to_categorical




In [2]:
column_names = ["TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]
news_dataset = pd.read_csv(os.path.join('./data', 'newsCorpora.csv'), names=column_names, header=None, delimiter='\t')
news_dataset.head()

Unnamed: 0,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [3]:
news_dataset.groupby(['CATEGORY']).size()

CATEGORY
b    115967
e    152469
m     45639
t    108344
dtype: int64

# Training and Hosting your Algorithm in Amazon SageMaker
# ![image](https://miro.medium.com/max/792/1*41reGFhdysmXNVHgmPMExA.png)

### Building and registering the container

The following shell code shows how to build the container image using `docker build` and push the container image to ECR using `docker push`. 

This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this will be the region where the notebook instance was created). If the repository doesn't exist, the script will create it.

In [5]:
%%sh
cd container
sh build_docker.sh

Login Succeeded
Stopping docker: [  OK  ]
Starting docker:	.[  OK  ]
Sending build context to Docker daemon  456.3MB
Step 1/9 : FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.14.0-cpu-py36-ubuntu16.04
 ---> e6a210ff54e4
Step 2/9 : RUN apt-get update &&     apt-get install -y nginx imagemagick graphviz
 ---> Using cache
 ---> 39d868b5172a
Step 3/9 : RUN pip install --upgrade pip
 ---> Using cache
 ---> 83610cd980fc
Step 4/9 : RUN pip install gevent gunicorn flask tensorflow_hub seqeval graphviz nltk spacy tqdm
 ---> Using cache
 ---> 090ab4fa935c
Step 5/9 : RUN python -m spacy download en_core_web_sm
 ---> Using cache
 ---> cf5c0166e852
Step 6/9 : RUN python -m spacy download en
 ---> Using cache
 ---> d39488c451a6
Step 7/9 : ENV PATH="/opt/program:${PATH}"
 ---> Using cache
 ---> 1b3759031fe0
Step 8/9 : COPY sagemaker_keras_text_classification /opt/program
 ---> Using cache
 ---> 092b378f8446
Step 9/9 : WORKDIR /opt/program
 ---> Using cache
 ---> b468053ed126


https://docs.docker.com/engine/reference/commandline/login/#credentials-store



Once you have your container packaged, you can use it to train and serve models. Let's do that with the algorithm we made above.

## Set up the environment

Here we specify a bucket to use and the role that will be used for working with SageMaker.

In [6]:
# S3 prefix
prefix = 'sagemaker-keras-text-classification'

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

## Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [7]:
import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

## Upload the data for training

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3.  

We can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [8]:
WORK_DIRECTORY = 'data'

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

## Create an estimator and fit the model

In order to use SageMaker to fit our algorithm, we'll create an `Estimator` that defines how to use the container to to train. This includes the configuration we need to invoke SageMaker training:

* The __container name__. This is constucted as in the shell commands above.
* The __role__. As defined above.
* The __instance count__ which is the number of machines to use for training.
* The __instance type__ which is the type of machine to use for training.
* The __output path__ determines where the model artifact will be written.
* The __session__ is the SageMaker session object that we defined above.

Then we use fit() on the estimator to train against the data that we uploaded above.

In [9]:
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-keras-text-classification'.format(account, region)

tree = sage.estimator.Estimator(image,
                       role, 1, 'ml.c5.2xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

In [10]:
tree.fit(data_location)

2020-04-22 01:18:26 Starting - Starting the training job...
2020-04-22 01:18:27 Starting - Launching requested ML instances......
2020-04-22 01:19:56 Starting - Preparing the instances for training......
2020-04-22 01:20:56 Downloading - Downloading input data
2020-04-22 01:20:56 Training - Downloading the training image......
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])[0m
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])[0m
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])[0m
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])[0m
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])[0m
  np_resource = np.dtype([("resource", np.ubyte, 1)])[0m
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])[0m
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])[0m
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])[0m
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])[0m
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])[0m
  np_resource = np.dtype([("resource", np.ubyte, 1)])[0m
[34m

## Deploy the model

Deploying the model to SageMaker hosting just requires a `deploy` call on the fitted model. This call takes an instance count, instance type, and optionally serializer and deserializer functions. These are used when the resulting predictor is created on the endpoint.

__This step may take about 10-20 min__

In [11]:
from sagemaker.predictor import json_serializer
predictor = tree.deploy(1, 'ml.m4.xlarge', serializer=json_serializer)

-------------!

## Prediction

In [13]:
request = { "input": "‘Deadpool 2’ Has More Swearing, Slicing and Dicing from Ryan Reynolds"}
print(predictor.predict(request).decode('utf-8'))

{"result": "Entertainment"}


## Optional cleanup

When you're done with the endpoint, you'll want to clean it up.

In [16]:
#sess.delete_endpoint(predictor.endpoint)