<h1>Spam Classifier for SMS messages</h1>
<br />
This notebook shows how to implement a basic spam classifier for SMS messages using Keras and Tensorflow.
The idea is to use the SMS spam collection dataset available at https://archive.ics.uci.edu/ml/datasets/sms+spam+collection to train a neural network by leveraging on the built-in container for TensorFlow in Amazon SageMaker.

Let's get started by getting the Amazon SageMaker session and the current execution role, using the Amazon SageMaker high-level SDK for Python.

In [16]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
print(role)

arn:aws:iam::825935527263:role/service-role/AmazonSageMaker-ExecutionRole-20180311T170786


We now download the spam collection dataset, unzip it and read the first 10 rows.

In [117]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip -o smsspamcollection.zip
!unzip -o smsspamcollection.zip
!head -10 SMSSpamCollection

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  198k  100  198k    0     0  64290      0  0:00:03  0:00:03 --:--:-- 64290
Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham	Even my brother is not like to speak with me. They treat me like ai

We now load the dataset into a Pandas dataframe and execute some data preparation.
More specifically we have to:
<ul>
    <li>replace the target column values (ham/spam) with numeric values (0/1)</li>
    <li>tokenize the sms messages and encode based on word counts</li>
    <li>split into train and test sets</li>
    <li>upload to a S3 bucket for training</li>
</ul>

In [3]:
import pandas
import pickle
import numpy
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence

df = pandas.read_csv('SMSSpamCollection', sep='\t', header=None)
df[df.columns[0]] = df[df.columns[0]].map({'ham': 0, 'spam': 1})

targets = df[df.columns[0]].values
messages = df[df.columns[1]].values

def vectorize_sequences(sequences, dimension):
    results = numpy.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
       results[i, sequence] = 1. 
    return results

def one_hot_encode(messages, dimension):
    data = []
    for msg in messages:
        temp = one_hot(msg, vocabulary_lenght)
        data.append(temp)
    return data

# one hot encoding for each SMS message
vocabulary_lenght = 9013
one_hot_data = one_hot_encode(messages, vocabulary_lenght)
encoded_messages = vectorize_sequences(one_hot_data, vocabulary_lenght)

# This code is an alternative way to encode text messages using Keras Tokenizer.
# The difference is that we will need to save the Tokenizer in order to re-use vocabulary when doing inferences.
# t = Tokenizer()
# t.fit_on_texts(messages)
# dump the tokenizer for later usage when doing inference
# pickle.dump(t, open("tokenizer.p", "wb"))
# messages = df[df.columns[1]]
# encoded_messages = t.texts_to_matrix(messages, mode='count')

df2 = pandas.DataFrame(encoded_messages)
df2.insert(0, 'spam', targets)

# split into training and test sets
train_set = df2[:4000] 
test_set = df2[4000:]

train_set.to_csv('sms_train_set.gz', header=False, index=False, compression='gzip')
test_set.to_csv('sms_test_set.gz', header=False, index=False, compression='gzip')

df2

Unnamed: 0,spam,0,1,2,3,4,5,6,7,8,...,9003,9004,9005,9006,9007,9008,9009,9010,9011,9012
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We have to upload the two files back to Amazon S3 in order to be accessed by the Amazon SageMaker training cluster.

In [17]:
import boto3

bucket_name = 'immersionday-sagemaker-test'

s3 = boto3.resource('s3')
target_bucket = s3.Bucket(bucket_name)

with open('sms_train_set.gz', 'rb') as data:
    target_bucket.upload_fileobj(data, 'messageantispam/sms_train_set.gz')
    
with open('sms_test_set.gz', 'rb') as data:
    target_bucket.upload_fileobj(data, 'messageantispam/sms_test_set.gz')

<h2>Training the model with TensorFlow</h2>

We are now ready to run the training using the Amazon SageMaker TensorFlow built-in container. First let's have a look at the script defining our neural network.

In [120]:
!cat 'SMS_keras_script.py'

import numpy as np
import os
import tensorflow as tf
import pandas
import json
from tensorflow.python.estimator.export.export import build_raw_serving_input_receiver_fn
from tensorflow.python.estimator.export.export_output import PredictOutput

INPUT_TENSOR_NAME = 'inputs'
SIGNATURE_NAME = "serving_default"
LEARNING_RATE = 0.01

def model_fn(features, labels, mode, params):

    first_hidden_layer = tf.keras.layers.Dense(16, activation='relu', name='first-layer')(features[INPUT_TENSOR_NAME])
    second_hidden_layer = tf.keras.layers.Dense(16, activation='relu')(first_hidden_layer)
    output_layer = tf.keras.layers.Dense(1, activation='sigmoid')(second_hidden_layer)
    predictions = output_layer

    # Provide an estimator spec for `ModeKeys.PREDICT`.
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(
            mode=mode,
            predictions={"spam": predictions},
            export_outputs={SIGNATURE_NAME: PredictOut

We are now ready to run the training using the TensorFlow estimator object of the SageMaker Python SDK.

In [122]:
from sagemaker.tensorflow import TensorFlow

output_bucket_path = 's3://immersionday-sagemaker-test/messageantispam/output/'

estimator = TensorFlow(entry_point='SMS_keras_script.py',
                               role=role,
                               training_steps=5000,                                  
                               evaluation_steps=100,
                               hyperparameters={'learning_rate': 0.01},
                               train_instance_count=1,
                               train_instance_type='ml.c4.xlarge',
                               output_path=output_bucket_path,
                               base_job_name='message-anti-spam-tf')

estimator.fit('s3://'+ bucket_name +'/messageantispam/')

INFO:sagemaker:Creating training-job with name: message-anti-spam-tf-2018-06-19-08-30-30-896


.................
[31m2018-06-19 08:33:16,590 INFO - root - running container entrypoint[0m
[31m2018-06-19 08:33:16,590 INFO - root - starting train task[0m
[31m2018-06-19 08:33:16,596 INFO - container_support.training - Training starting[0m
  from ._conv import register_converters as _register_converters[0m
[31m2018-06-19 08:33:18,555 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2[0m
[31m2018-06-19 08:33:18,791 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-eu-west-1-825935527263.s3.amazonaws.com[0m
[31m2018-06-19 08:33:18,838 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-eu-west-1-825935527263.s3.amazonaws.com[0m
[31m2018-06-19 08:33:18,854 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-eu-west-1-8259355272

[31m2018-06-19 08:33:34,034 INFO - tensorflow - Calling model_fn.[0m
[31m2018-06-19 08:33:34,191 INFO - tensorflow - Done calling model_fn.[0m
[31m2018-06-19 08:33:34,211 INFO - tensorflow - Starting evaluation at 2018-06-19-08:33:34[0m
[31m2018-06-19 08:33:34,270 INFO - tensorflow - Graph was finalized.[0m
[31m2018-06-19 08:33:34,270 INFO - tensorflow - Restoring parameters from s3://immersionday-sagemaker-test/messageantispam/output/message-anti-spam-tf-2018-06-19-08-30-30-896/checkpoints/model.ckpt-1[0m
[31m2018-06-19 08:33:34.293037: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-19 08:33:34.307348: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-19 08:33:34.316734: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-19 08:33:34.327142: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection h

[31m2018-06-19 08:33:46,508 INFO - tensorflow - global_step/sec: 16.57[0m
[31m2018-06-19 08:33:46,509 INFO - tensorflow - loss = 0.5651593, step = 101 (6.036 sec)[0m
[31m2018-06-19 08:33:51,680 INFO - tensorflow - global_step/sec: 19.3356[0m
[31m2018-06-19 08:33:51,681 INFO - tensorflow - loss = 0.47762156, step = 201 (5.172 sec)[0m
[31m2018-06-19 08:33:56,385 INFO - tensorflow - global_step/sec: 21.2533[0m
[31m2018-06-19 08:33:56,386 INFO - tensorflow - loss = 0.43916783, step = 301 (4.705 sec)[0m
[31m2018-06-19 08:34:02,303 INFO - tensorflow - global_step/sec: 16.8987[0m
[31m2018-06-19 08:34:02,304 INFO - tensorflow - loss = 0.4276031, step = 401 (5.918 sec)[0m
[31m2018-06-19 08:34:07,248 INFO - tensorflow - global_step/sec: 20.2209[0m
[31m2018-06-19 08:34:07,250 INFO - tensorflow - loss = 0.38467485, step = 501 (4.946 sec)[0m
[31m2018-06-19 08:34:12,629 INFO - tensorflow - global_step/sec: 18.5843[0m
[31m2018-06-19 08:34:12,630 INFO - tensorflow - loss = 0.399

[31m2018-06-19 08:35:33,994 INFO - tensorflow - Calling model_fn.[0m
[31m2018-06-19 08:35:34,145 INFO - tensorflow - Done calling model_fn.[0m
[31m2018-06-19 08:35:34,165 INFO - tensorflow - Starting evaluation at 2018-06-19-08:35:34[0m
[31m2018-06-19 08:35:34,222 INFO - tensorflow - Graph was finalized.[0m
[31m2018-06-19 08:35:34,222 INFO - tensorflow - Restoring parameters from s3://immersionday-sagemaker-test/messageantispam/output/message-anti-spam-tf-2018-06-19-08-30-30-896/checkpoints/model.ckpt-2000[0m
[31m2018-06-19 08:35:34.246654: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-19 08:35:34.262708: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-19 08:35:34.270963: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-19 08:35:34.280114: I tensorflow/core/platform/s3/aws_logging.cc:54] Connectio

[31m2018-06-19 08:35:38.504406: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-19 08:35:38.540655: I tensorflow/core/platform/s3/aws_logging.cc:54] Deleting file: /tmp/s3_filesystem_XXXXXX20180619T0835381529397338504[0m
[31m2018-06-19 08:35:38,540 INFO - tensorflow - SavedModel written to: s3://immersionday-sagemaker-test/messageantispam/output/message-anti-spam-tf-2018-06-19-08-30-30-896/checkpoints/export/Servo/temp-1529397337/saved_model.pb[0m
[31m2018-06-19 08:35:38.541236: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-19 08:35:38.545904: E tensorflow/core/platform/s3/aws_logging.cc:60] No response body. Response code: 404[0m
[31m2018-06-19 08:35:38.545937: W tensorflow/core/platform/s3/aws_logging.cc:57] If the signature check failed. This could be because of a time skew. Attempting to adjust the signer.[0m
[31m2018-06-19 08:35:38.546111: I tensorflow

<h2>Deploying the model</h2>

Let's deploy the trained model to an endpoint and execute inferences.

In [123]:
tf_predictor = estimator.deploy(initial_instance_count=1,
                                instance_type='ml.m5.xlarge',
                                endpoint_name='message-anti-spam-tf-endpoint')

INFO:sagemaker:Creating model with name: message-anti-spam-tf-2018-06-19-08-30-30-896
INFO:sagemaker:Creating endpoint with name message-anti-spam-tf-endpoint


--------------------------------------------------------------!

In [124]:
import tensorflow as tf
import numpy
from sagemaker.tensorflow.model import TensorFlowPredictor

# Uncomment the following line to connect to an existing endpoint.
# tf_predictor = TensorFlowPredictor('message-anti-spam-tf-endpoint')

test_messages = ['Free prizes for you! Download ringtones on http://sws44.to Answer YES']
one_hot_test_messages = one_hot_encode(test_messages, vocabulary_lenght)
encoded_test_messages = vectorize_sequences(one_hot_test_messages, vocabulary_lenght)

tensor_proto = tf.make_tensor_proto(values=encoded_test_messages, shape=[1, vocabulary_lenght], dtype=tf.float32)
result = tf_predictor.predict(tensor_proto)

print(result)

{u'outputs': {u'spam': {u'dtype': u'DT_FLOAT', u'floatVal': [0.5087814331054688], u'tensorShape': {u'dim': [{u'size': u'1'}, {u'size': u'1'}]}}}}


Let's do an example of loading a model from the saved model artifacts.

In [101]:
import tensorflow as tf

path_to_model = './Model/'

test_messages = ['Free prizes for you! Download ringtones on http://sws44.to Answer YES']
one_hot_test_messages = one_hot_encode(test_messages, vocabulary_lenght)
encoded_test_messages = vectorize_sequences(one_hot_test_messages, vocabulary_lenght)

with tf.Session(graph=tf.Graph()) as sess:
   tf.saved_model.loader.load(
       sess,
       [tf.saved_model.tag_constants.SERVING],
       path_to_model)

   sigmoid_tensor = sess.graph.get_tensor_by_name('dense_1/Sigmoid:0')
   predictions = sess.run(sigmoid_tensor, {'Placeholder_1:0': encoded_test_messages})

   print(predictions[0][0])

0.6947067


<h2>Training the model with MXNet</h2>

We are now ready to run the training using the Amazon SageMaker MXNet built-in container. First let's have a look at the script defining our neural network.

In [54]:
!cat 'SMS_MXNet_script.py'

from __future__ import print_function

import logging
import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon import nn
import numpy as np
import json
import time

import pip

try:
    from pip import main as pipmain
except:
    from pip._internal import main as pipmain

pipmain(['install', 'pandas'])
import pandas

logging.basicConfig(level=logging.DEBUG)

# ------------------------------------------------------------ #
# Training methods                                             #
# ------------------------------------------------------------ #


def train(hyperparameters, input_data_config, channel_input_dirs, output_data_dir,
          num_gpus, num_cpus, hosts, current_host, **kwargs):
    # SageMaker passes num_cpus, num_gpus and other args we can use to tailor training to
    # the current container environment, but here we just use simple cpu context.
    ctx = mx.cpu()

    # retrieve the hyperparameters we set in notebook (with

We are now ready to run the training using the MXNet estimator object of the SageMaker Python SDK.

In [43]:
from sagemaker.mxnet import MXNet

m = MXNet('SMS_MXNet_script.py', 
          role=role,
          train_instance_count=1, 
          train_instance_type='ml.c4.xlarge',
          base_job_name='message-anti-spam-mxnet',
          hyperparameters={'batch_size': 200, 
                         'epochs': 20, 
                         'learning_rate': 0.01, 
                         'log_interval': 200})

inputs = {'train': 's3://'+ bucket_name +'/messageantispam/sms_train_set.gz',
 'val': 's3://' + bucket_name + '/messageantispam/sms_test_set.gz'}

m.fit(inputs)

INFO:sagemaker:Creating training-job with name: message-anti-spam-mxnet-2018-06-19-20-47-20-349


................
[31m2018-06-19 20:49:52,326 INFO - root - running container entrypoint[0m
[31m2018-06-19 20:49:52,326 INFO - root - starting train task[0m
[31m2018-06-19 20:49:52,332 INFO - container_support.training - Training starting[0m
[31m2018-06-19 20:49:54,445 INFO - mxnet_container.train - MXNetTrainingEnvironment: {'enable_cloudwatch_metrics': False, 'available_gpus': 0, 'channels': {u'train': {u'TrainingInputMode': u'File', u'RecordWrapperType': u'None', u'S3DistributionType': u'FullyReplicated'}, u'val': {u'TrainingInputMode': u'File', u'RecordWrapperType': u'None', u'S3DistributionType': u'FullyReplicated'}}, '_ps_verbose': 0, 'resource_config': {u'current_host': u'algo-1', u'network_interface_name': u'ethwe', u'hosts': [u'algo-1']}, 'user_script_name': u'SMS_MXNet_script.py', 'input_config_dir': '/opt/ml/input/config', 'channel_dirs': {u'train': u'/opt/ml/input/data/train', u'val': u'/opt/ml/input/data/val'}, 'code_dir': '/opt/ml/code', 'output_data_dir': '/opt/ml/

<h2>Deploying the model</h2>

Let's deploy the trained model to an endpoint and execute inferences.

In [44]:
mxnet_pred = m.deploy(initial_instance_count=1,
                      instance_type='ml.m5.xlarge',
                      endpoint_name='message-anti-spam-mxnet-endpoint')

INFO:sagemaker:Creating model with name: message-anti-spam-mxnet-2018-06-19-20-47-20-349
INFO:sagemaker:Creating endpoint with name message-anti-spam-mxnet-endpoint-3


---------------------------------------------------!

In [53]:
from sagemaker.mxnet.model import MXNetPredictor

# Uncomment the following line to connect to an existing endpoint.
# mxnet_pred = MXNetPredictor('message-anti-spam-mxnet-endpoint')

test_messages = ["Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"]
one_hot_test_messages = one_hot_encode(test_messages, vocabulary_lenght)
encoded_test_messages = vectorize_sequences(one_hot_test_messages, vocabulary_lenght)

result = mxnet_pred.predict(encoded_test_messages)
value = numpy.argmax(result, axis=1)

print(result)
print(value)


[[0.2672363221645355, 0.7327636480331421]]
[1]
