<h1>Spam Classifier for SMS messages</h1>
<br />
This notebook shows how to implement a basic spam classifier for SMS messages using Keras and Tensorflow.
The idea is to use the SMS spam collection dataset available at https://archive.ics.uci.edu/ml/datasets/sms+spam+collection to train a neural network by leveraging on the built-in container for TensorFlow in Amazon SageMaker.

Let's get started by getting the Amazon SageMaker session and the current execution role, using the Amazon SageMaker high-level SDK for Python.

In [1]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
print(role)

arn:aws:iam::825935527263:role/service-role/AmazonSageMaker-ExecutionRole-20180311T170786


We now download the spam collection dataset, unzip it and read the first 10 rows.

In [2]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip -o smsspamcollection.zip
!unzip -o smsspamcollection.zip
!head -10 SMSSpamCollection

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  198k  100  198k    0     0  65448      0  0:00:03  0:00:03 --:--:-- 65448
Archive:  smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  
ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham	Ok lar... Joking wif u oni...
spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham	U dun say so early hor... U c already then say...
ham	Nah I don't think he goes to usf, he lives around here though
spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
ham	Even my brother is not like to speak with me. They treat me like ai

We now load the dataset into a Pandas dataframe and execute some data preparation.
More specifically we have to:
<ul>
    <li>replace the target column values (ham/spam) with numeric values (0/1)</li>
    <li>tokenize the sms messages and encode based on word counts</li>
    <li>split into train and test sets</li>
    <li>upload to a S3 bucket for training</li>
</ul>

In [3]:
import pandas
import pickle
import numpy
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence

df = pandas.read_csv('SMSSpamCollection', sep='\t', header=None)
df[df.columns[0]] = df[df.columns[0]].map({'ham': 0, 'spam': 1})

targets = df[df.columns[0]].values
messages = df[df.columns[1]].values

def vectorize_sequences(sequences, dimension):
    results = numpy.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
       results[i, sequence] = 1. 
    return results

def one_hot_encode(messages, dimension):
    data = []
    for msg in messages:
        temp = one_hot(msg, vocabulary_lenght)
        data.append(temp)
    return data

# one hot encoding for each SMS message
vocabulary_lenght = 9013
one_hot_data = one_hot_encode(messages, vocabulary_lenght)
encoded_messages = vectorize_sequences(one_hot_data, vocabulary_lenght)

# This code is an alternative way to encode text messages using Keras Tokenizer.
# The difference is that we will need to save the Tokenizer in order to re-use vocabulary when doing inferences.
# t = Tokenizer()
# t.fit_on_texts(messages)
# dump the tokenizer for later usage when doing inference
# pickle.dump(t, open("tokenizer.p", "wb"))
# messages = df[df.columns[1]]
# encoded_messages = t.texts_to_matrix(messages, mode='count')

df2 = pandas.DataFrame(encoded_messages)
df2.insert(0, 'spam', targets)

# split into training and test sets
train_set = df2[:4000] 
test_set = df2[4000:]

train_set.to_csv('sms_train_set.gz', header=False, index=False, compression='gzip')
test_set.to_csv('sms_test_set.gz', header=False, index=False, compression='gzip')

df2

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Unnamed: 0,spam,0,1,2,3,4,5,6,7,8,...,9003,9004,9005,9006,9007,9008,9009,9010,9011,9012
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We have to upload the two files back to Amazon S3 in order to be accessed by the Amazon SageMaker training cluster.

In [4]:
import boto3

bucket_name = 'immersionday-sagemaker-test'

s3 = boto3.resource('s3')
target_bucket = s3.Bucket(bucket_name)

with open('sms_train_set.gz', 'rb') as data:
    target_bucket.upload_fileobj(data, 'smsspamdata/sms_train_set.gz')
    
with open('sms_test_set.gz', 'rb') as data:
    target_bucket.upload_fileobj(data, 'smsspamdata/sms_test_set.gz')

We are now ready to run the training using the Amazon SageMaker TensorFlow built-in container. First let's have a look at the script defining our neural network.

In [102]:
!cat 'SMS_keras_script.py'

import numpy as np
import os
import tensorflow as tf
import pandas
import json
from tensorflow.python.estimator.export.export import build_raw_serving_input_receiver_fn
from tensorflow.python.estimator.export.export_output import PredictOutput

INPUT_TENSOR_NAME = 'inputs'
SIGNATURE_NAME = "serving_default"
LEARNING_RATE = 0.01

def model_fn(features, labels, mode, params):

    first_hidden_layer = tf.keras.layers.Dense(16, activation='relu', name='first-layer')(features[INPUT_TENSOR_NAME])
    second_hidden_layer = tf.keras.layers.Dense(16, activation='relu')(first_hidden_layer)
    output_layer = tf.keras.layers.Dense(1, activation='sigmoid')(second_hidden_layer)
    predictions = output_layer

    # Provide an estimator spec for `ModeKeys.PREDICT`.
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(
            mode=mode,
            predictions={"spam": predictions},
            export_outputs={SIGNATURE_NAME: PredictOut

We are now ready to run the training using the TensorFlow estimator object of the SageMaker Python SDK.

In [52]:
from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='SMS_keras_script.py',
                               role=role,
                               training_steps=2000,                                  
                               evaluation_steps=100,
                               hyperparameters={'learning_rate': 0.01},
                               train_instance_count=1,
                               train_instance_type='ml.c4.xlarge')

estimator.fit('s3://'+ bucket_name +'/smsspamdata/')

INFO:sagemaker:Creating training-job with name: sagemaker-tensorflow-2018-06-18-09-33-41-093


...................
[31m2018-06-18 09:36:33,963 INFO - root - running container entrypoint[0m
[31m2018-06-18 09:36:33,963 INFO - root - starting train task[0m
[31m2018-06-18 09:36:33,968 INFO - container_support.training - Training starting[0m
  from ._conv import register_converters as _register_converters[0m
[31m2018-06-18 09:36:35,733 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTP connection (1): 169.254.170.2[0m
[31m2018-06-18 09:36:35,939 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-eu-west-1-825935527263.s3.amazonaws.com[0m
[31m2018-06-18 09:36:35,987 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (2): sagemaker-eu-west-1-825935527263.s3.amazonaws.com[0m
[31m2018-06-18 09:36:36,004 INFO - botocore.vendored.requests.packages.urllib3.connectionpool - Starting new HTTPS connection (1): sagemaker-eu-west-1-82593552

[31m2018-06-18 09:36:49,922 INFO - tensorflow - Calling model_fn.[0m
[31m2018-06-18 09:36:50,068 INFO - tensorflow - Done calling model_fn.[0m
[31m2018-06-18 09:36:50,086 INFO - tensorflow - Starting evaluation at 2018-06-18-09:36:50[0m
[31m2018-06-18 09:36:50,141 INFO - tensorflow - Graph was finalized.[0m
[31m2018-06-18 09:36:50,142 INFO - tensorflow - Restoring parameters from s3://sagemaker-eu-west-1-825935527263/sagemaker-tensorflow-2018-06-18-09-33-41-093/checkpoints/model.ckpt-1[0m
[31m2018-06-18 09:36:50.163764: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-18 09:36:50.177108: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-18 09:36:50.184921: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-18 09:36:50.193345: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. 

[31m2018-06-18 09:37:05,494 INFO - tensorflow - global_step/sec: 17.7576[0m
[31m2018-06-18 09:37:05,495 INFO - tensorflow - loss = 0.4381073, step = 201 (5.632 sec)[0m
[31m2018-06-18 09:37:10,310 INFO - tensorflow - global_step/sec: 20.7606[0m
[31m2018-06-18 09:37:10,311 INFO - tensorflow - loss = 0.41425902, step = 301 (4.817 sec)[0m
[31m2018-06-18 09:37:16,393 INFO - tensorflow - global_step/sec: 16.4393[0m
[31m2018-06-18 09:37:16,394 INFO - tensorflow - loss = 0.36624393, step = 401 (6.083 sec)[0m
[31m2018-06-18 09:37:21,902 INFO - tensorflow - global_step/sec: 18.1543[0m
[31m2018-06-18 09:37:21,903 INFO - tensorflow - loss = 0.37035236, step = 501 (5.509 sec)[0m
[31m2018-06-18 09:37:27,435 INFO - tensorflow - global_step/sec: 18.0723[0m
[31m2018-06-18 09:37:27,436 INFO - tensorflow - loss = 0.37583932, step = 601 (5.533 sec)[0m
[31m2018-06-18 09:37:32,509 INFO - tensorflow - global_step/sec: 19.71[0m
[31m2018-06-18 09:37:32,510 INFO - tensorflow - loss = 0.23

[31m2018-06-18 09:41:04,135 INFO - tensorflow - global_step/sec: 17.268[0m
[31m2018-06-18 09:41:04,136 INFO - tensorflow - loss = 0.014958876, step = 4601 (5.791 sec)[0m
[31m2018-06-18 09:41:09,031 INFO - tensorflow - global_step/sec: 20.4254[0m
[31m2018-06-18 09:41:09,032 INFO - tensorflow - loss = 0.020323139, step = 4701 (4.896 sec)[0m
[31m2018-06-18 09:41:14,512 INFO - tensorflow - global_step/sec: 18.2456[0m
[31m2018-06-18 09:41:14,513 INFO - tensorflow - loss = 0.01208845, step = 4801 (5.481 sec)[0m
[31m2018-06-18 09:41:19,448 INFO - tensorflow - global_step/sec: 20.26[0m
[31m2018-06-18 09:41:19,449 INFO - tensorflow - loss = 0.0076505914, step = 4901 (4.936 sec)[0m
[31m2018-06-18 09:41:24,654 INFO - tensorflow - Saving checkpoints for 5000 into s3://sagemaker-eu-west-1-825935527263/sagemaker-tensorflow-2018-06-18-09-33-41-093/checkpoints/model.ckpt.[0m
[31m2018-06-18 09:41:24.654944: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released

[31m2018-06-18 09:41:29,018 INFO - tensorflow - Done calling model_fn.[0m
[31m2018-06-18 09:41:29,036 INFO - tensorflow - Starting evaluation at 2018-06-18-09:41:29[0m
[31m2018-06-18 09:41:29,090 INFO - tensorflow - Graph was finalized.[0m
[31m2018-06-18 09:41:29,090 INFO - tensorflow - Restoring parameters from s3://sagemaker-eu-west-1-825935527263/sagemaker-tensorflow-2018-06-18-09-33-41-093/checkpoints/model.ckpt-5000[0m
[31m2018-06-18 09:41:29.112665: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-18 09:41:29.134630: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-18 09:41:29.142565: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-18 09:41:29.152571: I tensorflow/core/platform/s3/aws_logging.cc:54] Connection has been released. Continuing.[0m
[31m2018-06-18 09:41:29.161317: I tensorflow/core/p

===== Job Complete =====
Billable seconds: 383


Let's deploy the trained model to an endpoint.

In [53]:
tf_predictor = estimator.deploy(initial_instance_count=1,
                                instance_type='ml.m5.xlarge')

INFO:sagemaker:Creating model with name: sagemaker-tensorflow-2018-06-18-09-33-41-093
INFO:sagemaker:Creating endpoint with name sagemaker-tensorflow-2018-06-18-09-33-41-093


--------------------------------------------------------------!

In [59]:
import tensorflow as tf
import numpy
from sagemaker.tensorflow.model import TensorFlowPredictor

# Uncomment the following line to connect to an existing endpoint.
# tf_predictor = TensorFlowPredictor('sagemaker-tensorflow-2018-06-18-09-33-41-093')

test_messages = ['Free prizes for you! Download ringtones on http://sws44.to Answer YES']
one_hot_test_messages = one_hot_encode(test_messages, vocabulary_lenght)
encoded_test_messages = vectorize_sequences(one_hot_test_messages, vocabulary_lenght)

tensor_proto = tf.make_tensor_proto(values=encoded_test_messages, shape=[1, vocabulary_lenght], dtype=tf.float32)
result = tf_predictor.predict(tensor_proto)

print(result)

{u'outputs': {u'spam': {u'dtype': u'DT_FLOAT', u'floatVal': [0.6947066783905029], u'tensorShape': {u'dim': [{u'size': u'1'}, {u'size': u'1'}]}}}}


In [101]:
import tensorflow as tf

path_to_model = './Model/'

test_messages = ['Free prizes for you! Download ringtones on http://sws44.to Answer YES']
one_hot_test_messages = one_hot_encode(test_messages, vocabulary_lenght)
encoded_test_messages = vectorize_sequences(one_hot_test_messages, vocabulary_lenght)

with tf.Session(graph=tf.Graph()) as sess:
   tf.saved_model.loader.load(
       sess,
       [tf.saved_model.tag_constants.SERVING],
       path_to_model)

   sigmoid_tensor = sess.graph.get_tensor_by_name('dense_1/Sigmoid:0')
   predictions = sess.run(sigmoid_tensor, {'Placeholder_1:0': encoded_test_messages})

   print(predictions[0][0])

0.6947067
