For Project 3 in my **Intro to Cloud Computing** course, we were tasked with finding a problem that could be solved using a distributed platform (Spark, Hadoop / MapReduce) and then using a distributed platform to engage the problem. I chose to develop a model which identified spam text messagaes using Spark through Amazon Sagemaker and Amazon Web Services (AWS) platform. This post is going to walk through the process of developing this model and settting up all the necessary resources. To be specific is going to cover the following actions:

* Estanblish a **LOCAL** Spark Session. (Future posts may connect to a Spark Cluster on Amazon EMR
* Load text message data from Amazon S3 into Spark Session
* Process data with a Spark ML Pipeline
 * Develop customer transformers for the ML pipeline
* Train the resulting data on a SageMaker instance using SageMaker's XGBoost classification algorithm
* Deploy the trained SageMaker model
* Evalute the model performance by performing inference on the deployed endpoint

For complete access to the project files, the associated GitHub repo for this project can be found here: 

[https://github.com/canfielder/DSBA-6190_Proj3_Spark_w_Sagemaker](https://github.com/canfielder/DSBA-6190_Proj3_Spark_w_Sagemaker)

# Why SageMaker and Spark?
For this project we will be performing some standard Natural Language Processing (NLP) actions on SMS messages, and doing so using a distributed platform. Now, there are many different ways to develop a distributed platform process. If we want to use a Spark cluster we can go straight to the source and use Amazon EMR or Google Cloud Dataproc. Or we could set up a MapReduce job on Azure. 

So why bother tapping into SageMaker?

Keep in mind that a many Data Science / Machine Learning processes generally follow these overaching steps:

<p align ="middle">
  <img src="../imgs/data_science_process_flow.png" width="35%" />
</p>

We want to take special notices of steps two and three: **Process** and **Train**. Processing large quantities of text, and then training the resulting output of that process, might have wildly different sizing needs. If you do both on the same system you would have to size that system for the most conservative usage. This means one of these steps is going to be needlessly oversized (**$$$**).

Instead, we can de-couple the **Process** step from **Train** step by utilizing Spark and SageMaker togther. Instead of an oversized, single system, we can:

* **Process**: Run your NLP with Spark, either locally or on Amazon EMR clusters
* **Train**: Train your model on Amazon SageMaker instances 

Each process is dynamically sized and spec-ed (GPUs vs no GPUs, etc.) for the actual needs of just that process, not both. In addition, even though you are tapping in to multiple Amazon resources, it is simple to develop this architecture all within a SageMaker notebook.

**Note**: I want to recognize the following Medium post, [Mixing Spark with Sagemaker ?](https://medium.com/@julsimon/mixing-spark-with-sagemaker-d30d34ffaee7) for providing a good brief discussion on this very subject.

# Notes on Operating Environment
I am performing all of these actions in Amazon Sagemaker on a SageMaker instance. I initially tried to perform these actions locally, connecting to Amazon resources with Boto3. But I kept getting errors executing certain Pyspark actions which I couldn't resolve. Working directly in SageMaker did not result in any errors.

# Spark Cluster
First up we will create a local SparkSession. We'll import the necessary libraries and modules, and then create the session.

In [1]:
import os
import boto3

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

import sagemaker_pyspark

# Configure Spark to use the SageMaker Spark dependency jars
jars = sagemaker_pyspark.classpath_jars()

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = SparkSession.builder.config("spark.driver.extraClassPath", classpath)\
    .master("local[*]")\
    .appName("sms_spam_filter")\
    .getOrCreate()
    
spark

# AWS Settings and Connections
## Boto3 Connection

First things first, we're going to set up a connection to Amazon. Before we can do anything, if doing this work outside of Amazon SageMaker, make sure you have correctly set a AWS configuration file on whatever machine you are working on. See the following documentation for how to do this [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html). 

I established a unique user and role for this exercise, which were granted the following permissions:

* AmazonS3FullAccess
* AmazonSageMakerFullAccess

**Note**: *These permissions are very broad. Usually, when granting permissions, follow the principle of [Grant Least Priviledge](https://docs.aws.amazon.com/IAM/latest/UserGuide/best-practices.html#grant-least-privilege).*


In [4]:
region = "us-east-1"
aws_credential_profile = 'blog_spam_sagemaker_spark'
role = 'blog_spam_sagemaker_spark'

With our region and AWS Credentials file profile defined, we can establish a connection to AWS via the SDK library Boto3. When working with Amazon SageMaker, I prefer establishing a *Session* at the start of my notebook, and then using the session to define Boto3 *Client* and *Resource* objects. This helps maintain a consistency when creating those objects.

In [5]:
import boto3

boto3_session = boto3.session.Session(region_name=region)

## Role ARN
With a general session established, we are going to connect to the AWS IAM service and extract the ARN associated with role being used for this exercise. This ARN will be needed to access necessary resources AWS resources. The ARN could also be manually copied into this notebook, but I prefer using the Boto3 tools when available and not overly onerous.

In [6]:
# Establish IAM Client
client_iam = boto3_session.client('iam')

# Extract Avaiable Roles
role_list = client_iam.list_roles()['Roles']

# Initialized Role ARN variable and Establish Key Value
role_arn = ''
key = 'RoleName'

# Extract Role ARN for Exercise Role
for item in role_list:
    if key in item and role == item[key]:
        role_arn = item['Arn']

# Verify ARN
print(role_arn)

arn:aws:iam::726963482731:role/blog_spam_sagemaker_spark


# Import
The data for this project was the **SMS Spam Collection Dataset**, originally hosted on the UCI Machine Learning repository. The same dataset is also hosted on Kaggle, [here](https://www.kaggle.com/uciml/sms-spam-collection-dataset). I used the Kaggle API to load the data directly to our S3 bucket. More information on how to do that can be found on the [Kaggle GitHub page](https://github.com/Kaggle/kaggle-api) and in the main README file of my personal [GitHub page](https://github.com/canfielder/DSBA-6190_Proj3_Spark_w_Sagemaker) for this project. The result of following the necessary steps is the file *spam.csv*  in a S3 bucket, in this case named *dsba-6190-project3-spark*.

With our data successfully loaded to our S3 bucket, we will now load this is data, as a **Spark DataFrame**, into the local *SparkSession* running our local notebook intance. 

## S3A Endpoint Check
The code in the following block comes directly from one of the [SageMaker Spark Examples on GitHub](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-spark/pyspark_mnist/pyspark_mnist_pca_kmeans.ipynb). I have added some additional commenting for context. Essentially this code checks to see if you region is in China, an applies the correct domain suffix to the S3A endpoint to reflect this. S3A is the connector between Haddop and AWS S3. 

For most regions this step is unnessecary as the default settings will be sufficient.

In [7]:
# List of Chinese Regions
cn_regions = ['cn-north-1', 'cn-northwest-1']

# Current Region
region = boto3_session.region_name

# Defined Endpoint URL Domain Suffix (i.e.: .com)
endpoint_domain = 'com.cn' if region in cn_regions else 'com'

# Set S3A Endpoint
spark._jsc.hadoopConfiguration().set('fs.s3a.endpoint', 's3.{}.amazonaws.{}'.format(region, endpoint_domain))

## Load Data
The data we have is in CSV format. We're going to import this CSV into a pyspark DataFrame. But, before we load the data we need to define a schema. If a schema was not pre-defined, this particular dataset would have been imported with empty columns.

In [8]:
from pyspark.sql.types import StructType, StructField, StringType

# Define S3 Labels
bucket = "dsba-6190-project3-spark"
file_name = "spam.csv"

# Define Known Schema
schema = StructType([
    StructField("class", StringType()),
    StructField("sms", StringType())
])

# Import CSV
df = spark.read\
          .schema(schema)\
          .option("header", "true")\
          .csv('s3a://{}/{}'.format(bucket, file_name))

# Inspect Data
After loading the data we're going to inspect the data to see what we're dealing with, and if there are any problems we need to sort through.

First, let's take a quick look at what our data looks like by inspecting the top 10 rows.

In [9]:
df.show(10, truncate = False)

+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|class|sms                                                                                                                                                             |
+-----+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|ham  |Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                                 |
|ham  |Ok lar... Joking wif u oni...                                                                                                                                   |
|spam |Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075o

So pretty straight forward. Two variables. There's our target variables, **class**, and the raw text, which we will need to process before training our model.

As this project is going to boil down to a binary classification exercise, let's look at the frequency counts for our target variable.

In [10]:
df.groupBy("class").count().show()

+------+-----+
| class|count|
+------+-----+
|ham"""|    2|
|   ham| 4825|
|  spam|  747|
+------+-----+



Uh oh! It appears two observations have errors in the target variable field, **ham"""**. Let's take a closer look at theses specific observations.

In [11]:
import pyspark.sql.functions as f

df.where(f.col("class") == 'ham"""').show(truncate = False)

+------+---------------------------------------------------------+
|class |sms                                                      |
+------+---------------------------------------------------------+
|ham"""|null                                                     |
|ham"""|Well there's still a bit left if you guys want to tonight|
+------+---------------------------------------------------------+



So, one row is null, while there other appears to be a standard **ham** message. Before doing anthing yet, let's see how many null values are in the complete dataset.

In [12]:
from functools import reduce

df.where(reduce(lambda x, y: x | y, (f.col(x).isNull() \
                                     for x in df.columns))).show()

+------+----+
| class| sms|
+------+----+
|ham"""|null|
+------+----+



It seems like there is only the one. So, We'll drop the null observation and correct the typo in the target variable for the other observation. Then we'll check the target variable frequency again to make sure our corrections were implemented.

In [16]:
# Drop Null
df = df.dropna()

# Correct """ham to ham
df = df.withColumn("class", f.when(f.col("class") == 'ham"""' , 'ham').
                     otherwise(f.col("class")))

# Generate Target Frequency Count
df.groupBy("class").count().show()

+-----+-----+
|class|count|
+-----+-----+
|  ham| 4826|
| spam|  747|
+-----+-----+



# Creating a Machine Learning Pipeline

In order to fully take advantage of using Spark and SageMaker, we're going to want to create a Spark ML Pipeline that will perform all of the actions for both the **Process** and **Train** steps. We can see the documentation on pipelines here: [ML Pipelines](https://spark.apache.org/docs/latest/ml-pipeline.html).

There are three main components for our ML Pipeline:

* DataFrame
* Transformer
* Estimator

Our DataFrame has already been loaded in the previous steps.

Our Transformers are going to handle processing of the DataFrame. These are Spark components.

Our Estimator is going to model the final, processed data. This is a SageMaker component. 

## Process
Before the our data can reach the modeling portion of our Pipeline, it needs to pass through the processing portion. To process our data we are either going to use a pre-existing Transformers, or define our own custom Transformers. These transformers will need to cover the following major steps.

1. Convert Target Variable to binary double-type 
2. Text Normalization
3. Tokenization 
4. TF-IDF Transformation

For custom variables, I recommend creating a separate python script for defining these Transformers, and then loading this script into our notebook when needed. The customer Transformers are defined in the file *custom_transformers.py*, located in a directy named *scripts*, found in the working directory of this project.

### Final Data Format
At the end of the Process portion of the Pipeline, we need our DataFrame to be two columns. We need a column of Doubles named **label** (our tagret variable), and a column of Vectors of Doubles name **features** (our processed text). This is the format our SageMaker XGBoost algorithm will understand.

### Target Variable 
Currently our target variable is a string-type value labeled either spam or ham. We need to convert this to an double-type value of 1 and 0. For this we defined a custom Transformer, shown below. 

#### Custom Transformer: Binary Target Variable

In [17]:
# Insert Custom Transformer Text

### Text Normalization
With Text Normalization, we will process the raw text to provide a quality input for our model. The actions used the blog post [**Spam classification using Spark’s DataFrames, ML and Zeppelin (Part 1)**](https://blog.codecentric.de/en/2016/06/spam-classification-using-sparks-dataframes-ml-zeppelin-part-1/) by Daniel Pape, accessed on 4/16/2020, were used as guidance. This blog post provided a good framework particularly for handling types of text you find in an SMS message, such as emoticons.

To normalize the text, there are several steps we plan on taking:

1. Convert all text to lowercase
2. Convert all numbers to the text **_" normalized_number "_**
3. Convert all emoticons to the text **_" normalized_emoticon "_**
4. Convert all currency symbols to the text **_" normalized_currency_symbol "_**
5. Convert all links to the text **_" normalized_url "_**
6. Convert all email addresses to the text **_" normalized_email "_**
7. Convert all diamond/question mark symbols to the text **_" normalized_doamond_symbol "_**
8. Remove HTML characters
9. Remove punctuation

These 9 steps can be executed using two custom Transformers. The first will convert all text to lower case. 
#### Custom Transformer: Convert Text to Lower Case

In [18]:
# Insert Custom Transformer Text

The second custom Transformer will execute steps two through nine. In addition to taking in an input and output column, it also takes a regex expression and a string. This Transformer searches for the regex expression and replaces anything that matches with a string. So, we will call this custom Transformer 8 times, redefining the regex expression and replacement string for the appropriate text normalization.

#### Custom Transformer: Replace Text

In [19]:
# Insert Custom Transformer Text

To better organize the regex strings and normalized text associated with each point above, I saved each regex_expression as a variable and created a dictionary cataloging them.

In [69]:
# List of HTML text
html_list = ["&lt;", "&gt;", "&amp;", "&cent;", "&pound;", "&yen;", "&euro;", "&copy;", "&reg;"]

# Regex Expressions for normalizing text
regex_email = "\\w+(\\.|-)*\\w+@.*\\.(com|de|uk)"
regex_emoticon = ":\)|:-\)|:\(|:-\(|;\);-\)|:-O|8-|:P|:D|:\||:S|:\$|:@|8o\||\+o\(|\(H\)|\(C\)|\(\?\)"
regex_number = "\\d+"
regex_punctuation ="[\\.\\,\\:\\-\\!\\?\\n\\t,\\%\\#\\*\\|\\=\\(\\)\\\"\\>\\<\\/]"
regex_currency = "[\\$\\€\\£]"
regex_url =  "(http://|https://)?www\\.\\w+?\\.(de|com|co.uk)"
regex_diamond_question = "�"
regex_html = "|".join(html_list)

# Dictionary of Normalized Text and Regex Expressions
dict_norm = {
    regex_emoticon : " normalized_emoticon ",
    regex_email : " normalized_emailaddress ",
    regex_number : " normalized_number ",
    regex_punctuation : " ",
    regex_currency : " normalized_currency_symbol ",
    regex_url: " normalized_url ",
    regex_diamond_question : " normalized_doamond_symbol ",
    regex_html : " "
}

Finally, we create one more custom Transformer that selects the **label** and **features** column from our processed data. 

#### Custom Transformer: Select Columns

In [48]:
# Insert Custom Transformer Text

### Define PipeLine Stages
Now that we have gone over each Transformer, we need to initialize them before incorporating them into the pipeline. But first, we need to run our script of custom Transformers, so that our notebook recognizes them.

In [123]:
%run '../scripts/custom_transformers.py'

With our custom Transformers loaded, we can initialize all of our custom processing Transformers.

In [124]:
output_col = 'features'

stage_binary_target=BinaryTransform(inputCol="class")

stage_lowercase_text=LowerCase(inputCol= "sms", outputCol=output_col)

stage_norm_emot=NormalizeText(inputCol=stage_lowercase_text.getOutputCol(), 
                               outputCol=output_col, 
                               normal_text=dict_norm[regex_emoticon],
                               regex_replace_string=regex_emoticon)

stage_norm_email=NormalizeText(inputCol=stage_norm_emot.getOutputCol(), 
                               outputCol=output_col, 
                               normal_text=dict_norm[regex_email],
                               regex_replace_string=regex_email)

stage_norm_num=NormalizeText(inputCol=stage_norm_email.getOutputCol(), 
                               outputCol=output_col, 
                               normal_text=dict_norm[regex_number],
                               regex_replace_string=regex_number)

stage_norm_punct=NormalizeText(inputCol=stage_norm_num.getOutputCol(), 
                               outputCol=output_col, 
                               normal_text=dict_norm[regex_punctuation],
                               regex_replace_string=regex_punctuation)

stage_norm_cur=NormalizeText(inputCol=stage_norm_punct.getOutputCol(), 
                               outputCol=output_col, 
                               normal_text=dict_norm[regex_currency],
                               regex_replace_string=regex_currency)

stage_norm_url=NormalizeText(inputCol=stage_norm_cur.getOutputCol(), 
                               outputCol=output_col, 
                               normal_text=dict_norm[regex_url],
                               regex_replace_string=regex_url)

stage_norm_diamond=NormalizeText(inputCol=stage_norm_url.getOutputCol(), 
                               outputCol=output_col, 
                               normal_text=dict_norm[regex_diamond_question],
                               regex_replace_string=regex_diamond_question)

We then initialize our pre-definied Transformers.

In [125]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF

# Bag of Words Tokens
tokens = Tokenizer(inputCol=output_col, outputCol='tokens')

# Remove Stopwords
stop_words = StopWordsRemover(inputCol=tokens.getOutputCol(), 
                           outputCol="tokens_filtered")

# Term Frequency Hash
hashingTF = HashingTF(inputCol=stop_words.getOutputCol(),
                      outputCol="tokens_hashed", numFeatures=1000)

# IDF 
idf = IDF(minDocFreq=2, inputCol=hashingTF.getOutputCol(), outputCol="features_tfidf")

# Final Processing Step - Custom
stage_column_select=ColumnSelect(inputCol=idf.getOutputCol())

In [126]:
from pyspark.ml import Pipeline

pipeline=Pipeline(stages=[stage_binary_target, stage_lowercase_text, 
                               stage_norm_emot, stage_norm_url,
                               stage_norm_num, stage_norm_punct,
                               stage_norm_cur, stage_norm_url,
                               stage_norm_diamond, tokens, stop_words,
                               hashingTF, idf, stage_column_select]
                              )

In [127]:
pipeline_fit = pipeline.fit(df)
df_pipeline = pipeline_fit.transform(df)
df_pipeline.show(10)
df_pipeline.printSchema()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(1000,[7,77,150,1...|
|  0.0|(1000,[20,372,484...|
|  1.0|(1000,[35,73,128,...|
|  0.0|(1000,[57,372,526...|
|  0.0|(1000,[210,340,37...|
|  1.0|(1000,[91,99,120,...|
|  0.0|(1000,[47,57,330,...|
|  0.0|(1000,[71,92,192,...|
|  1.0|(1000,[39,43,74,1...|
|  1.0|(1000,[73,146,224...|
+-----+--------------------+
only showing top 10 rows

root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

