# Build Machine Learning Pipeline using SageMaker, EMR, S3, ECS, API Gateway and Lambda



## Module-3: Preprocess Data, Build and Train a Spark Pipeline

Begin by running the following cell to confirm that this notebook is connected to EMR:

In [None]:
%%info

### Step 1: Define Project Name and S3 Bucket

You're going to need two pieces of information going forward.  First, you need a name for the model you're working with -- you can name it anything you prefer, within the naming limitations.  You also need to pick an S3 bucket that you'll be uploading your model artifacts to.

#### Substep a. Select a Model name, S3 bucket name, and an AWS region
You can decide what to name the model and the S3 bucket. Pick unique names for each. This example will be using 'sm-emr-e2e-model' for model, but you will need to specify your own bucket, because they are globally unique. You will also pick an AWS region. The S3 Bucket will be created in your chosen AWS region in Step-4.

In [None]:
%%local
model_name = 'sm-emr-e2e-model' # change the model name to something unique
s3_bucket = '<user defined S3 bucket>' # change the s3 bucket to something unique
aws_region = 'us-east-2' # at the time of creation this tutorial will run in US-EAST-1 region

#### Substep b. Define the model name and S3 bucket on EMR

This notebook is running both on EMR and on a notebook instance, hence, you need to define the same model name and same S3 bucket on EMR as well.

In [None]:
# EMR Cell
model_name = 'sm-emr-e2e-model' # same as above
s3_bucket = '<user defined S3 bucket>' # same as above

### Step 2: Setup Environment variables for Model Name and S3 Bucket on Sagemaker Notebook Instance

This provides a portable way of using operating system dependent functionalities. This will enable you to call the model and/or the S3 bucket using bash command in subsequent steps. 


In [None]:
%%local
import os

os.environ['SAGEMAKER_MODEL_NAME'] = model_name
os.environ['SAGEMAKER_S3_BUCKET'] = s3_bucket

### Step 3: Create the S3 bucket from the Notebook instance

Python SDK ‘boto3’ helps to connect to S3 from the Sagemaker notebook instance. The below code will let you create the S3 bucket defined in Step-2 to store the model and the artifacts. The bucket will be a private bucket and can be found in your AWS S3 account. 


In [None]:
%%local
import boto3

s3 = boto3.resource('s3')
s3.create_bucket(Bucket=s3_bucket, CreateBucketConfiguration={'LocationConstraint': aws_region}) # This will create the bucket in US-east-1 region

#### Verfiy your work:
You can verify, if the S3 bucket is created successfully in your S3 bucket, by running the below code block. If your bucket is created, it will be printed. Otherwise, please go back to the previous steps, and make sure you've the consistent s3 bucket names. 

In [None]:
%%local
import boto3

for bucket in s3.buckets.all():
    if bucket.name == s3_bucket:
        print(bucket.name + " is successfully created in your S3 bucket!")

### Step 4: Load data into S3 from the notebook instance

You will be using Enron Email data set for this example. The data set consists of 1,227,255 emails with 493, 384 attachments covering 151 custodians. This data set is already provided to you in “data” folder of the GIT repository that you cloned earlier. 

Let's upload the data to an S3 Bucket.


In [None]:
%%local
import sagemaker as sage

sess = sage.Session()

os.chdir("/home/ec2-user/SageMaker/sm-emr-e2e")

prefix = model_name
WORK_DIRECTORY = "data"
data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)


#### Verify your work:

Run the following code to list the uploaded data files from your S3 bucket. The data files must be in *.parquet* format.

In [None]:
%%local
import boto3

for bucket in s3.buckets.all():
    for obj in bucket.objects.filter(Prefix= model_name):
        print('{0}:{1}'.format(bucket.name, obj.key))

### Step 5: Read the data in EMR Spark Cluster

Now that the data is uploaded to S3, you can copy the S3 bucket path so that you could directly read the data from the EMR cluster.

#### Substep a. Copy the S3 bucket file path
The S3 bucket file path is required to read the data on EMR Spark. Copy and paste the below string into the Substep b.


Now that your data is uploaded to S3, let's find out the S3 location string to use it on EMR Spark.

In [None]:
%%local
print("Copy the following S3 location to the EMR cell below: "+"'"+data_location+'/enron.parquet'+"'")


#### Substep b. Read the data in EMR spark Cluster
Copy paste the above S3 bucket file path to read the data on EMR spark.


In [None]:
# EMR cell
df = spark.read.parquet("<copy and paste the S3 location here>")

#### Verify your work:
Lets take a look at the dataframe to make sure everything loaded okay.

In [None]:
# EMR cell
df.show(5)

### Step 6: Split the data into bag of words

You will split the email dataset to bag of words in order to prepare the data for model training.

This example is using only 10% of the total data as a sample to run on a small EMR Cluster. If you want to do this exercise on the whole data, consider scaling up your EMR cluster to a large size and then changing the fraction value in the code block to *1.0*.

In [None]:
# EMR Cell
from pyspark.sql.functions import split, col
sample = df.sample(withReplacement=False, fraction=0.1).withColumn('bow', split(col('content'), ' '))
sample.count()

### Step 7: Build a Spark pipeline and fit it to the sample data set on EMR

Spark pipeline takes a vectorizer as input. Hence, you need to convert the bag of words to a vector. Once the model is fit, you need to hash it, convert into a parquet file and save it in S3.

#### Substep a: Build and fit a spark pipeline
You can build and train the model using the following spark pipeline.


Let's build a basic Spark pipeline, and fit it to our dataframe.

In [None]:
# EMR Cell
from pyspark.ml import Pipeline
from pyspark.ml.feature import Word2Vec

word2vec = Word2Vec(inputCol='bow', outputCol='features', vectorSize=10)

pipeline = Pipeline(stages=[word2vec])

# Fit the pipeline to training.
model = pipeline.fit(sample)

#### Substep b: Convert it to a parquet file
You can create the hash file from the model and save it in S3.


In [None]:
# EMR Cell
hashes = model.transform(sample)
hashes.write.parquet("s3://"+s3_bucket+"/models/hashes.parquet", mode='overwrite')


#### Verify your work:

Run the following the code block to list the .parquet files that are saved in your S3.

In [None]:
%%local
import boto3

for bucket in s3.buckets.all():
    if bucket.name == s3_bucket:
        for obj in bucket.objects.filter(Delimiter='/models/hashes.parquet/'):
            print('{0}:{1}'.format(bucket.name,obj.key))

### Step 8: Save the training model to S3 on EMR

Note, the fit function above was actually running on EMR in Spark, and not on the SageMaker notebook instance.  Once EMR has finished training, run the cell below to save your trained model to S3.

In [None]:
# EMR Cell
model.write().overwrite().save("s3://" + s3_bucket + "/models/" + model_name +  ".model")

#### Verfiy your work:

Let's check if the training model is saved to s3 successfully. Run the following code cell to list the trained model (must be in .model format) from your S3 bucket along with *.parquet* files.

In [None]:
%%local
import boto3

for bucket in s3.buckets.all():
    if bucket.name == s3_bucket:
        for obj in bucket.objects.filter(Delimiter='/models/'):
            print('{0}:{1}'.format(bucket.name,obj.key))
            
            


### Step 9: Save model artifacts in a tar.gz format using Bash commands


#### Substep a. Convert the model artifacts to tar.gz format

SageMaker requires model artifacts to be in a tar.gz format.  Run the cell below to copy the model down from S3, archive and compress it, before sending it back our S3 working directory.

In [None]:
%%bash

mkdir -p artifacts/$SAGEMAKER_MODEL_NAME.model
mkdir -p artifacts/hashes

cd artifacts/hashes
aws s3 cp --recursive s3://$SAGEMAKER_S3_BUCKET/models/hashes.parquet ./
cd ../..

cd artifacts/$SAGEMAKER_MODEL_NAME.model
aws s3 cp --recursive s3://$SAGEMAKER_S3_BUCKET/models/$SAGEMAKER_MODEL_NAME.model ./

cd ../..
tar -cvvf $SAGEMAKER_MODEL_NAME.model.tar ./artifacts
gzip -f $SAGEMAKER_MODEL_NAME.model.tar
aws s3 cp $SAGEMAKER_MODEL_NAME.model.tar.gz s3://$SAGEMAKER_S3_BUCKET/models/$SAGEMAKER_MODEL_NAME.model.tar.gz

#### Subset b. Save the artifacts to S3

Finally, save the artifacts in tar.gz format to S3 bucket.

In [None]:
%%local
os.environ['SAGEMAKER_ARTIFACTS'] = "s3://" + s3_bucket + "/models/" + model_name + ".model.tar.gz"

#### Verfiy your work::

Run the code below to print the tar.gz file from your S3 bucket. If the desired file format is not present in your S3 bucket, go back to the previous steps and re-run the steps to make sure everything is run correctly.

In [None]:
%%local
import boto3

for bucket in s3.buckets.all():
    if bucket.name == s3_bucket:
        for obj in bucket.objects.filter(Delimiter='/models/'):
            if obj.key in "/models/" + model_name + ".model.tar.gz":
                print('{0}:{1}'.format(bucket.name,obj.key))

# Return to the Project Module 4

Congratulations! You have now trained a Spark model in EMR and saved it to S3. You can now return to the project website to begin module 4, to deploy an endpoint.