# Feature transformation with Amazon SageMaker Processing and SparkML

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Spark are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Spark in a managed SageMaker environment to run our preprocessing workload. Then, we'll take our preprocessed dataset and train a regression model using XGBoost.

![](images/processing.jpg)


## Contents

1. [Objective](#Objective:-predict-the-age-of-an-Abalone-from-its-physical-measurement)
1. [Setup](#Setup)
1. [Using Amazon SageMaker Processing to execute a SparkML Job](#Using-Amazon-SageMaker-Processing-to-execute-a-SparkML-Job)
  1. [Downloading dataset and uploading to S3](#Downloading-dataset-and-uploading-to-S3)
  1. [Build a Spark container for running the preprocessing job](#Build-a-Spark-container-for-running-the-preprocessing-job)
  1. [Run the preprocessing job using Amazon SageMaker Processing](#Run-the-preprocessing-job-using-Amazon-SageMaker-Processing)
    1. [Inspect the preprocessed dataset](#Inspect-the-preprocessed-dataset)
1. [Train a regression model using the Amazon SageMaker XGBoost algorithm](#Train-a-regression-model-using-the-SageMaker-XGBoost-algorithm)
  1. [Retrieve the XGBoost algorithm image](#Retrieve-the-XGBoost-algorithm-image)
  1. [Set XGBoost model parameters and dataset details](#Set-XGBoost-model-parameters-and-dataset-details)
  1. [Train the XGBoost model](#Train-the-XGBoost-model)

## Setup

Add the following policies to your SageMaker Execution Role:  
* `EC2ContainerRegistry`
* Permissions: `List`, `Read`, `Write` 
* Repository:  `amazon-reviews-spark-processor`

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [145]:
import sagemaker
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

Processing job name:  amazon-reviews-spark-processor-2020-03-10-04-49-55


## Using Amazon SageMaker Processing to execute a SparkML job

### Downloading dataset and uploading to Amazon Simple Storage Service (Amazon S3)

Show the dataset

# Get Name of Previous Processing Job To Inspect Output

In [214]:
previous_processing_job_name = 'sagemaker-scikit-learn-2020-03-09-21-39-23-619'

In [215]:
!aws s3 ls s3://$bucket/$previous_processing_job_name/output/

                           PRE raw-labeled-split-balanced-header-test/
                           PRE raw-labeled-split-balanced-header-train/
                           PRE raw-labeled-split-balanced-header-validation/
                           PRE raw-labeled-split-unbalanced-header-test/
                           PRE raw-labeled-split-unbalanced-header-train/
                           PRE raw-labeled-split-unbalanced-header-validation/


In [148]:
balanced_train_data_input = 's3://{}/{}/output/raw-labeled-split-balanced-header-train/'.format(bucket, previous_processing_job_name)
balanced_validation_data_input = 's3://{}/{}/output/raw-labeled-split-balanced-header-validation/'.format(bucket, previous_processing_job_name)
balanced_test_data_input = 's3://{}/{}/output/raw-labeled-split-balanced-header-test/'.format(bucket, previous_processing_job_name)

In [149]:
!aws s3 ls $balanced_train_data_input

2020-03-09 21:53:40  898897521 part-algo-1-amazon_reviews_us_Apparel_v1_00.csv
2020-03-09 21:53:40  101620078 part-algo-1-amazon_reviews_us_Digital_Music_Purchase_v1_00.csv
2020-03-09 21:53:40  493981090 part-algo-1-amazon_reviews_us_Home_Improvement_v1_00.csv
2020-03-09 21:53:40  172364050 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.csv
2020-03-09 21:53:40  789101028 part-algo-1-amazon_reviews_us_Toys_v1_00.csv
2020-03-09 21:53:42 1144801434 part-algo-10-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.csv
2020-03-09 21:53:42  265389749 part-algo-10-amazon_reviews_us_Home_Entertainment_v1_00.csv
2020-03-09 21:53:42  968972518 part-algo-10-amazon_reviews_us_Music_v1_00.csv
2020-03-09 21:53:42  287669682 part-algo-10-amazon_reviews_us_Tools_v1_00.csv
2020-03-09 21:55:55  513054699 part-algo-2-amazon_reviews_us_Automotive_v1_00.csv
2020-03-09 21:55:55   39397920 part-algo-2-amazon_reviews_us_Digital_Software_v1_00.csv
2020-03-09 21:55:55 1056545394 part-algo-2-amazon

### Build a Spark container for running the preprocessing job

An example Spark container is included in the `./container` directory of this example. The container handles the bootstrapping of all Spark configuration, and serves as a wrapper around the `spark-submit` CLI. At a high level the container provides:
* A set of default Spark/YARN/Hadoop configurations
* A bootstrapping script for configuring and starting up Spark master/worker nodes
* A wrapper around the `spark-submit` CLI to submit a Spark application


After the container build and push process is complete, use the Amazon SageMaker Python SDK to submit a managed, distributed Spark application that performs our dataset preprocessing.

Build the example Spark container.

In [150]:
docker_repo = 'amazon-reviews-spark-processor'
docker_tag = 'latest'

In [151]:
!docker build -t $docker_repo:$docker_tag -f container/Dockerfile ./container

Sending build context to Docker daemon  3.022MB
Step 1/33 : FROM openjdk:8-jre-slim
 ---> 9c82c74fbc96
Step 2/33 : RUN apt-get update
 ---> Using cache
 ---> ca0a6099c443
Step 3/33 : RUN apt-get install -y curl unzip python3 python3-setuptools python3-pip python-dev python3-dev python-psutil
 ---> Using cache
 ---> cc1fd71bd88c
Step 4/33 : RUN pip3 install py4j psutil==5.6.5 numpy==1.17.4
 ---> Using cache
 ---> 376699e73ced
Step 5/33 : RUN apt-get clean
 ---> Using cache
 ---> 1ea11fb14632
Step 6/33 : RUN rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> ea880623fd34
Step 7/33 : ENV PYTHONHASHSEED 0
 ---> Using cache
 ---> 7d1b53453a5e
Step 8/33 : ENV PYTHONIOENCODING UTF-8
 ---> Using cache
 ---> 12cfee88f392
Step 9/33 : ENV PIP_DISABLE_PIP_VERSION_CHECK 1
 ---> Using cache
 ---> 8010c768a0e5
Step 10/33 : ENV HADOOP_VERSION 3.0.0
 ---> Using cache
 ---> 0e10f67d8992
Step 11/33 : ENV HADOOP_HOME /usr/hadoop-$HADOOP_VERSION
 ---> Using cache
 ---> 19f1a57f792c
Step 12/33 : ENV HADOOP

Create an Amazon Elastic Container Registry (Amazon ECR) repository for the Spark container and push the image.

In [152]:
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

image_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account_id, region, docker_repo, docker_tag)
print(image_uri)

835319576252.dkr.ecr.us-east-1.amazonaws.com/amazon-reviews-spark-processor:latest


Create ECR repository and push docker image

In [153]:
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [154]:
!aws ecr describe-repositories --repository-names $docker_repo || aws ecr create-repository --repository-name $docker_repo

{
    "repositories": [
        {
            "repositoryArn": "arn:aws:ecr:us-east-1:835319576252:repository/amazon-reviews-spark-processor",
            "registryId": "835319576252",
            "repositoryName": "amazon-reviews-spark-processor",
            "repositoryUri": "835319576252.dkr.ecr.us-east-1.amazonaws.com/amazon-reviews-spark-processor",
            "createdAt": 1581702437.0,
            "imageTagMutability": "MUTABLE",
            "imageScanningConfiguration": {
                "scanOnPush": false
            }
        }
    ]
}


In [155]:
!docker tag $docker_repo:$docker_tag $image_uri

In [156]:
!docker push $image_uri

The push refers to repository [835319576252.dkr.ecr.us-east-1.amazonaws.com/amazon-reviews-spark-processor]

[1B4377c451: Preparing 
[1B268138d8: Preparing 
[1B35559781: Preparing 
[1B6b36b7a4: Preparing 
[1B782557b4: Preparing 
[1B8eb2663a: Preparing 
[1B5ae17e8e: Preparing 
[1B89b8c28a: Preparing 
[1Bd604f04b: Preparing 
[1Ba936c4d8: Preparing 
[1Bf855c32d: Preparing 
[1B964f7673: Preparing 
[1B0d7e7b4a: Preparing 
[1Ba6e6c92c: Preparing 
[1Bfecc21b1: Layer already exists [10A[1K[K[5A[1K[Klatest: digest: sha256:f6fe6ddad0942f0ba99e9298a5e5f39235b3de24ee169a250ac4d51a880f3669 size: 3472


### Run the preprocessing job using Amazon SageMaker Processing

Next, use the Amazon SageMaker Python SDK to submit a processing job. Use the Spark container that was just built, and a SparkML script for preprocessing in the job configuration.

Review the Spark preprocessing script.

In [157]:
cat preprocess-spark.py

from __future__ import print_function
from __future__ import unicode_literals

import time
import sys
import os
import shutil
import csv

import pyspark
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, DateType
from pyspark.sql.functions import *
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml.linalg import DenseVector
from pyspark.sql.functions import split
from pyspark.sql.functions import udf, col
from pyspark.sql.types import ArrayType, DoubleType
from pyspark.ml.feature import PCA, StandardScaler

def to_array(col):
    def to_array_internal(v):
        if v:
            return v.toArray().tolist()
        else:
            print('EmptyV: {}'.format(v))
            return []
    return udf(to_array_internal, ArrayType(DoubleType())).asNondeterministic()(col)

def main():
    spark = SparkSession.builder.appName('AmazonReviewsSp

In [224]:
from sagemaker.processing import ScriptProcessor

processor = ScriptProcessor(base_job_name='spark-amazon-reviews-processor',
                            image_uri=image_uri,
                            command=['/opt/program/submit'],
                            role=role,
                            instance_count=20, # instance_count needs to be > 1 or you will see the following error:  "INFO yarn.Client: Application report for application_ (state: ACCEPTED)"
                            instance_type='ml.r5.24xlarge',
#                            instance_type='ml.r5.8xlarge',
#                            max_runtime_in_seconds=600,
                            env={'mode': 'python'})

## Setup Input Data

In [225]:
!aws s3 ls $balanced_train_data_input

2020-03-09 21:53:40  898897521 part-algo-1-amazon_reviews_us_Apparel_v1_00.csv
2020-03-09 21:53:40  101620078 part-algo-1-amazon_reviews_us_Digital_Music_Purchase_v1_00.csv
2020-03-09 21:53:40  493981090 part-algo-1-amazon_reviews_us_Home_Improvement_v1_00.csv
2020-03-09 21:53:40  172364050 part-algo-1-amazon_reviews_us_Musical_Instruments_v1_00.csv
2020-03-09 21:53:40  789101028 part-algo-1-amazon_reviews_us_Toys_v1_00.csv
2020-03-09 21:53:42 1144801434 part-algo-10-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.csv
2020-03-09 21:53:42  265389749 part-algo-10-amazon_reviews_us_Home_Entertainment_v1_00.csv
2020-03-09 21:53:42  968972518 part-algo-10-amazon_reviews_us_Music_v1_00.csv
2020-03-09 21:53:42  287669682 part-algo-10-amazon_reviews_us_Tools_v1_00.csv
2020-03-09 21:55:55  513054699 part-algo-2-amazon_reviews_us_Automotive_v1_00.csv
2020-03-09 21:55:55   39397920 part-algo-2-amazon_reviews_us_Digital_Software_v1_00.csv
2020-03-09 21:55:55 1056545394 part-algo-2-amazon

## Setup Output Data

In [226]:
from time import gmtime, strftime
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

# TODO:  Clean these up
#input_raw_prefix = 'sagemaker/spark-preprocess-reviews-demo/input/raw/reviews'
output_prefix = 'amazon-reviews-spark-processor-{}'.format(timestamp_prefix)
#model_prefix = 'sagemaker/spark-preprocess-reviews-demo/model'
processing_job_name = 'amazon-reviews-spark-processor-{}'.format(timestamp_prefix)

print('Processing job name:  {}'.format(processing_job_name))

Processing job name:  amazon-reviews-spark-processor-2020-03-10-07-08-01


In [234]:
balanced_train_data_tfidf_output = 's3://{}/{}/tfidf-labeled-split-balanced-noheader-train'.format(bucket, output_prefix)
balanced_validation_data_tfidf_output = 's3://{}/{}/tfidf-labeled-split-balanced-noheader-validation'.format(bucket, output_prefix)
balanced_test_data_tfidf_output = 's3://{}/{}/tfidf-labeled-split-balanced-noheader-test'.format(bucket, output_prefix)

print(balanced_train_data_tfidf_output)
print(balanced_validation_data_tfidf_output)
print(balanced_test_data_tfidf_output)

s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-10-07-08-01/tfidf-labeled-split-balanced-noheader-train
s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-10-07-08-01/tfidf-labeled-split-balanced-noheader-validation
s3://sagemaker-us-east-1-835319576252/amazon-reviews-spark-processor-2020-03-10-07-08-01/tfidf-labeled-split-balanced-noheader-test


## Start the Spark Processing Job

_Notes on Invoking from Lambda:_
* However, if we use the boto3 SDK (ie. with a Lambda), we need to copy the `preprocess.py` file to S3 and specify the everything include --py-files, etc.
* We would need to do the following before invoking the Lambda:
     !aws s3 cp preprocess.py s3://<location>/sagemaker/spark-preprocess-reviews-demo/code/preprocess.py
     !aws s3 cp preprocess.py s3://<location>/sagemaker/spark-preprocess-reviews-demo/py_files/preprocess.py
* Then reference the s3://<location> above in the --py-files, etc.
* See Lambda example code in this same project for more details.

_Notes on not using ProcessingInput and Output:_
* Since Spark natively reads/writes from/to S3 using s3a://, we can avoid the copy required by ProcessingInput and ProcessingOutput (FullyReplicated or ShardedByS3Key) and just specify the S3 input and output buckets/prefixes._"
* See https://github.com/awslabs/amazon-sagemaker-examples/issues/994 for issues related to using /opt/ml/processing/input/ and output/
* If we use ProcessingInput, the data will be copied to each node (which we don't want in this case since Spark already handles this)

In [235]:
from sagemaker.processing import ProcessingOutput

# NOTE:  THIS IS NOW TESTING THE VALIDATION DATA SET (much smaller)

processor.run(code='preprocess-spark.py',
              arguments=['s3_input_data', balanced_train_data_input,
                         's3_output_data', balanced_train_data_tfidf_output,
              ],
              # We need this dummy output to allow us to call 
              #    ProcessingJob.from_processing_name() later 
              #    to describe the job and poll for Completed status
              outputs=[
                       ProcessingOutput(s3_upload_mode='EndOfJob',
                                        output_name='dummy-output',
                                        source='/opt/ml/processing/output')
              ],          
              logs=True,
              wait=False
)


Job Name:  spark-amazon-reviews-processor-2020-03-10-07-41-34-026
Inputs:  [{'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/spark-amazon-reviews-processor-2020-03-10-07-41-34-026/input/code/preprocess-spark.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'dummy-output', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/spark-amazon-reviews-processor-2020-03-10-07-41-34-026/output/dummy-output', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]


In [236]:
preprocessing_job_description = processor.jobs[-1].describe()
print(preprocessing_job_description)

{'ProcessingInputs': [{'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/spark-amazon-reviews-processor-2020-03-10-07-41-34-026/input/code/preprocess-spark.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'dummy-output', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/spark-amazon-reviews-processor-2020-03-10-07-41-34-026/output/dummy-output', 'LocalPath': '/opt/ml/processing/output', 'S3UploadMode': 'EndOfJob'}}]}, 'ProcessingJobName': 'spark-amazon-reviews-processor-2020-03-10-07-41-34-026', 'ProcessingResources': {'ClusterConfig': {'InstanceCount': 20, 'InstanceType': 'ml.r5.24xlarge', 'VolumeSizeInGB': 30}}, 'StoppingCondition': {'MaxRuntimeInSeconds': 86400}, 'AppSpecification': {'ImageUri': '835319576252.dkr.ecr.us-east-1.amazonaws.com/amazon-reviews-spark-p

In [242]:
processing_job_name = preprocessing_job_description['ProcessingJobName']

from IPython.core.display import display, HTML
display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, processing_job_name)))


In [241]:
processing_job_status = preprocessing_job_description['ProcessingJobStatus']

if (processing_job_status in ['Completed', 'Stopped']):
    # TODO:  Do something interesting
    print('Complete')
else:
    print(processing_job_status)

InProgress


## Please wait until the Processing Job Completes above

In [239]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=processing_job_name,
                                                                            sagemaker_session=sagemaker_session)
running_processor.describe()

{'ProcessingInputs': [{'InputName': 'code',
   'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/spark-amazon-reviews-processor-2020-03-10-07-41-34-026/input/code/preprocess-spark.py',
    'LocalPath': '/opt/ml/processing/input/code',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'FullyReplicated',
    'S3CompressionType': 'None'}}],
 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'dummy-output',
    'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/spark-amazon-reviews-processor-2020-03-10-07-41-34-026/output/dummy-output',
     'LocalPath': '/opt/ml/processing/output',
     'S3UploadMode': 'EndOfJob'}}]},
 'ProcessingJobName': 'spark-amazon-reviews-processor-2020-03-10-07-41-34-026',
 'ProcessingResources': {'ClusterConfig': {'InstanceCount': 20,
   'InstanceType': 'ml.r5.24xlarge',
   'VolumeSizeInGB': 30}},
 'StoppingCondition': {'MaxRuntimeInSeconds': 86400},
 'AppSpecification': {'ImageUri': '835319576252.dkr.ec

In [240]:
from IPython.core.display import display, HTML
display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, processing_job_name, region)))

# TODO:  Run the transformation for both validation and test

#### Inspect the processed dataset
Take a look at a few rows of the transformed dataset to make sure the preprocessing was successful.

In [206]:
!aws s3 ls --recursive $balanced_train_data_tfidf_output/

In [207]:
!aws s3 ls --recursive $balanced_validation_data_tfidf_output/

In [208]:
!aws s3 ls --recursive $balanced_test_data_tfidf_output/