# Feature transformation with Amazon SageMaker Processing and SparkML

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Spark are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Spark in a managed SageMaker environment to run our preprocessing workload. Then, we'll take our preprocessed dataset and train a regression model using XGBoost.

![](images/processing.jpg)


## Contents

1. [Objective](#Objective:-predict-the-age-of-an-Abalone-from-its-physical-measurement)
1. [Setup](#Setup)
1. [Using Amazon SageMaker Processing to execute a SparkML Job](#Using-Amazon-SageMaker-Processing-to-execute-a-SparkML-Job)
  1. [Downloading dataset and uploading to S3](#Downloading-dataset-and-uploading-to-S3)
  1. [Build a Spark container for running the preprocessing job](#Build-a-Spark-container-for-running-the-preprocessing-job)
  1. [Run the preprocessing job using Amazon SageMaker Processing](#Run-the-preprocessing-job-using-Amazon-SageMaker-Processing)
    1. [Inspect the preprocessed dataset](#Inspect-the-preprocessed-dataset)
1. [Train a regression model using the Amazon SageMaker XGBoost algorithm](#Train-a-regression-model-using-the-SageMaker-XGBoost-algorithm)
  1. [Retrieve the XGBoost algorithm image](#Retrieve-the-XGBoost-algorithm-image)
  1. [Set XGBoost model parameters and dataset details](#Set-XGBoost-model-parameters-and-dataset-details)
  1. [Train the XGBoost model](#Train-the-XGBoost-model)

## Setup

Add the following policies to your SageMaker Execution Role:  
* `EC2ContainerRegistry`
* Permissions: `List`, `Read`, `Write` 
* Repository:  `amazon-reviews-spark-processor`

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [6]:
import sagemaker
from time import gmtime, strftime
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

output_prefix = 'amazon-reviews-spark-processor-{}'.format(timestamp_prefix)
processing_job_name = 'amazon-reviews-spark-processor-{}'.format(timestamp_prefix)

print('Processing job name:  {}'.format(processing_job_name))

Processing job name:  amazon-reviews-spark-processor-2020-03-08-21-58-00


## Using Amazon SageMaker Processing to execute a SparkML job

### Downloading dataset and uploading to Amazon Simple Storage Service (Amazon S3)

Show the dataset

In [4]:
#!aws s3 ls s3://$bucket/$input_raw_prefix/
#!aws s3 ls s3://amazon-reviews-pds/parquet/
!aws s3 ls s3://{bucket}/amazon-reviews-pds/tsv/

2020-03-02 06:29:52  648641286 amazon_reviews_us_Apparel_v1_00.tsv.gz
2020-03-02 06:29:52  582145299 amazon_reviews_us_Automotive_v1_00.tsv.gz
2020-03-02 06:29:56  357392893 amazon_reviews_us_Baby_v1_00.tsv.gz
2020-03-02 06:29:58  914070021 amazon_reviews_us_Beauty_v1_00.tsv.gz
2020-03-02 06:30:07 2740337188 amazon_reviews_us_Books_v1_00.tsv.gz
2020-03-02 06:30:09 2692708591 amazon_reviews_us_Books_v1_01.tsv.gz
2020-03-02 06:30:10 1329539135 amazon_reviews_us_Books_v1_02.tsv.gz
2020-03-02 06:30:23  442653086 amazon_reviews_us_Camera_v1_00.tsv.gz
2020-03-02 06:30:27 2689739299 amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv.gz
2020-03-02 06:30:34 1294879074 amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz
2020-03-02 06:30:43  253570168 amazon_reviews_us_Digital_Music_Purchase_v1_00.tsv.gz
2020-03-02 06:30:50   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-03-02 06:30:51  506979922 amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz
2020-03-02 06:31:05   2744264

### Build a Spark container for running the preprocessing job

An example Spark container is included in the `./container` directory of this example. The container handles the bootstrapping of all Spark configuration, and serves as a wrapper around the `spark-submit` CLI. At a high level the container provides:
* A set of default Spark/YARN/Hadoop configurations
* A bootstrapping script for configuring and starting up Spark master/worker nodes
* A wrapper around the `spark-submit` CLI to submit a Spark application


After the container build and push process is complete, use the Amazon SageMaker Python SDK to submit a managed, distributed Spark application that performs our dataset preprocessing.

Build the example Spark container.

In [None]:
docker_repo = 'amazon-reviews-spark-processor'
docker_tag = 'latest'

In [None]:
!docker build -t $docker_repo:$docker_tag -f container/Dockerfile ./container

Create an Amazon Elastic Container Registry (Amazon ECR) repository for the Spark container and push the image.

In [None]:
import boto3
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.session.Session().region_name

image_uri = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account_id, region, docker_repo, docker_tag)
print(image_uri)

Create ECR repository and push docker image

In [None]:
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)

In [None]:
!aws ecr describe-repositories --repository-names $docker_repo || aws ecr create-repository --repository-name $docker_repo

In [None]:
!docker tag $docker_repo:$docker_tag $image_uri

In [None]:
!docker push $image_uri

### Run the preprocessing job using Amazon SageMaker Processing

Next, use the Amazon SageMaker Python SDK to submit a processing job. Use the Spark container that was just built, and a SparkML script for preprocessing in the job configuration.

Review the Spark preprocessing script.

In [None]:
cat preprocess-spark.py

TODO:  This doesn't apply anymore.  Remove this.

Run this script as a processing job.  You specify the command (`/opt/program/submit` for this Spark processor.)  You also need to specify one `ProcessingInput` with the `source` argument of the Amazon S3 bucket and `destination` is where the script reads this data from `/opt/ml/processing/input` (inside the Docker container.)  All local paths inside the processing container must begin with `/opt/ml/processing/`.

Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to.  For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`.  You also give the `ProcessingOutput` value for `output_name`, to make it easier to retrieve these output artifacts after the job is run.

The arguments parameter in the `run()` method are command-line arguments in our `preprocess.py` script.

In [None]:
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput 

processor = ScriptProcessor(base_job_name='spark-amazon-reviews-processor',
                                  image_uri=image_uri,
                                  command=['/opt/program/submit'],
                                  role=role,
                                  instance_count=10, # instance_count needs to be > 1 or you will see the following error:  "INFO yarn.Client: Application report for application_ (state: ACCEPTED)"
                                  instance_type='ml.r5.8xlarge',
                                  max_runtime_in_seconds=600,
                                  env={'mode': 'python'})

In [None]:
# Inputs
s3_input_data = 's3://amazon-reviews-pds/parquet/'
print(s3_input_data)

In [None]:
!aws s3 ls $s3_input_data

In [None]:
# Outputs
s3_output_train_data = 's3://{}/{}/train'.format(bucket, output_prefix)
s3_output_validation_data = 's3://{}/{}/validation'.format(bucket, output_prefix)
s3_output_test_data = 's3://{}/{}/test'.format(bucket, output_prefix)

print(s3_output_train_data)
print(s3_output_validation_data)
print(s3_output_test_data)

In [None]:
# Note:  We can specify the local `preprocess.py` because we're using the SageMaker SDK.
#
#    Notes on Invoking from Lambda:
#      * However, if we use the boto3 SDK (ie. with a Lambda), we need to copy the `preprocess.py` file to S3 and specify the everything include --py-files, etc.
#      * We would need to do the following before invoking the Lambda:
#          !aws s3 cp preprocess.py s3://<location>/sagemaker/spark-preprocess-reviews-demo/code/preprocess.py
#          !aws s3 cp preprocess.py s3://<location>/sagemaker/spark-preprocess-reviews-demo/py_files/preprocess.py
#      * Then reference the s3://<location> above in the --py-files, etc.
#      * See Lambda example code in this same project for more details.
#
# Note:  See https://github.com/awslabs/amazon-sagemaker-examples/issues/994 for issues related to using /opt/ml/processing/input/ and output/
#        If we use ProcessingInput, the data will be copied to each node (which we don't want in this case since Spark already handles this)
processor.run(code='preprocess-spark.py',
              arguments=['s3_input_data', s3_input_data,
                         's3_output_train_data', s3_output_train_data,
                         's3_output_validation_data', s3_output_validation_data,
                         's3_output_test_data', s3_output_test_data
                        ],
              logs=True,
              wait=False
)

In [None]:
#############
# TODO:  CHANGE THE processing_job_name BELOW
#############

region = sagemaker_session.boto_region_name
processing_job_name = 'spark-amazon-reviews-processor-2020-02-26-02-38-56-772'

#############
# TODO:  CHANGE THE processing_job_name ABOVE ^^
#############

from IPython.core.display import display, HTML
display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, processing_job_name)))

In [None]:
#!aws sagemaker list-processing-jobs

In [None]:
#running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=processing_job_name,
#                                                                            sagemaker_session=sagemaker_session)
#running_processor.describe()

#### Inspect the processed dataset
Take a look at a few rows of the transformed dataset to make sure the preprocessing was successful.

In [None]:
from IPython.core.display import display, HTML
display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, output_prefix, region)))

In [None]:
!aws s3 ls --recursive $s3_output_train_data/

In [None]:
!aws s3 ls --recursive $s3_output_validation_data/

In [None]:
!aws s3 ls --recursive $s3_output_test_data/