# Feature transformation with Amazon SageMaker Processing and SparkML

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Spark are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Spark in a managed SageMaker environment to run our preprocessing workload. Then, we'll take our preprocessed dataset and train a regression model using XGBoost.

![](images/processing.jpg)


## Contents

1. [Objective](#Objective:-predict-the-age-of-an-Abalone-from-its-physical-measurement)
1. [Setup](#Setup)
1. [Using Amazon SageMaker Processing to execute a SparkML Job](#Using-Amazon-SageMaker-Processing-to-execute-a-SparkML-Job)
  1. [Downloading dataset and uploading to S3](#Downloading-dataset-and-uploading-to-S3)
  1. [Build a Spark container for running the preprocessing job](#Build-a-Spark-container-for-running-the-preprocessing-job)
  1. [Run the preprocessing job using Amazon SageMaker Processing](#Run-the-preprocessing-job-using-Amazon-SageMaker-Processing)
    1. [Inspect the preprocessed dataset](#Inspect-the-preprocessed-dataset)
1. [Train a regression model using the Amazon SageMaker XGBoost algorithm](#Train-a-regression-model-using-the-SageMaker-XGBoost-algorithm)
  1. [Retrieve the XGBoost algorithm image](#Retrieve-the-XGBoost-algorithm-image)
  1. [Set XGBoost model parameters and dataset details](#Set-XGBoost-model-parameters-and-dataset-details)
  1. [Train the XGBoost model](#Train-the-XGBoost-model)

## Setup

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [None]:
!pip install boto3

In [1]:
import sagemaker
from time import gmtime, strftime
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

## Using Amazon SageMaker Processing to execute a SparkML job

### Run the preprocessing job using Amazon SageMaker Processing

Next, use the Amazon SageMaker Python SDK to submit a processing job. Use the Spark container that was just built, and a SparkML script for preprocessing in the job configuration.

Review the Spark preprocessing script.

In [2]:
cat preprocess-scikit.py

import argparse
import json
import os
import pandas as pd
import csv
import glob
from pathlib import Path


def list_arg(raw_value):
    """argparse type for a list of strings"""
    return str(raw_value).split(',')


def parse_args():
    # Unlike SageMaker training jobs (which have `SM_HOSTS` and `SM_CURRENT_HOST` env vars), processing jobs to need to parse the resource config file directly
    resconfig = {}
    try:
        with open('/opt/ml/config/resourceconfig.json', 'r') as cfgfile:
            resconfig = json.load(cfgfile)
    except FileNotFoundError:
        print('/opt/ml/config/resourceconfig.json not found.  current_host is unknown.')
        pass # Ignore

    # Local testing with CLI args
    parser = argparse.ArgumentParser(description='Process')

    parser.add_argument('--hosts', type=list_arg,
        default=resconfig.get('hosts', ['unknown']),
        help='Comma-separated list of host names running the job'
    )
    parser.add_ar

Run this script as a processing job.  You specify the command (`/opt/program/submit` for this Spark processor.)  You also need to specify one `ProcessingInput` with the `source` argument of the Amazon S3 bucket and `destination` is where the script reads this data from `/opt/ml/processing/input` (inside the Docker container.)  All local paths inside the processing container must begin with `/opt/ml/processing/`.

Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to.  For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`.  You also give the `ProcessingOutput` value for `output_name`, to make it easier to retrieve these output artifacts after the job is run.

The arguments parameter in the `run()` method are command-line arguments in our `preprocess.py` script.

In [3]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

output_prefix = 'amazon-reviews-scikit-processor-{}'.format(timestamp_prefix)
processing_job_name = 'amazon-reviews-scikit-processor-{}'.format(timestamp_prefix)

print('Processing job name:  {}'.format(processing_job_name))

Processing job name:  amazon-reviews-scikit-processor-2020-03-18-05-40-55


In [4]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

processor = SKLearnProcessor(
    #base_job_name='amazon-reviews-processor-scikit',
                                     framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.4xlarge',
                                     instance_count=10)

In [5]:
# Inputs
#s3_input_data = 's3://amazon-reviews-pds/tsv/'
s3_input_data = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(s3_input_data)

s3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv/


In [6]:
!aws s3 ls $s3_input_data

2020-03-02 06:29:52  648641286 amazon_reviews_us_Apparel_v1_00.tsv.gz
2020-03-02 06:29:52  582145299 amazon_reviews_us_Automotive_v1_00.tsv.gz
2020-03-02 06:29:56  357392893 amazon_reviews_us_Baby_v1_00.tsv.gz
2020-03-02 06:29:58  914070021 amazon_reviews_us_Beauty_v1_00.tsv.gz
2020-03-02 06:30:07 2740337188 amazon_reviews_us_Books_v1_00.tsv.gz
2020-03-02 06:30:09 2692708591 amazon_reviews_us_Books_v1_01.tsv.gz
2020-03-02 06:30:10 1329539135 amazon_reviews_us_Books_v1_02.tsv.gz
2020-03-02 06:30:23  442653086 amazon_reviews_us_Camera_v1_00.tsv.gz
2020-03-02 06:30:27 2689739299 amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv.gz
2020-03-02 06:30:34 1294879074 amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz
2020-03-02 06:30:43  253570168 amazon_reviews_us_Digital_Music_Purchase_v1_00.tsv.gz
2020-03-02 06:30:50   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-03-02 06:30:51  506979922 amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz
2020-03-02 06:31:05   2744264

In [7]:
# # Outputs
# s3_output_train_data = 's3://{}/{}/train'.format(bucket, output_prefix)
# s3_output_validation_data = 's3://{}/{}/validation'.format(bucket, output_prefix)
# s3_output_test_data = 's3://{}/{}/test'.format(bucket, output_prefix)

# print(s3_output_train_data)
# print(s3_output_validation_data)
# print(s3_output_test_data)

In [8]:
# ShardedS3Key to spread the transformations across all nodes
processor.run(code='preprocess-scikit.py',
                      inputs=[ProcessingInput(source=s3_input_data,
                                              destination='/opt/ml/processing/input/data/',
                                              s3_data_distribution_type='ShardedByS3Key')],
                      outputs=[
                               ProcessingOutput(s3_upload_mode='EndOfJob',
                                                output_name='raw-labeled-split-unbalanced-header-train',
                                                source='/opt/ml/processing/output/raw/labeled/split/unbalanced/header/train'),
                               ProcessingOutput(s3_upload_mode='EndOfJob',
                                                output_name='raw-labeled-split-unbalanced-header-validation',
                                                source='/opt/ml/processing/output/raw/labeled/split/unbalanced/header/validation'),
                               ProcessingOutput(s3_upload_mode='EndOfJob',
                                                output_name='raw-labeled-split-unbalanced-header-test',
                                                source='/opt/ml/processing/output/raw/labeled/split/unbalanced/header/test'),
                               ProcessingOutput(s3_upload_mode='EndOfJob',
                                                output_name='raw-labeled-split-balanced-header-train',
                                                source='/opt/ml/processing/output/raw/labeled/split/balanced/header/train'),
                               ProcessingOutput(s3_upload_mode='EndOfJob',
                                                output_name='raw-labeled-split-balanced-header-validation',
                                                source='/opt/ml/processing/output/raw/labeled/split/balanced/header/validation'),
                               ProcessingOutput(s3_upload_mode='EndOfJob',
                                                output_name='raw-labeled-split-balanced-header-test',
                                                source='/opt/ml/processing/output/raw/labeled/split/balanced/header/test'),
                      ],
                      logs=True,
                      wait=False)



Job Name:  sagemaker-scikit-learn-2020-03-18-05-40-57-007
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/input/code/preprocess-scikit.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'raw-labeled-split-unbalanced-header-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/output/raw-labeled-split-unbalanced-header-train', 'LocalPath': '/opt/ml/processing/output/raw/labeled/split/unbalanced/header/train', '

In [9]:
preprocessing_job_description = processor.jobs[-1].describe()
print(preprocessing_job_description)

{'ProcessingInputs': [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/input/code/preprocess-scikit.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'raw-labeled-split-unbalanced-header-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/output/raw-labeled-split-unbalanced-header-train', 'LocalPath': '/opt/ml/processing/output/raw/labeled/split/unbalanced/header/train', 'S3UploadMode': 'En

In [10]:
processing_job_name = preprocessing_job_description['ProcessingJobName']

from IPython.core.display import display, HTML
display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, processing_job_name)))


In [11]:
processing_job_status = preprocessing_job_description['ProcessingJobStatus']
if (processing_job_status in ['Completed', 'Stopped']):
    # TODO:  Do something interesting...
    print('Complete')
else:
    print(processing_job_status)

InProgress


## Please wait until the Processing Job Completes above

In [12]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=processing_job_name,
                                                                            sagemaker_session=sagemaker_session)
running_processor.describe()

{'ProcessingInputs': [{'InputName': 'input-1',
   'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv/',
    'LocalPath': '/opt/ml/processing/input/data/',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'ShardedByS3Key',
    'S3CompressionType': 'None'}},
  {'InputName': 'code',
   'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/input/code/preprocess-scikit.py',
    'LocalPath': '/opt/ml/processing/input/code',
    'S3DataType': 'S3Prefix',
    'S3InputMode': 'File',
    'S3DataDistributionType': 'FullyReplicated',
    'S3CompressionType': 'None'}}],
 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'raw-labeled-split-unbalanced-header-train',
    'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/output/raw-labeled-split-unbalanced-header-train',
     'LocalPath': '/opt/ml/processing/output/raw/l

In [13]:
from IPython.core.display import display, HTML
display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, processing_job_name, region)))

In [14]:
output_config = preprocessing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'raw-labeled-split-unbalanced-header-train':
        preprocessed_unbalanced_train_data = output['S3Output']['S3Uri']
    if output['OutputName'] == 'raw-labeled-split-unbalanced-header-validation':
        preprocessed_unbalanced_validation_data = output['S3Output']['S3Uri']        
    if output['OutputName'] == 'raw-labeled-split-unbalanced-header-test':
        preprocessed_unbalanced_test_data = output['S3Output']['S3Uri']
    if output['OutputName'] == 'raw-labeled-split-balanced-header-train':
        preprocessed_balanced_train_data = output['S3Output']['S3Uri']
    if output['OutputName'] == 'raw-labeled-split-balanced-header-validation':
        preprocessed_balanced_validation_data = output['S3Output']['S3Uri']        
    if output['OutputName'] == 'raw-labeled-split-balanced-header-test':
        preprocessed_balanced_test_data = output['S3Output']['S3Uri']
        
print(preprocessed_unbalanced_train_data)
print(preprocessed_unbalanced_validation_data)
print(preprocessed_unbalanced_test_data)
print(preprocessed_balanced_train_data)
print(preprocessed_balanced_validation_data)
print(preprocessed_balanced_test_data)

s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/output/raw-labeled-split-unbalanced-header-train
s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/output/raw-labeled-split-unbalanced-header-validation
s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/output/raw-labeled-split-unbalanced-header-test
s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/output/raw-labeled-split-balanced-header-train
s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/output/raw-labeled-split-balanced-header-validation
s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-18-05-40-57-007/output/raw-labeled-split-balanced-header-test


#### Inspect the processed dataset
Take a look at a few rows of the transformed dataset to make sure the preprocessing was successful.

In [15]:
!aws s3 ls $preprocessed_unbalanced_train_data/

In [16]:
!aws s3 ls $preprocessed_unbalanced_validation_data/

In [17]:
!aws s3 ls $preprocessed_unbalanced_test_data/

In [18]:
!aws s3 ls $preprocessed_balanced_train_data/

In [19]:
!aws s3 ls $preprocessed_balanced_validation_data/

In [20]:
!aws s3 ls $preprocessed_balanced_test_data/