# Feature transformation with Amazon SageMaker Processing and SparkML

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Spark are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Spark in a managed SageMaker environment to run our preprocessing workload. Then, we'll take our preprocessed dataset and train a regression model using XGBoost.

![](images/processing.jpg)


## Contents

1. [Objective](#Objective:-predict-the-age-of-an-Abalone-from-its-physical-measurement)
1. [Setup](#Setup)
1. [Using Amazon SageMaker Processing to execute a SparkML Job](#Using-Amazon-SageMaker-Processing-to-execute-a-SparkML-Job)
  1. [Downloading dataset and uploading to S3](#Downloading-dataset-and-uploading-to-S3)
  1. [Build a Spark container for running the preprocessing job](#Build-a-Spark-container-for-running-the-preprocessing-job)
  1. [Run the preprocessing job using Amazon SageMaker Processing](#Run-the-preprocessing-job-using-Amazon-SageMaker-Processing)
    1. [Inspect the preprocessed dataset](#Inspect-the-preprocessed-dataset)
1. [Train a regression model using the Amazon SageMaker XGBoost algorithm](#Train-a-regression-model-using-the-SageMaker-XGBoost-algorithm)
  1. [Retrieve the XGBoost algorithm image](#Retrieve-the-XGBoost-algorithm-image)
  1. [Set XGBoost model parameters and dataset details](#Set-XGBoost-model-parameters-and-dataset-details)
  1. [Train the XGBoost model](#Train-the-XGBoost-model)

## Setup

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [2]:
!pip install boto3

[31mtensorflow 2.0.0 requires opt-einsum>=2.3.2, which is not installed.[0m
[31mtensorflow 2.0.0 has requirement gast==0.2.2, but you'll have gast 0.3.3 which is incompatible.[0m
[31mawscli 1.18.11 has requirement botocore==1.15.11, but you'll have botocore 1.15.15 which is incompatible.[0m
[31mawscli 1.18.11 has requirement PyYAML<5.3,>=3.10, but you'll have pyyaml 5.3 which is incompatible.[0m
[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [3]:
import sagemaker
from time import gmtime, strftime
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

## Using Amazon SageMaker Processing to execute a SparkML job

### Downloading dataset and uploading to Amazon Simple Storage Service (Amazon S3)

Show the dataset

### Run the preprocessing job using Amazon SageMaker Processing

Next, use the Amazon SageMaker Python SDK to submit a processing job. Use the Spark container that was just built, and a SparkML script for preprocessing in the job configuration.

Review the Spark preprocessing script.

In [4]:
cat preprocess-scikit.py

import argparse
import json
import os

def list_arg(raw_value):
    """argparse type for a list of strings"""
    return str(raw_value).split(",")


def parse_args():
    # Unlike SageMaker training jobs (which have `SM_HOSTS` and `SM_CURRENT_HOST` env vars), processing jobs to need to parse the resource config file directly
    resconfig = {}
    try:
        with open("/opt/ml/config/resourceconfig.json", "r") as cfgfile:
            resconfig = json.load(cfgfile)
    except FileNotFoundError:
        print("/opt/ml/config/resourceconfig.json not found.  current_host is unknown.")
        pass # Ignore

    # Local testing with CLI args
    parser = argparse.ArgumentParser(description="Process")

    parser.add_argument("--hosts", type=list_arg,
        default=resconfig.get("hosts", ["unknown"]),
        help="Comma-separated list of host names running the job"
    )
    parser.add_argument("--current-host", type=str,
        default=resconfig.get("current

Run this script as a processing job.  You specify the command (`/opt/program/submit` for this Spark processor.)  You also need to specify one `ProcessingInput` with the `source` argument of the Amazon S3 bucket and `destination` is where the script reads this data from `/opt/ml/processing/input` (inside the Docker container.)  All local paths inside the processing container must begin with `/opt/ml/processing/`.

Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to.  For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`.  You also give the `ProcessingOutput` value for `output_name`, to make it easier to retrieve these output artifacts after the job is run.

The arguments parameter in the `run()` method are command-line arguments in our `preprocess.py` script.

In [5]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

output_prefix = 'amazon-reviews-scikit-processor-{}'.format(timestamp_prefix)
processing_job_name = 'amazon-reviews-scikit-processor-{}'.format(timestamp_prefix)

print('Processing job name:  {}'.format(processing_job_name))

Processing job name:  amazon-reviews-scikit-processor-2020-03-06-23-50-36


In [6]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(
    #base_job_name='amazon-reviews-processor-scikit',
                                     framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=2)

In [7]:
# Inputs
s3_input_data = 's3://amazon-reviews-pds/tsv/'
print(s3_input_data)

s3://amazon-reviews-pds/tsv/


In [None]:
!aws s3 ls $s3_input_data

In [8]:
# Outputs
s3_output_train_data = 's3://{}/{}/train'.format(bucket, output_prefix)
s3_output_validation_data = 's3://{}/{}/validation'.format(bucket, output_prefix)
s3_output_test_data = 's3://{}/{}/test'.format(bucket, output_prefix)

print(s3_output_train_data)
print(s3_output_validation_data)
print(s3_output_test_data)

s3://sagemaker-us-east-1-835319576252/amazon-reviews-scikit-processor-2020-03-06-23-50-36/train
s3://sagemaker-us-east-1-835319576252/amazon-reviews-scikit-processor-2020-03-06-23-50-36/validation
s3://sagemaker-us-east-1-835319576252/amazon-reviews-scikit-processor-2020-03-06-23-50-36/test


In [9]:
# ShardedS3Key to spread the transformations across all nodes
sklearn_processor.run(code='preprocess-scikit.py',
                      inputs=[ProcessingInput(source=s3_input_data,
                                              destination='/opt/ml/processing/input/data/',
                                              s3_data_distribution_type='ShardedByS3Key')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/output/train'),
                               ProcessingOutput(output_name='validation_data',
                                                source='/opt/ml/processing/output/validation'),
                               ProcessingOutput(output_name='test_data',
                                                source='/opt/ml/processing/output/test')],
                      logs=True,
                      wait=False)


Job Name:  sagemaker-scikit-learn-2020-03-06-23-50-51-679
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-06-23-50-51-679/input/code/preprocess-scikit.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'train_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-06-23-50-51-679/output/train_data', 'LocalPath': '/opt/ml/processing/output/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'validation_data', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemak

In [21]:
#############
# TODO:  CHANGE THE processing_job_name BELOW
#############

processing_job_name = 'sagemaker-scikit-learn-2020-03-06-23-50-51-679'

from IPython.core.display import display, HTML
display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, processing_job_name)))

#############
# TODO:  CHANGE THE processing_job_name ABOVE ^^
#############

In [26]:
#!aws sagemaker list-processing-jobs

In [27]:
#running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=processing_job_name,
#                                                                            sagemaker_session=sagemaker_session)
#running_processor.describe()

#### Inspect the processed dataset
Take a look at a few rows of the transformed dataset to make sure the preprocessing was successful.

In [22]:
preprocessing_job_description = sklearn_processor.jobs[-1].describe()

output_config = preprocessing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'train_data':
        preprocessed_train_data = output['S3Output']['S3Uri']
    if output['OutputName'] == 'validation_data':
        preprocessed_validation_data = output['S3Output']['S3Uri']        
    if output['OutputName'] == 'test_data':
        preprocessed_test_data = output['S3Output']['S3Uri']
        
print(preprocessed_train_data)
print(preprocessed_validation_data)
print(preprocessed_test_data)

s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-06-23-50-51-679/output/train_data
s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-06-23-50-51-679/output/validation_data
s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-03-06-23-50-51-679/output/test_data


In [23]:
!aws s3 ls $preprocessed_train_data/

2020-03-06 23:56:32         33 algo-1.csv
2020-03-06 23:56:32         33 algo-2.csv


In [24]:
!aws s3 ls $preprocessed_validation_data/

In [25]:
!aws s3 ls $preprocessed_test_data/

In [29]:
from IPython.core.display import display, HTML
display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, processing_job_name, region)))