# Feature Transformation with Amazon a SageMaker Processing Job and Scikit-Learn

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Scikit-Learn are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Scikit-Learn in a managed SageMaker environment to run our processing workload.

![](img/prepare_dataset.png)

![](img/processing.jpg)


## Contents

1. Setup Environment
1. Setup Input Data
1. Setup Output Data
1. Build a Spark container for running the processing job
1. Run the Processing Job using Amazon SageMaker
1. Inspect the Processed Output Data

# Setup Environment

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [None]:
!pip install boto3

In [None]:
import sagemaker
from time import gmtime, strftime
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

# Setup Input Data

In [None]:
# Inputs
s3_input_data = 's3://{}/amazon-reviews-pds/tsv/'.format(bucket)
print(s3_input_data)

In [None]:
!aws s3 ls $s3_input_data

# Setup Output Data

In [None]:
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())

output_prefix = 'amazon-reviews-scikit-processor-{}'.format(timestamp_prefix)
scikit_processing_job_name = 'amazon-reviews-scikit-processor-{}'.format(timestamp_prefix)

print('Processing job name:  {}'.format(scikit_processing_job_name))

# Run the Processing Job using Amazon SageMaker

Next, use the Amazon SageMaker Python SDK to submit a processing job. Use the Spark container that was just built, and a SparkML script for processing in the job configuration.

Review the Spark processing script.

In [None]:
cat preprocess-scikit-label-split.py

Run this script as a processing job.  You also need to specify one `ProcessingInput` with the `source` argument of the Amazon S3 bucket and `destination` is where the script reads this data from `/opt/ml/processing/input` (inside the Docker container.)  All local paths inside the processing container must begin with `/opt/ml/processing/`.

Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to.  For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`.  You also give the `ProcessingOutput` value for `output_name`, to make it easier to retrieve these output artifacts after the job is run.

The arguments parameter in the `run()` method are command-line arguments in our `preprocess-*.py` script.

Note that we sharding the data using `ShardedS3Key` to spread the transformations across all worker nodes in the cluster.

In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

processor = SKLearnProcessor(framework_version='0.20.0',
                             role=role,
                             instance_type='ml.m5.4xlarge',
                             instance_count=2)

In [None]:
processor.run(code='preprocess-scikit-label-split.py',
                      inputs=[ProcessingInput(source=s3_input_data,
                                              destination='/opt/ml/processing/input/data/',
                                              s3_data_distribution_type='ShardedByS3Key')],
                      outputs=[
                               ProcessingOutput(s3_upload_mode='EndOfJob',
                                                output_name='raw-labeled-split-balanced-header-train',
                                                source='/opt/ml/processing/output/raw/labeled/split/balanced/header/train'),
                               ProcessingOutput(s3_upload_mode='EndOfJob',
                                                output_name='raw-labeled-split-balanced-header-validation',
                                                source='/opt/ml/processing/output/raw/labeled/split/balanced/header/validation'),
                               ProcessingOutput(s3_upload_mode='EndOfJob',
                                                output_name='raw-labeled-split-balanced-header-test',
                                                source='/opt/ml/processing/output/raw/labeled/split/balanced/header/test'),
                      ],
                      logs=True,
                      wait=False)


In [None]:
scikit_processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']

from IPython.core.display import display, HTML
display(HTML('<b>Review <a href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, scikit_processing_job_name)))


In [None]:
from IPython.core.display import display, HTML

# Our job writes to `processing_job_name` since we are using ProcessingOutput above
scikit_processing_job_s3_output_prefix = scikit_processing_job_name

display(HTML('<b>Review <a href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Spark Job Has Completed</b>'.format(bucket, scikit_processing_job_s3_output_prefix, region)))


# Please Wait Until the Processing Job Completes
Re-run this next cell until the job status shows `Completed`.

In [None]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=scikit_processing_job_name,
                                                                            sagemaker_session=sagemaker_session)

processing_job_description = running_processor.describe()

processing_job_status = processing_job_description['ProcessingJobStatus']
print('\n')
print(processing_job_status)
print('\n')

print(processing_job_description)

# Inspect the Processed Output Data

## The next cells will not work properly until the job completes above.

Take a look at a few rows of the transformed dataset to make sure the processing was successful.

In [None]:
output_config = processing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'raw-labeled-split-balanced-header-train':
        processed_balanced_train_data = output['S3Output']['S3Uri']
    if output['OutputName'] == 'raw-labeled-split-balanced-header-validation':
        processed_balanced_validation_data = output['S3Output']['S3Uri']        
    if output['OutputName'] == 'raw-labeled-split-balanced-header-test':
        processed_balanced_test_data = output['S3Output']['S3Uri']
        
print(processed_balanced_train_data)
print(processed_balanced_validation_data)
print(processed_balanced_test_data)

In [None]:
!aws s3 ls $processed_balanced_train_data/

In [None]:
!aws s3 ls $processed_balanced_validation_data/

In [None]:
!aws s3 ls $processed_balanced_test_data/

# Pass `scikit_processing_job_s3_output_prefix` above as input to the next notebook

In [None]:
print(scikit_processing_job_s3_output_prefix)

In [None]:
%store scikit_processing_job_s3_output_prefix
