# Feature Transformation with Amazon a SageMaker Processing Job and Scikit-Learn

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Scikit-Learn are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Scikit-Learn in a managed SageMaker environment to run our processing workload.

![](img/prepare_dataset_bert.png)

![](img/processing.jpg)


## Contents

1. Setup Environment
1. Setup Input Data
1. Setup Output Data
1. Build a Spark container for running the processing job
1. Run the Processing Job using Amazon SageMaker
1. Inspect the Processed Output Data

# Setup Environment

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [1]:
import sagemaker
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

sm = boto3.Session().client(service_name='sagemaker', region_name=region)

# Setup Input Data

In [2]:
raw_input_data_s3_uri = 's3://{}/amazon-reviews-pds/tsv-all/'.format(bucket)
print(raw_input_data_s3_uri)

s3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv-all/


In [3]:
!aws s3 ls $raw_input_data_s3_uri

2020-08-18 20:56:00  648641286 amazon_reviews_us_Apparel_v1_00.tsv.gz
2020-08-18 20:56:00  582145299 amazon_reviews_us_Automotive_v1_00.tsv.gz
2020-08-18 20:56:00  357392893 amazon_reviews_us_Baby_v1_00.tsv.gz
2020-08-18 20:56:00  914070021 amazon_reviews_us_Beauty_v1_00.tsv.gz
2020-08-18 20:56:00 2740337188 amazon_reviews_us_Books_v1_00.tsv.gz
2020-08-18 20:56:08 2692708591 amazon_reviews_us_Books_v1_01.tsv.gz
2020-08-18 20:56:13 1329539135 amazon_reviews_us_Books_v1_02.tsv.gz
2020-08-18 20:56:14  442653086 amazon_reviews_us_Camera_v1_00.tsv.gz
2020-08-18 20:56:20 2689739299 amazon_reviews_us_Digital_Ebook_Purchase_v1_00.tsv.gz
2020-08-18 20:56:24 1294879074 amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tsv.gz
2020-08-18 20:56:41  253570168 amazon_reviews_us_Digital_Music_Purchase_v1_00.tsv.gz
2020-08-18 20:56:46   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2020-08-18 20:56:47  506979922 amazon_reviews_us_Digital_Video_Download_v1_00.tsv.gz
2020-08-18 20:56

# Run the Processing Job using Amazon SageMaker

Next, use the Amazon SageMaker Python SDK to submit a processing job using our custom python script.

# Review the Processing Script

In [4]:
!pygmentize preprocess-scikit-text-to-bert.py

[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmodel_selection[39;49;00m [34mimport[39;49;00m train_test_split
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m [34mimport[39;49;00m resample
[34mimport[39;49;00m [04m[36mfunctools[39;49;00m
[34mimport[39;49;00m [04m[36mmultiprocessing[39;49;00m

[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mfrom[39;49;00m [04m[36mdatetime[39;49;00m [34mimport[39;49;00m datetime
[34mimport[39;49;00m [04m[36msubprocess[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
subprocess.check_call([sys.executable, [33m'[39;49;00m[33m-m[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mpip[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33minstall[39;49;00m[33m'[39;49;00m, [33m'[39;49;00m[33mtensorflow==2.1.0[39;49;00m[33m'[39;49;00m])
[34mimport[39;49;00m [04m[36mtensorf

Run this script as a processing job.  You also need to specify one `ProcessingInput` with the `source` argument of the Amazon S3 bucket and `destination` is where the script reads this data from `/opt/ml/processing/input` (inside the Docker container.)  All local paths inside the processing container must begin with `/opt/ml/processing/`.

Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to.  For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`.  You also give the `ProcessingOutput` value for `output_name`, to make it easier to retrieve these output artifacts after the job is run.

The arguments parameter in the `run()` method are command-line arguments in our `preprocess-scikit-text-to-bert.py` script.

Note that we sharding the data using `ShardedByS3Key` to spread the transformations across all worker nodes in the cluster.

# Set the Processing Job Hyper-Parameters 

In [5]:
train_split_percentage=0.90
validation_split_percentage=0.05
test_split_percentage=0.05
max_seq_length=64
balance_dataset=True
processing_instance_type='ml.c5.2xlarge'
processing_instance_count=20

In [7]:
from sagemaker.sklearn.processing import SKLearnProcessor

processor = SKLearnProcessor(framework_version='0.20.0',
                             role=role,
                             instance_type=processing_instance_type,
                             instance_count=processing_instance_count,
                             max_runtime_in_seconds=7200)

In [8]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

processor.run(code='preprocess-scikit-text-to-bert.py',
              inputs=[
                    ProcessingInput(source=raw_input_data_s3_uri,
                                    destination='/opt/ml/processing/input/data/',
                                    s3_data_distribution_type='ShardedByS3Key')
              ],
              outputs=[
                    ProcessingOutput(s3_upload_mode='EndOfJob',
                                     output_name='bert-train',
                                     source='/opt/ml/processing/output/bert/train'),
                    ProcessingOutput(s3_upload_mode='EndOfJob',
                                     output_name='bert-validation',
                                     source='/opt/ml/processing/output/bert/validation'),
                    ProcessingOutput(s3_upload_mode='EndOfJob',
                                     output_name='bert-test',
                                     source='/opt/ml/processing/output/bert/test'),
              ],
              arguments=['--train-split-percentage', str(train_split_percentage),
                         '--validation-split-percentage', str(validation_split_percentage),
                         '--test-split-percentage', str(test_split_percentage),
                         '--max-seq-length', str(max_seq_length),
                         '--balance-dataset', str(balance_dataset)
              ],
              logs=True,
              wait=False)

Parameter 'session' will be renamed to 'sagemaker_session' in SageMaker Python SDK v2.



Job Name:  sagemaker-scikit-learn-2020-08-18-21-30-21-410
Inputs:  [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv-all/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-08-18-21-30-21-410/input/code/preprocess-scikit-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'bert-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-08-18-21-30-21-410/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'bert-validation', 'S3Output': {

In [9]:
scikit_processing_job_name = processor.jobs[-1].describe()['ProcessingJobName']
print(scikit_processing_job_name)

sagemaker-scikit-learn-2020-08-18-21-30-21-410


In [10]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(region, scikit_processing_job_name)))


In [11]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(region, scikit_processing_job_name)))


In [12]:
from IPython.core.display import display, HTML

display(HTML('<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Processing Job Has Completed</b>'.format(bucket, scikit_processing_job_name, region)))


# Please Wait Until the Processing Job Completes
Re-run this next cell until the job status shows `Completed`.

In [13]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(processing_job_name=scikit_processing_job_name,
                                                                            sagemaker_session=sagemaker_session)

processing_job_description = running_processor.describe()

processing_job_status = processing_job_description['ProcessingJobStatus']
print('\n')
print(processing_job_status)
print('\n')

print(processing_job_description)



InProgress


{'ProcessingInputs': [{'InputName': 'input-1', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/amazon-reviews-pds/tsv-all/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-08-18-21-30-21-410/input/code/preprocess-scikit-text-to-bert.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'bert-train', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-08-18-21-30-21-410/output/bert-train', 'LocalPath': '/opt/ml/processing/output/bert/train', 'S3UploadMode': 'EndOfJob'}}, {'OutputName': 'bert-validation', 'S3Output': {'S3

# _Please Wait Until the ^^ Processing Job ^^ Completes Above._

In [14]:
running_processor.wait(logs=False)

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

# Inspect the Processed Output Data

Take a look at a few rows of the transformed dataset to make sure the processing was successful.

In [15]:
output_config = processing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'bert-train':
        processed_train_data_s3_uri = output['S3Output']['S3Uri']
    if output['OutputName'] == 'bert-validation':
        processed_validation_data_s3_uri = output['S3Output']['S3Uri']
    if output['OutputName'] == 'bert-test':
        processed_test_data_s3_uri = output['S3Output']['S3Uri']
        
print(processed_train_data_s3_uri)
print(processed_validation_data_s3_uri)
print(processed_test_data_s3_uri)

s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-08-18-21-30-21-410/output/bert-train
s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-08-18-21-30-21-410/output/bert-validation
s3://sagemaker-us-east-1-835319576252/sagemaker-scikit-learn-2020-08-18-21-30-21-410/output/bert-test


In [16]:
!aws s3 ls $processed_train_data_s3_uri/

2020-08-18 21:37:52     544110 part-algo-1-amazon_reviews_us_Apparel_v1_00.tfrecord
2020-08-18 21:37:52     188249 part-algo-1-amazon_reviews_us_Home_Improvement_v1_00.tfrecord
2020-08-18 21:37:52     344551 part-algo-1-amazon_reviews_us_Toys_v1_00.tfrecord
2020-08-18 21:37:57     353794 part-algo-10-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-08-18 21:37:57     241462 part-algo-10-amazon_reviews_us_Music_v1_00.tfrecord
2020-08-18 21:36:37      37469 part-algo-11-amazon_reviews_us_Digital_Music_Purchase_v1_00.tfrecord
2020-08-18 21:36:37      60966 part-algo-11-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-08-18 21:37:12      10713 part-algo-12-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-08-18 21:37:12     209024 part-algo-12-amazon_reviews_us_Office_Products_v1_00.tfrecord
2020-08-18 21:37:17     289336 part-algo-13-amazon_reviews_us_Digital_Video_Download_v1_00.tfrecord
2020-08-18 21:37:17     166156 part-algo-13-amazon_reviews_us_Out

In [17]:
!aws s3 ls $processed_validation_data_s3_uri/

2020-08-18 21:37:52      30193 part-algo-1-amazon_reviews_us_Apparel_v1_00.tfrecord
2020-08-18 21:37:52      10653 part-algo-1-amazon_reviews_us_Home_Improvement_v1_00.tfrecord
2020-08-18 21:37:52      19451 part-algo-1-amazon_reviews_us_Toys_v1_00.tfrecord
2020-08-18 21:37:57      19887 part-algo-10-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-08-18 21:37:57      13552 part-algo-10-amazon_reviews_us_Music_v1_00.tfrecord
2020-08-18 21:36:38       2283 part-algo-11-amazon_reviews_us_Digital_Music_Purchase_v1_00.tfrecord
2020-08-18 21:36:38       3754 part-algo-11-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-08-18 21:37:13        669 part-algo-12-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-08-18 21:37:13      11527 part-algo-12-amazon_reviews_us_Office_Products_v1_00.tfrecord
2020-08-18 21:37:18      16304 part-algo-13-amazon_reviews_us_Digital_Video_Download_v1_00.tfrecord
2020-08-18 21:37:18       9387 part-algo-13-amazon_reviews_us_Out

In [18]:
!aws s3 ls $processed_test_data_s3_uri/

2020-08-18 21:37:53      30244 part-algo-1-amazon_reviews_us_Apparel_v1_00.tfrecord
2020-08-18 21:37:53      10434 part-algo-1-amazon_reviews_us_Home_Improvement_v1_00.tfrecord
2020-08-18 21:37:53      19832 part-algo-1-amazon_reviews_us_Toys_v1_00.tfrecord
2020-08-18 21:37:58      19811 part-algo-10-amazon_reviews_us_Digital_Ebook_Purchase_v1_01.tfrecord
2020-08-18 21:37:58      13406 part-algo-10-amazon_reviews_us_Music_v1_00.tfrecord
2020-08-18 21:36:38       2358 part-algo-11-amazon_reviews_us_Digital_Music_Purchase_v1_00.tfrecord
2020-08-18 21:36:38       3725 part-algo-11-amazon_reviews_us_Musical_Instruments_v1_00.tfrecord
2020-08-18 21:37:13        713 part-algo-12-amazon_reviews_us_Digital_Software_v1_00.tfrecord
2020-08-18 21:37:13      11842 part-algo-12-amazon_reviews_us_Office_Products_v1_00.tfrecord
2020-08-18 21:37:18      16321 part-algo-13-amazon_reviews_us_Digital_Video_Download_v1_00.tfrecord
2020-08-18 21:37:18       9469 part-algo-13-amazon_reviews_us_Out

# Pass Variables to the Next Notebook(s)

In [19]:
%store raw_input_data_s3_uri

Stored 'raw_input_data_s3_uri' (str)


In [20]:
%store max_seq_length

Stored 'max_seq_length' (int)


In [21]:
%store train_split_percentage

Stored 'train_split_percentage' (float)


In [22]:
%store validation_split_percentage

Stored 'validation_split_percentage' (float)


In [23]:
%store test_split_percentage

Stored 'test_split_percentage' (float)


In [24]:
%store balance_dataset

Stored 'balance_dataset' (bool)


In [25]:
%store processed_train_data_s3_uri

Stored 'processed_train_data_s3_uri' (str)


In [26]:
%store processed_validation_data_s3_uri

Stored 'processed_validation_data_s3_uri' (str)


In [27]:
%store processed_test_data_s3_uri

Stored 'processed_test_data_s3_uri' (str)


In [28]:
%store

Stored variables and their in-db values:
balance_dataset                                   -> True
experiment_name                                   -> 'Amazon-Customer-Reviews-BERT-Experiment-159772236
max_seq_length                                    -> 64
model_ab_endpoint_name                            -> 'tensorflow-training-2020-08-18-03-46-11-116-abtes
prepare_trial_component_name                      -> 'TrialComponent-2020-08-18-034608-bgtf'
processed_test_data_s3_uri                        -> 's3://sagemaker-us-east-1-835319576252/sagemaker-s
processed_train_data_s3_uri                       -> 's3://sagemaker-us-east-1-835319576252/sagemaker-s
processed_validation_data_s3_uri                  -> 's3://sagemaker-us-east-1-835319576252/sagemaker-s
processing_code_s3_prefix                         -> 'pipeline_sklearn_processing/1597769474/code'
pytorch_endpoint_name                             -> 'tensorflow-training-2020-08-18-03-46-11-116-pt-15
pytorch_model_name           

In [None]:
%%javascript
Jupyter.notebook.save_checkpoint();
Jupyter.notebook.session.delete();