# Feature Transformation with Amazon a SageMaker Processing Job and Scikit-Learn

In this notebook, we convert raw text into embeddings.  This will allow us to perform natural language processing tasks such as text classification.

Typically a machine learning (ML) process consists of few steps. First, gathering data with various ETL jobs, then pre-processing the data, featurizing the dataset by incorporating standard techniques or prior knowledge, and finally training an ML model using an algorithm.

Often, distributed data processing frameworks such as Scikit-Learn are used to pre-process data sets in order to prepare them for training. In this notebook we'll use Amazon SageMaker Processing, and leverage the power of Scikit-Learn in a managed SageMaker environment to run our processing workload.

# NOTE:  THIS NOTEBOOK WILL TAKE A 5-10 MINUTES TO COMPLETE.

# PLEASE BE PATIENT.

![](img/processing.jpg)

## Contents

1. Setup Environment
1. Setup Input Data
1. Setup Output Data
1. Build a Scikit-Learn container for running the processing job
1. Run the Processing Job using Amazon SageMaker
1. Inspect the Processed Output Data

# Setup Environment

Let's start by specifying:
* The S3 bucket and prefixes that you use for training and model data. Use the default bucket specified by the Amazon SageMaker session.
* The IAM role ARN used to give processing and training access to the dataset.

In [2]:
import sagemaker
import boto3

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sess.default_bucket()
region = boto3.Session().region_name

import botocore.config

config = botocore.config.Config(
    user_agent_extra='dsoaws/2.0'
)

sm = boto3.Session().client(service_name="sagemaker", 
                            region_name=region, 
                            config=config)
s3 = boto3.Session().client(service_name="s3", 
                            region_name=region,
                            config=config)

# Setup Input Data

In [3]:
%store -r s3_public_path_tsv

In [4]:
try:
    s3_public_path_tsv
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the INGEST section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [5]:
print(s3_public_path_tsv)

s3://amazon-reviews-pds/tsv


In [6]:
%store -r s3_private_path_tsv

In [7]:
try:
    s3_private_path_tsv
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the notebooks in the INGEST section before you continue.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

In [8]:
print(s3_private_path_tsv)

s3://sagemaker-us-east-1-079002598131/amazon-reviews-pds/tsv


In [9]:
raw_input_data_s3_uri = "s3://{}/amazon-reviews-pds/tsv/".format(bucket)
print(raw_input_data_s3_uri)

s3://sagemaker-us-east-1-079002598131/amazon-reviews-pds/tsv/


In [10]:
!aws s3 ls $raw_input_data_s3_uri

2023-03-10 00:46:18   18997559 amazon_reviews_us_Digital_Software_v1_00.tsv.gz
2023-03-10 00:46:19   27442648 amazon_reviews_us_Digital_Video_Games_v1_00.tsv.gz
2023-03-10 00:46:20   12134676 amazon_reviews_us_Gift_Card_v1_00.tsv.gz


# Run the Processing Job using Amazon SageMaker

Next, use the Amazon SageMaker Python SDK to submit a processing job using our custom python script.

# Review the Processing Script

In [11]:
!pygmentize preprocess.py

[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mmodel_selection[39;49;00m [34mimport[39;49;00m train_test_split[37m[39;49;00m
[34mfrom[39;49;00m [04m[36msklearn[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m [34mimport[39;49;00m resample[37m[39;49;00m
[34mimport[39;49;00m [04m[36mfunctools[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mmultiprocessing[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mdatetime[39;49;00m [34mimport[39;49;00m datetime[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mtime[39;49;00m [34mimport[39;49;00m gmtime, strftime, sleep[37m[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mre[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mcollections[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36margparse[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m[37m[39;49;00m
[

Run this script as a processing job.  You also need to specify one `ProcessingInput` with the `source` argument of the Amazon S3 bucket and `destination` is where the script reads this data from `/opt/ml/processing/input` (inside the Docker container.)  All local paths inside the processing container must begin with `/opt/ml/processing/`.

Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to.  For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name>/`.  You also give the `ProcessingOutput` value for `output_name`, to make it easier to retrieve these output artifacts after the job is run.

The arguments parameter in the `run()` method are command-line arguments in our `preprocess.py` script.

Note that we sharding the data using `ShardedByS3Key` to spread the transformations across all worker nodes in the cluster.

In [12]:
processing_instance_type = "ml.c5.2xlarge"
processing_instance_count = 2
train_split_percentage = 0.90
validation_split_percentage = 0.05
test_split_percentage = 0.05
balance_dataset = True
model_checkpoint = 'bigscience/bloomz-560m'
dataset_templates_name = 'amazon_us_reviews/Wireless_v1_00'
prompt_template_name = 'Given the review body return a categorical rating'

In [13]:
from sagemaker.sklearn.processing import SKLearnProcessor

processor = SKLearnProcessor(
    framework_version="0.23-1",
    role=role,
    instance_type=processing_instance_type,
    instance_count=processing_instance_count,
    env={"AWS_DEFAULT_REGION": region},
    max_runtime_in_seconds=7200,
)

In [14]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

processor.run(
    code="preprocess.py",
    inputs=[
        ProcessingInput(
            input_name="raw-input-data",
            source=raw_input_data_s3_uri,
            destination="/opt/ml/processing/input/data/",
            s3_data_distribution_type="ShardedByS3Key",
        )
    ],
    outputs=[
        ProcessingOutput(
            output_name="train", 
            s3_upload_mode="EndOfJob", 
            source="/opt/ml/processing/output/data/train"
        ),
        ProcessingOutput(
            output_name="validation",
            s3_upload_mode="EndOfJob",
            source="/opt/ml/processing/output/data/validation",
        ),
        ProcessingOutput(
            output_name="test", s3_upload_mode="EndOfJob", source="/opt/ml/processing/output/data/test"
        ),
    ],
    arguments=[
        "--train-split-percentage",
        str(train_split_percentage),
        "--validation-split-percentage",
        str(validation_split_percentage),
        "--test-split-percentage",
        str(test_split_percentage),
        "--balance-dataset",
        str(balance_dataset),
        "--model-checkpoint",
        str(model_checkpoint),
        "--dataset-templates-name",
        str(dataset_templates_name),
        "--prompt-template-name",
        str(prompt_template_name),
    ],
    logs=True,
    wait=False,
)

INFO:sagemaker:Creating processing-job with name sagemaker-scikit-learn-2023-03-10-01-44-04-130


In [15]:
scikit_processing_job_name = processor.jobs[-1].describe()["ProcessingJobName"]
print(scikit_processing_job_name)

sagemaker-scikit-learn-2023-03-10-01-44-04-130


In [16]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/sagemaker/home?region={}#/processing-jobs/{}">Processing Job</a></b>'.format(
            region, scikit_processing_job_name
        )
    )
)

In [17]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://console.aws.amazon.com/cloudwatch/home?region={}#logStream:group=/aws/sagemaker/ProcessingJobs;prefix={};streamFilter=typeLogStreamPrefix">CloudWatch Logs</a> After About 5 Minutes</b>'.format(
            region, scikit_processing_job_name
        )
    )
)

In [18]:
from IPython.core.display import display, HTML

display(
    HTML(
        '<b>Review <a target="blank" href="https://s3.console.aws.amazon.com/s3/buckets/{}/{}/?region={}&tab=overview">S3 Output Data</a> After The Processing Job Has Completed</b>'.format(
            bucket, scikit_processing_job_name, region
        )
    )
)

# Monitor the Processing Job

In [19]:
running_processor = sagemaker.processing.ProcessingJob.from_processing_name(
    processing_job_name=scikit_processing_job_name, sagemaker_session=sess
)

processing_job_description = running_processor.describe()

print(processing_job_description)

{'ProcessingInputs': [{'InputName': 'raw-input-data', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-079002598131/amazon-reviews-pds/tsv/', 'LocalPath': '/opt/ml/processing/input/data/', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'ShardedByS3Key', 'S3CompressionType': 'None'}}, {'InputName': 'code', 'AppManaged': False, 'S3Input': {'S3Uri': 's3://sagemaker-us-east-1-079002598131/sagemaker-scikit-learn-2023-03-10-01-44-04-130/input/code/preprocess.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}], 'ProcessingOutputConfig': {'Outputs': [{'OutputName': 'train', 'S3Output': {'S3Uri': 's3://sagemaker-us-east-1-079002598131/sagemaker-scikit-learn-2023-03-10-01-44-04-130/output/train', 'LocalPath': '/opt/ml/processing/output/data/train', 'S3UploadMode': 'EndOfJob'}, 'AppManaged': False}, {'OutputName': 'validation', 

In [20]:
running_processor.wait(logs=False)

..............................................................................................................!

# _Please Wait Until the ^^ Processing Job ^^ Completes Above._

# Inspect the Processed Output Data

Take a look at a few rows of the transformed dataset to make sure the processing was successful.

In [21]:
processing_job_description = running_processor.describe()

output_config = processing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
    if output["OutputName"] == "train":
        processed_train_data_s3_uri = output["S3Output"]["S3Uri"]
    if output["OutputName"] == "validation":
        processed_validation_data_s3_uri = output["S3Output"]["S3Uri"]
    if output["OutputName"] == "test":
        processed_test_data_s3_uri = output["S3Output"]["S3Uri"]

print(processed_train_data_s3_uri)
print(processed_validation_data_s3_uri)
print(processed_test_data_s3_uri)

s3://sagemaker-us-east-1-079002598131/sagemaker-scikit-learn-2023-03-10-01-44-04-130/output/train
s3://sagemaker-us-east-1-079002598131/sagemaker-scikit-learn-2023-03-10-01-44-04-130/output/validation
s3://sagemaker-us-east-1-079002598131/sagemaker-scikit-learn-2023-03-10-01-44-04-130/output/test


In [22]:
!aws s3 ls $processed_train_data_s3_uri/

2023-03-10 01:52:57   12095208 amazon_reviews_us_Digital_Software_v1_00.parquet
2023-03-10 01:53:12   13611143 amazon_reviews_us_Digital_Video_Games_v1_00.parquet
2023-03-10 01:52:57    1438759 amazon_reviews_us_Gift_Card_v1_00.parquet


In [23]:
!aws s3 ls $processed_validation_data_s3_uri/

2023-03-10 01:52:58     687294 amazon_reviews_us_Digital_Software_v1_00.parquet
2023-03-10 01:53:13     773591 amazon_reviews_us_Digital_Video_Games_v1_00.parquet
2023-03-10 01:52:58      79277 amazon_reviews_us_Gift_Card_v1_00.parquet


In [24]:
!aws s3 ls $processed_test_data_s3_uri/

2023-03-10 01:52:58     687294 amazon_reviews_us_Digital_Software_v1_00.parquet
2023-03-10 01:53:13     773591 amazon_reviews_us_Digital_Video_Games_v1_00.parquet
2023-03-10 01:52:58      79277 amazon_reviews_us_Gift_Card_v1_00.parquet


# Pass Variables to the Next Notebook(s)

In [25]:
%store raw_input_data_s3_uri

Stored 'raw_input_data_s3_uri' (str)


In [26]:
%store train_split_percentage

Stored 'train_split_percentage' (float)


In [27]:
%store validation_split_percentage

Stored 'validation_split_percentage' (float)


In [28]:
%store test_split_percentage

Stored 'test_split_percentage' (float)


In [29]:
%store balance_dataset

Stored 'balance_dataset' (bool)


In [30]:
%store model_checkpoint

Stored 'model_checkpoint' (str)


In [31]:
%store dataset_templates_name

Stored 'dataset_templates_name' (str)


In [32]:
%store prompt_template_name

Stored 'prompt_template_name' (str)


In [33]:
%store processed_train_data_s3_uri

Stored 'processed_train_data_s3_uri' (str)


In [34]:
%store processed_validation_data_s3_uri

Stored 'processed_validation_data_s3_uri' (str)


In [35]:
%store processed_test_data_s3_uri

Stored 'processed_test_data_s3_uri' (str)


In [36]:
%store

Stored variables and their in-db values:
balance_dataset                                       -> True
dataset_templates_name                                -> 'amazon_us_reviews/Wireless_v1_00'
ingest_create_athena_table_parquet_passed             -> True
model_checkpoint                                      -> 'bigscience/bloomz-560m'
pipeline_endpoint_name                                -> 'model-from-registry-ep-1678382518'
pipeline_experiment_name                              -> 'pipeline-1678412578'
pipeline_name                                         -> 'pipeline-1678412578'
pipeline_trial_name                                   -> 'trial-1678412578'
processed_test_data_s3_uri                            -> 's3://sagemaker-us-east-1-079002598131/sagemaker-s
processed_train_data_s3_uri                           -> 's3://sagemaker-us-east-1-079002598131/sagemaker-s
processed_validation_data_s3_uri                      -> 's3://sagemaker-us-east-1-079002598131/sagemaker-s
prompt_tem

# Release Resources

In [37]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>
        
<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>