# Module 3: Feature Engineering using Amazon SageMaker Processing Jobs

In module 3, you use SageMaker Processing jobs to perform feature engineering using the transformations you applied to the data in module 2. In module 4, you use SageMaker XGBoost algorithm to train your model. You will then deploy a SageMaker Inference Pipeline endpoint consisting of Feature Transformer and XGBoost steps for real-time inference in module 5, and deploy an API endpoint for the consumers using Amazon API Gateway and AWS Lambda in module 6. You perform inference by invoking the API endpoint in module 7. Finally, in module 8, you create a workflow for the end-to-end process by using Amazon SageMaker Pipelines. You will use SageMaker Experiments throughout the process to track the steps.

In the previous module, you performed some exploratory data analysis and preprocessing in the notebook using Python scripts. In this module, you will use [Amazon SageMaker Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html), which allows you to run preprocessing on fully managed infrastructure, to create and run a processing job. You will leverage the same feature transformer model you created in the previous module using SKLearn.

In [None]:
import sagemaker, pandas, numpy
import boto3

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()
s3_key_prefix = 'end-to-end-ml'

print(region)
print(role)
print(s3_bucket_name)

You will use the same dataset as the previous module. Download the "AI4I 2020 Predictive Maintenance Dataset" from the UCI Machine Learning Repository.

In [None]:
import urllib

dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv"
raw_data_file_name = "predictive_maintenance_raw_data_header.csv"
urllib.request.urlretrieve(dataset_url, raw_data_file_name)

The SageMaker Processing Job you are going to create will be reading the input dataset from Amazon S3, so you need to upload the dataset to S3. Each SageMaker session is associated with a default S3 bucket, so you will upload the file to the default S3 bucket. 

In [None]:
raw_data_key = f'{s3_key_prefix}/data/raw'
sagemaker_session.upload_data(raw_data_file_name, s3_bucket_name, key_prefix=raw_data_key)

## Data Exploration

You have already performed data exploration in the previous module, so you are not going to repeat the same process in this module. However, you can load the downloaded file into a dataframe and explore the data further if needed.

In [None]:
import pandas as pd

df = pd.read_csv(raw_data_file_name)
print('The shape of the dataset is:', df.shape)

## Create an experiment

As before, you will leverage [Amazon SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) to track your experiments. Remember that each experiment is a collection of runs. Each run is a collection of inputs, parameters, configurations, and results of an iteration in the ML lifecycle.

You create an **experiment** to track a series of iterations including processing and training jobs, and create a new **run** each time you run a processing or a training job.

Choose a name of the experiment.

In [None]:
from sagemaker.experiments.run import Run

import time
experiment_name = f"ml-end-to-end-{time.strftime('%Y-%m-%d-%H-%M-%S', time.localtime())}"
print(f"Experimentation name: {experiment_name}")

From now on, you will use the experiment name for the preprocessing and training jobs, so persist it using the store magic.

In [None]:
%store experiment_name

## Using Amazon SageMaker Processing for preprocessing

The preprocessing and feature engineering code is implemented in the `source_dir/preprocessor.py` file. The preprocessing logic in this file is similar to the previous module. However, this time, you will use SageMaker Processing to run the preprocessing logic. 

The script splits the dataset into train, validation, and test sets and transforms the data. It then saves the data sets to specific output paths. After SageMaker Processing runs the script, it automatically copies the contents from the local directories in the processing container to Amazon S3. The featurizer model is needed later for transofrming the features during inference, so the script also saves the featurizer model for later use.

In [None]:
!pygmentize source_dir/preprocessor.py

Configuring an Amazon SageMaker Processing job through the SM Python SDK requires to create a `Processor` object. In this case you will be using the default SKLearn container for processing, so you will use `SKLearnProcessor`. The object takes a number of parameters including the number of type of instances to use.

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(role=role,
                                     base_job_name='sm-feature-engineering',
                                     instance_type='ml.m5.large',
                                     instance_count=1,
                                     framework_version='0.20.0')

Invoke the `run` method of the `SKLearnProcessor` object to kick-off the job, specifying the script to execute, the arguments to pass to the script, and the configuration of inputs and outputs.

In [None]:
run_name=f'feature-engineering-{time.strftime("%H-%M-%S", time.localtime())}'
run_display_name=run_name

raw_data_path = f's3://{s3_bucket_name}/{s3_key_prefix}/data/raw/'
train_data_path = f's3://{s3_bucket_name}/{s3_key_prefix}/data/preprocessed/train/'
val_data_path = f's3://{s3_bucket_name}/{s3_key_prefix}/data/preprocessed/val/'
test_data_path = f's3://{s3_bucket_name}/{s3_key_prefix}/data/preprocessed/test/'
model_path = f's3://{s3_bucket_name}/{s3_key_prefix}/output/sklearn/'

with Run(
    experiment_name=experiment_name,
    run_name=run_name,
    run_display_name=run_display_name,
    sagemaker_session=sagemaker_session,
) as run:

    sklearn_processor.run(code='source_dir/preprocessor.py',
                          inputs=[ProcessingInput(input_name='raw_data', source=raw_data_path, destination='/opt/ml/processing/input')],
                          outputs=[ProcessingOutput(output_name='train_data', source='/opt/ml/processing/train', destination=train_data_path),
                                   ProcessingOutput(output_name='val_data', source='/opt/ml/processing/val', destination=val_data_path),
                                   ProcessingOutput(output_name='test_data', source='/opt/ml/processing/test', destination=test_data_path),
                                   ProcessingOutput(output_name='model', source='/opt/ml/processing/model', destination=model_path)],
                          arguments=['--train-test-split-ratio', '0.2'])

While the job is running, you can review its configurations, logs, and metrics under the Processing tab in the SageMaker Console (not in SageMaker Studio).

Once the job is completed, you can analyze the preprocessed dataset by downloading one of the feature files from S3 and loading it into a dataframe.

In [None]:
validation_features_key_prefix = f'{s3_key_prefix}/data/preprocessed/val/val_features.csv'

sagemaker_session.download_data('./', s3_bucket_name, validation_features_key_prefix)

In [None]:
import pandas as pd
df = pd.read_csv('val_features.csv')

df.head(10)

You can see that the categorical variables have been one-hot encoded. Feel free to check that we do not have NaN values anymore, as expected.

### Experiment analytics

As before, you can visualize experiment analytics using Amazon SageMaker Studio Experiments ([learn more](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-view-compare.html)) or use the SageMaker SDK to display the experiment analytics, as follows:

In [None]:
from sagemaker.analytics import ExperimentAnalytics

analytics = ExperimentAnalytics(experiment_name=experiment_name)
analytics.dataframe()

## You have completed Module 3

The preprocessed data is now stored in Amazon S3 and is ready for the training step. 

Open the notebook **04_train_model.ipynb** in module 4.