<h1>Data exploration, preprocessing and feature engineering</h1>

In this and the following notebooks we will demonstrate how you can build your ML Pipeline leveraging SKLearn Feature Transformers and SageMaker XGBoost algorithm & after the model is trained, deploy the Pipeline (Feature Transformer and XGBoost) as a SageMaker Inference Pipeline behind a single Endpoint for real-time inference.

In particular, in this notebook we will tackle the first steps related to data exploration and preparation. We will use standard Python libraries (Pandas) to query our dataset and have a first insight about data quality and available features and then [Amazon SageMaker Processing](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html) for data processing and building the feature transformer model with SKLearn.

In [None]:
from notebook_utilities import check_dependencies
check_dependencies() # Check SageMaker, Numpy, Pandas versions and install upgrades if needed

In [None]:
import sagemaker, pandas, numpy

print('sagemaker version:', sagemaker.__version__)
print('pandas version:', pandas.__version__)
print('numpy version:', numpy.__version__)


In [None]:
import boto3
import time

role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()
prefix = 'endtoendmlsm'

print(region)
print(role)
print(bucket_name)

We can now copy to our bucket the dataset used for this use case. We will use the "AI4I 2020 Predictive Maintenance Dataset" of the UCI Machine Learning Repository, available at: https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset

Let's download the dataset and upload to our Amazon S3 bucket so that AWS services can access the data.

In [None]:
import urllib

dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv"
file_name = "predmain_raw_data_header.csv"
urllib.request.urlretrieve(dataset_url, file_name)

In [None]:
sagemaker_session.upload_data(file_name, bucket_name, 
                              key_prefix='{0}/data/raw'.format(prefix))

<h2>Data Exploration</h2>

Let's take a look at the shape of our dataset.

In [None]:
import pandas as pd

df = pd.read_csv('predmain_raw_data_header.csv')
print('The shape of the dataset is:', df.shape)

Let's now look at the records by printing the first 8 rows.

In [None]:
df.head(8)

Let's see the data types for each column and identify any columns with missing values.

In [None]:
df.describe()

Let's try to see what are possible values for the field "Machine failure" and how frequently they occur over the entire dataset.

In [None]:
df['Machine failure'].value_counts()

In [None]:
import matplotlib.pyplot as plt

df['Machine failure'].value_counts().plot.bar()
plt.show()

We have discovered that the dataset is quite unbalanced, although we are not going to try balancing it.

We can now select the numeric attributes we are interested in and plot to see correlations.

In [None]:
import seaborn
import matplotlib.pyplot as plt

df1 = df.sample(frac =.1)
df1 = df1.drop(['UDI', 'TWF', 'HDF', 'PWF', 'OSF', 'RNF'], axis=1).select_dtypes(include='number')
df1.head()

In [None]:
seaborn.pairplot(df1, hue='Machine failure', corner=True)
plt.show()

For the purpose of keeping the data exploration step short during the workshop, we are not going to execute additional queries. However, feel free to explore the dataset more if you have time.

## Create an experiment

Before getting started with preprocessing and feature engineering, we want to leverage on Amazon SageMaker Experiments to track the experimentations that we will be executing.
We are going to create a new experiment and then a new trial, that represents a multi-step ML workflow (e.g. preprocessing stage1, preprocessing stage2, training stage, etc.). Each step of a trial maps to a trial component in SageMaker Experiments.

We will use the Amazon SageMaker Experiments SDK to interact with the service from the notebooks. Additional info and documentation is available here: https://github.com/aws/sagemaker-experiments

In [None]:
!pip install sagemaker-experiments

Now we are creating the experiment, or loading if it already exists.

In [None]:
import time
from smexperiments import experiment

experiment_name = 'end-to-end-ml-sagemaker-{0}'.format(str(int(time.time())))
current_experiment = experiment.Experiment.create(experiment_name=experiment_name,
                                                  description='SageMaker workshop experiment')

print(experiment_name)

Once we have our experiment, we can create a new trial.

In [None]:
trial_name = 'sklearn-xgboost-{0}'.format(str(int(time.time())))
current_trial = current_experiment.create_trial(trial_name=trial_name)

From now own, we will use the experiment and the trial as configuration parameters for the preprocessing and training jobs, to make sure we track executions.

In [None]:
%store experiment_name
%store trial_name

<h2>Preprocessing and Feature Engineering with Amazon SageMaker Processing</h2>

The preprocessing and feature engineering code is implemented in the `source_dir/preprocessor.py` file.

You can go through the code and see that a few categorical columns required one-hot encoding, plus we are filling some NaN values based on domain knowledge.
Once the SKLearn fit() and transform() is done, we are splitting our dataset into 80/20 train & validation and then saving to the output paths whose content will be automatically uploaded to Amazon S3 by SageMaker Processing. Finally, we also save the featurizer model as it will be reused later for inference.

In [None]:
!pygmentize source_dir/preprocessor.py

Configuring an Amazon SageMaker Processing job through the SM Python SDK requires to create a `Processor` object (in this case `SKLearnProcessor` as we are using the default SKLearn container for processing); we can specify how many instances we are going to use and what instance type is requested.

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput
from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(role=role,
                                     base_job_name='end-to-end-ml-sm-proc',
                                     instance_type='ml.m5.large',
                                     instance_count=1,
                                     framework_version='0.20.0')

Then, we can invoke the `run()` method of the `Processor` object to kick-off the job, specifying the script to execute, its arguments and the configuration of inputs and outputs as shown below.

In [None]:
raw_data_path = 's3://{0}/{1}/data/raw/'.format(bucket_name, prefix)
train_data_path = 's3://{0}/{1}/data/preprocessed/train/'.format(bucket_name, prefix)
val_data_path = 's3://{0}/{1}/data/preprocessed/val/'.format(bucket_name, prefix)
model_path = 's3://{0}/{1}/output/sklearn/'.format(bucket_name, prefix)

# Experiment tracking configuration
experiment_config={
    "ExperimentName": current_experiment.experiment_name,
    "TrialName": current_trial.trial_name,
    "TrialComponentDisplayName": "sklearn-preprocessing",
}

sklearn_processor.run(code='source_dir/preprocessor.py',
                      inputs=[ProcessingInput(input_name='raw_data', source=raw_data_path, destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data', source='/opt/ml/processing/train', destination=train_data_path),
                               ProcessingOutput(output_name='val_data', source='/opt/ml/processing/val', destination=val_data_path),
                               ProcessingOutput(output_name='model', source='/opt/ml/processing/model', destination=model_path)],
                      arguments=['--train-test-split-ratio', '0.2'],
                      experiment_config=experiment_config)

While the job is running, feel free to review its configurations, logs and metrics from SageMaker's views in the AWS Console.

Once the job is completed, we can give a look at the preprocessed dataset, by loading the validation features as follows:

In [None]:
file_name = 'val_features.csv'
s3_key_prefix = '{0}/data/preprocessed/val/{1}'.format(prefix, file_name)

sagemaker_session.download_data('./', bucket_name, s3_key_prefix)

In [None]:
import pandas as pd
df = pd.read_csv(file_name)

df.head(10)

We can see that the categorical variables have been one-hot encoded, and you are free to check that we do not have NaN values anymore as expected.

### Experiment analytics

You can visualize experiment analytics either from Amazon SageMaker Studio Experiments plug-in or using the SDK from a notebook, as follows:

In [None]:
from sagemaker.analytics import ExperimentAnalytics

analytics = ExperimentAnalytics(experiment_name=experiment_name)
analytics.dataframe()

After the preprocessing and feature engineering are completed, you can move to the next notebook in the **03_train_model** folder to start model training.