# Module 2: Fast Experimentation in SageMaker Studio Notebooks

In this module, you will perform data exploration and fast experimentation in a familiar Jupyter Notebook environment using SageMaker Studio notebooks. You will explore the data and use SKLearn Feature Transformers to preprocess the data. You then train the model and use the trained model to perform inference in the same notebook. You will use SageMaker Experiments to track the fast experimentation steps.

You will use the "AI4I 2020 Predictive Maintenance Dataset" from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset, which contains information about machine failures, to train a binary classification model that predicts whether a machine will fail based on input data.

## Environment set up 

Start by installing the xgboost python package.

In [None]:
%pip install -q xgboost

Retreive information about the default session Amazon S3 bucket for storing training data, and the IAM role that provides the required permissions.

In [None]:
import sagemaker
import boto3
role = sagemaker.get_execution_role()
region = boto3.Session().region_name
sagemaker_session = sagemaker.Session()
s3_bucket_name = sagemaker_session.default_bucket()
s3_key_prefix = 'fast-experimentation'

print(region)
print(role)
print(s3_bucket_name)

Download the dataset from the UCI website.

In [None]:
import urllib
import os

input_data_dir = '/opt/ml/processing/input'
if not os.path.exists(input_data_dir):
    os.makedirs(input_data_dir)
input_data_path = os.path.join(input_data_dir, 'predictive_maintenance_raw_data_header.csv')
dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv"
urllib.request.urlretrieve(dataset_url, input_data_path)

# Explaratory Data Analysis

The main goal of this notebook is to show you how you can perform Explaratory Data Analyis (EDA) in a familar Notebook environment. Hence, you will perform a fairly simple analysis to view the shape of the raw data, descriptive statistics of the features, frequency of the labels, and pairwise relationships between the features.

Feel free to spend more time on EDA.

Find out how many samples and columns are included in the dataset.

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv(input_data_path)

print('The shape of the dataset is:', df.shape)

Preview the first 10 rows.

In [None]:
df.head(10)

Check the data types for each column and identify columns with missing values

In [None]:
df.describe()

List the possible values for the "Machine failure" column and frequency of their occurence over the entire dataset.

In [None]:
df['Machine failure'].value_counts()

Plot the target columns to visualise the distribution of values.

In [None]:
import matplotlib.pyplot as plt

df['Machine failure'].value_counts().plot.bar()
plt.show()

We have discovered that the dataset is quite unbalanced. However, we are not going to balance it in this workshop.

Drop the attributes you are not interested in and keep only the numeric attributes.

In [None]:
df1 = df.sample(frac =.1)
df1 = df1.drop(['UDI', 'TWF', 'HDF', 'PWF', 'OSF', 'RNF'], axis=1).select_dtypes(include='number')
df1.head()

In [None]:
df1.info()

Use a pair plot to spot correlations.

In [None]:
import seaborn
import matplotlib.pyplot as plt

seaborn.pairplot(df1, hue='Machine failure', corner=True)
plt.show()

To keep the data exploration step short during the workshop, no additional queries are included. However, feel free to explore the dataset more if you have time.

## Feature Engineering

### Use SageMaker Experiments to track the experiments

Even though you are in the fast experimentation stage, it is still a good idea to track the experiments to gain comparative insights and track your best performing models.

You will leverage [Amazon SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) to track your experiments. SageMaker Experiments is fully integrated into the SageMaker Python SDK, so there is no need to use a separate module for creating or managing experiments. 

Each experiment is a collection of runs. Each run is a collection of inputs, parameters, configurations, and results of your iterations.

In [None]:
from sagemaker.experiments.run import Run

Preprocessing is the first step in the experimentation phase. You define a name for the experiment and include the current timestamp to differentiate between multuple runs if you run the notebook multiple times.


In [None]:
import time
experiment_name = f"sm-fast-experimentation-{time.strftime('%Y-%m-%d-%H-%M-%S', time.localtime())}"
print(f"Experimentation name: {experiment_name}")

Since you are performing fast experimentation, you can use the local copy of the dataset for preprocessing and training. However, to track your experimentation, store it in an Amazon S3 bucket.

In [None]:
raw_data_key = f"{s3_key_prefix}/data/raw"
s3_raw_data = sagemaker_session.upload_data(input_data_path, s3_bucket_name, key_prefix=raw_data_key)

### Data Processing

You are now ready to perform data processing. The `preprocessor.py` script in the `source_dir` directory will one-hot encode the relevant categorical columns and fill in the NaN values based on domain knowledge. The script splits the dataset into training, validation, and test datasets, fits the featurizer model, and transforms the datasets. The model and output datasets are written to the file system in directories under `/opt/ml/processing`.

The `Run` object creates the experiment using the name you provide if an experiment with that name does not exist. This experiment will consist of one or more runs. You use the `run_name` to describe the nature of the step, which in this case is transforming using sklearn. If you use the Run object multiple times using the same `run_name`, the new information for that run will replace the existing data. That might be desirable in some cases, but in this notebook, you make the run name unique by adding the timestamp. This way, you can track each execution if you run the cell multiple times.

In [None]:
preprocessed_data_dir = '/opt/ml/processing/output'
featurizer_model_dir = '/opt/ml/processing/model'

run_name=f'processing-{time.strftime("%H-%M-%S", time.localtime())}'
run_display_name='sklearn-transform'

with Run(
    experiment_name=experiment_name,
    run_name=run_name,
    run_display_name=run_display_name,
    sagemaker_session=sagemaker_session,
) as run:

    run.log_artifact(name='input_data', value=input_data_path, media_type='text/csv', is_output=False)

    %run 'source_dir/preprocessor.py' --input-data-path $input_data_path --output-data-dir $preprocessed_data_dir --featurizer-model-dir $featurizer_model_dir --s3-bucket-name $s3_bucket_name --s3-key-prefix $s3_key_prefix                              

Take a look at the processed training dataset.

In [None]:
import pandas as pd
df = pd.read_csv(train_features_output_path)
df.head(10)

You will see that the categorical variables have been one-hot encoded, and you are free to check that we do not have NaN values anymore as expected.


### Experiment Analytics

Analyze the experiment by opening the Experiments tab in Amazon SageMaker Studio sidebar menu and choosing the latest experiment, which should be at the top of the list. 

<p align="center">
    <img src="./images/experimentation_01.png" alt_text="Experiments menu" border=1 width=200>
</p>

<p align="center">
    <img src="./images/experimentation_02.png" alt_text="Choose an experiment" border=1 width=800>
</p>

Choose the processing run.

<p align="center">
    <img src="./images/experimentation_03.png" alt_text="Experiments - Choose a processing run" border=1 width=800>
</p>

Note that the input artifacts have been recorded by the log_artifact command you ran in the notebook.

<p align="center">
    <img src="./images/experimentation_04.png" alt_text="Experiments - Processing Input Artifacts" border=1 width=800>
</p>

Similarly, the train/val/test ratio parameters and output artifacts have been recorded by the commands included in the preprocessing script.

<p align="center">
    <img src="./images/experimentation_05.png" alt_text="Experiments - Processing train/val/test ratios" border=1 width=800>
</p>

<p align="center">
    <img src="./images/experimentation_06.png" alt_text="Experiments - Processing Output Artifacts" border=1 width=800>
</p>

## Model Training

In this part, you use xgboost to train a simple binary classification model using the pre-processed data generated in the previous step (preprocessing). You will record the hyperparameter values and the results to track the experiments.

In [None]:
model_dir = "/opt/ml/model"
eta = 0.3
max_depth = 2

run_name=f'training-{time.strftime("%H-%M-%S", time.localtime())}'
run_display_name=f'max-depth-{max_depth}'

with Run(
    experiment_name=experiment_name,
    run_name=run_name,
    run_display_name=run_display_name,
    sagemaker_session=sagemaker_session,
) as run:

    %run -i source_dir/xgboost_training.py --eta $eta --max-depth $max_depth --preprocessed-data-dir $preprocessed_data_dir --model-dir $model_dir

### Experiment analytics

Go back to the SageMaker Experiments using the sidebar menu in SageMaker Studio.

This time, choose the training run and switch to the Metrics tab to see the training metrics, output artifacts, and parameters recorded by the xgboost training script. 

<p align="center">
    <img src="./images/experimentation_07.png" border=1 alt_text="Experiments - Training Metrics" width=800>
</p>

<p align="center">
    <img src="./images/experimentation_08.png" border=1 alt_text="Experiments - Training Output Artifacts" width=800>
</p>

<p align="center">
    <img src="./images/experimentation_09.png" border=1 alt_text="Experiments - Training Parameters" width=800>
</p>


### Using the model to generate predictions

Now you use the model for inference.

In [None]:
df_test_features = pd.read_csv(test_features_output_path, header=None)
df_test_labels = pd.read_csv(test_labels_output_path, header=None)
test_X = df_test_features.values
test_y = df_test_labels.values.reshape(-1)
dtest = xgboost.DMatrix(test_X, label=test_y)

model_xgb_trial = xgboost.Booster()
model_xgb_trial.load_model(model_path)
test_predictions = model_xgb_trial.predict(dtest)

After performing inference on the test features, you can compare the predictions with the labels in the test data set to measure model performance. Using the SageMaker Experiments, you can store model performance charts against the current experiment. 

In [None]:
print ("===Metrics for Test Set===")
print('')
print (pd.crosstab(index=test_y, columns=np.round(test_predictions), 
                                 rownames=['Actuals'], 
                                 colnames=['Predictions'], 
                                 margins=True)
      )
print('')

rounded_predict = np.round(test_predictions)

accuracy = accuracy_score(test_y, rounded_predict)
precision = precision_score(test_y, rounded_predict)
recall = recall_score(test_y, rounded_predict)
print('')

print("Accuracy Model A: %.2f%%" % (accuracy * 100.0))
print("Precision Model A: %.2f" % (precision))
print("Recall Model A: %.2f" % (1 - recall))

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(test_y, test_predictions)
print("AUC A: %.2f" % (auc))


with Run(
    experiment_name=experiment_name,
    run_name=run_name
) as run:
    run.log_roc_curve(y_true=test_y, y_score=test_predictions, title=f"roc")    
    run.log_precision_recall(y_true=test_y, predicted_probabilities=test_predictions, title='precision-recall')
    run.log_confusion_matrix(y_true=test_y, y_pred=rounded_predict, title=f"confusion-matrix")

SageMaker Experiments also supports common chart types to visualize model training results. Open the SageMaker Experiments from the sidebar menu, choose the training run and go to the Charts tab to see the three graphs recoded by the last statement you ran.

<p align="center">
    <img src="./images/experimentation_10.png" alt_text="Experiments menu" width=800>
</p>

## You have completed Module 2

In this module, you performed fast experimentation by performing exploratory data analysis and preprocessing the data, training a model, and using the model to generate predictions on the test dataset. During this fast experimentation phase, you used SageMaker Experiments to keep track of the steps and the parameters, inputs and outputs for each step.  

Open the notebook **03_feature_engineering.ipynb** in module 3 to perform feature engineering by using Amazon SageMaker Processing to perform the preprocessing logic. 