<h1>Build and train models</h1>

## Workshop overview

In this workshop, you will go through a complete machine learning process. You will use the ["AI4I 2020 Predictive Maintenance Dataset" from the UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/AI4I+2020+Predictive+Maintenance+Dataset) , which contains information about machine failures, to train a regression model that predicts whether a machine will fail based on input data.

In this module, you will perform data exploration, data preprocessing, and model training in a familiar JupyterLab notebook environment in SageMaker Studio. In module 2 you will deploy a [inference pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html) behind an endpoint. The inference pipeline will consist of a Feature Transformer and an XGBoost model. Finally, in module 3, you will create a pipeline for a complete machine learning development process using [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-sdk.html).

## In this notebook...

You will explore the data and use SKLearn Feature Transformers to preprocess the data. You then build and train an XGBoost logistic regression model and test it. You will use the SageMaker [@remote decorator](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator.html) to scale data processing and model training as Amazon SageMaker Training jobs.


If you are not familiar with Jupyter notebooks: Run the code cells in this notebook one by one. You can use Jupyter environment shortcuts such as Shift + Enter to run the current cell and move the cursor to the next cell.

Read the notes between the code cells to understand what the notebook is doing and observe the output from each cell.

## Environment set up 

Let's get started by installing the requirements.

In [None]:
%pip install -r requirements.txt

The Amazon SageMaker Python SDK supports setting of default values for AWS infrastructure primitive types, such as instance types, Amazon S3 folder locations, and IAM roles. You can override the default locations of these files by setting the `SAGEMAKER_USER_CONFIG_OVERRIDE` environment variables for the user-defined configuration file paths.

In [None]:
import os

# Use the current working directory as the location for SageMaker Python SDK config file
os.environ["SAGEMAKER_USER_CONFIG_OVERRIDE"] = os.getcwd()

[SageMaker Distribution images](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-distribution.html) include popular libraries for machine learning and data science.

In [None]:
import pandas
import xgboost
import sklearn

print(f"Pandas version: {pandas.__version__}")
print(f"XGBoost version: {xgboost.__version__}")
print(f"SKLearn version: {sklearn.__version__}")

Install the seaborn data visulization library.

In [None]:
%pip install seaborn

Download the dataset from the UCI website.

In [None]:
import urllib
import os

input_data_dir = 'data/'
if not os.path.exists(input_data_dir):
    os.makedirs(input_data_dir)
input_data_path = os.path.join(input_data_dir, 'predictive_maintenance_raw_data_header.csv')
dataset_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00601/ai4i2020.csv"
urllib.request.urlretrieve(dataset_url, input_data_path)

# Exploratory data analysis

In this section, you will perform a fairly simple analysis to examine the shape and distribution of the raw data, summary statistics of the features, frequency counts of the labels, and the relationships between pairs of features. Feel free to spend more time on data analysis if you wish.

Determine the number of samples (rows) and features (columns) in the dataset.

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv(input_data_path)

print('The shape of the dataset is:', df.shape)

Preview the first 10 rows.

In [None]:
df.head(10)

Check the data types for each column and identify columns with missing values.

In [None]:
df.describe()

List the possible values for the "Machine failure" column and frequency of their occurence over the entire dataset.

In [None]:
df['Machine failure'].value_counts()

Plot the target columns to visualise the distribution of values.

In [None]:
import matplotlib.pyplot as plt

df['Machine failure'].value_counts().plot.bar()
plt.show()

You will notice that the dataset is quite unbalanced. However, you are not going to balance it in this workshop.

In [None]:
# compute the count of unique values for colums in df
df.nunique()

Drop the attributes you are not interested in and keep only the numeric attributes.

In [None]:
df1 = df.sample(frac =.1)
df1 = df1.drop(['UDI', 'TWF', 'HDF', 'PWF', 'OSF', 'RNF'], axis=1).select_dtypes(include='number')
df1.head()

View the summary of the pre-processed dataset.

In [None]:
df1.info()

Use a pair plot to spot correlations.

In [None]:
import seaborn
import matplotlib.pyplot as plt

seaborn.pairplot(df1, hue='Machine failure', corner=True)
plt.show()

To keep the data exploration step short during the workshop, no additional queries are included. However, feel free to explore the dataset more if you wish.

## Feature Engineering

### Data Processing

You will run data pre-processing in the `preprocess` function in the following cell. This function performs one-hot encoding of the relevant categorical columns and fills in the NaN values based on domain knowledge. It then splits the dataset into training, validation, and test datasets, fits the featurizer model, and transforms the datasets. The function returns the model and the output datasets, and saves the serialized model to the file system.

The following cell annotates the `preprocess` function with the [@remote decorator](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator.html) to run the Python function as a SageMaker job without requiring any other modifications to the function code. Feel free to comment out the remote decorator in the cells below to seamlesssly move from running the function remotely via SageMaker Training to local execution. If you comment out the decorator to run the function locally, you will need to run this command in the terminal to give permission to the output directory where the function will save the models: `sudo chmod -R 777 /opt/ml/model`. You don't need to run this command if you leave the remote decorator in, since the `config.yaml` file runs that command before executing the training job.

The code also uses [SageMaker Managed Warm Pools](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html) by setting the `keep_alive_period_in_seconds` parameter. SageMaker Managed Warm Pools let you retain and reuse provisioned infrastructure after the completion of a job to reduce latency for repetitive workloads, such as iterative experimentation or running many jobs consecutively. Subsequent training jobs that match specified parameters run on the retained warm pool infrastructure, which speeds up start times by reducing the time spent provisioning resources. Please note that Managed Warm Pools might not be enabled for your AWS Account; in such case, although the code will still work, you might not see lower latencies for the subsequent iterations.

In [None]:
import os
import joblib

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

from sagemaker.remote_function import remote

@remote(keep_alive_period_in_seconds=3600, job_name_prefix="amzn-sm-btd-preprocess")
def preprocess(df):
    columns = ['Type', 'Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]', 'Machine failure']
    cat_columns = ['Type']
    num_columns = ['Air temperature [K]', 'Process temperature [K]', 'Rotational speed [rpm]', 'Torque [Nm]', 'Tool wear [min]']
    target_column = 'Machine failure'

    df = df[columns]

    training_ratio = 0.8
    validation_ratio = 0.1
    test_ratio = 0.1

    X = df.drop(target_column, axis=1)
    y = df[target_column]

    print(f'Splitting data training ({training_ratio}), validation ({validation_ratio}), and test ({test_ratio}) sets ')

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio, random_state=0, stratify=y)
    X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=validation_ratio/(validation_ratio+training_ratio), random_state=2, stratify=y_train)

    # Apply transformations
    transformer = ColumnTransformer(transformers=[('numeric', StandardScaler(), num_columns),
                                                  ('categorical', OneHotEncoder(), cat_columns)],
                                    remainder='passthrough')
    featurizer_model = transformer.fit(X_train)
    X_train = featurizer_model.transform(X_train)
    X_val = featurizer_model.transform(X_val)

    print(f'Shape of train features after preprocessing: {X_train.shape}')
    print(f'Shape of validation features after preprocessing: {X_val.shape}')
    print(f'Shape of test features after preprocessing: {X_test.shape}')
    
    y_train = y_train.values.reshape(-1)
    y_val = y_val.values.reshape(-1)
    
    print(f'Shape of train labels after preprocessing: {y_train.shape}')
    print(f'Shape of validation labels after preprocessing: {y_val.shape}')
    print(f'Shape of test labels after preprocessing: {y_test.shape}')

    model_file_path="/opt/ml/model/sklearn_model.joblib"
    os.makedirs(os.path.dirname(model_file_path), exist_ok=True)
    joblib.dump(featurizer_model, model_file_path)

    return X_train, y_train, X_val, y_val, X_test, y_test, featurizer_model

The function returns multiple values, including the training, validation, and test features and labels, and the featurizer model.

In [None]:
X_train, y_train, X_val, y_val, X_test, y_test, featurizer_model = preprocess(df)

Analyze the featurizer model structure.

In [None]:
featurizer_model

Analyzing a few rows from preprocessed training dataset shows that the categorical features have been one-hot encoded. If you wish, you can perform more analysis to make sure there are no NaN values in the dataset.


In [None]:
import pandas as pd
pd.DataFrame(X_train).head(10)

## Model Training

In this section, you will use XGBoost to train a logistic regression model using the preprocessed data generated in the previous step. Again, you will use a standard Python function that accepts some of the XGBoost hyperparameters as input and returns the model.

In [None]:
import os
import xgboost
import numpy as np

from sagemaker.remote_function import remote

@remote(keep_alive_period_in_seconds=3600, job_name_prefix="amzn-sm-btd-train")
def train(X_train, y_train, X_val, y_val,
          eta=0.1, 
          max_depth=2, 
          gamma=0.0,
          min_child_weight=1,
          verbosity=0,
          objective='binary:logistic',
          eval_metric='auc',
          num_boost_round=5):

    print('Train features shape: {}'.format(X_train.shape))
    print('Train labels shape: {}'.format(y_train.shape))
    print('Validation features shape: {}'.format(X_val.shape))
    print('Validation labels shape: {}'.format(y_val.shape))

    # Creating DMatrix(es)
    dtrain = xgboost.DMatrix(X_train, label=y_train)
    dval = xgboost.DMatrix(X_val, label=y_val)
    watchlist = [(dtrain, "train"), (dval, "validation")]

    print('')
    print (f'===Starting training with max_depth {max_depth}===')

    param_dist = {
        "max_depth": max_depth,
        "eta": eta,
        "gamma": gamma,
        "min_child_weight": min_child_weight,
        "verbosity": verbosity,
        "objective": objective,
        "eval_metric": eval_metric
    }

    xgb = xgboost.train(
        params=param_dist,
        dtrain=dtrain,
        evals=watchlist,
        num_boost_round=num_boost_round)

    predictions = xgb.predict(dval)

    print ("Metrics for validation set")
    print('')
    print (pd.crosstab(index=y_val, columns=np.round(predictions),
                       rownames=['Actuals'], colnames=['Predictions'], margins=True))
    print('')

    rounded_predict = np.round(predictions)

    val_accuracy = accuracy_score(y_val, rounded_predict)
    val_precision = precision_score(y_val, rounded_predict)
    val_recall = recall_score(y_val, rounded_predict)

    print("Accuracy Model A: %.2f%%" % (val_accuracy * 100.0))
    print("Precision Model A: %.2f" % (val_precision))
    print("Recall Model A: %.2f" % (val_recall))

    from sklearn.metrics import roc_auc_score

    val_auc = roc_auc_score(y_val, predictions)
    print("Validation AUC A: %.2f" % (val_auc))

    model_file_path="/opt/ml/model/xgboost_model.bin"
    os.makedirs(os.path.dirname(model_file_path), exist_ok=True)
    xgb.save_model(model_file_path)

    return xgb

Running the training function will initiate a SageMaker training job because the function is decoarated with the @remote decorator.

In [None]:
eta=0.3
max_depth=8

booster = train(X_train, y_train, X_val, y_val,
              eta=eta, 
              max_depth=max_depth)

Display the information about the trained model.

In [None]:
booster

### Using the models to generate predictions

Finally you use the models for inference and evaluate model accuracy. The @remote decorator is commented out, because the function does not require additional processing power or memory. However, you can uncomment the first line to run the job remotely if you wish.

In [None]:
#@remote(keep_alive_period_in_seconds=600, job_name_prefix="amzn-sm-btd-test")
def test(featurizer_model, booster, X_test, y_test):

    X_test = featurizer_model.transform(X_test)
    y_test = y_test.values.reshape(-1)

    dtest = xgboost.DMatrix(X_test, label=y_test)
    test_predictions = booster.predict(dtest)
    
    print ("===Metrics for Test Set===")
    print('')
    print (pd.crosstab(index=y_test, columns=np.round(test_predictions), 
                                     rownames=['Actuals'], 
                                     colnames=['Predictions'], 
                                     margins=True)
          )
    print('')

    rounded_predict = np.round(test_predictions)

    accuracy = accuracy_score(y_test, rounded_predict)
    precision = precision_score(y_test, rounded_predict)
    recall = recall_score(y_test, rounded_predict)
    print('')

    print("Accuracy Model A: %.2f%%" % (accuracy * 100.0))
    print("Precision Model A: %.2f" % (precision))
    print("Recall Model A: %.2f" % (recall))

    from sklearn.metrics import roc_auc_score

    auc = roc_auc_score(y_test, test_predictions)
    print("AUC A: %.2f" % (auc))

Test the trained model using the text features and labels.

In [None]:
test(featurizer_model, booster, X_test, y_test)

### You have completed module 1
In this module, you built a featurizer model using SKLearn to preprocess the data. You also built and trained a regression model using XGBoost.

Proceed to module 2 to deploy the models on a SageMaker inference endpoint.