#  MLOps Manual to Repeatable Workflow

<div class="alert alert-warning"> 
	⚠️ <strong> PRE-REQUISITE: </strong> Before proceeding with this notebook, please ensure that you have executed the <code>1-data-prep-feature-store.ipynb</code> Notebook</li>
</div>

## Contents

- [Introduction](#Introduction)
- [Recap](#Recap)
- [Experiment tracking](#Experiment-tracking)
- [SageMaker Training](#SageMaker-Training)
- [SageMaker Training with Automatic Model Tuning (HPO)](#SageMaker-Training-with-Automatic-Model-Tuning-(HPO))
- [Model Registry](#Model-Registry)

## Introduction

This is our second notebook which will explore the model training stage of the ML workflow.

Here, we will put on the hat of the `Data Scientist` and will perform the task of modeling which includes training a model, performing hyperparameter tuning, and registering our model in a model registry. This task is highly iterative in nature and hence we need to track our experimentation until we reach desired results.

Similar to previous notebook on preprocessing datasets, we will first start by performing the above tasks manually inside our notebook's local environment, using the local data generated during the previous steps. Then we will learn how to bring scale and experiment tracking into these steps using managed SageMaker training capabilities and how to connect it to SageMaker Feature Store.

Let's get started!

**Important:** for this example, we will use SageMaker's (XGBoost algorithm)[https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html] as a built-in model. That means that you don't have to write your model code and SageMaker takes care of it. We will use CSV data as input. For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. Let's query our Feature Store Group to get the necessary data

**Imports**

Let's first install the sagemaker-experiments library in case it is not yet installed

In [None]:
%store -r

In [None]:
import sys
!{sys.executable} -m pip install sagemaker-experiments

In [None]:
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.sklearn.model import SKLearnModel
from time import gmtime, strftime
import boto3
import sys
import sagemaker
import json
import os
import pandas as pd
from sagemaker.model_metrics import ModelMetrics, MetricsSource
from sagemaker.analytics import ExperimentAnalytics
from sagemaker.tuner import IntegerParameter, ContinuousParameter, HyperparameterTuner
# SageMaker Experiments objects
from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from sagemaker.feature_store.feature_group import FeatureGroup
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput

**Session variables**

In [None]:
# Useful SageMaker variables
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role_arn= sagemaker.get_execution_role()
region = sagemaker_session.boto_region_name
s3_client = boto3.client('s3', region_name=region)
sagemaker_client = boto3.client('sagemaker')

enable_local_mode_training = False
model_package_group_name = 'synthetic-housing-models'
model_name = 'xgboost-model'


fs_dir = os.path.join(os.getcwd(), 'data/fs_data')
os.makedirs(fs_dir, exist_ok=True)

fs_train_dir = os.path.join(os.getcwd(), 'data/fs_data/train')
os.makedirs(fs_train_dir, exist_ok=True)

fs_validation_dir = os.path.join(os.getcwd(), 'data/fs_data/validation')
os.makedirs(fs_validation_dir, exist_ok=True)

## Recap

So we've processed our data and now have training and validation sets available in Feature Store to be used for training. Since SageMaker training jobs expects the training data to be on s3, let's first add our feature store data to s3

In [None]:
def save_fs_data_to_s3(fg_name, features_to_select, sm_session, file_name, local_path, bucket, bucket_prefix):
    fs_group = FeatureGroup(name=fg_name, sagemaker_session=sm_session)  
    query = fs_group.athena_query()
    table = query.table_name
    query_string = f'SELECT {features_to_select} FROM "sagemaker_featurestore"."{table}"  ORDER BY record_id'
    query_results= 'sagemaker-featurestore'
    output_location = f's3://{bucket}/{query_results}/query_results/'
    query.run(query_string=query_string, output_location=output_location)
    query.wait()
    df = query.as_dataframe()
    df.to_csv(local_path+'/'+file_name, index=False, header=False)
    s3_client.upload_file(local_path+'/'+file_name, bucket, bucket_prefix+'/'+file_name)
    dataset_uri_prefix = "s3://" + bucket + "/" + bucket_prefix
    return dataset_uri_prefix

train_data = save_fs_data_to_s3(
    train_feature_group_name, 
    features_to_select, 
    sagemaker_session, 
    "train.csv", 
    fs_train_dir, 
    bucket, 
    s3_prefix+"/data/fs_data/train"
)
val_data = save_fs_data_to_s3(
    validation_feature_group_name, 
    features_to_select, 
    sagemaker_session, 
    "validation.csv", 
    fs_validation_dir, 
    bucket, 
    s3_prefix+"/data/fs_data/validation"
)
train_data, val_data

Let's compare the dataset distribution of our original dataset and the one read from Feature Store.

In [None]:
# read original training data
df_train_orig = pd.read_csv(sm_processed_train_dir+'/train.csv', header=None)
df_train_orig.describe()

In [None]:
# reading training data from Feature Store
df_train_fs = pd.read_csv(fs_train_dir+'/train.csv', header=None)
df_train_fs.describe()

Great! Our dataset distribution seems intact!.

We are ready to train a SageMaker Built-in XGboost model with it!

Ok, let's run the above script on our local notebook resources and make sure everything's working alright.

## Experiment tracking

[SageMaker Experiments](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html) can track all the model training iterations. Experiments are a great way to organize your data science work. You can create experiments to organize all your model development work for:

1. A business use case you are addressing (e.g. create experiment named "customer churn prediction"), or
2. A data science team that owns the experiment (e.g. create experiment named "marketing analytics experiment"), or
3. A specific data science and ML project. Think of it as a "folder" for organizing your "files".

In [None]:
synthetic_housing_experiment = Experiment.create(
    experiment_name=f'synthetic-housing-xgboost-{strftime("%d-%H-%M-%S", gmtime())}', 
    description='Synthetic housing price estimation.',
    sagemaker_boto_client=sagemaker_client)

## SageMaker Training

Now that we've prepared our training and test data, we can move on to use SageMaker's hosted training functionality - [SageMaker Training](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html). Hosted training is preferred for doing actual training, especially large-scale, distributed training. Unlike training a model on a local computer or server, SageMaker hosted training will spin up a separate cluster of machines managed by SageMaker to train your model. Before starting hosted training, the data must be in S3, or an EFS or FSx for Lustre file system. We uploaded to S3 in the previous notebook, so we're good to go here.

Let's go ahead and create a built-in XGBoost model. You can see that we use the `Estimator` object with the xgboost container and all we need to do is pass the parameters to the model. SageMaker takes care of the implementation.

In [None]:
# initialize hyperparameters
hyperparameters = {
    "max_depth": "5",
    "eta": "0.2",
    "gamma": "4",
    "min_child_weight": "6",
    "subsample": "0.7",
    "objective": "reg:squarederror",
    "num_round": "50",
    "verbosity": "2",
    "eval_metric": "mse"
}

train_instance_type = 'ml.c5.xlarge'

train_input = TrainingInput(train_data, content_type='text/csv')
validation_input = TrainingInput(val_data, content_type='text/csv')
inputs = {'train': train_input, 'validation': validation_input}

# this line automatically looks for the XGBoost image URI and builds an XGBoost container.
# specify the repo_version depending on your preference.
xgboost_container = sagemaker.image_uris.retrieve("xgboost", region, "1.5-1")

# construct a SageMaker estimator that calls the xgboost-container
estimator = sagemaker.estimator.Estimator(
    image_uri=xgboost_container, 
    hyperparameters=hyperparameters,
    role=role_arn,
    instance_count=1, 
    instance_type='ml.m5.2xlarge', 
    volume_size=5, # 5 GB 
)

Before we actually train the `XGBoost` model, we'll create a trial under the experiment we created at the beginning of this notebook. The results of the training job we're about to run will be tracked by SageMaker Experiments under this trial.

In [None]:
regresor_trial = Trial.create(
    trial_name = f'xgboost-{strftime("%d-%H-%M-%S", gmtime())}',
    experiment_name = synthetic_housing_experiment.experiment_name
)

experiment_config = {
    'ExperimentName': synthetic_housing_experiment.experiment_name,
    'TrialName': regresor_trial.trial_name,
    'TrialComponentDisplayName': 'TrainingJob',
}

Now that we've passed in the necessary inputs to the `Estimator` object, we can now call its `fit` method in order to train the xgboost model on our data.

In [None]:
estimator.fit(inputs, experiment_config=experiment_config)

Now that training finished, we can use SageMaker Experiments to examine the results and see how it compares to other training jobs within the experiment. Right now this is the only job captured in Experiments, but let's take a look anyway to see what data it stores.

In [None]:
trial_component_analytics = ExperimentAnalytics(
    sagemaker_session=sagemaker_session, 
    experiment_name=synthetic_housing_experiment.experiment_name,
    sort_order="Descending"
)

df_experiments = trial_component_analytics.dataframe()
df_experiments[[
    'Trials', 'TrialComponentName', 'DisplayName', 'train:mse - Avg', 'validation:mse - Avg', 
    'max_depth', 'eta', 'gamma', 'min_child_weight', 'subsample', 'objective', 'num_round', 
    'verbosity','SourceArn'
]]

Well, that MSE looks quite good, but in cases where it's undesirable, we could improve it by adjusting model hyperparameters. But instead of guessing what hyperparameters we should have, we can let SageMaker search the hyperparameter space in an intelligent way on our behalf.

## SageMaker Training with Automatic Model Tuning (HPO)

[Amazon SageMaker Automatic Model Tuning](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning.html), also known as hyperparameter tuning/optimization, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

You can use SageMaker automatic model tuning with built-in algorithms, custom algorithms, and SageMaker pre-built containers for machine learning frameworks.

We begin by specifying the hyperparameters we wish to tune, and the range of values over which to tune each one.  We also must specify an objective metric to be optimized:  in this use case, we'd like to minimize the validation loss.

In [None]:
hyperparameter_ranges = {
  'max_depth': IntegerParameter(1, 10),
  'alpha': ContinuousParameter(0, 1000),
  'gamma': ContinuousParameter(0, 5),
}

objective_metric_name = 'validation:mse'
objective_type = 'Minimize'

Next we specify a HyperparameterTuner object that takes the above definitions as parameters.  Each tuning job must be given a budget:  a maximum number of training jobs.  A tuning job will complete after that many training jobs have been executed.  

We also can specify how much parallelism to employ, in this case two jobs, meaning that the tuning job will complete after two series of two jobs in parallel have completed (so, a total of 4 jobs as set by `max_jobs`).  For the default Bayesian Optimization tuning strategy used here, the tuning search is informed by the results of previous groups of training jobs, so we don't run all of the jobs in parallel, but rather divide the jobs into groups of parallel jobs.  There is a trade-off: using more parallel jobs will finish tuning sooner, but likely will sacrifice tuning search accuracy. 

Now we can launch a hyperparameter tuning job by calling the `fit` method of the HyperparameterTuner object.  The tuning job may take some minutes to finish.  While you're waiting, the status of the tuning job, including metadata and results for invidual training jobs within the tuning job, can be checked in the SageMaker console in the **Hyperparameter tuning jobs** panel.  

In [None]:
tuner_parameters = {
    'estimator': estimator,
    'objective_metric_name': objective_metric_name,
    'hyperparameter_ranges': hyperparameter_ranges,
    'max_jobs': 4,
    'max_parallel_jobs': 2,
    'objective_type': objective_type
}

tuner = HyperparameterTuner(**tuner_parameters)

tuning_job_name = f'xboost-model-tuning-{strftime("%d-%H-%M-%S", gmtime())}'
tuner.fit(inputs, job_name=tuning_job_name)
tuner.wait()

After the tuning job is finished, we can use the `HyperparameterTuningJobAnalytics` object from the SageMaker Python SDK to list the top 5 tuning jobs with the best performance. Although the results vary from tuning job to tuning job, the best validation loss from the tuning job (under the FinalObjectiveValue column) likely will be substantially lower than the validation loss from the hosted training job above, where we did not perform any tuning other than manually increasing the number of epochs once.  

In [None]:
tuner_metrics = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)
tuner_metrics.dataframe().sort_values(['FinalObjectiveValue'], ascending=True).head(5)

## Model Registry

With the [SageMaker Model Registry](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html) you can do the following:

- Catalog models for production.
- Manage model versions.
- Associate metadata, such as training metrics, with a model.
- Manage the approval status of a model.
- Deploy models to production.
- Automate model deployment with CI/CD.

You can catalog models by creating model package groups that contain different versions of a model. You can create a model group that tracks all of the models that you train to solve a particular problem. You can then register each model you train and the model registry adds it to the model group as a new model version. A typical workflow might look like the following:

- Create a model group.
- Create an ML pipeline that trains a model.
- For each run of the ML pipeline, create a model version that you register in the model group you created in the first step.

So first we'll create a [Model Package Group](https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry-model-group.html) in which we can store/group all related models and their versions.

In [None]:
def create_model_package_group(model_package_group_name, model_package_group_description, sagemaker_session):
    sagemaker_client = sagemaker_session.sagemaker_client

    # Check if model package group already exists
    model_package_group_exists = False
    model_package_groups = sagemaker_client.list_model_package_groups(NameContains=model_package_group_name)
    for list_item in model_package_groups['ModelPackageGroupSummaryList']:
        if list_item['ModelPackageGroupName'] == model_package_group_name:
            model_package_group_exists = True

    # Create new model package group if it doesn't already exist
    if model_package_group_exists != True:
        sagemaker_client.create_model_package_group(ModelPackageGroupName=model_package_group_name,
                                                  ModelPackageGroupDescription=model_package_group_description)
    else:
        print(f'{model_package_group_name} Model Package Group already exists')

create_model_package_group(model_package_group_name, 'Models predicting synthetic housing prices',
                           sagemaker_session)

Next we'll register the model we just trained with SageMaker Training.

In [None]:
def create_training_job_metrics(estimator, s3_prefix, region, bucket, problem_type='regression'):
    # Define supervised learning problem type
    if problem_type == 'regression':
        model_metrics_report = {'regression_metrics': {}}
    elif problem_type == 'classification':
        model_metrics_report = {'classification_metrics': {}}
    
    # Parse training job metrics defined in metric_definitions
    training_job_info = estimator.latest_training_job.describe()
    training_job_name = training_job_info['TrainingJobName']
    metrics = training_job_info['FinalMetricDataList']
    for metric in metrics:
        metric_dict = {metric['MetricName']: {'value': metric['Value'], 'standard_deviation': 'NaN'}}
        if problem_type == 'regression':
            model_metrics_report['regression_metrics'].update(metric_dict)
        if problem_type == 'classification':
            model_metrics_report['classification_metrics'].update(metric_dict)
            
    with open('training_metrics.json', 'w') as f:
        json.dump(model_metrics_report, f)
    
    training_metrics_s3_prefix = f'{s3_prefix}/training_jobs/{training_job_name}/training_metrics.json'
    s3_client = boto3.client('s3', region_name=region)
    s3_client.upload_file(Filename='training_metrics.json', Bucket=bucket, Key=training_metrics_s3_prefix)
    training_metrics_s3_uri = f's3://{bucket}/{training_metrics_s3_prefix}'
    model_statistics = MetricsSource('application/json', training_metrics_s3_uri)
    model_metrics = ModelMetrics(model_statistics=model_statistics)
    return model_metrics

# Register model
best_estimator = tuner.best_estimator()
model_metrics = create_training_job_metrics(best_estimator, s3_prefix, region, bucket)

model_package = best_estimator.register(content_types=['text/csv'],
                                        response_types=['application/json'],
                                        inference_instances=['ml.m5.xlarge'],
                                        transform_instances=['ml.m5.xlarge'],
                                        image_uri=best_estimator.image_uri,
                                        model_package_group_name=model_package_group_name,
                                        model_metrics=model_metrics,
                                        approval_status='PendingManualApproval',
                                        description='XGBoost model to predict synthetic housing prices',
                                        model_name=model_name,
                                        name=model_name)
model_package_arn = model_package.model_package_arn

We'll store relevant variables to be used in the next notebooks.

In [None]:
%store model_package_arn
%store model_name
%store model_package_group_name
%store model_metrics