# Use Different Combinations of SageMaker Components

1. [Example 1: build, train, and deploy _**all**_ on SageMaker](#Example1)

2. [Example 2: bring own training script](#Example2)

3. [Example 3: bring own trained model](#Example3)

4. [Example 4: bring own container](#Example4)

Useful Docs:

1. [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)

2. [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/index.html)

<a id="Example1"></a>
# Example 1: build, train, and deploy _**all**_ on SageMaker

---
1. Introduction
2. Prerequisites and Preprocessing
    1. Data ingestion
    2. Data inspection
    3. Data conversion
    4. Data uploading
3. Training
4. Hosting
5. Validating
6. (Optional) Clean-up
---

## Introduction

Using Amazon SageMaker's Built-in Linear Learner to Predict Whether a Handwritten Digit is a 0.

> Amazon SageMaker's Linear Learner algorithm extends upon typical linear models by training many models in parallel, in a computationally efficient manner. Each model has a different set of hyperparameters, and then the algorithm finds the set that optimizes a specific criteria. This can provide substantially more accurate models than typical linear algorithms at the same, or lower, cost.

## Prerequisites and Preprocessing

Kernel in SageMaker Studio: `Data Science`

In [None]:
import sagemaker

role = sagemaker.get_execution_role()
bucket = sagemaker.Session().default_bucket()  # “sagemaker-{region}-{aws-account-id}”
prefix = 'demo-linear-mnist'

### Data ingestion

Next, we read the dataset into memory, for preprocessing prior to training. This processing could be done *in situ* by Amazon Athena, Apache Spark in Amazon EMR, Amazon Redshift, etc., assuming the dataset is present in the appropriate location. Then, the next step would be to transfer the data to S3 for use in training. For small datasets, such as this one, reading into memory isn't onerous, though it would be for larger datasets.

In [None]:
%%time
import pickle, gzip

# Manually download mnist.pkl.gz from https://www.kaggle.com/pablotab/mnistpklgz
# Upload mnist.pkl.gz to ./datasets

# Load the dataset
with gzip.open('./datasets/mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f, encoding='latin1')

### Data inspection

Once the dataset is imported, it's typical as part of the machine learning process to inspect the data, understand the distributions, and determine what type(s) of preprocessing might be needed. You can perform those tasks right here in the notebook. As an example, let's go ahead and look at one of the digits that is part of the dataset.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (2, 10)


def show_digit(img, caption='', subplot=None):
    if subplot == None:
        _, (subplot) = plt.subplots(1, 1)
    imgr = img.reshape((28, 28))
    subplot.axis('off')
    subplot.imshow(imgr, cmap='gray')
    plt.title(caption)

show_digit(train_set[0][30], 'This is a {}'.format(train_set[1][30]))

### Data conversion

Since algorithms have particular input and output requirements, converting the dataset is also part of the process that a data scientist goes through prior to initiating training. In this particular case, the Amazon SageMaker implementation of Linear Learner takes recordIO-wrapped protobuf, where the data we have today is a pickle-ized numpy array on disk.

In [None]:
import io
import numpy as np
import sagemaker.amazon.common as smac

vectors = np.array([t.tolist() for t in train_set[0]]).astype('float32')
labels = np.where(np.array([t.tolist() for t in train_set[1]]) == 0, 1, 0).astype('float32')

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, vectors, labels)
buf.seek(0)

### Data uploading

Now that we've created our recordIO-wrapped protobuf, we'll need to upload it to S3, so that Amazon SageMaker training can use it.

In [None]:
import boto3
import os

key = 'recordio-pb-data'
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3url_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print(s3url_train_data)

Let's also setup an output S3 location for the model artifact that will be output as the result of training with the algorithm.

In [None]:
s3url_output = 's3://{}/{}/output'.format(bucket, prefix)
print(s3url_output)

## Training

First let's specify our algorithm image. 

In [None]:
from sagemaker.amazon.amazon_estimator import get_image_uri

image = get_image_uri(boto3.Session().region_name, 'linear-learner')

Next we'll kick off the base estimator, making sure to pass in the necessary hyperparameters.  Notice:
- `feature_dim` is set to 784, which is the number of pixels in each 28 x 28 image.
- `predictor_type` is set to 'binary_classifier' since we are trying to predict whether the image is or is not a 0.
- `mini_batch_size` is set to 200.  This value can be tuned for relatively minor improvements in fit and speed, but selecting a reasonable value relative to the dataset is appropriate in most cases.

In [None]:
import boto3

linear = sagemaker.estimator.Estimator(image,
                                       role, 
                                       train_instance_count=1, 
                                       train_instance_type='ml.m5.xlarge',
                                       output_path=s3url_output,
                                       sagemaker_session=sagemaker.Session())

linear.set_hyperparameters(feature_dim=784,
                           predictor_type='binary_classifier',
                           mini_batch_size=200)

linear.fit({'train': s3url_train_data})

## Hosting

Now that we've trained our model, we can deploy it behind an Amazon SageMaker real-time hosted endpoint.  This will allow us to make predictions (or inference) from the model dyanamically. (_Note, Amazon SageMaker allows you the flexibility of importing models trained elsewhere, as well as the choice of not importing models if the target of model creation is AWS Lambda, AWS Greengrass, Amazon Redshift, Amazon Athena, or other deployment target._)

In [None]:
# (Optional) Use an existing model trained by SageMaker

job_name = 'linear-learner-2021-02-20-17-21-37-544'
linear = sagemaker.estimator.Estimator.attach(job_name)

In [None]:
all_instance_types = ["ml.r5d.12xlarge", "ml.r5.12xlarge", "ml.p2.xlarge", "ml.m5.4xlarge", "ml.m4.16xlarge", "ml.r5d.24xlarge", "ml.r5.24xlarge", "ml.p3.16xlarge", "ml.m5d.xlarge", "ml.m5.large", "ml.t2.xlarge", "ml.p2.16xlarge", "ml.m5d.12xlarge", "ml.inf1.2xlarge", "ml.m5d.24xlarge", "ml.c4.2xlarge", "ml.c5.2xlarge", "ml.c4.4xlarge", "ml.inf1.6xlarge", "ml.c5d.2xlarge", "ml.c5.4xlarge", "ml.g4dn.xlarge", "ml.g4dn.12xlarge", "ml.c5d.4xlarge", "ml.g4dn.2xlarge", "ml.c4.8xlarge", "ml.c4.large", "ml.c5d.xlarge", "ml.c5.large", "ml.g4dn.4xlarge", "ml.c5.9xlarge", "ml.g4dn.16xlarge", "ml.c5d.large", "ml.c5.xlarge", "ml.c5d.9xlarge", "ml.c4.xlarge", "ml.inf1.xlarge", "ml.g4dn.8xlarge", "ml.inf1.24xlarge", "ml.m5d.2xlarge", "ml.t2.2xlarge", "ml.c5d.18xlarge", "ml.m5d.4xlarge", "ml.t2.medium", "ml.c5.18xlarge", "ml.r5d.2xlarge", "ml.r5.2xlarge", "ml.p3.2xlarge", "ml.m5d.large", "ml.m5.xlarge", "ml.m4.10xlarge", "ml.t2.large", "ml.r5d.4xlarge", "ml.r5.4xlarge", "ml.m5.12xlarge", "ml.m4.xlarge", "ml.m5.24xlarge", "ml.m4.2xlarge", "ml.p2.8xlarge", "ml.m5.2xlarge", "ml.r5d.xlarge", "ml.r5d.large", "ml.r5.xlarge", "ml.r5.large", "ml.p3.8xlarge", "ml.m4.4xlarge"]
for i in range(len(all_instance_types)):
    try:
        linear_predictor = linear.deploy(initial_instance_count=1,
                                         instance_type=all_instance_types[i])
    except:
        pass
    else:
        print('\nUsing instance type: ', all_instance_types[i])
        break

## Validating

Finally, we can now validate the model for use.  We can pass HTTP POST requests to the endpoint to get back predictions.  We'll specify how to serialize requests and deserialize responses that are specific to the algorithm.

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer

# linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer

Now let's try to get a prediction for a single record.

In [None]:
result = linear_predictor.predict(train_set[0][30:31])
print(result)

We should see that for one record our endpoint returned some JSON which contains `predictions`, including the `score` and `predicted_label`.  In this case, `score` will be a continuous value between [0, 1] representing the probability we think the digit is a 0 or not.  `predicted_label` will take a value of either `0` or `1` where (somewhat counterintuitively) `1` denotes that we predict the image is a 0, while `0` denotes that we are predicting the image is not of a 0.

Let's do a whole batch of images and evaluate our predictive accuracy.

In [None]:
import numpy as np

predictions = []
for array in np.array_split(test_set[0], 100):
    result = linear_predictor.predict(array)
    predictions += [r['predicted_label'] for r in result['predictions']]

predictions = np.array(predictions)

In [None]:
import pandas as pd
pd.crosstab(np.where(test_set[1] == 0, 1, 0), predictions, rownames=['actuals'], colnames=['predictions'])

## (Optional) Clean-up

To avoid incurring unnecessary charges, delete the endpoints and resources that you created while running this example.

Delete endpoints

In [None]:
sagemaker.Session().delete_endpoint(linear_predictor.endpoint)

Delete models

In [None]:
import boto3
from pprint import pprint

client = boto3.client('sagemaker')


def main():
    model_names = []
    for key in paginate(client.list_models):
        model_names.append(key['ModelName'])
    delete_multiple_models(model_names)


def delete_multiple_models(model_names):
    for model_name in model_names:
        print('Deleting model: {}'.format(model_name))
        client.delete_model(ModelName=model_name)


def paginate(method, **kwargs):
    client = method.__self__
    paginator = client.get_paginator(method.__name__)
    for page in paginator.paginate(**kwargs).result_key_iters():
        for result in page:
            yield result

            
main()

Delete S3 bucket

Open the Amazon S3 [console](https://console.aws.amazon.com/s3/), and then delete the bucket that you created for storing model artifacts and the training dataset.

Delete logs

Open the Amazon CloudWatch [console](https://console.aws.amazon.com/cloudwatch/), and then delete all of the log groups that have names starting with `/aws/sagemaker/`.

<a id="Example2"></a>
# Example 2: bring own training script

---
1. Introduction
2. Prerequisites and Preprocessing
    1. Data ingestion
    2. Data inspection
    3. Data uploading
3. Debugger
4. Experiments
5. Monitor
6. (Optional) Clean-up
---

## Introduction

Using Gradient Boosted Trees as a Framework to Predict Mobile Customer Departure.

> Using XGBoost **as a framework** provides more flexibility than using it **as a built-in algorithm** because it enables more advanced scenarios that allow you to incorporate preprocessing and post-processing scripts into your training script. Specifically, we'll be able to specify a list of rules that we want Debugger to evaluate our training process against.

## Prerequisites and Preprocessing

Kernel in SageMaker Studio: `Data Science`

In [None]:
import sys

!{sys.executable} -m pip install sagemaker-experiments

In [None]:
# Check the Sagemaker Python SDK Version
# If you see the "Please restart the kernel" prompt, simply click Kernel above and hit Restart.
import sagemaker

if int(sagemaker.__version__.split('.')[0]) == 2:
    !{sys.executable} -m pip install "sagemaker>=1.71.0,<2.0.0"
    print("Installing previous SageMaker Version. Please restart the kernel")
else:
    print("Version is good")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
import awscli
import boto3
import re

import sagemaker
from sagemaker import get_execution_role
from sagemaker.predictor import csv_serializer
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig
from sagemaker.model_monitor import DataCaptureConfig, DatasetFormat, DefaultModelMonitor
from sagemaker.s3 import S3Uploader, S3Downloader

from smexperiments.experiment import Experiment
from smexperiments.trial import Trial
from smexperiments.trial_component import TrialComponent
from smexperiments.tracker import Tracker

In [None]:
sess = boto3.Session()
# sess.get_available_services()
sm = sess.client('sagemaker')
role = sagemaker.get_execution_role()
bucket = sagemaker.Session().default_bucket()  # “sagemaker-{region}-{aws-account-id}”
prefix = 'demo-xgboost-churn'

### Data ingestion

The dataset that we use is publicly available and was mentioned in the book [Discovering Knowledge in Data](https://www.amazon.com/dp/0470908742/) by Daniel T. Larose. It's attributed by the author to the University of California Irvine Repository of Machine Learning Datasets. The downloaded and preprocessed dataset is in the `data` folder that accompanies the [notebook](https://github.com/aws/amazon-sagemaker-examples/blob/master/aws_sagemaker_studio/getting_started/xgboost_customer_churn_studio.ipynb). It's been split into training and validation datasets. To see how the dataset was preprocessed, see this [XGBoost customer churn notebook that starts with the original dataset](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn.ipynb).

In [None]:
# Set the path to the data files
# Make sure you cloned https://github.com/aws/amazon-sagemaker-examples.git
%cd /root/amazon-sagemaker-examples/aws_sagemaker_studio/getting_started
local_data_path = './data/training-dataset-with-header.csv'
data = pd.read_csv(local_data_path)

### Data inspection

We'll train on a CSV file without the header. But for now, we load some of the data from a version of the training data that has a header. Explore the data to see the dataset's features and what data will be used to train the model.

In [None]:
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 10)         # Keep the output on one page
data

### Data uploading

Now we'll upload the files to S3 for training but first we will create an S3 bucket for the data if one does not already exist.

In [None]:
try:
    if sess.region_name == "us-east-1":
        sess.client('s3').create_bucket(Bucket=bucket)
    else:
        sess.client('s3').create_bucket(Bucket=bucket, 
                                        CreateBucketConfiguration={'LocationConstraint': sess.region_name})
except Exception as e:
    print("Looks like you already have a bucket of this name. That's good. Uploading the data files...")

# Return the URLs of the uploaded file, so they can be reviewed or used elsewhere
s3url = S3Uploader.upload('data/train.csv', 's3://{}/{}/{}'.format(bucket, prefix,'train'))
print(s3url)
s3url = S3Uploader.upload('data/validation.csv', 's3://{}/{}/{}'.format(bucket, prefix,'validation'))
print(s3url)

Then, because we're training with the CSV file format, we'll create `s3_input`s that our training function can use as a pointer to the files in S3.

In [None]:
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(bucket, prefix), content_type='csv')

## Debugger

With Amazon SageMaker Debugger, you can debug models during training. During training, Debugger periodically saves tensors, which specify the state of the model at that point in time. Debugger saves the tensors to Amazon S3 for analysis and visualization. This allows you to diagnose training issues with Studio.

To enable automated detection of common issues during training, you can attach a list of rules to evaluate the training job against. Some rule configurations that apply to XGBoost include `AllZero`, `ClassImbalance`, `Confusion`, `LossNotDecreasing`, `Overfit`, `Overtraining`, `SimilarAcrossRuns`, `TensorVariance`, `UnchangedTensor`, `TreeDepth`.

Let's use the `LossNotDecreasing` rule which is triggered if the loss doesn't decrease monotonically at any point during training, the `Overtraining` rule, and the `Overfit` rule.

In [None]:
debug_rules = [Rule.sagemaker(rule_configs.loss_not_decreasing()),
               Rule.sagemaker(rule_configs.overtraining()),
               Rule.sagemaker(rule_configs.overfit())
              ]

## Experiments

Amazon SageMaker Experiments allows us to keep track of model training; organize related models together; and log model configuration, parameters, and metrics so we can reproduce and iterate on previously trained models and compare models. We'll create a single experiment to keep track of the different trials to train the model that we'll try.

In [None]:
sess = sagemaker.session.Session()
create_date = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
customer_churn_experiment = Experiment.create(experiment_name="customer-churn-prediction-xgboost-{}".format(create_date), 
                                              description="Using xgboost to predict customer churn", 
                                              sagemaker_boto_client=boto3.client('sagemaker'))

We'll create a trial for this run and associate it with the experiment that we just created so that we can use Studio to compare it with other trials. To train the model, we'll create an estimator and specify a few parameters, such as the type of training instances we'd like to use and how many and [hyperparameters](https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst). Because we are running in framework mode, we also need to pass parameters, like the entry point script that can incorporate additional processing into your training jobs and the framework version, to the estimator.

> We've made a couple of simple changes to enable the Debugger. We created a SessionHook which we pass as a callback function when creating a Booster. We passed a SaveConfig object that tells the hook to save the evaluation metrics, feature importances, and SHAP values at regular intervals. Debugger is highly configurable, so you can choose exactly what to save. For even more detail, see the [Developer Guide for XGBoost](https://github.com/awslabs/sagemaker-debugger/tree/master/docs/xgboost.md).
<code>!pygmentize xgboost_customer_churn.py</code>

After that, we call `fit` to start the training job.

==**IMPORTANT**==

The last line in the given [script](https://github.com/aws/amazon-sagemaker-examples/blob/master/aws_sagemaker_studio/getting_started/xgboost_customer_churn.py) must be modified, otherwise the notebook will have errors (See the [issue](https://github.com/aws/amazon-sagemaker-examples/issues/2031#issuecomment-787990970)).

In [None]:
hyperparams = {"max_depth":5,
               "subsample":0.8,
               "num_round":600,
               "eta":0.2,
               "gamma":4,
               "min_child_weight":6,
               "silent":0,
               "objective":'binary:logistic'}

In [None]:
entry_point_script = "xgboost_customer_churn.py"

trial = Trial.create(trial_name="trial-{}-weight-{}".format(strftime("%Y-%m-%d-%H-%M-%S", gmtime()), hyperparams['min_child_weight']), 
                     experiment_name=customer_churn_experiment.experiment_name,
                     sagemaker_boto_client=boto3.client('sagemaker'))

framework_xgb = sagemaker.xgboost.XGBoost(entry_point=entry_point_script,
                                          role=role,
                                          framework_version="0.90-2",
                                          py_version="py3",
                                          hyperparameters=hyperparams,
                                          train_instance_count=1, 
                                          train_instance_type='ml.m5.xlarge',
                                          output_path='s3://{}/{}/output'.format(bucket, prefix),
                                          sagemaker_session=sess,
                                          rules=debug_rules
                                          )

framework_xgb.fit({'train': s3_input_train,
                   'validation': s3_input_validation}, 
                  experiment_config={
                      "ExperimentName": customer_churn_experiment.experiment_name, 
                      "TrialName": trial.trial_name,
                      "TrialComponentDisplayName": "Training",
                  })

To improve a model, you typically try other hyperparameter values to see if they affect the final validation error. Let's change the `min_child_weight` value and start other training jobs with those values to see how they affect the validation error. For each `min_child_weight` value, we'll create a separate trial so that we can compare the results in Studio.

In [None]:
min_child_weights = [1, 2, 4, 8, 10]

for weight in min_child_weights:
    hyperparams["min_child_weight"] = weight
    trial = Trial.create(trial_name="trial-{}-weight-{}".format(strftime("%Y-%m-%d-%H-%M-%S", gmtime()), weight), 
                         experiment_name=customer_churn_experiment.experiment_name,
                         sagemaker_boto_client=boto3.client('sagemaker'))

    t_xgb = sagemaker.xgboost.XGBoost(entry_point=entry_point_script,
                                      role=role,
                                      framework_version="0.90-2",
                                      py_version="py3",
                                      hyperparameters=hyperparams,
                                      train_instance_count=1, 
                                      train_instance_type='ml.m5.xlarge',
                                      output_path='s3://{}/{}/output'.format(bucket, prefix),
                                      sagemaker_session=sess,
                                      rules=debug_rules
                                     )

    t_xgb.fit({'train': s3_input_train,
               'validation': s3_input_validation},
                wait=False,
                experiment_config={
                    "ExperimentName": customer_churn_experiment.experiment_name, 
                    "TrialName": trial.trial_name,
                    "TrialComponentDisplayName": "Training",
                }
               )

After the training jobs succeed, you can view `Experiments and trails` in Studio via `Sagemaker Components and registries` in the left bar. To see the trials of a specific experiment, in the Experiments list, double-click it. To see several trials together, multi-select them and right-click to choose `Open in trial component list`. You can click the button next to `Deploy model` to view more properties in the table, e.g., `Metrics`, `Parameters`, `Input artifacts` and `Output artifacts`. To create charts, you multi-select the trials then click `Add chart`. Here is an example: for `Data type`, choose `Summary statistics`; for `Chart type`, choose `Scatter plot`; then choose the `min_child_wight` hyperparameter as the `X-axis`; for `Y-axis` metrics, choose either `validation:error_last` or `validation:error_avg`, and then color them by `trialComponentName`.

## Monitor

Now that we've trained the model, let's deploy it to a hosted endpoint. To monitor the model after it's hosted and serving requests, we will also add configurations to capture data that is being sent to the endpoint.

In [None]:
endpoint_name = "xgboost-churn-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
job_name = 'sagemaker-xgboost-2021-03-01-13-52-39-970'
xgb = sagemaker.xgboost.XGBoost.attach(job_name)
data_capture_prefix = '{}/datacapture'.format(prefix)

xgb_predictor = xgb.deploy(initial_instance_count=1, 
                           instance_type='ml.m5.large',
                           endpoint_name=endpoint_name,
                           data_capture_config=DataCaptureConfig(enable_capture=True,
                                                                 sampling_percentage=100,
                                                                 destination_s3_uri='s3://{}/{}'.format(bucket, data_capture_prefix)
                                                                )
                           )

Now that we have a hosted endpoint running, we can make real-time predictions from our model by making an HTTP POST request. But first, we'll need to set up serializers and deserializers for passing our `test_data` NumPy arrays to the model behind the endpoint.

In [None]:
xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = None

Now, we'll loop over our test dataset and collect predictions by invoking the XGBoost endpoint:

In [None]:
print("Sending test traffic to the endpoint {}. \nPlease wait for a minute...".format(endpoint_name))

with open('data/test_sample.csv', 'r') as f:
    for row in f:
        payload = row.rstrip('\n')
        response = xgb_predictor.predict(data=payload)
        time.sleep(0.5)

Because we made some real-time predictions by sending data to our endpoint, we should have also captured that data for monitoring purposes. Let's list the data capture files stored in Amazon S3. Expect to see different files from different time periods organized by the hour in which the invocation occurred. The format of the S3 path is:

`s3://{bucket}/{data_capture_prefix}/{endpoint_name}/AllTraffic/yyyy/mm/dd/hh/filename.jsonl`

In [None]:
from time import sleep

current_endpoint_capture_prefix = '{}/{}'.format(data_capture_prefix, endpoint_name)
for _ in range(12): # wait up to a minute to see captures in S3
    capture_files = S3Downloader.list("s3://{}/{}".format(bucket, current_endpoint_capture_prefix))
    if capture_files:
        break
    sleep(5)

print("Found Data Capture Files:")
print(capture_files)

All the data captured is stored in a SageMaker specific json-line formatted file. Next, let's take a quick peek at the contents of a single line in a pretty formatted json so that we can observe the format a little better.

In [None]:
capture_file = S3Downloader.read_file(capture_files[-1])

print("=====Single Data Capture====")
print(json.dumps(json.loads(capture_file.split('\n')[0]), indent=2)[:2000])

As you can see, each inference request is captured in one line in the jsonl file. The line contains both the input and output merged together. In our example, we provided the ContentType as `text/csv` which is reflected in the `observedContentType` value. Also, we expose the enconding that we used to encode the input and output payloads in the capture format with the `encoding` value.

To use Amazon SageMaker Model Monitor, we need to:
1. Create a baseline to compare against real-time traffic. 
1. When the baseline is ready, set up a schedule to continously evaluate and compare against the baseline.
1. Send synthetic traffic to trigger alarms.

It takes one hour or more to complete this section because the shortest monitoring polling time is one hour.

### 1. Suggest baseline constraints with the training dataset

The training dataset that you used to train the model is usually a good baseline dataset. Note that the training dataset data schema and the inference dataset schema should match exactly (for example, the number and type of the features).

From our training dataset, let's ask Amazon SageMaker to suggest a set of baseline `constraints` and generate descriptive `statistics` to explore the data. For this example, let's upload the training dataset that we used to train the model. We'll use the dataset file with column headers so that we have descriptive feature names.

In [None]:
baseline_prefix = prefix + '/baselining'
baseline_data_prefix = baseline_prefix + '/data'
baseline_results_prefix = baseline_prefix + '/results'

baseline_data_uri = 's3://{}/{}'.format(bucket, baseline_data_prefix)
baseline_results_uri = 's3://{}/{}'.format(bucket, baseline_results_prefix)
print('Baseline data uri: {}'.format(baseline_data_uri))
print('Baseline results uri: {}'.format(baseline_results_uri))
baseline_data_path = S3Uploader.upload("data/training-dataset-with-header.csv", baseline_data_uri)

Create a baselining job with the training dataset

Now that we have the training data ready in Amazon S3, let's start a job to `suggest` constraints. To generate the constraints, the convenient helper starts a `ProcessingJob` using a ProcessingJob container provided by Amazon SageMaker.

In [None]:
my_default_monitor = DefaultModelMonitor(role=role,
                                         instance_count=1,
                                         instance_type='ml.m5.xlarge',
                                         volume_size_in_gb=20,
                                         max_runtime_in_seconds=3600,
                                        )

baseline_job = my_default_monitor.suggest_baseline(baseline_dataset=baseline_data_path,
                                                   dataset_format=DatasetFormat.csv(header=True),
                                                   output_s3_uri=baseline_results_uri,
                                                   wait=True
                                                  )

Once the job succeeds, we can explore the `baseline_results_uri` to see what files are stored there.

In [None]:
print("Found Files:")
S3Downloader.list("s3://{}/{}".format(bucket, baseline_results_prefix))

We have a `constraints.json` file that has information about suggested constraints. We also have a `statistics.json` file which contains statistical information about the data in the baseline.

In [None]:
baseline_job = my_default_monitor.latest_baselining_job
schema_df = pd.io.json.json_normalize(baseline_job.baseline_statistics().body_dict["features"])
schema_df.head(10)

In [None]:
constraints_df = pd.io.json.json_normalize(baseline_job.suggested_constraints().body_dict["features"])
constraints_df.head(10)

### 2. Analyze subsequent captures for data quality issues

Now that we have processed the baseline dataset to get baseline statistics and constraints, let's  monitor and analyze the data that is being sent to the endpoint with monitoring schedules.

First, create a monitoring schedule for the previously created endpoint. The schedule specifies the cadence at which we run a new processing job to compare recent data captures to the baseline.

In [None]:
# First, copy over some test scripts to the S3 bucket so that they can be used for pre and post processing
code_prefix = '{}/code'.format(prefix)
preprocessor_script = S3Uploader.upload('preprocessor.py', 's3://{}/{}'.format(bucket, code_prefix))
postprocessor_script = S3Uploader.upload('postprocessor.py', 's3://{}/{}'.format(bucket, code_prefix))

We are ready to create a model monitoring schedule for the endpoint created before and also the baseline resources (constraints and statistics) which were generated above.

In [None]:
from sagemaker.model_monitor import CronExpressionGenerator
from time import gmtime, strftime

reports_prefix = '{}/reports'.format(prefix)
reports_path = 's3://{}/{}'.format(bucket, reports_prefix)

mon_schedule_name = 'demo-xgboost-customer-churn-model-schedule-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
my_default_monitor.create_monitoring_schedule(monitor_schedule_name=mon_schedule_name,
                                              endpoint_input=xgb_predictor.endpoint,
                                              #record_preprocessor_script=preprocessor_script,
                                              post_analytics_processor_script=postprocessor_script,
                                              output_s3_uri=reports_path,
                                              statistics=my_default_monitor.baseline_statistics(),
                                              constraints=my_default_monitor.suggested_constraints(),
                                              schedule_cron_expression=CronExpressionGenerator.hourly(),
                                              enable_cloudwatch_metrics=True,
                                             )

### 3. Start generating some artificial traffic

The following block starts a thread to send some traffic to the created endpoint. This allows us to continue to send traffic to the endpoint so that we'll continually capture data for analysis. If there is no traffic, the monitoring jobs start to fail later.

To terminate this thread, you need to stop the kernel.

In [None]:
from threading import Thread

runtime_client = boto3.client('runtime.sagemaker')

# (just repeating code from above for convenience/ able to run this section independently)
def invoke_endpoint(ep_name, file_name, runtime_client):
    with open(file_name, 'r') as f:
        for row in f:
            payload = row.rstrip('\n')
            response = runtime_client.invoke_endpoint(EndpointName=ep_name,
                                                      ContentType='text/csv',
                                                      Body=payload)
            response['Body'].read()
            sleep(1)
            
def invoke_endpoint_forever():
    while True:
        invoke_endpoint(endpoint_name, 'data/test-dataset-input-cols.csv', runtime_client)
        
thread = Thread(target=invoke_endpoint_forever)
thread.start()

After you've set up the schedule, jobs start at specified intervals. Let's list the latest five executions. If you run this code after creating the hourly schedule, you might find the executions empty. You might have to wait until you cross the hour boundary (in UTC) to see executions start. The following code includes the logic for waiting.

In [None]:
mon_executions = my_default_monitor.list_executions()
if len(mon_executions) == 0:
    print("We created a hourly schedule above and it will kick off executions ON the hour.\nWe will have to wait till we hit the hour...")

while len(mon_executions) == 0:
    print("Waiting for the 1st execution to happen...")
    time.sleep(60)
    mon_executions = my_default_monitor.list_executions()  

Evaluate the latest execution and list the generated reports.

In [None]:
latest_execution = mon_executions[-1]
latest_execution.wait()

In [None]:
print("Latest execution result: {}".format(latest_execution.describe()['ExitMessage']))
report_uri = latest_execution.output.destination

print("Found Report Files:")
S3Downloader.list(report_uri)

If there are any violations compared to the baseline, they will be generated here. Let's list the violations.

In [None]:
violations = my_default_monitor.latest_monitoring_constraint_violations()
pd.set_option('display.max_colwidth', -1)
constraints_df = pd.io.json.json_normalize(violations.body_dict["violations"])
constraints_df.head(10)

You can plug in the processing job arn for a single execution of the monitoring into [this notebook](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_model_monitor/visualization/SageMaker-Model-Monitor-Visualize.ipynb) to see more detailed visualizations of the violations and distribution statistics of the data captue that was processed in that execution.

In [None]:
latest_execution.describe()['ProcessingJobArn']

## (Optional) Clean-up

If you're done with this example, run the following cell.  This removes the hosted endpoint that you created and prevents you from being charged for any instances that might continue running. It also cleans up all artifacts related to the experiment. 

Delete S3 bucket

Open the Amazon S3 [console](https://console.aws.amazon.com/s3/), and then delete the bucket that you created for storing model artifacts and datasets.

Delete logs

Open the Amazon CloudWatch [console](https://console.aws.amazon.com/cloudwatch/), and then delete all of the log groups that have names starting with `/aws/sagemaker/`.

In [None]:
try:
    sess.delete_monitoring_schedule(mon_schedule_name)
except:
    pass
while True:
    try:
        print("Waiting for schedule to be deleted")
        sess.describe_monitoring_schedule(mon_schedule_name)
        sleep(15)
    except:
        print("Schedule deleted")
        break

sess.delete_endpoint(xgb_predictor.endpoint)

def cleanup(experiment):
    '''Clean up everything in the given experiment object'''
    for trial_summary in experiment.list_trials():
        trial = Trial.load(trial_name=trial_summary.trial_name)
        
        for trial_comp_summary in trial.list_trial_components():
            trial_step=TrialComponent.load(trial_component_name=trial_comp_summary.trial_component_name)
            print('Starting to delete TrialComponent..' + trial_step.trial_component_name)
            sm.disassociate_trial_component(TrialComponentName=trial_step.trial_component_name, TrialName=trial.trial_name)
            trial_step.delete()
            time.sleep(1)
         
        trial.delete()
    
    experiment.delete()

cleanup(customer_churn_experiment)

Or you can list the items to delete manually.

In [None]:
import boto3
from smexperiments.experiment import Experiment

sess = boto3.Session()
sm = sess.client('sagemaker')

# List all experiments in the account
print(sm.list_experiments())
# List monitoring schedules
print(sm.list_monitoring_schedules())
# List endpoints
print(sm.list_endpoints())
# List endpoint configs
print(sm.list_endpoint_configs())
# List models
sm.list_models()

In [None]:
# Delete a monitoring schedule
name = 'demo-xgboost-customer-churn-model-schedule-2021-03-01-15-54-57'
sm.delete_monitoring_schedule(MonitoringScheduleName=name)

# Delete an endpoint
name = 'xgboost-churn-2021-03-01-13-57-30'
sm.delete_endpoint(EndpointName=name)

# Delete an endpoint config
name = 'xgboost-churn-2021-03-01-13-57-30'
sm.delete_endpoint_config(EndpointConfigName=name)

# Force to delete the experiment and associated trials, trial components under the experiment.
name = 'customer-churn-prediction-xgboost-2021-03-01-10-19-35'
experiment = Experiment.load(experiment_name=name, sagemaker_boto_client=sm)
experiment.delete_all(action='--force')

# Delete a model
name = 'sagemaker-xgboost-2021-03-01-13-52-39-970'
sm.delete_model(ModelName=name)

<a id="Example3"></a>
# Example 3: bring own trained model

<a id="Example4"></a>
# Example 4: bring own container