# Welcome!
This workshop is based on our upcoming O'Reilly Book, [**Data Science on AWS**](https://www.amazon.com/dp/1492079391/) due in 2021.

[![](img/book_full_color_sm.png)](https://www.amazon.com/dp/1492079391/)

<img src="img/aws-ai-ml-stack.png" width="90%" align="left">

# Start the "Data Science" Kernel
The kernel powers all of our notebook interactions.

# Click on "No Kernel" in the Upper Right
![](img/select_kernel.png)

# Select the `Data Science` Kernel
![](img/select_data_science_kernel.png)

# Confirm the Kernel is Started in Upper Right
![](img/confirm_kernel_started.png)

# NOTE:  YOU CANNOT CONTINUE UNTIL THE KERNEL IS STARTED
# ### PLEASE WAIT UNTIL THE KERNEL IS STARTED BEFORE CONTINUING!!! ###

# Use `Shift+Enter` to run each cell of every notebook

Use `Shift+Enter` on the cell below to see the output.

# Follow Us On Twitter

In [None]:
%%html

<a href="https://twitter.com/cfregly" class="twitter-follow-button" data-size="large" data-lang="en" data-show-count="false">Follow @cfregly</a>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

## _Click This Button ^^ Above ^^_

In [None]:
%%html

<a href="https://twitter.com/anbarth" class="twitter-follow-button" data-size="large" data-lang="en" data-show-count="false">Follow @anbarth</a>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

## _Click This Button ^^ Above ^^_

# Star Our GitHub Repo

In [None]:
%%html

<a class="github-button" href="https://github.com/data-science-on-aws/workshop" data-color-scheme="no-preference: light; light: light; dark: dark;" data-icon="octicon-star" data-size="large" data-show-count="true" aria-label="Star data-science-on-aws/workshop on GitHub">Star</a>
<script async defer src="https://buttons.github.io/buttons.js"></script>

## _Click This Button ^^ Above ^^_

# Visit our Website

In [None]:
%%html

<iframe src="https://datascienceonaws.com" width="800px" height="600px"/>

# Orchestrating Jobs, Model Registration, Continuous Deployment, and Lineage Tracking with Amazon SageMaker

Amazon SageMaker offers Machine Learning application developers and Machine Learning operations engineers the ability to orchestrate SageMaker jobs and author reproducible Machine Learning pipelines, deploy custom-build models for inference in real-time with low latency or offline inferences with Batch Transform, and track lineage of artifacts. You can institute sound operational practices in deploying and monitoring production workflows, deployment of model artifacts, and track artifact lineage through a simple interface, adhering to safety and best-practice paradigmsfor Machine Learning application development.

The SageMaker Workflow service supports a SageMaker Machine Learning Pipeline Domain Specific Language (DSL), which is a declarative Json specification. This DSL defines a Directed Acyclic Graph (DAG) of pipeline parameters and SageMaker job steps. The SageMaker Python Software Developer Kit (SDK) streamlines the generation of the pipeline DSL using constructs that are already familiar to engineers and scientists alike.

The SageMaker Model Registry is where trained models are stored, versioned, and managed. Data Scientists and Machine Learning Engineers can compare model versions, approve models for deployment, and deploy models from different AWS accounts, all from a single Model Registry. SageMaker enables customers to follow the best practices with ML Ops and getting started right. Customers are able to standup a full ML Ops end-to-end system with a single API call.

And the SageMaker Lineage service makes it easy to track all the artifacts created in a SageMaker Machine Learning Pipeline from start to finish.

# Use Case: Fine-Tune a BERT Model and Create a Text Classifier

We have chosen the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) as our main dataset.

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.

![BERT Training](img/bert_training.png)

BERT’s attention mechanism is called a Transformer. This is, not coincidentally, the name of the popular BERT Python library, “Transformers,” maintained by a company called [HuggingFace](https://github.com/huggingface/transformers). We will use a variant of BERT called [DistilBert](https://arxiv.org/pdf/1910.01108.pdf) which requires less memory and compute, but maintains very good accuracy on our dataset.

## SageMaker Pipelines

Amazon SageMaker Pipelines support the following:

* Pipelines - A Directed Acyclic Graph of steps and conditions to orchestrate SageMaker jobs and resource creation.
* Processing Job steps - A simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation.
* Training Job steps - An iterative process that teaches a model to make predictions by presenting examples from a training dataset.
* Conditional step execution - Provides conditional execution of branches in a pipeline.
* Registering Models - Creates a model package resource in the Model Registry that can be used to create deployable models in Amazon SageMaker.
* Parametrized Pipeline executions - Allows pipeline executions to vary by supplied parameters.
* Transform Job steps - A batch transform to preprocess datasets to remove noise or bias that interferes with training or inference from your dataset, get inferences from large datasets, and run inference when you don't need a persistent endpoint.


## Our BERT Pipeline

In the Processing Step, we perform Feature Engineering to create BERT embeddings from the `reviews_body` text using the pre-trained BERT model, and split the dataset into train, validation and test files. To optimize for Tensorflow training, we saved the files in TFRecord format. 

In the Training Step, we fine-tune the BERT model to our Customer Reviews Dataset and add a new classification layer to predict the `star_rating` for a given `review_body`.

In the Evaluation Step, we take the trained model and a test dataset as input, and produce a JSON file containing classification evaluation metrics.

In the Condition Step, we decide whether to register this model if the accuracy of the model, as determined by our evaluation step exceeded some value. 



![](./img/bert_sagemaker_pipeline.png)

## SageMaker Model Registry

Amazon SageMaker Model Registry supports the following:

* Catalog models after the training step - data scientists run tens to thousands of experiments and may select a small set of models as candidates for production.
* Manage model versions - data scientists can register new models which will be automatically versioned in the model registry.
* Compare models - data scientists can run model evaluation steps in the pipeline and generate model metrics (e.g. accuracy metrics and bias metrics) which are recorded in the Model Registry and can be used to compare model versions.
* Approve models - data scientists can mark model versions as “approved” or “rejected”. Alternately, the pipeline can also automate the model approvals. If there is a deployment pipeline associated with a Model and a live endpoint, then the model version is propagated in production. 
* Deploy models in different AWS accounts - models in the Model Registry support resource sharing across accounts which enables models built in data scientist accounts to be deployed in different pre-production and production accounts.

## SageMaker Lineage

Amazon SageMaker Lineage supports the following:

* Automatically tracks all the artifacts created in a machine learning workflow from start to finish.  Modeled as a directed graph like structure.
* Explore the lineage artifacts with easy to use SDK methods.


## Notebooks

These notebook show how to:

### SageMaker Workflows

* Define a set of Workflow Parameters that can be used to parametrize a Workflow Pipeline
* Define a Processing step that performs cleaning and feature engineering, splitting the input data into train and test data sets
* Define a Training step that trains a model on the pre-processed train data set
* Define a Processing step that evaluates the trained model's performance on the test data set
* Define a Register Model step that creates a model package from the estimator and model artifacts used in training
* Define a Conditional step that measures a condition based on output from prior steps and conditionally executes the Register Model step
* Define and create a Pipeline in a Workflow DAG, with the defined parameters and steps defined
* Start a Pipeline execution and wait for execution to complete

### SageMaker Model Registry

* Create a SageMaker Project based on the Model Package Group name from the pipeline execution defined before


### SageMaker Lineage

Amazon SageMaker Lineage supports the following:

* Provide the inputs and outputs of SageMaker job artifacts

# SageMaker Studio Extensions

SageMaker Studio provides a rich set of features to visually inspect SageMaker resources including experiments, training jobs, and pipelines.

![](img/sm_studio_extensions.png)

# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>

<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>