# SageMaker Pipelines to Train a BERT-Based Text Classifier

In this lab, we will do the following:
* Define a set of Workflow Parameters that can be used to parametrize a Workflow Pipeline
* Define a Processing step that performs cleaning and feature engineering, splitting the input data into train and test data sets
* Define a Training step that trains a model on the pre-processed train data set
* Define a Processing step that evaluates the trained model's performance on the test data set
* Define a Register Model step that creates a model package from the estimator and model artifacts used in training
* Define a Conditional step that measures a condition based on output from prior steps and conditionally executes the Register Model step
* Define and create a Pipeline in a Workflow DAG, with the defined parameters and steps defined
* Start a Pipeline execution and wait for execution to complete

# Terminology

Amazon SageMaker Pipelines support the following steps:

* Pipelines - A Directed Acyclic Graph of steps and conditions to orchestrate SageMaker jobs and resource creation.
* Processing Job steps - A simplified, managed experience on SageMaker to run data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation.
* Training Job steps - An iterative process that teaches a model to make predictions by presenting examples from a training dataset.
* Conditional step execution - Provides conditional execution of branches in a pipeline.
* Registering Models - Creates a model package resource in the Model Registry that can be used to create deployable models in Amazon SageMaker.
* Parametrized Pipeline executions - Allows pipeline executions to vary by supplied parameters.
* Transform Job steps - A batch transform to preprocess datasets to remove noise or bias that interferes with training or inference from your dataset, get inferences from large datasets, and run inference when you don't need a persistent endpoint.

# Our BERT Pipeline

In the Processing Step, we perform Feature Engineering to create BERT embeddings from the `review_body` text using the pre-trained BERT model, and split the dataset into train, validation and test files. To optimize for Tensorflow training, we saved the files in TFRecord format. 

In the Training Step, we fine-tune the BERT model to our Customer Reviews Dataset and add a new classification layer to predict the `star_rating` for a given `review_body`.

In the Evaluation Step, we take the trained model and a test dataset as input, and produce a JSON file containing classification evaluation metrics.

In the Condition Step, we decide whether to register this model if the accuracy of the model, as determined by our evaluation step exceeded some value. 

![](./img/bert_sagemaker_pipeline.png)

The pipeline that we create follows a typical Machine Learning Application pattern of pre-processing, training, evaluation, and model registration:

![A typical ML Application pipeline](img/pipeline-full.png)

# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>

<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>

In [None]:
%%javascript

try {
    Jupyter.notebook.save_checkpoint();
    Jupyter.notebook.session.delete();
}
catch(err) {
    // NoOp
}