# Start the `Data Science` Kernel in Amazon SageMaker Studio environments 

The kernel powers all of our notebook interactions.

## Confirm the Kernel is Started in Upper Right
![](img/confirm_kernel_started.png)

In [None]:
import psutil

notebook_memory = psutil.virtual_memory()
print(notebook_memory)

if notebook_memory.total < 32 * 1000 * 1000 * 1000:
    print('*******************************************')    
    print('YOU ARE NOT USING THE CORRECT INSTANCE TYPE')
    print('PLEASE CHANGE INSTANCE TYPE TO  m5.2xlarge ')
    print('*******************************************')
else:
    correct_instance_type=True

# NOTE:  YOU CANNOT CONTINUE UNTIL THE KERNEL IS STARTED
# ### PLEASE WAIT UNTIL THE KERNEL IS STARTED BEFORE CONTINUING!!! ###

# Use `Shift+Enter` to Run Each Cell

Use `Shift+Enter` on the cell below to see the output.

# Click `Kernel` => `Restart Kernel and Run All Cells` to Run All Cells
![](img/restart-kernel-and-run-all-cells.png)

# Workshop Intro

This workshop is based on our O'Reilly Book, Data Science on AWS.

[![Data Science on AWS](img/book_full_color_sm.png)](https://www.amazon.com/Data-Science-AWS-End-End/dp/1492079391/)

YouTube Videos, Meetups, Book, and Code Here:  **https://datascienceonaws.com**

# Workshop Description
In this hands-on workshop, we will build an end-to-end AI/ML pipeline for natural language processing with Amazon SageMaker.  We will train and tune a generative language model using the state-of-the-art GPT3 model for language representation.

To build our GPT3-based NLP model, we use the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) which contains 150+ million customer reviews from Amazon.com for the 20 year period between 1995 and 2015.  In particular, we train a review generator using the `review_body` (free-form review text).

# Learning Objectives
Attendees will learn how to do the following:
* Ingest data into S3 using Amazon Athena and the Parquet data format
* Visualize data with pandas, matplotlib on SageMaker notebooks
* Perform feature engineering on a raw dataset using Scikit-Learn and SageMaker Processing Jobs
* Store and share features using SageMaker Feature Store
* Train and evaluate a custom GPT3 model using PyTorch and SageMaker Training Jobs
* Evaluate the model using Scikit-Learn and SageMaker Processing Jobs
* Track model artifacts using Amazon SageMaker ML Lineage Tracking
* Register and version models using SageMaker Model Registry
* Deploy a model to a REST Inference Endpoint using SageMaker Endpoints
* Automate ML workflow steps by building end-to-end model pipelines using SageMaker Pipelines
* Perform hyper-parameter tuning to find the best model configuration for your dataset

# Amazon AI and Machine Learning Stack
<img src="img/aws_ml_stack.png" width="90%" align="left">

# Follow Us On Twitter

In [None]:
%%html

<a href="https://twitter.com/cfregly" class="twitter-follow-button" data-size="large" data-lang="en" data-show-count="false">Follow @cfregly</a>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

## _Click This Button ^^ Above ^^_

In [None]:
%%html

<a href="https://twitter.com/anbarth" class="twitter-follow-button" data-size="large" data-lang="en" data-show-count="false">Follow @anbarth</a>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

## _Click This Button ^^ Above ^^_

# Star Our GitHub Repo

In [None]:
%%html

<a class="github-button" href="https://github.com/data-science-on-aws/workshop" data-color-scheme="no-preference: light; light: light; dark: dark;" data-icon="octicon-star" data-size="large" data-show-count="true" aria-label="Star data-science-on-aws/workshop on GitHub">Star</a>
<script async defer src="https://buttons.github.io/buttons.js"></script>

## _Click This Button ^^ Above ^^_

# Visit our Website

In [None]:
%%html

<iframe src="https://datascienceonaws.com" width="800px" height="600px"/>

# Use Case: Fine-Tune a GPT3 Model and Create a Review Generator

We have chosen the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/amazon-reviews-pds/readme.html) as our main dataset.

- `marketplace`: 2-letter country code (in this case all "US").
- `customer_id`: Random identifier that can be used to aggregate reviews written by a single author.
- `review_id`: A unique ID for the review.
- `product_id`: The Amazon Standard Identification Number (ASIN).  `http://www.amazon.com/dp/<ASIN>` links to the product's detail page.
- `product_parent`: The parent of that ASIN.  Multiple ASINs (color or format variations of the same product) can roll up into a single parent.
- `product_title`: Title description of the product.
- `product_category`: Broad product category that can be used to group reviews (in this case digital videos).
- `star_rating`: The review's rating (1 to 5 stars).
- `helpful_votes`: Number of helpful votes for the review.
- `total_votes`: Number of total votes the review received.
- `vine`: Was the review written as part of the [Vine](https://www.amazon.com/gp/vine/help) program?
- `verified_purchase`: Was the review from a verified purchase?
- `review_headline`: The title of the review itself.
- `review_body`: The text of the review.
- `review_date`: The date the review was written.


The attention mechanism is called a Transformer. This is, not coincidentally, the name of the popular Python library, “Transformers,” maintained by a company called [HuggingFace](https://github.com/huggingface/transformers).

# SageMaker Pipeline with Generative

![](./img/generative_sagemaker_pipeline.png)

In the Processing Step, we perform Feature Engineering to create generative embeddings from the `review_body` text using the pre-trained generative model, and split the dataset into train, validation and test files.

In the Training Step, we fine-tune the generative model using the `review_body` column from the Amazon Customer Reviews Dataset.

In the Evaluation Step, we take the trained model and a test dataset as input, and produce a JSON file containing evaluation metrics.

In the Condition Step, we decide whether to register this model if the model metrics, as determined by our evaluation step, exceeded some value. 


# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>

<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>