First, check that the correct kernel is chosen.

<img src="img/kernel_set_up.png" width="300"/>

You can click on that to see and check the details of the image, kernel, and instance type.

<img src="img/w3_kernel_and_instance_type.png" width="600"/>

In [None]:
import psutil

notebook_memory = psutil.virtual_memory()
print(notebook_memory)

if notebook_memory.total < 32 * 1000 * 1000 * 1000:
    print('*******************************************')    
    print('YOU ARE NOT USING THE CORRECT INSTANCE TYPE')
    print('PLEASE CHANGE INSTANCE TYPE TO  m5.2xlarge ')
    print('*******************************************')
else:
    correct_instance_type=True

# NOTE:  YOU CANNOT CONTINUE UNTIL THE KERNEL IS STARTED
# ### PLEASE WAIT UNTIL THE KERNEL IS STARTED BEFORE CONTINUING!!! ###

# Use `Shift+Enter` to Run Each Cell

Use `Shift+Enter` on the cell below to see the output.

# Click `Kernel` => `Restart Kernel and Run All Cells` to Run All Cells
![](img/restart-kernel-and-run-all-cells.png)

# Workshop Intro

This workshop is based on our O'Reilly Book, Data Science on AWS.

[![Data Science on AWS](img/book_full_color_sm.png)](https://www.amazon.com/Data-Science-AWS-End-End/dp/1492079391/)

YouTube Videos, Meetups, Book, and Code Here:  **https://datascienceonaws.com**

# Workshop Description
In this hands-on workshop, we will build an end-to-end AI/ML pipeline for natural language processing with Amazon SageMaker.  We will train and tune a generative language model using the state-of-the-art FLAN-T5 model for language representation.

In this workflow, we will use the [Amazon Customer Reviews Dataset](https://s3.amazonaws.com/dsoaws/amazon-reviews-pds/readme.html) for labs related to data processing as it contains a very large corpus of ~150 million customer reviews. This is useful for showcasing SageMaker's distributed processing abilities which can be extended to many large datasets. After the data processing sections, we will build our FLAN-T5 based NLP model using the [dialogsum](https://huggingface.co/datasets/knkarthick/dialogsum) dataset from HuggingFace which contains ~15k examples of dialogue with associated summarizations.

# Learning Objectives
Attendees will learn how to do the following:
* Ingest data into S3 using Amazon Athena and the Parquet data format
* Visualize data with pandas, matplotlib on SageMaker notebooks
* Perform feature engineering on a raw dataset using Scikit-Learn and SageMaker Processing Jobs
* Store and share features using SageMaker Feature Store
* Train and evaluate a custom generative AI model using PyTorch, HuggingFace, and SageMaker Training Jobs
* Evaluate the model using Scikit-Learn and SageMaker Processing Jobs
* Track model artifacts using Amazon SageMaker ML Lineage Tracking
* Register and version models using SageMaker Model Registry
* Deploy a model to an HTTPS Inference Endpoint using SageMaker Endpoints
* Automate ML workflow steps by building end-to-end model pipelines using SageMaker Pipelines

# Follow Us On Twitter

In [None]:
%%html

<a href="https://twitter.com/cfregly" class="twitter-follow-button" data-size="large" data-lang="en" data-show-count="false">Follow @cfregly</a>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

## _Click This Button ^^ Above ^^_

In [None]:
%%html

<a href="https://twitter.com/anbarth" class="twitter-follow-button" data-size="large" data-lang="en" data-show-count="false">Follow @anbarth</a>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

## _Click This Button ^^ Above ^^_

# Star Our GitHub Repo

In [None]:
%%html

<a class="github-button" href="https://github.com/data-science-on-aws/workshop" data-color-scheme="no-preference: light; light: light; dark: dark;" data-icon="octicon-star" data-size="large" data-show-count="true" aria-label="Star data-science-on-aws/workshop on GitHub">Star</a>
<script async defer src="https://buttons.github.io/buttons.js"></script>

## _Click This Button ^^ Above ^^_

# Visit our Website

In [None]:
%%html

<iframe src="https://datascienceonaws.com" width="800px" height="600px"/>

# Use Case: Fine-Tune a Foundation LLM for Dialogue Summarization

We have chosen the [dialogsum](https://huggingface.co/datasets/knkarthick/dialogsum) dataset from HuggingFace as the main dataset for our model building labs. Here is an example of the input and output we will be fine-tuning a model with.

**INPUT DIALOGUE:**
```
#Person1#: Here we come.
#Person2#: Thank you. What's the fare?
#Person1#: $ 10.
#Person2#: How can it be?
#Person1#: Well, the rate is two dollars for the first two kilometers and twenty cents for each additional two hundred meters.
#Person2#: I see. Thanks for your drive.
```

**OUTPUT SUMMARY:**
```
#Person1# tells #Person2# the fare of taking a taxi.
```

# SageMaker Pipeline with Generative AI

![](./img/generative_sagemaker_pipeline.png)

In the Processing Step, we perform Feature Engineering to create generative embeddings from the `review_body` text using the pre-trained generative model, and split the dataset into train, validation and test files.

In the Training Step, we fine-tune the generative model using the `review_body` column from the Amazon Customer Reviews Dataset.

In the Evaluation Step, we take the trained model and a test dataset as input, and produce a JSON file containing evaluation metrics.

In the Condition Step, we decide whether to register this model if the model metrics, as determined by our evaluation step, exceeded some value. 


# Release Resources

In [None]:
%%html

<p><b>Shutting down your kernel for this notebook to release resources.</b></p>
<button class="sm-command-button" data-commandlinker-command="kernelmenu:shutdown" style="display:none;">Shutdown Kernel</button>

<script>
try {
    els = document.getElementsByClassName("sm-command-button");
    els[0].click();
}
catch(err) {
    // NoOp
}    
</script>