<p style="padding: 10px; border: 1px solid black;">
<img src="images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>


# <a name="0">MLU LLM Workshop </a>
## <a name="0">Lab 1: Inference with Pretrained Models </a>

This notebook exemplifies how to use pretrained large languages models (LLMs) for processing either a specific text or a tailored dataset. By utilizing LLMs, we can perform a variety of natural language processing (NLP) tasks such as text classification, sentiment analysis, and text generation. LLMs have been pre-trained on enormous amounts of data, making them highly effective in understanding the nuances of language and generating coherent responses.


1. <a href="#1">Import libraries</a>
2. <a href="#2">Load an LLM</a>
3. <a href="#3">Batch inference with text generation</a>
4. <a href="#4">LLM inference with a customized dataset</a>
5. <a href="#5">Quizzes</a>

__Jupyter notebooks environment__:

* Jupiter notebooks allow creating and sharing documents that contain both code and rich text cells. If you are not familiar with Jupiter notebooks, read more [here](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). 
* This is a quick-start demo to bring you up to speed on coding and experimenting with machine learning. Move through the notebook __from top to bottom__. 
* Run each code cell to see its output. To run a cell, click within the cell and press __Shift+Enter__, or click __Run__ from the top of the page menu. 
* A `[*]` symbol next to the cell indicates the code is still running. A `[#]` symbol, where # is an integer, indicates it is finished.
* Beware, __some code cells might take longer to run__, sometimes 5-10 minutes (depending on the task, installing packages and libraries, training models, etc.)
    
    
Please work top to bottom of this notebook and don't skip sections as this could lead to error messages due to missing code.

---

You will be presented with two kinds of exercises throughout the notebook: activities and challenges. <br/>

| <img style="float: center;" src="./images/activity.png" alt="Activity" width="125"/>| <img style="float: center;" src="./images/challenge.png" alt="Challenge" width="125"/>|
| --- | --- |
|<p style="text-align:center;">No coding is needed for an activity. You try to understand a concept, <br/>answer questions, or run a code cell.</p> |<p style="text-align:center;">Challenges are where you test your understanding by taking a short quiz.</p> |

----    


Let's start by loading some libraries and packages!

---

### <a name="1">Import libraries</a>
(<a href="#0">Go to top</a>)


First, let's install and import the necessary libraries, including the Hugging Face Transformers library and the PyTorch library, which is a dependency for Transformers.


In [1]:
%%capture
!pip3 install -r requirements.txt --quiet

In [2]:
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

---

### <a name="#2">Load an LLM</a>
(<a href="#0">Go to top</a>)

---

The LLMs we are going to use are models from the [Dolly](https://github.com/databrickslabs/dolly) family, a set of instruction-following LLMs commercially open-sourced by Databricks. There are three sizes of models: 3 billions (`3b`),  7 billions (`7b`) and  12 billions (`12b`). The Dolly family models are derived from EleutherAI’s Pythia-12b and fine-tuned on a [~15K record instruction corpus](https://huggingface.co/datasets/databricks/databricks-dolly-15k) generated by Databricks employees and released under a permissive license (CC-BY-SA).

First, let's pick a model to load. The below code initializes a tokenizer and a base model using the `Dolly-v2-3b` model from the Hugging Face Transformers library. The tokenizer converts raw text into tokens, and the base model generates text based on a given prompt. By following the instructions outlined above, you can correctly instantiate these components and leverage their functionality in your code.

The following code initializes an inference pipeline using the `pipeline()` function from the Transformers library. The pipeline is created with the purpose of text generation with the following parameters:
- `model="databricks/dolly-v2-3b"` specifies the name of the pre-trained model to be used.
- `device_map="auto"` specifies the device where the model will be loaded. Setting it to "auto" allows the library to automatically select the appropriate device (e.g., CPU or GPU) based on availability.
- `torch_dtype=torch.float16` specifies the data type for the model's weights and activations. In this case, the model will use half-precision floating-point numbers (`torch.float16`) to save memory and potentially speed up computation. Note that not all models support this data type.
- `trust_remote_code=True` determines whether to trust remote code from the model's repository. By setting it to True, the pipeline will allow remote code execution, such as custom methods or functions provided by the model's repository. 

In [3]:
instruct_pipeline = pipeline(model="databricks/dolly-v2-3b", 
                             device_map="auto",
                             torch_dtype=torch.float16, 
                             trust_remote_code=True, 
                             )


Downloading (…)lve/main/config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

Downloading (…)instruct_pipeline.py:   0%|          | 0.00/9.16k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.


Downloading (…)okenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

---

### <a name="#3">LLM inference with text generation</a>
(<a href="#0">Go to top</a>)

---

Let's test our `instruct_pipeline` with its text generation functionalities!

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h2><i>Try it Yourself!</i></h2>
    <br>
    <p style="text-align:center;margin:auto;"><img src="./images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Try different prompts and observe the responses generated by the model.</p>
    <p style=" text-align: center; margin: auto;"><b>Note: Results may not be factually accurate and may be based on false assumptions.</b></p>
    <br>
</div>


In [4]:
prompt = "What solutions come pre-built with Amazon SageMaker JumpStart?"
output1 = instruct_pipeline(prompt)
output1

[{'generated_text': 'Amazon SageMaker JumpStart includes code, documentation, and a pre-built docker image that you can use for training your machine learning models on your existing data sources.'}]

The above outputs are hard to read. Let's display it in a easy to read Markdown format.

In [5]:
from IPython.display import Markdown
Markdown(output1[0]['generated_text'])

Amazon SageMaker JumpStart includes code, documentation, and a pre-built docker image that you can use for training your machine learning models on your existing data sources.


---


### <a name="#4">Batch inference with a customized dataset</a>
(<a href="#0">Go to top</a>)

---



In [6]:
import pandas as pd
test = pd.read_csv("data/amazon_sagemaker_faqs.csv")
with pd.option_context('display.max_colwidth', None):
    display(test.head())

Unnamed: 0,instruction,response
0,What is Amazon SageMaker?,"Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows."
1,In which Regions is Amazon SageMaker available?\r\n,"For a list of the supported Amazon SageMaker AWS Regions, please visit the AWS Regional Services page. Also, for more information, see Regional endpoints in the AWS general reference guide."
2,What is the service availability of Amazon SageMaker?\r\n,"Amazon SageMaker is designed for high availability. There are no maintenance windows or scheduled downtimes. SageMaker APIs run in Amazon’s proven, high-availability data centers, with service stack replication configured across three facilities in each AWS Region to provide fault tolerance in the event of a server failure or Availability Zone outage."
3,How does Amazon SageMaker secure my code?,"Amazon SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest."
4,What security measures does Amazon SageMaker have?,"Amazon SageMaker ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest. Requests to the SageMaker API and console are made over a secure (SSL) connection. You pass AWS Identity and Access Management roles to SageMaker to provide permissions to access resources on your behalf for training and deployment. You can use encrypted Amazon Simple Storage Service (Amazon S3) buckets for model artifacts and data, as well as pass an AWS Key Management Service (KMS) key to SageMaker notebooks, training jobs, and endpoints, to encrypt the attached ML storage volume. Amazon SageMaker also supports Amazon Virtual Private Cloud (VPC) and AWS PrivateLink support."


Let's batch inference on the question dataset. Running this cell might take upto ~15 minutes depending on the number of samples.

In [7]:
%%time

# Set number of samples for inference
num_samples = 10

test = test.sample(num_samples)
predictions = instruct_pipeline(list(test['instruction'].head(10)))

CPU times: user 32 s, sys: 0 ns, total: 32 s
Wall time: 32 s


Merge the predictions to the original dataframe.

In [8]:
predictions_clean = [x[0]['generated_text'] for x in predictions]
test['prediction'] = predictions_clean

with pd.option_context('display.max_colwidth', None):
    display(test.head())

Unnamed: 0,instruction,response,prediction
9,Is R supported with Amazon SageMaker?,"Yes, R is supported with Amazon SageMaker. You can use R within SageMaker notebook instances, which include a preinstalled R kernel and the reticulate library. Reticulate offers an R interface for the Amazon SageMaker Python SDK, enabling ML practitioners to build, train, tune, and deploy R models. \r\n","Yes, R is supported with Amazon SageMaker."
1,In which Regions is Amazon SageMaker available?\r\n,"For a list of the supported Amazon SageMaker AWS Regions, please visit the AWS Regional Services page. Also, for more information, see Regional endpoints in the AWS general reference guide.","Amazon SageMaker is available in all regions where AWS is available. If you are using the AWS VPC endpoint, we also support Amazon SageMaker on a private subnet."
12,How does Amazon SageMaker Clarify improve model explainability?,Amazon SageMaker Clarify is integrated with Amazon SageMaker Experiments to provide a feature importance graph detailing the importance of each input for your model’s overall decision-making process after the model has been trained. These details can help determine if a particular model input has more influence than it should on overall model behavior. SageMaker Clarify also makes explanations for individual predictions available via an API.,"When models trained on data stored in a DBMS, such as Postgres, are run in an analytical processing (ALP) shop (Amazon T, Google Analytics, etc.) model explainability is often one of the top concerns. SageMaker Clarify provides a way for DBMS data to be exposed as friendly, pre-formulated outputs that follow industry standards for business impact dashboards, predictive performance dashboards, and even automated reporting on arbitrary buckets."
115,Why should I use Amazon SageMaker Serverless Inference?,"Amazon SageMaker Serverless Inference simplifies the developer experience by eliminating the need to provision capacity up front and manage scaling policies. SageMaker Serverless Inference can scale instantly from tens to thousands of inferences within seconds based on the usage patterns, making it ideal for ML applications with intermittent or unpredictable traffic. For example, a chatbot service used by a payroll processing company experiences an increase in inquiries at the end of the month while for rest of the month traffic is intermittent. Provisioning instances for the entire month in such scenarios is not cost-effective, as you end up paying for idle periods. SageMaker Serverless Inference helps address these types of use cases by providing you automatic and fast scaling out of the box without the need for you to forecast traffic up front or manage scaling policies. Additionally, you pay only for the compute time to run your inference code (billed in milliseconds) and for data processing, making it a cost-effective option for workloads with intermittent traffic.","As a fully serverlessML stack, SageMaker Serverless Inference can be used to build ML workflows on any cloud that supports serverless. The fully serverless ML stack delivers truly pay as you go ML capabilities on top of existing data lakes, big data sources, and data warehouses without the need for long-running containers or virtual machines."
137,How do I deploy models to the edge devices?,Amazon SageMaker Edge Manager stores the model package in your specified Amazon S3 bucket. You can use the over-the-air (OTA) deployment feature provided by AWS IoT Greengrass or any other deployment mechanism of your choice to deploy the model package from your S3 bucket to the devices.,"In a ""Hybrid Edge-Cloud"" deployment, the model is hosted on the cloud and the model/edge is responsible for delivering the trained model for consumption on the edge devices. In order to deploy models to edge devices, model providers typically provide a tool to pre-compile the model on cloud and push it to the edge device. Alternatively, you can also write a ""Man in the middle"" node on the edge that will push the model to the edge devices for them to consume."


### <a name="5">Quizzes</a>
(<a href="#0">Go to top</a>)

Well done on completing the lab! Now, it's time for a brief knowledge assessment.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h2><i>Try it Yourself!</i></h2>
    <br>
    <p style="text-align:center;margin:auto;"><img src="./images/challenge.png" alt="Challenge" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Answer the following questions to test your understanding of using pre-trained LLMs for inference.</p>
    <br>
</div>


In [9]:
from mlu_utils.quiz_questions import *
lab1_question1

In [10]:
lab1_question2

Let's restart the kernel to release CPU and GPU memory for the next lab.

**You might see a pop up indicating the kernel has been restarted.**

In [11]:
# import IPython
# IPython.Application.instance().kernel.do_shutdown(True)

<p style="padding: 10px; border: 1px solid black;">
<img src="images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>

# Thank you!