<p style="padding: 10px; border: 1px solid black;">
<img src="images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>


# <a name="0">MLU LLM Workshop </a>
## <a name="0">Lab 1: Inference with Pretrained Models </a>

This notebook exemplifies how to use pretrained large languages models (LLMs) for processing either a specific text or a tailored dataset. By utilizing LLMs, we can perform a variety of natural language processing (NLP) tasks such as text classification, sentiment analysis, and text generation. LLMs have been pre-trained on enormous amounts of data, making them highly effective in understanding the nuances of language and generating coherent responses.


1. <a href="#1">Import libraries</a>
2. <a href="#2">Load an LLM</a>
3. <a href="#3">Batch inference with text generation</a>
4. <a href="#4">LLM inference with a customized dataset</a>
5. <a href="#5">Quizzes</a>

__Jupyter notebooks environment__:

* Jupiter notebooks allow creating and sharing documents that contain both code and rich text cells. If you are not familiar with Jupiter notebooks, read more [here](https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/what_is_jupyter.html). 
* This is a quick-start demo to bring you up to speed on coding and experimenting with machine learning. Move through the notebook __from top to bottom__. 
* Run each code cell to see its output. To run a cell, click within the cell and press __Shift+Enter__, or click __Run__ from the top of the page menu. 
* A `[*]` symbol next to the cell indicates the code is still running. A `[#]` symbol, where # is an integer, indicates it is finished.
* Beware, __some code cells might take longer to run__, sometimes 5-10 minutes (depending on the task, installing packages and libraries, training models, etc.)
    
    
Please work top to bottom of this notebook and don't skip sections as this could lead to error messages due to missing code.

---

You will be presented with two kinds of exercises throughout the notebook: activities and challenges. <br/>

| <img style="float: center;" src="./images/activity.png" alt="Activity" width="125"/>| <img style="float: center;" src="./images/challenge.png" alt="Challenge" width="125"/>|
| --- | --- |
|<p style="text-align:center;">No coding is needed for an activity. You try to understand a concept, <br/>answer questions, or run a code cell.</p> |<p style="text-align:center;">Challenges are where you test your understanding by taking a short quiz.</p> |

----    


Let's start by loading some libraries and packages!

---

### <a name="1">Import libraries</a>
(<a href="#0">Go to top</a>)


First, let's install and import the necessary libraries, including the Hugging Face Transformers library and the PyTorch library, which is a dependency for Transformers.


In [1]:
%%capture
!pip3 install -r requirements.txt --quiet

In [2]:
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer


Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /home/ec2-user/anaconda3/envs/pytorch_p310/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


---

### <a name="#2">Load an LLM</a>
(<a href="#0">Go to top</a>)

---

The LLMs we are going to use are models from the [Dolly](https://github.com/databrickslabs/dolly) family, a set of instruction-following LLMs commercially open-sourced by Databricks. There are three sizes of models: 3 billions (`3b`),  7 billions (`7b`) and  12 billions (`12b`). The Dolly family models are derived from EleutherAI’s Pythia-12b and fine-tuned on a [~15K record instruction corpus](https://huggingface.co/datasets/databricks/databricks-dolly-15k) generated by Databricks employees and released under a permissive license (CC-BY-SA).

First, let's pick a model to load. The below code initializes a tokenizer and a base model using the `Dolly-v2-3b` model from the Hugging Face Transformers library. The tokenizer converts raw text into tokens, and the base model generates text based on a given prompt. By following the instructions outlined above, you can correctly instantiate these components and leverage their functionality in your code.

The following code initializes an inference pipeline using the `pipeline()` function from the Transformers library. The pipeline is created with the purpose of text generation with the following parameters:
- `model="databricks/dolly-v2-3b"` specifies the name of the pre-trained model to be used.
- `device_map="auto"` specifies the device where the model will be loaded. Setting it to "auto" allows the library to automatically select the appropriate device (e.g., CPU or GPU) based on availability.
- `torch_dtype=torch.float16` specifies the data type for the model's weights and activations. In this case, the model will use half-precision floating-point numbers (`torch.float16`) to save memory and potentially speed up computation. Note that not all models support this data type.
- `trust_remote_code=True` determines whether to trust remote code from the model's repository. By setting it to True, the pipeline will allow remote code execution, such as custom methods or functions provided by the model's repository. 

In [3]:
instruct_pipeline = pipeline(model="databricks/dolly-v2-3b", 
                             device_map="auto",
                             torch_dtype=torch.float16, 
                             trust_remote_code=True,
                             )


---

### <a name="#3">LLM inference with text generation</a>
(<a href="#0">Go to top</a>)

---

Let's test our `instruct_pipeline` with its text generation functionalities!

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h2><i>Try it Yourself!</i></h2>
    <br>
    <p style="text-align:center;margin:auto;"><img src="./images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Try different prompts and observe the responses generated by the model.</p>
    <p style=" text-align: center; margin: auto;"><b>Note: Results may not be factually accurate and may be based on false assumptions.</b></p>
    <br>
</div>


In [4]:
prompt = "What solutions come pre-built with Amazon SageMaker JumpStart?"
output1 = instruct_pipeline(prompt)
output1

[{'generated_text': 'Here is a list of solutions that come pre-built with Amazon SageMaker JumpStart:\n1. MLFlow\n2. Gluon\n3. Dali\n4. ML Flow Model Builisher\n5. MLFlow Model Builisher'}]

The above outputs are hard to read. Let's display it in a easy to read Markdown format.

In [5]:
from IPython.display import Markdown
Markdown(output1[0]['generated_text'])

Here is a list of solutions that come pre-built with Amazon SageMaker JumpStart:
1. MLFlow
2. Gluon
3. Dali
4. ML Flow Model Builisher
5. MLFlow Model Builisher


---


### <a name="#4">Batch inference with a customized dataset</a>
(<a href="#0">Go to top</a>)

---



In [6]:
import pandas as pd
test = pd.read_csv("data/amazon_sagemaker_faqs.csv")
with pd.option_context('display.max_colwidth', None):
    display(test.head())

Unnamed: 0,instruction,response
0,What is Amazon SageMaker?,"Amazon SageMaker is a fully managed service to prepare data and build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows."
1,In which Regions is Amazon SageMaker available?\r\n,"For a list of the supported Amazon SageMaker AWS Regions, please visit the AWS Regional Services page. Also, for more information, see Regional endpoints in the AWS general reference guide."
2,What is the service availability of Amazon SageMaker?\r\n,"Amazon SageMaker is designed for high availability. There are no maintenance windows or scheduled downtimes. SageMaker APIs run in Amazon’s proven, high-availability data centers, with service stack replication configured across three facilities in each AWS Region to provide fault tolerance in the event of a server failure or Availability Zone outage."
3,How does Amazon SageMaker secure my code?,"Amazon SageMaker stores code in ML storage volumes, secured by security groups and optionally encrypted at rest."
4,What security measures does Amazon SageMaker have?,"Amazon SageMaker ensures that ML model artifacts and other system artifacts are encrypted in transit and at rest. Requests to the SageMaker API and console are made over a secure (SSL) connection. You pass AWS Identity and Access Management roles to SageMaker to provide permissions to access resources on your behalf for training and deployment. You can use encrypted Amazon Simple Storage Service (Amazon S3) buckets for model artifacts and data, as well as pass an AWS Key Management Service (KMS) key to SageMaker notebooks, training jobs, and endpoints, to encrypt the attached ML storage volume. Amazon SageMaker also supports Amazon Virtual Private Cloud (VPC) and AWS PrivateLink support."


Let's batch inference on the question dataset. Running this cell might take upto ~15 minutes depending on the number of samples.

In [7]:
%%time

# Set number of samples for inference
num_samples = 10

test = test.sample(num_samples)
predictions = instruct_pipeline(list(test['instruction'].head(10)))

CPU times: user 30.9 s, sys: 0 ns, total: 30.9 s
Wall time: 30.9 s


Merge the predictions to the original dataframe.

In [8]:
predictions_clean = [x[0]['generated_text'] for x in predictions]
test['LLM prediction'] = predictions_clean

with pd.option_context('display.max_colwidth', None):
    display(test.head())

Unnamed: 0,instruction,response,LLM prediction
70,What is Amazon SageMaker Studio Lab?,"Amazon SageMaker Studio Lab is a free ML development environment that provides the compute, storage (up to 15 GB), and security—all at no cost—for anyone to learn and experiment with ML. All you need to get started is a valid email ID; you don’t need to configure infrastructure or manage identity and access or even sign up for an AWS account. SageMaker Studio Lab accelerates model building through GitHub integration, and it comes preconfigured with the most popular ML tools, frameworks, and libraries to get you started immediately. SageMaker Studio Lab automatically saves your work so you don’t need to restart between sessions. It’s as easy as closing your laptop and coming back later.","Amazon SageMaker Studio Lab makes it easy to develop and deploy machine learning (ML) applications on Amazon SageMaker. With Amazon SageMaker Studio Lab, you can easily create an Amazon SageMaker model with pre-trained ML models, train and validate your models, and deploy your models to Amazon SageMaker through your AWS account."
122,What type of endpoints does SageMaker Inference Recommender support?,Currently we support only real-time endpoints.,SageMaker Inference Recommender supports the endpoint for retrieving a set of machine-learned recommendations from SageMaker Sentiment Inference Service.
50,How do I maintain consistency between online and offline features?,Amazon SageMaker Feature Store automatically maintains consistency between online and offline features without additional management or code. SageMaker Feature Store is fully managed and maintains consistency across training and inference environments.,"One of the key challenges when building a distributed system is maintaining consistency between online and offline features. One common approach to handling this challenge is to handle offline data in a read-only state, and provide the read-write APIs to allow for modifications."
118,What is Amazon SageMaker Inference Recommender?,"Amazon SageMaker Inference Recommender is a new capability of Amazon SageMaker that reduces the time required to get ML models in production by automating performance benchmarking and tuning model performance across SageMaker ML instances. You can now use SageMaker Inference Recommender to deploy your model to an endpoint that delivers the best performance and minimizes cost. You can get started with SageMaker Inference Recommender in minutes while selecting an instance type and get recommendations for optimal endpoint configurations within hours, eliminating weeks of manual testing and tuning time. With SageMaker Inference Recommender, you pay only for the SageMaker ML instances used during load testing, and there are no additional charges.","The Inference Recommender enables users to perform data science with their favorite Hadoop frameworks, without having to rely on complex or specialized systems. It uses the Deep Learning Libraries for computer vision and machine learning tasks, which are optimized for speed and memory consumption and provide an efficient and scalable solution for large-scale recommender systems."
100,"Can I optimize multiple objectives simultaneously, such as optimizing a model to be both fast and accurate?\r\n","Not at this time. Currently, you need to specify a single objective metric to optimize or change your algorithm code to emit a new metric, which is a weighted average between two or more useful metrics, and have the tuning process optimize towards that objective metric.","No, you can optimize one objective at a time, and each objective needs to be defined well. If you are optimizing a model to be fast, and the objective is based on loss, and the loss is not going to be accurate, the model will be slow even if it has the correct output. On the other hand, if the loss is based on the reference values, the model might give the wrong output for correct values, but it might be fast. So, if both the loss and accuracy are important for you, it is better to optimize one at a time and do a proper data tuning."


### <a name="5">Quizzes</a>
(<a href="#0">Go to top</a>)

Well done on completing the lab! Now, it's time for a brief knowledge assessment.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h2><i>Try it Yourself!</i></h2>
    <br>
    <p style="text-align:center;margin:auto;"><img src="./images/challenge.png" alt="Challenge" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Answer the following questions to test your understanding of using pre-trained LLMs for inference.</p>
    <br>
</div>


In [9]:
from mlu_utils.quiz_questions import *
lab1_question1

In [10]:
lab1_question2

Let's restart the kernel to release CPU and GPU memory for the next lab.

**You might see a pop up indicating the kernel has been restarted.**

In [11]:
# import IPython
# IPython.Application.instance().kernel.do_shutdown(True)

<p style="padding: 10px; border: 1px solid black;">
<img src="images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>

# Thank you!