<p style="padding: 10px; border: 1px solid black;">
<img src="images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>


# <a name="0">MLU LLM Workshop </a>
## <a name="0">Lab 4: Deploy a Pretrained Model </a>
    
In this notebook, we will deploy a large languages model (LLM) that we finetuned in the last notebook. Specifically, we will use [DJL (Deep Java Library)](https://djl.ai/), a machine learning library that provides a seamless interface for developing and deploying deep learning models, with support for Python and other programming languages. DJL is designed to be framework-agnostic, which means it provides a unified interface and abstractions for working with various deep learning frameworks, such as TensorFlow and PyTorch. It allows developers to switch between different frameworks without having to make extensive code changes. DJL enables easy deployment of trained models into production environments. It offers integrations with popular deployment platforms such as Amazon SageMaker, making it easier to serve models at scale.
    

In this notebook, we will cover the following key aspects:

1. <a href="#1">Install and import libraries</a>
2. <a href="#2">Overview of deployment parameters</a>
3. <a href="#3">Instantiate SageMaker parameters</a>
4. <a href="#4">Create the model artifact</a>
5. <a href="#5">Produce the inference script</a>
6. <a href="6">Upload the model artifact to S3</a>
7. <a href="#5">Deploy the finetuned LLM</a>
8. <a href="#6">Inference via the deployed LLM</a>
9. <a href="#7">Quizzes</a>

Please work top to bottom of this notebook and don't skip sections as this could lead to error messages due to missing code.

---

You will be presented with two kinds of exercises throughout the notebook: activities and challenges. <br/>

| <img style="float: center;" src="./images/activity.png" alt="Activity" width="125"/>| <img style="float: center;" src="./images/challenge.png" alt="Challenge" width="125"/>|
| --- | --- |
|<p style="text-align:center;">No coding is needed for an activity. You try to understand a concept, <br/>answer questions, or run a code cell.</p> |<p style="text-align:center;">Challenges are where you test your understanding by taking a short quiz.</p> |

----


---

### <a name="1">Install and Import libraries</a>
(<a href="#0">Go to top</a>)

First, let's install and import the necessary libraries for deployment, including the `sagemaker` library and the `Boto3` library, the AWS Python SDK. If you haven't install the packages, uncomment the below line and install them.

In [1]:
%%capture
!pip3 install -U sagemaker --quiet

In [2]:
import boto3
import json
import sagemaker.djl_inference
from sagemaker.session import Session
from sagemaker import image_uris
from sagemaker import Model

---
### <a name="2">Overview of deployment parameters</a>
(<a href="#0">Go to top</a>)

To deploy using the SageMaker Python SDK with the DJL, we will need to instantiate `Model` class with the following parameters:
```{python}
model = Model(
    image_uri,
    model_data=...,
    predictor_cls=...,
    role=aws_role
)
```
- `image_uri`: The Docker image URI representing the deep learning framework and version to be used.
- `model_data`: The location of the finetuned LLM model artifact in an S3 bucket. It specifies the path to the TAR GZ file containing the model's parameters, architecture, and any necessary artifacts.
- `predictor_cls`: This is just a "json in json out" predictor, nothing DJL related, check more details at [sagemaker.djl_inference.DJLPredictor](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#djlpredictor).
- `role`: The AWS Identity and Access Management (IAM) role ARN that provides necessary permissions to access resources like the S3 bucket containing the model data.

---
### <a name="3">Instantiate SageMaker parameters</a>
(<a href="#0">Go to top</a>)

Let's initialize a SageMaker session and retrieve information related to the AWS environment such as SageMaker role and AWS region. We also specify the image URI for a specific version of the "djl-deepspeed" framework using the SageMaker session's region. The image URI is a unique identifier for a specific Docker container image that can be used in various AWS services, such as Amazon SageMaker or Elastic Container Registry (ECR).

In [3]:
sagemaker_session = Session()
print("sagemaker_session: ", sagemaker_session)

aws_role = sagemaker_session.get_caller_identity_arn()
print("aws_role: ", aws_role)

aws_region = boto3.Session().region_name
print("aws_region: ", aws_region)

image_uri = image_uris.retrieve(framework="djl-deepspeed",
                                version="0.22.1",
                                region=sagemaker_session._region_name)
print("image_uri: ", image_uri)


sagemaker_session:  <sagemaker.session.Session object at 0x7f32cfe33a30>
aws_role:  arn:aws:iam::757420736997:role/WorkshopEndtoEnd
aws_region:  us-east-1
image_uri:  763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.22.1-deepspeed0.9.2-cu118



---

### <a name="4">Create the model artifact</a>
(<a href="#0">Go to top</a>)

To upload the model artifact in the S3 bucket, we need to create TAR GZ file containing the model's parameters. First, we create a directory named `lora_model` and a subdirectory named `dolly-3b-lora`. The "-p" option ensures that the command creates any intermediate directories if they don't exist. Then, we copy the lora checkpoints `adapter_model.bin` and `adapter_config.json` to `dolly-3b-lora`. The base dolly model will be downloaded at runtime from huggingface hub.

In [4]:
%%bash
rm -rf lora_model
mkdir -p lora_model
mkdir -p lora_model/dolly-3b-lora
cp dolly-3b-lora/adapter_config.json lora_model/dolly-3b-lora/
cp dolly-3b-lora/adapter_model.bin lora_model/dolly-3b-lora/

Next, we need to set the [DJL Serving configuration options](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-configuration.html) in `serving.properties`. The jupyter `%%writefile` magic command enables us to write the following content to a file named "lora_model/serving.properties".
- `engine=Python`: This line specifies the engine used for serving.
- `option.entryPoint=model.py`: This line specifies the entry point for the serving process, which is set to "model.py". 
- `option.adapter_checkpoint=dolly-3b-lora`: This line sets the checkpoint for the adapter to "dolly-3b-lora". A checkpoint typically represents the saved state of a model or its parameters.
- `option.adapter_name=dolly-lora`: This line sets the name of the adapter to "dolly-lora", a component that helps interface between the model and the serving infrastructure.

In [5]:
%%writefile lora_model/serving.properties
engine=Python
option.entryPoint=model.py
option.adapter_checkpoint=dolly-3b-lora
option.adapter_name=dolly-lora

Writing lora_model/serving.properties


Another file we need in the the model artifact is the environment requirement file. Let's use the same `requirements.txt` file from the lab folder.

In [6]:
%%bash
cp requirements.txt lora_model/

---
### <a name="5">Produce the inference script</a>
(<a href="#0">Go to top</a>)

Similar to the finetuning notebook, we have defined a text generation pipeline for inference. The code is provided in `mlu_utils/deployment_model.py`. 

We save these inference functions to `lora_model/model.py`.

In [7]:
%%bash
cp mlu_utils/deployment_model.py lora_model/model.py

---
### <a name="6">Upload the model artifact to S3</a>
(<a href="#0">Go to top</a>)

Let's create a compressed tarball archive of the "lora_model" directory and saves it as "lora_model.tar.gz".

In [8]:
%%bash
tar -cvzf lora_model.tar.gz lora_model/


lora_model/
lora_model/model.py
lora_model/dolly-3b-lora/
lora_model/dolly-3b-lora/adapter_model.bin
lora_model/dolly-3b-lora/adapter_config.json
lora_model/serving.properties
lora_model/requirements.txt


Create a bucket with your desired name and add it to the `mybucket` variable.
If the S3 bucket is not created, an existing one would be selected.

We upload the "lora_model.tar.gz" file to the specified S3 bucket.

In [9]:
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')

# Set the bucket
mybucket=sagemaker_session.default_bucket()
    
response = s3_client.upload_file("lora_model.tar.gz", mybucket, "lora_model.tar.gz")

---

### <a name="7">Deploy the finetuned LLM</a>
(<a href="#0">Go to top</a>)

Now it's the time to deploy the finetuned LLM using SageMaker Python SDK. The SageMaker Python SDK `Model` class is instantiated with the following parameters:

- `image_uri`: The Docker image URI representing the deep learning framework and version to be used.
- `model_data`: The location of the finetuned LLM model artifact in an S3 bucket. It specifies the path to the TAR GZ file containing the model's parameters, architecture, and any necessary artifacts.
- `predictor_cls`: This is just a "json in json out" predictor, nothing DJL related, check more details at [sagemaker.djl_inference.DJLPredictor](https://sagemaker.readthedocs.io/en/stable/frameworks/djl/sagemaker.djl_inference.html#djlpredictor).
- `role`: The AWS Identity and Access Management (IAM) role ARN that provides necessary permissions to access resources like the S3 bucket containing the model data.

In [10]:
model_data="s3://{}/lora_model.tar.gz".format(mybucket)

model = Model(image_uri=image_uri,
              model_data=model_data,
              predictor_cls=sagemaker.djl_inference.DJLPredictor,
              role=aws_role)

Note: **The deployment should finish within 10 minutes. If it took longer than that, your endpoint may be failed.**

In [11]:
%%time
predictor = model.deploy(1, "ml.g4dn.xlarge")

------------!CPU times: user 141 ms, sys: 13.6 ms, total: 155 ms
Wall time: 6min 33s


---

### <a name="8">Inference via the deployed model</a>
(<a href="#0">Go to top</a>)

Now let's test our inference endpoint with [predictor.predict](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html#sagemaker.predictor.Predictor.predict)!

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h2><i>Try it Yourself!</i></h2>
    <br>
    <p style="text-align:center;margin:auto;"><img src="./images/activity.png" alt="Activity" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Try different prompts and observe the quality of responses generated by the deployed model.</p>
    <br>
</div>

In [12]:
outputs = predictor.predict({"inputs": "What solutions come pre-built with Amazon SageMaker JumpStart?"})

In [13]:
outputs

'Amazon SageMaker JumpStart includes pre-built solutions for ML model training, model deployment, and model monitoring. You can choose from a variety of pre-built solutions, including ML model training, model deployment, and model monitoring. You can also customize your solution to meet your business needs. For example, you can choose from a variety of ML model training solutions, such as Amazon SageMaker Training Manager, Amazon SageMaker Training Studio, and Amazon SageMaker Studio. You can also choose from a variety of model deployment solutions, such as Amazon SageMaker Model Deployment Manager, Amazon SageMaker Model Deployment Studio, and Amazon SageMaker Model Deployment Studio. You can choose from a variety of model monitoring solutions, such as Amazon SageMaker Model Monitoring Manager, Amazon SageMaker Model Monitoring Studio, and Amazon SageMaker Model Monitoring Studio. You can also choose from a variety of solutions for model deployment and model monitoring. You can also c

In [14]:
from IPython.display import Markdown
Markdown(outputs)

Amazon SageMaker JumpStart includes pre-built solutions for ML model training, model deployment, and model monitoring. You can choose from a variety of pre-built solutions, including ML model training, model deployment, and model monitoring. You can also customize your solution to meet your business needs. For example, you can choose from a variety of ML model training solutions, such as Amazon SageMaker Training Manager, Amazon SageMaker Training Studio, and Amazon SageMaker Studio. You can also choose from a variety of model deployment solutions, such as Amazon SageMaker Model Deployment Manager, Amazon SageMaker Model Deployment Studio, and Amazon SageMaker Model Deployment Studio. You can choose from a variety of model monitoring solutions, such as Amazon SageMaker Model Monitoring Manager, Amazon SageMaker Model Monitoring Studio, and Amazon SageMaker Model Monitoring Studio. You can also choose from a variety of solutions for model deployment and model monitoring. You can also customize your solution to meet your business needs. For example,

---

### <a name="9">Cleaning up everything</a>
(<a href="#0">Go to top</a>)

After finishing this notebook let's be frugal and delete the model and the endpoint.

In [15]:
predictor.delete_endpoint()
model.delete_model()

Let's remove model artifacts to save memory

In [16]:
%%bash
rm -rf lora_model*

### <a name="10">Quizzes</a>
(<a href="#0">Go to top</a>)

Well done on completing the lab! Now, it's time for a brief knowledge assessment.

<div style="border: 4px solid coral; text-align: center; margin: auto;">
    <h2><i>Try it Yourself!</i></h2>
    <br>
    <p style="text-align:center;margin:auto;"><img src="./images/challenge.png" alt="Challenge" width="100" /> </p>
    <p style=" text-align: center; margin: auto;">Answer the following questions to test your understanding of deploying LLMs to an endpoint using Sagemaker.</p>
    <br>
</div>


In [17]:
from mlu_utils.quiz_questions import *
lab4_question1

In [18]:
lab4_question2

<p style="padding: 10px; border: 1px solid black;">
<img src="images/MLU-NEW-logo.png" alt="drawing" width="400"/> <br/>

# Thank you!