# Serve Flan T5 Finetuned with LoRA using SageMaker LMI DLC

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

---

In this example we will walk through how you can take a Flan T5 model fine tuned via Low Rank Adapters (LoRA) and deploy it for inference using SageMaker Large Model Inference (LMI) DLCs. We will use the FasterTransformer DLC for optimized inference. For more information on LoRA, please see the [relevant paper](https://arxiv.org/abs/2106.09685).

## Setup

Installs the dependencies required to package the model and run inferences using Amazon SageMaker. Update SageMaker, boto3 etc

In [None]:
!pip install sagemaker boto3 --upgrade  --quiet

In [None]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from sagemaker.utils import name_from_base

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix_fastertransformer = "hf-large-model-ft/code_flant5lora/fastertransformer"  # folder within bucket where code artifact will go

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

## Create SageMaker Model Artifacts

The process of creating model artifacts for a LoRA fine-tuned model is very similar to the process for creating artifacts for a standard model to use with the LMI DLCs. For this notebook, we will be using [this LoRA checkpoint available on the huggingface hub](https://huggingface.co/lorahub/flan_t5_large-quarel_logic_test). If you have LoRA checkpoints for a different model, you can substitute those here as long as the base model architecture is supported by FasterTransformer. We will demonstrate two ways for creating model artifacts depending on whether you are deploying a fine-tuned model available from the HuggingFace hub, or a fine-tuned model with local artifacts.

In [None]:
!rm -rf code_flant5_lora
!mkdir -p code_flant5_lora

### Option 1: Create a serving.properties for a LoRA checkpoint available on the hub

In [None]:
%%writefile code_flant5_lora/serving.properties
engine = FasterTransformer
# built in default handler. handles model conversion from LoRA to FasterTransformer compatible
option.entryPoint = djl_python.fastertransformer
# Pointer to LoRA checkpoints available on the huggingface hub
option.model_id = lorahub / flan_t5_large - quarel_logic_test
option.tensor_parallel_degree = 1
option.dtype = fp16
option.model_loading_timeout = 1800

That is all you need for deploying a LoRA fine-tuned model available on the HuggingFace hub!

### Option 2: Create a serving.properties for a local LoRA checkpoint

If you have a LoRA checkpoint available locally that you want to deploy for inference, you will need to package the checkpoints as part of the code artifact that is uploaded to S3. The direcotry structure should look like this:

```
code_flant5_lora/
├── lora_checkpoint_dir
│   ├── adapter_config.json
│   ├── adapter_model.bin # can also be in safetensors format
├── (optional) base_model_artifacts
│   ├── *.bin # checkpoint files
│   ├── *.json # configuration files for model, tokenizer, generation etc.
├── serving.properties
```

If you have saved your LoRA checkpoints in a directory called `lora_checkpoint_dir`, the serving.properties file will look like this (uncommented, of course):

In [None]:
# engine=FasterTransformer
# option.entryPoint=djl_python.fastertransformer
# # Pointer to LoRA checkpoints available on the huggingface hub
# option.model_id=lora_checkpoint_dir
# # built in default handler. handles model conversion from LoRA to FasterTransformer compatible
# option.tensor_parallel_degree=1
# option.dtype=fp16
# option.model_loading_timeout=1800

Note: For LoRA models generated by the HuggingFace Peft library, the adapter_config.json file specifies the base model that was used to train in a field called `base_model_name_or_path`. To use this model for inference, the LMI container expects the base model path relative to the model directory (in this example that would be the code_flant5_lora directory). If the directory is not found, the base model will be downloaded from HuggingFace hub at runtime.

### Create requirements.txt

In [None]:
%%writefile code_flant5_lora/requirements.txt
peft==0.4.0

## Create the Tarball and then upload to S3 location
Next, we will package our artifacts as `*.tar.gz` files for uploading to S3 for SageMaker to use for deployment

In [None]:
!rm -f model.tar.gz
!rm -rf code_flant5_lora/.ipynb_checkpoints
!tar czvf model.tar.gz -C code_flant5_lora .
s3_code_artifact_fastertransformer = sess.upload_data(
    "model.tar.gz", bucket, s3_code_prefix_fastertransformer
)
print(
    f"S3 Code or Model tar for fastertransformer uploaded to --- > {s3_code_artifact_fastertransformer}"
)

## Define a serving container, SageMaker Model and SageMaker endpoint
Now that we have uploaded the model artifacts to S3, we can create a SageMaker endpoint.

### Define the serving container
Here we define the container to use for the model for inference. We will be using SageMaker's Large Model Inference(LMI) container using FasterTransformer.

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-fastertransformer", region=region, version="0.23.0"
)

### Create SageMaker model, endpoint configuration and endpoint.

In [None]:
model_name_ds = name_from_base(f"flant5-lora-ds")
print(model_name_ds)

In [None]:
create_model_response = sm_client.create_model(
    ModelName=model_name_ds,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact_fastertransformer,
    },
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

In [None]:
model_name = model_name_ds
print(f"Building EndpointConfig and Endpoint for: {model_name}")

In [None]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.4xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            # "VolumeSizeInGB": 512
        },
    ],
)
endpoint_config_response

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

### Run Inference

In this example we are running a LoRA fine-tuned version of the flan-t5-large model, but the same steps can be used to deploy larger models using tensor parallelism and larger instances. Note that this model may not be accurate and provide correct outputs, but this example is intended to showcase the functionality of deploying LoRA fine-tuned models with optimized inference.

In [None]:
%%time

prompt = """Infer the date from context.  
Q: Today, 8/3/1997, is a day that we will never forget. 
What is the date one week ago from today in MM/DD/YYYY? Options: 
(A) 03/27/1998 
(B) 09/02/1997 
(C) 07/27/1997 
(D) 06/29/1997 
(E) 07/27/1973 
(F) 12/27/1997 A:
"""

data = {
    "inputs": prompt,
}

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(data),
    ContentType="application/json",
)

output = json.loads(response_model["Body"].read().decode("utf8"))
print(output[0]["generated_text"])

## Summary

In this example we showed how to take LoRA fine-tune checkpoints and deploy the full fine-tuned model on SageMaker using the FasterTransformer LMI DLC. This DLC will merge the LoRA weights with the base model weights, and then optimize the resulting model using tensor parallelism and fused CUDA kernels (if the model type is supported).

The same functionality is available in the DeepSpeed DLC. You can refer to the Falcon 7b LoRA inference example in this notebook: [Serve Falcon 7b Fine-Tuned with LoRA using SageMaker LMI DLC](https://github.com/aws/amazon-sagemaker-examples/tree/main/inference/generativeai/llm-workshop/lab11-lora-inference/falcon-7b-lora.ipynb)

A unique feature of LoRA is that since the fine tuned model artifacts are separate from the base model, we can deploy a single endpoint with multiple adapters and switch between them at inference time. This functionality is demonstrated in this notebook: [Serve Multiple Fine-Tuned LoRA Adapters on a single endpoint using SageMaker LMI DLC](https://github.com/aws/amazon-sagemaker-examples/tree/main/inference/generativeai/llm-workshop/lab11-lora-inference/lora-multi-adapter.ipynb)

### Clean Up

In [None]:
# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
# - In case the end point failed we still want to delete the model
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|lab11-lora-inference|flan-t5-lora.ipynb)