# Serve Multiple Fine-Tuned LoRA Adapters on a single endpoint using SageMaker LMI DLC

---

This notebook's CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.

![This us-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-2/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

---

This notebook will demonstrate how you can deploy multiple fine-tuned LoRA adapters with a single base model copy on SageMaker using the DJL Serving Large Model Inference DLC. LoRA (Low Rank Adapters) is a powerful technique for fine-tuning large language models. This technique significantly reduces the number of trainable parameters compared to traditional fine-tuning while achieving comparable or superior performance. You can learn more about the LoRA technique in this [paper](https://arxiv.org/abs/2106.09685).

A major benefit of LoRA is that the fine-tuned adapters can easily be added to and removed from the base model, which makes switching adapters pretty cheap and viable at runtime. In this notebook we will show how you can deploy a SageMaker endpoint with a single base model and multiple LoRA adapters, and change adapters for different requests.

Since LoRA adapters are much smaller than the size of a base model (can realistically be 100x-1000x smaller), we can deploy an endpoint with a single base model and multiple LoRA adapters using much less hardware than deploying an equivalent number of fully fine-tuned models.

The example we will work through in this notebook is guided by the multi adapter example in HuggingFace's PEFT library: https://github.com/huggingface/peft/blob/main/examples/multi_adapter_examples/PEFT_Multi_LoRA_Inference.ipynb

## Install Packages and Import Dependencies

In [None]:
!pip install huggingface_hub sagemaker boto3 awscli --upgrade --quiet

In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from sagemaker.utils import name_from_base
from huggingface_hub import snapshot_download, notebook_login

In this example we'll be using a model that requires a huggingface account and a token for read permissions. You can read more on the setup here https://huggingface.co/docs/hub/security-tokens. 

In [None]:
notebook_login()

## Download Model Artifacts and Upload to S3

We will be deploying an endpoint with 2 LoRA adapters. These are the models we will be using:
- Base Model: https://huggingface.co/decapoda-research/llama-7b-hf
- LoRA Fine Tuned Adapter 1: https://huggingface.co/tloen/alpaca-lora-7b
- LoRA Fine Tuned Adapter 2: https://huggingface.co/22h/cabrita-lora-v0-1

In [None]:
!rm -rf lora-multi-adapter
!mkdir -p lora-multi-adapter/base_model
!mkdir -p lora-multi-adapter/adapters

In [None]:
snapshot_download("decapoda-research/llama-7b-hf", local_dir="lora-multi-adapter/base_model")

In [None]:
snapshot_download("tloen/alpaca-lora-7b", local_dir="lora-multi-adapter/adapters/eng_alpaca")

In [None]:
snapshot_download(
    "22h/cabrita-lora-v0-1", local_dir="lora-multi-adapter/adapters/portuguese_alpaca"
)

In [None]:
sagemaker_session = sagemaker.session.Session()
s3_bucket = sagemaker_session.default_bucket()

In [None]:
!aws s3 sync lora-multi-adapter s3://{s3_bucket}/lora-mutli-adapter

## Creating Inference Handler and DJL Serving Configuration

The following files cover the model server configuration (`serving.properties`) and custom inference handler (`model.py`). This configuration can be used as an example to write your own inference handler for different models. 

The core structure to cover here is the model directory. We include both the base model and LoRA adapters in the model directory like this:

```
|- base_model/
|- adapters/
|--- <adapter_1>/
|--- <adapter_2>/
|--- ...
|--- <adapter_n>/
|- serving.properties
|- model.py

```

The `base_model` directory contains all the base model artifacts from the corresponding repository on the huggingface hub. These model artifacts are generated from the `model.save_pretrained()` method of huggingface's transformers library.

Each of the adapters in the `adapters` directory contains the LoRA adapter artifacts. Typically there are two files: `adapter_model.bin` and `adapter_config.json` which are the adapter weights and adapter configuration respectively. These are typically obtained from the Peft library via the `PeftModel.save_pretrained()` method. In this example, the inference handler will register the adapters with names equivalent to their directory name, and we will use that name in the request to target a specific adapter.

In [None]:
!rm -rf lora-multi-adapter-code
!mkdir -p lora-multi-adapter-code

In [None]:
%%writefile lora-multi-adapter-code/serving.properties
# Python engine is currently the only engine supported for multi adapter use-case
engine=Python
option.model_id=s3://{{s3_bucket}}/lora-mutli-adapter
option.base_model_path=base_model
option.adapters_path=adapters
option.dtype=fp16
option.entryPoint=model.py

In [None]:
import jinja2

jinja_env = jinja2.Environment()
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path("lora-multi-adapter-code/serving.properties").open().read())
Path("lora-multi-adapter-code/serving.properties").open("w").write(
    template.render(s3_bucket=s3_bucket)
)
!pygmentize lora-multi-adapter-code/serving.properties | cat -n

In [None]:
%%writefile lora-multi-adapter-code/model.py
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
from peft import PeftModel
import torch
import os
from djl_python.inputs import Input
from djl_python.outputs import Output
import logging

model = None
tokenizer = None
logger = logging.getLogger()


def generate_prompt(instruction, input=None):
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. 
        Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} 
        ### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the 
        request.### Instruction: {instruction} ### Response:"""


def evaluate(
    instruction,
    input=None,
    temperature=0.1,
    top_p=0.75,
    top_k=40,
    num_beams=4,
    max_new_tokens=256,
    **kwargs,
):
    prompt = generate_prompt(instruction, input)
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs["input_ids"].to(torch.cuda.current_device())
    generation_config = GenerationConfig(
        temperature=temperature,
        top_p=top_p,
        top_k=top_k,
        num_beams=num_beams,
        no_repeat_ngram_size=3,
        **kwargs,
    )

    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            generation_config=generation_config,
            return_dict_in_generate=True,
            output_scores=True,
            max_new_tokens=max_new_tokens,
        )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
    return output.split("### Response:")[1].strip()


def load_model(base_model_path, adapters_path=None):
    model = LlamaForCausalLM.from_pretrained(
        base_model_path, low_cpu_mem_usage=True, torch_dtype=torch.float16, device_map="auto"
    )
    tokenizer = LlamaTokenizer.from_pretrained(base_model_path)
    logging.info(f"Loaded Base Model {base_model_path}")

    print(os.listdir(adapters_path))
    if os.path.exists(adapters_path):
        first_adapter = True
        for adapter_dir in os.listdir(adapters_path):
            logging.info(f"Registering Adapter {os.path.join(adapters_path, adapter_dir)}")
            if not os.path.isdir(os.path.join(adapters_path, adapter_dir)):
                continue
            if first_adapter:
                model = PeftModel.from_pretrained(
                    model, os.path.join(adapters_path, adapter_dir), adapter_name=adapter_dir
                )
                first_adapter = False
            else:
                model.load_adapter(
                    os.path.join(adapters_path, adapter_dir), adapter_name=adapter_dir
                )
    return model, tokenizer


def handle(inputs: Input):
    global model, tokenizer
    if not model:
        properties = inputs.get_properties()
        model_dir = properties.get("model_id")
        base_model_path = os.path.join(model_dir, properties.get("base_model_path"))
        adapters_path = os.path.join(model_dir, properties.get("adapters_path"))
        logging.info(base_model_path)
        logging.info(adapters_path)
        model, tokenizer = load_model(base_model_path, adapters_path=adapters_path)

    if inputs.is_empty():
        # initialization request
        return None

    json_inputs = inputs.get_as_json()
    sentence = json_inputs.get("inputs")
    adapter_name = json_inputs.get("adapter_name", None)
    generation_kwargs = json_inputs.get("parameters", {})
    if adapter_name is not None:
        model.set_adapter(adapter_name)
        outputs = evaluate(sentence, **generation_kwargs)
    else:
        with model.disable_adapter():
            outputs = evaluate(sentence, **generation_kwargs)

    return Output().add_as_json(outputs)

In [None]:
!rm -f model.tar.gz
!rm -rf lora-multi-adapter-code/.ipynb_checkpoints
!tar czvf model.tar.gz -C lora-multi-adapter-code .

## Create SageMaker Model and Endpoint

In [None]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix = (
    "hf-large-model-djl/lora-multi-adapter"  # folder within bucket where code artifact will go
)

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

In [None]:
s3_code_artifact_accelerate = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

In [None]:
inference_image_uri = sagemaker.image_uris.retrieve(
    "djl-deepspeed", region=region, version="0.23.0"
)

model_name_acc = name_from_base(f"lora-multi-adapter")

create_model_response = sm_client.create_model(
    ModelName=model_name_acc,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact_accelerate},
)
model_arn = create_model_response["ModelArn"]

In [None]:
endpoint_config_name = f"{model_name_acc}-config"
endpoint_name = f"{model_name_acc}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name_acc,
            "InstanceType": "ml.g5.8xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            # "VolumeSizeInGB": 512
        },
    ],
)

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Make Inference Requests

We show how you can make requests targetting each of the adapters configured in the endpoint, as well as requests targetting just the base model with no adapters.

Inference Request targetting the tloen/alpaca-lora-7b adapter (named eng_alpaca based on the directory name of the adapter).

In [None]:
%%time

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": "Tell me about Alpacas", "adapter_name": "eng_alpaca"}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

Inference Request targetting the 22h/cabrita-lora-v0-1 adapter (named portuguese_alpaca based on the direcotry name of the adapter).

In [None]:
%%time

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": "Invente uma desculpa criativa pra dizer que não preciso ir à festa.",
            "adapter_name": "portuguese_alpaca",
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

Inference Request targetting the base model without any adapters

In [None]:
%%time

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": "Tell me about Alpacas"}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

## Clean up Resources

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name_acc)

## Notebook CI Test Results

This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.


![This us-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-1/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This us-east-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-east-2/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This us-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/us-west-1/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This ca-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ca-central-1/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This sa-east-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/sa-east-1/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This eu-west-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-1/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This eu-west-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-2/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This eu-west-3 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-west-3/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This eu-central-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-central-1/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This eu-north-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/eu-north-1/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This ap-southeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-1/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This ap-southeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-southeast-2/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This ap-northeast-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-1/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This ap-northeast-2 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-northeast-2/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

![This ap-south-1 badge failed to load. Check your device's internet connectivity, otherwise the service is currently unavailable](https://h75twx4l60.execute-api.us-west-2.amazonaws.com/sagemaker-nb/ap-south-1/inference|generativeai|llm-workshop|lab11-lora-inference|lora-multi-adapter.ipynb)

