# Serve Multiple Fine-Tuned LoRA Adapters on a single endpoint using SageMaker LMI DLC

This notebook illustrates the deployment of multiple fine-tuned LoRA adapters that leverages a single base model  on Amazon SageMaker using the DJL Serving Large Model Inference(LMI) container. LoRA, or Low Rank Adapters, is a parameter efficient fine-tuning(PEFT) technique, considerably diminishing the count of trainable parameters in contrast to conventional fine-tuning, yet delivering analogous or superior results. This aims to reduce the memory footprint during inference. For further insights into the LoRA technique, refer to this [paper](https://arxiv.org/abs/2106.09685).

One of the principal advantages of LoRA lies in the facility to seamlessly integrate and detach the fine-tuned adapters from the base model, enabling cost-effective and feasible adapter interchange at runtime. This notebook exemplifies the deployment of a SageMaker endpoint with a unified base model and multiple LoRA adapters and demonstrates how to modify adapters to cater to varying requests.

Given that LoRA adapters have a significantly smaller footprint compared to a base model (realistically being 100x-1000x smaller), it’s plausible to deploy an endpoint with one base model and several LoRA adapters utilizing substantially fewer hardware resources than would be the case with an equivalent number of comprehensively fine-tuned models.

### Install Packages and Import Dependencies

First, lets install required libraries such as `sagemaker`, `boto3`, `awscli` and `huggingface_hub`. You can safely ingore any warnming related to running pip as `root` user

In [None]:
!pip install huggingface_hub sagemaker boto3 awscli --upgrade --quiet

### Imports and set up

In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from sagemaker.utils import name_from_base
from huggingface_hub import snapshot_download, notebook_login

In [None]:
sagemaker_session = sagemaker.session.Session()
s3_bucket = sagemaker_session.default_bucket()

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
s3_code_prefix = (
    "aim351-lab3/llama7b-lora"  # folder within bucket where code artifact will go
)

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

### Download Model Artifacts and Upload to S3

For the pupose of this lab, we use Llama-2-7b adapters from HuggingFace thats fine-tuned using PEFT LoRA technique. We will be deploying an endpoint with base LLaMA 7B model and 2 LoRA adapters. These are the models we will be using
- Base Model: https://huggingface.co/huggyllama/llama-7b
- Adapter tloen/alpaca-lora-7b
- Adapter 22h/cabrita-lora-v0-1

The `base_model` directory contains all the base model artifacts from the corresponding repository on the huggingface hub. These model artifacts are generated from the `model.save_pretrained()` method of huggingface's transformers library.

Each of the adapters in the `adapters` directory contains the LoRA adapter artifacts. Typically there are two files: `adapter_model.bin` and `adapter_config.json` which are the adapter weights and adapter configuration respectively. These are typically obtained from the Peft library via the `PeftModel.save_pretrained()` method. In this example, the inference handler will register the adapters with names equivalent to their directory name, and we will use that name in the request to target a specific adapter.

```
|- base_model/
|--- <base-model-artifacts>/
|- adapters/
|--- <lora1>/
|-------- <lora1-model-artifacts>/
|--- <lora2>/
|-------- <lora2-model-artifacts>/
|--- <lora3>/
|-------- <lora3-model-artifacts>/
|--- <lora_n>/
|-------- <lora_n-model-artifacts>/
```

In [None]:
!rm -rf lora-multi-adapter
!mkdir -p lora-multi-adapter/adapters

In [None]:
snapshot_download("tloen/alpaca-lora-7b", local_dir="lora-multi-adapter/adapters/eng_alpaca", local_dir_use_symlinks=False)

In [None]:
snapshot_download("22h/cabrita-lora-v0-1", local_dir="lora-multi-adapter/adapters/portuguese_alpaca", local_dir_use_symlinks=False)

## Creating Inference Handler and DJL Serving Configuration

The following files cover the model server configuration (`serving.properties`) and custom inference handler (`model.py`). This configuration can be used as an example to write your own inference handler for different models. 

In [None]:
%%writefile lora-multi-adapter/serving.properties
# Python engine is currently the only engine supported for multi adapter use-case
engine=Python
option.model_id=huggyllama/llama-7b
option.adapters_path=adapters
option.dtype=fp16
option.entryPoint=model.py
option.tensor_parallel_degree=4
load_on_devices=0

In [None]:
import jinja2

jinja_env = jinja2.Environment()
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path("lora-multi-adapter/serving.properties").open().read())
Path("lora-multi-adapter/serving.properties").open("w").write(
    template.render(s3_bucket=s3_bucket)
)
!pygmentize lora-multi-adapter/serving.properties | cat -n

This section walks through how to serve Python based model with DJL Serving.

1. Define Model
To get started, implement a python source file named model.py as the entry point. DJL Serving will run your request by invoking a handle function that you provide. The handle function should have the following signature:

    ```def handle(inputs: Input)```
    
2. If there are other packages you want to use with your script, you can include a requirements.txt file in the same directory with your model file to install other dependencies at runtime. A requirements.txt file is a text file that contains a list of items that are installed by using pip install. You can also specify the version of an item to install.

In [None]:
%%writefile lora-multi-adapter/model.py
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
from peft import PeftModel
import torch
import os
from djl_python.inputs import Input
from djl_python.outputs import Output
import logging

model = None
tokenizer = None

def generate_prompt(instruction, input=None):
    if input:
        return f"""Below is an instruction that describes a task, paired with an input that provides further context. 
        Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} 
        ### Response:"""
    else:
        return f"""Below is an instruction that describes a task. Write a response that appropriately completes the 
        request.### Instruction: {instruction} ### Response:"""


def evaluate(
        instruction,
        adapters,
        input=None,
        max_new_tokens=64,
        **kwargs,
):
    prompts = []
    for inp in instruction:
        prompts.append(generate_prompt(inp, input))
    inputs = tokenizer(prompts, return_tensors="pt", padding=True)
    input_ids = inputs["input_ids"].to(torch.cuda.current_device())
    attention_mask = inputs["attention_mask"].to(torch.cuda.current_device())
    generation_config = GenerationConfig(num_beams=1, do_sample=False)

    logging.info(f"using adapters: {adapters}")
    with torch.no_grad():
        generation_output = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            adapters=adapters,
            generation_config=generation_config,
            max_new_tokens=max_new_tokens,
        )
    output = tokenizer.batch_decode(generation_output, skip_special_tokens=True)
    return output


def load_model(model_id):
    model = LlamaForCausalLM.from_pretrained(
        model_id,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    tokenizer = LlamaTokenizer.from_pretrained(model_id)
    if not tokenizer.pad_token:
        tokenizer.pad_token = '[PAD]'
    logging.info(f"Loaded Base Model {model_id}")
    return model, tokenizer


def register_adapter(inputs: Input):
    """
    Registers lora adapter with the model.
    """
    global model
    adapter_name = inputs.get_property("name")
    adapter_model_id_or_path = inputs.get_property("src")
    logging.info(
        f"Registering adapter {adapter_name} from {adapter_model_id_or_path}")
    if isinstance(model, PeftModel):
        model.load_adapter(adapter_model_id_or_path, adapter_name)
    else:
        model = PeftModel.from_pretrained(model,
                                           adapter_model_id_or_path,
                                           adapter_name)


def handle(inputs: Input):
    global model, tokenizer
    if not model:
        properties = inputs.get_properties()
        model_id = properties.get("model_id")
        model, tokenizer = load_model(model_id)

    if inputs.is_empty():
        return None


    json_inputs = inputs.get_as_json()
    sentence = json_inputs.get("inputs")
    adapters = json_inputs.get("adapters", [])
    generation_kwargs = json_inputs.get("parameters", {})
    outputs = evaluate(sentence, adapters, **generation_kwargs)

    return Output().add_as_json(outputs)

DJL Serving supports model artifacts in model directory, .zip or .tar.gz format. To package model artifacts in a .tar.gz:

In [None]:
!rm -f model.tar.gz
!rm -rf lora-multi-adapter/.ipynb_checkpoints
!tar czvf model.tar.gz -C lora-multi-adapter .

## Create SageMaker Model and Endpoint

SageMaker expects the model artifact to be uploaded in S3, so we upload the tar artiface to S3

In [None]:
s3_code_artifact_accelerate = sess.upload_data("model.tar.gz", bucket, s3_code_prefix)

We then proceed to retrieve the LMI container image and create the model using the S3 Code artifacrs and container image URI. 

In [None]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.25.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

In [None]:
model_name_acc = name_from_base(f"llama7b-lora")

create_model_response = sm_client.create_model(
    ModelName=model_name_acc,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, 
                      "ModelDataUrl": s3_code_artifact_accelerate,
                      "Environment": {"ENABLE_ADAPTERS_PREVIEW": "true"},
                     })
model_arn = create_model_response["ModelArn"]

We will be deploying this endpoint to a g5.2xlarge instnace. 

In [None]:
endpoint_config_name = f"{model_name_acc}-config"
endpoint_name = f"{model_name_acc}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name_acc,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            # "VolumeSizeInGB": 512
        },
    ],
)

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)

In [None]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Make Inference Requests

We show how you can make requests targetting each of the adapters configured in the endpoint, as well as requests targetting just the base model with no adapters.

Inference Request targetting the 'portuguese_alpaca' and 'eng_alpaca' adapter (named eng_alpaca based on the directory name of the adapter). This invocation is a batched request for mulitple adapter in a single call, this hows support for heterogenous batched multi-lora request 

In [None]:
%%time


response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["what is aws reinvent", "translate to english: Estou na AWS re:invent e estou gostando", "Does aws reinvent happen in Las Vegas?"],
                     "adapters": ["eng_alpaca", "portuguese_alpaca", "eng_alpaca"]}),
    ContentType="application/json",
)

response = response_model["Body"].read().decode("utf8").replace("[","").replace("]","")

In [None]:
response

You may want to consider post processing the response depending on your requirements. Below is an example of inference request targetting single portuguese_alpaca adapter (named portuguese_alpaca based on the direcotry name of the adapter).

In [None]:
%%time

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": ["Translate this sentence from portugese to english: Estou na AWS re:invent e estou gostando"],
            "adapters": ["portuguese_alpaca"],
        }
    ),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

### Congrats, you have suucessfully complete Lab3.

Please proceed to clean up section.

## Clean up Resources

In [None]:
sm_client.delete_endpoint(EndpointName=endpoint_name)

In [None]:
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name_acc)

### (Optional)Fine-tune and deploy Llama-2-7B multi-LoRA adapter model

Learn how to fine-tune LoRA on your data set and set up multi-adaper inference using the optional notebooks included in this lab

#### Fine-tuning

Following are the PEFT LoRA configuration set during the fine-tuning process. The Llama7B model is fine-tuned with different dataset and 3 different adapter were created and loaded in S3 location.

The key parameters are:

- task_type: Sets this as a Causal LM task (generating text auto-regressively)
- inference_mode: False indicates this is for training, not inference/evaluation
- r: The rank for the Lora decomposition (defaults to 8)
- lora_alpha: The alpha parameter for Lora (defaults to 32) 
- lora_dropout: The dropout rate for Lora (defaults to 0.05)
- target_modules: The transformer modules to apply Lora to. Here it is commented out, so Lora will be applied to all modules.

For more details on fine-tuning process, please refer `(optional)-llama7b-lora-fine-tune.ipynb` notebook and assiciated scripts

#### Multi-LoRA Inference

For more details on fine-tuning process, please refer `(optional)-llama7b-lora-adapter.ipynb` notebook and assiciated scripts