# Serving LoRA-based Llama 2 and Mistral adapters with high performance on SageMaker 


This notebook will demonstrate how you can deploy multiple base models and their fine-tuned LoRA adapters on SageMaker using the DJL Serving Large Model Inference DLC. LoRA (Low Rank Adapters) is a powerful technique for fine-tuning large language models. This technique significantly reduces the number of trainable parameters compared to traditional fine-tuning while achieving comparable or superior performance. You can learn more about the LoRA technique in this paper.

A major benefit of LoRA is that the fine-tuned adapters can easily be added to and removed from the base model, which makes switching adapters pretty cheap and viable at runtime. In this notebook we will show how you can deploy a SageMaker endpoint with a single base model and multiple LoRA adapters, and change adapters for different requests.

Since LoRA adapters are much smaller than the size of a base model (can realistically be 100x-1000x smaller), we can deploy an endpoint with a single base model and multiple LoRA adapters using much less hardware than deploying an equivalent number of fully fine-tuned models.

In this notebook, we deploy the [llama2-7B](https://huggingface.co/TheBloke/Llama-2-7b-fp16) and the [mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1) as the base models and their LoRA adapters fine tuned for a specific language on the same SageMaker endpoint as shown below by leveraging the [SageMaker Large Model Inference Container](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers).

![Multiple adapters for Llama 2 as a base model](adapter-basemodel.png)

The LMI container offers the out-of-box integration with SageMaker for hosting multiple LoRA adapters with higher performance (low latency and high throughput) using the [vLLM](https://docs.vllm.ai/en/latest/models/lora.html) library that uses [S-LORA](https://github.com/S-LoRA/S-LoRA) and [Punica](https://arxiv.org/pdf/2310.18547.pdf). S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead.

Below diagram shows the Multi LoRA-Adapter serving stack of LMI container on SageMaker
![Multi LoRA-Adapter serving stack of LMI container on SageMaker](LoRA-LMI-SageMaker.png)

# License agreement
 - View license information https://huggingface.co/meta-llama before using the model.
 - This notebook is a sample notebook and not intended for production use. Please refer to the licence at https://github.com/aws/mit-0. 

## Install, import the required libraries; set some variables

In [1]:
!pip install sagemaker boto3 huggingface_hub awscli --upgrade --quiet

In [2]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from huggingface_hub import snapshot_download

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [3]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts

In [4]:
model_bucket = sess.default_bucket()  # bucket to house model artifacts
s3_code_prefix = "hf-large-model-djl/multi-lora/Llama-2-7b-fp16/code"  # folder within bucket where model/code artifacts will go
s3_code_prefix2 = "hf-large-model-djl/multi-lora/Mistral-7b-fp16/code"  # folder within bucket where model/code artifacts will go

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

We will be deploying an endpoint with 3 LoRA adapters. These are the models we will be using:

Base Model: https://huggingface.co/TheBloke/Llama-2-7B-Chat-fp16
LoRA Fine Tuned Adapter 1: https://huggingface.co/UnderstandLing/llama-2-7b-chat-ru
LoRA Fine Tuned Adapter 2: https://huggingface.co/UnderstandLing/llama-2-7b-chat-es
LoRA Fine Tuned Adapter 3: https://huggingface.co/UnderstandLing/llama-2-7b-chat-fr

The core structure to cover here is the model directory. We include both the base model and LoRA adapters in the model directory like this:

```
|- model_dir
    |- adapters/
        |--- <adapter_1>/
        |--- <adapter_2>/
        |--- ...
        |--- <adapter_n>/
```

It is also possible to have model files located in a separate s3 bucket by specifying that location using an s3 `option.model_id` in the serving.properties. In this case, the adapters directory can be located either alongside the `serving.properties` or alongside the model files in s3.

Each of the adapters in the `adapters` directory contains the LoRA adapter artifacts. Typically there are two files: `adapter_model.bin` and `adapter_config.json` which are the adapter weights and adapter configuration respectively. These are typically obtained from the Peft library via the `PeftModel.save_pretrained()` method.

In [5]:
!rm -rf llama-lora-multi-adapter
!mkdir -p llama-lora-multi-adapter/adapters
!echo "Lora Multi Adapter Model" > llama-lora-multi-adapter/README.txt

In [6]:
# snapshot_download("UnderstandLing/llama-2-7b-chat-ru", local_dir="llama-lora-multi-adapter/adapters/ru", local_dir_use_symlinks=False)

In [7]:
snapshot_download("UnderstandLing/llama-2-7b-chat-es", local_dir="llama-lora-multi-adapter/adapters/es", local_dir_use_symlinks=False)

For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

README.md:   0%|          | 0.00/5.51k [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/134M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

'/home/sagemaker-user/sagemaker-genai-hosting-examples/LORA-adapters-IC/llama-lora-multi-adapter/adapters/es'

In [8]:
snapshot_download("UnderstandLing/llama-2-7b-chat-fr", local_dir="llama-lora-multi-adapter/adapters/fr", local_dir_use_symlinks=False)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.51k [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/134M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

'/home/sagemaker-user/sagemaker-genai-hosting-examples/LORA-adapters-IC/llama-lora-multi-adapter/adapters/fr'

In [9]:
# !rm -f model.tar.gz
# !rm -rf llama-lora-multi-adapter/.ipynb_checkpoints
# !tar czvf model.tar.gz -C llama-lora-multi-adapter .

In [10]:
!rm -f adapters.tar.gz
!rm -rf llama-lora-multi-adapter/.ipynb_checkpoints
!tar czvf adapters.tar.gz -C llama-lora-multi-adapter .

./
./adapters/
./adapters/es/
./adapters/es/.huggingface/
./adapters/es/.huggingface/.gitignore
./adapters/es/.huggingface/download/
./adapters/es/.huggingface/download/.gitattributes.lock
./adapters/es/.huggingface/download/README.md.lock
./adapters/es/.huggingface/download/adapter_config.json.lock
./adapters/es/.huggingface/download/adapter_model.safetensors.lock
./adapters/es/.huggingface/download/added_tokens.json.lock
./adapters/es/.huggingface/download/special_tokens_map.json.lock
./adapters/es/.huggingface/download/tokenizer.json.lock
./adapters/es/.huggingface/download/tokenizer.model.lock
./adapters/es/.huggingface/download/README.md.metadata
./adapters/es/.huggingface/download/tokenizer_config.json.lock
./adapters/es/.huggingface/download/adapter_config.json.metadata
./adapters/es/.huggingface/download/special_tokens_map.json.metadata
./adapters/es/.huggingface/download/added_tokens.json.metadata
./adapters/es/.huggingface/download/.gitattributes.metadata
./adapters/es/.huggi

In [11]:
s3_code_artifact_accelerate = sess.upload_data("adapters.tar.gz", model_bucket, s3_code_prefix)

### Select the appropriate configuration parameters and container¶

To optimize the deployment of Large Language Models (LLMs); one needs to choose the appropriate model partitioning framework, optimal batching technique, batching size, tensor parallelism degree, etc. The choice of a particular configuration depends on the usecase.

Hence, based on the usecase, you need to:
1. set the configuration parameters for the container.
2. select the appropriate container image to be used for inference.

### Set the configuration parameters using environment variables
1. `SERVING_LOAD_MODELS` - specifies the engine that will be used for this workload. In this case we'll be hosting a model using the **Python** engine.

2. `OPTION_MODEL_ID`: Set this to the URI of the Amazon S3 bucket that contains the model. When this is set, the container leverages [s5cmd](https://github.com/peak/s5cmd) to download the model from S3. This enables faster deployments by utilizing optimized approach within the DJL inference container to transfer the model from S3 into the hosting instance.
If you want to download the model from huggingface.co, you can set `OPTION_MODEL_ID` to the model id of a pre-trained model hosted inside a model repository on huggingface.co (https://huggingface.co/models). The container uses this model id to download the corresponding model repository on huggingface.co.

3. `OPTION_TENSOR_PARALLEL_DEGREE`: Set to the number of GPU devices over which DeepSpeed needs to partition the model. This parameter also controls the number of workers per model which will be started up when DJL serving runs. In this example we use the `ml.g5.12xlarge` instance that has 4 GPUs; hence this is set to 4.

4. `OPTION_ROLLING_BATCH`: This parameter enables the use of a particular batching technique for continuous or iteration level batching to enable merging multiple concurrent requests that arrive at different times for inference.
In scenarios that involves open ended generation and chatbots, there is a need for having a high throughput. [vLLM](https://arxiv.org/pdf/2309.06180.pdf) is a fast LLM inference and serving framework that uses techniques like PagedAttention and continuous batching to improve the throughput. Hence, we set the `rolling_batch` parameter to `vllm`. When using `vllm`, you can also use some [additional parameters](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md#vllm).

5. `OPTION_MAX_ROLLING_BATCH_SIZE`: The maximum number of concurrent requests to be used in a batch by the model server for inference. Clients can still send more requests to the endpoint, they will be queued.

6. `OPTION_ENABLE_LORA`: This config enables support for LoRA adapters. Default: false.

7. `OPTION_MAX_LORAS`: This config determines the maximum number of LoRA adapters that can be run at once. Allocates more GPU memory for those adapters. Default: 4

8. `OPTION_MAX_LORA_RANK`: This config determines the maximum rank allowed for a LoRA adapter. Setting a larger value will enable more adapters at a greater memory usage cost. Default: 16

9. `OPTION_LORA_EXTRA_VOCAD_SIZE`: This config determines the maximum additional vocabulary that can be added through a LoRA adapter. Default: 256

10. `OPTION_MAX_CPU_LORAS`: This config determines the maximum number of LoRA adapters to cache in memory. All others will be evicted to disk. Default: None


For more information on the available options, please refer to the [DJL Serving - SageMaker Large Model Inference Configurations](https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/configurations_large_model_inference_containers.md)


### Select the relevant Large Model Inference container
SageMaker offers optimized [large model inference containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers) that contains different frameworks for model parallelism enabling inference of LLMs on multiple GPUs.

In this scenario, since we are leveraging `vllm` as the batching technique, we leverage the `deepspeed` container that has frameworks like deepspeed, vllm, etc.

In [12]:
deepspeed_image_uri = image_uris.retrieve(
    framework="djl-deepspeed",
    region=sess.boto_session.region_name,
    version="0.27.0"
)

env_generation = {"HUGGINGFACE_HUB_CACHE": "/tmp",
                  "TRANSFORMERS_CACHE": "/tmp",
                  "OPTION_MODEL_ID": "TheBloke/Llama-2-7B-Chat-fp16",
                  "OPTION_TRUST_REMOTE_CODE": "true",
                  "OPTION_TENSOR_PARALLEL_DEGREE": "1",
                  "OPTION_ROLLING_BATCH": "lmi-dist",
                  "OPTION_MAX_ROLLING_BATCH_SIZE": "32",
                  "OPTION_DTYPE": "fp16",
                  "OPTION_ENABLE_LORA": "true",
                  "OPTION_GPU_MEMORY_UTILIZATION": "0.8",
                  "OPTION_MAX_LORA_RANK": "64",
                  "OPTION_MAX_CPU_LORAS": "4"
                 }

In [13]:
# - Select the appropriate environment variable which will tune the deployment server.
env = env_generation # use this in case it is 'generation' task 
# - now we select the appropriate container 
inference_image_uri = deepspeed_image_uri # use this in case it is 'generation' task 
#inference_image_uri = trtllm_image_uri # enable this in case your use case is summarization ( high input and medium output sizes ) 

print(f"Environment variables are ---- > {env}")
print(f"Image going to be used is ---- > {inference_image_uri}")

Environment variables are ---- > {'HUGGINGFACE_HUB_CACHE': '/tmp', 'TRANSFORMERS_CACHE': '/tmp', 'OPTION_MODEL_ID': 'TheBloke/Llama-2-7B-Chat-fp16', 'OPTION_TRUST_REMOTE_CODE': 'true', 'OPTION_TENSOR_PARALLEL_DEGREE': '1', 'OPTION_ROLLING_BATCH': 'lmi-dist', 'OPTION_MAX_ROLLING_BATCH_SIZE': '32', 'OPTION_DTYPE': 'fp16', 'OPTION_ENABLE_LORA': 'true', 'OPTION_GPU_MEMORY_UTILIZATION': '0.8', 'OPTION_MAX_LORA_RANK': '64', 'OPTION_MAX_CPU_LORAS': '4'}
Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121


### Create an endpoint config
Create an endpoint configuration using the appropriate instance type. Set the `ContainerStartupHealthCheckTimeoutInSeconds` to account for the time taken to download the LLM weights from S3 or the model hub; and the time taken to load the model on the GPUs.

In [14]:
model_name = sagemaker.utils.name_from_base("lmi-llama2-7b")
print(model_name)

endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

# Set varient name and instance type for hosting
variant_name = "AllTraffic"
instance_type = "ml.g5.12xlarge"
model_data_download_timeout_in_seconds = 1600
container_startup_health_check_timeout_in_seconds = 1600
initial_instance_count = 1
max_instance_count = 1

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ExecutionRoleArn=role,
    ProductionVariants=[
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": initial_instance_count,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": initial_instance_count,
                "MaxInstanceCount": max_instance_count,
            },
            "RoutingConfig": {
                'RoutingStrategy': 'LEAST_OUTSTANDING_REQUESTS'
            },
        },
    ],
)
endpoint_config_response

lmi-llama2-7b-2024-05-14-02-51-09-214


{'EndpointConfigArn': 'arn:aws:sagemaker:us-east-1:972812897072:endpoint-config/lmi-llama2-7b-2024-05-14-02-51-09-214-config',
 'ResponseMetadata': {'RequestId': 'fd5c58f0-1ccf-42ba-8afa-0d14a7b2b09f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'fd5c58f0-1ccf-42ba-8afa-0d14a7b2b09f',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '125',
   'date': 'Tue, 14 May 2024 02:51:09 GMT'},
  'RetryAttempts': 0}}

### Create an endpoint using the endpoint config

To create the end point the steps are:

- Create the endpoint config using the following key parameters

In this notebook we leverage the boto3 SDK. You can also use the [SageMaker SDK](https://sagemaker.readthedocs.io/en/stable/).

In [15]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:us-east-1:972812897072:endpoint/lmi-llama2-7b-2024-05-14-02-51-09-214-endpoint


#### This step can take ~10 mins or longer

In [16]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:972812897072:endpoint/lmi-llama2-7b-2024-05-14-02-51-09-214-endpoint
Status: InService


## Create an inference component to your endpoint for Llama-2-7B-chat with LoRA adapters
Inference components can reuse a SageMaker model that you may have already created. You also have the option to specify your artifacts and container directly when creating an inference component which we will show below. In this example we will also create a SageMaker model if you want to reference it later. 

### Create the Model

Create the Model
Leverage the `inference_image_uri` to create a model object. We will leverage the Least routing algorithim -- [Least Routing Algorithim](https://aws.amazon.com/blogs/machine-learning/minimize-real-time-inference-latency-by-using-amazon-sagemaker-routing-strategies/). This innovation from sagemnaker has shown to reduce latency by 10% or more when we have multiple instances configured to serve the endpoints

In [17]:
model_name = sagemaker.utils.name_from_base("lmi-llama2-7b")
print(model_name)

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env,
        "ModelDataUrl": s3_code_artifact_accelerate,
    }
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

lmi-llama2-7b-2024-05-14-02-55-10-800
Created Model: arn:aws:sagemaker:us-east-1:972812897072:model/lmi-llama2-7b-2024-05-14-02-55-10-800


### Create Inference Component

We can now create our inference component. Note below that we specify an inference component name. You can use this name to update your inference compent or view metrics and logs on the inference component you create in CloudWatch. You will also want to set your "ComputeResourceRequirements". This will tell SageMaker how much of each resource you want to reserver for EACH COPY of your inference component. Finally we set the number of copies that we want to deploy. The number of copies can be managed through autoscaling policies. 

In [18]:
prefix = sagemaker.utils.unique_name_from_base("lmi-llama2-7b")

inference_component_name = f"{prefix}-inference-component"
print(f"Demo inference component name: {inference_component_name}:: endpoint_name={endpoint_name}")

Demo inference component name: lmi-llama2-7b-1715655311-a135-inference-component:: endpoint_name=lmi-llama2-7b-2024-05-14-02-51-09-214-endpoint


In [19]:
sm_client.create_inference_component(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name,
        # "Container": {
        #     "Image": inference_image_uri,
        #     "ArtifactUrl": s3_code_artifact,
        # },
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": 1200,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1200,
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 1,
            "MinMemoryRequiredInMb": 7*2*1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

{'InferenceComponentArn': 'arn:aws:sagemaker:us-east-1:972812897072:inference-component/lmi-llama2-7b-1715655311-a135-inference-component',
 'ResponseMetadata': {'RequestId': '24e5628c-9dd6-4e16-8845-32234174da65',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '24e5628c-9dd6-4e16-8845-32234174da65',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '138',
   'date': 'Tue, 14 May 2024 02:55:11 GMT'},
  'RetryAttempts': 0}}

#### This step can take ~15 mins or longer

In [20]:
import sys

while True:
    desc = sm_client.describe_inference_component(
        InferenceComponentName=inference_component_name
    )
    status = desc["InferenceComponentStatus"]
    print("Status: " + status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService


### Invoke the endpoint with a sample prompt

In [21]:
params = { "max_new_tokens": 100}  

In [22]:
%%time
# Testing Spanish (es) adapter
response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Piensa en una excusa creativa para decir que no necesito ir a la fiesta."],
                     "parameters": params,
                     "adapters": ["es"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

CPU times: user 15.9 ms, sys: 2.83 ms, total: 18.7 ms
Wall time: 3.55 s


'{"generated_text": "\\n\\nEsta es una lista de excusas creativas para decir que no necesitas ir a la fiesta:\\n\\n1. Tengo que hacer una tarea urgente para la escuela.\\n2. Tengo que ir a la tienda a comprar algo importante.\\n3. Tengo que hacer una llamada importante para mi trabajo.\\n4. Tengo que hacer una llamada importante para mi trabajo.\\n5. Tengo que hacer una llam"}'

In [23]:
%%time
# Testing French (fr) adapter
response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Pensez à une excuse créative pour dire que je n'ai pas besoin d'aller à la fête."],
                     "parameters": params,
                     "adapters": ["fr"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

CPU times: user 4.46 ms, sys: 0 ns, total: 4.46 ms
Wall time: 2.98 s


'{"generated_text": "\\n\\nPensez à une excuse créative pour dire que je n\'ai pas besoin d\'aller à la fête."}'

In [None]:
# %%time
# Testing Russian (ru) adapter
# response_model = smr_client.invoke_endpoint(
#     InferenceComponentName=inference_component_name,
#     EndpointName=endpoint_name,
#     Body=json.dumps({"inputs": ["Придумайте креативное оправдание, чтобы сказать, что мне не нужно идти на вечеринку."],
#                      "parameters": params,
#                      "adapters": ["ru"]}),
#     ContentType="application/json",
# )

# response_model["Body"].read().decode("utf8")

## Mistral 7B with LoRA adapter
To showcase multiple-base models with their LoRA adapters, we will add another base model, mistralai/Mistral-7B-v0.1, and it’s LoRA adapter to the same SageMaker endpoint. Here's we are basically repeating similar steps for the preivous Llama2 base model and its LoRA adapter for the Mistral 7B.

In [24]:
!rm -rf mistral-lora-multi-adapter
!mkdir -p mistral-lora-multi-adapter/adapters
!echo "Lora Multi Adapter Model" > mistral-lora-multi-adapter/README.txt

In [25]:
snapshot_download("CATIE-AQ/mistral7B-FR-InstructNLP-LoRA", local_dir="mistral-lora-multi-adapter/adapters/fr", local_dir_use_symlinks=False)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/13.0k [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/539 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/85.2M [00:00<?, ?B/s]

'/home/sagemaker-user/sagemaker-genai-hosting-examples/LORA-adapters-IC/mistral-lora-multi-adapter/adapters/fr'

In [28]:
!rm -f adapers.tar.gz
!rm -rf mistral-lora-multi-adapter/.ipynb_checkpoints
!tar czvf adapters.tar.gz -C mistral-lora-multi-adapter .

./
./adapters/
./adapters/fr/
./adapters/fr/.huggingface/
./adapters/fr/.huggingface/.gitignore
./adapters/fr/.huggingface/download/
./adapters/fr/.huggingface/download/.gitattributes.lock
./adapters/fr/.huggingface/download/README.md.lock
./adapters/fr/.huggingface/download/adapter_config.json.lock
./adapters/fr/.huggingface/download/adapter_model.bin.lock
./adapters/fr/.huggingface/download/.gitattributes.metadata
./adapters/fr/.huggingface/download/adapter_config.json.metadata
./adapters/fr/.huggingface/download/README.md.metadata
./adapters/fr/.huggingface/download/adapter_model.bin.metadata
./adapters/fr/.gitattributes
./adapters/fr/adapter_config.json
./adapters/fr/README.md
./adapters/fr/adapter_model.bin
./README.txt


In [29]:
s3_code_artifact_accelerate = sess.upload_data("adapters.tar.gz", model_bucket, s3_code_prefix2)

Note: mistralai/Mistral-7B-v0.1 is gated. Access to model mistralai/Mistral-7B-v0.1 is restricted and you must be in the authorized list to use it. Visit https://huggingface.co/mistralai/Mistral-7B-v0.1 to ask for access.

In [30]:
deepspeed_image_uri = image_uris.retrieve(
    framework="djl-deepspeed",
    region=sess.boto_session.region_name,
    version="0.27.0"
)

my_hf_token = "<YOUR_HuggingFacePersonalAccessToken_HERE>"
my_hf_token = "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"

env_generation = { #"HUGGINGFACE_HUB_CACHE": "/tmp",
                  "HF_TOKEN": my_hf_token,
                  "SERVING_LOAD_MODELS": "test::Python=/opt/ml/model",
                  "OPTION_MODEL_ID": "mistralai/Mistral-7B-v0.1",
                  "OPTION_TRUST_REMOTE_CODE": "true",
                  "OPTION_TENSOR_PARALLEL_DEGREE": "1",
                  "OPTION_ENABLE_LORA": "true",
                  "OPTION_GPU_MEMORY_UTILIZATION": "0.8",
                  "OPTION_MAX_LORA_RANK": "64",
                  "OPTION_MAX_CPU_LORAS": "4"
                 }

In [31]:
# - Select the appropriate environment variable which will tune the deployment server.
env = env_generation # use this in case it is 'generation' task 
# - now we select the appropriate container 
inference_image_uri = deepspeed_image_uri # use this in case it is 'generation' task 
# inference_image_uri = trtllm_image_uri # enable this in case your use case is summarization ( high input and medium output sizes ) 

print(f"Environment variables are ---- > {env}")
print(f"Image going to be used is ---- > {inference_image_uri}")

Environment variables are ---- > {'HF_TOKEN': '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', 'SERVING_LOAD_MODELS': 'test::Python=/opt/ml/model', 'OPTION_MODEL_ID': 'mistralai/Mistral-7B-v0.1', 'OPTION_TRUST_REMOTE_CODE': 'true', 'OPTION_TENSOR_PARALLEL_DEGREE': '1', 'OPTION_ENABLE_LORA': 'true', 'OPTION_GPU_MEMORY_UTILIZATION': '0.8', 'OPTION_MAX_LORA_RANK': '64', 'OPTION_MAX_CPU_LORAS': '4'}
Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121


### Create an inference component/model for Mistral 7B


In [32]:
model_name2 = sagemaker.utils.name_from_base("lmi-mistral-7b")
print(model_name2)

create_model_response = sm_client.create_model(
    ModelName=model_name2,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "Environment": env,
        "ModelDataUrl": s3_code_artifact_accelerate,
    }
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

lmi-mistral-7b-2024-05-14-03-16-09-738
Created Model: arn:aws:sagemaker:us-east-1:972812897072:model/lmi-mistral-7b-2024-05-14-03-16-09-738


In [33]:
prefix = sagemaker.utils.unique_name_from_base("lmi-mistral-7b")

inference_component_name2 = f"{prefix}-inference-component"
print(f"Demo inference component name: {inference_component_name2}:: endpoint_name={endpoint_name}")

Demo inference component name: lmi-mistral-7b-1715656570-1c78-inference-component:: endpoint_name=lmi-llama2-7b-2024-05-14-02-51-09-214-endpoint


Note: mistralai/Mistral-7B-v0.1 is gated.  Access to model mistralai/Mistral-7B-v0.1 is restricted and you must be in the authorized list to use it. Visit https://huggingface.co/mistralai/Mistral-7B-v0.1 to ask for access.

In [34]:
sm_client.create_inference_component(
    InferenceComponentName=inference_component_name2,
    EndpointName=endpoint_name,
    VariantName=variant_name,
    Specification={
        "ModelName": model_name2,
        # "Container": {
        #     "Image": inference_image_uri,
        #     "ArtifactUrl": s3_code_artifact,
        # },
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1200,
        },
        "ComputeResourceRequirements": {
            "NumberOfAcceleratorDevicesRequired": 1,
            "MinMemoryRequiredInMb": 7*2*1024,
            # "MinMemoryRequiredInMb": 1024,
        },
    },
    RuntimeConfig={"CopyCount": 1},
)

{'InferenceComponentArn': 'arn:aws:sagemaker:us-east-1:972812897072:inference-component/lmi-mistral-7b-1715656570-1c78-inference-component',
 'ResponseMetadata': {'RequestId': '415035a6-7725-4c20-9ab7-b10d49ba4096',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '415035a6-7725-4c20-9ab7-b10d49ba4096',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '139',
   'date': 'Tue, 14 May 2024 03:16:12 GMT'},
  'RetryAttempts': 0}}

In [35]:
inference_component_name2

'lmi-mistral-7b-1715656570-1c78-inference-component'

#### This step can take ~15 mins or longer

In [36]:
import sys

while True:
    desc = sm_client.describe_inference_component(
        InferenceComponentName=inference_component_name2
    )
    status = desc["InferenceComponentStatus"]
    print("Status: " + status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService


### Invoke the endpoint with a sample prompt

In [None]:
params = { "max_new_tokens": 100}

In [37]:
%%time
# Testing French (fr) adapter

response_model = smr_client.invoke_endpoint(
    InferenceComponentName=inference_component_name2,
    EndpointName=endpoint_name,
    Body=json.dumps({"inputs": ["Pensez à une excuse créative pour dire que je n'ai pas besoin d'aller à la fête."],
                     "parameters": params,
                     "adapters": ["fr"]}),
    ContentType="application/json",
)

response_model["Body"].read().decode("utf8")

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (0) from TPYNTkar7bxmTljCro42FHJPsiEqvzQZ696F with message "Your invocation timed out while waiting for a response from container TPYNTkar7bxmTljCro42FHJPsiEqvzQZ696F. Review the latency metrics for each container in Amazon CloudWatch, resolve the issue, and try again.". See https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logEventViewer:group=/aws/sagemaker/Endpoints/lmi-llama2-7b-2024-05-14-02-51-09-214-endpoint in account 972812897072 for more information.

## Clean up the environment

In [38]:
sm_client.delete_inference_component(InferenceComponentName=inference_component_name)
sm_client.delete_inference_component(InferenceComponentName=inference_component_name2)
sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)
sm_client.delete_model(ModelName=model_name2)

{'ResponseMetadata': {'RequestId': '6103241f-3a92-4513-8b96-8234074fc28f',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '6103241f-3a92-4513-8b96-8234074fc28f',
   'content-type': 'application/x-amz-json-1.1',
   'date': 'Tue, 14 May 2024 03:32:56 GMT',
   'content-length': '0'},
  'RetryAttempts': 3}}

#### Resource:
- [Deep Learning containers for large model inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-dlc.html)
- [Deep Java Library for Large Model Inference](https://docs.djl.ai/docs/serving/serving/docs/large_model_inference.html)