#  Serve MosaicML/MPT-30B model with Amazon SageMaker Hosting

In this example we walk through how to deploy and perform inference on the **MosaicML MPT 30B model** using the **Large Model Inference(LMI)** container provided by AWS using **DJL Serving**. Because this is a large language model (LLM) that does not fit on a single GPU, we will use an 'ml.g5.12xlarge" instance which has **4** GPUs.


## Setup

Installs the dependencies required to package the model and run inferences using Amazon SageMaker. Update SageMaker, boto3 etc

In [58]:
!pip install sagemaker boto3 --upgrade  --quiet

[0m

In [90]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from sagemaker.utils import name_from_base

## Imports and variables

In [60]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
model_bucket = sess.default_bucket()  # bucket to house artifacts
hf_model_id = 'mosaicml/mpt-30b'
model_id = hf_model_id.replace('/','-')
s3_code_prefix_accelerate = f"hf-large-model/{model_id}/accelerate"  # folder within bucket where code artifact will go

region = sess._region_name
account_id = sess.account_id()

s3_client = boto3.client("s3")
sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

jinja_env = jinja2.Environment()

### 1. Create SageMaker compatible model artifacts

In order to prepare our model for deployment to a SageMaker Endpoint for hosting, we will need to prepare a few things for SageMaker and our container. We will use a local folder as the location of these files including **serving.properties** that defines parameters for the LMI container and **requirements.txt** to detail what dependies to install.

In [61]:
directory_name = f"code_{model_id.replace('-','_')}_accelerate"
os.makedirs(directory_name, exist_ok=True)

In the **serving.properties** files  define the the **engine** to use and **model** to host. Note the **tensor_parallel_degree** parameter which is also required in this scenario. We will use **tensor parallelism** to divide the model into multiple parts because no single GPU has enough memory for the entire model. In this case we will use a 'ml.g5.12xlarge' instance which provides **4** GPUs. Be careful not to specify a value larger than the instance provides or your deployment will fail. 

In [63]:
%%writefile ./{directory_name}/serving.properties
engine=Python
option.model_id={{hf_model_id}}
option.tensor_parallel_degree=4

Overwriting ./code_mosaicml_mpt_30b_accelerate/serving.properties


In [64]:
%%writefile ./{directory_name}/requirements.txt
torch==2.0.1
einops==0.5.0

Overwriting ./code_mosaicml_mpt_30b_accelerate/requirements.txt


In [65]:
# we plug in the appropriate model location into our `serving.properties` file based on the region in which this notebook is running
template = jinja_env.from_string(Path(f"{directory_name}/serving.properties").open().read())
Path(f"{directory_name}/serving.properties").open("w").write(
    template.render(hf_model_id=hf_model_id)
)
!pygmentize {directory_name}/serving.properties | cat -n

     1	[36mengine[39;49;00m=[33mPython[39;49;00m[37m[39;49;00m
     2	[36moption.model_id[39;49;00m=[33mmosaicml/mpt-30b[39;49;00m[37m[39;49;00m
     3	[36moption.tensor_parallel_degree[39;49;00m=[33m4[39;49;00m[37m[39;49;00m
     4	[37m#option.s3url = [39;49;00m[37m[39;49;00m


### 2. Create a model.py with custom inference code

SageMaker allows you to bring your own script for inference. Here we create our **model.py** file with the appropriate code for the MPT 30B model.

In [66]:
%%writefile ./{directory_name}/model.py
from djl_python import Input, Output
import os
import torch
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from typing import Any, Dict, Tuple
import warnings

predictor = None


def get_model(properties):
    model_name = properties["model_id"]
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    generator = pipeline(
        task="text-generation", model=model, tokenizer=tokenizer, device_map="auto", torch_dtype=torch.bfloat16
    )
    return generator


def handle(inputs: Input) -> None:
    global predictor
    if not predictor:
        predictor = get_model(inputs.get_properties())
    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None
    data = inputs.get_as_json()
    text = data["text"]
    text_length = data["text_length"]
    outputs = predictor(text, do_sample=True, min_length=text_length, max_length=text_length)
    result = {"generated_text": outputs[0]['generated_text']}
    return Output().add_as_json(result)

Overwriting ./code_mosaicml_mpt_30b_accelerate/model.py


### 3. Create the Tarball and then upload to S3 location
Next, we will package our artifacts as `*.tar.gz` files for uploading to S3 for SageMaker to use for deployment

In [67]:
!rm -f model.tar.gz
!rm -rf {directory_name}/.ipynb_checkpoints
!tar czvf model.tar.gz -C {directory_name} .
s3_code_artifact_accelerate = sess.upload_data("model.tar.gz", bucket, s3_code_prefix_accelerate)
print(f"S3 Code or Model tar for accelerate uploaded to --- > {s3_code_artifact_accelerate}")

./
./model.py
./requirements.txt
./serving.properties
S3 Code or Model tar for accelerate uploaded to --- > s3://sagemaker-eu-west-1-069230569860/hf-large-model/mosaicml-mpt-30b/accelerate/model.tar.gz


### 4. Define a serving container, SageMaker Model and SageMaker endpoint
Now that we have uploaded the model artifacts to S3, we can create a SageMaker endpoint.


#### Define the serving container
Here we define the container to use for the model for inference. We will be using SageMaker's Large Model Inference(LMI) container using Accelerate. 

In [68]:
inference_image_uri = (
    f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.22.1-deepspeed0.8.3-cu118"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.eu-west-1.amazonaws.com/djl-inference:0.22.1-deepspeed0.8.3-cu118


#### Create SageMaker model, endpoint configuration and endpoint.


In [69]:
model_name_acc = name_from_base(model_id)
print(model_name_acc)

mosaicml-mpt-30b-2023-06-26-15-30-18-665


In [70]:
create_model_response = sm_client.create_model(
    ModelName=model_name_acc,
    ExecutionRoleArn=role,
    PrimaryContainer={"Image": inference_image_uri, "ModelDataUrl": s3_code_artifact_accelerate},
)
model_arn = create_model_response["ModelArn"]

print(f"Created Model: {model_arn}")

Created Model: arn:aws:sagemaker:eu-west-1:069230569860:model/mosaicml-mpt-30b-2023-06-26-15-30-18-665


In [71]:
model_name = model_name_acc
print(f"Building EndpointConfig and Endpoint for: {model_name}")

Building EndpointConfig and Endpoint for: mosaicml-mpt-30b-2023-06-26-15-30-18-665


In [72]:
endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            "ModelDataDownloadTimeoutInSeconds": 3600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            # "VolumeSizeInGB": 512
        },
    ],
)
endpoint_config_response

{'EndpointConfigArn': 'arn:aws:sagemaker:eu-west-1:069230569860:endpoint-config/mosaicml-mpt-30b-2023-06-26-15-30-18-665-config',
 'ResponseMetadata': {'RequestId': '3cf7abbf-a370-48ab-963a-e5d7f7c1d621',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '3cf7abbf-a370-48ab-963a-e5d7f7c1d621',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '128',
   'date': 'Mon, 26 Jun 2023 15:30:26 GMT'},
  'RetryAttempts': 0}}

In [73]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name
)
print(f"Created Endpoint: {create_endpoint_response['EndpointArn']}")

Created Endpoint: arn:aws:sagemaker:eu-west-1:069230569860:endpoint/mosaicml-mpt-30b-2023-06-26-15-30-18-665-endpoint


In [74]:
import time

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:eu-west-1:069230569860:endpoint/mosaicml-mpt-30b-2023-06-26-15-30-18-665-endpoint
Status: InService


### Run Inference

In [93]:
%%time

response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({"text": "The population of Greece is", "text_length": 20}),
    ContentType="application/json",
)

CPU times: user 15.8 ms, sys: 0 ns, total: 15.8 ms
Wall time: 2.09 s


In [94]:
r = response_model["Body"].read().decode("utf8")

In [95]:
# Load the JSON string as a dictionary
data_dict = json.loads(r)

# Access the dictionary elements
generated_text = data_dict['generated_text']

# Print the generated text
print(generated_text)

The population of Greece is 9,999,450 (2019 EUD) people. Population density is


### Clean Up

In [45]:
# Delete the endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)

{'ResponseMetadata': {'RequestId': 'eec95601-c765-46cc-9498-7f2aa55cf437',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'eec95601-c765-46cc-9498-7f2aa55cf437',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Mon, 26 Jun 2023 12:36:22 GMT'},
  'RetryAttempts': 0}}

In [46]:
# Delete the model and the endpoint configuration
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=model_name)

{'ResponseMetadata': {'RequestId': '7bd12f20-4db8-431f-803f-1ae44a909291',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '7bd12f20-4db8-431f-803f-1ae44a909291',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Mon, 26 Jun 2023 12:36:23 GMT'},
  'RetryAttempts': 0}}

# Delete all endpoints, endpoint configurations & models

In [96]:
import boto3

def delete_resources(resource_type):
    client = boto3.client('sagemaker')
    list_method = getattr(client, f"list_{resource_type}s")
    delete_method = getattr(client, f"delete_{resource_type}")
    resource_type_name = resource_type.replace('_', ' ').title().replace(' ', '')
    resources = list_method()[f"{resource_type_name}s"]
    for resource in resources:
        resource_name = resource[f"{resource_type_name}Name"]
        print(f"Deleting {resource_type}: {resource_name}")
        # if resource_name == "falcon-40b-instruct": continue
        # if resource_name == "llama-30b-supercot-2023-06-15-16-54-10-187-endpoint": continue
        delete_method(**{f"{resource_type_name}Name": resource_name})

def main():
    resource_types = ['model', 'endpoint', 'endpoint_config']  # Add more resource types if needed

    for resource_type in resource_types:
        delete_resources(resource_type)

if __name__ == "__main__":
    main()

Deleting model: huggingface-pytorch-inference-2023-06-26-19-45-53-859
Deleting model: mosaicml-mpt-30b-2023-06-26-15-30-18-665
Deleting endpoint: huggingface-pytorch-inference-2023-06-26-19-45-54-426
Deleting endpoint: mosaicml-mpt-30b-2023-06-26-15-30-18-665-endpoint
Deleting endpoint_config: huggingface-pytorch-inference-2023-06-26-19-45-54-426
Deleting endpoint_config: mosaicml-mpt-30b-2023-06-26-15-30-18-665-config
