# Open-LLAMA 7B implementation using LMI container on SageMaker


#### Model source: https://github.com/openlm-research/open_llama ; 
#### Model download hub: https://huggingface.co/openlm-research/open_llama_7b; 
#### License: Apache-2.0


In this tutorial, you will bring your own container from docker hub to SageMaker and run inference with it.
Please make sure the following permission granted before running the notebook:

- ECR Push/Pull access
- S3 bucket push access
- SageMaker access


## Step 1: Let's bump up SageMaker and import stuff

In [1]:
%pip install --upgrade pip --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install sagemaker boto3 awscli --upgrade --quiet

Note: you may need to restart the kernel to use updated packages.


In [3]:
import boto3
import sagemaker
from sagemaker import Model, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

  from pandas.core.computation.check import NUMEXPR_INSTALLED


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


In [4]:
print(role, region, account_id)

arn:aws:iam::171503325295:role/test-ecr us-east-1 171503325295


In [5]:
sagemaker.__version__

'2.208.0'

## Step 2 Image URI for the DJL container is being used here



In [6]:
inference_image_uri = sagemaker.image_uris.retrieve(
    framework="djl-deepspeed", region=region, version="0.26.0"
)
print(f"Image going to be used is ---- > {inference_image_uri}")

Image going to be used is ---- > 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.26.0-deepspeed0.12.6-cu121


## Step 3: Start preparing model artifacts
In LMI container, we expect some artifacts to help set up the model.
Either enviroment variables or a `serving.properties` file is required. 

- enviroment variables | serving.properties (required): Defines the model server settings. 

```
%%writefile serving.properties
engine = MPI
option.tensor_parallel_degree = max
option.model_id = openlm-research/open_llama_7b
option.rolling_batch=vllm
```

```
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel
```

```
s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")
```

## Step 4: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### 4.1 Create SageMaker endpoint
We will use enviroment variables to specify the LMI config. 

In [8]:
env = {
    "SERVING_LOAD_MODELS": "test::MPI=/opt/ml/model",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_MODEL_ID": "openlm-research/open_llama_7b",
    "OPTION_ROLLING_BATCH": "vllm"
}

model = Model(image_uri=inference_image_uri, 
              role=role,
              sagemaker_session=sess,
             # model_data=code_artifact, # Required only if we are using serving.properties / model.py
              env=env)

You need to specify the instance type to use and endpoint names

In [9]:
instance_type = "ml.g5.2xlarge"  # "ml.g5.2xlarge" - #single GPU.

endpoint_name = sagemaker.utils.name_from_base("open-llama-lmi-model")

model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    container_startup_health_check_timeout=900,
    sagemaker_session=sess
)

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

-------------!

## Step 5a: Test and benchmark inference latency
### The latency is heavily dependent on 'max_new_tokens' parameter

In [None]:
import time

tic = time.time()
predictor.predict(
    {"inputs": "tuna sandwich nutritional content is ", "parameters": {"max_new_tokens": 16}}
)
toc = time.time()
print(toc - tic)

## Let us define a helper function to get a histogram of invocation latency distribution

In [None]:
import matplotlib.pyplot as plt
import time
import numpy as np
from tqdm import tqdm


def _latency_hist_plot(endpoint_name, invocation_number=100, sleep_time=1):
    latency_array = []
    for i in tqdm(range(invocation_number)):
        tic = time.time()
        response_ = predictor.predict(
            {"inputs": "Large model inference is", "parameters": {"max_new_tokens": 256}}
        )
        toc = time.time()
        latency_array.append(toc - tic)
        time.sleep(sleep_time)

    latency_array_np = np.array(latency_array)
    _ = plt.hist(latency_array_np, bins="auto")  # arguments are passed to np.histogram
    plt.title("Invocation Latency Histogram with 'auto' bins")
    plt.show()

In [None]:
%%time
inv_start_time = time.time()
invocation_number = 10
# Real-time endpoint
_latency_hist_plot(endpoint_name, invocation_number, sleep_time=1)
inv_lapse_time = time.time() - inv_start_time
print(inv_lapse_time)

In [None]:
endpoint_name = predictor.endpoint_name
print(endpoint_name)
print(region)

## Step 5b: Analyze Inference Latency via CloudWatch

In [None]:
# https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-test-endpoints.html
import pandas as pd

cw = boto3.client("cloudwatch", region_name=region)


def get_invocation_metrics_for_endpoint(endpoint_name, metric_name, start_time, end_time):
    #     metric = "Sum"
    metric = "Average"
    metrics = cw.get_metric_statistics(
        Namespace="AWS/SageMaker",
        MetricName=metric_name,
        StartTime=start_time,
        EndTime=end_time,
        Period=1,
        Statistics=[metric],
        Dimensions=[
            {"Name": "EndpointName", "Value": endpoint_name},
            {"Name": "VariantName", "Value": "AllTraffic"},
        ],
    )
    return (
        pd.DataFrame(metrics["Datapoints"])
        .sort_values("Timestamp")
        .set_index("Timestamp")
        .drop("Unit", axis=1)
        .rename(columns={metric: metric_name})
    )


#     return metrics

In [None]:
import datetime

def plot_endpoint_metrics(start_time=None, end_time=None):
    #    start_time = start_time or datetime.datetime.now() - datetime.timedelta(seconds=inv_lapse_time+60)
    #    end_time = datetime.datetime.now()
    model_metrics = get_invocation_metrics_for_endpoint(
        endpoint_name, "ModelLatency", start_time, end_time
    )
    overhead_metrics = get_invocation_metrics_for_endpoint(
        endpoint_name, "OverheadLatency", start_time, end_time
    )
    total_metrics = model_metrics.join(overhead_metrics)
    total_metrics["ModelLatency"] = total_metrics["ModelLatency"] / 1000
    total_metrics["OverheadLatency"] = total_metrics["OverheadLatency"] / 1000
    #    total_metrics["TotalLatency in ms"] = total_metrics[["ModelLatency","OverheadLatency"]].sum(axis=1)
    #     total_metrics = total_metrics.drop(['ModelLatency', 'OverheadLatency'], axis=1)
    total_metrics.plot()
    return total_metrics

In [None]:
endtime = datetime.datetime.now()
print(endtime)
startime = endtime - datetime.timedelta(seconds=inv_lapse_time + 60)
print(startime)

In [None]:
# wait for cloudwatch metrics to populate
time.sleep(300)

In [None]:
total_metrics = plot_endpoint_metrics(start_time=startime, end_time=endtime)
# total_metrics = plot_endpoint_metrics(start_time=startime, end_time=endtime)

In [None]:
# Latency expressed in ms
total_metrics

## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()