# BYOC instruction for using LMI container on SageMaker
In this tutorial, you will bring your own container from docker hub to SageMaker and run inference with it.
Please make sure the following permission granted before running the notebook:

- ECR Push/Pull access
- S3 bucket push access
- SageMaker access
- DynamoDB access (create DB and query)

If you plan to do step 6, we also need to have lambda and API-gateway permission.

- AWSLambda access (Create lambda function)
- IAM access (Create role, delete role)
- APIGateway (Creation, deletion)

## Step 1: Let's bump up SageMaker and import stuff

In [1]:
%pip install sagemaker boto3 awscli --upgrade  --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
import boto3
import sagemaker
from sagemaker import Model, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

## Step 2 pull and push the docker from Docker hub to ECR repository

*Note: Please make sure you have the permission in AWS credential to push to ECR repository*

This process may take a while, depends on the container size and your network bandwidth

In [3]:
%%bash

# The name of our container
repo_name=djlserving-byoc
# Target container
target_container="deepjavalibrary/djl-serving:deepspeed-nightly"

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${repo_name}:latest"
echo "Creating ECR repository ${fullname}"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${repo_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${repo_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin "${account}.dkr.ecr.${region}.amazonaws.com"

# Build the docker image locally with the image name and then push it to ECR
# with the full name.
echo "Start pulling container: ${target_container}"

docker pull ${target_container}
docker tag ${target_container} ${fullname}
docker push ${fullname}

Creating ECR repository 125045733377.dkr.ecr.us-east-1.amazonaws.com/djlserving-byoc:latest
Login Succeeded
Start pulling container: deepjavalibrary/djl-serving:deepspeed-nightly
deepspeed-nightly: Pulling from deepjavalibrary/djl-serving
846c0b181fff: Pulling fs layer
6599ec13b57f: Pulling fs layer
e1426114a55b: Pulling fs layer
f5a45ff4f0b5: Pulling fs layer
51873ca0db79: Pulling fs layer
8539eb40abca: Pulling fs layer
f5a45ff4f0b5: Waiting
b5e218b9a64b: Pulling fs layer
d71da67a71de: Pulling fs layer
8539eb40abca: Waiting
4b4331e1e893: Pulling fs layer
ba157caf65a0: Pulling fs layer
4b4331e1e893: Waiting
7dc2c379e8a4: Pulling fs layer
57819aec952b: Pulling fs layer
4a352577dbbb: Pulling fs layer
7dc2c379e8a4: Waiting
57819aec952b: Waiting
9ac8cc841e71: Pulling fs layer
b5e218b9a64b: Waiting
4a352577dbbb: Waiting
d71da67a71de: Waiting
ed5e468826d6: Pulling fs layer
c56c5861fc54: Pulling fs layer
5e4c6f1cc111: Pulling fs layer
f83e1b46ff02: Pulling fs layer
ed5e468826d6: Waiting
9fdb8

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



## Step 3: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [3]:
%%writefile serving.properties
engine=Python
option.tensor_parallel_degree=1
option.predict_timeout=1200
option.model_id=cerebras/Cerebras-GPT-1.3B

Writing serving.properties


In [4]:
%%writefile model.py
from djl_python import Input, Output
import torch
import logging
from transformers import AutoModelForCausalLM, AutoTokenizer
from djl_python.streaming_utils import StreamingUtils
from paginator import DDBPaginator
import uuid


def load_model(properties):
    model_location = properties['model_dir']
    if "model_id" in properties:
        model_location = properties['model_id']
    logging.info(f"Loading model in {model_location}")
    device = "cuda:0"
    model = AutoModelForCausalLM.from_pretrained(model_location, low_cpu_mem_usage=True, torch_dtype=torch.float16).to(device)
    tokenizer = AutoTokenizer.from_pretrained(model_location)
    stream_generator = StreamingUtils.get_stream_generator("Accelerate")
    return model, tokenizer, stream_generator


model = None
tokenizer = None
stream_generator = None
paginator = None

def separate_inference(session_id, inputs):
    prompt = inputs["prompt"]
    length = inputs["max_new_tokens"]
    generate_kwargs = dict(max_new_tokens=length, do_sample=True)
    generator = stream_generator(model, tokenizer, prompt, **generate_kwargs)
    generated = ""
    iterator = 0
    for text in generator:
        generated += text[0]
        if iterator == 5:
            paginator.add_cache(session_id, generated)
            iterator = 0
        iterator += 1
    paginator.add_cache(session_id, generated + "<eos>")



def handle(inputs: Input):
    global model, tokenizer, stream_generator, paginator
    if not model:
        model, tokenizer, stream_generator = load_model(inputs.get_properties())
        paginator = DDBPaginator("test_DB_Qing")

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None
    session_id = str(uuid.uuid4())
    return Output().add({"session_id": session_id}).finalize(separate_inference, session_id, inputs.get_as_json())

Writing model.py


In [5]:
%%writefile paginator.py
import boto3
import logging


class DDBPaginator:
    DEFAULT_KEY_NAME = "cache_id"

    def __init__(self, db_name):
        self.db_name = db_name
        self.ddb_client = boto3.client('dynamodb')
        try:
            self.ddb_client.describe_table(TableName=db_name)
        except self.ddb_client.exceptions.ResourceNotFoundException:
            logging.info(f"Table {db_name} not found")
            self.ddb_client.create_table(TableName=db_name,
                                         AttributeDefinitions=[{
                                             'AttributeName': self.DEFAULT_KEY_NAME,
                                             'AttributeType': 'S'
                                         }, ],
                                         KeySchema=[
                                             {
                                                 'AttributeName': self.DEFAULT_KEY_NAME,
                                                 'KeyType': 'HASH'
                                             }],
                                         BillingMode='PAY_PER_REQUEST'
                                         )
            waiter = self.ddb_client.get_waiter('table_exists')
            waiter.wait(TableName=db_name, WaiterConfig={'Delay': 1})

    def add_cache(self, session_id, content):
        return self.ddb_client.put_item(TableName=self.db_name,
                                        Item={self.DEFAULT_KEY_NAME: {"S": session_id}, "content": {"S": content}})

    def get_cache(self, session_id):
        result = self.ddb_client.get_item(TableName=self.db_name, Key={self.DEFAULT_KEY_NAME: {"S": session_id}})
        return result['Item']['content']['S']

Writing paginator.py


In [6]:
%%writefile requirements.txt
boto3
transformers==4.27.2

Writing requirements.txt


In [7]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
mv model.py mymodel/
mv paginator.py mymodel/
mv requirements.txt mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/model.py
mymodel/serving.properties
mymodel/paginator.py
mymodel/requirements.txt


## Step 4: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### 4.1 Upload artifact on S3 and create SageMaker model

In [8]:
s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

repo_name="djlserving-byoc"
image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo_name}:latest"
env = {"HUGGINGFACE_HUB_CACHE": "/tmp", "TRANSFORMERS_CACHE": "/tmp"}

model = Model(image_uri=image_uri, model_data=code_artifact, env=env, role=role)

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-125045733377/large-model-lmi/code/mymodel.tar.gz


### 4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [9]:
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-model")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             # container_startup_health_check_timeout=3600
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

------------!

## Step 5: Test and benchmark the inference

In here, we use a SageMaker endpoint + DDB simple fetcher to get the response result

In [17]:
import time
session_id = predictor.predict({"prompt": ["write a bubble sort algorithm in python"], "max_new_tokens": 512})
def get_stream(session_id):
        ddb_client = boto3.client('dynamodb')
        prev = 0
        while True:
            result = ddb_client.get_item(TableName="test_DB_Qing", Key={"cache_id": {"S": session_id}})
            if 'Item' in result:
                text = result['Item']['content']['S']
                print(text[prev:], end='')
                prev = len(text)
                if text.endswith('<eos>'):
                    break
            time.sleep(0.1)
get_stream(session_id['session_id'])

.

A:

You can use the following code to sort a list of tuples:
def bubbleSort(lst):
    """Sort a list of tuples in ascending order."""
    for i in lst:
        yield i

lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

def bubbleSort(lst):
    """Sort a list of tuples in descending order."""
    for i in lst:
        yield i

lst = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

lst.sort()

Output:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4<eos>

## Step 6: Make this as a single endpoint service

in the previous example, we just demoed how to create an endpoint and use CLI to complete inference. Now, let's build a real-world application using Lambda and API-Gateway. Here we used an open-sourced toolkit by AWS called [Chalice](https://github.com/aws/chalice). It combines most commonly used Lambda/DynamoDB/APIGateway functions to deploy the stack easily.

Chalice requires 4 major components:

- `app.py`: Place to define your lambda function and related services
- `requirements.txt`: pip wheel needed to drive the applicaiton
- `.chalice/config.json`: a json file defines the generation logic and deployment stage
- `.chalice/policy-<stage>.json`: a json file defines the policy that needs to attach to an IAM role of Lambda

In [19]:
%pip install chalice requests --upgrade  --quiet

Note: you may need to restart the kernel to use updated packages.


In [60]:
%%writefile app.py
import boto3
import sagemaker
from sagemaker import serializers, deserializers
from chalice import Chalice

app = Chalice(app_name='stream_endpoint')
TABLE_NAME="test_DB"
SM_ENDPOINT_NAME="lmi-model-deployment"
sm_predictor = None

@app.route('/query', methods=['POST'])
def run_inference():
    body = app.current_request.json_body
    if "session_id" in body:
        return ddb_fetcher(body["session_id"])
    elif "prompt" in body:
        return get_sm_predictor().predict(body)
    else:
        return {"result" : "Error!", "_debug": body}

def ddb_fetcher(session_id):
    ddb_client = boto3.client('dynamodb')
    result = ddb_client.get_item(TableName=TABLE_NAME, Key={"cache_id": {"S": session_id}})
    if 'Item' in result:
        return {"result" : result['Item']['content']['S']}
    return {"result": "", "_debug": result}


def get_sm_predictor():
    global sm_predictor
    if sm_predictor is None:
        sess = sagemaker.session.Session()
        sm_predictor = sagemaker.Predictor(
            endpoint_name=SM_ENDPOINT_NAME,
            sagemaker_session=sess,
            serializer=serializers.JSONSerializer(),
            deserializer=deserializers.JSONDeserializer(),
        )
    return sm_predictor

Writing app.py


In [61]:
%%writefile requirements.txt
boto3
sagemaker

Writing requirements.txt


In [62]:
%%writefile config.json
{
  "version": "2.0",
  "app_name": "stream_endpoint",
  "stages": {
    "dev": {
      "autogen_policy": false,
      "api_gateway_stage": "api"
    }
  }
}

Writing config.json


In [63]:
%%writefile policy-dev.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:*:*:*",
      "Effect": "Allow"
    },
    {
      "Action": [
        "dynamodb:GetItem",
        "dynamodb:Scan",
        "dynamodb:Query"
      ],
      "Resource": [
        "arn:aws:dynamodb:*:*:table/test_DB*"
      ],
      "Effect": "Allow"
    },
    {
      "Action": [
        "sagemaker:ListEndpoints",
        "sagemaker:InvokeEndpoint"
      ],
      "Resource": [
        "arn:aws:sagemaker:*:*:endpoint/lmi*"
      ],
      "Effect": "Allow"
    }
  ]
}

Writing policy-dev.json


Now, let's do deployment!

In [64]:
%%bash
mkdir -p deployment/.chalice
mv app.py deployment/
mv requirements.txt deployment/
mv policy-dev.json deployment/.chalice/
mv config.json deployment/.chalice/
cd deployment/
chalice deploy

Creating deployment package.
Reusing existing deployment package.
Updating policy for IAM role: stream_endpoint-dev-api_handler
Creating lambda function: stream_endpoint-dev
Creating Rest API
Resources deployed:
  - Lambda ARN: arn:aws:lambda:us-east-1:125045733377:function:stream_endpoint-dev
  - Rest API URL: https://fdtazsc92c.execute-api.us-east-1.amazonaws.com/api/




## Clean up the environment

If you have lambda and API gateway environment, do the following to clean up:

In [66]:
%%bash
cd deployment/
chalice delete

Deleting Rest API: fdtazsc92c
Deleting function: arn:aws:lambda:us-east-1:125045733377:function:stream_endpoint-dev
Deleting IAM role: stream_endpoint-dev-api_handler


Clean up the SageMaker endpoint:

In [18]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()