# < CANNOT BE USED FOR PRODUCTION >
# Codegen Sagemaker inference with Intel optimizations

## Agenda
0. Prerequisites
1. Build Deep Learning Container and push it to AWS ECR
2. Create a Torchserve file and put it on S3 bucket
3. Create AWS Sagemaker endpoint
4. Invoke the endpoint

### Prerequisites

Install all libraries required to run the example.

In [31]:
!pip install "sagemaker>=2.175.0" --upgrade --quiet
! pip install awscli boto3 botocore numpy s3transfer torch-model-archiver==0.8.1 torchserve==0.8.2 --upgrade --quiet

Remember also that you have all required accesses on you AWS account. To run this example you're going to need following accesses:
- AmazonEC2ContainerRegistryFullAccess
- AmazonEC2FullAccess
- AmazonS3FullAccess

**Define also following variables.** These variables are needed for the Deep Learning containers to build the Docker and push it to the AWS ECR.

In [32]:
from datetime import datetime

current_datetime = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
current_datetime

'2024-03-08-14-05-25'

In [33]:
ACCOUNT_ID = ""
REPOSITORY_NAME = "pytorch_inference"
REGION = "us-west-2"
# modify this based on your S3 Bucket name
S3_BUCKET_NAME = "" # s3://<s3 bucket name>/

In [34]:
# define these variable names based on S3 Bucket name and ECR url
import os
tag = f"2.2.0-cpu-intel-py310-ubuntu20.04-sagemaker-codegen-{current_datetime}"
ECR_URL = f"{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com/{REPOSITORY_NAME}:{tag}"
S3_URL = os.path.join(S3_BUCKET_NAME, "codegen25.tar.gz")
endpoint_name = "codegen-ipex"
ECR_URL

'205130860845.dkr.ecr.us-west-2.amazonaws.com/pytorch_inference:2.2.0-cpu-intel-py310-ubuntu20.04-sagemaker-codegen-2024-03-08-14-05-25'

### Build Deep Learning Container and push it to AWS ECR

If you don't have Docker image prepared beforehand, build the image with all required intel optimizations.

In [35]:
# review Docker
!cat docker/Dockerfile

ARG PYTHON=python3
ARG PYTHON_VERSION=3.10.13
ARG MINIFORGE3_VERSION=23.11.0-0

FROM ubuntu:20.04 AS sagemaker

LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true

ARG PYTHON
ARG PYTHON_VERSION
ARG MINIFORGE3_VERSION

ENV TORCHSERVE_VERSION="0.8.2"
ENV SM_TOOLKIT_VERSION="2.0.22"
ENV SAGEMAKER_SERVING_MODULE sagemaker_pytorch_serving_container.serving:main

# Set Env Variables for the images
ENV DEBIAN_FRONTEND=noninteractive
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/lib"
ENV LD_LIBRARY_PATH="/opt/conda/lib:${LD_LIBRARY_PATH}"
ENV PYTHONIOENCODING=UTF-8
# See http://bugs.python.org/issue19846
ENV LANG=C.UTF-8
ENV PATH=/opt/conda/bin:$PATH
ENV TEMP=/home/model-server/tmp
ENV MKL_THREADING_LAYER=GNU

ENV DLC_CONTAINER_TYPE=inference

RUN apt-get -y update \
 && apt-get -y upgrade \
 && apt-get install -y --no-install-recommends \
    build-essential \
    ca-certificat

In [36]:
# build docker image
!docker build -t $ECR_URL docker

Sending build context to Docker daemon  9.216kB
Step 1/42 : ARG PYTHON=python3
Step 2/42 : ARG PYTHON_VERSION=3.10.13
Step 3/42 : ARG MINIFORGE3_VERSION=23.11.0-0
Step 4/42 : FROM ubuntu:20.04 AS sagemaker
 ---> 3cff1c6ff37e
Step 5/42 : LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
 ---> Using cache
 ---> 37386c6722a6
Step 6/42 : LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
 ---> Using cache
 ---> dbb0b22555ff
Step 7/42 : ARG PYTHON
 ---> Using cache
 ---> d653f71c2132
Step 8/42 : ARG PYTHON_VERSION
 ---> Using cache
 ---> 606b73eaefda
Step 9/42 : ARG MINIFORGE3_VERSION
 ---> Using cache
 ---> 47b945f0abbb
Step 10/42 : ENV TORCHSERVE_VERSION="0.8.2"
 ---> Using cache
 ---> 2e3bf6056a8b
Step 11/42 : ENV SM_TOOLKIT_VERSION="2.0.22"
 ---> Using cache
 ---> 0bb5c0bae7b7
Step 12/42 : ENV SAGEMAKER_SERVING_MODULE sagemaker_pytorch_serving_container.serving:main
 ---> Using cache
 ---> 14721f4f33a8
Step 13/42 : ENV DEBIAN_FRONTEND=noninteractive
 ---> Usi

[0mRemoving intermediate container 8f78913418f8
 ---> 40ae302009e6
Step 26/42 : RUN pip install --no-cache-dir --extra-index-url https://download.pytorch.org/whl/cpu -U     opencv-python     pyopenssl     "cryptography>41.0.6"     "ipython>=8.10.0,<9.0"     "urllib3>=1.26.18,<2"     "prompt-toolkit<3.0.39"
 ---> Running in cbb4f8a176f9
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu
Collecting opencv-python
  Downloading opencv_python-4.9.0.80-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting pyopenssl
  Downloading pyOpenSSL-24.0.0-py3-none-any.whl.metadata (12 kB)
Collecting cryptography>41.0.6
  Downloading cryptography-42.0.5-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (5.3 kB)
Collecting ipython<9.0,>=8.10.0
  Downloading ipython-8.22.2-py3-none-any.whl.metadata (4.8 kB)
Collecting urllib3<2,>=1.26.18
  Downloading urllib3-1.26.18-py2.py3-none-any.whl.metadata (48 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.7/5.7 MB 243.5 MB/s eta 0:00:00
Collecting networkx (from torch==2.2)
  Downloading https://download.pytorch.org/whl/networkx-3.2.1-py3-none-any.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 228.2 MB/s eta 0:00:00
Collecting jinja2 (from torch==2.2)
  Downloading https://download.pytorch.org/whl/Jinja2-3.1.2-py3-none-any.whl (133 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.1/133.1 kB 200.6 MB/s eta 0:00:00
Collecting fsspec (from torch==2.2)
  Downloading https://download.pytorch.org/whl/fsspec-2023.4.0-py3-none-any.whl (153 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 154.0/154.0 kB 325.7 MB/s eta 0:00:00
INFO: pip is looking at multiple versions of torchvision to determine which version is compatible with other requirements. This could take a while.
Collecting torchvision
  Downloading https://download.pytorch.org/whl/cpu/torchvision-0.17.0%2Bcpu-cp310-cp310-linux_x86_64.whl (1.6 MB)
     ━━━━━━━━━━━━━━━

Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.15.1->transformers==4.33.2)
  Downloading fsspec-2024.2.0-py3-none-any.whl.metadata (6.8 kB)
Collecting zipp>=0.5 (from importlib-metadata->diffusers)
  Downloading zipp-3.17.0-py3-none-any.whl.metadata (3.7 kB)
Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 132.4 MB/s eta 0:00:00
Downloading diffusers-0.26.3-py3-none-any.whl (1.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.9/1.9 MB 330.2 MB/s eta 0:00:00
Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 280.0/280.0 kB 364.1 MB/s eta 0:00:00
Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 341.2 MB/s eta 0:00:00
Downloading huggingface_hub-0.21.4-py3-none-any.whl (346 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 346.4/346.4 kB 366.7 MB/s eta 0:00:00
Downloadi

Collecting contourpy>=1.0.1 (from matplotlib->captum==0.6.0->-r https://raw.githubusercontent.com/pytorch/serve/v0.8.2/requirements/common.txt (line 3))
  Downloading contourpy-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting cycler>=0.10 (from matplotlib->captum==0.6.0->-r https://raw.githubusercontent.com/pytorch/serve/v0.8.2/requirements/common.txt (line 3))
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib->captum==0.6.0->-r https://raw.githubusercontent.com/pytorch/serve/v0.8.2/requirements/common.txt (line 3))
  Downloading fonttools-4.49.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (159 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 159.1/159.1 kB 17.0 MB/s eta 0:00:00
Collecting kiwisolver>=1.3.1 (from matplotlib->captum==0.6.0->-r https://raw.githubusercontent.com/pytorch/serve/v0.8.2/requirements/common.txt (line 3))
  Downloading kiwisolver-1.4.5-

Collecting PTable (from -r /root/oss_compliance/python_packages/piplicenses/requirements.txt (line 2))
  Downloading PTable-0.9.2.tar.gz (31 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: PTable
  Building wheel for PTable (setup.py): started
  Building wheel for PTable (setup.py): finished with status 'done'
  Created wheel for PTable: filename=PTable-0.9.2-py3-none-any.whl size=22907 sha256=34cce791d894fb9119938f350f8f50a824f1581f737f9d17e415f50cf2986680
  Stored in directory: /root/.cache/pip/wheels/bc/88/52/f2e9fc70f3a657cf256e9b01a8a42938c4c5ee69118d51ed90
Successfully built PTable
Installing collected packages: PTable
Successfully installed PTable-0.9.2
  import pkg_resources
[0mPath doesnt exists /opt/conda/lib/python3.10/site-packages/fonttools
Path doesnt exists /opt/conda/lib/python3.10/site-packages/fonttools
Path doesnt exists /opt/conda/lib/python3.10/site-packages/intel_ope

In [37]:
# Authenticate to ECR
!aws ecr get-login-password --region {REGION} | docker login --username AWS --password-stdin {ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com

https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded


In [38]:
# Push docker image
!docker push $ECR_URL

The push refers to repository [205130860845.dkr.ecr.us-west-2.amazonaws.com/pytorch_inference]

[1B6dbbae6c: Preparing 
[1Bed8b56a6: Preparing 
[1Be530a422: Preparing 
[1Bef9f7289: Preparing 
[1Bc00388c2: Preparing 
[1B7c01c12a: Preparing 
[1Bdce3d9a0: Preparing 
[1B1dcac4b3: Preparing 
[1B850ed2d1: Preparing 
[1Bce5ff191: Preparing 
[1Bccc317b3: Preparing 
[1B4f7a5ea5: Preparing 
[1B50ce8c77: Preparing 
[1B6df6acab: Preparing 
[1B1bf48f86: Preparing 
[1B5a4c4867: Preparing 


[7Bccc317b3: Pushing    744MB/792.1MB[15A[2K[16A[2K[16A[2K[16A[2K[13A[2K[17A[2K[12A[2K[11A[2K[10A[2K[11A[2K[11A[2K[10A[2K[11A[2K[11A[2K[9A[2K[10A[2K[9A[2K[11A[2K[9A[2K[11A[2K[10A[2K[16A[2K[9A[2K[10A[2K[9A[2K[11A[2K[9A[2K[9A[2K[8A[2K[10A[2K[11A[2K[10A[2K[8A[2K[10A[2K[9A[2K[8A[2K[7A[2K[10A[2K[9A[2K[8A[2K[7A[2K[8A[2K[11A[2K[10A[2K[8A[2K[8A[2K[7A[2K[8A[2K[9A[2K[8A[2K[11A[2K[8A[2K[8A[2K[9A[2K[10A[2K[7A[2K[9A[2K[7A[2K[9A[2K[11A[2K[7A[2K[8A[2K[7A[2K[11A[2K[8A[2K[7A[2K[9A[2K[7A[2K[9A[2K[7A[2K[11A[2K[9A[2K[10A[2K[11A[2K[9A[2K[8A[2K[11A[2K[9A[2K[10A[2K[8A[2K[11A[2K[8A[2K[9A[2K[11A[2K[9A[2K[10A[2K[9A[2K[11A[2K[8A[2K[10A[2K[7A[2K[8A[2K[9A[2K[10A[2K[11A[2K[9A[2K[10A[2K[9A[2K[11A[2K[8A[2K[11A[2K[8A[2K[9A[2K[7A[2K[8A[2K[9A[2K[10A[2K[8A[2K[9A[2K[8A[2K[10A[2K[8A[2K[11A[2K[7A[2K

[7Bccc317b3: Pushed   806.2MB/792.1MB[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K[7A[2K2.2.0-cpu-intel-py310-ubuntu20.04-sagemaker-codegen-2024-03-08-14-05-25: digest: sha256:45cf2dccdc3176062cab8016194379a85bfece4f2018478c9a17b61fb8b56e7e size: 3898


### Create a Torchserve file and put it on S3 bucket

The endpoint has been tested on `Salesforce/codegen25-7b-multi` model. Here's how to create a torchserve file and put it on S3 bucket required to run the endpoint with Deep Learning Containers.

In order to change batch size, max length or max new tokens of the model, modify fields in model-config.yaml before creating the Torchserve file.

In [39]:
!cd codegen_model && cat model-config.yaml

minWorkers: 1
maxWorkers: 1
responseTimeout: 1500

handler:
    model_name: "Salesforce/codegen25-7b-multi"
    batch_size: 1
    max_length: 128
    max_new_tokens: 32
    ipex_weight_only_quantization: true
    woq_dtype: "INT8"
    lowp_mode: "BF16"
    act_quant_mode: "PER_IC_BLOCK"
    group_size: -1
    token_latency: true
    benchmark: true 
    num_warmup: 2
    num_iter: 8
    greedy: true
    


To generate a Torchserve file use following command:

In [40]:
!cd codegen_model && torch-model-archiver --force --model-name codegen25 --version 1.0 --handler codegen_handler.py --config-file model-config.yaml --extra-files codegen25.py --archive-format tgz



Next, copy the model into an S3 bucket of your choice:

In [41]:
!cd codegen_model && aws s3 cp codegen25.tar.gz $S3_BUCKET_NAME

Completed 5.1 KiB/5.1 KiB (47.0 KiB/s) with 1 file(s) remainingupload: ./codegen25.tar.gz to s3://intel-sagemaker/codegen25.tar.gz


### Create AWS Sagemaker endpoint

Next step is to deploy the model to AWS Sagemaker and create an endpoint in order to run inference. 

In [42]:
import sagemaker
import boto3

boto3_session = boto3.session.Session(region_name=REGION)
smr = boto3.client('sagemaker-runtime')
sm = boto3.client('sagemaker')
role = sagemaker.get_execution_role()
sess = sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)
region = sess._region_name
account = sess.account_id()

bucket_name = sess.default_bucket()
prefix = "torchserve"
output_path = f"s3://{bucket_name}/{prefix}"
print(f'account={account}, region={region}, role={role}, output_path={output_path}')

account=205130860845, region=us-west-2, role=arn:aws:iam::205130860845:role/sagemaker_fullaccess, output_path=s3://sagemaker-us-west-2-205130860845/torchserve


In [43]:
from sagemaker import Model

instance_type = "ml.m7i.8xlarge"
sagemaker_name = sagemaker.utils.name_from_base(endpoint_name)

model = Model(
    name="torchserve-codegen-ipex" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
    # Enable SageMaker uncompressed model artifacts
    model_data=S3_URL,
    image_uri=ECR_URL,
    role=role,
    sagemaker_session=sess,
    env={"TS_INSTALL_PY_DEP_PER_MODEL": "true",
         "SAGEMAKER_CONTAINER_LOG_LEVEL": "0",
         "SAGEMAKER_REGION": region},
)
print(sagemaker_name)
print(model)

codegen-ipex-2024-03-08-14-09-28-465
<sagemaker.model.Model object at 0x7f985c6cf220>


In [44]:
model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=sagemaker_name,
    #volume_size=32, # increase the size to store large model
    model_data_download_timeout=3600, # increase the timeout to download large model
    container_startup_health_check_timeout=600, # increase the timeout to load large model
)

-----!

You can inspect the logs to check whether the model has been deployed successfully.

### Invoke the endpoint

Once the model is deployed, invoke the sample response with following code.

In [None]:
import time, json

client = boto3.client('sagemaker-runtime')
task = "Write a python function to compute the factorial of an integer."

custom_attributes = "c000b4f9-df62-4c85-a0bf-7c525f9104a4"  # An example of a trace ID.
content_type = "text/plain"                           # The MIME type of the input data in the request body.
accept = "*/*"                                              # The desired MIME type of the inference in the response.

import io

class Parser:
    def __init__(self):
        self.buff = io.BytesIO()
        self.read_pos = 0
        
    def write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)
        data = self.buff.getvalue()
        
    def scan_lines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            if line[-1] != b'\n':
                self.read_pos += len(line)
                yield line[:-1]
                
    def reset(self):
        self.read_pos = 0

start_time = time.time()
response = client.invoke_endpoint_with_response_stream(
    EndpointName=sagemaker_name, 
    CustomAttributes=custom_attributes, 
    ContentType=content_type,
    Accept=accept,
    Body=task)
print("--- %s seconds ---" % (time.time() - start_time))

parser = Parser()
for event in response['Body']:
    parser.write(event['PayloadPart']['Bytes'])
    for line in parser.scan_lines():
        print("\n", line.decode("utf-8"), end=' \n')

### Clean up

Once you will be done running the endpoint, you can delete it by using following method.

In [104]:
sm.delete_endpoint(EndpointName=sagemaker_name)

{'ResponseMetadata': {'RequestId': '666d3bc9-4614-4f72-a267-66677e440ce4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '666d3bc9-4614-4f72-a267-66677e440ce4',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Thu, 07 Mar 2024 17:42:18 GMT'},
  'RetryAttempts': 0}}