# < CANNOT BE USED FOR PRODUCTION >
# Codegen Sagemaker inference with Intel optimizations

## Agenda
0. Prerequisites
1. Build Deep Learning Container and push it to AWS ECR
2. Create a Torchserve file and put it on S3 bucket
3. Create AWS Sagemaker endpoint
4. Invoke the endpoint

### Prerequisites

Install all libraries required to run the example.

In [1]:
!pip install "sagemaker>=2.175.0" --upgrade --quiet
! pip install awscli boto3 botocore numpy s3transfer torch-model-archiver==0.8.1 torchserve==0.8.2 --upgrade

Collecting awscli
  Downloading awscli-1.32.57-py3-none-any.whl.metadata (11 kB)
Collecting boto3
  Downloading boto3-1.34.57-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore
  Downloading botocore-1.34.57-py3-none-any.whl.metadata (5.7 kB)
Collecting numpy
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Collecting torch-model-archiver==0.8.1
  Downloading torch_model_archiver-0.8.1-py3-none-any.whl.metadata (1.3 kB)
Collecting torchserve==0.8.2
  Downloading torchserve-0.8.2-py3-none-any.whl.metadata (1.4 kB)
Collecting enum-compat (from torch-model-archiver==0.8.1)
  Downloading enum_compat-0.0.3-py3-none-any.whl.metadata (954 bytes)
Downloading torch_model_archiver-0.8.1-py3-none-any.whl (14 kB)
Downloading torchserve-0.8.2-py3-none-any.whl (23.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

Remember also that you have all required accesses on you AWS account. To run this example you're going to need following accesses:
- AmazonEC2ContainerRegistryFullAccess
- AmazonEC2FullAccess
- AmazonS3FullAccess

**Define also following variables.** These variables are needed for the Deep Learning containers to build the Docker and push it to the AWS ECR.

In [99]:
ACCOUNT_ID = ""
AWS_SECRET_ACCESS_KEY=""
REPOSITORY_NAME = "pytorch_inference"
REGION = "us-west-2"

### Build Deep Learning Container and push it to AWS ECR

If you don't have Docker image prepared beforehand, clone the Deep Learning Containers repository and build the image with all required intel optimizations.

In [108]:
!rm -rf deep-learning-containers
!git clone https://github.com/aalbersk/deep-learning-containers
!cd deep-learning-containers && git checkout intel_pytorch_ipex

Cloning into 'deep-learning-containers'...
remote: Enumerating objects: 31543, done.[K
remote: Counting objects: 100% (890/890), done.[K
remote: Compressing objects: 100% (536/536), done.[K
remote: Total 31543 (delta 446), reused 599 (delta 297), pack-reused 30653[K
Receiving objects: 100% (31543/31543), 209.48 MiB | 30.44 MiB/s, done.
Resolving deltas: 100% (19288/19288), done.
Updating files: 100% (1867/1867), done.
branch 'intel_pytorch_ipex' set up to track 'origin/intel_pytorch_ipex'.
Switched to a new branch 'intel_pytorch_ipex'


As default Intel DLC has only essential Pytorch libraries + latest Transformers (4.37), Codegen requires requirements with following libraries additionaly:
```python
transformers==4.33.2
tiktoken
```
As Sagemaker creates an endpoint within read-only folder, we cannot utilize torchserve ability to install libraries during initialization. That's why it is needed to add following libraries to the Dockerfile.

In [110]:
!sed -i 's/transformers/transformers==4.33.2/' deep-learning-containers/pytorch/inference/docker/2.2/py3/Dockerfile.intel.cpu
!sed -i 's/accelerate/accelerate \\\n    tiktoken/' deep-learning-containers/pytorch/inference/docker/2.2/py3/Dockerfile.intel.cpu ;

In [111]:
!cd deep-learning-containers && git diff

[1mdiff --git a/pytorch/inference/docker/2.2/py3/Dockerfile.intel.cpu b/pytorch/inference/docker/2.2/py3/Dockerfile.intel.cpu[m
[1mindex 5f6022ba..30d74bb6 100644[m
[1m--- a/pytorch/inference/docker/2.2/py3/Dockerfile.intel.cpu[m
[1m+++ b/pytorch/inference/docker/2.2/py3/Dockerfile.intel.cpu[m
[36m@@ -127,9 +127,10 @@[m [mRUN pip install --no-cache-dir -U \[m
 RUN python -m pip install oneccl_bind_pt --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/cpu/us/[m
 [m
 RUN pip install --no-cache-dir -U \[m
[31m-    transformers \[m
[32m+[m[32m    transformers==4.33.2 \[m
     diffusers \[m
[31m-    accelerate[m
[32m+[m[32m    accelerate \[m
[32m+[m[32m    tiktoken[m
 [m
 # Install TorchServe pypi dependencies directly from their requirements.txt file[m
 # NOTE: This also brings in unnecessary cpu dependencies like nvgpu[m


By default the image will build `2.2` version of Pytorch+IPEX image. If you'd like to build another version, modify fields `version` and `short_version` in [pytorch/inference/buildspec-intel.yml](https://github.com/aalbersk/deep-learning-containers/blob/intel_pytorch_ipex/pytorch/inference/buildspec-intel.yml). The command below will automatically build the image and push it into your ECR.

In [None]:
!cd deep-learning-containers && PYTHONPATH=$PYTHONPATH:$(pwd):$(pwd)/src INTEL_DEDICATED=true python src/main.py --buildspec pytorch/inference/buildspec-intel.yml --framework pytorch --image_types inference --device_types cpu

### Create a Torchserve file and put it on S3 bucket

In [112]:
# modify this based on your S3 Bucket name
S3_BUCKET_NAME = "" # s3://<s3 bucket name>/

The endpoint has been tested on `Salesforce/codegen25-7b-multi` model. Here's how to create a torchserve file and put it on S3 bucket required to run the endpoint with Deep Learning Containers.

In order to change batch size, max length or max new tokens of the model, modify fields in model-config.yaml before creating the Torchserve file.

In [113]:
!cd codegen_model && cat model-config.yaml

minWorkers: 1
maxWorkers: 1
responseTimeout: 1500

handler:
    model_name: "Salesforce/codegen25-7b-multi"
    batch_size: 1
    max_length: 512
    max_new_tokens: 128
    ipex_weight_only_quantization: true
    woq_dtype: "INT8"
    lowp_mode: "BF16"
    act_quant_mode: "PER_IC_BLOCK"
    group_size: -1
    token_latency: true
    benchmark: true 
    num_warmup: 2
    num_iter: 3
    greedy: true
    


To generate a Torchserve file use following command:

In [114]:
!cd codegen_model && torch-model-archiver --force --model-name codegen25 --version 1.0 --handler codegen_handler.py --config-file model-config.yaml --extra-files codegen25.py --archive-format tgz



Next, copy the model into an S3 bucket of your choice:

In [115]:
!cd codegen_model && aws s3 cp codegen25.tar.gz $S3_BUCKET_NAME

Completed 5.1 KiB/5.1 KiB (50.7 KiB/s) with 1 file(s) remainingupload: ./codegen25.tar.gz to s3://intel-sagemaker/codegen25.tar.gz


### Create AWS Sagemaker endpoint

Next step is to deploy the model to AWS Sagemaker and create an endpoint in order to run inference. 

In [116]:
# define these variable names based on S3 Bucket name and ECR url
import os
ECR_URL = f"{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com/{REPOSITORY_NAME}:2.2.0-cpu-intel-py310-ubuntu20.04-sagemaker-codegen-2024-03-07-17-00-38"
S3_URL = os.path.join(S3_BUCKET_NAME, "codegen25.tar.gz")
endpoint_name = "codegen-ipex"

In [117]:
from datetime import datetime

current_datetime = datetime.now().strftime('%Y-%m-%d-%H-%M-%S')

In [118]:
import sagemaker
import boto3

boto3_session = boto3.session.Session(region_name=REGION)
smr = boto3.client('sagemaker-runtime')
sm = boto3.client('sagemaker')
role = sagemaker.get_execution_role()
sess = sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)
region = sess._region_name
account = sess.account_id()

bucket_name = sess.default_bucket()
prefix = "torchserve"
output_path = f"s3://{bucket_name}/{prefix}"
print(f'account={account}, region={region}, role={role}, output_path={output_path}')

account=205130860845, region=us-west-2, role=arn:aws:iam::205130860845:role/sagemaker_fullaccess, output_path=s3://sagemaker-us-west-2-205130860845/torchserve


In [119]:
from sagemaker import Model

instance_type = "ml.m7i.8xlarge"
sagemaker_name = sagemaker.utils.name_from_base(endpoint_name)

model = Model(
    name="torchserve-bert-ipex" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
    # Enable SageMaker uncompressed model artifacts
    model_data=S3_URL,
    image_uri=ECR_URL,
    role=role,
    sagemaker_session=sess,
    env={"TS_INSTALL_PY_DEP_PER_MODEL": "true",
         "SAGEMAKER_CONTAINER_LOG_LEVEL": "0",
         "SAGEMAKER_REGION": region},
)
print(sagemaker_name)
print(model)

codegen-ipex-2024-03-07-18-00-44-327
<sagemaker.model.Model object at 0x7f3bdd69bac0>


In [None]:
model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=sagemaker_name,
    #volume_size=32, # increase the size to store large model
    model_data_download_timeout=3600, # increase the timeout to download large model
    container_startup_health_check_timeout=600, # increase the timeout to load large model
)

-

You can inspect the logs to check whether the model has been deployed successfully.

### Invoke the endpoint

Once the model is deployed, invoke the sample response with following code.

In [100]:
import time, json

client = boto3.client('sagemaker-runtime')
task = "Write a python function to compute the factorial of an integer."

custom_attributes = "c000b4f9-df62-4c85-a0bf-7c525f9104a4"  # An example of a trace ID.
content_type = "text/plain"                           # The MIME type of the input data in the request body.
accept = "*/*"                                              # The desired MIME type of the inference in the response.

import io

class Parser:
    def __init__(self):
        self.buff = io.BytesIO()
        self.read_pos = 0
        
    def write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)
        data = self.buff.getvalue()
        
    def scan_lines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            if line[-1] != b'\n':
                self.read_pos += len(line)
                yield line[:-1]
                
    def reset(self):
        self.read_pos = 0

start_time = time.time()
response = client.invoke_endpoint_with_response_stream(
    EndpointName=sagemaker_name, 
    CustomAttributes=custom_attributes, 
    ContentType=content_type,
    Accept=accept,
    Body=task)
print("--- %s seconds ---" % (time.time() - start_time))

parser = Parser()
for event in response['Body']:
    parser.write(event['PayloadPart']['Bytes'])
    for line in parser.scan_lines():
        print("\n", line.decode("utf-8"), end=' \n')

ReadTimeoutError: Read timeout on endpoint URL: "https://runtime.sagemaker.us-west-2.amazonaws.com/endpoints/codegen-ipex-2024-03-07-17-28-49-991/invocations-response-stream"

### Clean up

Once you will be done running the endpoint, you can delete it by using following method.

In [104]:
sm.delete_endpoint(EndpointName=sagemaker_name)

{'ResponseMetadata': {'RequestId': '666d3bc9-4614-4f72-a267-66677e440ce4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '666d3bc9-4614-4f72-a267-66677e440ce4',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Thu, 07 Mar 2024 17:42:18 GMT'},
  'RetryAttempts': 0}}