<div class="alert alert-info"> <strong> Note </strong>
This notebook was tested with the `Data Science` kernel on an Amazon SageMaker notebook instance of type `t3.medium`.
</div>

In [3]:
!pip install -U sagemaker  #update sagemaker to the latest version

Collecting sagemaker
  Downloading sagemaker-2.142.0.tar.gz (685 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m685.8/685.8 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting importlib-metadata<5.0,>=1.4.0
  Using cached importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.142.0-py2.py3-none-any.whl size=927429 sha256=15fc654d33a29c025462f97486c96ecc9be379b9bd824065320d7e1acee9e8a8
  Stored in directory: /root/.cache/pip/wheels/90/1d/81/207852cda88c2f85a32fb5c7cd0aac3ec6c09069747b9ec526
Successfully built sagemaker
Installing collected packages: importlib-metadata, sagemaker
  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 6.1.0
    Uninstalling importlib-metadata-6.1.0:
      Successfully unin

In [5]:
import sagemaker 
print(sagemaker.__version__)

2.142.0


## setting up environment variables 

In [6]:
import sagemaker
from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/spaCY"

role = sagemaker.get_execution_role()

bucket

'sagemaker-ap-southeast-1-201364840562'

## SageMaker training job

In [None]:
#gpu version
pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='code',
                            role=role,
                            instance_type='ml.g4dn.xlarge',
                            instance_count=1,
                            framework_version='1.12',
                            py_version='py38',
                            use_spot_instances=True,
                            max_run=6400,
                            max_wait = 7200,
                            use_cuda=True
                           )
pytorch_estimator.fit()


In [None]:
#cpu version
pytorch_estimator = PyTorch(entry_point='train.py',
                            source_dir='code',
                            role=role,
                            instance_type='ml.m5.xlarge',
                            instance_count=1,
                            framework_version='1.12',
                            py_version='py38',
                            use_spot_instances=True,
                            max_run=6400,
                            max_wait = 7200,
                            use_cuda=False
                           )
pytorch_estimator.fit()

In [2]:
description=pytorch_estimator.latest_training_job.describe()
description.keys()
description["ModelArtifacts"]["S3ModelArtifacts"]

NameError: name 'pytorch_estimator' is not defined

In [73]:
!aws s3 cp s3://sagemaker-ap-southeast-1-201364840562/pytorch-training-2023-03-29-02-20-30-090/output/model.tar.gz  s3://sagemaker-ap-southeast-1-201364840562/spacy_model/en_ner_fashion-0.0.0.tar.gz 

copy: s3://sagemaker-ap-southeast-1-201364840562/pytorch-training-2023-03-29-02-20-30-090/output/model.tar.gz to s3://sagemaker-ap-southeast-1-201364840562/spacy_model/en_ner_fashion-0.0.0.tar.gz


## Infernce with the model through realtime endpoint

### Run Multiple NLP Bert Models on GPU with Amazon SageMaker Multi-Model Endpoints (MME)

[Amazon SageMaker](https://aws.amazon.com/sagemaker/) multi-model endpoints(MME) provide a scalable and cost-effective way to deploy large number of deep learning models. Previously, customers had limited options to deploy 100s of deep learning models that need accelerated compute with GPUs. Now customers can deploy 1000s of deep learning models behind one SageMaker endpoint. MME can run multiple models on a GPU core, share GPU instances behind an endpoint across multiple models and dynamically load/unload models based on the incoming traffic. With this, customers can significantly save cost and achieve best price performance.




In [61]:
#assessing the necessary sagemaker clients
import boto3

sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client(service_name="sagemaker-runtime")

account_id = boto3.client("sts").get_caller_identity()["Account"]
region = boto3.Session().region_name


In [None]:
# now, we have 3 models in place

In [None]:
!aws s3 cp s3://sagemaker-ap-southeast-1-201364840562/pytorch-training-2023-03-29-02-20-30-090/output/model.tar.gz  s3://sagemaker-ap-southeast-1-201364840562/spacy_model/en_ner_fashion-0.0.0.tar.gz 

In [79]:
!aws s3 ls s3://sagemaker-ap-southeast-1-201364840562/spacy_model/

2023-03-29 01:59:09          0 
2023-03-29 03:06:34    8202671 en_ner_fashion-0.0.0.tar.gz
2023-03-29 03:15:19    8202671 en_ud_en_ewt-0.0.0.tar.gz
2023-03-29 02:46:48    8202671 model1.tar.gz
2023-03-29 02:47:00    8202671 model2.tar.gz
2023-03-29 02:47:11    8202671 model3.tar.gz


In [75]:
# Model creation artefacts
from time import gmtime, strftime

model_name = "spacy-realtime-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
spacy_model_url = 's3://sagemaker-ap-southeast-1-201364840562/spacy_model/' ## MODEL S3 URL
container = "{}.dkr.ecr.{}.amazonaws.com/spacy-sagemaker-realtime:latest".format(account_id, region)
print(container)
instance_type = "ml.m5.xlarge"

print("Model name: " + model_name)
print("Model data Url: " + spacy_model_url)
print("Container image: " + container)

container = {"Image": container, 
             "ModelDataUrl": spacy_model_url, 
             "Mode": "MultiModel"}

create_model_response = sm_client.create_model(
                            ModelName=model_name, 
                            ExecutionRoleArn=role, 
                            Containers=[container]
)

print("Model ARN: " + create_model_response["ModelArn"])

201364840562.dkr.ecr.ap-southeast-1.amazonaws.com/spacy-sagemaker-realtime:latest
Model name: spacy-realtime-2023-03-29-03-15-40
Model data Url: s3://sagemaker-ap-southeast-1-201364840562/spacy_model/
Container image: 201364840562.dkr.ecr.ap-southeast-1.amazonaws.com/spacy-sagemaker-realtime:latest
Model ARN: arn:aws:sagemaker:ap-southeast-1:201364840562:model/spacy-realtime-2023-03-29-03-15-40


In [76]:
#Endpoint Configuration for Creation Sagemaker Endpoint
endpoint_config_name = "spacy-realtime-config" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Endpoint config name: " + endpoint_config_name)

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialInstanceCount": 1,
            "InitialVariantWeight": 1,
            "ModelName": model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint config ARN: " + create_endpoint_config_response["EndpointConfigArn"])

Endpoint config name: spacy-realtime-config2023-03-29-03-16-15
Endpoint config ARN: arn:aws:sagemaker:ap-southeast-1:201364840562:endpoint-config/spacy-realtime-config2023-03-29-03-16-15


In [77]:
#Create the new endpoint
endpoint_name = "spacy-realtime-endpoint-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print("Endpoint name: " + endpoint_name)

create_endpoint_response = sm_client.create_endpoint(
                                            EndpointName=endpoint_name, 
                                            EndpointConfigName=endpoint_config_name
                                        )

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Endpoint Status: " + status)

print("Waiting for {} endpoint to be in service...".format(endpoint_name))
waiter = sm_client.get_waiter("endpoint_in_service")
waiter.wait(EndpointName=endpoint_name)

Endpoint name: spacy-realtime-endpoint-2023-03-29-03-16-27
Endpoint Arn: arn:aws:sagemaker:ap-southeast-1:201364840562:endpoint/spacy-realtime-endpoint-2023-03-29-03-16-27
Endpoint Status: Creating
Waiting for spacy-realtime-endpoint-2023-03-29-03-16-27 endpoint to be in service...


## Invoke the through mme

In [78]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
import json

#csv_serializer = CSVSerializer()
test_df=pd.read_csv("payload_spacy.csv")
#payload = "Amazon SageMaker helps data scientists and developers to prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML."

response = runtime_sm_client.invoke_endpoint(
        EndpointName=endpoint_name,
        TargetModel="en_ner_fashion-0.0.0.tar.gz",
        Body = test_df.to_csv(index=False)
    )

return_df = response["Body"].read()

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary with message "{
  "code": 500,
  "type": "InternalServerException",
  "message": "Failed to start workers"
}
". See https://ap-southeast-1.console.aws.amazon.com/cloudwatch/home?region=ap-southeast-1#logEventViewer:group=/aws/sagemaker/Endpoints/spacy-realtime-endpoint-2023-03-29-03-16-27 in account 201364840562 for more information.

In [None]:
results = pd.read_csv(BytesIO(return_df),on_bad_lines='skip')
results

In [None]:
text_triton = "Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs."
input_ids, attention_mask = tokenize_text(text_triton)

payload = {
    "inputs": [
        {"name": "token_ids", "shape": [1, 128], "datatype": "INT32", "data": input_ids},
        {"name": "attn_mask", "shape": [1, 128], "datatype": "INT32", "data": attention_mask},
    ]
}

for i in range(N):
    response = client.invoke_endpoint(
        EndpointName=endpoint_name, 
        ContentType="application/octet-stream", 
        Body=json.dumps(payload), 
        TargetModel=f"bert-{i}.tar.gz"
    )

    print(json.loads(response["Body"].read().decode("utf8")))

In [None]:
text_sm = "Amazon SageMaker helps data scientists and developers to prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML."
request_body, header_length = get_sample_tokenized_text_binary_trt(text_sm)

response = client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
        header_length
    ),
    Body=request_body,
    TargetModel="bert-0.tar.gz"
)

# Parse json header size length from the response
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix) :]

# Read response body
result = httpclient.InferenceServerClient.parse_response_body(
    response["Body"].read(), header_length=int(header_length_str)
)
# print(response)
# print(result)
output0_data = result.as_numpy("output")
output1_data = result.as_numpy("pooled_output")
print(output0_data)
print(output1_data)