### Hosting a HuggingFace model in Amazon Sagemaker
#### This notebook runs on both GPU and CPU, but it runs faster on a GPU since it has an NLP nerual network model in it and it is used twice prior to hosting: to generate word embeddings and to test the model locally. Measured end-2-end runtimes are: ml.p3.2xlarge: 9.5 minutes. ml.c5.4xlarge: 18.5 minutes. Your mileage may vary.
#### This notebook pulls a BERT model from HuggingFace repo to demonstrate a Semantic Search use-case.  
#### In addition, this notebook demonstrates the following capabilities:
* how to add extra artifacts to the pretrained model.tar.gz, such as word embeddings, for example. 
* how to pull a container from ECR.
* how to add custom inference.py “entry point” script, as well as additional files/directories to be used during inference process.
* how to add dependencies (eg requirements.txt) to be executed at upon launching the container.
* how to set up environment variables to allow the code inside inference container to access content outside of VPC. for example. 
* how to deploy the model 
* how to invoke the model.
* how to call CloudWatch via Python APIs to obtain model invocation metrics such as model and overhead latency.  

![semantic_search_image](notebook_images/semantic_search_image.png)

#### Semantic Search is a searching technique which incorporates contextual meaning of words or phrases. It relies on offline convertion of the corpus of data into word embeddings using (in this example) a distilbert model. During a real-time search (i.e. "model inference"), a query is also converted into an embedding using the same distilbert model. Query_embedding is then ranked against all embeddings of the corpus using cosine similarity and the top-ranked matches are presented . In this simplified example, corpus embeddings are included in the same tarball as the model itself. Including the corpus in the model's tarball file is obviously is not a scalable solution as it is limited by the available memory and compute power of the inference endpoint. A scalable solution would involve using Elastic Search instead.
#### See this reference for the semantic search similarity algorithm implemented here:
https://www.sbert.net/examples/applications/semantic-search/README.html

#### For the dataset, we will use misinformation_papers.csv file available here: 
https://github.com/orion-search/tutorials/blob/master/data/misinformation_papers.csv
Please, use your own method to download this dataset and place it in the ./dataset folder.

Publications and other sources used in creating this notebook: https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8 

https://github.com/orion-search/tutorials

In [1]:
import time
nb_start = time.time()

### Installing required libraries

In [2]:
!python -m pip install --upgrade pip

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting pip
  Downloading pip-22.3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-22.3.1


In [3]:
!pip install "sagemaker>=2.48.0" "transformers>=4.12.3" "datasets[s3]>=1.18.3" --upgrade

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting sagemaker>=2.48.0
  Downloading sagemaker-2.121.2.tar.gz (620 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m621.0/621.0 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting transformers>=4.12.3
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m68.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hCollecting datasets[s3]>=1.18.3
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m451.7/451.7 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting packaging==20.9
  Downloading packaging-20.9-py2.py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m648.5 kB/s[0m eta [36m0:0

Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.121.2-py2.py3-none-any.whl size=844051 sha256=737bbdbc083e9d4d5bc591b8e149e92cd7320b103ce555d3b2be717c92cc20bc
  Stored in directory: /home/ec2-user/.cache/pip/wheels/46/dc/fc/d947addd83079e53244196ead545f1800abfc5e043a0cf9c1a
Successfully built sagemaker
Installing collected packages: tokenizers, xxhash, packaging, responses, huggingface-hub, botocore, transformers, datasets, sagemaker
  Attempting uninstall: packaging
    Found existing installation: packaging 21.3
    Uninstalling packaging-21.3:
      Successfully uninstalled packaging-21.3
  Attempting uninstall: botocore
    Found existing installation: botocore 1.24.19
    Uninstalling botocore-1.24.19:
      Successfully uninstalled botocore-1.24.19
  Attempting uninstall: sagemaker
    Found existing installation: sagemaker 2.119.0
    Uninstalling sagemaker-2.119.

In [4]:
!pip install sentence-transformers
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=707885de39bfa61f853975b98d8725a82be2bd760cad3fe0bf0e35fe06743bc5
  Stored in directory: /home/ec2-user/.cache/pip/wheel

### Importing libraries, setting up AWS S3 buckets and AWS IAM roles. 

In [5]:
import sagemaker

sm_session = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it is not provided. 
# however, typically enterprise users don't have permissions to create their own buckets or 
# have AWS servcies (such as SageMaker) create them for you. 
sagemaker_session_bucket= "huggingface-bucket-se"
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sm_session.default_bucket()

role = sagemaker.get_execution_role()
sm_session = sagemaker.Session(default_bucket=sagemaker_session_bucket)
default_bucket=sagemaker_session_bucket
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sm_session.default_bucket()}")
print(f"sagemaker session region: {sm_session.boto_region_name}")

sagemaker role arn: arn:aws:iam::328296961357:role/service-role/AmazonSageMaker-ExecutionRole-20191125T182032
sagemaker bucket: huggingface-bucket-se
sagemaker session region: us-west-2


In [6]:
import json
from transformers import T5Tokenizer, T5ForConditionalGeneration
import sagemaker
import torch
import tqdm
import os
import time
from sentence_transformers import util, losses
import boto3
import pickle

## 1) Download and save a model from HuggingFace that would generate word embeddings. 

In [7]:
#helper function to pull quora_distilbert_model from Hugging Face repo.
from sentence_transformers import SentenceTransformer, InputExample, losses, models, datasets
def quora_distilbert_model():
    # Load quora-distilbert-base
    word_emb = models.Transformer('sentence-transformers/quora-distilbert-base')
    pooling = models.Pooling(word_emb.get_word_embedding_dimension())
    model = SentenceTransformer(modules=[word_emb, pooling])
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    model.to(device)
    
    return model, 'quora_distilbert'

In [8]:
#saving the model on local "disk"
distilbert_model=quora_distilbert_model()[0]
from pprint import pprint
pprint(vars(distilbert_model))
distilbert_model.save('./trained_models/quora-distilbert-untrained/')

Downloading:   0%|          | 0.00/540 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/490 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

{'_backward_hooks': OrderedDict(),
 '_buffers': OrderedDict(),
 '_forward_hooks': OrderedDict(),
 '_forward_pre_hooks': OrderedDict(),
 '_is_full_backward_hook': None,
 '_load_state_dict_pre_hooks': OrderedDict(),
 '_model_card_text': None,
 '_model_card_vars': {},
 '_model_config': {},
 '_modules': OrderedDict([('0',
                           Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel ),
                          ('1',
                           Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}))]),
 '_non_persistent_buffers_set': set(),
 '_parameters': OrderedDict(),
 '_state_dict_hooks': OrderedDict(),
 '_target_device': device(type='cpu'),
 'training': True}


### 2) Generate embeddings file and compress it as *.pkl format

#### Depending on the gype (GPU vs CPU) and the size of the machine where you notebook is running, generating the embeddings may take a some time. For example, the below "Generate Embeddings" cell takes 6 min on ml.c5.4xlarge (16 vCPU, 32GiB memory), 12 min on  ml.c5.2xlarge (8 vCPU, 16GiB memory) and 40 seconds on p3.2xlarge (1 V100 GPU/16 GiB mem; 8 vCPU/61GiB memory).

In [9]:
# Generate Embeddings
docs = set()
with open('./dataset/misinformation_papers.csv') as fIn:
    for line in fIn:
        doc = line.rstrip("\n")
        docs.add(doc)

docs = list(docs)        
paragraph_emb = distilbert_model.encode([d for d in docs], convert_to_tensor=True)    

# Save Embeddings as a pickle file on local "disk"
with open('./inference/embed_support_titles.pkl', "wb") as fOut:
    pickle.dump({'titles': docs, 'embeddings': paragraph_emb}, fOut, protocol=pickle.HIGHEST_PROTOCOL)

### 3) Test Model Locally

In [10]:
#Load sentences & embeddings from disc
with open('./inference/embed_support_titles.pkl', "rb") as fIn:    
    stored_data = pickle.load(fIn)
    support_titles = stored_data['titles']
    support_titles_embed = stored_data['embeddings']

In [11]:
query = "grammatical errors in spoken English"
query_emb = distilbert_model.encode(query, convert_to_tensor=True)
prediction = util.semantic_search(query_emb, support_titles_embed, top_k=10)[0]

output = []
for hit in prediction:
    doc = support_titles[hit['corpus_id']]
    output.append({"score": hit['score'], "title": doc})

print(json.dumps(output))

[{"score": 0.7601248025894165, "title": "Those difficulties tend to lead students to make errors in building English"}, {"score": 0.7531709671020508, "title": "developmental stage to gain English competence and errors are a result from the "}, {"score": 0.7259856462478638, "title": " grammar intensively so they often produce errors regarding to grammatical rules in"}, {"score": 0.6749558448791504, "title": "students often feel difficult in learning English especially in terms of grammar."}, {"score": 0.6726627945899963, "title": "interlanguage. This study aims to find out the grammatical errors that students"}, {"score": 0.6695536971092224, "title": " grammar items which relates to the errors that mostly produced in narrative text such"}, {"score": 0.6657508015632629, "title": "AN ANALYSIS OF STUDENTS\u2019 ERRORS IN USING ENGLISH PRONOUNS: A CASE STUDY AT NINTH GRADE STUDENTS OF SMPN 2 LINGSAR IN ACADEMIC YEAR 2017/2018,\"This study is aimed to analyze students\u2019 errors in using E

### 4) Prepare model file to be deployed on a SageMaker Inference Endpoint


In [12]:
# Copy Embeddings from inference_extra/ -> to the top-level of the HuggingFace model
# Note, SageMaker documentation can be interpreted as saying that the embeddings *.pkl file could be placed in 
# ./inference/ directory and referenced in inference.py via ./inference/filename.pkl. This is incorrect.

In [13]:
!cp ./inference/embed_support_titles.pkl ./trained_models/quora-distilbert-untrained/embed_support_titles.pkl

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [14]:
!rm ./inference/embed_support_titles.pkl

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [15]:
# create a model *.tar.gz file
import tarfile
import os.path

def make_tarfile(output_filename, source_dir):
    with tarfile.open(output_filename, "w:gz") as tar:
        tar.add(source_dir, arcname=os.path.sep)

source_dir="./trained_models/quora-distilbert-untrained"
output_filename="./trained_models/quora-distilbert_untrained_model_artifact.tar.gz"
make_tarfile(output_filename, source_dir)
print(output_filename)

./trained_models/quora-distilbert_untrained_model_artifact.tar.gz


In [16]:
#upload model file with embeddings to S3
from datetime import datetime
from sagemaker.s3 import S3Downloader, S3Uploader
key_prefix='datalab-hf-1/trained_models'
model_s3_uri_prefix=os.path.join("s3://", default_bucket, key_prefix, datetime.now().strftime("%m%d%I%p"))
s3_model_uri=S3Uploader.upload(desired_s3_uri = model_s3_uri_prefix,
                                   local_path = output_filename,
                                   sagemaker_session=sm_session)
print(model_s3_uri_prefix)
print(s3_model_uri)

s3://huggingface-bucket-se/datalab-hf-1/trained_models/121311PM
s3://huggingface-bucket-se/datalab-hf-1/trained_models/121311PM/quora-distilbert_untrained_model_artifact.tar.gz


### 5) Create SageMaker Endpoint with HuggingFace Image

In [17]:
from sagemaker.huggingface import HuggingFace

# Retrieve huggingface docker image for the inference container
hf_inf_image_uri = sagemaker.image_uris.retrieve(
    framework="huggingface",
    region="us-west-2",
    version='4.12',
    py_version="py38",
    image_scope="inference",
    base_framework_version='pytorch1.9',
    instance_type="ml.c5.xlarge" #"ml.p3.2xlarge",
)

hf_inf_image_uri

'763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-inference:1.9-transformers4.12-cpu-py38-ubuntu20.04'

In [18]:
from sagemaker.huggingface.model import HuggingFaceModel

source_dir="./inference" 

#kms_key = "00f78e2f-dd0b-488b-aa5f-cf0f39cb374f"

#env_variables_dict = {
#    "SAGEMAKER_TS_BATCH_SIZE": "3",
#    "SAGEMAKER_TS_MAX_BATCH_DELAY": "100000",
#    "SAGEMAKER_TS_MIN_WORKERS": "1",
#    "SAGEMAKER_TS_MAX_WORKERS": "1",
#    'http_proxy': proxy_endpoint, 
#    'HTTP_PROXY': proxy_endpoint,
#    'https_proxy': proxy_endpoint,
#    'HTTPS_PROXY': proxy_endpoint,
#    'NO_PROXY': no_proxy_endpoint,
#    'no_proxy': no_proxy_endpoint
#}

hugface_model = HuggingFaceModel(
    model_data=s3_model_uri,
    role=sagemaker.get_execution_role(),
    image_uri=hf_inf_image_uri,
    source_dir=source_dir,
    entry_point="inference.py",
    dependencies=["./inference/requirements.txt"], 
#    env=env_variables_dict,
#    vpc_config = {'Subnets': ['subnet-XXX', 'subnet-XXX'], 'SecurityGroupIds': ['sg-XXX', 'sg-XXX', 'sg-XXX']},
#    model_kms_key = kms_key
)


In [19]:
tic = time.time()
inf_instance_type = "ml.c5.xlarge" #"ml.p3.2xlarge"
predictor = hugface_model.deploy(
#    kms_key=kms_key,
    initial_instance_count=1,
    instance_type=inf_instance_type,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer()
)
toc=time.time()
print(toc-tic)

---------!289.9066252708435


In [20]:
endpoint_name = predictor.endpoint_name
endpoint_name

'huggingface-pytorch-inference-2022-12-13-23-46-59-900'

### 4) Test Endpoint Deployment

In [21]:
# https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-test-endpoints.html

# Create a low-level client representing Amazon SageMaker Runtime
sagemaker_runtime = boto3.client("sagemaker-runtime", region_name="us-west-2")

# If ContentType="application/json", input must be a list of 1 string
query = '["grammatical errors in spoken English"]'

# Invoke the endpoint using the client created earlier
response = sagemaker_runtime.invoke_endpoint(
                            EndpointName=endpoint_name, 
                            Body=query.encode('utf-8'),
                            ContentType="application/json", 
                            Accept="application/json",
                            )

# Optional - Print the response body and decode it so it is human read-able.
print(response)
import pandas as pd
pd.set_option('display.max_colwidth', None)
# Pandas DataFrame from lists of dicts. 
list_of_dicts_output = json.loads(response['Body'].read().decode('utf-8'))
df_output = pd.DataFrame(list_of_dicts_output)
    
df_output

{'ResponseMetadata': {'RequestId': '0333121d-02c5-423e-97dd-86293f248f42', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '0333121d-02c5-423e-97dd-86293f248f42', 'x-amzn-invoked-production-variant': 'AllTraffic', 'date': 'Tue, 13 Dec 2022 23:51:31 GMT', 'content-type': 'application/json', 'content-length': '1551'}, 'RetryAttempts': 0}, 'ContentType': 'application/json', 'InvokedProductionVariant': 'AllTraffic', 'Body': <botocore.response.StreamingBody object at 0x7fa0fa5e13a0>}


Unnamed: 0,score,title
0,0.762311,developmental stage to gain English competence and errors are a result from the
1,0.740182,Those difficulties tend to lead students to make errors in building English
2,0.733867,grammar intensively so they often produce errors regarding to grammatical rules in
3,0.709804,interlanguage. This study aims to find out the grammatical errors that students
4,0.709719,"INTERLANGUAGE: GRAMMATICAL ERRORS (A Case Study of First Yein the Academic Year 2014/2015)(A Case Study of First Year of MAN 2 Banjarnegarahe Academic Year 2014/2015),""The difference between Indonesian and English language makes the"
5,0.703969,"sentences. However, errors are actually natural because they are regarded as a"
6,0.701805,grammar items which relates to the errors that mostly produced in narrative text such
7,0.693199,"AN ANALYSIS OF STUDENTS’ ERRORS IN USING ENGLISH PRONOUNS: A CASE STUDY AT NINTH GRADE STUDENTS OF SMPN 2 LINGSAR IN ACADEMIC YEAR 2017/2018,""This study is aimed to analyze students’ errors in using English pronouns functioning as;"
8,0.653287,students often feel difficult in learning English especially in terms of grammar.
9,0.652491,"•\tThe Harms of Social Spoiling, Social Construction, and Language"


In [22]:
#clean up to save cost
predictor.delete_model()
predictor.delete_endpoint()

In [23]:
nb_end = time.time()
print(f"notebook execution time, seconds: {nb_end-nb_start}")

notebook execution time, seconds: 1179.633066892624
