<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel:</strong> Python 3 (ipykernel)
</div>

## Lab 0: Warm Up: Deploy Embedding Models on ml.inf2.8xlarge for Inference

In this lab, we'll walk you throught the process of deploying an Open Source BGE Embeddings Model to a SageMaker endpoint for inference. We're going to leverage 1 `ml.inf2.8xlarge` machine for this and subsequent labs. In practice, you can deploy a SageMaker model behind a single load balanced endpoint with auto-scaling policies defined - allowing your LLM SaaS endpoint to scale with input demand.

In [None]:
!python3 -m pip install sagemaker==2.197.0

### Define Global Variables

In [None]:
_MODEL_NAME = "BGE-Base-v15"
_MODEL_SIZE = "512-to-768"
MODEL_DATA_S3_URI = "s3://sagemaker-us-west-2-914153712152/embedding-model-artifact/bge-base-en-v1-5-seqlen-384-bs-1.tar.gz"
ROLE = "arn:aws:iam::914153712152:role/workshop-studio-v2-cfn-OSE-EMR-SageMakerExecutionRole"
INSTANCE_TYPE = "ml.inf2.8xlarge"

MODEL_NAME = f"{_MODEL_NAME}-{_MODEL_SIZE}-neuron-embedding-model"
ENDPOINT_NAME = f"{_MODEL_NAME}-{_MODEL_SIZE}-neuron-embedding-ep"

In [None]:
MODEL_NAME, ENDPOINT_NAME

## Let's Deploy!

### Model and Instance Configuration

We're going to deploy BGE Embedding model on Amazon Silicon Inferentia `Inf2`. Inferentia instances are purpose built for deep learning (DL) inference. They deliver high performance at the lowest cost in Amazon EC2 for generative artificial intelligence (AI) models, including large language models (LLMs) and vision transformers. 

In [None]:
from sagemaker.huggingface.model import HuggingFaceModel

In [None]:
hf_bge_model = HuggingFaceModel(
    model_data=MODEL_DATA_S3_URI,        
    transformers_version="4.34.1",
    pytorch_version="1.13.1",
    py_version='py310',
    model_server_workers=2,
    role=ROLE,
    name=MODEL_NAME,
)

In [None]:
%%time
print("===== SageMaker Deployment =====")

print("\nPreparing to deploy the model...")
predictor = hf_bge_model.deploy(
    endpoint_name=ENDPOINT_NAME,
    initial_instance_count=1,      
    instance_type=INSTANCE_TYPE, 
    volume_size = 50
)
print("\n===== Deployment Complete =====")

### Test Sample Input

In [None]:
data = {
  "inputs": "This workshop is the best way to learn about Amazon SageMaker new feature releases!",
}

res = predictor.predict(data=data)

In [None]:
# print some results
print(f"length of embeddings: {len(res['embeddings'])}")
print(f"first 10 elements of embeddings: {res['embeddings'][:10]}")

### Boto3 Inference

In [None]:
import os 
import json
import boto3

region_name = "us-west-2"
client = boto3.client("runtime.sagemaker", region_name=region_name)

In [None]:
region_name

In [None]:
data = {
  "inputs": "This workshop is the best way to learn about Amazon SageMaker new feature releases!",
}

In [None]:
body = json.dumps(data)
response = client.invoke_endpoint(
   EndpointName="BGE-Base-v15-512-to-768-neuron-embedding-tg-model",
   Body=body,
   ContentType="application/json"
)
result = json.loads(response['Body'].read().decode("utf-8"))

In [None]:
# print some results
print(f"length of embeddings: {len(result['embeddings'])}")
print(f"first 10 elements of embeddings: {result['embeddings'][:10]}")