# Bert text embedding inference deployment guide
In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [None]:
%pip install sagemaker --upgrade  --quiet

In [None]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
session = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs

## Step 2: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

Check out available images: [Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)

In [None]:
image_uri = image_uris.retrieve(
    framework="djl-lmi",
    region=session.boto_session.region_name,
    version="0.28.0"
)

# use LMI nightly image:
# image_url = "125045733377.dkr.ecr.us-east-1.amazonaws.com/djl-serving:lmi-nightly"

### Create SageMaker model

You can deploy model from Huggingface hub or DJL model zoo by using `HF_MODEL_ID` environment variable.

In [None]:
# model_id = "djl://ai.djl.huggingface.onnxruntime/BAAI/bge-base-en-v1.5"
model_id = "BAAI/bge-base-en-v1.5"

env = {
    "HF_MODEL_ID": model_id,
    "OPTION_ENGINE": "OnnxRuntime",
    "SERVING_MIN_WORKERS": "1", # make sure min and max Workers are equals when deploy model on GPU
    "SERVING_MAX_WORKERS": "1",
}

model = Model(image_uri=image_uri, env=env, role=role)

### Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [None]:
instance_type = "ml.g4dn.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-text-embedding")

model.deploy(initial_instance_count=1,
    instance_type=instance_type,
    endpoint_name=endpoint_name,
)

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=session,
    serializer=serializers.JSONSerializer(),
    deserializer=deserializers.JSONDeserializer(),
)

## Step 3: Test and benchmark the inference

Let's try to run with an input

In [None]:
predictor.predict(
    {"inputs": "What is Deep Learning?"}
)

## Clean up the environment

In [None]:
session.delete_endpoint(endpoint_name)
session.delete_endpoint_config(endpoint_name)
model.delete_model()