# Deploy TEI as AWS Sagemaker Real-time Endpoint


TEI is not yet supported as a Deep Learning Container (DLC) for SageMaker like TGI is. For this reason we need to extend the container to make it compatible with SageMaker before we're able to use it in a deployment. Specifcally, we need:

1. Create a modified TEI image thats [compatible with Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html)
2. Push the new TEI image to ECR
3. Deploy it to Sagemaker

## 1. Create a modified TEI image

In [82]:
%%writefile sagemaker-entrypoint.sh
#!/bin/bash

if [[ -z "${HF_MODEL_ID}" ]]; then
  echo "HF_MODEL_ID must be set"
  exit 1
fi
export MODEL_ID="${HF_MODEL_ID}"

if [[ -n "${HF_MODEL_REVISION}" ]]; then
  export REVISION="${HF_MODEL_REVISION}"
fi

if [[ -n "${POOLING}" ]]; then
  export POOLING="${POOLING}"
fi

if [[ -n "${MAX_CONCURRENT_REQUESTS}" ]]; then
  export MAX_CONCURRENT_REQUESTS="${MAX_CONCURRENT_REQUESTS}"
fi

if [[ -n "${MAX_BATCH_TOKENS}" ]]; then
  export MAX_BATCH_TOKENS="${MAX_BATCH_TOKENS}"
fi

text-embeddings-router --port 8080

Overwriting sagemaker-entrypoint.sh


In [83]:
%%writefile Dockerfile
FROM ghcr.io/huggingface/text-embeddings-inference:86-1.1 as base

COPY sagemaker-entrypoint.sh entrypoint.sh
RUN chmod +x entrypoint.sh

ENTRYPOINT ["./entrypoint.sh"]

Writing Dockerfile


Build the docker image

In [None]:
!docker build -t tei-sagemaker .

## 2. Push to ECR

This will require setting up [a repository on ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/getting-started-cli.html).

The following general steps to authenticate and push an image to your repository:
1. Retrieve an authentication token and authenticate your Docker client to your registry.
2. After the build completes, tag your image so you can push the image to this repository.
3. Run the following command to push this image to your newly created AWS repository.



## 3. Deploy as SageMaker Endpoint

In [66]:
import boto3
from datetime import datetime
from sagemaker import get_execution_role, Session

sess = Session()
sm_client = boto3.client(service_name='sagemaker')
runtime_sm_client = boto3.client(service_name='sagemaker-runtime')

account_id = boto3.client('sts').get_caller_identity()['Account']
region = sess.boto_region_name

role = get_execution_role()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [30]:
import json
from sagemaker.huggingface import HuggingFaceModel


# sagemaker config
model_id = "BAAI/bge-base-en-v1.5"
instance_type = "ml.g5.xlarge"
number_of_gpu = 1
health_check_timeout = 150
resource_name = "Demo-{}-{}"

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': model_id, # model_id from hf.co/models
  'POOLING': "cls",
  'MAX_CONCURRENT_REQUESTS': json.dumps(512), # The maximum amount of concurrent requests for this particular deployment. 
  'MAX_BATCH_TOKENS': json.dumps(16384),  # Limits the number of tokens that can be processed in parallel during the generation
}

# create HuggingFaceModel with the image uri
hf_model = HuggingFaceModel(
  role=role,
  image_uri="<IMAGE_URI_FROM_ECR>",
  env=config
)

# deploy to real-time endpoint
tei_endpoint = hf_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout,
  endpoint_name=resource_name.format("TEI-Endpoint", datetime.now().strftime("%Y-%m-%d-%H-%M-%S")),
)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
----!

In [71]:
example = {"inputs":["Hello "*512], "truncate":True}
tei_endpoint.predict(example)

[[0.0013170368,
  0.006132093,
  0.01187253,
  0.005940105,
  0.038366858,
  0.060453143,
  0.046292115,
  0.008194042,
  -0.02515809,
  -0.043957543,
  -0.010244473,
  -0.009929613,
  -0.055661123,
  0.0075796815,
  -0.0001281519,
  0.01663383,
  0.04767443,
  0.020949718,
  -0.0030602866,
  0.0039050335,
  0.012809431,
  0.04451047,
  0.055384662,
  0.022454903,
  0.023514675,
  -0.015382068,
  0.019121991,
  0.06315633,
  -0.085211895,
  0.04690648,
  -0.006715736,
  0.005798034,
  0.023468597,
  -0.00054812536,
  0.008201722,
  -0.07114302,
  -0.02677079,
  0.009545637,
  -0.0034308233,
  -0.034865,
  0.008324594,
  -0.013500587,
  -0.008155645,
  0.018108297,
  -0.017140677,
  -0.027292997,
  -0.0253424,
  0.044571906,
  -0.0065467865,
  0.036370184,
  -0.006589024,
  0.04478693,
  0.026709354,
  -0.022516338,
  0.017540012,
  0.026878303,
  0.006266484,
  -0.040639993,
  -0.030119058,
  -0.025757093,
  0.04401898,
  0.0025054417,
  0.039841324,
  -0.007886861,
  0.06192761,
  0.0