In [7]:
%pip install sagemaker --upgrade  --quiet

Note: you may need to restart the kernel to use updated packages.


In [1]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [2]:
%%writefile serving.properties
engine=Python
option.model_id=s3://<your-s3-bucket>/llama3/nous-llama3-8b-instruct/
option.task=text-generation
option.trust_remote_code=true
option.tensor_parallel_degree=1
option.rolling_batch=vllm
option.dtype=fp16

Writing serving.properties


In [3]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/serving.properties


## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

[Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)


In [4]:
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.27.0"
    )

### Upload artifact on S3 and create SageMaker model

In [6]:
s3_code_prefix = "large-model-lmi/code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-cn-northwest-1-542319707026/large-model-lmi/code/mymodel.tar.gz


### 4.2 Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [7]:
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base("lmi-llama3-8b")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             # container_startup_health_check_timeout=3600
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
)

--------------!

## Step 5: Test and benchmark the inference

Firstly let's try to run with a wrong inputs

In [8]:
predictor.predict(
    {"inputs": "tell me a story of the little red riding hood", "parameters": {"max_tokens":128}}
)

b'{"generated_text": ". I will tell you a story of the little red riding hood. Once upon a time, there was a little girl named Little Red Riding Hood."}'

In [9]:
predictor.predict(
    {
        "inputs": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nwhat is machine learning?\n\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>", 
        "parameters": {"max_tokens":128, "do_sample": True, "max_new_tokens": 128, "temperature": 0.6}
    }
)

b'{"generated_text": "\\n\\nMachine learning is a subfield of artificial intelligence (AI) that involves training algorithms to learn from data, recognize patterns, and make predictions or decisions without being explicitly programmed. In other words, machine learning enables machines to improve their performance on a task over time, based on the data they receive.\\n\\nMachine learning is often categorized into three types:\\n\\n1. **Supervised Learning**: In this type of learning, the algorithm is trained on labeled data, where the correct output is already known. The algorithm learns to map inputs to outputs based on the labeled data, and can then be used to make predictions on new, unseen data.\\n2."}'

## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()