# Deploy a LLaMA 3.1 8B Instruct Model Adapter Using SageMaker Endpoints and SageMaker Large Model Inference (LMI) Container with the SageMaker Python SDK 

In this example you will deploy a trained adapter of `LLaMA-3.1-8B-instruct`, to a SageMaker Managed Endpoint.

Update the sagemaker SDK to the latest version

In [None]:
!pip install sagemaker==2.229 --quiet

Baseline setup. Create clients for the boto3 SDK and default values for setting up the bucket.

In [None]:
import os, boto3
import sagemaker

from sagemaker.model import Model
from sagemaker import ModelPackage
import time
from datasets import load_dataset
from pprint import pprint

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
sm_client = boto3.client('sagemaker')



We will get the downloaded snapshot of base model from HuggingFace.

## Large Model Inference (LMI) Containers

In this example you will deploy your model using [SageMaker's Large Model Inference (LMI) Containers](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/index.html).

LMI containers are a set of high-performance Docker Containers purpose built for large language model (LLM) inference. With these containers, you can leverage high performance open-source inference libraries like vLLM, TensorRT-LLM, Transformers NeuronX to deploy LLMs on AWS SageMaker Endpoints. These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution.

The LMI container supports a variety of different backends, outlined in the table below. 

The model for this example can be deployed using the vLLM backend, which corresponds to the `djl-lmi` container image.

| Backend | SageMakerDLC | Example URI |
| --- | --- | --- |
|vLLM|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|lmi-dist|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|hf-accelerate|djl-lmi|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-lmi11.0.0-cu124
|tensorrt-llm|djl-tensorrtllm|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124
|transformers-neuronx|djl-neuronx|763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1

In the following steps you will leverage the SageMaker Python SDK to build your model configuration and deploy it to SageMaker endpoint. There are alternative methods to do this as well, such as the Boto3 SDK, but the SM Python SDK reduces the amount of code necessary perform the same activities.

The first step in model deployment is to [create a SageMaker Model object](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html). This consists of a unique name, a container image, and the environment configuration from above.

In [None]:
%store -r model_package_arn

model_package_arn

In [None]:
timestamp = time.strftime("%Y-%m-%d-%H-%M-%S")

endpoint_name=f"endpopint-llama3-8b-instruct-adapter-{timestamp}"

ft_model = ModelPackage(
    model_package_arn=model_package_arn,
    sagemaker_session=sess,
    role=role,
    name=endpoint_name,
)

Now that you have a model object ready, you will use use the SageMaker Python SDK to create a SageMaker Managed Endpoint. The SDK eliminates some of the intermediate steps, such as creating an Endpoint Configuration.

## Creating a new endpoint 

In [None]:
%%time

print(f"Deploying model with endpoint name ep-{ft_model.name}")
ft_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.12xlarge",
    endpoint_name=f"{ft_model.name}",
    container_startup_health_check_timeout=900,
    wait=False
)
print(f"\nEndpoint deployed ===>", ft_model.endpoint_name)

In [None]:
# Check the endpoint status
while True:
    response = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = response['EndpointStatus']
    print(f"Endpoint status: {status}")
    
    if status == 'InService':
        print("Endpoint is ready!")
        break
    elif status == 'Failed':
        print("Endpoint creation failed.")
        break
    
    # Wait for 30 seconds before checking again
    time.sleep(30)

print(f"Final endpoint status: {status}")

# Pick a random prompt

In [None]:
def create_summarization_prompts(data_point):
    full_prompt =f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
                    You are an AI assistant trained to summarize conversations. Provide a concise summary of the dialogue, capturing the key points and overall context.
                    <|eot_id|><|start_header_id|>user<|end_header_id|>
                    Summarize the following conversation:

                    {data_point["dialogue"]}
                    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
                    Here's a concise summary of the conversation in a single sentence:

                    <|eot_id|>"""
    return {"prompt": full_prompt}

In [None]:
# HF dataset that we will be working with 
dataset_name="Samsung/samsum"
    
# Load dataset from the hub
dataset = load_dataset(dataset_name, split="test")

random_row = dataset.shuffle().select(range(1))[0]
random_row
# Add system message to each conversation
#columns_to_remove = list(random_row.features)

random_prompt=create_summarization_prompts(random_row)
pprint(random_prompt)
#columns_to_remove
#dataset = random_row.map(create_summarization_prompts, remove_columns=columns_to_remove,batched=False)

# Run Inference

With your endpoint successfully deployed, you will want to test it to ensure that it is fully functional.

To do so, you will take a piece of sample text and summarize it using your deployed model. This sample text was pulled from the [ECTSum dataset](https://huggingface.co/datasets/mrSoul7766/ECTSum).

Using the sample article and prompt template, invoke the model to view the structure of the response and its contents.

In [None]:
%%time
import json

#endpoint_name = "deploy-llama3-8b-instruct-adapter-v2"
print(endpoint_name)

llm = sagemaker.Predictor(
    endpoint_name = endpoint_name,
    sagemaker_session = sess,
    serializer = sagemaker.serializers.JSONSerializer(),
    deserializer = sagemaker.deserializers.JSONDeserializer(),
)

In [None]:
response = llm.predict(
    {
        "inputs": random_prompt['prompt'],
        "parameters": {
            "do_sample":True,
            "max_new_tokens":256,
            "top_p":0.9,
            "temperature":0.6,
        },
    }
)

response['generated_text']