# LMI v16 Intro Notebook
In this example we showcase how you can setup the LMI v16 container utilizing the vLLM backend. Note that there are more samples around some of the newer features/support with v16, that you can reference down below:

- [Custom Input/Output Formatters](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/lmi/user_guides/input_formatter_schema.md)
- [Multi-Adapter Inference](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/adapters.md)
- [Sticky Session Routing](https://github.com/deepjavalibrary/djl-serving/blob/0.34.0-dlc/serving/docs/stateful_sessions.md)

## Prerequisites
To work with this notebook ensure that you have a [HuggingFace Access Token](https://huggingface.co/docs/hub/en/security-tokens) to access model artifacts. You can run this notebook in SM Classic Notebook Instances, SM Studio, or an IDE that has access to the necessary services we are working with.

## Additional Resources
A lot of this code is borrowed from other samples in this repository and adjusted for this specific model and container: https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/OpenAI/gpt-oss/deploy/openai_gpt_oss.ipynb.

## Setup

In [None]:
%pip install sagemaker --upgrade --quiet --no-warn-conflicts

In [None]:
import json
import sagemaker
import boto3

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()

sm_client = boto3.client("sagemaker")  # client to intreract with SageMaker
smr_client = boto3.client("sagemaker-runtime")  # client to intreract with SageMaker Endpoints

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")
print(f"sagemaker version: {sagemaker.__version__}")

## Specify Container and vLLM Serving Properties

In [None]:
#specify hardware
instance_type = "ml.g5.4xlarge"
num_gpu = 1

# specify container LMIv16
CONTAINER_VERSION = "0.34.0-lmi16.0.0-cu128"
inference_image = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}"
print(f"Using image URI: {inference_image}")

#utilize the vLLM async handler: 
vllm_env = {
    "HF_MODEL_ID": "Qwen/Qwen3-1.7B",
    "HF_TOKEN": "Enter HF token here",
    "SERVING_FAIL_FAST": "true",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_TENSOR_PARALLEL_DEGREE": json.dumps(num_gpu),
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
    "OPTION_TRUST_REMOTE_CODE": "true",
}

## Create Inference Component and Endpoint
Here we utilize the higher level SageMaker Python SDK to create both an endpoint and an Inference Component, adjust the resource requirements depending on your model and use-case.

In [None]:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

# SageMaker Constructs
model_name = sagemaker.utils.name_from_base("model-lmi")
endpoint_name = model_name
inference_component_name = f"ic-{model_name}"

# SageMaker Model Object -> vLLM env
lmi_model = sagemaker.Model(
    image_uri=inference_image,
    env=vllm_env,
    role=role,
    name=model_name,
)

lmi_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=600,
    endpoint_name=endpoint_name,
    endpoint_type=sagemaker.enums.EndpointType.INFERENCE_COMPONENT_BASED,
    inference_component_name=inference_component_name,
    #check the memory available for your instance, g5.4xlarge for instance has 24GB GPU memory and 64GB memory
    resources=ResourceRequirements(requests={"num_accelerators": 1, "memory": 1024*10, "copies": 1,}),
)

## Invoke Endpoint

In [None]:
import json
content_type = "application/json"

# Adjust payload and parameters as needed
payload = "What is the capitol of the United States?"
response = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name, #specify IC name
    ContentType=content_type,
    Accept=content_type,
    Body=json.dumps(
        {
            "inputs": payload,
            "parameters": {
                "max_new_tokens": 200  # Adjust this value as needed
                },
        }
    ),
)
result = json.loads(response["Body"].read().decode())['generated_text']
result

## Cleanup
You can also delete via the Studio UI under deployments.

In [None]:
sess.delete_inference_component(inference_component_name, wait=True)
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
sess.delete_model(model_name)