# Llama 3.2 vision stateful inference with SageMaker

## Contents

This notebook uses SageMaker notebook instance `conda_pytorch_p310` kernel, demonstrates how to use TorchServe to deploy Llama 3.2 vision Model on SageMaker. 
 This notebook can be run using Amazon SageMaker Notebooks and NOT SageMaker studio since it is easier to run docker commands in SageMaker Notebook

 Make sure to follow the [README](../llama32-11b-vision/README.md) and setup a notebook instance.



## Step 0: Let's bump up SageMaker and import stuff

In [None]:
!python --version && aws --version

In [None]:
!pip install -Uq pip
!pip install -Uq sagemaker
!pip install torch-model-archiver
!pip install -Uq botocore
!pip install -Uq boto3

In [None]:
!cat > .env <<EOF
TS_HF_TOKEN_VALUE="hf_...."
EOF

Make sure you have accepted Meta terms and conditions to download llama models [here](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision)  
Generate a hugging face access token  [Learn more](https://huggingface.co/docs/hub/en/security-tokens)  
Open the .env file and add the access token to the .env file as a value for TS_HF_TOKEN_VALUE

In [None]:
!pip install python-dotenv
from dotenv import load_dotenv
import os
load_dotenv(override=True)  # Loads the variables from .env

In [None]:
import os
import shutil
import importlib
import botocore

In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path

In [None]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers
barebone_session = sagemaker.session.Session()  # barebone sagemaker session to get current region
# region name of the current SageMaker Studio environment
region = barebone_session._region_name
boto3_session=boto3.session.Session(region_name=region)
# Create a SageMaker runtime client object using your IAM role ARN
smr = boto3.client('sagemaker-runtime', region_name=region)
# Create a SageMaker client object
sm = boto3.client('sagemaker', region_name=region)
# execution role for the endpoint
role = sagemaker.get_execution_role()  
# sagemaker session for interacting with different AWS APIs
sess= sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)  
# account_id of the current SageMaker Studio environment
account = sess.account_id()  

# Configuration:
bucket_name = sess.default_bucket()
prefix = "torchserve"
output_path = f"s3://{bucket_name}/{prefix}"
model_name = "llama32vision-sm"
print(f'account={account}, region={region}, role={role}, output_path={output_path}')

## Step 1: Build a BYOD TorchServe Docker container and push it to Amazon ECR

1. Create an ECR repo: https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-create.html
2. Get Base Image: https://github.com/aws/deep-learning-containers/blob/master/available_images.md

In [None]:
baseimage = f"763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker"
reponame = "llama32-11b-vision-stateful"
versiontag = "1.0"
print("use the output from the print below to run ./build_and_push.sh in a termianl. You get better feedback in terminal.")
print (f"cd docker && ./build_and_push.sh {reponame} {versiontag} {baseimage} {region} {account}")
print("if you do endup running this command in a terminal , you can skip the next cell")

In [None]:
# %%capture build_output

# # Build our own docker image
# !cd docker && ./build_and_push.sh {reponame} {versiontag} {baseimage} {region} {account}

In [None]:
# Update container
container = f"{account}.dkr.ecr.{region}.amazonaws.com/{reponame}:{versiontag}"
container
print(baseimage)


## Step2: Build TorchServe Model Artifacts and Upload to S3

In [None]:
rm -rf code/{model_name}

In [None]:
!cd code && torch-model-archiver --model-name {model_name} --version 1.0 --handler handler/custom_handler.py --config-file handler/model-config.yaml --archive-format no-archive --extra-files handler/ -f

In [None]:
!cd code && aws s3 cp {model_name} {output_path}/{model_name} --recursive

In [None]:
s3_uri = f"{output_path}/{model_name}/"
print(s3_uri)

## Step3: Create SageMaker Endpont

### 3.1 Create Model

In [None]:
from datetime import datetime

# we are deploying this model in a single GPU memory. Each GPU in g5 instance has 24GB of GPU memory. 
# The model size is 22 GB at 16 bits per weight. We are cutting it a bit close by using g5 instances.
# For a production use case it is better to ml.p4.24xlarge or higher since p4d has 40GB of GPU memory per GPU
# We have kept the instance type as g5 to reduce the cost and also make more accessible for people who want to 
# understand how stateful inference works
instance_type = "ml.g5.4xlarge"  
endpoint_name = sagemaker.utils.name_from_base(model_name)

model = Model(
    name=model_name + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
    # Enable SageMaker uncompressed model artifacts via "S3DataType": "S3Prefix"
    model_data={
        "S3DataSource": {
                "S3Uri": s3_uri,
                "S3DataType": "S3Prefix",
                "CompressionType": "None",
        }
    },
    image_uri=container,
    role=role,
    sagemaker_session=sess,
    env={
        # TorchServe configuration file
        "TS_CONFIG_FILE": "/home/model-server/config.properties",
        # Disable token authorization for REST APIs
        "TS_DISABLE_TOKEN_AUTHORIZATION": "true", 
        # Headers to indicate Session ID
        "TS_HEADER_KEY_SEQUENCE_ID": "X-Amzn-SageMaker-Session-Id",
        "TS_REQUEST_SEQUENCE_ID": "X-Amzn-SageMaker-Session-Id",
        # Headers to indicate closed session
        "TS_HEADER_KEY_SEQUENCE_END": "X-Amzn-SageMaker-Closed-Session-Id",
        "TS_REQUEST_SEQUENCE_END": "X-Amzn-SageMaker-Closed-Session-Id",
        # Enable system metrics aggregation
        "TS_DISABLE_SYSTEM_METRICS": "false",
        "TS_HF_TOKEN": os.environ["TS_HF_TOKEN_VALUE"]
    },
)
print(model)

### 3.2 Deploy Model and Create Endpoint

In [None]:
model.deploy(
    initial_instance_count=1, # increase the number of instances based on your load
    instance_type=instance_type,
    endpoint_name=endpoint_name,
    #volume_size=512, # increase the size to store large model
    model_data_download_timeout=3600, 
    container_startup_health_check_timeout=3600, 
)

### 3.3 Create a Predictor

In [None]:
predictor = sagemaker.predictor.Predictor(
    endpoint_name=model.endpoint_name,
    sagemaker_session=sess
)
print(predictor)

In [None]:
# predictor = sagemaker.predictor.Predictor(
#     endpoint_name='llava-sm-2024-09-04-06-35-10-354',
#     sagemaker_session=sess
# )
# print(predictor)

## Step4: Run Inference

In [None]:
#Add necessary modules path to sys.path
import os, sys

demo_data_path = os.path.join(os.getcwd(), "code/handler")
if demo_data_path not in sys.path:
    sys.path.append(demo_data_path)

In [None]:
#Install dependencies
!pip install torch dataclasses_json

### 4.1 Open Session 1

In [None]:
image_url="https://images.pexels.com/photos/1519753/pexels-photo-1519753.jpeg"

In [None]:
%%time
from data_types import (
    BaseRequest,
    CloseSessionRequest,
    StartSessionRequest,
    TextPromptRequest,
    OpenSessionResponse,
    TextPromptResponse,
    CloseSessionResponse
)

ts_request_sequence_id = "SessionId"


def send_and_check_request(r, seq_id):
    response = smr.invoke_endpoint(
        EndpointName=endpoint_name,
        Body=r.to_json(),
        ContentType="application/json",
        SessionId=seq_id
    )
    assert response["ResponseMetadata"]["HTTPStatusCode"] == 200, f"Sending request failed: {r}"
    return response['Body'].readlines()[0]

open_request = StartSessionRequest(
    type="start_session",
    path=image_url,
)

open_response = send_and_check_request(open_request, "NEW_SESSION")
open_response = OpenSessionResponse.from_json(open_response)
print(open_response)
assert open_response.session_id.startswith("ts-seq-")

In [None]:
open_response.session_id

### 4.2 Send Text Promt 1

In [None]:
%%time
text_prompt_request1 = TextPromptRequest(
    type="send_text_prompt",
    session_id=open_response.session_id,
    prompt_text="describe the picture"
)

text_prompt_response1 = send_and_check_request(text_prompt_request1, open_response.session_id)
text_prompt_response1 = TextPromptResponse.from_json(text_prompt_response1)
print(text_prompt_response1.response_text)
assert text_prompt_response1.response_text

### 4.3 Send Text Promt 2

In [None]:
%%time
text_prompt_request2 = TextPromptRequest(
    type="send_text_prompt",
    session_id=open_response.session_id,
    prompt_text="is there a mountain in the picture, describe it"
)

text_prompt_response2 = send_and_check_request(text_prompt_request2, open_response.session_id)
text_prompt_response2 = TextPromptResponse.from_json(text_prompt_response2)
print(text_prompt_response2.response_text)
assert text_prompt_response2.response_text

### 4.4 Close session

In [None]:
# close session
close_request = CloseSessionRequest(
    type="close_session",
    session_id=open_response.session_id,
)
    
close_response = send_and_check_request(
    close_request, open_response.session_id
)

close_response = CloseSessionResponse.from_json(close_response)
assert close_response.success

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()