# Create a sagemaker endpoint with Qwen2 0.5B model

This notebook provides the steps to deploy Qwen2 (0.5B parameter size) into a sagemaker endpoint

## Deploy model

Install and import dependencies

In [2]:
!pip install sagemaker boto3 --upgrade --quiet

In [3]:
import sagemaker
import jinja2
from sagemaker import image_uris
import boto3
import time
import json

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


Retrieve role and session to use for the operations. Also, retrieve the S3 bucket that will store the artifacts

In [15]:
role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
model_bucket = sess.default_bucket()
region = sess._region_name
print(f"Using bucket{model_bucket}")

Using bucketsagemaker-us-east-1-576219157147


Define a variables to contain the s3url of the location that has the model


In [6]:
s3_model_prefix = "qwen_2"  # folder within bucket where code artifact will go
jinja_env = jinja2.Environment()
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"
print(f"Pretrained model will be uploaded to ---- > {pretrained_model_location}")

Pretrained model will be uploaded to ---- > s3://sagemaker-us-east-1-576219157147/qwen_2/


Get the inference container image 

In [7]:
inference_image_uri = image_uris.retrieve(
    framework="djl-deepspeed", region=sess.boto_session.region_name, version="0.25.0"
)
inference_image_uri

'763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.25.0-deepspeed0.11.0-cu118'

Setup model name variables and configs

In [8]:
model_version = "Qwen/Qwen2-0.5B-Chat"
model_names = {
    "model_name": model_version, #@param ["Qwen/Qwen-VL", "Qwen/Qwen-VL-Chat", "Qwen/Qwen-VL-Chat-Int4"]
}
with open("inference-artifacts/model_name.json",'w') as file:
    json.dump(model_names, file)

with open("inference-artifacts/serving.properties", 'r') as f:
    current = f.read()

with open("inference-artifacts/serving.properties", 'w') as f:
    updated = current.replace("SAGEMAKER_BUCKET", model_bucket)
    f.write(updated)

Create compressed file for inference

In [10]:
%%sh
rm -r inference-artifacts/.ipynb_checkpoints
tar czvf model.tar.gz inference-artifacts/

inference-artifacts/
inference-artifacts/model.py
inference-artifacts/requirements.txt
inference-artifacts/serving.properties
inference-artifacts/model_name.json


Upload to S3

In [11]:
s3_code_artifact = sess.upload_data("model.tar.gz", model_bucket, s3_model_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {s3_code_artifact}")

S3 Code or Model tar ball uploaded to --- > s3://sagemaker-us-east-1-576219157147/qwen_2/model.tar.gz


Set endpoint name and deploy

In [12]:
from sagemaker.model import Model
from sagemaker.utils import name_from_base

model_name = name_from_base(model_version).split('/')[-1].replace(".","-")
model = Model(
    image_uri=inference_image_uri,
    model_data=s3_code_artifact,
    role=role,
    name=model_name,
)


'Qwen2-0-5B-Chat-2024-06-06-19-36-29-157'

In [13]:
%%time
endpoint_name = "endpoint-" + model_name
print(f"Deploying {endpoint_name}")
model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge",
    endpoint_name=endpoint_name
)

----------!CPU times: user 80.9 ms, sys: 24.3 ms, total: 105 ms
Wall time: 5min 32s


## Test deployed model

In [19]:
sagemaker_runtime = boto3.client(
    "sagemaker-runtime", region_name=region)

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name, 
    Body=bytes('{"prompt": "Give me a short introduction to large language model."}', 'utf-8')
    )

# Decodes and prints the response body:
print(response['Body'].read().decode('utf-8'))

A large language model is an artificial intelligence system that can generate human-like text by analyzing large amounts of natural language data. These models are typically trained on vast amounts of text data, which includes documents, news articles, and social media posts, among other sources. The goal of these models is to produce text that is coherent, grammatically correct, and informative, with the ability to generate new ideas and concepts based on context and patterns in the input data.

Large language models have been used in a variety of applications, including chatbots, virtual assistants, natural language processing (NLP), and knowledge representation systems. They have also been used to train machine learning models for tasks such as speech recognition, image classification, and recommendation systems. Large language models are becoming increasingly important in many industries, particularly those involving complex reasoning, decision-making, and creativity.


## References

Real-time inference - Amazon SageMaker - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html