## vLLM-LMI Mixtral-8x7B-DPO-AWQ deployment guide

### In this tutorial, you will use vllm backend of Large Model Inference(LMI) DLC to deploy Mixtral-8x7B-DPO-AWQ and run inference with it.

Please make sure the following permission granted before running the notebook:

* S3 bucket push access
* SageMaker access




### Step 1: Let's bump up SageMaker and import stuff

In [1]:
%pip install sagemaker --upgrade  --quiet

Note: you may need to restart the kernel to use updated packages.


In [None]:
%pip install transformers sentencepiece --upgrade  --quiet

In [2]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()

### Step 2: Start preparing model artifacts

In LMI container, we expect some artifacts to help setting up the model

* serving.properties (required): Defines the model server settings
* model.py (optional): A python file to define the core inference logic
* requirements.txt (optional): Any additional pip wheel need to install

In [8]:
%%writefile serving.properties
engine=Python
option.model_id=TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ
option.tensor_parallel_degree=4
option.max_rolling_batch_size=64
option.rolling_batch=vllm
option.task=text-generation
option.dtype=fp16
option.quantize=awq
option.max_model_len=8192

Writing serving.properties


In [9]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

mymodel/
mymodel/serving.properties


### Step 3: Start building SageMaker endpoint

#### Getting the container image URI

In [10]:
image_uri = image_uris.retrieve(
        framework="djl-deepspeed",
        region=sess.boto_session.region_name,
        version="0.26.0"
    )

#### Upload artifact on S3 and create SageMaker model

In [None]:
model_name = "TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-AWQ"
s3_code_prefix = f"large-model-vllm/{model_name}code"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

#### Create SageMaker endpoint with a specified instance type

In [None]:
instance_type = "ml.g4dn.12xlarge"
endpoint_name = sagemaker.utils.name_from_base(f"lmi-model-{model_name.replace('/', '-')}")
print(f"endpoint_name: {endpoint_name}")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             container_startup_health_check_timeout=1800
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
)

### Step 4: Run inference

In [28]:
system_message=""
input_text = "请解释一下AI"

prompt_template=f'''<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{input_text}<|im_end|>
<|im_start|>assistant
'''

In [32]:
parameters = {
                "max_new_tokens":128,
                "do_sample":True,
                "temperature":0.7,
                "top_p":0.95,
                "top_k":40,
                "repetition_penalty":1.1
            }

## None Streaming

In [33]:
%%time
response = predictor.predict(
    {
        "inputs": prompt_template, 
         "parameters": parameters
    }
)
text = str(response, 'utf-8')
text

CPU times: user 10.5 ms, sys: 4.19 ms, total: 14.7 ms
Wall time: 6.42 s


'{"generated_text": "AI，全称为人工智能（Artificial Intelligence），是指计算机系统通过学习、自我优化和模拟人类思维方式来执行复杂任务的技术。AI可以从数据中学习并生成预测或决策，以实现各种应用场景，例如图像识别、语音合成、自然语言处理等。AI的主要目标是使计算机系统能够进行类似于人类智能的行动，包括"}'

## Streaming

In [34]:
import json
import boto3

smr_client = boto3.client("sagemaker-runtime")

In [35]:
def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload), 
        ContentType="application/json",
        CustomAttributes='accept_eula=false'
    )
    return response_stream

In [36]:
payload = {
    "inputs":  prompt_template,
    "parameters": parameters,
    "stream": True ## <-- to have response stream.
}


In [43]:
from utils.LineIterator import LineIterator

def print_response_stream(response_stream):
    event_stream = response_stream.get('Body')
    for line in LineIterator(event_stream):
        print(line, end='')

In [45]:
response_stream = get_realtime_response_stream(smr_client, endpoint_name, payload)
print_response_stream(response_stream)

AI，全称为人工智能（Artificial Intelligence），是一个广泛的研究领域和技术实践。它涉及在计算机系统中模拟、扩展和创造智能行为和决策能力。

AI的目标是使计算机系统能够像人类一样学习、理解、推理和进行决策。这意味着AI可以处理复杂的任务、自动化过程、提高效率和降低成本

## Clear resources

In [24]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()