# deepseek-ai/deepseek-coder-6.7b-base LMI deployment guide
In this tutorial, you will use LMI container from DLC to SageMaker and run inference with it.

Please make sure the following permission granted before running the notebook:

- S3 bucket push access
- SageMaker access

## Step 1: Let's bump up SageMaker and import stuff

In [None]:
%pip install sagemaker --upgrade  --quiet
%pip install boto3

In [None]:
%pip show sagemaker

In [None]:
import boto3
import sagemaker
from sagemaker import Model, image_uris, serializers, deserializers

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
region = sess._region_name  # region name of the current SageMaker Studio environment
account_id = sess.account_id()  # account_id of the current SageMaker Studio environment
sagemaker_default_bucket = sess.default_bucket()

## Step 2: Start preparing model artifacts
In LMI contianer, we expect some artifacts to help setting up the model
- serving.properties (required): Defines the model server settings
- model.py (optional): A python file to define the core inference logic
- requirements.txt (optional): Any additional pip wheel need to install

In [None]:
model_name="deepseek-ai/deepseek-coder-6.7b-base"
model_lineage=model_name.split("/")[0]
model_specific_name = model_name.split("/")[1]
cmd=f's/option.model_id=model_name/option.model_id={model_lineage}\/{model_specific_name}/g'
print(cmd)

#### Option1(Global region)

- It's better to pre-download the model and upload it to S3, then use the s3url for deployment.

In [None]:

with open("serving.properties", "w") as wf:
    wf.write("engine=Python\n")
    wf.write(f"option.model_id={model_name}\n")
    wf.write("option.rolling_batch=vllm\n")
    wf.write("option.max_model_len=2048\n")

#### Option2(China region)

1. Download model from Hugging face
2. Upload model to S3 Bucket
3. Write serving.properties using s3url

In [None]:
!pip install -U huggingface_hub

In [None]:
import os
from pathlib import Path

os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

# local_model_path_name = "Qwen2-72B-Instruct"
local_model_path = Path(model_specific_name)
local_model_path.mkdir(exist_ok=True)

s3_model_prefix = f"lmi/{model_specific_name}"
s3url=f"s3://{sagemaker_default_bucket}/{s3_model_prefix}"

In [None]:
!huggingface-cli download --resume-download {model_name} --local-dir {local_model_path}

In [None]:
# Upload model to S3
!aws s3 cp {local_model_path} {s3url} --recursive

In [None]:
with open("serving.properties", "w") as wf:
    wf.write("engine=Python\n")
    wf.write(f"option.model_id={s3url}\n")
    wf.write("option.rolling_batch=vllm\n")
    wf.write("option.max_model_len=2048\n")

### Compress model artifacts

In [None]:
%%sh
mkdir mymodel
mv serving.properties mymodel/
tar czvf mymodel.tar.gz mymodel/
rm -rf mymodel

## Step 3: Start building SageMaker endpoint
In this step, we will build SageMaker endpoint from scratch

### Getting the container image URI

[Large Model Inference available DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers)


In [None]:
image_uri = image_uris.retrieve(
        framework="djl-lmi",
        region=sess.boto_session.region_name,
        version="v0.28.0"
    )

### Upload artifact on S3 and create SageMaker model

In [None]:
s3_code_prefix = f"large-model-lmi/code-{model_lineage}-{model_specific_name}"
bucket = sess.default_bucket()  # bucket to house artifacts
code_artifact = sess.upload_data("mymodel.tar.gz", bucket, s3_code_prefix)
print(f"S3 Code or Model tar ball uploaded to --- > {code_artifact}")

model = Model(image_uri=image_uri, model_data=code_artifact, role=role)

## Step4: Create SageMaker endpoint

You need to specify the instance to use and endpoint names

In [None]:
instance_type = "ml.g5.2xlarge"
endpoint_name = sagemaker.utils.name_from_base(f"{model_lineage}-{model_specific_name.replace('.','-')}")

model.deploy(initial_instance_count=1,
             instance_type=instance_type,
             endpoint_name=endpoint_name,
             # container_startup_health_check_timeout=3600
            )

# our requests and responses will be in json format so we specify the serializer and the deserializer
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sess,
    serializer=serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

## Step 5: Test and benchmark the inference

### Standard schema
Ref: https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/lmi_input_output_schema.html

In [None]:
%pip install transformers -U

In [None]:
prompt = "帮我写一段快排代码"

parameters = {
        "max_new_tokens":1024, 
        "do_sample": True,
    }
response = predictor.predict(
    {"inputs": prompt, "parameters": parameters}
)
# text = str(response, "utf-8")
print(response)

### Streaming

In [None]:
import json
import boto3

smr_client = boto3.client("sagemaker-runtime")

In [None]:
import io
import json

class TokenIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()

            # print(line)
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                full_line = line[:-1].decode("utf-8")
                # print(full_line)
                line_data = json.loads(full_line.lstrip("data:").rstrip("/n"))
                return line_data["token"].get("text", "")
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])


def get_realtime_response_stream(sagemaker_runtime, endpoint_name, payload):
    response_stream = sagemaker_runtime.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(payload),
        ContentType="application/json",
        CustomAttributes='accept_eula=false'
    )
    return response_stream

#### Standard schema streaming 

In [None]:
prompt = "帮我写一段快排代码"
parameters = {
        "max_new_tokens":1024, 
        "do_sample": True,
    }

payload = {
    "inputs":  prompt,
    "parameters": parameters,
    "stream": True ## <-- to have response stream.
}
response_stream = get_realtime_response_stream(smr_client, endpoint_name, payload)
# print_response_stream(response_stream)
for token in TokenIterator(response_stream["Body"]):
    # pass
    print(token, end="", flash=True)

## Clean up the environment

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_name)
model.delete_model()