# DeepSeek-R1 671B dynamic quants on SageMaker endpoint

The original version of DeepSeek R1 is an FP8 model with 671B parameters, which requires larger GPU instances (such as p5en type) for deployment. 

Due to resource limitations, in order to deploy on g5, g6 and other instance types, dynamic quantization techniques can be used to reduce resource consumption. Following the technical blog from unsloth: [https://unsloth.ai/blog/deepseekr1-dynamic](https://unsloth.ai/blog/deepseekr1-dynamic), we have implemented the deployment of the DeepSeek-R1 671B dynamic quantization model on SageMaker endpoint.

## 1. Define some variables

This model is inferenced by llama.cpp. To deploy the model on SageMaker endpoint, you need to deploy via BYOC (bring your own container).

First you will build and store a llama.cpp endpoint docker image in you ECR private repo (for example `sagemaker_endpoint/llama.cpp`), you need to define the following variables.


**⚠️ For China region, you need to make sure the docker image `ghcr.io/ggerganov/llama.cpp:server-cuda` accessible**

In [1]:
MODEL_ID = "unsloth/DeepSeek-R1-GGUF"
QUANT_TYPE = "DeepSeek-R1-UD-IQ1_S" # 1.58 bit
INSTANCE_TYPE = "ml.g5.48xlarge"
# INSTANCE_TYPE = "ml.g6.48xlarge"

# QUANT_TYPE = "DeepSeek-R1-UD-Q2_K_XL"  # 2.51 bit, better quality, not support on g5/g6 instance
# INSTANCE_TYPE = "ml.g6e.48xlarge"

REPO_NAMESPACE = "sagemaker_endpoint/llama.cpp"
REPO_TAG = "server-cuda"
ACCOUNT = !aws sts get-caller-identity --query Account --output text
REGION = !aws configure get region
ACCOUNT = ACCOUNT[0]
REGION = REGION[0]


if REGION.startswith("cn"):
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com.cn/{REPO_NAMESPACE}:{REPO_TAG}"
else:
    CONTAINER = f"{ACCOUNT}.dkr.ecr.{REGION}.amazonaws.com/{REPO_NAMESPACE}:{REPO_TAG}"

## 2. Build the container

Endpoint starting codes are in `app/`. The script will build and push to ecr. 

**The docker only need to be built once**, and after that, when deploying other endpoints, the same docker image can be shared.

In [None]:
cmd = f"REPO_TAG={REPO_TAG} REPO_NAMESPACE={REPO_NAMESPACE} ACCOUNT={ACCOUNT} REGION={REGION} bash ./build_and_push.sh"
print("Runging:", cmd)
!{cmd}

## 3. Deploy on SageMaker

define the model and deploy on SageMaker


In [3]:
%pip install -U boto3 sagemaker transformers huggingface_hub hf_transfer

Note: you may need to restart the kernel to use updated packages.


### 3.1 Init SageMaker session

In [4]:
import os
import re
import glob
import json
from datetime import datetime
import time

import boto3
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = sess.default_bucket()

sagemaker_client = boto3.client("sagemaker")



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


### 3.2 Download and upload model file

You need to prepare model weights and upload to S3. You can download from [https://huggingface.co/unsloth/DeepSeek-R1-GGUF](https://huggingface.co/unsloth/DeepSeek-R1-GGUF). 

In [6]:
model_name = MODEL_ID.replace("/", "-").replace(".", "-")
local_repo_path = os.environ['HOME'] + "/models/" + model_name

s3_model_path = f"s3://{default_bucket}/models/{model_name}/{QUANT_TYPE}"

%mkdir -p {local_model_path}

print("local_repo_path:", local_repo_path)

local_repo_path: /home/ec2-user/models/unsloth-DeepSeek-R1-GGUF


Download the dynamic quant model

In [7]:
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

if REGION.startswith("cn"):
    # if you are in China region, use a mirror of huggingface
    os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

from huggingface_hub import snapshot_download
snapshot_download(
  repo_id = MODEL_ID,
  local_dir = local_repo_path,
  allow_patterns = [f"*{QUANT_TYPE}*"],
)

local_model_path = f"{local_repo_path}/{QUANT_TYPE}"
llamma_cpp_model_name = glob.glob(f"{local_model_path}/*00001-of-*.gguf")[0].split("/")[-1]
print("model downloaded to", local_model_path)
print("llama.cpp model", llamma_cpp_model_name)

model downloaded to /home/ec2-user/models/unsloth-DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S
llama.cpp model DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf


#### upload to s3

In [8]:
!aws s3 sync {local_model_path} {s3_model_path}
print("s3_model_path:", s3_model_path)

s3_model_path: s3://sagemaker-us-west-2-236995464743/models/unsloth-DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S


### 3.3 Prepare llama.cpp start scripts

Then you need to a write the llama.cpp starting scripts for endpoint, the container will automatically use the `start.sh` as the entrypont.

Please carefully modify the startup script file as needed, such as the model running parameter information. All parameters can be referenced at [https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md)

Here is a simple script that pulling a model from S3 and starting a llama.cpp server.

In [9]:
endpoint_model_name = sagemaker.utils.name_from_base(model_name, short=True)
local_code_path = endpoint_model_name
s3_code_path = f"s3://{default_bucket}/endpoint_code/llamacpp_byoc/{endpoint_model_name}.tar.gz"

%mkdir -p {local_code_path}

print("local_code_path:", local_code_path)

with open(f"{local_code_path}/start.sh", "w") as f:
    f.write(f"""
#!/bin/bash

# download model to local
s5cmd sync --concurrency 64 \
    \"{s3_model_path}/*\" /temp/{model_name}/{QUANT_TYPE}

/app/llama-server \
    --host 0.0.0.0  --port 8000 \
    -m /temp/{model_name}/{QUANT_TYPE}/{llamma_cpp_model_name} \
    --n-gpu-layers 62 --tensor-split 8,7,8,8,8,8,7,8 \
    -ctk q4_0 \
    --ctx-size 10240 --parallel 2 --batch-size 32 \
    --threads 96 --prio 2 --temp 0.6 --top-p 0.95
""")

local_code_path: unsloth-DeepSeek-R1-GGUF-250207-0646


In [10]:
!rm -f {local_code_path}.tar.gz
!tar czvf {local_code_path}.tar.gz {local_code_path}/
!aws s3 cp {local_code_path}.tar.gz {s3_code_path}
print("s3_code_path:", s3_code_path)

unsloth-DeepSeek-R1-GGUF-250207-0646/
unsloth-DeepSeek-R1-GGUF-250207-0646/start.sh
upload: ./unsloth-DeepSeek-R1-GGUF-250207-0646.tar.gz to s3://sagemaker-us-west-2-236995464743/endpoint_code/llamacpp_byoc/unsloth-DeepSeek-R1-GGUF-250207-0646.tar.gz
s3_code_path: s3://sagemaker-us-west-2-236995464743/endpoint_code/llamacpp_byoc/unsloth-DeepSeek-R1-GGUF-250207-0646.tar.gz


### 3.3 Deploy endpoint on SageMaker

In [11]:
# Step 0. create model

# endpoint_model_name already defined in above step

create_model_response = sagemaker_client.create_model(
    ModelName=endpoint_model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": CONTAINER,
        "ModelDataUrl": s3_code_path
    },
    
)
print(create_model_response)
print("endpoint_model_name:", endpoint_model_name)

{'ModelArn': 'arn:aws:sagemaker:us-west-2:236995464743:model/unsloth-DeepSeek-R1-GGUF-250207-0646', 'ResponseMetadata': {'RequestId': '6e7b3ea5-ce41-4ceb-b489-75d32ed0516f', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '6e7b3ea5-ce41-4ceb-b489-75d32ed0516f', 'content-type': 'application/x-amz-json-1.1', 'content-length': '98', 'date': 'Fri, 07 Feb 2025 06:46:41 GMT'}, 'RetryAttempts': 0}}
endpoint_model_name: unsloth-DeepSeek-R1-GGUF-250207-0646


In [12]:
# Step 1. create endpoint config

endpoint_config_name = sagemaker.utils.name_from_base(model_name, short=True)

endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": endpoint_model_name,
            "InstanceType": INSTANCE_TYPE,
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 1000,
            # "EnableSSMAccess": True,
        },
    ],
)
print(endpoint_config_response)
print("endpoint_config_name:", endpoint_config_name)

{'EndpointConfigArn': 'arn:aws:sagemaker:us-west-2:236995464743:endpoint-config/unsloth-DeepSeek-R1-GGUF-250207-0646', 'ResponseMetadata': {'RequestId': '15606040-3c5c-4438-acf2-b6f92a582f6d', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '15606040-3c5c-4438-acf2-b6f92a582f6d', 'content-type': 'application/x-amz-json-1.1', 'content-length': '117', 'date': 'Fri, 07 Feb 2025 06:46:48 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: unsloth-DeepSeek-R1-GGUF-250207-0646


In [13]:
# Step 2. create endpoint

endpoint_name = sagemaker.utils.name_from_base(model_name, short=True)

create_endpoint_response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
print(create_endpoint_response)
print("endpoint_config_name:", endpoint_name)
while 1:
    status = sagemaker_client.describe_endpoint(EndpointName=endpoint_name)["EndpointStatus"]
    if status != "Creating":
        break
    print(datetime.now().strftime('%Y%m%d-%H:%M:%S') + " status: " + status)
    time.sleep(60)
print("Endpoint created:", endpoint_name)

{'EndpointArn': 'arn:aws:sagemaker:us-west-2:236995464743:endpoint/unsloth-DeepSeek-R1-GGUF-250207-0646', 'ResponseMetadata': {'RequestId': '203995d4-4d1e-401b-a74f-faef4b96c7cb', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '203995d4-4d1e-401b-a74f-faef4b96c7cb', 'content-type': 'application/x-amz-json-1.1', 'content-length': '104', 'date': 'Fri, 07 Feb 2025 06:46:50 GMT'}, 'RetryAttempts': 0}}
endpoint_config_name: unsloth-DeepSeek-R1-GGUF-250207-0646
20250207-06:46:50 status: Creating
20250207-06:47:51 status: Creating
20250207-06:48:51 status: Creating
20250207-06:49:51 status: Creating
20250207-06:50:51 status: Creating
20250207-06:51:51 status: Creating
Endpoint created: unsloth-DeepSeek-R1-GGUF-250207-0646


## 4. Test

You can invoke your model with SageMaker runtime.

In [14]:
messages = [{
        "role": "user",
        "content": "Hi, who are you?"
}]

### 4.1 Message api non-stream mode

In [15]:
sagemaker_runtime = boto3.client('runtime.sagemaker')

payload = {
    "messages": messages,
    "max_tokens": 1024,
    "stream": False
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["message"]["content"])

<think>

</think>

Hi! I'm DeepSeek-R1, an AI assistant from DeepSeek. I'm glad to interact with you!


### 4.2 Message api stream mode

In [16]:
payload = {
    "messages": messages,
    "max_tokens": 1024,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            print(data["choices"][0]["delta"]["content"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()

<think>

</think>

Hi! I'm DeepSeek-R1, an AI assistant from DeepSeek. I'm at your service.


### 4.3 Completion api non-stream mode

In [17]:
from transformers import AutoTokenizer
hf_model_id = "deepseek-ai/DeepSeek-R1"
tokenizer = AutoTokenizer.from_pretrained(hf_model_id, trust_remote_code=True)

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

payload = {
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": False
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

print(json.loads(response['Body'].read())["choices"][0]["text"])

<think>

</think>

Hi! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your disposal. Feel free to ask me anything, and I'll do my best to provide effective assistance.


### 4.4 Completion api stream mode

In [18]:
payload = {
    "prompt": prompt,
    "max_tokens": 1024,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.end()
            # print(data)
            print(data["choices"][0]["text"], end="")
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]
print()

<think>

</think>

Hi! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your disposal. Feel free to ask me anything, and I'll do my best to help you.


### 4.5 Speed test

In [24]:
from transformers import AutoTokenizer
hf_model_id = "deepseek-ai/DeepSeek-R1"
tokenizer = AutoTokenizer.from_pretrained(hf_model_id, trust_remote_code=True)

sagemaker_runtime = boto3.client('runtime.sagemaker')

messages = [{
        "role": "user",
        "content": "帮我写一首七言律诗介绍上海"
}]

payload = {
    "messages": messages,
    "max_tokens": 4096,
    "temperature": 0.0,
    "stream": True
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    ContentType='application/json',
    Body=json.dumps(payload)
)

buffer = ""
time_start = time.time()
first_token_latency = 0
output = []
for t in response['Body']:
    buffer += t["PayloadPart"]["Bytes"].decode()
    last_idx = 0
    for match in re.finditer(r'^data:\s*(.+?)(\n\n)', buffer):
        try:
            data = json.loads(match.group(1).strip())
            last_idx = match.span()[1]
            if first_token_latency == 0:
                first_token_latency = time.time() - time_start
            print(data["choices"][0]["delta"]["content"], end="")
            output.append(data["choices"][0]["delta"]["content"])
        except (json.JSONDecodeError, KeyError, IndexError) as e:
            pass
    buffer = buffer[last_idx:]


total_time = time.time() - time_start

num_tokens = tokenizer("".join(output), return_tensors="pt").input_ids.shape[1]

print("\n" + "=" * 50)
print(f"First token latency {first_token_latency:.3} seconds")
print(f"Output speed {num_tokens/(total_time-first_token_latency):.3} tokens/seconds")
print("=" * 50)

<think>
嗯，用户让我帮忙写一首七言律诗来介绍上海。首先，我需要考虑用户的需求是什么。可能他们需要一首符合传统格律的诗，用来介绍上海的城市特色。七言律诗通常有八句，每句七个字，平仄对仗都要注意。所以我要确保每句七个字，押韵，对仗工整。

接下来，我需要思考上海的特点。上海是国际大都市，有外滩、东方明珠、黄浦江这些地标。还有历史与现代的结合，比如弄堂里的传统和陆家嘴的现代建筑。交通方面，地铁发达，高架桥很多。另外，上海的夜景很美，灯火辉煌，商业繁华，比如南京路步行街。还有上海的历史，比如作为昔日的渔村，后来成为国际港口，有海纳百川的包容精神。

然后，我需要把这些元素融入诗中。第一句可能提到上海的繁华，比如“申城锦绣”。然后提到外滩和东方明珠，黄浦江的潮水。接着对比弄堂里的传统和陆家嘴的现代高楼。然后讲交通便利，地铁和高架。再写夜景，灯火如星雨，商业区如南京路。最后总结历史变迁，从渔村到国际都市，体现包容精神。

还要注意平仄和对仗。比如“申城锦绣映重霄”对“外滩明珠黄浦潮”，结构上要对应。中间两联要对仗，比如“弄巷犹存风雅韵”对“云楼已矗霁虹桥”，这里“弄巷”和“云楼”都是名词，“风雅韵”和“霁虹桥”也是结构相似。

最后两句要总结，提到历史变迁和包容精神。押韵方面，通常律诗押平声韵，一韵到底。检查韵脚是否正确，比如“霄、潮、韵、桥、雨、遥、标”，可能需要调整确保押同一个韵，比如“霄、潮、韵”可能不在同一韵部，可能需要调整韵脚，使其统一。

可能用户希望诗既有传统韵味，又能体现现代上海的特点，所以需要平衡古典诗歌形式和现代内容。还要避免生硬堆砌，让诗句流畅自然，有画面感。最后检查字数、对仗、押韵是否符合七律的要求，确保整体结构严谨，意境连贯。
</think>

好的，我来创作一首介绍上海的七言律诗，并附上赏析：

《咏上海》
申城锦绣映重霄，外滩明珠黄浦潮。
弄巷犹存风雅韵，云楼已矗霁虹桥。
千街灯火如星雨，十里商圈胜市朝。
昔日渔村今海港，五洲潮涌一襟遥。

赏析：
这首七言律诗通过生动的意象展现了上海古今交融的都市风貌。首联“申城锦绣映重霄”以璀璨的灯火勾勒出城市天际线的壮丽，“外滩明珠黄浦潮”则巧妙点出外滩与黄浦江的经典地标。颔联“弄巷犹存风雅韵”体现老城厢的历史底蕴，“云楼已矗霁虹桥”展现陆家嘴的现代天际线。颈联描绘南京路等商圈如星雨般绚丽的夜景与繁