# Deploying DeepSeek-R1-Distill-Qwen-14B reasoning model on Amazon SageMaker with DJLServing DLC


❗This notebook works well on `ml.t3.medium` instance with `PyTorch 2.2.0 Python 3.10 CPU optimized` kernel from **SageMaker Studio Classic** or `Python3` kernel from **JupyterLab**.

# Set up Environment

In [None]:
%%capture --no-stderr

!pip install -U pip
!pip install -U "sagemaker>=2.237.3"
!pip install -U "transformers>=4.47.0"

In [None]:
import boto3
import sagemaker

aws_region = boto3.Session().region_name
sess = sagemaker.Session()
bucket = sess.default_bucket()

aws_region, bucket

## Upload the `model.tar.gz`

Create and upload `model.tar.gz` to our sagemaker session bucket.

In [None]:
%%sh

mkdir -p model
rm -rf model/*

# copy the custom inference script into the working directory
cp -rp ../python/code/* model/

rm -f model.tar.gz
tar --exclude "*/.ipynb_checkpoints*" -czvf model.tar.gz model/
tar -tvf model.tar.gz

model/
model/model.py
model/serving.properties
drwxr-xr-x sagemaker-user/users 0 2025-01-27 02:06 model/
-rw-r--r-- sagemaker-user/users 3944 2025-01-27 02:06 model/model.py
-rw-r--r-- sagemaker-user/users  123 2025-01-27 02:06 model/serving.properties


In [None]:
model_name = 'deepseek-ai/DeepSeek-R1-Distill-Qwen-14B'

base_name = model_name.split('/')[-1].replace('.', '-').lower()
base_name

In [None]:
from sagemaker.s3 import S3Uploader

# upload model.tar.gz to s3
s3_model_uri = S3Uploader.upload(
    local_path="./model.tar.gz",
    desired_s3_uri=f"s3://{bucket}/{base_name}"
)

print(f"S3 Code or Model tar ball uploaded to:\n{s3_model_uri}")

In [None]:
!aws s3 ls {s3_model_uri}

## Deploy `DeepSeek-R1-Distill-Qwen-14B` model to SageMaker Real-time Endpoint

In [None]:
from sagemaker import image_uris

image_uri = image_uris.retrieve(
    framework="djl-lmi",
    region=aws_region,
    version="0.30.0",
    py_version="py311"
)

image_uri

In [None]:
import sagemaker
from sagemaker import Model
from sagemaker.utils import name_from_base

sm_model_name = name_from_base(base_name, short=True).lower()
role = sagemaker.get_execution_role()

model = Model(
    name=sm_model_name,
    image_uri=image_uri,
    model_data=s3_model_uri,
    role=role
)

In [None]:
%%time

from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.utils import name_from_base

endpoint_name = name_from_base(base_name, short=True).lower()
instance_type = 'ml.g5.12xlarge'

_predictor = model.deploy(
    endpoint_name=endpoint_name,
    initial_instance_count=1,
    instance_type=instance_type,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

# Create a Predictor with SageMaker Endpoint name

In [None]:
from sagemaker import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer

predictor = Predictor(
    endpoint_name=endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

# Run Inference

### Standard schema

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
prompt = "Help me write a quick sort code. Please reason step by step, and put your final answer within \boxed{}."

messages = [
    {"role": "user", "content": prompt}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

parameters = {
    "max_new_tokens": 1000,
}

response = predictor.predict(
    {"inputs": inputs, "parameters": parameters}
)

print(response["generated_text"])

Sure, I'd be happy to help you with that. Here's a quick step-by-step guide on how to write a quick sort code:


1. **Choose a pivot**: The pivot can be the first element, the last element, a random element, or the median. For simplicity, we'll choose the first element as the pivot.

2. **Partition the array**: Rearrange the array so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it. After this step, the pivot is in its final position.

3. **Recursively apply the steps**: Recursively apply the above steps to the sub-array of elements with smaller values and separately to the sub-array of elements with greater values.

Here's a Python code for quick sort:

```python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        less_than_pivot = [x for x in arr[1:] if x <= pivot]
        greater_than_pivot = [x for x in arr[1:] if x > pivot]
        return q

In [None]:
prompt = "Generate 10 random numbers. Please reason step by step."

messages = [
    {"role": "user", "content": prompt}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,

)

parameters = {
    "max_new_tokens": 5000,
    "temperature": 0.6,
    "top_p": 0.9
}

response = predictor.predict(
    {"inputs": inputs, "parameters": parameters}
)

print(response["generated_text"])

Sure, I can help with that. Here's a Python code snippet that generates 10 random numbers:


```python
import random

random_numbers = [random.randint(0, 100) for _ in range(10)]
print(random_numbers)
```

This code works in the following steps:

1. Import the `random` module, which provides various functions for generating random numbers.

2. Define a list comprehension that generates 10 random integers between 0 and 100. The `random.randint(0, 100)` function generates a random integer between 0 and 100, and the `for _ in range(10)` part repeats this process 10 times.

3. Print the list of random numbers.

You can run this code in a Python environment to see the output.



### Message API

- Ref: https://docs.djl.ai/docs/serving/serving/docs/lmi/user_guides/chat_input_output_schema.html#message

In [None]:
import json


prompt = "Help me write a quick sort code in python. Please reason step by step, and put your final answer within \boxed{}."

messages = [
    {
        "role": "user",
        "content": prompt
    }
]

response = predictor.predict({
    "messages": messages,
    "max_tokens": 500
})

print(json.dumps(response, indent=2))

{
  "id": "chatcmpl-139695694322928",
  "object": "chat.completion",
  "created": 1737944189,
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Sure, I can help you with that. Here is a simple implementation of the Quick Sort algorithm in Python.\n\n```python\ndef partition(arr, low, high):\n    i = (low-1)\n    pivot = arr[high]\n    for j in range(low , high):\n        if arr[j] <= pivot:\n            i += 1\n            arr[i], arr[j] = arr[j], arr[i]\n    arr[i+1], arr[high] = arr[high], arr[i+1]\n    return i+1\n\ndef quick_sort(arr, low, high):\n    if low < high:\n        pi = partition(arr, low, high)\n        quick_sort(arr, low, pi-1)\n        quick_sort(arr, pi+1, high)\n\n# Test the code\narr = [10, 7, 8, 9, 1, 5]\nn = len(arr)\nquick_sort(arr, 0, n-1)\nprint(\"Sorted array is:\", arr)\n```\n\nIn this code, the `partition` function takes the last element as pivot, places the pivot element at its correct position in th

In [None]:
prompt = "Generate 10 random numbers. Please reason step by step."

messages = [
    {
        "role": "user",
        "content": prompt
    }
]

response = predictor.predict({
    "messages": messages,
    "max_tokens": 5000,
    "temperature": 0.6,
    "top_p": 0.9,
})

print(response['choices'][0]['message']['content'])

Sure, I can help with that. Here is a step-by-step process to generate 10 random numbers using Python.

1. **Import the random module**: The random module in Python provides various functions to generate random numbers.

```python
import random
```

2. **Use the randint() function**: The randint() function generates a random integer within a specified range. If you want 10 random numbers, you can generate a random integer between a minimum and maximum value.

```python
for _ in range(10):
    random_number = random.randint(0, 100)
    print(random_number)
```

In the above code, `random.randint(0, 100)` generates a random integer between 0 and 100. The `_` is a convention for a variable that we don't actually use in the loop. The `range(10)` generates a sequence of numbers from 0 to 9. So, the loop runs 10 times, generating 10 random numbers.

Please note that `randint` generates an inclusive range, so the maximum value 100 is included in the possible generated numbers.

3. **Print the

# Streaming

### Standard schema streaming

In [None]:
import io
import json
from sagemaker.iterators import BaseIterator


class TokenIterator(BaseIterator):
    def __init__(self, stream):
        super().__init__(stream)
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()

            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                full_line = line[:-1].decode('utf-8')
                line_data = json.loads(full_line.lstrip("data:").rstrip("\n"))
                return line_data["token"].get("text", "")
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

In [None]:
prompt = "Generate 10 random numbers. Please reason step by step."

messages = [
    {"role": "user", "content": prompt}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

parameters = {
    "max_new_tokens": 5000,
    "temperature": 0.6,
    "top_p": 0.9
}

payload = {
    "inputs": inputs,
    "parameters": parameters,
    "stream": True
}

response_stream = predictor.predict_stream(
    data=payload,
    custom_attributes="accept_eula=false",
    iterator=TokenIterator,
)

for token in response_stream:
    print(token, end="", flush=True)

Sure, I can help with that. Here's a Python code snippet that generates 10 random numbers:


```python
import random

random_numbers = [random.randint(0, 100) for _ in range(10)]
print(random_numbers)
```

This code works in the following steps:

1. Import the `random` module, which provides various functions for generating random numbers.

2. Define a list comprehension that generates 10 random integers between 0 and 100. The `random.randint(0, 100)` function generates a random integer between 0 and 100, and the `for _ in range(10)` part repeats this process 10 times.

3. Print the list of random numbers.

You can run this code in a Python environment to see the output.


### Message Schema streaming

In [None]:
import io
import json
from sagemaker.iterators import BaseIterator


class MessageTokenIterator(BaseIterator):
    def __init__(self, stream):
        super().__init__(stream)
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()

            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                full_line = line[:-1].decode('utf-8')
                line_data = json.loads(full_line.lstrip('data:').rstrip('\n'))
                return line_data['choices'][0]['delta'].get('content', '')
            chunk = next(self.byte_iterator)
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk['PayloadPart']['Bytes'])

In [None]:
prompt = "Generate 10 random numbers. Please reason step by step."

messages = [
    {
        "role": "user",
        "content": prompt
    }
]

payload = {
    "messages": messages,
    "max_tokens": 5000,
    "temperature": 0.6,
    "top_p": 0.9,
    "stream": "true"
}

response_stream = predictor.predict_stream(
    data=payload,
    custom_attributes="accept_eula=false",
    iterator=MessageTokenIterator,
)

for token in response_stream:
    print(token, end="", flush=True)

Sure, I can help you with that. Let's use Python as our programming language for this.

Step 1: Import the random module
The random module in Python provides various functions that are used to generate random numbers.

```python
import random
```

Step 2: Generate the random numbers
We can use the `random.randint()` function to generate random integers within a specific range.

```python
for i in range(10):
    print(random.randint(1, 100))
```

In the above code, `random.randint(1, 100)` generates a random integer between 1 and 100 (inclusive). The `for` loop runs this 10 times, so it generates 10 random integers. The `print()` function outputs the generated numbers.

If you want to generate random floating point numbers between 0 and 1, you can use `random.random()`. Here is how you can do it:

```python
for i in range(10):
    print(random.random())
```

In this case, `random.random()` generates a random float number between 0 and 1.

I hope this helps! Let me know if you have any o

In [None]:
prompt = "첫째항과 공비가 모두 양수 k인 등비수열 {a_n}이 (a_4/a_2)+(a_2/a_1)=30을 만족할 때, k의 값은?"

messages = [
    {
        "role": "user",
        "content": prompt
    }
]

payload = {
    "messages": messages,
    "max_tokens": 5000,
    "temperature": 0.7,
    "top_p": 0.9,
    "stream": "true"
}

response_stream = predictor.predict_stream(
    data=payload,
    custom_attributes="accept_eula=false",
    iterator=MessageTokenIterator,
)

for token in response_stream:
    print(token, end="", flush=True)

주어진 등비수열에 대한 합은 등비수열의 4번째 항과 2번째 항의 비율과 2번째 항과 1번째 항의 비율의 합이다. 즉, (a_4/a_2) + (a_2/a_1) = 30이다.

주어진 등비수열의 첫 4개 항은 a_1, a_2, a_3, a_4이다. 공비 k는 a_n = k * a_(n-1) (n > 1)로 나타낼 수 있다. 이를 전개하면 a_4/a_2 = a_2/a_1 = k 이다. 따라서 (a_4/a_2) + (a_2/a_1) = 2k가 된다.

주어진 식 30 = 2k를 풀면 k = 30/2 = 15가 된다.

Python 코드로 구현하면 다음과 같다.

```python
# 주어진 식을 변수에 할당
given_sum = 30

# 등비수열의 공비 k를 계산
k = given_sum / 2

# 결과 출력
print(k)
```

이 코드는 공비 k를 계산하고 출력한다. 이 경우 k의 값은 15가 된다.


# Clean up the environment

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

# References

- [DeepSeek-R1 Model Card](https://huggingface.co/deepseek-ai/DeepSeek-R1#usage-recommendations)
- [deepseek-ai/deepseek-coder-6.7b-instruct SageMaker LMI deployment guide](https://github.com/aws-samples/llm_deploy_gcr/blob/main/sagemaker/deepseek_coder_6.7_instruct.ipynb)