# Deploy DeepSeek-R1-0528 models on Amazon SageMaker Endpoint

The DeepSeek R1 model has undergone a minor version upgrade, with the current version being DeepSeek-R1-0528. In the latest update, DeepSeek R1 has significantly improved its depth of reasoning and inference capabilities by leveraging increased computational resources and introducing algorithmic optimization mechanisms during post-training. The model has demonstrated outstanding performance across various benchmark evaluations, including mathematics, programming, and general logic. Its overall performance is now approaching that of leading models, such as O3 and Gemini 2.5 Pro.

Compared to the previous version, the upgraded model shows significant improvements in handling complex reasoning tasks. For instance, in the AIME 2025 test, the model’s accuracy has increased from 70% in the previous version to 87.5% in the current version. This advancement stems from enhanced thinking depth during the reasoning process: in the AIME test set, the previous model used an average of 12K tokens per question, whereas the new version averages 23K tokens per question.

See [DeepSeek Blog](https://api-docs.deepseek.com/news/news250528) for more details

## Environment Setup

First, we'll upgrade the SageMaker SDK to ensure compatibility with the latest features, particularly those needed for large language model deployment and streaming inference.

> **Note**: The `--quiet` and `--no-warn-conflicts` flags are used to minimize unnecessary output while installing dependencies.

> ⚠️ **Important**: After running the installation cell below, you may need to restart your notebook kernel to ensure the updated packages are properly loaded. To do this:

In [None]:
%pip install --upgrade --quiet --no-warn-conflicts sagemaker sagemaker-core python-dotenv

In [None]:
import io
import time
import json
from datetime import datetime
from IPython.display import clear_output
import boto3
import sagemaker

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name  # region name of the current SageMaker Studio environment

smr_client = boto3.client("sagemaker-runtime")

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {bucket}")
print(f"sagemaker session region: {region}")
print(f"sagemaker version: {sagemaker.__version__}")

## Configure Model Container and Instance

For deploying DeepSeek-R1, we'll use:
- **LMI (Deep Java Library) Inference Container**: A container optimized for large language model inference
- **P5 Instance**: AWS's latest GPU instance type optimized for large model inference

Key configurations:
- The container URI points to the DJL inference container in ECR (Elastic Container Registry)
- We use `ml.p5en.48xlarge` instance which offer:
  - 8 NVIDIA H200 GPUs
  - 1128 GB of memory
  - High network bandwidth for optimal inference performance

> **Note**: The region in the container URI should match your AWS region. Replace `us-east-1` with your region if different.

In [None]:
CONTAINER_VERSION = "0.33.0-lmi15.0.0-cu128"

inference_image = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:{CONTAINER_VERSION}"

print(f"Inference image: {inference_image}")

## Deployment of DeepSeek-R1-0528

First, let's deploy full version of the model using latest version of the LMI container. Please see [HugginFace Repo](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) for more details about the model.

Please note you will need access to `ml.p5en.48xlarge` instance to deploy the full version of the model.

In [None]:
common_env = {
    "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-0528",
}
lmi_env = {
    "SERVING_FAIL_FAST": "true",
    "OPTION_ASYNC_MODE": "true",
    "OPTION_ROLLING_BATCH": "disable",
    "OPTION_ENTRYPOINT": "djl_python.lmi_vllm.vllm_async_service",
    "OPTION_TRUST_REMOTE_CODE": "True",
    "OPTION_GPU_MEMORY_UTILIZATION": "0.87",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_MAX_MODEL_LEN": "32768",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
}
env = common_env | lmi_env
model_name = sagemaker.utils.name_from_base("model")
endpoint_config_name = model_name
endpoint_name = model_name
instance_type = "ml.p5en.48xlarge"
timeout = 3600

We are using sagemaker_core SDK for model deployment. 

This [SDK](https://github.com/aws/sagemaker-core) offers more "Pythonic" syntax compared to boto3

**Key Features:**
- Object-Oriented Interface: Provides a structured way to interact with SageMaker resources, making it easier to manage them using familiar object-oriented programming techniques.
- Resource Chaining: Allows seamless connection of SageMaker resources by passing outputs as inputs between them, simplifying workflows and reducing the complexity of parameter management.
- Full Parity with SageMaker APIs: Ensures access to all SageMaker capabilities through the SDK, providing a comprehensive toolset for building and deploying machine learning models.
- Abstraction of Low-Level Details: Automatically handles resource state transitions and polling logic, freeing developers from managing these intricacies and allowing them to focus on higher-level tasks.
- Auto Code Completion: Enhances the developer experience by offering real-time suggestions and completions in popular IDEs, reducing syntax errors and speeding up the coding process.
- Comprehensive Documentation and Type Hints: Provides detailed guidance and type hints to help developers understand functionalities, write code faster, and reduce errors without complex API navigation.
- Incorporation of Intelligent Defaults: Integrates the previous SageMaker SDK feature of intelligent defaults, allowing developers to set default values for parameters like IAM roles and VPC configurations. This streamlines the setup process, enabling developers to focus on customizations specific to their use case.

In [None]:
from sagemaker_core.shapes import ContainerDefinition, ProductionVariant
from sagemaker_core.resources import Model, EndpointConfig, Endpoint
from sagemaker_core.shapes import ProductionVariantRoutingConfig

In [None]:
model = Model.create(
    model_name=model_name,
    primary_container=ContainerDefinition(image=inference_image, environment=env),
    execution_role_arn=role,
    session=sess.boto_session,
    region=region,
)

In [None]:
endpoint = Endpoint.create(
    endpoint_name=endpoint_name,
    endpoint_config_name=EndpointConfig.create(
        endpoint_config_name=endpoint_config_name,
        production_variants=[
            ProductionVariant(
                variant_name=model_name,
                initial_instance_count=1,
                instance_type=instance_type,
                model_name=model,
                container_startup_health_check_timeout_in_seconds=timeout,
                model_data_download_timeout_in_seconds=timeout,
                routing_config=ProductionVariantRoutingConfig(routing_strategy="LEAST_OUTSTANDING_REQUESTS"),
            )
        ],
    ), 
)
endpoint.wait_for_status("InService")

## Inference examples

### Synchronous invocation

In [None]:
payload=json.dumps(
    {
        "messages": [
            {"role": "user", "content": "Name popular places to visit in London?"}
        ],
        "temperature": 0.9,
    }
)
response = smr_client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=payload)
result = json.loads(response['Body'].read().decode("utf8"))
print(result["choices"][0]["message"]["content"])

### Asynchronous invocation

In [None]:
class LineIterator:
    def __init__(self, stream):
        self.byte_iterator = iter(stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                return line[:-1]
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if "PayloadPart" not in chunk:
                print("Unknown event type:" + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

def stream_response(endpoint_name, inputs, max_tokens=8189, temperature=0.7, top_p=0.9):    
    body = {
      "messages": [
        {"role": "user", "content": [{"type": "text", "text": inputs}]}
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "stream": True,
    }

    resp = smr_client.invoke_endpoint_with_response_stream(
        EndpointName=endpoint_name,
        Body=json.dumps(body),
        ContentType="application/json",
    )

    event_stream = resp["Body"]
    start_json = b"{"
    full_response = ""
    start_time = time.time()
    token_count = 0

    for line in LineIterator(event_stream):
        if line != b"" and start_json in line:
            data = json.loads(line[line.find(start_json):].decode("utf-8"))
            token_text = data['choices'][0]['delta'].get('content', '')
            full_response += token_text
            token_count += 1

            # Calculate tokens per second
            elapsed_time = time.time() - start_time
            tps = token_count / elapsed_time if elapsed_time > 0 else 0

            # Clear the output and reprint everything
            clear_output(wait=True)
            print(full_response)
            print(f"\nTokens per Second: {tps:.2f}", end="")

    print("\n") # Add a newline after response is complete
    
    return full_response

In [None]:
inputs = "What is greater 9.11 or 9.8?"
output = stream_response(endpoint_name, inputs, max_tokens=8000)

## Cleanup

In [None]:
sess.delete_endpoint(endpoint_name)
sess.delete_endpoint_config(endpoint_config_name)
sess.delete_model(model_name)

## Deployment of DeepSeek-R1-0528-Qwen3-8B

Now, let's deploy distilled version of the model (Qwen3-8B) using CloudFormation template. Please see [HugginFace Repo](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B) for more details about the model.

Please note you will need access to `ml.g6e.4xlarge` instance to deploy the full version of the model and Sagemaker role should have priviledges to deploy CloudFormation stacks

In [None]:
stack_timestamp = datetime.now().isoformat(timespec="seconds").replace(":", "-")
stack_name = f"deepseek-r1-0528-{stack_timestamp}"
print("stack_name:", stack_name)

In [None]:
cloudformation = boto3.client("cloudformation")
model_name = sagemaker.utils.name_from_base("model")

We will define string variable that holds CloudFormation template. 

You can save it to the actual yaml-file and use outside of this notebook example.

Please refer to [CloudFormation documentation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html) for more details

In [None]:
cfn_deploy = f"""
AWSTemplateFormatVersion: 2010-09-09

Parameters:

  DockerImageArn:
    Type: String
    Default: {inference_image}
  RoleArn:
    Type: String
    Default: {role}
  ModelName:
    Type: String
    Default: {model_name}
  InstanceType:
    Type: String
    Default: ml.g6e.4xlarge
  InitialInstanceCount:
    Type: Number
    Default: 1


Resources:

  Model:
    Type: "AWS::SageMaker::Model"
    Properties:
      ModelName: !Ref ModelName
      Containers:
        -
          ContainerHostname: 'GenericContainer'
          Image: !Ref DockerImageArn
          Environment:
            HF_MODEL_ID: "deepseek-ai/DeepSeek-R1-0528-Qwen3-8B"
            SERVING_FAIL_FAST: "True"
            OPTION_ASYNC_MODE: "True"
            OPTION_ROLLING_BATCH: "disable"
            OPTION_ENTRYPOINT: "djl_python.lmi_vllm.vllm_async_service"
            OPTION_TRUST_REMOTE_CODE: "True"
            OPTION_MAX_ROLLING_BATCH_SIZE: 16
            OPTION_MAX_MODEL_LEN: 32768
            OPTION_TENSOR_PARALLEL_DEGREE: "max"
      ExecutionRoleArn: !Ref RoleArn

  EndpointConfig:
    Type: "AWS::SageMaker::EndpointConfig"
    Properties:
      EndpointConfigName: !GetAtt Model.ModelName
      ProductionVariants:
        - ModelName: !GetAtt Model.ModelName
          VariantName: default
          InitialInstanceCount: !Ref InitialInstanceCount
          InstanceType: !Ref InstanceType
          ModelDataDownloadTimeoutInSeconds: 600
          ContainerStartupHealthCheckTimeoutInSeconds: 600

  GenericPredictEndpoint:
    Type: "AWS::SageMaker::Endpoint"
    Properties:
      EndpointConfigName: !GetAtt EndpointConfig.EndpointConfigName
      EndpointName: !GetAtt Model.ModelName

Outputs:
  EndpointArn:
    Value: !Ref GenericPredictEndpoint
  EndpointName:
    Value: !GetAtt GenericPredictEndpoint.EndpointName
"""

In [None]:
stack_parameters = {
    "DockerImageArn": inference_image,
    "RoleArn": role,
    "ModelName": model_name,
}

stack_parameters = [
    {'ParameterKey': key, 'ParameterValue': str(value)}
    for key, value in stack_parameters.items()
]

In [None]:
create_result = cloudformation.create_stack(
        StackName=stack_name,
        TemplateBody=cfn_deploy,
        Parameters=stack_parameters,
        Capabilities=['CAPABILITY_IAM'],
        OnFailure='ROLLBACK'
)
create_result

In [None]:
create_start = datetime.now()

print(f'\nWaiting for {stack_name} stack to be in service...')

waiter = cloudformation.get_waiter('stack_create_complete')
waiter.wait(StackName = stack_name)

print(f"Creation of stack {stack_name} took {datetime.now() - create_start}\n")

In [None]:
resources = cloudformation.describe_stack_resources(StackName=stack_name)
endpoint_arn = next(r["PhysicalResourceId"] for r in resources["StackResources"] if r["LogicalResourceId"] == "GenericPredictEndpoint")
endpoint_name = endpoint_arn.split("/")[-1]
endpoint_name

## Inference examples

### Synchronous invocation

In [None]:
payload=json.dumps(
    {
        "messages": [
            {"role": "user", "content": "Name popular places to visit in Italy?"}
        ],
        "temperature": 0.9,
    }
)
response = smr_client.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=payload)
result = json.loads(response['Body'].read().decode("utf8"))
print(result["choices"][0]["message"]["content"])

### Asynchronous invocation

In [None]:
inputs = "How many R in a word 'strawberry'?"
output = stream_response(endpoint_name, inputs, max_tokens=8000)

## Cleanup

In [None]:
cloudformation.delete_stack(StackName=stack_name)