# Deploy DeepSeek-R1-Distill-Llama-8B on Amazon SageMaker AI with SGLang

❗This notebook works well on `ml.g5.xlarge` instance with 50GB of disk size and `PyTorch 2.2.0 Python 3.10 CPU optimized kernel` from **SageMaker Studio Classic** or `Python3 kernel` from **JupyterLab**.

This notebook has been rewritten based on [sagemaker-genai-hosting-examples/Deepseek/SGLang-Deepseek/deepseek-r1-llama-70b-sglang.ipynb](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/Deepseek/SGLang-Deepseek/deepseek-r1-llama-70b-sglang.ipynb)

Note that SageMaker provides [pre-built SageMaker AI Docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) that can help you quickly start with the model inference on SageMaker. It also allows you to [bring your own Docker container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html) and use it inside SageMaker AI for training and inference. To be compatible with SageMaker AI, your container must have the following characteristics:

- Your container must have a web server listening on port `8080`.
- Your container must accept POST requests to the `/invocations` and `/ping` real-time endpoints.

In this notebook, we'll demonstrate how to adapt the [SGLang](https://github.com/sgl-project/sglang) framework to run on SageMaker AI endpoints. SGLang is a serving framework for large language models that provides state-of-the-art performance, including a fast backend runtime for efficient serving with RadixAttention, extensive model support, and an active open-source community. For more information refer to [https://docs.sglang.ai/index.html](https://docs.sglang.ai/index.html) and [https://github.com/sgl-project/sglang](https://github.com/sgl-project/sglang).

By using SGLang and building a custom Docker container, you can run advanced AI models like the [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) on a SageMaker AI endpoint.

### Set up Environment

In [None]:
%%capture --no-stderr

!pip install -U "sagemaker>=2.237.1"
!pip install -U sagemaker-studio-image-build==0.6.0

In [None]:
!pip list | grep -E -w "sagemaker|sagemaker_studio_image_build"

### Prepare the SGLang SageMaker container

In [None]:
DOCKER_IMAGE = "sglang-sagemaker"
DOCKER_IMAGE_TAG = "latest"

[sm-docker](https://github.com/aws-samples/sagemaker-studio-image-build-cli) is a CLI for building Docker images in SageMaker Studio using AWS CodeBuild

In [None]:
%%time
!cd ../container && sm-docker build . --repository {DOCKER_IMAGE}:{DOCKER_IMAGE_TAG} --build-arg CUDA_VERSION=12.4.1

### Create SageMaker AI endpoint for DeepSeek-R1-Distill-Llama-8B model

In this example, we will use the DeepSeek-R1-Distill-Llama-8B model artifacts directly [SageMaker Jumpstart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html). This way, it saves you time to download the model from HuggingFace and upload to S3.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel

model_id, model_version = "deepseek-llm-r1-distill-llama-8b", "1.0.0"
# model_id, model_version = "deepseek-llm-r1-distill-llama-8b", "*"

model = JumpStartModel(model_id=model_id, model_version=model_version)
model_data = model.model_data['S3DataSource']['S3Uri']
model_data

In [None]:
import sagemaker
from sagemaker.session import Session

session = Session()
region = session._region_name
role = sagemaker.get_execution_role()

ecr_uri = f'{session.account_id()}.dkr.ecr.{region}.amazonaws.com/{DOCKER_IMAGE}:{DOCKER_IMAGE_TAG}'

Then we will create the [SageMaker model](https://sagemaker.readthedocs.io/en/stable/api/inference/model.html) with the custom docker image and model data available on s3.

In [None]:
from sagemaker.model import Model
from sagemaker.predictor import Predictor


model = Model(
    model_data={
        "S3DataSource": {
            "S3Uri": model_data,
            "S3DataType": "S3Prefix",
            "CompressionType": "None",
        },
    },
    role=role,
    image_uri=ecr_uri,
    env={
        'TENSOR_PARALLEL_DEGREE': '1', # ml.g5.2xlarge
        # 'TENSOR_PARALLEL_DEGREE': '8' # ml.g5.48xlarge
    },
    predictor_cls=Predictor
)

In [None]:
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer


instance_type = 'ml.g5.2xlarge' # you can also change to ml.g5.48xlarge or p4d.24xlarge

predictor = model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer()
)

### Invoke endpoint with SageMaker Python SDK

In [None]:
response = predictor.predict({
    'model':'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature': 0,
    'max_tokens': 1000,
    'top_logprobs': 2,
    'logprobs': True
})

print(response['choices'][0]['message']['content'])

<think>
Okay, so I need to list three countries and their capitals. Hmm, let me think about how to approach this. First, I should probably pick countries that I'm somewhat familiar with. Maybe I can start with some nearby ones or ones I've heard about in the news.

Let me consider the United States. I know their capital is Washington, D.C. That's a good one. Now, where else? Maybe a European country. France's capital is Paris, right? That's a major city I've heard of. Okay, so that's two down.

Now, for the third country, I should pick one that's a bit different. Maybe an Asian country. Japan comes to mind. I believe their capital is Tokyo. Yeah, that sounds right. I've heard Tokyo mentioned a lot in the context of Japan's government.

Wait, let me double-check to make sure I'm not mixing up capitals. Sometimes I get confused between countries that have similar-sounding names. For example, I know that Germany's capital is Berlin, but I didn't list that. And I'm pretty sure the UK's cap

### Streaming response from the endpoint

Additionally, SGLang allows you to invoke the endpoint and receive streaming response. Below is an example of how to interact with the endpoint with streaming response.

In [None]:
import io
import json
from sagemaker.iterators import BaseIterator
from sagemaker.iterators import handle_stream_errors


class TokenIterator(BaseIterator):
    def __init__(self, event_stream):
        super().__init__(event_stream)
        self.byte_iterator = iter(self.event_stream)
        self.buffer = io.BytesIO()
        self.read_pos = 0

    def __iter__(self):
        return self

    def __next__(self):
        r"""Returns the next Line for an Line iterable.

        The output of the event stream will be in the following format:

        ```
        b'data: {"id":"2d81e745f32e46879c2e6bf28171570f","object":"chat.completion.chunk","created":1742104124,"model":"mymodel","choices":[{"index":0,"delta":{"role":"assistant","content":"","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":null,"matched_stop":null}'
        ...
        b'bf28171570f","object":"chat.completion.chunk","created":1742104141,"model":"mymodel","choices":[],"usage":{"prompt_tokens":11,"total_tokens":523,"completion_tokens":512}}\n\n'
        b'data: [DONE]\n\n'
        ```

        While usually each PayloadPart event from the event stream will contain a byte array
        with a full json, this is not guaranteed and some of the json objects may be split across
        PayloadPart events. For example:
        ```
        {'PayloadPart': {'Bytes': b'data: {"id":"1f7cb39ac2e24f6187305bdb20fc0002",'}
        {'PayloadPart': {'Bytes': b'"object":"chat.completion.chunk",'}
        ...
        {'PayloadPart': {'Bytes': b'}\n\n'}
        ```

        This class accounts for this by concatenating bytes written via the 'write' function
        and then exposing a method which will return lines (ending with a '\n' character) within
        the buffer via the 'scan_lines' function. It maintains the position of the last read
        position to ensure that previous bytes are not exposed again.

        Returns:
            str: Read and return one line from the event stream.
        """
        while True:
            self.buffer.seek(self.read_pos)
            line = self.buffer.readline()
            if line and line[-1] == ord("\n"):
                self.read_pos += len(line)
                full_line = line[:-1].decode('utf-8')
                if full_line.startswith("data:"):
                    try:
                        json_line = json.loads(full_line.lstrip("data:").rstrip("\n"))
                    except Exception as _:
                        json_line = {}
                    part = json_line.get('choices')[0]['delta']['content'] if json_line.get('choices') else ""
                    return part
            try:
                chunk = next(self.byte_iterator)
            except StopIteration:
                if self.read_pos < self.buffer.getbuffer().nbytes:
                    continue
                raise
            if "PayloadPart" not in chunk:
                # handle API response errors and force terminate.
                handle_stream_errors(chunk)
                # print and move on to next response byte
                print("Unknown event type:" + chunk)
                continue
            self.buffer.seek(0, io.SEEK_END)
            self.buffer.write(chunk["PayloadPart"]["Bytes"])

In [None]:
payload = {
    'model':'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature': 0,
    'max_tokens': 1000,
    'top_logprobs': 2,
    'logprobs': True,
    'stream': True,
    # 'stream_options': {'include_usage': True}
}

response_stream = predictor.predict_stream(
    data=payload,
    iterator=TokenIterator,
)

for token in response_stream:
    print(token, end="", flush=True)

<think>
Okay, so I need to list three countries and their capitals. Hmm, let me think about how to approach this. First, I should probably pick countries that I'm somewhat familiar with. Maybe I can start with some nearby ones or ones I've heard about in the news.

Let me consider the United States. I know their capital is Washington, D.C. That's a good one. Now, where else? Maybe a European country. France's capital is Paris, right? That's a major city I've heard of. Okay, so that's two down.

Now, for the third country, I should pick one that's a bit different. Maybe an Asian country. Japan comes to mind. I believe their capital is Tokyo. Yeah, that sounds right. I've heard Tokyo mentioned a lot in the context of Japan's government.

Wait, let me double-check to make sure I'm not mixing up capitals. Sometimes I get confused between countries that have similar-sounding names. For example, I know that Germany's capital is Berlin, but I didn't list that. And I'm pretty sure the UK's cap

### Invoke endpoint with boto3

Note that you can also invoke the endpoint with boto3. If you have an existing endpoint, you don't need to recreate the predictor and can follow below example to invoke the endpoint with an endpoint name.

In [None]:
import boto3
import json

sagemaker_runtime = boto3.client('sagemaker-runtime', region_name=region)
endpoint_name = predictor.endpoint_name # you can manually set the endpoint name with an existing endpoint

prompt = {
    'model':'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature': 0,
    'max_tokens': 1000,
}

response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(prompt)
)

response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

<think>
Okay, so I need to list three countries and their capitals. Hmm, let me think about how to approach this. First, I should probably pick countries that I'm somewhat familiar with. Maybe I can start with some nearby ones or ones I've heard about in the news.

Let me consider the United States. I know their capital is Washington, D.C. That's a good one. Now, where else? Maybe a European country. France's capital is Paris, right? That's a major city I've heard of. Okay, so that's two down.

Now, for the third country, I should pick one that's a bit different. Maybe an Asian country. Japan comes to mind. I believe their capital is Tokyo. Yeah, that sounds right. I've heard Tokyo mentioned a lot in the context of Japan's government.

Wait, let me double-check to make sure I'm not mixing up capitals. Sometimes I get confused between countries that have similar-sounding names. For example, I know that Germany's capital is Berlin, but I didn't list that. And I'm pretty sure the UK's cap

### Streaming response from the endpoint with boto3

In [None]:
request_body = {
    'model':'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature': 0,
    'max_tokens': 1000,
    'top_logprobs': 2,
    'logprobs': True,
    'stream': True,
    # 'stream_options': {'include_usage': True}
}

response = sagemaker_runtime.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(request_body),
    ContentType="application/json"
)

# Gets the EventStream object returned by the SDK:
response_stream = TokenIterator(response['Body'])
for token in response_stream:
    print(token, end="", flush=True)

<think>
Okay, so I need to list three countries and their capitals. Hmm, let me think about how to approach this. First, I should probably pick countries that I'm somewhat familiar with. Maybe I can start with some nearby ones or ones I've heard about in the news.

Let me consider the United States. I know their capital is Washington, D.C. That's a good one. Now, where else? Maybe a European country. France's capital is Paris, right? That's a major city I've heard of. Okay, so that's two down.

Now, for the third country, I should pick one that's a bit different. Maybe an Asian country. Japan comes to mind. I believe their capital is Tokyo. Yeah, that sounds right. I've heard Tokyo mentioned a lot in the context of Japan's government.

Wait, let me double-check to make sure I'm not mixing up capitals. Sometimes I get confused between countries that have similar-sounding names. For example, I know that Germany's capital is Berlin, but I didn't list that. And I'm pretty sure the UK's cap

### Clean up the environment

Make sure to delete the endpoint and other artifacts that were created to avoid unnecessary cost. You can also go to SageMaker AI console to delete all the resources created in this example.

In [None]:
predictor.delete_model()
predictor.delete_endpoint()

### References

- [DeepSeek-R1 Model Card](https://huggingface.co/deepseek-ai/DeepSeek-R1#usage-recommendations)
- [sagemaker-genai-hosting-examples/Deepseek/SGLang-Deepseek/deepseek-r1-llama-70b-sglang.ipynb](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/Deepseek/SGLang-Deepseek/deepseek-r1-llama-70b-sglang.ipynb)