# Run DeepSeek R1 Llama 70B efficient on Amazon SageMaker AI with SGLang

> This notebook has been tested on the Python 3 kernel of a SageMaker Jupternotebook instance on a ml.m5.xlarge instance with 50GB of disk size

Amazon SageMaker AI provides the ability to build Docker containers to run on SageMaker endpoints, where they listen for health checks on /ping and receive real-time inference requests on /invocations. Using SageMaker AI for inference offers several benefits:

- **Scalability**: SageMaker AI can automatically scale your inference endpoints up and down based on demand, ensuring your models can handle varying workloads.
- **High Availability**: SageMaker AI manages the infrastructure and maintains the availability of your inference endpoints, so you don't have to worry about managing the underlying resources.
- **Monitoring and Logging**: SageMaker AI provides built-in monitoring and logging capabilities, making it easier to track the performance and health of your inference endpoints.
- **Security**: SageMaker AI integrates with other AWS services, such as AWS Identity and Access Management (IAM), to provide robust security controls for your inference workloads.

Note that SageMaker provides [pre-built SageMaker AI Docker images](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html) that can help you quickly start with the model inference on SageMaker. It also allows you to [bring your own Docker container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html) and use it inside SageMaker AI for training and inference. To be compatible with SageMaker AI, your container must have the following characteristics:

- Your container must have a web server listening on port 8080.
- Your container must accept POST requests to the /invocations and /ping real-time endpoints.

In this notebook, we'll demonstrate how to adapt the [**SGLang**](https://github.com/sgl-project/sglang) framework to run on SageMaker AI endpoints. SGLang is a serving framework for large language models that provides state-of-the-art performance, including a fast backend runtime for efficient serving with RadixAttention, extensive model support, and an active open-source community. For more information refer to https://docs.sglang.ai/index.html and https://github.com/sgl-project/sglang.

By using SGLang and building a custom Docker container, you can run advanced AI models like the [DeepSeek R1 Llama 70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) on a SageMaker AI endpoint.


## Prepare the SGLang SageMaker container

SageMaker AI makes extensive use of Docker containers for build and runtime tasks. Using containers, you can train machine learning algorithms and deploy models quickly and reliably at any scale. See [this link](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-run-image) to understand how SageMaker AI runs your inference image. 

- For model inference, SageMaker AI runs the container as:
```
docker run image serve
```

- You can provide your entrypoint script as `exec` form to provide instruction of how to perform the inference process, for example:
```
ENTRYPOINT ["python", "inference.py"]
```

- When deploying ML models, one option is to archive and compress the model artifacts into a `tar.gz` format and provided the s3 path of the model artifacts as the `ModelDataUrl` in the [`CreateModel`](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API request. SageMaker AI will copy the model artifacts from the S3 location 
 and decompresses this tar file into `/opt/ml/model` directory before your container starts for use by your inference code. However, for deploying large models, SageMaker AI allows you to [deploy uncompressed models](https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-uncompressed.html). In this example, we will show you how to use the uncompressed DeepSeek R1 Distilled Llama 70B model.

- To receive inference requests, the container must have a web server listening on port `8080` and must accept `POST` requests to the `/invocations` and `/ping` endpoints.

The below diagram shows on a high-level, how you should prepare your own container image to be compatible for SageMaker AI hosting. 

![inference](./img/sagemaker-real-time-inference.png)


If you already have a docker image, you can see more instructions for [adapting your own inference container for SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html). Also it is important to note that, SageMaker AI provided containers automatically implements a web server for serving requests that responds to `/invocations` and `/ping` (for healthcheck) requests. You can find more about the [prebuilt SageMaker AI docker images for deep learning in our SageMaker doc](https://docs.aws.amazon.com/sagemaker/latest/dg/pre-built-containers-frameworks-deep-learning.html).




#### Create the entrypoint serve file 
The `serve` file will used as the `exec` form to be executed at the container starting time. The main command to start sglang in the SageMaker docker image is 
```
python3 -m sglang.launch_server --model-path <your model path> --host 0.0.0.0 --port 8080
```
Here the `model-path` can be set as `/opt/ml/model` as this is where SageMaker AI will copy the model artifacts from s3 to the endpoint and use `port` **8080** as required by SageMaker hosting.

In [1]:
%%writefile serve
#!/bin/bash

echo "Starting server"

SERVER_ARGS="--host 0.0.0.0 --port 8080"

if [ -n "$TENSOR_PARALLEL_DEGREE" ]; then
    SERVER_ARGS="${SERVER_ARGS} --tp-size ${TENSOR_PARALLEL_DEGREE}"
fi

if [ -n "$DATA_PARALLEL_DEGREE" ]; then
    SERVER_ARGS="${SERVER_ARGS} --dp-size ${DATA_PARALLEL_DEGREE}"
fi

if [ -n "$EXPERT_PARALLEL_DEGREE" ]; then
    SERVER_ARGS="${SERVER_ARGS} --ep-size ${EXPERT_PARALLEL_DEGREE}"
fi

if [ -n "$MEM_FRACTION_STATIC" ]; then
    SERVER_ARGS="${SERVER_ARGS} --mem-fraction-static ${MEM_FRACTION_STATIC}"
fi

if [ -n "$QUANTIZATION" ]; then
    SERVER_ARGS="${SERVER_ARGS} --quantization ${QUANTIZATION}"
fi

if [ -n "$CHUNKED_PREFILL_SIZE" ]; then
    SERVER_ARGS="${SERVER_ARGS} --chunked-prefill-size ${CHUNKED_PREFILL_SIZE}"
fi

if [ -n "$MODEL_ID" ]; then
    SERVER_ARGS="${SERVER_ARGS} --model-path ${MODEL_ID}"
else
    SERVER_ARGS="${SERVER_ARGS} --model-path /opt/ml/model"
fi

if [ -n "$TORCH_COMPILE" ]; then
    SERVER_ARGS="${SERVER_ARGS} --enable-torch-compile"
fi

if [ -n "$TORCHAO_CONFIG" ]; then
    SERVER_ARGS="${SERVER_ARGS} --torchao-config ${TORCHAO_CONFIG}"
fi

if [ -n "$KV_CACHE_DTYPE" ]; then
    SERVER_ARGS="{$SERVER_ARGS} --kv-cache-dtype ${KV_CACHE_DTYPE}"
fi

python3 -m sglang.launch_server $SERVER_ARGS

Writing serve


SGLang has provided the based [Dockerfile here](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile). You can directly extend the base image with

```
# Extend from the base sglang image
FROM lmsysorg/sglang:latest
```

In this example, we have copied the whole base Dockerfile and added the below lines to make it compatible with SageMaker
```
COPY serve /usr/bin/serve
RUN chmod 777 /usr/bin/serve

ENTRYPOINT [ "/usr/bin/serve" ]
```
You can add additional layers in the container image to accomodate your specific use case.

In [2]:
%%writefile Dockerfile
ARG CUDA_VERSION=12.5.1

FROM nvcr.io/nvidia/tritonserver:24.04-py3-min

ARG BUILD_TYPE=all
ENV DEBIAN_FRONTEND=noninteractive

RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
    && echo 'tzdata tzdata/Zones/America select Los_Angeles' | debconf-set-selections \
    && apt update -y \
    && apt install software-properties-common -y \
    && add-apt-repository ppa:deadsnakes/ppa -y && apt update \
    && apt install python3.10 python3.10-dev -y \
    && update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1 \
    && update-alternatives --set python3 /usr/bin/python3.10 && apt install python3.10-distutils -y \
    && apt install curl git sudo libibverbs-dev -y \
    && apt install -y rdma-core infiniband-diags openssh-server perftest ibverbs-providers libibumad3 libibverbs1 libnl-3-200 libnl-route-3-200 librdmacm1 \
    && curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py \
    && python3 --version \
    && python3 -m pip --version \
    && rm -rf /var/lib/apt/lists/* \
    && apt clean

# For openbmb/MiniCPM models
RUN pip3 install datamodel_code_generator

WORKDIR /sgl-workspace

ARG CUDA_VERSION
RUN python3 -m pip install --upgrade pip setuptools wheel html5lib six \
    && git clone --depth=1 https://github.com/sgl-project/sglang.git \
    && if [ "$CUDA_VERSION" = "12.1.1" ]; then \
         python3 -m pip install torch --index-url https://download.pytorch.org/whl/cu121; \
       elif [ "$CUDA_VERSION" = "12.4.1" ]; then \
         python3 -m pip install torch --index-url https://download.pytorch.org/whl/cu124; \
       elif [ "$CUDA_VERSION" = "12.5.1" ]; then \
         python3 -m pip install torch --index-url https://download.pytorch.org/whl/cu124; \
       elif [ "$CUDA_VERSION" = "11.8.0" ]; then \
         python3 -m pip install torch --index-url https://download.pytorch.org/whl/cu118; \
         python3 -m pip install sgl-kernel -i https://docs.sglang.ai/whl/cu118; \
       else \
         echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1; \
       fi \
    && cd sglang \
    && if [ "$BUILD_TYPE" = "srt" ]; then \
         if [ "$CUDA_VERSION" = "12.1.1" ]; then \
           python3 -m pip --no-cache-dir install -e "python[srt]" --find-links https://flashinfer.ai/whl/cu121/torch2.5/flashinfer-python; \
         elif [ "$CUDA_VERSION" = "12.4.1" ]; then \
           python3 -m pip --no-cache-dir install -e "python[srt]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python; \
         elif [ "$CUDA_VERSION" = "12.5.1" ]; then \
           python3 -m pip --no-cache-dir install -e "python[srt]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python; \
         elif [ "$CUDA_VERSION" = "11.8.0" ]; then \
           python3 -m pip --no-cache-dir install -e "python[srt]" --find-links https://flashinfer.ai/whl/cu118/torch2.5/flashinfer-python; \
           python3 -m pip install sgl-kernel -i https://docs.sglang.ai/whl/cu118; \
         else \
           echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1; \
         fi; \
       else \
         if [ "$CUDA_VERSION" = "12.1.1" ]; then \
           python3 -m pip --no-cache-dir install -e "python[all]" --find-links https://flashinfer.ai/whl/cu121/torch2.5/flashinfer-python; \
         elif [ "$CUDA_VERSION" = "12.4.1" ]; then \
           python3 -m pip --no-cache-dir install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python; \
         elif [ "$CUDA_VERSION" = "12.5.1" ]; then \
           python3 -m pip --no-cache-dir install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python; \
         elif [ "$CUDA_VERSION" = "11.8.0" ]; then \
           python3 -m pip --no-cache-dir install -e "python[all]" --find-links https://flashinfer.ai/whl/cu118/torch2.5/flashinfer-python; \
           python3 -m pip install sgl-kernel -i https://docs.sglang.ai/whl/cu118; \
         else \
           echo "Unsupported CUDA version: $CUDA_VERSION" && exit 1; \
         fi; \
       fi

ENV DEBIAN_FRONTEND=interactive

COPY serve /usr/bin/serve
RUN chmod 777 /usr/bin/serve

ENTRYPOINT [ "/usr/bin/serve" ]

Writing Dockerfile


Next, we will need to create an ECR repository for the custom docker image and build the image locally and push to the ECR repository. Note that you need to make sure the IAM role you used here has permission to push to ECR. 

The below cell might take sometime, please be patient. If you have already built the docker image from other development environment, please feel free to skip the below cell.

In [None]:
%%sh
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
REGION=$(aws configure get region)
REPOSITORY_NAME=sglang-sagemaker

# Create ECR repository if needed
if aws ecr describe-repositories --repository-names "${REPOSITORY_NAME}" &>/dev/null; then
    echo "Repository ${REPOSITORY_NAME} already exists"
else
    echo "Creating ECR repository ${REPOSITORY_NAME}..."
    aws ecr create-repository \
        --repository-name "${REPOSITORY_NAME}" \
        --region "${REGION}"
fi

#build docker image and push to ECR repository
docker build -t sglang .
aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com
docker tag sglang:latest $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest
docker push $ACCOUNT_ID.dkr.ecr.$REGION.amazonaws.com/$REPOSITORY_NAME:latest

### Create SageMaker AI endpoint for DeepSeek R1 distilled Llama 70B model
In this example, we will use the DeepSeek R1 distilled Llama 70B model artifacts directly [SageMaker Jumpstart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html). This way, it saves you time to download the model from HuggingFace and upload to S3.

SageMaker JumpStart provides pretrained, open-source models for a wide range of problem types to help you get started with machine learning. You can incrementally train and tune these models before deployment. JumpStart also provides solution templates that set up infrastructure for common use cases, and executable example notebooks for machine learning with SageMaker AI.

You can see the Deepseek model on SageMaker Jumpstart in SageMaker AI Studio as shown below:

![deepseek-jumpstart](./img/jumpstart-deepseek-model.png)

In [14]:
from sagemaker.jumpstart.model import JumpStartModel
model_id, model_version = "deepseek-llm-r1-distill-llama-70b", "*"

model = JumpStartModel(model_id=model_id, model_version=model_version)
model_data=model.model_data['S3DataSource']['S3Uri']
model_data

No instance type selected for inference hosting endpoint. Defaulting to ml.p4d.24xlarge.


's3://jumpstart-cache-prod-us-west-2/deepseek-llm/deepseek-llm-r1-distill-llama-70b/artifacts/inference-prepack/v1.0.0/'

Then we will create the [SageMaker model](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L149) with the custom docker image and model data available on s3.

In [18]:
import sagemaker
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.model import Model
from sagemaker.predictor import Predictor
from sagemaker.session import Session
session = Session()
region = session._region_name
role = sagemaker.get_execution_role()
ecr_uri = f'{session.account_id()}.dkr.ecr.{region}.amazonaws.com/sglang-sagemaker:latest'

model = Model(
    model_data={"S3DataSource": {
                    "S3Uri": model_data,
                    "S3DataType": "S3Prefix",
                    "CompressionType": "None",
                },
    },
    role=role,
    image_uri=ecr_uri,
    env={
        'TENSOR_PARALLEL_DEGREE': '8'
    },
    predictor_cls=Predictor
)

You can simply call the [`deploy` function](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/model.py#L149) from the SageMaker Model class to deploy the model to an endpoint and it will return a [`Predictor`](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/base_predictor.py#L98) object to perform invocation against this endpoint.

In [19]:
predictor = model.deploy(
    initial_instance_count=1, 
    instance_type='ml.g5.48xlarge', # you can also change to p4d.24xlarge
    serializer=JSONSerializer(), 
    deserializer=JSONDeserializer()
)

---------------!

OpenAI API chat completion interface

In [25]:
response = predictor.predict({
    'model':'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature':0,
    'max_tokens':200,
    'top_logprobs': 2,
    'logprobs': True
})
print(response['choices'][0]['message']['content'])

<think>
Okay, so I need to list three countries and their capitals. Hmm, let's see. I'm not super familiar with all the countries and capitals, but I can try to think of some that I know. 

First, I'll start with the country I live in, which is the United States. I'm pretty sure the capital is Washington, D.C. Yeah, that's right. So that's one.

Next, I think about France. I remember learning that Paris is the capital. That seems correct. So France and Paris would be the second pair.

Now, for the third one, I'm a bit unsure. I know that Japan is a country, and I think the capital is Tokyo. I've heard that before, so I'll go with that. So Japan and Tokyo.

Wait, let me double-check if I'm mixing up any capitals. Sometimes I confuse Tokyo with another city, but I think it's correct. Yeah, Tokyo is definitely the capital


#### Invoke endpoint with boto3
Note that you can also invoke the endpoint with boto3. If you have an existing endpoint, you don't need to recreate the `predictor` and can follow below example to invoke the endpoint with an endpoint name.

In [32]:
import boto3
import json
sagemaker_runtime = boto3.client('sagemaker-runtime')
endpoint_name = predictor.endpoint_name # you can manually set the endpoint name with an existing endpoint

prompt = {
    'model':'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature':0,
    'max_tokens':512,
}
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/json",
    Body=json.dumps(prompt)
)
response_dict = json.loads(response['Body'].read().decode("utf-8"))
response_content = response_dict['choices'][0]['message']['content']
print(response_content)

<think>
Okay, so I need to list three countries and their capitals. Hmm, let's see. I'm not super familiar with all the countries and capitals, but I can try to think of some that I know. 

First, I'll start with the country I live in, which is the United States. I'm pretty sure the capital is Washington, D.C. Yeah, that's right. So that's one.

Next, I think about France. I remember learning that Paris is the capital. That seems correct. So France and Paris would be the second pair.

Now, for the third one, I'm a bit unsure. I know that Japan is a country, and I think the capital is Tokyo. I've heard that before, so I'll go with that. So Japan and Tokyo.

Wait, let me double-check if I'm mixing up any capitals. Sometimes I confuse Tokyo with another city, but I think it's correct. Yeah, Tokyo is definitely the capital of Japan. Okay, so I think I have three correct pairs: USA and Washington, D.C., France and Paris, and Japan and Tokyo.
</think>

Here are three countries and their capi

#### Streaming response from the endpoint
Additionally, SGLang allows you to invoke the endpoint and receive streaming response. Below is an example of how to interact with the endpoint with streaming response.

In [26]:
import io
import json

# Example class that processes an inference stream:
class SmrInferenceStream:
    
    def __init__(self, sagemaker_runtime, endpoint_name):
        self.sagemaker_runtime = sagemaker_runtime
        self.endpoint_name = endpoint_name
        # A buffered I/O stream to combine the payload parts:
        self.buff = io.BytesIO() 
        self.read_pos = 0
        
    def stream_inference(self, request_body):
        # Gets a streaming inference response 
        # from the specified model endpoint:
        response = self.sagemaker_runtime\
            .invoke_endpoint_with_response_stream(
                EndpointName=self.endpoint_name, 
                Body=json.dumps(request_body), 
                ContentType="application/json"
        )
        # Gets the EventStream object returned by the SDK:
        event_stream = response['Body']
        for event in event_stream:
            # Passes the contents of each payload part
            # to be concatenated:
            self._write(event['PayloadPart']['Bytes'])
            # Iterates over lines to parse whole JSON objects:
            for line in self._readlines():
                line = line.decode('utf-8')[len('data: '):]
                # print(line)
                try:
                    resp = json.loads(line)
                except:
                    continue
                if len(line)>0 and type(resp) == dict:
                    # if len(resp.get('choices')) == 0:
                    #     continue
                    part = resp.get('choices')[0]['delta']['content']
                    
                else:
                    part = resp
                # Returns parts incrementally:
                yield part
    
    # Writes to the buffer to concatenate the contents of the parts:
    def _write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)

    # The JSON objects in buffer end with '\n'.
    # This method reads lines to yield a series of JSON objects:
    def _readlines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            self.read_pos += len(line)
            yield line[:-1]

In [27]:
request_body = {
    'model':'mymodel',
    'messages':[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    'temperature':0,
    'max_tokens':512,
    # 'top_logprobs': 2,
    # 'logprobs': True,
    'stream': True,
    'stream_options': {'include_usage': True}
}

smr_inference_stream = SmrInferenceStream(
    sagemaker_runtime, predictor.endpoint_name)
stream = smr_inference_stream.stream_inference(request_body)
for part in stream:
    print(part, end='')

<think>
 need to list three countries and their capitals. Hmm, let's see. I'm not super familiar with all the countries and capitals, but I can try to think of some that I know. 

First, I'll start with the country I live in, which is the United States. I'm pretty sure the capital is Washington, D.C. Yeah, that's right. So that's one.

 about France. I remember learning that Paris is the capital. That seems correct. So France and Paris would be the second pair.

 third one, I'm a bit unsure. I know that Japan is a country, and I think the capital is Tokyo. I've heard that before, so I'll go with that. So Japan and Tokyo.

 let me double-check if I'm mixing up any capitals. Sometimes I confuse Tokyo with another city, but I think it's correct. Yeah, Tokyo is definitely the capital of Japan. Okay, so I think I have three correct pairs: USA and Washington, D.C., France and Paris, and Japan and Tokyo.
</think>

 and their capitals:ries

. **United States** - Washington, D.C.
** - Parise
. 

## Cleanup
Make sure to delete the endpoint and other artifacts that were created to avoid unnecessary cost. You can also go to SageMaker AI console to delete all the resources created in this example.

In [None]:
predictor.delete_endpoint()