<div style="background-color: #FFDDDD; border-left: 5px solid red; padding: 10px; color: black;">
    <strong>Kernel:</strong> Python 3 (ipykernel)
</div>

# 🚀 Deploy `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` on Amazon SageMaker

To start off, let's install some packages to help us through the notebooks. **Restart the kernel after packages have been installed.**

In [7]:
%pip install -r ./scripts/requirements.txt --upgrade

Collecting transformers (from -r ./scripts/requirements.txt (line 1))
  Downloading transformers-4.51.0-py3-none-any.whl.metadata (38 kB)
Collecting peft (from -r ./scripts/requirements.txt (line 2))
  Downloading peft-0.15.1-py3-none-any.whl.metadata (13 kB)
Collecting accelerate (from -r ./scripts/requirements.txt (line 3))
  Downloading accelerate-1.6.0-py3-none-any.whl.metadata (19 kB)
Collecting bitsandbytes (from -r ./scripts/requirements.txt (line 4))
  Downloading bitsandbytes-0.45.4-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting datasets (from -r ./scripts/requirements.txt (line 5))
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting sagemaker (from -r ./scripts/requirements.txt (line 10))
  Downloading sagemaker-2.243.0-py3-none-any.whl.metadata (16 kB)
Collecting trl (from -r ./scripts/requirements.txt (line 15))
  Downloading trl-0.16.1-py3-none-any.whl.metadata (12 kB)
Downloading transformers-4.51.0-py3-none-any.whl (10.4 MB)
[2K   

In [1]:
import os
import sagemaker
from sagemaker.djl_inference import DJLModel
from ipywidgets import Dropdown

import sys
sys.path.append(os.path.dirname(os.getcwd()))

from utilities.helpers import (
    pretty_print_html, 
    set_meta_llama_params,
    print_dialog,
    format_messages,
    write_eula
)



sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml


In [2]:
session = sagemaker.Session()
role = sagemaker.get_execution_role()
default_bucket = session.default_bucket()

print(f"Execution Role: {role}")
print(f"Default S3 Bucket: {default_bucket}")

Execution Role: arn:aws:iam::058264176820:role/amazon-sagemaker-base-executionrole
Default S3 Bucket: sagemaker-us-east-1-058264176820


## Deploy Model to SageMaker Hosting

### Step 1: Get SageMaker LMI Container to host DeepSeek

In [5]:
inference_image_uri = sagemaker.image_uris.retrieve(
    framework="djl-lmi", 
    region=session.boto_session.region_name, 
    version="latest"
)
pretty_print_html(f"using image to host: {inference_image_uri}")

### Step 2: Deploy model using `DJLModel`

In [11]:
inference_llm_config = {
    "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
    "HF_API_TOKEN":"",
    "OPTION_MAX_MODEL_LEN": "4096",
    "OPTION_GPU_MEMORY_UTILIZATION": "0.8",
    "OPTION_ENABLE_STREAMING": "false",
    "OPTION_ROLLING_BATCH": "auto",
    "OPTION_MODEL_LOADING_TIMEOUT": "3600",
    # "OPTION_OUTPUT_FORMATTER": "jsonlines",
    "OPTION_PAGED_ATTENTION": "false",
    "OPTION_DTYPE": "fp16",
}

In [12]:
model_name = "DeepSeek-R1-Distill-Llama-8B"

lmi_model = sagemaker.Model(
    image_uri=inference_image_uri,
    env=inference_llm_config,
    role=role,
    name=model_name
)

In [None]:
base_endpoint_name = f"{model_name}-endpoint"

predictor = lmi_model.deploy(
    initial_instance_count=1, 
    instance_type="ml.g5.2xlarge",
    endpoint_name=base_endpoint_name
)

------------------------------------------

## Run Inference

In [None]:
import sagemaker
from datasets import load_dataset

In [None]:
sagemaker_session = sagemaker.Session()

In [None]:
model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

endpoint_name = base_endpoint_name

In [None]:
predictor = sagemaker.Predictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=sagemaker.serializers.JSONSerializer(),
    deserializer=sagemaker.deserializers.JSONDeserializer(),
)

In [None]:
def create_summarization_prompts(data_point):
    full_prompt =f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
                    You are an AI assistant trained to summarize conversations. Provide a concise summary of the dialogue, capturing the key points and overall context.
                    <|eot_id|><|start_header_id|>user<|end_header_id|>
                    Summarize the following conversation:

                    {data_point["dialogue"]}
                    <|eot_id|><|start_header_id|>assistant<|end_header_id|>
                    Here's a concise summary of the conversation in a single sentence:

                    <|eot_id|>"""
    return {"prompt": full_prompt}

In [None]:
from pprint import pprint
# HF dataset that we will be working with 
dataset_name="Samsung/samsum"
    
# Load dataset from the hub
dataset = load_dataset(dataset_name, split="test")

random_row = dataset.shuffle().select(range(1))[0]

random_prompt=create_summarization_prompts(random_row)
pprint(random_prompt)

In [None]:
response = predictor.predict(
    {
        "inputs": random_prompt['prompt'],
        "parameters": {
            "do_sample":True,
            "max_new_tokens":200,
            "top_p":0.95,
            "top_k":50,
            "temperature":0.7,
            "stop": ['<|eot_id|>', '<|end_of_text|>']
        },
    }
)

response['generated_text']