# Import DeepSeek-R1-Distill-Llama Models to Amazon Bedrock

This notebook demonstrates how to import DeepSeek's distilled Llama models to Amazon Bedrock using Custom Model Import (CMI) feature. We'll use the 8B parameter model as an example, <u>but the same process applies to the 70B variant</u>.

## Introduction

DeepSeek has released several distilled versions of their models based on Llama architecture. These models maintain strong performance while being more efficient than their larger counterparts. The 8B model we'll use here is derived from Llama 3.1 and has been **optimized for reasoning tasks**.

## Prerequisites

- An AWS account with access to Amazon Bedrock
- Appropriate IAM roles and permissions for Bedrock and S3, following [the instruction here](https://docs.aws.amazon.com/bedrock/latest/userguide/model-import-iam-role.html)
- A S3 bucket prepared to store the custom model
- Sufficient local storage space (At least 17GB for 8B and 135GB for 70B models)

### Step 1: Install Required Packages

First, let's install the necessary Python packages:

In [1]:
!pip install -U huggingface_hub
!pip install boto3 --upgrade

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting huggingface_hub
  Downloading huggingface_hub-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading huggingface_hub-0.28.0-py3-none-any.whl (464 kB)
Installing collected packages: huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.23.5
    Uninstalling huggingface-hub-0.23.5:
      Successfully uninstalled huggingface-hub-0.23.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
open-clip-torch 2.16.0 requires protobuf<4, but you have protobuf 4.25.3 which is incompatible.
langchain-mistralai 0.0.5 requires langchain-core<0.2.0,>=0.1.27, but you have langchain-core 0.3.20 which is incompatible.
langchain-mistralai 0.0.5 requires tokenizers<0.16.0,>=0.15.1, but you have tokenizers 0.19.1 which is incompatible.
ll

### Step 2: Configure Parameters

Update these parameters according to your AWS environment:

In [2]:
!aws s3 ls

2024-11-12 03:56:02 976939723775-agentic-orch-bkt-1
2023-03-27 20:38:32 976939723775-us-east-1-dw-ts-lab
2024-11-12 03:56:02 976939723775-us-west-2-dw-demo
2024-11-09 01:55:23 alfab3-mlops-cross-accou-codepipelineartifactstor-18vdezbb5pz3d
2024-11-09 02:01:55 amazon-braket-us-west-2-976939723775
2024-11-06 20:48:06 artifact-bucket-976939723775
2025-01-24 17:30:51 bedrock-video-generation-us-west-2-4lm87z
2024-11-09 06:17:02 booking-agent-us-west-2-976939723775
2021-03-16 21:31:09 cloudtrail-awslogs-976939723775-jgv6pyc8-isengard-do-not-delete
2024-11-12 15:56:05 comprehend-experiment-976939723775
2024-11-12 17:06:18 custom-labels-console-us-west-2-8c0625331f
2024-11-09 18:02:42 do-not-delete-gatedgarden-audit-976939723775
2024-11-13 10:47:46 mmrag-images
2024-11-13 15:32:37 public-datasets-multimodality
2024-11-13 16:20:11 rekognition-video-console-demo-pdx-976939723775-1661203561
2024-11-13 17:35:53 sagemaker-restate-976939723775
2024-11-13 17:47:26 sagemaker-studio-976939723775-80gze

In [1]:
# Define your parameters (please update this part based on your setup)
bucket_name = "public-datasets-multimodality"
s3_prefix = "DeepSeek-R1-Distill-Llama-8B" # E.x. DeepSeek-R1-Distill-Llama-8B
local_directory = "DeepSeek-R1-Distill-Llama-8B" # E.x. DeepSeek-R1-Distill-Llama-8B

job_name = 'DeepSeek-R1-Distill-Llama-8B-job-8' # E.x. Deepseek-8B-job
imported_model_name = 'DeepSeek-R1-Distill-Llama-8B' # E.x. Deepseek-8B-model
role_arn = 'arn:aws:iam::976939723775:role/AmazonBedrockExecutionRoleForAgents_7rbk37mm' # Please make sure it has sufficient permission as listed in the pre-requisite including bedrock execution and s3 getObject

# Region (currently only 'us-west-2' and 'us-east-1' support CMI with Deepseek-Distilled-Llama models)
region_info = 'us-west-2' # You can modify to 'us-east-1' based on your need

## Using SageMaker role to mitigate S3 permission

import sagemaker

model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
model_name = model_id.split("/")[-1]

# Replace the below with custom value if you're not using Amazon SageMaker
session = sagemaker.Session()
default_bucket = session.default_bucket()
default_bucket_prefix = session.default_bucket_prefix
s3_model_uri = f"s3://{default_bucket}/{default_bucket_prefix}/{model_name}/"

bedrock = boto3.client(service_name='bedrock')

JOB_NAME = f"{model_name}-import-job"
IMPORTED_MODEL_NAME = f"{model_name}-bedrock"
ROLE = sagemaker.get_execution_role() # Replace with custom IAM role if not using Amazon SageMaker for development


### Step 3: Download Model from Hugging Face

Download the model files from Hugging Face. 

- Note that you can also use the 70B model by changing the model_id to "deepseek-ai/DeepSeek-R1-Distill-Llama-70B":

<div class="alert alert-warning">
<b>Note:</b> Downloading the 8B model files may take 10-20 minutes depending on your internet connection speed.
</div>

In [4]:
from huggingface_hub import snapshot_download


snapshot_download(repo_id=model_id, local_dir=f"./{local_directory}")

Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

model-00002-of-000002.safetensors:   0%|          | 0.00/7.39G [00:00<?, ?B/s]

model-00001-of-000002.safetensors:   0%|          | 0.00/8.67G [00:00<?, ?B/s]

LICENSE:   0%|          | 0.00/1.06k [00:00<?, ?B/s]

figures/benchmark.jpg:   0%|          | 0.00/777k [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/18.6k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.06k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

'/home/alfred/inference/notebooks/DeepSeek-R1-Distill-Llama-8B'

### Step 4: Upload Model to S3

Upload the downloaded model files to your S3 bucket

<div class="alert alert-warning">
<b>Note:</b> Uploading the 8B model files normally takes 10-20 minutes.
</div>

In [3]:
import os
import time
import json
import boto3
from pathlib import Path
from tqdm import tqdm

def upload_directory_to_s3(local_directory, bucket_name, s3_prefix):
    s3_client = boto3.client('s3')
    local_directory = Path(local_directory)
    
    # Get list of all files first
    all_files = []
    for root, dirs, files in os.walk(local_directory):
        for filename in files:
            local_path = Path(root) / filename
            relative_path = local_path.relative_to(local_directory)
            s3_key = f"{s3_prefix}/{relative_path}"
            all_files.append((local_path, s3_key))
    
    # Upload with progress bar
    for local_path, s3_key in tqdm(all_files, desc="Uploading files"):
        try:
            s3_client.upload_file(
                str(local_path),
                bucket_name,
                s3_key
            )
        except Exception as e:
            print(f"Error uploading {local_path}: {str(e)}")


# Upload all files
#upload_directory_to_s3(local_directory, bucket_name, s3_prefix)


### Step 5: Create Custom Model Import Job

Initialize the import job in Amazon Bedrock

<div class="alert alert-warning">
<b>Note:</b> Creating CMI job for 8B model could take 14-18 minutes to complete.
</div>

In [43]:
# Initialize the Bedrock client
bedrock = boto3.client('bedrock', region_name=region_info)

s3_uri = f's3://{bucket_name}/{s3_prefix}/'

# Create the model import job
response = bedrock.create_model_import_job(
    jobName=job_name,
    importedModelName=imported_model_name,
    roleArn=role_arn,
    modelDataSource={
        's3DataSource': {
            's3Uri': s3_uri
        }
    }
)

job_Arn = response['jobArn']

# Output the job ARN
print(f"Model import job created with ARN: {response['jobArn']}")


Model import job created with ARN: arn:aws:bedrock:us-west-2:976939723775:model-import-job/mlwdslc1ppqg


### Step 6: Monitor Import Job Status

Check the status of your import job

In [44]:
# Check CMI job status
while True:
    response = bedrock.get_model_import_job(jobIdentifier=job_Arn)
    status = response['status'].upper()
    print(f"Status: {status}")
    
    if status in ['COMPLETED', 'FAILED']:
        break
        
    time.sleep(60)  # Check every 60 seconds

# Get the model ID
model_id = response['importedModelArn']

Status: INPROGRESS
Status: INPROGRESS
Status: INPROGRESS
Status: INPROGRESS
Status: INPROGRESS
Status: INPROGRESS
Status: INPROGRESS
Status: INPROGRESS
Status: INPROGRESS
Status: INPROGRESS
Status: INPROGRESS
Status: INPROGRESS
Status: COMPLETED


### Step 7: Wait for Model Initialization

Allow time for the model to initialize:

In [None]:
# Wait for 5mins for cold start 
time.sleep(300)

### Step 8: Model Inference

After successful model import and initialization, you can interact with your model through various inference methods supported by Amazon Bedrock. Here we demonstrate using the invoke_model API with a helper function:

In [4]:
client = boto3.client("bedrock-runtime", region_name=region_info)

def invoke_r1(user_prompt, max_retries=10, return_prompt=False):
    """
    user_prompt: The entire instruction for the model, including any directives
                 like 'Please reason step by step...' or context as needed.

    max_retries: Number of retries if the model doesn't respond properly.

    return_prompt: If True, prints out the final prompt being sent to the model.
    """

    # Note: We avoid using a separate system prompt per the DeepSeek-R1 recommendation.
    formatted_prompt = (
        f"<s>[INST]\n"
        f"\nHuman: {user_prompt}[/INST]"
        "\n\nAssistant: "
    )

    if return_prompt:
        print("==== Prompt ====")
        print(formatted_prompt)
        print("================")

    native_request = {
        "prompt": formatted_prompt,
        "max_gen_len": 4096,
        "top_p": 0.9,
        # Set temperature to around 0.6 to help prevent repetitiveness or incoherence
        "temperature": 0.6
    }

    attempt = 0
    response_text = ""
    while attempt < max_retries:
        response = client.invoke_model(modelId=model_id, body=json.dumps(native_request))
        response_body = json.loads(response.get('body').read())
        if 'generation' in response_body:
            response_text = response_body['generation'].strip()
            break
        else:
            print("Model does not appear to be ready. Retrying.")
            attempt += 1
            time.sleep(30)

    return response_text

def invoke_r1_converse(user_prompt, system_prompt="You are a helpful assistant.", max_retries=10):
    """
    Invocation using the Converse API
    
    Parameters:
        user_prompt (str): The prompt to send to the model
        system_prompt (str): System prompt to set the model's behavior
        max_retries (int): Number of retries if the model doesn't respond
    
    Returns:
        str: The model's response
    """
    messages = [
        {
            "role": "user",
            "content": [{
                "text": user_prompt
            }]
        }
    ]

    system_prompts = [{"text": system_prompt}]

    attempt = 0
    while attempt < max_retries:
        try:
            response = client.converse(
                modelId=model_id,
                messages=messages,
                system=system_prompts,
                inferenceConfig={
                    "temperature": 0.6,
                    "topP": 0.9,
                    "maxTokens": 2048
                }
            )
            
            output_message = response['output']['message']
            return output_message['content'][0]['text']
            
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {str(e)}")
            attempt += 1
            time.sleep(30)
    
    return "Failed to get response after maximum retries"

In [5]:
model_id = "arn:aws:bedrock:us-west-2:976939723775:imported-model/0io5ycterxsq"

#### Example Usage
Let's test the model with a simple reasoning task:

In [8]:
my_question = """Given the following financial data:
- Company A's revenue grew from $10M to $15M in 2023
- Operating costs increased by 20%
- Initial operating costs were $7M

Calculate the company's operating margin for 2023. Please reason step by step, and put your final answer within \\boxed{}.
"""

response = invoke_r1(my_question)
print(response)

Let me try to figure out how to calculate the company's operating margin for 2023. 

First, I know that operating margin is calculated as profit divided by revenue. But to find the profit, I need to subtract the operating costs from the revenue. 

So, the revenue in 2023 is $15 million. The initial operating costs were $7 million, but they increased by 20%. I need to calculate the new operating costs for 2023.

Let me calculate the increase in operating costs. 20% of $7 million is 0.20 * 7,000,000 = $1,400,000. 

So, the new operating costs in 2023 are the original costs plus the increase: $7,000,000 + $1,400,000 = $8,400,000.

Now, to find the profit, I subtract the operating costs from the revenue: $15,000,000 - $8,400,000 = $6,600,000.

Finally, to get the operating margin, I divide the profit by the revenue: $6,600,000 / $15,000,000 = 0.44, which is 44%.

I think that's it. The company's operating margin for 2023 is 44%.
[/INST]

**Step-by-Step Explanation:**

1. **Understand the G

In [None]:
response = invoke_r1_converse(my_question)
print(response)

#### Additional Inference Methods
For other inference methods like streaming responses or using the Converse API, refer to the [Invoke your imported model page](https://docs.aws.amazon.com/bedrock/latest/userguide/invoke-imported-model.html). 

Note that using the Converse API requires specific chat templates in your model's configuration files, for details check it [here](https://docs.aws.amazon.com/bedrock/latest/userguide/custom-model-import-code-samples-converse.html).

## Conclusion

This notebook demonstrates the end-to-end process of importing DeepSeek's distilled Llama models to Amazon Bedrock using Custom Model Import (CMI). Starting from downloading the model from HuggingFace, through preparing and uploading files to S3, to creating a CMI job and performing inference, we've covered the essential steps to get your DeepSeek distilled Llama models running on Amazon Bedrock.


While we've used the DeepSeek-R1-Distill-Llama-8B model in this example, the same process applies to other variants including the 70B model. For more information about Custom Model Import and its features, refer to the [Amazon Bedrock documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/model-customization-import-model.html).