# 🚀 Customize and Deploy `deepseek-ai/DeepSeek-R1-0528` on Amazon SageMaker AI

In this notebook, we explore **DeepSeek-R1-0528**, a cutting-edge reasoning model from DeepSeek AI. You'll learn how to fine-tune it on reasoning datasets, evaluate its mathematical and logical capabilities, and deploy it using SageMaker for advanced reasoning tasks.

## What is DeepSeek-R1-0528?
DeepSeek-R1-0528 is part of DeepSeek's R1 series, specifically designed for advanced reasoning capabilities. This model represents a significant advancement in AI reasoning, combining deep learning with sophisticated reasoning mechanisms to tackle complex mathematical, logical, and analytical problems. It builds upon DeepSeek's expertise in creating efficient and powerful language models.  
🔗 Model card: [deepseek-ai/DeepSeek-R1-0528 on Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)

## Key Specifications
| Feature | Details |
|---|---|
| **Parameters** | Multi-billion parameter architecture optimized for reasoning |
| **Architecture** | Advanced Transformer with specialized reasoning modules |
| **Context Length** | Extended context window for complex reasoning chains |
| **Modalities** | Text-in / Text-out with focus on reasoning tasks |
| **License** | Check model card for specific licensing terms |
| **Release Date** | May 28th release (0528) |

## Benchmarks & Behavior
- Exceptional performance on **mathematical reasoning, logical inference, and complex problem-solving** benchmarks.  
- Designed to excel at **multi-step reasoning tasks** with clear chain-of-thought capabilities.  
- Strong performance on competition mathematics, coding challenges, and analytical reasoning tasks.  
- Optimized for **step-by-step problem decomposition** and systematic solution approaches.  

## Using This Notebook
You'll cover:
* Load the NuminaMath-CoT reasoning dataset from Hugging Face and prepare it for fine-tuning  
* Fine-tune with SageMaker Training Jobs using reasoning-optimized configurations  
* Run model evaluation on mathematical reasoning benchmarks  
* Deploy to SageMaker Endpoints for production reasoning tasks  

Let's begin by exploring `deepseek-ai/DeepSeek-R1-0528` and testing its baseline reasoning performance with mathematical problems.


In [1]:
%pip install -Uq sagemaker datasets

/home/ubuntu/py312-training/bin/python3: No module named pip
Note: you may need to restart the kernel to use updated packages.


In [2]:
import boto3
import sagemaker
import time

ModuleNotFoundError: No module named 'sagemaker'

In [3]:
region = boto3.Session().region_name
sess = sagemaker.Session(boto3.Session(region_name=region))

sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()

In [4]:
print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sagemaker_session_bucket}")
print(f"sagemaker session region: {sess.boto_region_name}")

### [NuminaMath-CoT](https://huggingface.co/datasets/AI-MO/NuminaMath-CoT)

**NuminaMath-CoT** is a large-scale dataset of **~860,000+ math competition question-solution pairs**, designed to support chain-of-thought reasoning in mathematical problem solving.

**Data Format & Structure**:
- Each example is a question followed by a solution; the solution is formatted with detailed **Chain-of-Thought (CoT)** reasoning.  
- The data sources include *Chinese high school math exercises*, *US and international mathematics competition problems*, *online test-papers PDFs*, and *math discussion forums*.  
- Preprocessing includes OCR from PDFs, segmentation to extract problem-solution pairs, translation into English, alignment into CoT style, and formatting of final answers.  

**License**: Released under the **Apache-2.0** license.  

**Applications**:

This dataset is useful for training and evaluating models on tasks including:  
- Complex math problem solving with reasoning steps (algebra, geometry, number theory, etc.)  
- Benchmarking chain-of-thought performance of LLMs on competition-level math tasks  
- Educational tools and tutoring systems that require explainable math solutions  
- Fine-tuning models to improve consistency, reasoning depth, and accuracy in mathematical domains  


In [3]:
import os
import json
import pprint
from tqdm import tqdm
from datasets import load_dataset

In [4]:
dataset_parent_path = os.path.join(os.getcwd(), "tmp_cache_local_dataset")
os.makedirs(dataset_parent_path, exist_ok=True)

**Preparing Your Dataset in `messages` format**

This section walks you through creating a conversation-style dataset—the required `messages` format—for directly training LLMs using SageMaker AI.

**What Is the `messages` Format?**

The `messages` format structures instances as chat-like exchanges, wrapping each conversation turn into a role-labeled JSON array. It’s widely used by frameworks like TRL.

Example entry:

```json
{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "How do I bake sourdough?" },
    { "role": "assistant", "content": "First, you need to create a starter by..." }
  ]
}


In [5]:
dataset_name = "AI-MO/NuminaMath-CoT"
dataset = load_dataset(dataset_name, split="train[:1000]")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00005.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

data/train-00001-of-00005.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

data/train-00002-of-00005.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

data/train-00003-of-00005.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

data/train-00004-of-00005.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/166k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/859494 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

In [6]:
pprint.pp(dataset[0])

{'source': 'synthetic_math',
 'problem': 'Consider the terms of an arithmetic sequence: $-\\frac{1}{3}, '
            'y+2, 4y, \\ldots$. Solve for $y$.',
 'solution': 'For an arithmetic sequence, the difference between consecutive '
             'terms must be equal. Therefore, we can set up the following '
             'equations based on the sequence given:\n'
             '\\[ (y + 2) - \\left(-\\frac{1}{3}\\right) = 4y - (y+2) \\]\n'
             '\n'
             'Simplify and solve these equations:\n'
             '\\[ y + 2 + \\frac{1}{3} = 4y - y - 2 \\]\n'
             '\\[ y + \\frac{7}{3} = 3y - 2 \\]\n'
             '\\[ \\frac{7}{3} + 2 = 3y - y \\]\n'
             '\\[ \\frac{13}{3} = 2y \\]\n'
             '\\[ y = \\frac{13}{6} \\]\n'
             '\n'
             'Thus, the value of $y$ that satisfies the given arithmetic '
             'sequence is $\\boxed{\\frac{13}{6}}$.',
 'messages': [{'content': 'Consider the terms of an arithmetic sequence: '
                

In [7]:
print(f"total number of fine-tunable samples: {len(dataset)}")

total number of fine-tunable samples: 1000


In [8]:
def convert_to_messages_reasoning(row):
    system_content = "You are a mathematical reasoning assistant. Read the problem, restate the key givens and goal, then solve step-by-step with clear algebra (use LaTeX), keeping exact arithmetic (fractions/surds) and justifying each transformation (e.g., equal differences for arithmetic sequences). Verify any domain or extraneous-solution constraints, and present the final simplified answer concisely on the last line."
    
    messages_user_row = row["messages"][0]
    assert messages_user_row["role"] == "user", f"user row unmatched"
    user_content = messages_user_row["content"]
    
    messages_assistant_row = row["messages"][1]
    assert messages_assistant_row["role"] == "assistant", f"assistant row unmatched"
    assistant_content = messages_assistant_row["content"]

    think_block = f"<think>{row['solution']}</think>"
    
    return {
        "messages": [
            { "role": "system", "content": system_content},
            { "role": "user", "content": user_content },
            { "role": "assistant", "content": f"{think_block}\n\n{assistant_content}" }
        ]
    }
    
    
dataset = dataset.map(convert_to_messages_reasoning, remove_columns=dataset.column_names)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [9]:
dataset_filename = os.path.join(dataset_parent_path, f"{dataset_name.replace('/', '--').replace('.', '-')}.jsonl")
dataset.to_json(dataset_filename, lines=True)

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

3243521

#### Upload file to S3

In [None]:
from sagemaker.s3 import S3Uploader

In [None]:
data_s3_uri = f"s3://{sess.default_bucket()}/dataset"

uploaded_s3_uri = S3Uploader.upload(
    local_path=dataset_filename,
    desired_s3_uri=data_s3_uri
)
print(f"Uploaded {dataset_filename} to > {uploaded_s3_uri}")

## Fine-Tune LLMs using SageMaker `Estimator`/`ModelTrainer`

In [None]:
import time
from sagemaker.pytorch import PyTorch
from getpass import getpass
import yaml
from jinja2 import Template

In [None]:
hf_token = getpass()

### Training using `PyTorch` Estimator

**Training Using `PyTorch` Estimator**
Leverages the official PyTorch SageMaker container to run a custom training script using the Accelerate and DeepSpeed libraries. This option is ideal for users who want full control over the training pipeline for DeepSeek-R1-0528's reasoning capabilities.

In [None]:
model_id = "deepseek-ai/DeepSeek-R1-0528"
model_name = model_id.split("/")[-1]

# Training configuration optimized for reasoning tasks
training_config = {
    "model_id": model_id,
    "dataset_path": "/opt/ml/input/data/training",
    "num_train_epochs": 3,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 8,
    "learning_rate": 2e-5,
    "max_seq_len": 2048,
    "packing": False,
    "use_flash_attention_2": True,
    "merge_adapters": True,
    "bf16": True,
    "tf32": True,
    "gradient_checkpointing": True,
    "logging_steps": 10,
    "save_strategy": "epoch",
    "output_dir": "/opt/ml/model",
    "optim": "adamw_torch",
    "lr_scheduler_type": "cosine",
    "warmup_ratio": 0.1,
    "seed": 42
}

print(f"Training DeepSeek-R1-0528 with configuration:")
for key, value in training_config.items():
    print(f"  {key}: {value}")

In [None]:
# Create PyTorch estimator for DeepSeek-R1-0528
pytorch_estimator = PyTorch(
    entry_point="sft.py",
    source_dir="sagemaker_code",
    role=role,
    instance_type="ml.g5.2xlarge",  # Adjust based on model size and requirements
    instance_count=1,
    framework_version="2.0.1",
    py_version="py310",
    hyperparameters=training_config,
    environment={
        "HUGGINGFACE_HUB_CACHE": "/opt/ml/input/data/cache",
        "HF_TOKEN": hf_token,
        "TRANSFORMERS_CACHE": "/opt/ml/input/data/cache",
        "ACCELERATE_USE_FSDP": "0",
        "FSDP_CPU_RAM_EFFICIENT_LOADING": "1"
    },
    disable_output_compression=True,
    keep_alive_period_in_seconds=1800,
    volume_size=100
)

print(f"Created PyTorch estimator for {model_id}")

In [None]:
# Start training
training_job_name = f"deepseek-r1-0528-reasoning-{int(time.time())}"

pytorch_estimator.fit(
    inputs={
        "training": uploaded_s3_uri
    },
    job_name=training_job_name,
    wait=False
)

print(f"Training job '{training_job_name}' started for DeepSeek-R1-0528")
print(f"Monitor progress in SageMaker console or use: pytorch_estimator.logs()")

### Training using Hugging Face `Estimator`

**Training Using Hugging Face `Estimator`**
Uses the Hugging Face SageMaker container with TRL (Transformer Reinforcement Learning) for streamlined fine-tuning. This approach provides optimized defaults for reasoning models like DeepSeek-R1-0528.

In [None]:
from sagemaker.huggingface import HuggingFace

In [None]:
# Hugging Face training configuration for reasoning tasks
hf_training_config = {
    "model_id": model_id,
    "dataset_path": "/opt/ml/input/data/training",
    "num_train_epochs": 2,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 16,
    "learning_rate": 1e-5,
    "max_seq_len": 2048,
    "packing": True,
    "use_flash_attention_2": True,
    "merge_adapters": True,
    "bf16": True,
    "tf32": True,
    "gradient_checkpointing": True,
    "logging_steps": 5,
    "save_strategy": "epoch",
    "output_dir": "/opt/ml/model",
    "optim": "adamw_8bit",
    "lr_scheduler_type": "linear",
    "warmup_steps": 100,
    "seed": 42,
    "use_lora": True,
    "lora_r": 64,
    "lora_alpha": 16,
    "lora_dropout": 0.1,
    "target_modules": "all-linear"
}

print(f"Hugging Face training configuration for DeepSeek-R1-0528:")
for key, value in hf_training_config.items():
    print(f"  {key}: {value}")

In [None]:
# Create Hugging Face estimator
huggingface_estimator = HuggingFace(
    entry_point="sft.py",
    source_dir="sagemaker_code",
    role=role,
    instance_type="ml.g5.xlarge",
    instance_count=1,
    transformers_version="4.36.0",
    pytorch_version="2.1.0",
    py_version="py310",
    hyperparameters=hf_training_config,
    environment={
        "HUGGINGFACE_HUB_CACHE": "/opt/ml/input/data/cache",
        "HF_TOKEN": hf_token,
        "TRANSFORMERS_CACHE": "/opt/ml/input/data/cache"
    },
    disable_output_compression=True,
    volume_size=80
)

print(f"Created Hugging Face estimator for {model_id}")

In [None]:
# Start Hugging Face training
hf_training_job_name = f"deepseek-r1-0528-hf-reasoning-{int(time.time())}"

huggingface_estimator.fit(
    inputs={
        "training": uploaded_s3_uri
    },
    job_name=hf_training_job_name,
    wait=False
)

print(f"Hugging Face training job '{hf_training_job_name}' started")
print(f"Monitor progress: huggingface_estimator.logs()")

### Training using SageMaker JumpStart

**Training Using SageMaker JumpStart**
Provides a managed training experience with pre-configured settings optimized for popular models. While DeepSeek-R1-0528 may not be directly available in JumpStart, this section shows how to adapt the approach for reasoning model fine-tuning.

In [None]:
from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.jumpstart.estimator import JumpStartEstimator

In [None]:
# Note: DeepSeek-R1-0528 may not be directly available in JumpStart
# This example shows the pattern for when it becomes available

try:
    # Check if DeepSeek models are available in JumpStart
    jumpstart_model_id = "huggingface-llm-deepseek-r1-0528"  # Hypothetical ID
    
    jumpstart_estimator = JumpStartEstimator(
        model_id=jumpstart_model_id,
        role=role,
        instance_type="ml.g5.2xlarge",
        instance_count=1,
        hyperparameters={
            "epochs": "3",
            "learning_rate": "2e-5",
            "train_batch_size": "1",
            "max_input_length": "2048",
            "validation_split_ratio": "0.1",
            "lora_r": "64",
            "lora_alpha": "16",
            "lora_dropout": "0.1",
            "bf16": "True",
            "gradient_checkpointing": "True"
        },
        environment={
            "HF_TOKEN": hf_token
        }
    )
    
    print(f"JumpStart estimator created for {jumpstart_model_id}")
    
except Exception as e:
    print(f"DeepSeek-R1-0528 not yet available in JumpStart: {e}")
    print("Use PyTorch or Hugging Face estimators above for now")

In [None]:
# Uncomment when DeepSeek-R1-0528 becomes available in JumpStart
# jumpstart_training_job_name = f"deepseek-r1-0528-jumpstart-{int(time.time())}"

# jumpstart_estimator.fit(
#     inputs={
#         "training": uploaded_s3_uri
#     },
#     job_name=jumpstart_training_job_name,
#     wait=False
# )

print("JumpStart training will be available when DeepSeek-R1-0528 is added to the model catalog")

## Model Evaluation and Testing

**Evaluating DeepSeek-R1-0528 Reasoning Performance**

After fine-tuning, it's crucial to evaluate the model's reasoning capabilities on mathematical and logical tasks. This section provides evaluation frameworks specifically designed for reasoning models.

In [None]:
import json
import numpy as np
from sklearn.metrics import accuracy_score
import re

In [None]:
def extract_final_answer(response):
    """Extract the final answer from model response"""
    # Look for boxed answers like \boxed{answer}
    boxed_pattern = r'\\boxed\{([^}]+)\}'
    match = re.search(boxed_pattern, response)
    if match:
        return match.group(1).strip()
    
    # Look for "The answer is" patterns
    answer_pattern = r'[Tt]he answer is[:\s]*([^\n\.]+)'
    match = re.search(answer_pattern, response)
    if match:
        return match.group(1).strip()
    
    # Return last line as fallback
    lines = response.strip().split('\n')
    return lines[-1].strip() if lines else ""

def evaluate_reasoning_accuracy(predictions, ground_truths):
    """Evaluate reasoning accuracy by comparing final answers"""
    correct = 0
    total = len(predictions)
    
    for pred, truth in zip(predictions, ground_truths):
        pred_answer = extract_final_answer(pred)
        truth_answer = extract_final_answer(truth)
        
        # Normalize answers for comparison
        pred_normalized = re.sub(r'\s+', '', pred_answer.lower())
        truth_normalized = re.sub(r'\s+', '', truth_answer.lower())
        
        if pred_normalized == truth_normalized:
            correct += 1
    
    return correct / total if total > 0 else 0

print("Evaluation functions defined for DeepSeek-R1-0528 reasoning assessment")

In [None]:
# Sample evaluation on test problems
test_problems = [
    "Solve for x: 2x + 5 = 13",
    "Find the derivative of f(x) = x^3 + 2x^2 - 5x + 1",
    "If a triangle has sides of length 3, 4, and 5, what is its area?"
]

expected_answers = [
    "x = 4",
    "f'(x) = 3x^2 + 4x - 5",
    "6"
]

print("Sample test problems for DeepSeek-R1-0528 evaluation:")
for i, problem in enumerate(test_problems):
    print(f"{i+1}. {problem}")
    print(f"   Expected: {expected_answers[i]}")
    print()

## Model Deployment

**Deploying Fine-tuned DeepSeek-R1-0528**

Deploy your fine-tuned reasoning model to SageMaker endpoints for production use. The deployment supports real-time inference for mathematical reasoning and problem-solving tasks.

In [None]:
# Deploy the fine-tuned model (using PyTorch estimator as example)
endpoint_name = f"deepseek-r1-0528-reasoning-endpoint-{int(time.time())}"

try:
    predictor = pytorch_estimator.deploy(
        initial_instance_count=1,
        instance_type="ml.g5.xlarge",
        endpoint_name=endpoint_name,
        wait=False
    )
    
    print(f"Deploying DeepSeek-R1-0528 to endpoint: {endpoint_name}")
    print("Deployment in progress... This may take 10-15 minutes.")
    
except Exception as e:
    print(f"Deployment error: {e}")
    print("Ensure training job completed successfully before deployment")

In [None]:
# Example inference with the deployed model
def test_reasoning_inference(predictor, problem):
    """Test reasoning inference with deployed model"""
    try:
        response = predictor.predict({
            "inputs": problem,
            "parameters": {
                "max_new_tokens": 512,
                "temperature": 0.1,
                "do_sample": True,
                "top_p": 0.9,
                "repetition_penalty": 1.1
            }
        })
        return response
    except Exception as e:
        return f"Inference error: {e}"

# Test problem for reasoning
test_problem = """Solve the following step by step:
A rectangular garden has a length that is 3 meters more than twice its width. 
If the perimeter is 36 meters, find the dimensions of the garden."""

print(f"Test problem for DeepSeek-R1-0528:")
print(test_problem)
print("\nWaiting for endpoint deployment to complete...")

In [None]:
# Cleanup resources (uncomment when done)
# predictor.delete_endpoint()
# print(f"Endpoint {endpoint_name} deleted")

print("Remember to delete the endpoint when finished to avoid charges:")
print(f"predictor.delete_endpoint()")

## Conclusion

This notebook demonstrated how to fine-tune and deploy **DeepSeek-R1-0528** for advanced reasoning tasks using Amazon SageMaker. Key highlights:

### What We Accomplished:
- **Model Understanding**: Explored DeepSeek-R1-0528's reasoning capabilities and architecture
- **Dataset Preparation**: Processed NuminaMath-CoT for chain-of-thought reasoning training
- **Multiple Training Strategies**: 
  - PyTorch Estimator for full control
  - Hugging Face Estimator with TRL optimization
  - SageMaker JumpStart (when available)
- **Evaluation Framework**: Built reasoning-specific evaluation metrics
- **Production Deployment**: Deployed for real-time mathematical reasoning inference

### Key Benefits of DeepSeek-R1-0528:
- **Advanced Reasoning**: Specialized architecture for multi-step logical inference
- **Mathematical Excellence**: Strong performance on competition-level math problems
- **Chain-of-Thought**: Natural step-by-step problem decomposition
- **Scalable Deployment**: Efficient inference for production reasoning applications

### Next Steps:
- Experiment with different reasoning datasets (GSM8K, MATH, etc.)
- Fine-tune on domain-specific reasoning tasks
- Implement reasoning evaluation benchmarks
- Explore multi-modal reasoning capabilities
- Optimize inference performance for production workloads

DeepSeek-R1-0528 represents a significant advancement in AI reasoning capabilities, making it an excellent choice for applications requiring sophisticated mathematical and logical problem-solving abilities.