# Fine-tune DeepSeek-R1-Distill-Qwen-7B using SageMaker Hyperpod recipes and ModelTrainer

In this notebook, we fine-tune [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) on Amazon SageMaker AI, using SageMaker Hyperpod recies and [ModelTrainer](https://sagemaker.readthedocs.io/en/v2.239.0/api/training/model_trainer.html) class

Recipe: [DeepSeek R1 Distill Qwen 7b - LoRA](https://github.com/aws/sagemaker-hyperpod-recipes/blob/main/recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq16k_gpu_lora.yaml)


## Prerequisites

Our first step is to install Libraries we need on the client to correctly prepare our dataset and start our training/evaluations jobs.

In [None]:
%pip install -r ./scripts/requirements.txt --upgrade
%pip install -q -U s3fs boto3 botocore

***

## Global variables

This section contains python variables used in the notebook

In [None]:
import sagemaker
from datasets import load_dataset
import pandas as pd
from transformers import AutoTokenizer
import boto3
import os

sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()

# HuggingFace Model ID
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"

# Max number of steps for the training loop
max_steps = 215

# define Training Job Name 
job_prefix = f"train-{model_id.split('/')[-1].replace('.', '-')}-recipe-lora"

***

## Prepare the dataset

In this example, we use the [FreedomIntelligence/medical-o1-reasoning-SFT](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT) dataset from Hugging Face. The FreedomIntelligence/medical-o1-reasoning-SFT is used to fine-tune HuatuoGPT-o1, a medical LLM designed for advanced medical reasoning. This dataset is constructed using GPT-4o, which searches for solutions to verifiable medical problems and validates them through a medical verifier.

For details, see the paper and GitHub repository.

In [None]:
# HF dataset that we will be working with 
dataset_name="FreedomIntelligence/medical-o1-reasoning-SFT"

In [None]:
def generate_prompt(data_point):
    """
    Generates a medical analysis prompt based on patient information.
    
    Args:
        data_point (dict): Dictionary containing target and meaning_representation keys
        
    Returns:
        dict: Dictionary containing the formatted prompt
    """
    full_prompt = f"""
    Below is an instruction that describes a task, paired with an input that provides further context. 
    Write a response that appropriately completes the request. 
    Before answering, think carefully about the question and create a step-by-step chain of thoughts to ensure a logical and accurate response.

    ### Instruction:
    You are a medical expert with advanced knowledge in clinical reasoning, diagnostics, and treatment planning. 
    Please answer the following medical question. 

    ### Question:
    {data_point["Question"]}

    ### Response:
    {data_point["Complex_CoT"]}

    """
    return {"prompt": full_prompt.strip()}

In [None]:
# Load dataset from the HF hub
train_set = load_dataset(dataset_name, 'en', split="train[5%:]")
test_set = load_dataset(dataset_name, 'en', split="train[:5%]")

# Add system message to each conversation
columns_to_remove = list(train_set.features)

train_dataset = train_set.map(
    generate_prompt,
    remove_columns=columns_to_remove,
    batched=False
)

test_dataset = test_set.map(
    generate_prompt,
    remove_columns=columns_to_remove,
    batched=False
)

In [None]:
# Review dataset
train_dataset, test_dataset

Load the DeepSeek-R1 Distill Qwen 7B tokenizer from the Hugging Face Transformers library, and generate tokens for the train and validation datasets

In [None]:
####################
# Model & Tokenizer
####################
max_seq_length=1024

# Initialize a tokenizer by loading a pre-trained tokenizer configuration, using the fast tokenizer implementation if available.
tokenizer = AutoTokenizer.from_pretrained(
        model_id,
        use_fast=True
    )

tokenizer.pad_token = tokenizer.eos_token
    
def tokenize(text):
    result = tokenizer(
        text['prompt'],
        max_length=max_seq_length,
        padding="max_length",
        truncation=True
    )
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
train_dataset = train_dataset.map(tokenize, remove_columns=["prompt"])
test_dataset = test_dataset.map(tokenize, remove_columns=["prompt"])

### Upload the tokenized data to Amazon S3

In [None]:
input_path = 'datasets/deepseek-r1-distilled-qwen-7b-recipe-lora'
train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/train"
test_dataset_s3_path = f"s3://{bucket_name}/{input_path}/test"

train_dataset.save_to_disk(train_dataset_s3_path)
test_dataset.save_to_disk(test_dataset_s3_path)

***

## Fine-tune model

Below ModelTrainer will train the model.

#### Get PyTorch image_uri

We are going to use the native PyTorch container image, pre-built for Amazon SageMaker

In [None]:
from sagemaker.config import load_sagemaker_config

In [None]:
configs = load_sagemaker_config()

In [None]:
instance_type = "ml.p4d.24xlarge" # Override the instance type if you want to get a different container version

instance_type

In [None]:
image_uri = (
    f"658645717510.dkr.ecr.{sagemaker_session.boto_session.region_name}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
)

image_uri

Define checkpoint s3 path

In [None]:
checkpoint_s3_path = f"s3://{bucket_name}/deepseek-r1-distilled-qwen-7b-recipe-lora/checkpoints"

In [None]:
from sagemaker.modules.configs import CheckpointConfig, Compute, InputData, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer

instance_count = 1

# Working override for custom dataset
recipe_overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "trainer": {
        "num_nodes": instance_count, # Required when instance_count > 1,
        "max_steps": max_steps,
    },
    "exp_manager": {
        "exp_dir": "/opt/ml/output",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },
    "use_smp_model": False, # Required for PEFT
    "model": {
        "hf_model_name_or_path": model_id,
        "train_batch_size": 14,
        "val_batch_size": 2,
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/test",
        },
    },
}

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=instance_count,
    keep_alive_period_in_seconds=1800
)

model_trainer = ModelTrainer.from_recipe(
    training_image=image_uri,
    training_recipe="fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq8k_gpu_lora",
    recipe_overrides=recipe_overrides,
    requirements="./scripts/requirements.txt",
    base_job_name=job_prefix,
    compute=compute_configs,
    stopping_condition=StoppingCondition(
        max_runtime_in_seconds=7200
    ),
    checkpoint_config=CheckpointConfig(
        s3_uri=f"{checkpoint_s3_path}/{job_prefix}"
    ),
)

In [None]:
from sagemaker.modules.configs import InputData

# Pass the input data
train_input = InputData(
    channel_name="train",
    data_source=train_dataset_s3_path, # S3 path where training data is stored
)

test_input = InputData(
    channel_name="test",
    data_source=test_dataset_s3_path, # S3 path where training data is stored
)

# Check input channels configured
data = [train_input, test_input]
data

In [None]:
# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=True)

## Evaluation

Define S3 path for the trained model

In [None]:
checkpoint_dir = f"s3://{bucket_name}/deepseek-r1-distilled-qwen-7b-recipe-lora/checkpoints/{job_prefix}"

trained_model=f"{checkpoint_dir}/peft_full/steps_{max_steps}/final-model/"

trained_model

In [None]:
!aws s3 ls {trained_model}

### Run evaluation job using SageMaker ModelTrainer

In [None]:
instance_type = "ml.p4d.24xlarge" # Override the instance type if you want to get a different container version

instance_type

In [None]:
image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=sagemaker_session.boto_session.region_name,
    version="2.4",
    instance_type=instance_type,
    image_scope="training"
)

image_uri

In [None]:
from sagemaker.modules.configs import Compute, InputData, OutputDataConfig, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.train import ModelTrainer

# Define the script to be run
source_code = SourceCode(
    source_dir="./scripts",
    requirements="requirements.txt",
    entry_script="evaluate_recipe.py",
    
)

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=1,
    keep_alive_period_in_seconds=1800
)

# define Training Job Name 
job_name = f"eval-{job_prefix}"

# Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    stopping_condition=StoppingCondition(
        max_runtime_in_seconds=7200
    ),
    hyperparameters={
        "model_id": model_id,  # Hugging Face model id
        "dataset_name": dataset_name
    }
)

In [None]:
from sagemaker.modules.configs import InputData

# Pass the input data
train_input = InputData(
    channel_name="adapterdir",
    data_source=trained_model,
)

test_input = InputData(
    channel_name="testdata",
    data_source=test_dataset_s3_path, # S3 path where training data is stored
)

# Check input channels configured
data = [train_input, test_input]
data

In [None]:
# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=False)