# Codefu 7B train with veRL, Ray using SageMaker training job

In this notebook, we train [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) on Amazon SageMaker AI, using [veRL](https://github.com/volcengine/verl), and [Ray on SageMaker training job](https://github.com/aws-samples/sample-ray-on-amazon-sagemaker-training-jobs).

This code replicates the training setup used for **[CodeFu-7B-v0.1](https://huggingface.co/aws-prototyping/codefu-7b-v0.1)**.

## Prerequisites

In [None]:
%pip install -r ./scripts/requirements.txt --upgrade

***

## Setup Configuration file path

In [None]:
import os

# os.environ["AWS_PROFILE"] = "<aws_profile>"

***

### Upload to Amazon S3

In [None]:
import boto3
import sagemaker

In [None]:
sagemaker_session = sagemaker.Session()
s3_client = boto3.client('s3')

bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix

In [None]:
# save train_dataset to s3 using our SageMaker session
if default_prefix:
    input_path = f"{default_prefix}/datasets/codefu-verl-ray"
else:
    input_path = f"datasets/codefu-verl-ray"

train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/train/dataset.parquet"
test_dataset_s3_path = f"s3://{bucket_name}/{input_path}/test/dataset.parquet"

In [None]:
# Save datasets to s3

s3_client.upload_file(
    "./data/train/dataset.parquet", bucket_name, f"{input_path}/train/dataset.parquet"
)
s3_client.upload_file(
    "./data/test/dataset.parquet", bucket_name, f"{input_path}/test/dataset.parquet"
)

print(f"Training data uploaded to:")
print(train_dataset_s3_path)
print(test_dataset_s3_path)

***

## Model fine-tuning

We are now ready to fine-tune our model. We will use the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) from transfomers to fine-tune our model. We prepared a script [train.py](./scripts/train.py) which will loads the dataset from disk, prepare the model, tokenizer and start the training.

For configuration we use `TrlParser`, that allows us to provide hyperparameters in a `yaml` file. This yaml will be uploaded and provided to Amazon SageMaker similar to our datasets. We are saving the config file as `args.yaml` and upload it to S3.

In [None]:
%%bash

cat > ./args.yaml <<EOF
model_id: "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"                           # Hugging Face model id
# sagemaker specific parameters
output_dir: "/opt/ml/checkpoints"                       # path to where SageMaker will upload the model checkpoints
train_dataset_path: "/opt/ml/input/data/train/"   # path to where S3 saves train dataset
test_dataset_path: "/opt/ml/input/data/test/"       # path to where S3 saves test dataset
# training parameters
run_name: "sagemaker-training-run"
learning_rate: 1e-5                    # learning rate scheduler
num_train_epochs: 2                    # number of training epochs
per_device_train_batch_size: 64        # batch size per device during training
per_device_eval_batch_size: 8          # batch size for evaluation
gradient_checkpointing: true
algorithm_adv_estimator: "grpo" 
data_max_prompt_length: 4096 
data_max_response_length: 20480  
actor_rollout_ref_model_use_remove_padding: true 
actor_rollout_ref_actor_ppo_mini_batch_size: 32 
actor_rollout_ref_actor_use_dynamic_bsz: true 
actor_rollout_ref_actor_ppo_micro_batch_size: 8 
actor_rollout_ref_actor_use_kl_loss: true 
actor_rollout_ref_actor_kl_loss_coef: 0.001 
actor_rollout_ref_actor_kl_loss_type: "low_var_kl" 
actor_rollout_ref_actor_ulysses_sequence_parallel_size: 4 
actor_rollout_ref_actor_fsdp_config_param_offload: true 
actor_rollout_ref_actor_fsdp_config_grad_offload: true 
actor_rollout_ref_actor_fsdp_config_optimizer_offload: true 
actor_rollout_ref_rollout_log_prob_micro_batch_size: 8
actor_rollout_ref_rollout_tensor_model_parallel_size: 2 
actor_rollout_ref_rollout_name: "vllm" 
actor_rollout_ref_rollout_temperature: 1.0 
actor_rollout_ref_rollout_gpu_memory_utilization: 0.9 
actor_rollout_ref_rollout_n: 8 
actor_rollout_ref_ref_log_prob_micro_batch_size: 8
actor_rollout_ref_ref_fsdp_config_param_offload: true 
algorithm_kl_ctrl_kl_coef: 0.001 
trainer_critic_warmup: 0 
trainer_save_freq: 16 
trainer_test_freq: 16
wandb_token: YOUR_WANDB_TOKEN
EOF

Lets upload the config file to S3.

In [None]:
import os
from sagemaker.s3 import S3Uploader

if default_prefix:
    input_path = f"s3://{bucket_name}/{default_prefix}/datasets/codefu-verl-ray"
else:
    input_path = f"s3://{bucket_name}/datasets/codefu-verl-ray"

# upload the model yaml file to s3
model_yaml = "args.yaml"
train_config_s3_path = S3Uploader.upload(
    local_path=model_yaml, desired_s3_uri=f"{input_path}/config"
)

os.remove("./args.yaml")

print(f"Training config uploaded to:")
print(train_config_s3_path)

## Fine-tune model

Below estimtor will train the model with QLoRA, merge the adapter in the base model and save in S3

#### Get PyTorch image_uri

We are going to use the native PyTorch container image, pre-built for Amazon SageMaker

In [None]:
import boto3
import sagemaker
from sagemaker.config import load_sagemaker_config

In [None]:
sts = boto3.client("sts")

sagemaker_session = sagemaker.Session()

bucket_name = sagemaker_session.default_bucket()
default_prefix = sagemaker_session.default_bucket_prefix
configs = load_sagemaker_config()

In [None]:
instance_type = "ml.p4de.24xlarge"
instance_count = 1

instance_type

In [None]:
account_id = sts.get_caller_identity()["Account"]
region = sagemaker_session.boto_session.region_name
repo_name = "codefu-pytorch"
tag = "latest"

image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/{repo_name}:{tag}"

image_uri

In [None]:
from sagemaker import get_execution_role
from sagemaker.modules.configs import (
    CheckpointConfig,
    Compute,
    OutputDataConfig,
    RemoteDebugConfig,
    SourceCode,
    StoppingCondition,
)
from sagemaker.modules.train import ModelTrainer

args = [
    "--entrypoint",
    "train.py",
    "--config",
    "/opt/ml/input/data/config/args.yaml",
]

# Define the script to be run
source_code = SourceCode(
    source_dir="./scripts",
    requirements="requirements.txt",
    command=f"python launcher.py {' '.join(args)}",
)

# Define the compute
compute_configs = Compute(
    instance_type=instance_type,
    instance_count=instance_count,
    keep_alive_period_in_seconds=1800,
)

# define Training Job Name
job_name = f"train-codefu-verl-ray"

# define OutputDataConfig path
if default_prefix:
    output_path = f"s3://{bucket_name}/{default_prefix}/{job_name}"
else:
    output_path = f"s3://{bucket_name}/{job_name}"

# Define the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    stopping_condition=StoppingCondition(
        max_runtime_in_seconds=3600 * 24 * 5
    ),  # 5 days
    output_data_config=OutputDataConfig(s3_output_path=output_path),
    checkpoint_config=CheckpointConfig(
        s3_uri=output_path + "/checkpoints", local_path="/opt/ml/checkpoints"
    ),
    environment={
        "RAY_PROMETHEUS_HOST": "<PROMETHEUS_HOST>",
        "RAY_GRAFANA_HOST": "<GRAFANA_HOST>",
        "RAY_PROMETHEUS_NAME": "prometheus",
        "BASE_MODEL": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
        "RUN_NAME": "sagemaker-training-run",
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "VLLM_ATTENTION_BACKEND": "XFORMERS",
        "MIN_PUBLIC_RATIO": "0",
        "NCCL_P2P_DISABLE": "1",
        "NCCL_IB_DISABLE": "0",
        "NCCL_NET_PLUGIN": "none",
        "NCCL_TIMEOUT": "1800",
    },
    role=get_execution_role(),
).with_remote_debug_config(RemoteDebugConfig(enable_remote_debug=True))

In [None]:
from sagemaker.modules.configs import InputData, S3DataSource

# Pass the input data
train_input = InputData(
    channel_name="train",
    data_source=S3DataSource(
        s3_data_type="S3Prefix",
        s3_uri=train_dataset_s3_path,
        s3_data_distribution_type="FullyReplicated",
    ),  # S3 path where training data is stored
)

val_input = InputData(
    channel_name="test",
    data_source=S3DataSource(
        s3_data_type="S3Prefix",
        s3_uri=test_dataset_s3_path,
        s3_data_distribution_type="FullyReplicated",
    ),  # S3 path where val data is stored
)

config_input = InputData(
    channel_name="config",
    data_source=S3DataSource(
        s3_data_type="S3Prefix",
        s3_uri=train_config_s3_path,
        s3_data_distribution_type="FullyReplicated",
    ),  # S3 path where configs are stored
)

# Check input channels configured
data = [
    train_input,
    val_input,
    config_input,
]
data

In [None]:
# starting the train job with our uploaded datasets as input
model_trainer.train(input_data_config=data, wait=False)