## RLVR Example - Finetuning with Sagemaker

This notebook demonstrates basic user flow for RLVR Finetuning from a model available in Sagemaker Jumpstart.
Information on available models on jumpstart: https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-foundation-models-latest.html

### Setup and Configuration

Initialize the environment by importing necessary libraries and configuring AWS credentials

In [1]:

from sagemaker.train.rlvr_trainer import RLVRTrainer
from sagemaker.train.configs import InputData
from rich import print as rprint
from rich.pretty import pprint
from sagemaker.core.resources import ModelPackage
from sagemaker.train.common import TrainingType


import boto3
from sagemaker.core.helper.session_helper import Session

# For MLFlow native metrics in Trainer wait, run below line with approriate region
# os.environ["SAGEMAKER_MLFLOW_CUSTOM_ENDPOINT"] = "https://mlflow.sagemaker.us-east-1.app.aws"



sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/rsareddy/Library/Application Support/sagemaker/config.yaml


### Create RLVRTrainer
**Required Parameters** 

* `model`: base_model id on Sagemaker Hubcontent that is available to finetune (or) ModelPackage artifacts

**Optional Parameters**
* `custom_reward_function`: Custom reward function/Evaluator ARN
* `model_package_group_name`: ModelPackage group name or ModelPackageGroup
* `mlflow_resource_arn`: MLFlow app ARN to track the training job
* `mlflow_experiment_name`: MLFlow app experiment name(str)
* `mlflow_run_name`: MLFlow app run name(str)
* `training_dataset`: Training Dataset - either Dataset ARN or S3 Path of the dataset (Please note these are required for a training job to run, can be either provided via Trainer or .train())
* `validation_dataset`: Validation Dataset - either Dataset ARN or S3 Path of the dataset
* `s3_output_path`: S3 path for the trained model artifacts

In [2]:
# For fine-tuning (prod)
rlvr_trainer = RLVRTrainer(
    model="meta-textgeneration-llama-3-2-1b-instruct", # Union[str, ModelPackage] 
    model_package_group_name="sdk-test-finetuned-models", #"test-finetuned-models", # Make it Optional
    #mlflow_resource_arn="arn:aws:sagemaker:us-west-2:052150106756:mlflow-tracking-server/mmlu-eval-experiment",  # Optional[str] - MLflow app ARN (auto-resolved if not provided), can accept name and search in the account
    mlflow_experiment_name="test-rlvr-finetuned-models-exp", # Optional[str]
    mlflow_run_name="test-rlvr-finetuned-models-run", # Optional[str]
    training_dataset="s3://mc-flows-sdk-testing/input_data/rlvr-rlaif-test-data/train_285.jsonl",     #"arn:aws:sagemaker:us-west-2:052150106756:hub-content/AIRegistry/DataSet/MarketingDemoDataset1/1.0.0", #Optional[]
    s3_output_path="s3://mc-flows-sdk-testing/output/",
    #sagemaker_session=sagemaker_session,
    #role="arn:aws:iam::052150106756:role/service-role/AmazonSageMaker-ExecutionRole-20250731T162975"
    #role="arn:aws:iam::052150106756:role/Admin",
    accept_eula=True
)

### Discover and update Finetuning options

Each of the technique and model has overridable hyperparameters that can be finetuned by the user.

In [3]:
print("Default Finetuning Options:")
pprint(rlvr_trainer.hyperparameters.to_dict()) # rename as hyperparameters

#set options
rlvr_trainer.hyperparameters.get_info()



Default Finetuning Options:



data_path:
  Current value: None
  Type: string
  Default: None
  Required: Yes

global_batch_size:
  Current value: 64
  Type: integer
  Default: 64
  Valid options: [64, 128, 256, 512, 1024]
  Required: Yes

learning_rate:
  Current value: 1e-05
  Type: float
  Default: 1e-05
  Range: 1e-07 - 0.001
  Required: Yes

max_epochs:
  Current value: 2
  Type: integer
  Default: 2
  Range: 1 - 30
  Required: Yes

max_prompt_length:
  Current value: 1024
  Type: integer
  Default: 1024
  Range: 512 - 16384
  Required: Yes

mlflow_run_id:
  Current value: 
  Type: string
  Default: 

mlflow_tracking_uri:
  Current value: 
  Type: string
  Default: 

model_name_or_path:
  Current value: meta-llama/Llama-3.2-1B-Instruct
  Type: string
  Default: meta-llama/Llama-3.2-1B-Instruct
  Required: Yes

name:
  Current value: example-name-3pl5q
  Type: string
  Default: example-name-3pl5q
  Required: Yes

output_path:
  Current value: /opt/ml/model
  Type: string
  Default: /opt/ml/model
  Required: Ye

#### Start RLVR training


In [11]:
training_job = rlvr_trainer.train(wait=True)

In [12]:
pprint(training_job)

In [4]:
training_job = rlvr_trainer.train(wait=True)

In [9]:
from sagemaker.core.resources import TrainingJob

response = TrainingJob.get(training_job_name="meta-textgeneration-llama-3-2-1b-instruct-rlvr-20251125142110")
pprint(response)

### View any Training job details

We can get any training job details and its status with TrainingJob.get(...)

In [6]:
from sagemaker.core.resources import TrainingJob

response = TrainingJob.get(training_job_name="meta-textgeneration-llama-3-2-3b-instruct-rlvr-20251123033517")
pprint(response)

In [14]:
training_job.refresh()
pprint(training_job)

### Test RLVR with Custom RewardFunction

Here we are providing a user-defined reward function ARN

In [3]:

# For fine-tuning 
rlvr_trainer = RLVRTrainer(
    model="meta-textgeneration-llama-3-2-1b-instruct", # Union[str, ModelPackage] 
    model_package_group_name="sdk-test-finetuned-models", # Make it Optional
    #mlflow_resource_arn="arn:aws:sagemaker:us-west-2:052150106756:mlflow-tracking-server/mmlu-eval-experiment",  # Optional[str] - MLflow app ARN (auto-resolved if not provided), can accept name and search in the account
    mlflow_experiment_name="test-rlvr-finetuned-models-exp", # Optional[str]
    mlflow_run_name="test-rlvr-finetuned-models-run", # Optional[str]
    training_dataset="s3://mc-flows-sdk-testing/input_data/rlvr-rlaif-test-data/train_285.jsonl", #Optional[]
    s3_output_path="s3://mc-flows-sdk-testing/output/",
    #sagemaker_session=sagemaker_session,
    #role="arn:aws:iam::052150106756:role/service-role/AmazonSageMaker-ExecutionRole-20250731T162975"
    #role="arn:aws:iam::052150106756:role/Admin",
    custom_reward_function="arn:aws:sagemaker:us-west-2:729646638167:hub-content/sdktest/JsonDoc/rlvr-test-rf/0.0.1",
    accept_eula=True
    
)

In [4]:
training_job = rlvr_trainer.train(wait=True)

In [4]:
training_job = rlvr_trainer.train(wait=True)

In [5]:
#training_job.refresh()
pprint(training_job)

In [None]:

#meta-textgeneration-llama-3-2-1b-instruct-rlvr-20251113182932

## Continued Finetuning (or) Finetuning on Model Artifacts

#### Discover a ModelPackage and get its details

In [6]:
from rich import print as rprint
from rich.pretty import pprint
from sagemaker.core.resources import ModelPackage, ModelPackageGroup

#model_package_iter = ModelPackage.get_all(model_package_group_name="test-finetuned-models-gamma")
model_package = ModelPackage.get(model_package_name="arn:aws:sagemaker:us-west-2:052150106756:model-package/test-finetuned-models-gamma/61")

pprint(model_package)

#### Create Trainer

Trainer creation is same as above Finetuning Section except for `model`'s input is ModelPackage(previously trained artifacts)

In [7]:
# For fine-tuning 
rlvr_trainer = RLVRTrainer(
    model=model_package, # Union[str, ModelPackage] 
    training_type=TrainingType.LORA, 
    model_package_group_name="test-finetuned-models-gamma", #"test-finetuned-models", # Make it Optional
    mlflow_resource_arn="arn:aws:sagemaker:us-west-2:052150106756:mlflow-tracking-server/mmlu-eval-experiment",  # Optional[str] - MLflow app ARN (auto-resolved if not provided), can accept name and search in the account
    mlflow_experiment_name="test-rlvr-finetuned-models-exp", # Optional[str]
    mlflow_run_name="test-rlvr-finetuned-models-run", # Optional[str]
    training_dataset="arn:aws:sagemaker:us-west-2:052150106756:hub-content/AIRegistry/DataSet/rlvr-rlaif-test-dataset/0.0.2",     #"arn:aws:sagemaker:us-west-2:052150106756:hub-content/AIRegistry/DataSet/MarketingDemoDataset1/1.0.0", #Optional[]
    s3_output_path="s3://open-models-testing-pdx/output",
    sagemaker_session=sagemaker_session,
    #role="arn:aws:iam::052150106756:role/service-role/AmazonSageMaker-ExecutionRole-20250731T162975"
    role="arn:aws:iam::052150106756:role/Admin"
)

In [2]:
# For fine-tuning 
rlvr_trainer = RLVRTrainer(
    model="arn:aws:sagemaker:us-west-2:052150106756:model-package/test-finetuned-models-gamma/61", # Union[str, ModelPackage] 
    training_type=TrainingType.LORA, 
    model_package_group_name="test-finetuned-models-gamma", #"test-finetuned-models", # Make it Optional
    mlflow_resource_arn="arn:aws:sagemaker:us-west-2:052150106756:mlflow-tracking-server/mmlu-eval-experiment",  # Optional[str] - MLflow app ARN (auto-resolved if not provided), can accept name and search in the account
    mlflow_experiment_name="test-rlvr-finetuned-models-exp", # Optional[str]
    mlflow_run_name="test-rlvr-finetuned-models-run", # Optional[str]
    training_dataset="arn:aws:sagemaker:us-west-2:052150106756:hub-content/AIRegistry/DataSet/rlvr-rlaif-test-dataset/0.0.2",     #"arn:aws:sagemaker:us-west-2:052150106756:hub-content/AIRegistry/DataSet/MarketingDemoDataset1/1.0.0", #Optional[]
    s3_output_path="s3://open-models-testing-pdx/output",
    sagemaker_session=sagemaker_session,
    #role="arn:aws:iam::052150106756:role/service-role/AmazonSageMaker-ExecutionRole-20250731T162975"
    role="arn:aws:iam::052150106756:role/Admin"
)

#### Start the Training

In [None]:
training_job = rlvr_trainer.train(
    wait=True,
)

Training Job Name: meta-textgeneration-llama-3-2-3b-instruct-rlvr-20251124140103


Output()

In [11]:
pprint(training_job)

#### Nova RLVR job

In [2]:
import os
os.environ['SAGEMAKER_REGION'] = 'us-east-1'

# For fine-tuning 
rlvr_trainer = RLVRTrainer(
    model="nova-textgeneration-lite-v2", # Union[str, ModelPackage] 
    model_package_group_name="sdk-test-finetuned-models", #"test-finetuned-models", # Make it Optional
    #mlflow_resource_arn="arn:aws:sagemaker:us-east-1:052150106756:mlflow-app/app-UNBKLOAX64PX",  # Optional[str] - MLflow app ARN (auto-resolved if not provided), can accept name and search in the account
    mlflow_experiment_name="test-nova-rlvr-finetuned-models-exp", # Optional[str]
    mlflow_run_name="test-nova-rlvr-finetuned-models-run", # Optional[str]
    training_dataset="s3://mc-flows-sdk-testing-us-east-1/input_data/rlvr-nova/grpo-64-sample.jsonl",
    validation_dataset="s3://mc-flows-sdk-testing-us-east-1/input_data/rlvr-nova/grpo-64-sample.jsonl",
    s3_output_path="s3://mc-flows-sdk-testing-us-east-1/output/",
    custom_reward_function="arn:aws:sagemaker:us-east-1:729646638167:hub-content/sdktest/JsonDoc/rlvr-nova-test-rf/0.0.1",
    accept_eula=True
)


[{'DisplayName': 'Nova Lite V2 LoRA RLVR SMTJ training on GPU', 'Name': 'nova_lite_v2_smtj_p5_p5en_gpu_lora_rft', 'RecipeFilePath': 'recipes/fine-tuning/nova/nova_2_0/nova_lite/RFT/nova_lite_v2_smtj_p5_p5en_gpu_lora_rft.yaml', 'CustomizationTechnique': 'RLVR', 'InstanceCount': 4, 'Type': 'FineTuning', 'Versions': ['1.0'], 'Hardware': 'GPU', 'SupportedInstanceTypes': ['ml.p5.48xlarge', 'ml.p5en.48xlarge'], 'Peft': 'LORA', 'SequenceLength': '8K', 'ServerlessMeteringType': 'Hourly', 'SmtjRecipeTemplateS3Uri': 's3://jumpstart-cache-prod-us-east-1/recipes/nova_lite_v2_smtj_p5_p5en_gpu_lora_rft_payload_template_sm_jobs_v1.0.19.yaml', 'SmtjOverrideParamsS3Uri': 's3://jumpstart-cache-prod-us-east-1/recipes/nova_lite_v2_smtj_p5_p5en_gpu_lora_rft_override_params_sm_jobs_v1.0.19.json', 'SmtjImageUri': '708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-RFT-V2-latest'}, {'DisplayName': 'Nova Lite V2 Full Rank RLVR SMTJ training on GPU', 'Name': 'nova_lite_v2_smtj_p5_p5en_gpu_rf

In [3]:
rlvr_trainer.hyperparameters.to_dict()

{'name': 'my-rft-run-dcx1l',
 'data_s3_path': 's3://<bucket>/<data file>',
 'reward_lambda_arn': 'arn:aws:lambda:<region>:<account-id>:function:<function-name>',
 'learning_rate': '0.0001',
 'max_steps': '10',
 'lora_alpha': '32',
 'global_batch_size': '64',
 'max_length': '8192',
 'learning_rate_ratio': '64.0'}

In [4]:
rlvr_trainer.hyperparameters.data_s3_path = 's3://example-bucket'

rlvr_trainer.hyperparameters.reward_lambda_arn = 'arn:aws:lambda:us-east-1:729646638167:function:rlvr-nova-reward-function'

In [5]:
rlvr_trainer.hyperparameters.to_dict()

{'name': 'my-rft-run-dcx1l',
 'data_s3_path': 's3://example-bucket',
 'reward_lambda_arn': 'arn:aws:lambda:us-east-1:729646638167:function:rlvr-nova-reward-function',
 'learning_rate': '0.0001',
 'max_steps': '10',
 'lora_alpha': '32',
 'global_batch_size': '64',
 'max_length': '8192',
 'learning_rate_ratio': '64.0'}

In [6]:
training_job = rlvr_trainer.train(wait=True)

In [6]:
training_job = rlvr_trainer.train(wait=False)

sagemaker.config INFO - Not applying SDK defaults from location: /Library/Application Support/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /Users/rsareddy/Library/Application Support/sagemaker/config.yaml


#### Nova RLVR job (551952248621)

In [3]:
import os
os.environ['SAGEMAKER_REGION'] = 'us-east-1'

# For fine-tuning 
rlvr_trainer = RLVRTrainer(
    model="nova-textgeneration-lite-v2", # Union[str, ModelPackage] 
    model_package_group_name="test-prod-iad-model-pkg-group", #"test-finetuned-models", # Make it Optional
    #mlflow_resource_arn="arn:aws:sagemaker:us-east-1:052150106756:mlflow-app/app-UNBKLOAX64PX",  # Optional[str] - MLflow app ARN (auto-resolved if not provided), can accept name and search in the account
    mlflow_experiment_name="test-nova-rlvr-finetuned-models-exp", # Optional[str]
    mlflow_run_name="test-nova-rlvr-finetuned-models-run", # Optional[str]
    training_dataset="s3://ease-integ-test-input-551952248621-us-east-1/converse-serverless-test-data/grpo-64-sample.jsonl",
    validation_dataset="s3://ease-integ-test-input-551952248621-us-east-1/converse-serverless-test-data/grpo-64-sample.jsonl",
    s3_output_path="s3://ease-integ-test-output-551952248621-us-east-1/model-customization-algo/",
    custom_reward_function="arn:aws:sagemaker:us-east-1:551952248621:hub-content/recipestest/JsonDoc/nova-prod-iad-test-evaluator-lambda-reward-function/0.0.1",
    accept_eula=True
)


[{'DisplayName': 'Nova Lite V2 LoRA RLVR SMTJ training on GPU', 'Name': 'nova_lite_v2_smtj_p5_p5en_gpu_lora_rft', 'RecipeFilePath': 'recipes/fine-tuning/nova/nova_2_0/nova_lite/RFT/nova_lite_v2_smtj_p5_p5en_gpu_lora_rft.yaml', 'CustomizationTechnique': 'RLVR', 'InstanceCount': 4, 'Type': 'FineTuning', 'Versions': ['1.0'], 'Hardware': 'GPU', 'SupportedInstanceTypes': ['ml.p5.48xlarge', 'ml.p5en.48xlarge'], 'Peft': 'LORA', 'SequenceLength': '8K', 'ServerlessMeteringType': 'Hourly', 'SmtjRecipeTemplateS3Uri': 's3://jumpstart-cache-prod-us-east-1/recipes/nova_lite_v2_smtj_p5_p5en_gpu_lora_rft_payload_template_sm_jobs_v1.0.19.yaml', 'SmtjOverrideParamsS3Uri': 's3://jumpstart-cache-prod-us-east-1/recipes/nova_lite_v2_smtj_p5_p5en_gpu_lora_rft_override_params_sm_jobs_v1.0.19.json', 'SmtjImageUri': '708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-fine-tune-repo:SM-TJ-RFT-V2-latest'}, {'DisplayName': 'Nova Lite V2 Full Rank RLVR SMTJ training on GPU', 'Name': 'nova_lite_v2_smtj_p5_p5en_gpu_rf

In [4]:
rlvr_trainer.hyperparameters.to_dict()

{'name': 'my-rft-run-dcx1l',
 'data_s3_path': 's3://<bucket>/<data file>',
 'reward_lambda_arn': 'arn:aws:lambda:<region>:<account-id>:function:<function-name>',
 'learning_rate': '0.0001',
 'max_steps': '10',
 'lora_alpha': '32',
 'global_batch_size': '64',
 'max_length': '8192',
 'learning_rate_ratio': '64.0'}

In [5]:
rlvr_trainer.hyperparameters.data_s3_path = 's3://example-bucket'

rlvr_trainer.hyperparameters.reward_lambda_arn = 'arn:aws:lambda:us-east-1:729646638167:function:rlvr-nova-reward-function'

In [6]:
rlvr_trainer.hyperparameters.to_dict()

{'name': 'my-rft-run-dcx1l',
 'data_s3_path': 's3://example-bucket',
 'reward_lambda_arn': 'arn:aws:lambda:us-east-1:729646638167:function:rlvr-nova-reward-function',
 'learning_rate': '0.0001',
 'max_steps': '10',
 'lora_alpha': '32',
 'global_batch_size': '64',
 'max_length': '8192',
 'learning_rate_ratio': '64.0'}

In [None]:
training_job = rlvr_trainer.train(wait=True)