# Hyperpod Nova RFT One-Stop Notebook (Single turn)
This notebook provides an end-to-end workflow to run single-turn RFT for Nova models on Amazon SageMaker HyperPod. It guides you through cluster setup, IAM/EKS permissions, training, and evaluation in a single, reproducible notebook. 

Multi-turn is Nova Forge only feature: https://docs.aws.amazon.com/sagemaker/latest/dg/nova-forge.html

## Prerequisites

- To get started, you need to create SageMaker Studio JupyterLab space
    - Go to Sagemaker AI -> Domains -> Set up for single user (Quick setup) -> Submit
    - This will open SageMaker studio
    - Go to JupyterLab -> Create JupyterLab space -> Submit
    - Once the space is created
       -  Change Storage to `50GB`
       -  Run JupyterLab -> Wait for it to be in `Running state`
       -  Open JupyterLab and upload this notebook
- After step 1, follow the notebook `Nova Hyperpod RIG ‚Äî One-Time Cluster & Dependency Setup` from `amazon-nova-samples/customization/hyperpod-rig-cluster-setup/Hyperpod Nova Cluster and Dependencies setup.ipynb` to setup dependencies and cluster
    - Note: You need minimum 8 p5.48xlarge instances for Single turn RFT
    - After the setup is done, refresh the page and change the notebook kernel to `Python (nova_sdk)`

## Cell Execution Guide

- üîµ **ONE-TIME SETUP**: Run once per notebook JupyterLab instance
- üü£ **KERNEL RESTART**: Run everytime kernel restarts
- üü° **PER-JOB**: Run for every training/evaluation job

### üü£ KERNEL RESTART: Validate dependencies and update paths

Verify that cluster and dependencies setup was complete using `Nova Hyperpod RIG ‚Äî One-Time Cluster & Dependency Setup` 

In [1]:
# Update paths so that Python and CLI commands use the depenedncies installed in nova_sdk venv

import os
os.environ["BASH_ENV"] = os.path.expanduser("~/.bashrc")
os.environ["PATH"] = f"{os.path.expanduser('~')}/.local/bin:" + os.environ["PATH"]

venv_bin = f"{os.environ["HOME"]}/nova_sdk/bin"
os.environ["PATH"] = f"{venv_bin}:" + os.environ["PATH"]

# Temp solution to unblock Nova Customization SDK
os.environ["PYTHONPATH"]= f"{os.environ["HOME"]}/nova_sdk/lib/python3.12/site-packages/hyperpod_cli/sagemaker_hyperpod_recipes/launcher/nemo/nemo_framework_launcher/launcher_scripts:" + os.environ.get("PYTHONPATH","")

# Set paths updted flag
paths_updated = True

In [2]:
if not globals().get("paths_updated"):
    raise Exception("Paths not updated, please run KERNEL RESTART: Update paths step")

import sys
import shutil
import subprocess
from pathlib import Path

EXPECTED_PYTHON = Path(f"{os.environ["HOME"]}/nova_sdk/bin")
LOCAL_BIN = Path(f"{os.environ["HOME"]}/.local/bin")
EXPECTED_SITE_PACKAGES = Path(f"{os.environ["HOME"]}/nova_sdk/lib")

def check(cond, ok, fail):
    if cond:
        print(f"‚úÖ {ok}")
    else:
        raise RuntimeError(f"‚ùå {fail}")

# 1. Python executable
check(
    EXPECTED_PYTHON == Path(sys.executable).parent,
    f"Python executable: {sys.executable}",
    f"Wrong Python env. Expected {EXPECTED_PYTHON}, got {Path(sys.executable).parent}"
)

# 2. Python package location
import amzn_nova_customization_sdk
pkg_path = Path(amzn_nova_customization_sdk.__file__).resolve()

print(pkg_path)
check(
    str(EXPECTED_SITE_PACKAGES.parent) in str(pkg_path),
    f"amzn-nova-customization-sdk loaded from {pkg_path}",
    f"amzn-nova-customization-sdk NOT loaded from nova_sdk env: {pkg_path}"
)

# 3. hyperpod CLI
hyperpod_path = shutil.which("hyperpod")
print(hyperpod_path)
check(
    hyperpod_path and Path(hyperpod_path).resolve().parent == EXPECTED_PYTHON,
    f"hyperpod found at {hyperpod_path}",
    f"hyperpod not found in {EXPECTED_PYTHON}"
)

# 4. AWS CLI v2
aws_path = shutil.which("aws")
check(
    aws_path is not None,
    "aws CLI found",
    "aws CLI not found"
)

aws_version = subprocess.check_output(["aws", "--version"], stderr=subprocess.STDOUT).decode()
check(
    "aws-cli/2" in aws_version,
    f"AWS CLI version OK: {aws_version.strip()}",
    f"AWS CLI is not v2: {aws_version.strip()}"
)

# 5. helm
helm_path = shutil.which("helm")
check(
    helm_path and Path(helm_path).resolve().parent == LOCAL_BIN,
    f"helm found at {helm_path}",
    f"helm not found in {LOCAL_BIN}"
)

# 6. kubectl
kubectl_path = shutil.which("kubectl")
check(
    kubectl_path and Path(kubectl_path).resolve().parent == LOCAL_BIN,
    f"kubectl found at {kubectl_path}",
    f"kubectl not found in {LOCAL_BIN}"
)

print("\nüéâ Environment verification successful.")



‚úÖ Python executable: /home/sagemaker-user/nova_sdk/bin/python
/home/sagemaker-user/nova_sdk/lib/python3.12/site-packages/amzn_nova_customization_sdk/__init__.py
‚úÖ amzn-nova-customization-sdk loaded from /home/sagemaker-user/nova_sdk/lib/python3.12/site-packages/amzn_nova_customization_sdk/__init__.py
/home/sagemaker-user/nova_sdk/bin/hyperpod
‚úÖ hyperpod found at /home/sagemaker-user/nova_sdk/bin/hyperpod
‚úÖ aws CLI found
‚úÖ AWS CLI version OK: aws-cli/2.33.5 Python/3.13.11 Linux/6.1.158-180.294.amzn2023.x86_64 exe/x86_64.ubuntu.22
‚úÖ helm found at /home/sagemaker-user/.local/bin/helm
‚úÖ kubectl found at /home/sagemaker-user/.local/bin/kubectl

üéâ Environment verification successful.


## üü° PER-JOB: Preparing Dataset

- Follow these steps to prepare your dataset: https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-rft-nova2.html#nova-hp-rft-data-format

## üü° PER-JOB: Run Training

Recipes: https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/nova/nova_2_0/nova_lite/RFT

### Prerequisites
- Dataset in the format shared above
- Reward lambda function based on https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-rft-reward-functions.html
- Output path for generating artifacts

### Option 1: Using Nova Customization SDK

Use this when you want to get started with Nova RFT

In [None]:
# Start training using Nova Customization SDK

if not globals().get("paths_updated"):
    raise Exception("Paths not updated, please run KERNEL RESTART: Update paths step")

from amzn_nova_customization_sdk.manager.runtime_manager import SMHPRuntimeManager
from amzn_nova_customization_sdk.model.nova_model_customizer import NovaModelCustomizer
from amzn_nova_customization_sdk.model.model_enums import Model, TrainingMethod

# 1. Setup runtime
runtime = SMHPRuntimeManager(
    instance_type="ml.p5.48xlarge",  # Instance type
    instance_count=2,                # Trainer replicas (Default: repliacs:2, generation_replicas:2, rollout_replicas: 1, system_replicas:3)
    cluster_name="<CLUSTER_NAME>",   # Pretty Sagemaker RIG cluster name
    namespace="kubeflow"             # Cluster namespace (Default: Kubectl)
)

# 3. Initialize customizer
customizer = NovaModelCustomizer(
    model=Model.NOVA_LITE_2,                # Only NOVA_LITE_2 is supported
    method=TrainingMethod.RFT,              # Select between RFT or RFT_LORA
    infra=runtime,
    data_s3_path="s3://bucket/data.jsonl",  # Input dataset path
    output_s3_path="s3://bucket/output/",   # Output dataset path
)


# 4. Define recipe overrides
rft_lambda_arn = "<YOUR_LAMBDA_FUNCTION_ARN>" # Required: Ensure your function follows https://docs.aws.amazon.com/sagemaker/latest/dg/nova-hp-rft-reward-functions.html#nova-hp-rft-reward-implementation
training_overrides = {
    "global_batch_size": 64,
    "max_steps": 10
}

# 5. Start training
training_result = customizer.train(
    recipe_path=None,              # Optional: Path to custom recipe. Refer: https://github.com/aws/sagemaker-hyperpod-recipes/tree/main/recipes_collection/recipes/fine-tuning/nova/nova_2_0/nova_lite/RFT
    job_name="<JOB_NAME>-rft",     # Should have `rft` in name
    overrides=training_overrides,  # Training overrides
    rft_lambda_arn=rft_lambda_arn  # Reward Lambda ARN
)
print(f"Training started: {training_result.job_id}")
training_result.dump()

#### Monitor training Job

In [None]:
if not globals().get("paths_updated"):
    raise Exception("Paths not updated, please run KERNEL RESTART: Update paths step")

from amzn_nova_customization_sdk.monitor.log_monitor import CloudWatchLogMonitor
from amzn_nova_customization_sdk.model.model_enums import Platform


training_job_monitor = CloudWatchLogMonitor.from_job_id(
    job_id="<TRAIN_JOB_ID>",   # Get this from the above log
    platform=Platform.SMHP,
    cluster_name="<CLUSTER_NAME>",
    namespace="kubeflow"
)
training_job_monitor.show_logs(limit=20)

In [None]:
# You can also use kubectl commands directly

if not globals().get("paths_updated"):
    raise Exception("Paths not updated, please run KERNEL RESTART: Update paths step")

JOB_ID="<TRAIN_JOB_ID>"
!kubectl get pods -n kubeflow -o wide | grep $JOB_ID

To monitor the job:
- Cloudwatch
  - Go to cloudwatch -> Log management -> `/aws/sagemaker/Clusters/<PRETTY_CLUSTER_NAME>`
  - Now to check logs, you need to open the log stream for the specific `instance id` that the above get pods command displays. For example if master pod has `hyperpod-i-1234567890` then you need to open the logs ending with `i-1234567890` in cloudwatch log stream
- [Optional] If you used `mlflow_tracking_uri` in your job, you can track metrics in MLFlow
  - Go to AWS console
  - Go to Sagemaker AI -> Sagemaker Studio -> Click on Domain created in step 1 -> Open Studio
  - Go to Applications -> MlFlow -> Click on `Open MLflow`

In [None]:
# Delete job once training is completed and you see the manifest file in run.output_s3_path
!helm uninstall -n kubeflow $JOB_ID

### Option 2: Directly using Hyperpod CLI

Use this when you want more control over the recipe

In [None]:
# Set required Variables

RIG_PRETTY_CLUSTER_NAME=""    # Your RIG name
TRAIN_RUN_NAME=""             # Run name for training job
INPUT_DATASET_S3_PATH=""      # S3 Path to your Input dataset
OUTPUT_S3_PATH=""             # S3 Path to output directory 
REWARD_LAMBDA_ARN=""          # Lambda ARN to reward function

In [None]:
# Connect to cluster.

if not globals().get("paths_updated"):
    raise Exception("Paths not updated, please run KERNEL RESTART: Update paths step")

!hyperpod connect-cluster --cluster-name "$RIG_PRETTY_CLUSTER_NAME"

In [None]:
import json

# Recipe overrides. Override others as required
overrides = {
    "instance_type": "ml.p5.48xlarge",
    "recipes.run.name": TRAIN_RUN_NAME,
    "recipes.run.data_s3_path": INPUT_DATASET_S3_PATH,
    "recipes.run.output_s3_path": OUTPUT_S3_PATH,
    "recipes.run.reward_lambda_arn": REWARD_LAMBDA_ARN,
    "recipes.training_config.global_batch_size": 64,
    "recipes.training_config.trainer.max_steps": 10,
    # Optional: following Overrides are for MLflow
    # "recipes.run.mlflow_tracking_uri": "", # MLFlow App ARN created during Cluster and dependencies setup
    # "recipes.run.mlflow_experiment_name": "", # Can leave blank if want to use TRAIN_RUN_NAME
    # "recipes.run.mlflow_run_name": "", # Can leave blank if want to use TRAIN_RUN_NAME
}
OVERRIDES = json.dumps(overrides)

! cd sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes && hyperpod start-job \
--namespace kubeflow \
--recipe fine-tuning/nova/nova_2_0/nova_lite/RFT/nova_lite_2_0_p5_gpu_rft \
--override-parameters '{OVERRIDES}'

In [None]:
# Get NAME from the result of above hyperpod start command

if not globals().get("paths_updated"):
    raise Exception("Paths not updated, please run KERNEL RESTART: Update paths step")

JOB_ID="<TRAIN_JOB_ID>"
!kubectl get pods -n kubeflow -o wide | grep $JOB_ID

### Monitor training job

To monitor the job:
- Cloudwatch
  - Go to cloudwatch -> Log management -> `/aws/sagemaker/Clusters/<PRETTY_CLUSTER_NAME>`
  - Now to check logs, you need to open the log stream for the specific `instance id` that the above get pods command displays. For example if master pod has `hyperpod-i-1234567890` then you need to open the logs ending with `i-1234567890` in cloudwatch log stream
- [Optional] If you used `mlflow_tracking_uri` in your job, you can track metrics in MLFlow
  - Go to AWS console
  - Go to Sagemaker AI -> Sagemaker Studio -> Click on Domain created in step 1 -> Open Studio
  - Go to Applications -> MlFlow -> Click on `Open MLflow`

In [None]:
# Delete job once training is completed and you see the manifest file in run.output_s3_path
!helm uninstall -n kubeflow $JOB_ID

## üü° PER-JOB: Run Eval

Recipes: https://github.com/aws/sagemaker-hyperpod-recipes/blob/main/recipes_collection/recipes/evaluation/nova/nova_2_0/nova_lite/nova_lite_2_0_p5_48xl_gpu_rft_eval.yaml

- Retrieve the your model path (TRAINING_CHECKPOINT_S3_PATH) from the `manifest.json` in `run.output_s3_path` from the training job
- Update any other paramemeter in the hyperpod start command if required

In [None]:
%%bash

cd sagemaker-hyperpod-cli/src/hyperpod_cli/sagemaker_hyperpod_recipes && hyperpod start-job \
  --namespace "kubeflow" \
  --recipe "evaluation/nova/nova_2_0/nova_lite/nova_lite_2_0_p5_48xl_gpu_rft_eval" \
  --override-parameters '{
    "instance_type": "ml.p5.48xlarge",
    "container": "708977205387.dkr.ecr.us-east-1.amazonaws.com/nova-evaluation-repo:SM-HP-Eval-V2-latest",
    "recipes.run.model_name_or_path": "<TRAINING_CHECKPOINT_S3_PATH>",
    "recipes.run.name": "<RUN_NAME>",
    "recipes.run.data_s3_path": "<EVAL_DATASET_PATH>",
    "recipes.run.output_s3_path": "<OUTPUT_S3_PATH>",
    "recipes.rl_env.reward_lambda_arn": "<YOUR_LAMBDA_FUNCTION>"
  }' 

# Override other params in nova_lite_2_0_p5_48xl_gpu_rft_eval as requried

In [None]:
# Get JOB_ID from the result of above hyperpod start command

JOB_ID="<JOB_ID>"
!kubectl get pods -n kubeflow -o wide | grep $JOB_ID

### Monitor eval job

For Monitoring eval job, follow the same steps as Monitoring training job