# veRL on SageMaker

### GRPO Algorithm Example

Getting Started with veRL on SageMaker

> **Note:** This example must be run from a machine with an NVIDIA GPU in order to build the docker image.

Prerequisites:

- create a .env file with the following variables:
  - WANDB_API_KEY=XXXX
  - HF_TOKEN=XXXX

1. Build and push a veRL container to ECR


In [None]:
region = 'us-east-1'

In [None]:
import boto3

sts = boto3.client('sts')
sts.get_caller_identity()
account = sts.get_caller_identity()['Account']
boto_session = boto3.session.Session(region_name=region)
region = boto_session.region_name
boto_session

# setup image name and tag
image = "verl-on-sagemaker"
tag = "v1"
fullname = f"{account}.dkr.ecr.{region}.amazonaws.com/{image}:{tag}"

fullname

> **NOTE** The below command should be run directly in bash and not from the notebook. It will fail if run from the notebook.

```bash

```


In [None]:
# bash container/build_tools/build_and_push.sh {region} {image} {tag}

2. Git clone the veRL repo, copy the example scripts to the local scripts directory, and execute the preprocessing script to download and format the gsm8k dataset. For simplicty we will upload our data straight from the scripts so it gets copied to our to our training instance and training container.


In [None]:
%%bash

# git clone https://github.com/volcengine/verl

# mv verl/verl scripts/verl
# mv verl/examples scripts/examples
# rm -rf verl
python3 scripts/examples/data_preprocess/gsm8k.py --local_dir scripts/data/gsm8k

3. Modify the example script so it knows where to find the data. The script we will run is located at `scripts/examples/grpo_trainer/run_qwen2-7b.sh`. Modify the following two lines in the script:

```hightlight
    data.train_files=data/gsm8k/train.parquet \
    data.val_files=data/gsm8k/test.parquet \
```


In [None]:
!sed -i 's/$HOME/\//' scripts/examples/data_preprocess/gsm8k.py

4. Execute a training job with the ModelTrainer API.


In [None]:
import os
import boto3
from dotenv import load_dotenv
from huggingface_hub import HfFolder
from sagemaker.modules import Session
from sagemaker.modules.train import ModelTrainer
from sagemaker.modules.train.model_trainer import Mode
from sagemaker.modules.configs import SourceCode, Compute, InputData

sess = Session(boto3.session.Session(region_name=region))
# iam = boto3.client('iam')
# role = iam.get_role(RoleName='sagemaker')['Role']['Arn']

load_dotenv()

# image URI for the training job
# pytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.6.0-gpu-py312-cu126-ubuntu22.04-sagemaker"
verl_image = fullname
# you can find all available images here
# https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html

env = {
    'WANDB_API_KEY': os.getenv('WANDB_API_KEY'),
    'HF_TOKEN': HfFolder.get_token(),
}

# define the script to be run
source_code = SourceCode(
    source_dir="scripts", command='bash ./examples/grpo_trainer/run_qwen2-7b.sh'
)

# Compute configuration for the training job
compute = Compute(
    instance_count=1,
    # for local mode
    # instance_type='local_gpu',
    instance_type="ml.p5.48xlarge",
    # instance_type="ml.p4d.24xlarge",
    volume_size_in_gb=96,
    keep_alive_period_in_seconds=3600,
)

# define the ModelTrainer
model_trainer = ModelTrainer(
    sagemaker_session=sess,
    training_image=verl_image,
    source_code=source_code,
    base_job_name="verl-grpo-example",
    compute=compute,
    environment=env,
    # for local mode
    # training_mode=Mode.LOCAL_CONTAINER,
)

# pass the input data
# input_data = InputData(
#     channel_name="train",
#     data_source=training_input_path,  #s3 path where training data is stored
# )

# start the training job
model_trainer.train(wait=True)

---

### Appendix


Testing a container manually before running in local mode


In [None]:
%%bash
docker run --shm-size=10.24gb --gpus all -it 10489a46d273 bash

SageMaker local docker compose command


In [None]:
# docker compose -f /home/ubuntu/verl-on-sagemaker/docker-compose.yaml up --build  --abort-on-container-exit