OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

📖Introduction

OGER is a framework that enhances reasoning capabilities by unifying offline teacher guidance and online RL through a specialized reward modeling lens. Specifically, our framework leverages multi-source teacher trajectories for collaborative training to solidify foundational reasoning capabilities. Concurrently, we construct a divergence-based exploration reward to quantify the semantic disparity between online and offline trajectories, facilitating a profound synergy between expert imitation and autonomous discovery. To ensure training stability, we implement a strategic hybrid sampling mechanism that integrates offline expert data directly into the online training batches. Furthermore, we refine this exploration signal by incorporating the policy model's token-level entropy distribution, enabling fine-grained control to incentivize novel reasoning behaviors and mitigate the risk of premature convergence.

Key Highlights:

Multi-source Offline Trajectories
Offline-Guided Exploration Reward
OGER Reward with Hybrid Set

✨Getting Started

Installation

You can install OGER dependencies by running the following commands:

conda create -n oger python=3.10
conda activate oger
cd oger
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .

Quick Start

We provide an example script to train OGER on our multi-source offline subset of OpenR1-Math-220k. You can run the following command to train OGER:

conda activate oger

cd ./verl

save_dir=/path/to/your/dir

export MODEL_PATH=/path/to/your/model
export PROJECT_NAME=project_name
export EXPERIMENT_NAME=oger
export TENSORBOARD_DIR=/path/to/your/log


# Set XFormers backend to avoid CUDA errors
export VLLM_ATTENTION_BACKEND=XFORMERS
export DATA_DIR=/path/to/your/data

# Train over a single node, 8 AH200 GPUs.
python3 -m verl.mix_src.main_mix_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$DATA_DIR/luffy_offline_samples.parquet \
    data.val_files=$DATA_DIR/valid.parquet \
    data.train_batch_size=128 \
    data.val_batch_size=512 \
    data.max_prompt_length=1024 \
    data.max_response_length=8192 \
    actor_rollout_ref.model.path=$MODEL_PATH \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size=64 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=32768 \
    actor_rollout_ref.actor.kl_loss_coef=0.00 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=1 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.grad_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.temperature=1.0 \
    actor_rollout_ref.rollout.val_temperature=0.6 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.60 \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.rollout.n_val=1 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.rollout.max_prefix_len=8192 \
    algorithm.kl_ctrl.kl_coef=0.000 \
    actor_rollout_ref.actor.entropy_coeff=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console','tensorboard'] \
    trainer.project_name="$PROJECT_NAME" \
    trainer.experiment_name="$EXPERIMENT_NAME" \
    +trainer.val_before_train=False \
    +trainer.max_replace_num=1 \
    +trainer.similarity_embedding=True \
    +trainer.similarity_embedding_model=/path/to/embedding_model \
    +trainer.replace_mode="similarity_mean_des" \
    +trainer.use_similarity_score=True \
    +trainer.similarity_shaping=False \
    +trainer.similarity_reverse=True \
    +trainer.similarity_reverse_fail=False \
    +trainer.last_token_enp_reward='exp_neg_enp' \
    +trainer.skip_all_success=False \
    +trainer.use_history_samples=False \
    +trainer.use_mu_schedule=False \
    +trainer.repeating_check=False \
    +trainer.similarity_reranker=False \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=50 \
    trainer.test_freq=10 \
    +actor_rollout_ref.rollout.replace_after_rm=True \
    trainer.default_local_dir=$save_dir/$PROJECT_NAME/$EXPERIMENT_NAME \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.actor.use_sft_prefix_reward=False \
    actor_rollout_ref.rollout.prefix_share_across_samples=False \
    actor_rollout_ref.rollout.prefix_strategy=random \
    actor_rollout_ref.rollout.n_prefix=1 \
    actor_rollout_ref.rollout.min_prefix_ratio=0.0 \
    actor_rollout_ref.rollout.max_prefix_ratio=0.0 \
    actor_rollout_ref.rollout.prefix_reward_weight_alpha=1.0 \
    actor_rollout_ref.ref.use_ref=False \
    actor_rollout_ref.actor.use_off_policy_loss=True \
    actor_rollout_ref.actor.off_policy_normalize=False \
    actor_rollout_ref.actor.off_policy_reshape="p_div_p_0.1" \
    actor_rollout_ref.actor.off_policy_loss_impl=token \
    algorithm.grpo_use_std=False \
    actor_rollout_ref.actor.loss_remove_token_mean=True \
    actor_rollout_ref.actor.loss_remove_clip=True \
    data.reward_impl_version=3 \
    data.shuffle=True \
    trainer.default_hdfs_dir=null \
    trainer.max_optim_to_keep=-1 \
    trainer.total_training_steps=2000 "${@:1}"

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
figs		figs
verl		verl
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
reqv2.txt		reqv2.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

📖Introduction

Key Highlights:

✨Getting Started

Installation

Quick Start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

📖Introduction

Key Highlights:

✨Getting Started

Installation

Quick Start

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages