-
Notifications
You must be signed in to change notification settings - Fork 193
添加训推修复功能(add feature train-infer-mismatch) #273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
new file: examples/qwen2.5-infer_correction/agentic_webshop_infer_correction.yaml new file: examples/qwen2.5-infer_correction/rlvr_infer_correction_config.yaml new file: examples/qwen2.5-infer_correction/run_agentic_pipeline_webshop.sh new file: examples/qwen2.5-infer_correction/run_rlvr_pipeline.sh modified: roll/configs/base_config.py modified: roll/configs/generating_args.py modified: roll/distributed/scheduler/generate_scheduler.py modified: roll/pipeline/agentic/env_manager/step_env_manager.py modified: roll/pipeline/agentic/env_manager/traj_env_manager.py modified: roll/pipeline/base_worker.py modified: roll/pipeline/rlvr/actor_pg_worker.py modified: roll/pipeline/rlvr/actor_worker.py
|
Thank you very much for contributing the code. Let’s review this feature together. @wwxFromTju |
ok,快速看了下,社区同学主要改动在于新增了“infer_correction”这个函数,需要测试一下训练效果后,可以和咱们新的代码直接合并。 |
hey,您好,是否有运行的测试效果?咱们做一下对照 |
|
你好,我这边是有一些测试的效果对比,但是由于是在公司的一些私有数据上面,估计难以放出来。只能说确实是更加稳定和效果提升 |
请问方便跑一跑仓库里的数据吗?7B rlvr的 和 agentic frozelake或者sokoan的 |
|
好的,明天等我消息 |
|
@millioniron 👋 Hello from ROLL! Thank you for your interest and contributions to our project! 🤝 We at ROLL are eager to build stronger connections with our community developers. Let's foster collaboration and grow together! Looking ahead, the ROLL project will also organize regular events and offer gifts to active contributors. 📧 For further discussion:
|
|
配置如下: defaults:
- ../config/traj_envs@_here_
- ../config/deepspeed_zero@_here_
- ../config/deepspeed_zero2@_here_
- ../config/deepspeed_zero3@_here_
- ../config/deepspeed_zero3_cpuoffload@_here_
hydra:
run:
dir: .
output_subdir: null
exp_name: "agentic_pipeline_none_correction"
seed: 42
logging_dir: ./output/logs/${exp_name}
output_dir: ./output
render_save_dir: ./output/render
system_envs:
USE_MODELSCOPE: '1'
#track_with: wandb
#tracker_kwargs:
# api_key:
# project: roll-agentic
# name: ${exp_name}_sokoban
# notes: "agentic_pipeline"
# tags:
# - agentic
# - roll
# - baseline
track_with: tensorboard
tracker_kwargs:
log_dir: ./tensorboard/roll_exp/${exp_name}
checkpoint_config:
type: file_system
output_dir: ./agentic/models/${exp_name}
num_gpus_per_node: 8
max_steps: 500
save_steps: 10000
logging_steps: 1
eval_steps: 20
resume_from_checkpoint: false
async_generation_ratio: 1
rollout_batch_size: 256
val_batch_size: 256
sequence_length: 24576
ppo_epochs: 1
adv_estimator: "grpo"
#pg_clip: 0.1
#dual_clip_loss: True
init_kl_coef: 0.0
whiten_advantages: true
entropy_loss_coef: 0
max_grad_norm: 1.0
add_token_level_kl: false
use_kl_loss: false
global_template: qwen3
pretrain: /Qwen3-4B-Thinking-2507
reward_pretrain: /Qwen3-4B-Thinking-2507
infer_correction: true
infer_is_mode: None
infer_is_threshold_min: 0.0
infer_is_threshold_max: 5.0 # 1.5~5.0
enable_token_reject: true
infer_token_mask_threshold_min: 0.1
infer_token_mask_threshold_max: 10.0 # 2~10
enable_catastrophic_reject: true
infer_catastrophic_threshold: 1e-4
enable_seq_reject: None
actor_train:
model_args:
attn_implementation: fa2
disable_gradient_checkpointing: false
dtype: bf16
model_type: ~
training_args:
learning_rate: 1.0e-6
weight_decay: 0
per_device_train_batch_size: 1
gradient_accumulation_steps: 32
warmup_steps: 10
lr_scheduler_type: cosine
data_args:
template: qwen3
strategy_args:
# strategy_name: deepspeed_train
# strategy_config: ${deepspeed_zero3}
strategy_name: megatron_train
strategy_config:
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
expert_model_parallel_size: 1
use_distributed_optimizer: true
recompute_granularity: full
device_mapping: list(range(0,4))
infer_batch_size: 2
# use_dynamic_batching_in_infer: true
# max_tokens_per_microbatch_in_infer: 30732
# sequence_length_round_in_infer: 128
actor_infer:
model_args:
disable_gradient_checkpointing: true
dtype: bf16
generating_args:
max_new_tokens: 2048 # single-turn response length
top_p: 1
top_k: 100
num_beams: 1
temperature: 1
num_return_sequences: 1
logprobs: 1
data_args:
template: qwen3
strategy_args:
strategy_name: sglang
strategy_config:
mem_fraction_static: 0.85
load_format: auto
device_mapping: list(range(4,8))
# use_dynamic_batching_in_infer: true
# max_tokens_per_microbatch_in_infer: 30732
# sequence_length_round_in_infer: 128
reference:
model_args:
attn_implementation: fa2
disable_gradient_checkpointing: true
dtype: bf16
model_type: ~
data_args:
template: qwen3
strategy_args:
strategy_name: hf_infer
strategy_config: ~
device_mapping: list(range(0,4))
infer_batch_size: 8
# use_dynamic_batching_in_infer: true
# max_tokens_per_microbatch_in_infer: 30732
# sequence_length_round_in_infer: 128
reward_normalization:
grouping: traj_group_id # 可以tags(env_type)/traj_group_id(group)/batch(rollout_batch)... group_by计算reward/adv
method: mean_std # asym_clip / identity / mean_std
train_env_manager:
max_env_num_per_worker: 16
num_env_groups: 32
# under the same group, the env config and env seed are ensured to be equal
group_size: 8
tags: [LargerSokoban]
num_groups_partition: [32] # If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation
val_env_manager:
max_env_num_per_worker: 32
num_env_groups: 256
group_size: 1 # should be set to 1 because val temperature is set to 0 and same prompt leads to same output
tags: [LargerSokoban]
num_groups_partition: [256] # TODO: If not set, all env names divide nums equally. Under the same group, the env config and env seed (prompt) are equal in each generation
# Here, you can override variables defined in the imported envs. max_tokens_per_step: 128 in custom_env.SimpleSokoban, here replaced by 64
max_tokens_per_step: 2048
sokoban_format_penalty: 0.0
max_actions_per_traj: 8
custom_envs:
SimpleSokoban:
${custom_env.SimpleSokoban}
LargerSokoban:
${custom_env.LargerSokoban}
SokobanDifferentGridVocab:
${custom_env.SokobanDifferentGridVocab}
FrozenLake:
${custom_env.FrozenLake}
FrozenLakeThink:
${custom_env.FrozenLakeThink} |
…oken/seq mask & add rollout importance sampling.
…ptimize memory usage during top-k logits computation.
|
#283 也包含一些infer_logporbs的内容,引入了一些conflict到这个PR,辛苦rebase一下。 |
|
好的 |
new file: examples/qwen2.5-infer_correction/agentic_webshop_infer_correction.yaml new file: examples/qwen2.5-infer_correction/rlvr_infer_correction_config.yaml new file: examples/qwen2.5-infer_correction/run_agentic_pipeline_webshop.sh new file: examples/qwen2.5-infer_correction/run_rlvr_pipeline.sh modified: roll/configs/base_config.py modified: roll/configs/generating_args.py modified: roll/distributed/scheduler/generate_scheduler.py modified: roll/pipeline/agentic/env_manager/step_env_manager.py modified: roll/pipeline/agentic/env_manager/traj_env_manager.py modified: roll/pipeline/base_worker.py modified: roll/pipeline/rlvr/actor_pg_worker.py modified: roll/pipeline/rlvr/actor_worker.py
|
hi,我进行了合并修订。具体来讲,对于新版本中有关infer-log-prob的获取部分我遵循了官方的版本。但是在具体的训推修复部分,是使用我自己的修订。 ✨ What's Changed What does this PR do?✨ What's Changed1. 核心组件重构
2. 三级拒绝策略体系
3. 智能重要性采样
4. 工业级诊断系统
|
|
这个PR由于我跟进了最新版本的缘故,commit有些多了。是否需要我重新提交一个PR? |
可以自己squash一下,然后提交? |
|
重开了一个新pull:#288 |






包括TIS和MASK掉超出阈值的样本(token,sequence),以及灾难性过滤等。
在base work/actor worker/actor-pg-worker进行了添加。
经过测试,rlvr和agentic都可运行,但(agentic的VL-trajectory-envmanager尚不可运行,因其formulate_rollouts中重新的encode可能造成打乱或?)但或许可以借鉴verl中的重新计算logprobs,总之这点我并没有修改。