ET-Agent is a framework focused on fully optimizing agents' behavioral patterns in TIR tasks. ET-Agent proceeds from data and algorithm aspects. On the data side, it uses the Self-Evolving Data Flywheel to generate a large amount of enhanced data, providing a data foundation for RFT to broaden the tool-use action space coverage. On the algorithm side, it introduces the Behavior Calibration Training framework to gradually calibrate the agent's behavioral patterns to the optimal state. Results across multiple dimensions on two types of reasoning tasks show that its performance is superior to commonly used multi-TIR methods.
We first set up all the required environments.
git clone https://github.com/asilverlight/ET-Agent/
cd ET-Agent/evaluation
conda create -n evaluation python=3.10
conda activate evaluation
pip install -r requirements.txtcd ET-Agent/evaluation/vllm
conda create -n vllm python=3.10
conda activate vllm
pip install -r requirements.txtcd ET-Agent/FlashRAG-main
conda create -n retrieval python=3.10
conda activate retrieval
pip install -r requirements.txtcd ET-Agent/LLaMA-Factory
conda create -n rft python=3.10
conda activate rft
pip install -r requirements.txtcd ET-Agent/ARPO
conda create -n rl python=3.10
conda activate rl
pip install -r requirements.txt- Set up services for all required LLMs.
In a terminal, execute the following code:
cd ET-Agent/flywheel
conda activate vllm
bash vllm_launch_eval1.shIn another terminal, execute the following code:
cd ET-Agent/flywheel
conda activate vllm
bash vllm_launch_evolve1.sh- Deploy local retrieval service. We provide a Wikipedia retriever service implemented using FlashRAG and FastAPI. Before starting the retriever serving, you need to download the pre-indexed Wikipedia, Wikipedia corpus, and corresponding retriever models. Index construction method can be found here.
More details can be found in the FlashRAG documentation.
To start the retriever serving, first fill in
ET-Agent/retriever/serving_config.yamlwith the correct paths to the retrieval model, index, and corpus, as well as available GPU IDs. Then, run the following command to start the retriever serving:
python host_wiki.py \
--config serving_config.yaml \
--num_retriever {num_retriever} \
--port {port}- Run the Self-Evolving Data Flywheel.
cd ET-Agent/flywheel
conda activate evaluation
bash run.shNote that you need to set CONDA_ENV in run.sh to your environment path, and replace your BING_API_KEY and BING_ZONE. After that, replace LOCAL_SEARCH_URL with the port you set in host_wiki.py (the default is 0.0.0.0:1243).
Execute the following code:
conda activate rft
cd ET-Agent/LLaMA-Factory/examples/train_full
bash train_sft.sh- Conduct Group-wise Pareto Sampling. First, directly guide the agent to sample several chains for each question in the original data. In a terminal, execute the following code:
cd ET-Agent/direct_rollout
conda activate vllm
bash vllm_launch_infer1.shIn another terminal, execute the following code:
cd ET-Agent/direct_rollout
conda activate vllm
bash vllm_launch_summarize_model1.shYou should also deploy the local retrieval service. Then, run the following code for executing direct rollout:
cd ET-Agent/direct_rollout
conda activate evaluation
bash run_direct_inference.shNote that you need to set CONDA_ENV in run.sh to your environment path, and replace your BING_API_KEY and BING_ZONE. After that, replace LOCAL_SEARCH_URL with the port you set in host_wiki.py (the default is 0.0.0.0:1243).
Then, you can conduct pareto sampling on the data:
cd ET-Agent/pareto
conda activate evaluation
bash data_process.shYou can get the train data for RL in /path/to/ARPO/rl_datasets/train_processed_stage.parquet.
- Conduct Curriculum RL Training You can execute the RL code for each round in sequence:
cd ET-Agent/ARPO
conda activate rl
bash scripts/Efficiency_Qwen_7B_Stage.sh🔍 Click here! Watch the details of train bash
SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
PARENT_DIR="$(dirname "$SCRIPT_DIR")"
cd "$PARENT_DIR"
echo "Switched to parent directory: $PARENT_DIR"
# ============================ Environment Setup ============================
# Set basic environment variables
export PYTHONUNBUFFERED=1
export HYDRA_FULL_ERROR=1
export VLLM_ATTENTION_BACKEND=XFORMERS
export VERL_LOGGING_LEVEL=DEBUG
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
export RAY_memory_usage_threshold=0.8
export RAY_memory_monitor_refresh_ms=0
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Set Python path
export PYTHONPATH="/path/to/ARPO"/verl_arpo_entropy:$PYTHONPATH
# ============================ Basic Configuration ============================
# Experiment name and project
PROJECT_NAME="reasoning_tasks" # Modify experiment group
EXPERIMENT_NAME="Efficiency_Qwen_7B_Stage1" # Modify experiment name
# Configuration file path
CONFIG_PATH="/path/to/ARPO/scripts/config" # Modify the absolute path of the config folder, relative path is not recommended
CONFIG_NAME="ppo_trainer.yaml"
# Distributed training settings
NNODES=1
N_GPUS_PER_NODE=4
# ============================ Data Configuration ============================
# Data parameters
PROMPT_KEY="prompt" # Prompt field name
TRAIN_BATCH_SIZE=64 # Training batch size
PPO_MINI_BATCH_SIZE=4 # PPO mini-batch size
MAX_PROMPT_LENGTH=2000 # Maximum prompt length
MAX_RESPONSE_LENGTH=4096 # Maximum response length
# Data file paths
TRAIN_FILES="/path/to/ARPO/rl_datasets/train_processed_stage1.parquet" # Modify training data path
VALID_FILES="/path/to/ARPO/rl_datasets/valid.parquet" # Modify validation data path
# ============================ Model Configuration ============================
# Actor model path
ACTOR_MODEL_PATH="/path/to/output/sft" # Modify training model path
# ============================ Rollout Configuration ==========================
# Rollout settings
ROLLOUT_NAME="vllm" # Use vllm engine
ROLLOUT_MODE="sync_with_tool" # Synchronous mode with tool support
ROLLOUT_N=16 # Number of responses generated per sample
INITIAL_ROLLOUTS=8 # Initial rollout number
BEAM_SIZE=2 # Beam size
BRANCH_PROBABILITY=0.5 # Branch probability
Entropy_weight=0.2
SIGMA_TOOL=0.0
SIGMA_LENGTH=0.0
# ============================ Rollout Tools Configuration ==========================
SEARCH_CACHE_PATH="/path/to/ARPO/search_cache/search_cache.json" # Modify
# ============================ Reward Model Configuration ==========================
# Reward model settings
REWARD_MANAGER="efficiency" # Reward manager type
CUSTOM_REWARD_FUNCTION_PATH="/path/to/ARPO/verl_arpo_entropy/verl/utils/reward_score/efficiency.py"
CUSTOM_REWARD_FUNCTION_NAME="compute_score"
# ============================ Training Configuration ============================
# Training parameters
TOTAL_EPOCHS=3 # Total training epochs
SAVE_FREQ=30 # Save frequency
TEST_FREQ=10 # Test frequency
# ============================ Path Configuration ============================
# Save path
SAVE_PATH="/path/to/output/${EXPERIMENT_NAME}" # Modify save path
ROLLOUT_SAVE_PATH="${SAVE_PATH}/rollout"
# ============================ WandB Configuration ============================
# WandB settings
WANDB_API_KEY="your_wandb_api_key" # Modify your wandb key
# SEARCH_CLASS_PATH="verl.workers.agent.tools.search_tool.BingSearchTool"
SEARCH_CLASS_PATH="verl.workers.rollout.tools.search_tool.LocalSearchTool"
LOCAL_SEARCH_URLS="your_local_search_url"
# ============================ Preparation ============================
# Login to WandB (if API key is provided)
if [ "$WANDB_API_KEY" != "" ]; then
# export WANDB_BASE_URL="https://api.bandw.top"
wandb login --relogin $WANDB_API_KEY
export WANDB_DIR=${SAVE_PATH}
fi
# Create save directory
if [ ! -d "$SAVE_PATH" ]; then
mkdir -p $SAVE_PATH
fi
# Create rollout save directory
if [ ! -d "$ROLLOUT_SAVE_PATH" ]; then
mkdir -p $ROLLOUT_SAVE_PATH
fi
# ============================ Start Training ============================
python3 -m verl.trainer.main_ppo \
--config-path=$CONFIG_PATH \
--config-name=$CONFIG_NAME \
algorithm.adv_estimator=grpo \
algorithm.kl_ctrl.kl_coef=0.0 \
data.train_files=${TRAIN_FILES} \
data.val_files=${VALID_FILES} \
data.prompt_key=${PROMPT_KEY} \
data.train_batch_size=${TRAIN_BATCH_SIZE} \
data.max_prompt_length=${MAX_PROMPT_LENGTH} \
data.max_response_length=${MAX_RESPONSE_LENGTH} \
actor_rollout_ref.model.path=${ACTOR_MODEL_PATH} \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=${PPO_MINI_BATCH_SIZE} \
actor_rollout_ref.actor.use_dynamic_bsz=True \
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=$((2*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.0 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.fsdp_config.param_offload=False \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.name=${ROLLOUT_NAME} \
actor_rollout_ref.rollout.mode=${ROLLOUT_MODE} \
actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
actor_rollout_ref.rollout.n=${ROLLOUT_N} \
actor_rollout_ref.rollout.initial_rollouts=${INITIAL_ROLLOUTS} \
actor_rollout_ref.rollout.beam_size=${BEAM_SIZE} \
actor_rollout_ref.rollout.branch_probability=${BRANCH_PROBABILITY} \
actor_rollout_ref.rollout.entropy_weight=${Entropy_weight} \
actor_rollout_ref.rollout.tools.tool_instances.search.use_local_search=True \
actor_rollout_ref.rollout.tools.tool_instances.search.params.localsearch.local_search_url=${LOCAL_SEARCH_URLS} \
actor_rollout_ref.rollout.tools.tool_instances.search.params.localsearch.max_results=4 \
actor_rollout_ref.rollout.tools.tool_instances.search.params.localsearch.max_document_length=1200 \
actor_rollout_ref.rollout.multi_turn.enable=${ENABLE_MULTI_TURN} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=$((4*(MAX_PROMPT_LENGTH+MAX_RESPONSE_LENGTH))) \
actor_rollout_ref.ref.fsdp_config.param_offload=False \
reward_model.reward_manager=${REWARD_MANAGER} \
reward_model.reward_kwargs.sigma_tool=${SIGMA_TOOL} \
reward_model.reward_kwargs.sigma_length=${SIGMA_LENGTH} \
custom_reward_function.path=${CUSTOM_REWARD_FUNCTION_PATH} \
custom_reward_function.name=${CUSTOM_REWARD_FUNCTION_NAME} \
trainer.critic_warmup=0 \
trainer.logger="[console, wandb]" \
trainer.project_name=${PROJECT_NAME} \
trainer.experiment_name=${EXPERIMENT_NAME} \
trainer.n_gpus_per_node=${N_GPUS_PER_NODE} \
trainer.nnodes=${NNODES} \
trainer.save_freq=${SAVE_FREQ} \
trainer.test_freq=${TEST_FREQ} \
trainer.total_epochs=${TOTAL_EPOCHS} \
trainer.default_local_dir=${SAVE_PATH} \
trainer.val_before_train=False \
trainer.rollout_data_dir=${ROLLOUT_SAVE_PATH} \
hydra.run.dir=${SAVE_PATH}/outputs 2>&1 | tee ${SAVE_PATH}/run.logIn terms of the Correctness and efficiency metrics, ET-Agent achieves optimal performance in most cases across six difficult tasks, surpassing a range of baselines. For the metrics of Conciseness, Successful Execution, and Reasoning Length, ET-Agent also achieves the best average performance, outperforming six commonly used multi-TIR methods.
- Deploy the local retrieval service.- Deploy the reasoning model:
cd ET-Agent/evaluation
conda activate vllm
bash vllm_launch_reasoning_model-1.sh- Deploy the summarization model:
cd ET-Agent/evaluation
conda activate vllm
bash vllm_launch_summarization_model-1.sh- Begin for inference:
cd ET-Agent/evaluation
conda activate evaluation
bash infer_local_sds.shYou need to set MODEL_PATH, OUTPUT_PATH, CONDA_ENV, BING_API_KEY, BING_ZONE, SUMM_MODEL_PATH, SEARCH_CACHE_FILE, URL_CACHE_FILE, LOCAL_SEARCH_URL in infer_local_sds.sh.
- After that, you can test the metrics of the model: Deploy the judging model:
cd ET-Agent/evaluation
conda activate vllm
bash evaluate/deploy_qwen2.5_72B_instruct.shThen, run the code in ET-Agent/evaluation/evaluate_all_datas.sh to evaluate the performance of the model. Here, we evaluate the F1 score, LLM-as-Judge, and Efficiency metric:
bash evaluate/evaluate_all_datas.sh- For the measurement of the Conciseness, Successful Execution, and Reasoning Length metrics, please run
ET-Agent/radar/run.sh:
First, deploy the judgement model:
cd ET-Agent/radar
conda activate vllm
bash serve_model.shThen, you can run ET-Agent/radar/run.sh. You need to set methods_paths in run.sh.
cd ET-Agent/radar
conda activate evaluation
bash run.sh

