Skip to content

This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.

License

Notifications You must be signed in to change notification settings

YuxiXie/MCTS-DPO

Repository files navigation

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

This repository contains code and analysis for the paper: Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning. Below is the framework of our proposed method.

Model Framework

Environment Setup

conda env create --file conda-recipe.yaml
pip install -r requirements.txt

Dataset Download

Run MCTS-DPO

Our main code include ./mcts_rl/algorithms/mcts and ./mcts_rl/trainers/tsrl_trainer.py

To run MCTS-DPO for MathQA on Mistral (SFT):

bash scripts/mcts_mathqa.sh

To run MCTS-DPO for CSR on Mistral (SFT):

bash scripts/mcts_csr.sh

Citation

@article{xie2024monte,
  title={Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning},
  author={Xie, Yuxi and Goyal, Anirudh and Zheng, Wenyue and Kan, Min-Yen and Lillicrap, Timothy P and Kawaguchi, Kenji and Shieh, Michael},
  journal={arXiv preprint arXiv:2405.00451},
  year={2024}
}

This repository is adapted from the code of the works Safe-RLHF.

About

This is the repository that contains the source code for the Self-Evaluation Guided MCTS for online DPO.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published