This repository contains the official code for DRIFT (Dissatisfaction-Refined Iterative preference Training), a post-training recipe that leverages abundant implicit user dissatisfaction (DSAT) signals from real-world deployments while requiring minimal explicit satisfaction (SAT) feedback. DRIFT anchors training on DSAT and dynamically samples positives from the evolving policy. On real-world WildFeedback and synthetic UltraFeedback datasets, DRIFT achieves up to +6.23% (7B) / +7.61% (14B) on WildBench Task Score and up to +8.95% (7B) / +12.29% (14B) on AlpacaEval2 win rate over base models, outperforming strong baselines such as iterative DPO and SPIN. At larger scales, 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we show that DRIFT preserves preference margins and avoids gradient degeneration, providing an effective and scalable recipe for real-world post-training that exploits the most abundant and informative signal.
This project uses the Alignment Handbook for DPO training. Follow these steps to set up the environment:
- Clone the Alignment Handbook repository:
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook-
Set up the environment as described in the Alignment Handbook documentation.
-
Clone this repository:
git clone https://github.com/cacayaya/DRIFT.git
cd DRIFTWe provide the user satisified/dissatisfied chat data (SAT/DSAT) we used from WildFeedback in the ./data/sat_data.jsonl and ./data/dsat_data.jsonl.
We also curate a DSAT->SAT seed set (491 pairs) from WildFeedback, where a dissatisfied user turn (DSAT) is followed by a revised model response that satisfies the user (SAT). Each pair provides a natural preference: the DSAT response fails to meet expectations, while the subsequent SAT response is preferred. We provide this seed data in the ./data/seed-data.
All data can also be found at Huggingface: DRIFT Collection.
To generate drift preference data for iterative training, run:
CUDA_VISIBLE_DEVICES=0 python gen-drift.py \
--model_name Qwen/Qwen2.5-7B-Instruct-seed \
--input_file ./data/dsat_data.jsonlThis will use provided model to generate the chosen responses and paired with DSAT response from the real-world chat.
For new iteration, simply change the model for current iteration:
CUDA_VISIBLE_DEVICES=0 python gen-drift.py \
--model_name Qwen/Qwen2.5-7B-Instruct-DRIFT-iter1 \
--input_file ./data/dsat_data.jsonlAfter generating preference data, you can use the Alignment Handbook for DPO training. And you can find corresponding training configs in ./configs
For evaluation, we use WildBench and AlpacaEval2. WildBench is built from challenging ChatGPT-human user queries in WildChat-1M, making it well suited for assessing our method for real-world performance.
