Skip to content

cacayaya/DRIFT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

This repository contains the official code for DRIFT (Dissatisfaction-Refined Iterative preference Training), a post-training recipe that leverages abundant implicit user dissatisfaction (DSAT) signals from real-world deployments while requiring minimal explicit satisfaction (SAT) feedback. DRIFT anchors training on DSAT and dynamically samples positives from the evolving policy. On real-world WildFeedback and synthetic UltraFeedback datasets, DRIFT achieves up to +6.23% (7B) / +7.61% (14B) on WildBench Task Score and up to +8.95% (7B) / +12.29% (14B) on AlpacaEval2 win rate over base models, outperforming strong baselines such as iterative DPO and SPIN. At larger scales, 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we show that DRIFT preserves preference margins and avoids gradient degeneration, providing an effective and scalable recipe for real-world post-training that exploits the most abundant and informative signal.

intro

Setup

Requirements

This project uses the Alignment Handbook for DPO training. Follow these steps to set up the environment:

  1. Clone the Alignment Handbook repository:
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook
  1. Set up the environment as described in the Alignment Handbook documentation.

  2. Clone this repository:

git clone https://github.com/cacayaya/DRIFT.git
cd DRIFT

Data Generation

We provide the user satisified/dissatisfied chat data (SAT/DSAT) we used from WildFeedback in the ./data/sat_data.jsonl and ./data/dsat_data.jsonl.

We also curate a DSAT->SAT seed set (491 pairs) from WildFeedback, where a dissatisfied user turn (DSAT) is followed by a revised model response that satisfies the user (SAT). Each pair provides a natural preference: the DSAT response fails to meet expectations, while the subsequent SAT response is preferred. We provide this seed data in the ./data/seed-data.

All data can also be found at Huggingface: DRIFT Collection.

To generate drift preference data for iterative training, run:

CUDA_VISIBLE_DEVICES=0 python gen-drift.py \
    --model_name Qwen/Qwen2.5-7B-Instruct-seed \
    --input_file ./data/dsat_data.jsonl

This will use provided model to generate the chosen responses and paired with DSAT response from the real-world chat.

For new iteration, simply change the model for current iteration:

CUDA_VISIBLE_DEVICES=0 python gen-drift.py \
    --model_name Qwen/Qwen2.5-7B-Instruct-DRIFT-iter1 \
    --input_file ./data/dsat_data.jsonl

Training with DPO

After generating preference data, you can use the Alignment Handbook for DPO training. And you can find corresponding training configs in ./configs

Evaluation

For evaluation, we use WildBench and AlpacaEval2. WildBench is built from challenging ChatGPT-human user queries in WildChat-1M, making it well suited for assessing our method for real-world performance.

References

About

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages