We introduce Compositional Preference Models (CPMs), a novel framework for training robust and interpretable preference models.
The generic handling of language models and the generation depend on HuggingFace's Transformers library.
run.sh
: Run overall experimentsfeature_extract/annotation.sh
: Extract feature values using LMmle-train/logistic_fits.sh
: Train logistic classifier that combines feature values into single modelreward_model/pm_training.sh
: Train standard preference modelmle-train/preference_evaluation.sh
: Evaluate preference alignment with LLM
@inproceedings{go2023compositional,
title={Compositional Preference Models for Aligning LMs},
author={Go, Dongyoung and Korbak, Tomasz and Kruszewski, Germ{\'a}n and Rozen, Jos and Dymetman, Marc},
booktitle={The Twelfth International Conference on Learning Representations},
year={2023}
}