This repository implements Multi-Objective Reinforcement Learning from AI Feedback (MORLAIF). Instead of training the target model with a single preference model representing "all human preferences", the idea is to break this down into many simpler principles such as "toxicity", "factuality" and "sycophancy". This is essentially a form of task decomposition on the preference modeling stage of a RLAIF or RLHF system. It improves alignment performance, interpretability and allows fine-grained control over the model's behavior without retraining preference models. The results show that MORLAIF outperforms standard single-objective RLAIF baselines, and that it can be used to align larger language models using smaller preference models. For more information read the full paper on arxiv at: Multi-Objective Reinforcement Learning from AI Feedback.
- Replication
- Methodology
- Current Setup
- Principles
- Results
- Theoretical Advantages
- Frequently Asked Questions (FAQ)
- Acknowledgments
- Build Docker Image: Run
docker build -t morlaif .
to build the Docker environment. - Run Docker Container: Use
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm morlaif
to start the container. - Prepare the Dataset: Execute
process_HH_dataset.py
to format the anthropichh-rlhf
dataset. The principle specific datasets in thedata/datasets
folder can alternatively be used. - Acquire Feedback: Run
get_feedback.bash
to obtain feedback from GPT 3.5 or 4 according to the different principles inprinciples
. Note that this will by defualt send API requests in parallel, setmax_requests_per_minute
appropriately. There is also the option of using OpenAI's batch api usingcreate_batches.py
andsend_batches.py
- Launch PM Training: Initiate preference model training with the provided bash scripts in
PM_training
. - Create a Scalarization Function: A few different scalarization functions (see scalarization section) are defined in
PPO_training/MORL_scalarizer.py
. Good weights for linear can be calculated usingPM_training/PM_regression.py
- Start PPO Training: Follow up with PPO training using the designated bash scripts in
PPO_training
.
- Generating responses: A SFT model produces pairs of responses for prompts.
- Rating by Feedback Model: A feedback model evaluates which of these responses is better according to each individual principle. Most experiments used the 12 principles listed below.
- Training Preference Models: The ratings are used to train separate preference models (full models or LoRAs) for each principle.
- MORL Scalarization Function: A MORL scalarization function combines the ratings from each preference model into a reward signal.
- PPO Training: The combined score from the scalarization function acts as a reward signal, guiding the training of the target model using Proximal Policy Optimization (PPO).
- Target Model: The code currently supports GPT-2 small/medium/large/XL, Llama-2-7B/13B/70B and Gemma-2B/7B.
- Preference Models: Currently the code implements the finetuning of GPT-2, Llama-2, Gemma or LoRAs of these models as the preference models.
- Feedback Model: GPT-3.5 and GPT-4 are supported to rate response pairs according to each individual principle.
- Datasets: Currently Anthropic's hh-rlhf and openassistant-guanaco are used.
- Hardware: For GPT2 models a RTX 3090 24GB was used while for Llama and Gemma models a remote cluster with 8x A100 80gb was used.
These 12 principles were used for most of the experiments.
- helpfulness
- ethicality
- factuality
- toxicity
- sycophancy
- empathy
- relevance
- context
- bias
- understandability
- detail
- conciseness
Different MORL scalarization functions were evaluated to combine the preference model outputs, including:
- Weighted Linear Combination
- Worst-Case Optimization, aka Minimax, Max-Min or Rawlsian social welfare
- Soft Max-Min
- Uncertainty-Weighted Optimization
- Lower Quantile Optimization
- Max-Median
- Bernoulli-Nash
Our experiments demonstrate that when trained as preference models for individual principles, the accuracies are generally much higher than for single-objective PMs. Furthermore, all MORL objectives outperform the standard single-PM RLAIF baselines.
In human preference experiments, MORLAIF Llama-2-7b is strongly preferred over Single Objective RLAIF. Notably, a version trained with GPT-2-medium preference models performs on par with the single-objective model. GPT-4-Turbo judgments show high win rates for GPT-2-medium with a decreasing but still significant win rate for larger models. The correlation matrix reveals weak correlations for sycophancy, which is also the only principle which received a negative weight, indicating that sycophancy is actually preferred. Multi-objective PM accuracy depends on the number of principles used, shown for GPT-2-medium, Llama-2-7b, and the theoretical performance ceiling (representing 100\% accuracy for each principle).- Special thanks to the Long-Term Future Fund for funding this project.