Official Code For the Paper "More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness"
This repo contains the experiment code for the paper "More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness". See our paper for details.
This project is organized around the following five primary areas of trustworthiness, including:
- Toxicity
- Stereotypical bias
- Machine Ethics
- Truthfulness
- Privacy
This project is structured around subdirectories dedicated to each area of trustworthiness. Each subdir includes python scripts to reproduce our experiment results. Additionally, the data for each evaluation is included in the data/
subdirectory.
.
├── README.md
├── data/
├── toxicity/
├── stereotype/
├── machine_ethics/
├── truthfulness/
├── privacy/
├── toxicity_eval/
├── stereotype_eval/
├── machine_ethics/
├── privacy/
├── machine_ethics/
(
In general, python scripts end with _exp.py
are used for language model inference, and _analysis.py
are used for calculating the numerical benchmark values. Other area-specific analysis will be explained in the corresponding subdirectory.
In this project, we mainly focus on the open source Pythia suite, and here are the links to the specific models hosted on Huggingface
Pythia 70M
: EleutherAI/pythia-70mPythia 160M
: EleutherAI/pythia-160mPythia 410M
: EleutherAI/pythia-410mPythia 2.8B
: EleutherAI/pythia-2.8bPythia 6.9B
: EleutherAI/pythia-6.9b