Official Code For the Paper "More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness"

Overview

This repo contains the experiment code for the paper "More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness". See our paper for details.

This project is organized around the following five primary areas of trustworthiness, including:

Toxicity
Stereotypical bias
Machine Ethics
Truthfulness
Privacy

Project Structure

This project is structured around subdirectories dedicated to each area of trustworthiness. Each subdir includes python scripts to reproduce our experiment results. Additionally, the data for each evaluation is included in the data/ subdirectory.

.
├── README.md
├── data/
    ├── toxicity/
    ├── stereotype/
    ├── machine_ethics/
    ├── truthfulness/
    ├── privacy/
├── toxicity_eval/
├── stereotype_eval/
├── machine_ethics/
├── privacy/
├── machine_ethics/

(⚠️WARNING: This paper contains model outputs that may be considered offensive.)

In general, python scripts end with _exp.py are used for language model inference, and _analysis.py are used for calculating the numerical benchmark values. Other area-specific analysis will be explained in the corresponding subdirectory.

[Important] Candidate models

In this project, we mainly focus on the open source Pythia suite, and here are the links to the specific models hosted on Huggingface

Pythia 70M: EleutherAI/pythia-70m
Pythia 160M: EleutherAI/pythia-160m
Pythia 410M: EleutherAI/pythia-410m
Pythia 2.8B: EleutherAI/pythia-2.8b
Pythia 6.9B: EleutherAI/pythia-6.9b

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
ethics_eval		ethics_eval
privacy_eval		privacy_eval
stereotype_eval		stereotype_eval
toxicity_eval		toxicity_eval
truth_eval		truth_eval
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

ethics_eval

ethics_eval

privacy_eval

privacy_eval

stereotype_eval

stereotype_eval

toxicity_eval

toxicity_eval

truth_eval

truth_eval

README.md

README.md

Repository files navigation

Official Code For the Paper "More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness"

Overview

Project Structure

[Important] Candidate models

About

Releases

Packages

Languages

aaron-jx-li/rlhf-trustworthiness

Folders and files

Latest commit

History

Repository files navigation

Official Code For the Paper "More RLHF, More Trust? On The Impact of Human Preference Alignment On Language Model Trustworthiness"

Overview

Project Structure

[Important] Candidate models

About

Resources

Stars

Watchers

Forks

Languages