Representation Engineering (RepE)

This is the official repository for "Representation Engineering: A Top-Down Approach to AI Transparency"
by Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, and Dan Hendrycks.

Check out our website and demo here.

Introduction

In this paper, we introduce and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including truthfulness, memorization, power-seeking, and more, demonstrating the promise of representation-centered transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.

Installation

To install repe from the github repository main branch, run:

git clone https://github.com/andyzoujm/representation-engineering.git
cd representation-engineering
pip install -e .

Quickstart

Our RepReading and RepControl pipelines inherit the 🤗 Hugging Face pipelines for both classification and generation.

from repe import repe_pipeline_registry # register 'rep-reading' and 'rep-control' tasks into Hugging Face pipelines
repe_pipeline_registry()

# ... initializing model and tokenizer ....

rep_reading_pipeline =  pipeline("rep-reading", model=model, tokenizer=tokenizer)
rep_control_pipeline =  pipeline("rep-control", model=model, tokenizer=tokenizer, **control_kwargs)

RepReading and RepControl Experiments

Check out example frontiers of Representation Engineering (RepE), containing both RepControl and RepReading implementation. We welcome community contributions as well!

RepE_eval

We also release a language model evaluation framework RepE_eval based on RepReading that can serve as an additional baseline beside zero-shot and few-shot on standard benchmarks. Please check out our paper for more details.

Citation

If you find this useful in your research, please consider citing:

@misc{zou2023transparency,
      title={Representation Engineering: A Top-Down Approach to AI Transparency}, 
      author={Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, Zico Kolter, Dan Hendrycks},
      year={2023},
      eprint={2310.01405},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
data		data
examples		examples
lorra_finetune		lorra_finetune
repe		repe
repe_eval		repe_eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

data

data

examples

examples

lorra_finetune

lorra_finetune

repe

repe

repe_eval

repe_eval

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

Representation Engineering (RepE)

Introduction

Installation

Quickstart

RepReading and RepControl Experiments

RepE_eval

Citation

About

Releases

Packages

Contributors 6

Languages

License

andyzoujm/representation-engineering

Folders and files

Latest commit

History

Repository files navigation

Representation Engineering (RepE)

Introduction

Installation

Quickstart

RepReading and RepControl Experiments

RepE_eval

Citation

About

Resources

License

Stars

Watchers

Forks

Languages