This repository tracks the latest research on representation engineering (RepE), which was originally introduced by Zou et al. (2023). The goal is to offer a comprehensive list of papers and resources relevant to the topic. Work that falls under the umbrella of representation engineering are also included.
Note
If you believe your paper on representation engineering (or related topics) is not included, or if you find a mistake, typo, or information that is not up to date, please open an issue, and I will address it as soon as possible.
If you want to add a new paper, feel free to either open an issue or create a pull request.
Also:
Important
Note that representation engineering is a relatively new framework, so the categorization below reflects my subjective understanding of the techniques. The first table includes work that explicitly uses the term "representation engineering." Other closely related work is grouped in the later tables.
If you disagree with the categorization or have suggestions for improvement, please let me know by opening an issue.
- Words in Motion: Representation Engineering for Motion Forecasting
- Author(s): Omer Sahin Tas, Royden Wagner
- Date: 2024-06
- Venue: -
- Code: GitHub
- Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
- Author(s): Duanyu Feng, Bowen Qin, Chen Huang, Youcheng Huang, Zheng Zhang, Wenqiang Lei
- Date: 2024-06
- Venue: -
- Code: GitHub
- PaCE: Parsimonious Concept Engineering for Large Language Models
- Author(s): Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Darshan Thaker, Aditya Chattopadhyay, Chris Callison-Burch, René Vidal
- Date: 2024-06
- Venue: -
- Code: GitHub
- Improving Alignment and Robustness with Circuit Breakers
- Author(s): Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks
- Date: 2024-06
- Venue: -
- Code: GitHub
- ConTrans: Weak-to-Strong Alignment Engineering via Concept Transplantation
- Author(s): Weilong Dong, Xinwei Wu, Renren Jin, Shaoyang Xu, Deyi Xiong
- Date: 2024-05
- Venue: -
- Code: -
- Towards General Conceptual Model Editing via Adversarial Representation Engineering
- Author(s): Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun
- Date: 2024-04
- Venue: -
- Code: GitHub
- Towards Uncovering How Large Language Model Works: An Explainability Perspective
- Author(s): Haiyan Zhao, Fan Yang, Bo Shen, Himabindu Lakkaraju, Mengnan Du
- Date: 2024-02
- Venue: -
- Code: -
- Tradeoffs Between Alignment and Helpfulness in Language Models
- Author(s): Yotam Wolf, Noam Wies, Dorin Shteyman, Binyamin Rothberg, Yoav Levine, Amnon Shashua
- Date: 2024-01
- Venue: -
- Code: -
- Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering
- Author(s): Tianlong Li, Shihan Dou, Wenhao Liu, Muling Wu, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
- Date: 2024-01
- Venue: -
- Code: -
- Aligning Large Language Models with Human Preferences through Representation Engineering
- Author(s): Wenhao Liu, Xiaohua Wang, Muling Wu, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, Xuanjing Huang
- Date: 2023-12
- Venue: -
- Code: -
- Representation Engineering: A Top-Down Approach to AI Transparency
- Author(s): Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks
- Date: 2023-10
- Venue: -
- Code: GitHub
- Controlling Large Language Model Agents with Entropic Activation Steering
- Author(s): Nate Rahn, Pierluca D'Oro, Marc G. Bellemare
- Date: 2024-06
- Venue: -
- Code: -
- Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
- Author(s): Tianlong Wang, Xianfeng Jiao, Yifan He, Zhongzhi Chen, Yinghao Zhu, Xu Chu, Junyi Gao, Yasha Wang, Liantao Ma
- Date: 2024-06
- Venue: -
- Code: -
- Activation Steering for Robust Type Prediction in CodeLLMs
- Author(s): Francesca Lucchetti, Arjun Guha
- Date: 2024-04
- Venue: -
- Code: -
- Extending Activation Steering to Broad Skills and Multiple Behaviours
- Author(s): Teun van der Weij, Massimo Poesio, Nandi Schoots
- Date: 2024-03
- Venue: -
- Code: GitHub
- Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
- Author(s): Chen Qian, Jie Zhang, Wei Yao, Dongrui Liu, Zhenfei Yin, Yu Qiao, Yong Liu, Jing Shao
- Date: 2024-02
- Venue: -
- Code: GitHub
- MiMiC: Minimally Modified Counterfactuals in the Representation Space
- Author(s): Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru
- Date: 2024-02
- Venue: -
- Code: -
- Investigating Bias Representations in Llama 2 Chat via Activation Steering
- Author(s): Dawn Lu, Nina Rimsky
- Date: 2024-02
- Venue: -
- Code: -
- InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
- Author(s): Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, Xipeng Qiu
- Date: 2024-01
- Venue: -
- Code: GitHub
- Steering Llama 2 via Contrastive Activation Addition
- Author(s): Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, Alexander Matt Turner
- Date: 2024-12
- Venue: -
- Code: GitHub
- Improving Activation Steering in Language Models with Mean-Centring
- Author(s): Ole Jorgensen, Dylan Cope, Nandi Schoots, Murray Shanahan
- Date: 2023-12
- Venue: -
- Code: -
- Backdoor Activation Attack: Attack Large Language Models using Activation Steering for Safety-Alignment
- Author(s): Haoran Wang, Kai Shu
- Date: 2023-11
- Venue: -
- Code: GitHub
- The Linear Representation Hypothesis and the Geometry of Large Language Models
- Author(s): Kiho Park, Yo Joong Choe, Victor Veitch
- Date: 2023-11
- Venue: -
- Code: GitHub
- Activation Addition: Steering Language Models Without Optimization
- Author(s): Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, Monte MacDiarmid
- Date: 2023-08
- Venue: -
- Code: GitHub
- Extracting Latent Steering Vectors from Pretrained Language Models
- Author(s): Nishant Subramani, Nivedita Suresh, Matthew E. Peters
- Date: 2022-05
- Venue: -
- Code: GitHub
- Uncovering Safety Risks in Open-source LLMs through Concept Activation Vector
- Author(s): Zhihao Xu, Ruixuan Huang, Xiting Wang, Fangzhao Wu, Jing Yao, Xing Xie
- Date: 2024-04
- Venue: -
- Code: -
- Explaining Explainability: Understanding Concept Activation Vectors
- Author(s): Angus Nicolson, Lisa Schut, J. Alison Noble, Yarin Gal
- Date: 2024-04
- Venue: -
- Code: -
- Demystifying Embedding Spaces using Large Language Models
- Author(s): Guy Tennenholtz, Yinlam Chow, Chih-Wei Hsu, Jihwan Jeong, Lior Shani, Azamat Tulepbergenov, Deepak Ramachandran, Martin Mladenov, Craig Boutilier
- Date: 2023-10
- Venue: ICLR 2024
- Code: -
- The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
- Author(s): Samuel Marks, Max Tegmark
- Date: 2023-10
- Venue: -
- Code: GitHub
Other relevant posts: