mech-interpretability

This repo contains my experiments to better understand the recent advances in mechanistic interpretability research (with an emphasis on characterizing the stability of discovered circuits). Please see the drafts summarizing the findings.

Exploring length generalization in the context of indirect object identification task

Abstract

Mechanistic interpretability aims to explain how neural networks learn at the circuit level. So far only a handful of circuits with compelling evidence have been discovered in relatively small language models [1,2,3]. It remains unclear whether these circuits (a) persist once formed after training further, and (b) have similar explanatory power in larger models or models with different architecture. A recent case study on reverse engineering the circuit behind indirect object identification (IOI) task has uncovered an interesting circuit with a fairly specialized division of labor [3]. While the authors performed extensive experiments and ablation analyses to validate the specific circuit components (i.e., specialized attention heads), how the model performance varies under different perturbations is underexplored. Here we study the performance on IOI task and its variants in the original GPT-2 small and other models of equivalent size. This analysis makes some progress toward addressing the following open research questions curated by Neel Nanda: 5.34 and 5.35 [4].

References

Nelson Elhage, Neel Nanda, Catherine Olsson et al. A mathematical framework for transformer circuits. Transformer Circuits Thread (2021)
Nanda, Neel, et al. "Progress measures for grokking via mechanistic interpretability." arXiv:2301.05217 (2023)
Wang, Kevin, et al. "Interpretability in the wild: a circuit for indirect object identification in gpt-2 small." arXiv:2211.00593 (2022)
200 Concrete Problems in Interpretability
TransformerLens library

Name	Name	Last commit message	Last commit date
Latest commit cx0 neel nanda mats stream first draft Jan 5, 2024 ce7fff2 · Jan 5, 2024 History 10 Commits
drafts	drafts	neel nanda mats stream first draft	Jan 5, 2024
ioi-circuit	ioi-circuit	model output for figure 3	Nov 11, 2023
LICENSE	LICENSE	Initial commit	Nov 10, 2023
README.md	README.md	added references and link to draft	Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mech-interpretability

Exploring length generalization in the context of indirect object identification task

References

About

Releases

Packages

Languages

License

cx0/mech-interpretability

Folders and files

Latest commit

History

Repository files navigation

mech-interpretability

Exploring length generalization in the context of indirect object identification task

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages