annahdo

annahdo

Achievements

counterfactuals counterfactuals Public

Python 12 3
implementing_activation_steering implementing_activation_steering Public

A collection of different ways to implement accessing and modifying internal model activations for LLMs

Jupyter Notebook 11
exploring_directions exploring_directions Public

We find concept directions in hidden layers of an LLM an use them for classification, activation steering and knowledge removal

Jupyter Notebook 2 1
pankessel/adv_explanation_ref pankessel/adv_explanation_ref Public

reference implementation for "explanations can be manipulated and geometry is to blame"

Python 36 12