GitHub - Butanium/self-obf-interp

All files in the root folder are mine. Notebook_results contains my plots. Data contains my custom gemma chat templates. The rest is from a version of https://github.com/mgm52/self-obfuscation

To run the code you'll need access to private models from Max, so reach out if you want to run it :)

Setup

git config --global credential.helper store
pip install huggingface_hub uv
huggingface-cli login --token <YOUR_ACCESS_TOKEN_WITH_GEMMA_AND_MAX_REPO> --add-to-git-credential
apt update && apt install git-lfs -y
git lfs install
git-lfs clone https://huggingface.co/self-obfuscation-model-viewers/self-obf-files
uv sync
cd self-obf-files
git switch models-only
cd ..
uv run python -c "from transformers import AutoModelForCausalLM; AutoModelForCausalLM.from_pretrained('google/gemma-2-9b-it')"  # optional but will download gemma-2-9b-it if it's not in your cache already

File Descriptions

probe_experiments_notebook.ipynb: Python notebook for running experiments to analyze self-obfuscation in language models using probing techniques. Includes layer stitching experiments, representation analysis, and visualization.
data_utils.py: Utilities for loading and processing test data for self-obfuscation experiments, including functions to create variations of examples with different adversarial words.
utils.py: General utility functions for self-obfuscation experiments, including functions for patching model activations, extracting token positions, and layer stitching.
visualization.py: Extensive visualization utilities for plotting experimental results, including layer sweep visualization, representation norm plots, KL divergence visualization, and token probability visualizations.
probes.py: Functions for loading and evaluating probes to detect specific concepts in model representations.
models.py: Model loading and patching utilities, including functions to load checkpoints, stitch model layers, and extract model representations.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
conditional_experiment		conditional_experiment
data		data
notebook_results		notebook_results
obf_reps		obf_reps
old_code		old_code
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
data_utils.py		data_utils.py
models.py		models.py
probe_experiments_notebook.ipynb		probe_experiments_notebook.ipynb
probes.py		probes.py
pyproject.toml		pyproject.toml
utils.py		utils.py
uv.lock		uv.lock
visualization.py		visualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

File Descriptions

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Butanium/self-obf-interp

Folders and files

Latest commit

History

Repository files navigation

Setup

File Descriptions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages