<a target="_blank" href="https://colab.research.google.com/github/TransformerLensOrg/TransformerLens/blob/main/demos/SVD_Interpreter_Demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## TransformerLens SVD Interpreter Demo

A few months ago, a Conjecture post came out about how the singular value decompositions of transformer matrices were [surprisingly interpretable](https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight#Directly_editing_SVD_representations), leading to recognisable semantic clusters. This seemed like good functionality to add to TransformerLens, which is what the SVD Interpreter feature does. You simply need to pass it a model, the type of matrix you want, and the size of the results you want, then you can plot it using PySvelte. This demo will show you how it's done.

How to use this notebook:

**Go to Runtime > Change Runtime Type and select GPU as the hardware accelerator.**

Tips for reading this Colab:

* You can run all this code for yourself!
* The graphs are interactive!
* Use the table of contents pane in the sidebar to navigate
* Collapse irrelevant sections with the dropdown arrows
* Search the page using the search in the sidebar, not CTRL+F

## Setup (Can be ignored)

In [None]:
# Janky code to do different setup when run in a Colab notebook vs VSCode
DEBUG_MODE = False
try:
    import google.colab
    IN_COLAB = True
    print("Running as a Colab notebook")
    %pip install git+https://github.com/JayBaileyCS/TransformerLens.git # TODO: Change!
    # Install Neel's personal plotting utils
    %pip install git+https://github.com/neelnanda-io/neel-plotly.git
    # Install another version of node that makes PySvelte work way faster
    !curl -fsSL https://deb.nodesource.com/setup_16.x | sudo -E bash -; sudo apt-get install -y nodejs
    %pip install git+https://github.com/neelnanda-io/PySvelte.git
    # Needed for PySvelte to work, v3 came out and broke things...
    %pip install typeguard==2.13.3
    %pip install typing-extensions
except:
    IN_COLAB = False
    print("Running as a Jupyter notebook - intended for development only!")
    from IPython import get_ipython

    ipython = get_ipython()
    # Code to automatically update the HookedTransformer code as its edited without restarting the kernel
    ipython.magic("load_ext autoreload")
    ipython.magic("autoreload 2")

In [2]:
# Plotly needs a different renderer for VSCode/Notebooks vs Colab argh
import plotly.io as pio

if IN_COLAB or not DEBUG_MODE:
    # Thanks to annoying rendering issues, Plotly graphics will either show up in colab OR Vscode depending on the renderer - this is bad for developing demos! Thus creating a debug mode.
    pio.renderers.default = "colab"
else:
    pio.renderers.default = "png"

In [3]:
import torch
import pysvelte
import numpy as np
import transformer_lens
import transformer_lens.utils as utils
from transformer_lens import HookedTransformer, SVDInterpreter

In [4]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"{device = }")

device = 'cuda'


## SVD Interpretation

The SVD Interpreter supports interpretation for three types of Transformer matrix:

* OV - The [output-value circuit](https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J#z=CLmGoD1pvjmsg0dPyL3wkuGS) of the matrix. (d_model x d_model) in size.
* w_in - Weights passed into the MLP block of the matrix. (d_model x (4 x d_model)) in size.
* w_out - Weights that come out of the MLP block of the matrix. ((4 x d_model) x d_model) in size.

The SVD interpreter handles everything behind the scenes, so you only need to pass in the model and the type of matrix you want. Let's give it a go!

We'll be passing in **fold_ln = False, center_writing_weights+false, and center_unembed=False** here to mimic the existing post as closely as possible in order to demonstrate that this works (and the numerical instability that makes it not *completely* work). You can do interpretability on the default model without these parameters, but you won't be able to replicate the same results. I haven't checked much to see how it affects their quality, though w_out seemed to decay greatly when center_unembed was True - this would be worth testing properly!

Replication with this type of analysis is inherently difficult, because linear dependence is numerically unstable. Very minor numerical changes (Like floating-point discrepancies) can alter the results slightly. (See [this comment](https://www.lesswrong.com/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight?commentId=4e8534hbyWCpZFgFD)) So don't worry if you don't get exactly the same results on different devices - this is, unfortunately, expected. Try to stick to the same device for all your experiments and be sure to point out which one you used when writing them up. (And if anyone has a more stable way to get these results, [let us know](https://github.com/TransformerLensOrg/TransformerLens/issues)!)

In [10]:
model = HookedTransformer.from_pretrained("gpt2-medium", fold_ln=False, center_writing_weights=False, center_unembed=False)

Using pad_token, but it is not set yet.


Loaded pretrained model gpt2-medium into HookedTransformer


In [11]:
all_tokens = [model.to_str_tokens(np.array([i])) for i in range(model.cfg.d_vocab)]
all_tokens = [all_tokens[i][0] for i in range(model.cfg.d_vocab)]

# Utility function to plot values in the same style as the Conjecture post.
def plot_matrix(matrix, tokens, k=10, filter="topk"):
  pysvelte.TopKTable(tokens=all_tokens, activations=matrix, obj_type="SVD direction", k=k, filter=filter).show()

In [12]:
svd_interpreter = SVDInterpreter(model)

ov = svd_interpreter.get_singular_vectors('OV', layer_index=22, head_index=10)
w_in = svd_interpreter.get_singular_vectors('w_in', layer_index=20)
w_out = svd_interpreter.get_singular_vectors('w_out', layer_index=16)

plot_matrix(ov, all_tokens)
plot_matrix(w_in, all_tokens)
plot_matrix(w_out, all_tokens)


To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).



Currently, this is the extent of our support for SVD interpretability. However, this is a very new idea, and we're excited to see how people use it! If you find an interesting use for this type of research that we don't cover, feel free to [open a ticket](https://github.com/TransformerLensOrg/TransformerLens/issues) or contact the code's author at jaybaileycs@gmail.com.

One thing I'd love to see that basically anyone who followed this demo could get started with (I'd consider it an **A-level problem** from Neel's [Concrete Open Problems sequence](https://www.lesswrong.com/s/yivyHaCAmMJ3CqSyj)) is to try different combinations of model parameters (fold_ln, center_writing_weights, center_unembed) and see which ones lead to big changes in the interpretability of the SVD matrices. 

Are these changes positive, or negative? Can you pick any set of parameters you want? Are different parameters more or less interpretable in general, or does it vary by head and layer? Can you get two different interpretations of the same head with different parameters? What else can you find? This is very low-hanging fruit that would be immediately tractable and immediately useful!