<a href="https://colab.research.google.com/github/gg2001/transformer-circuits/blob/master/puzzles/1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo](https://www.alignmentforum.org/posts/eLNo7b56kQQerCzp2/mech-interp-puzzle-1-suspiciously-similar-embeddings-in-gpt)

In [1]:
%pip install transformer_lens plotly nbformat

Collecting transformer_lens
  Downloading transformer_lens-2.7.0-py3-none-any.whl.metadata (12 kB)
Collecting beartype<0.15.0,>=0.14.1 (from transformer_lens)
  Downloading beartype-0.14.1-py3-none-any.whl.metadata (28 kB)
Collecting better-abc<0.0.4,>=0.0.3 (from transformer_lens)
  Downloading better_abc-0.0.3-py3-none-any.whl.metadata (1.4 kB)
Collecting datasets>=2.7.1 (from transformer_lens)
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting fancy-einsum>=0.0.3 (from transformer_lens)
  Downloading fancy_einsum-0.0.3-py3-none-any.whl.metadata (1.2 kB)
Collecting jaxtyping>=0.2.11 (from transformer_lens)
  Downloading jaxtyping-0.2.34-py3-none-any.whl.metadata (6.4 kB)
Collecting wandb>=0.13.5 (from transformer_lens)
  Downloading wandb-0.18.3-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.7 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.7.1->transformer_lens)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxha

In [2]:
from tqdm import tqdm
from transformer_lens import HookedTransformer
import plotly.express as px
import torch
import heapq

In [3]:
model = HookedTransformer.from_pretrained("gpt-neo-small")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

Loaded pretrained model gpt-neo-small into HookedTransformer


In [4]:
W_E = model.W_E
W_E.requires_grad = False

In [5]:
# subsample = torch.randperm(model.cfg.d_vocab)[:5000].to(model.cfg.device)
# W_E = model.W_E[subsample]  # Take a random subset of 5,000 for memory reasons

In [6]:
W_E_normed = W_E / W_E.norm(dim=-1, keepdim=True) # [d_vocab, d_model]
# cosine_sims = W_E_normed @ W_E_normed.T # [d_vocab, d_vocab]

In [57]:
num_tokens = W_E_normed.size(0)
batch = 1000
top_k = 1000
heap = []
sum_cosine_sims = 0.0
count_cosine_sims = 0


for i in tqdm(range(0, num_tokens, batch)):
    token_cosine_sims = (W_E_normed[i:i+batch] @ W_E_normed[i + 1:].T)
    mask = torch.tril(torch.ones(token_cosine_sims.size(0), token_cosine_sims.size(1), dtype=torch.bool, device=token_cosine_sims.device), diagonal=-1)
    cosine_sims_top = token_cosine_sims.masked_fill(mask, float('-inf'))
    flattened = cosine_sims_top.view(-1)

    valid_mask = flattened != float('-inf')
    sum_cosine_sims += flattened[valid_mask].sum().item()
    count_cosine_sims += valid_mask.sum().item()

    top_cosine_sims = torch.topk(flattened, top_k)
    top_token1 = top_cosine_sims.indices // (num_tokens - i - 1) + i
    top_token2 = top_cosine_sims.indices % (num_tokens - i - 1) + i + 1

    for token_0, token_1, cosine_sim in zip(top_token1.tolist(), top_token2.tolist(), top_cosine_sims.values):
        if len(heap) < top_k:
            heapq.heappush(heap, (cosine_sim, token_0, token_1))
        else:
            if cosine_sim > heap[0][0]:
                heapq.heappushpop(heap, (cosine_sim, token_0, token_1))


sorted_heap = sorted(heap, key=lambda x: x[0], reverse=True)
mean = sum_cosine_sims / count_cosine_sims
print(f"Mean cosine similarity: {mean} {count_cosine_sims}")

100%|██████████| 51/51 [00:05<00:00, 10.06it/s]


Mean cosine similarity: 0.9293529306070797 1262857896


In [50]:
for cosine_sim, token_0, token_1 in sorted_heap:
    print(f"Token 0: {model.tokenizer.convert_ids_to_tokens(int(token_0))}, Token 1: {model.tokenizer.convert_ids_to_tokens(int(token_1))}, Cosine Similarity: {cosine_sim}")

Token 0: TPPStreamerBot, Token 1: EStreamFrame, Cosine Similarity: 0.9999352097511292
Token 0: StreamerBot, Token 1: EStreamFrame, Cosine Similarity: 0.9999347925186157
Token 0: PsyNetMessage, Token 1: EStreamFrame, Cosine Similarity: 0.9999344944953918
Token 0: ÿ, Token 1: ĠRandomRedditor, Cosine Similarity: 0.999933660030365
Token 0: ö, Token 1: ÿ, Cosine Similarity: 0.9999334812164307
Token 0: ĠRandomRedditor, Token 1: EStreamFrame, Cosine Similarity: 0.9999333024024963
Token 0: ù, Token 1: ú, Cosine Similarity: 0.9999331831932068
Token 0: rawdownload, Token 1: TPPStreamerBot, Cosine Similarity: 0.9999327063560486
Token 0: ù, Token 1: ĠattRot, Cosine Similarity: 0.999932587146759
Token 0: ù, Token 1: EStreamFrame, Cosine Similarity: 0.9999322295188904
Token 0: embedreportprint, Token 1: EStreamFrame, Cosine Similarity: 0.9999321699142456
Token 0: ú, Token 1: PsyNetMessage, Cosine Similarity: 0.999932050704956
Token 0: ÿ, Token 1: EStreamFrame, Cosine Similarity: 0.999931812286377
To

In [None]:
# px.histogram(
#    cosine_sims.flatten().detach().cpu().numpy(),
#    title="Pairwise cosine sims of embedding",
#)