# What I'm Doing

## 1. Prove necessity and sufficiency of Layer 37 chunks

### 1.1 Necessity Sweep

Ablate (zero-out) each block of Layer 37 separately:

- Whole residual-stream -> logits
- Each attention-head -> logits
- Each MLP -> logits

Record logit-difference (superset, top foil)

### 1.2 Sufficiency Sweep

Patch the same blocks clean -> corrupt while all other layers stay corrupted.

- If a single head/MLP patch restores the answer, that slice is sufficient.

### 1.3 Hydra Check

After ablating the critical slice, look for backup heads lighting up.
If backup appears, widen the slice; otherwise you've pinned the unique causal pathway.

Outcome: a shortlist of critical sub-modules inside Layer 37

## 2. Locate earlier retrieval heads feeding Layer 37

1. Craft one-name probes (e.g. "Brad Pitt is a...") and run path-patch from earlier head to the Layer 37 slice you just isolated.
2. Heads whose value vectors inject the correct profession label into that slice are retrieval heads.
3. Verify by ablating the retrieval head -> Layer 37 path; the superset answer should break.

Outcome: a two-hop path: Name tokens -> retrieval headsd (layers ~ 10-25) -> Layer 37 aggregator.

## 3. Test the generality of the circuit

1. Dense grid dataset - generate hundreds of entity pairs spanning >= 10 professions and 5 super-classes (artist, athlete, scientist ...)
2. Repeat the same necessity/sufficiency tests.
    - If the same heads fail across the grid, you have a general superset circuit.
    - If the failure is class-specific, split the dataset and continue per class.
3. Negative controls - include pairs whose superset is ambiguous or undefined. The circuit should stay silent or route elsewhere.

Outcome: evidence that the circuit represents the logitcal rule "f(x)=superset(profession(x))" rather than memorized templates.

## 4. Open thje black box of Layer 37 aggregator

1. Linear sub-space probe - train a probe on the Layer 37 residual to predict the one-hot superset label.
   - Low-rank => likely a single direction encoding "artistness".
   - High-rank => multiple features; cluster them.
2. Feature patching - patch only the "artist" direction from clean -> corrupt; if that alone fixes outputs, you have a feature-level explanation.
3. Neuron search - run SAE or feature-visualization on the MLP to map specific neurons/features to superset classes.

Outcome: a concrete mechanistic story:

> "Head 15.2 retrieves *profession* for each name -> its value is added to residual -> MLP 37.4 projects those values onto an 'artistness' direction; if either value is non-zero the logit for artist is boosted."

## Practical guard-rails

- Keep prompts short (name pairs + query) to avoid unrelated context features.
- Match entity fame so retrieval confidence is uniform; otherwise uncertainty, not reasoning, may dominate activations.
- Log all logits not just differences; a large negative swing in foils can masquerade as a positive causal effect.
- Automate the sweeps--one DataFrame with columns: layer, head/MLP, metric-before, metric-after--and plot heatmaps.


In [1]:
%load_ext autoreload
%autoreload 2

In [None]:
!git clone --progress -v https://github.com/giordanorogers/mechinterp

In [None]:
!pip install git+https://github.com/davidbau/baukit.git

In [None]:
import os
os.chdir("mechinterp")

In [None]:
!pip install -r requirements.txt

In [9]:
import os
import json
import sys

sys.path.append("../")

##################################################################
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
##################################################################

import logging
from src.utils import logging_utils
from src.utils import env_utils

logger = logging.getLogger(__name__)

logging.basicConfig(
    level=logging.DEBUG,
    format=logging_utils.DEFAULT_FORMAT,
    datefmt=logging_utils.DEFAULT_DATEFMT,
    stream=sys.stdout,
)

import torch
import transformers
transformers.logging.set_verbosity_error()

logger.info(f"{torch.__version__=}, {torch.version.cuda=}")
logger.info(
    f"{torch.cuda.is_available()=}, {torch.cuda.device_count()=}, {torch.cuda.get_device_name()=}"
)
logger.info(f"{transformers.__version__=}")

from src.utils.training_utils import get_device_map

model_key = "meta-llama/Llama-3.3-70B-Instruct"

device_map = get_device_map(model_key, 32, n_gpus=1)
print(device_map)

print(os.getcwd())

2025-07-02 10:05:23 __main__ INFO     torch.__version__='2.7.0', torch.version.cuda=None


AssertionError: Torch not compiled with CUDA enabled

In [11]:
from huggingface_hub import login

# Option A: Direct login
login(token="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")

2025-07-02 11:55:45 urllib3.connectionpool DEBUG    https://huggingface.co:443 "GET /api/whoami-v2 HTTP/1.1" 200 861


In [2]:
from src.models import ModelandTokenizer
from transformers import BitsAndBytesConfig
import torch

mt = ModelandTokenizer(
    model_key="gpt2",
    torch_dtype=torch.bfloat16,
    # device_map=device_map,
    device_map="auto",
    #quantization_config = BitsAndBytesConfig(
    #    load_in_4bit=True
    #    #load_in_8bit=True
    #)
)

env.yml not found in /Users/giordanorogers/Documents/Code/mechinterp!
Setting MODEL_ROOT="". Models will now be downloaded to conda env cache, if not already there
Other defaults are set to:
    DATA_DIR = "data"
    RESULTS_DIR = "results"
    HPARAMS_DIR = "hparams"
gpt2 not found in models/
If not found in cache, model will be downloaded from HuggingFace to cache directory


In [21]:
from src.probing.prompt import BiAssociationPrefix

prefix_generator_cls = BiAssociationPrefix

prefix_generator = prefix_generator_cls(
    filter_attributes=[
        #"nationality", 
        "profession", 
        "school"
    ],
    format = "_3"
)

prefix = prefix_generator.get_prefix()
print(prefix)

# Task: Find Common Attributes Between Two People
You will be given two people's names. Your job is to determine if they share ANY common attribute from the list below.

## Response Format:
- If you find a match: "Yes - [shared attribute] - [description of what they share]"
- If no match: "No - [Person 1] and [Person 2] have nothing in common"

## Attributes to Consider:
1. Same profession → "Yes - [profession] - they are both [profession]"
2. Same school → "Yes - [school] - they both graduated from [school]"

Q: Person Y and Person Z
A: No - Person Y and Person Z have nothing in common.

Q: Person W and Person X
A: No - Person W and Person X have nothing in common.

Q: Person C and Person D
A: Yes - Doctor - they are both doctors.

Q: Person E and Person F
A: Yes - Boston University - they both graduated from Boston University.

## Your turn, give your answer in a single line.


## Prompt Pair

Only the entity embeddings and their retrieval heads should differ, isolating weight-stored knowledge.

In [5]:
import json
from src.functional import generate_with_patch

clean_prompt = "Q: Pick the odd person out: Isaac Newton, Brad Pitt, Leonardo DiCaprio\nA:"
corrupt_prompt = "Q: Pick the odd person out: Isaac Newton, Albert Einstein, Leonardo DiCaprio\nA:"
print(json.dumps(
    generate_with_patch(
        mt=mt,
        inputs=clean_prompt,
        n_gen_per_prompt=1,
        do_sample=False,
        max_new_tokens=30
    ),
    indent=2,
))

[
  "Q: Pick the odd person out: Isaac Newton, Brad Pitt, Leonardo DiCaprio\nA: I don't know. I don't know. I don't know. I don't know. I don't know. I don't know."
]


In [6]:
from typing import List
from src.tokens import prepare_input, find_token_range
from src.probing.prompt import ProbingPrompt
from src.models import ModelandTokenizer

def find_token_range(string, substring, tokenizer, offset_mapping, **kwargs):
    """
    Return the start and end (inclusive) indices of the first occurrence
    of search_term in input_str, or (None, None) if not found.
    """
    char_start = string.find(substring)
    if char_start == -1:
        return None, None
    char_end = char_start + len(substring) - 1

    token_start, token_end = None, None
    for index, (token_char_start, token_char_end) in enumerate(offset_mapping):
        if token_start is None:
            if token_char_start <= char_start and token_char_end >= char_start:
                token_start = index
        if token_end is None:
            if token_char_start <= char_end and token_char_end >= char_end:
                token_end = index
                break
    return (token_start, token_end)

def prepare_ooo_input(
    mt: ModelandTokenizer,
    entities = List[str],
    prefix: str = "Q: Pick the odd person out: ",
    suffix: str = "\nA:",
    return_offsets_mapping: bool = False,
) -> str:
    prompt = f"{prefix}{(', ').join(entities)}{suffix}"

    tokenized = prepare_input(
        prompts=prompt,
        tokenizer=mt,
        return_offsets_mapping=True
    )
    offset_mapping = tokenized["offset_mapping"][0]

    entity_ranges = tuple(
        [
            find_token_range(
                string=prompt,
                substring=entity,
                tokenizer=mt,
                offset_mapping=offset_mapping,
            )
            for entity in entities
        ]
    )

    query_range = find_token_range(
        string=prompt,
        substring=suffix,
        tokenizer=mt,
        offset_mapping=offset_mapping
    )
    query_token_idx = query_range[1]

    tokenized = dict(
        input_ids=tokenized["input_ids"],
        attention_mask=tokenized["attention_mask"],
    )
    if return_offsets_mapping:
        tokenized["offset_mapping"] = [offset_mapping]

    return ProbingPrompt(
        prompt=prompt,
        entities=entities,
        model_key=mt.name.split("/")[-1],
        tokenized=tokenized,
        entity_ranges=entity_ranges,
        query_range=(query_token_idx, query_token_idx)
    )

In [7]:
prepare_ooo_input(
    mt,
    ["Isaac Newton", "Brad Pitt", "Leonardo DiCaprio"]
    #["Tim Ferris", "Lex Fridman", "Dwarkesh Patel"]
)

ProbingPrompt(prompt='Q: Pick the odd person out: Isaac Newton, Brad Pitt, Leonardo DiCaprio\nA:', entities=['Isaac Newton', 'Brad Pitt', 'Leonardo DiCaprio'], model_key='gpt2', tokenized={'input_ids': tensor([[   48,    25, 12346,   262,  5629,  1048,   503,    25, 19068, 17321,
            11,  8114, 10276,    11, 38083,  6031, 15610, 27250,   198,    32,
            25]], device='mps:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]],
       device='mps:0')}, entity_ranges=((8, 9), (11, 12), (14, 17)), query_range=(19, 19))

In [8]:
from typing import List
from src.models import ModelandTokenizer
from src.functional import predict_next_token
from src.probing.utils import get_lm_generated_answer

def get_odd_entity_out(
    mt: ModelandTokenizer,
    entities: List[str],
    prefix = "Q: Pick the odd person out: ",
    suffix = "\nA:",
    return_next_token_probs = True,
    return_interesting_logits = True
):
    ooo_prompt = prepare_ooo_input(
        mt=mt,
        entities=entities,
        prefix=prefix,
        suffix=suffix,
    )

    answer = get_lm_generated_answer(
        mt,
        prompt=ooo_prompt
    )
    answer = answer.split("\n")[0]

    if return_next_token_probs:
        if return_interesting_logits:
            entity_toks = [mt.tokenizer.encode(entity) for entity in entities]
            first_toks = [name_toks[0] for name_toks in entity_toks]
        return answer, predict_next_token(
            mt=mt, inputs=ooo_prompt.prompt, k=5, preds_of_interest=first_toks
        )
    return answer
        

In [9]:
ents = ["Isaac Newton", "Brad Pitt", "Leonardo DiCaprio"]

entity_toks = [mt.tokenizer.encode(entity) for entity in ents]

first_toks = [name_toks[0] for name_toks in entity_toks]

first_toks

[39443, 30805, 36185]

In [10]:
odd1out_dataset = {
    "actors_scientists": [
        {
            "entities": ["Isaac Newton", "Brad Pitt", "Leonardo DiCaprio"],
            "target": "Isaac Newton"
        },
        {
            "entities": ["Isaac Newton", "Albert Einstein", "Leonardo DiCaprio"],
            "target": "Leonardo DiCaprio"
        }
    ],
    "writers_athletes": [
        {
            "entities": ["Stephen King", "Mark Twain", "Usain Bolt"],
            "target": "Usain Bolt"
        },
        {
            "entities": ["Lionel Messi", "Mark Twain", "Usain Bolt"],
            "target": "Mark Twain"
        }
    ],
    "musicians_politician": [
        {
            "entities": ["Barack Obama", "Bob Dylan", "George Bush"],
            "target": "Bob Dylan"
        },
        {
            "entities": ["Barack Obama", "Bob Dylan", "John Lennon"],
            "target": "Barack Obama"
        }
    ]
}


In [28]:
import logging

logger = logging.getLogger(__name__)

limit = 100
results = {}

for professions in odd1out_dataset.items():
    logger.info("-" * 10 + f" {professions[0]} " + "-" * 10)
    targets = []
    predictions = []
    counter = 0
    ooo_results = []
    
    for ent_targ in professions[1]:
        query_entities = ent_targ['entities']
        target = ent_targ['target']

        answer, next_tok_probs = get_odd_entity_out(
            mt=mt,
            entities=query_entities,
        )

        next_tok_print = [str(pred) for pred in next_tok_probs[0]]
        print(f"{query_entities} => {target} | {next_tok_print}")

        ooo_results.append({
            "query_entities": query_entities,
            "target": target,
            "model_answer": answer,
            "next_tok_probs": next_tok_probs[0]
        })

        targets.append(target)
        #print(next_tok_probs[0])

        processed_tokens = []
        for item in next_tok_probs[0]:
            if hasattr(item, 'token'):
                processed_tokens.append(item.token)
            elif isinstance(item, dict):
                processed_tokens.extend(list(item.keys()))
        predictions.append(processed_tokens)

        counter += 1
        if counter >= limit:
            break

    results[professions[0]] = {
        "results": ooo_results
    }

['Isaac Newton', 'Brad Pitt', 'Leonardo DiCaprio'] => Isaac Newton | ['" I"[314] (p=0.047, logit=-149.000)', '" The"[383] (p=0.047, logit=-149.000)', '" It"[632] (p=0.017, logit=-150.000)', '" He"[679] (p=0.017, logit=-150.000)', '" You"[921] (p=0.017, logit=-150.000)', "{'Isa': {'logit': -160.0, 'prob': 7.911264106041926e-07, 'token_id': 39443}, 'Brad': {'logit': -160.0, 'prob': 7.911264106041926e-07, 'token_id': 30805}, 'Leon': {'logit': -160.0, 'prob': 7.911264106041926e-07, 'token_id': 36185}}"]
['Isaac Newton', 'Albert Einstein', 'Leonardo DiCaprio'] => Leonardo DiCaprio | ['" I"[314] (p=0.060, logit=-149.000)', '" The"[383] (p=0.022, logit=-150.000)', '" It"[632] (p=0.022, logit=-150.000)', '" You"[921] (p=0.022, logit=-150.000)', '" If"[1002] (p=0.022, logit=-150.000)', "{'Isa': {'logit': -160.0, 'prob': 9.936701417245786e-07, 'token_id': 39443}, 'Albert': {'logit': -161.0, 'prob': 3.6555076121658203e-07, 'token_id': 42590}, 'Leon': {'logit': -160.0, 'prob': 9.936701417245786e-0

In [27]:
predictions

[[' I', ' The', ' It', ' You', ' That', 'Bar', 'Bob', 'George'],
 [' I', ' The', ' It', ' He', ' You', 'Bar', 'Bob', 'John']]

- Find token indices for first names
    - Needs to get returned by `get_odd_entity_out` and added to the ooo_results
- Find logit for each of those tokens in the predictions


So now I need to find the token indices for all the first names.
Then I will check for each prediction what the logit is for the correct first name token.
And I will also check the difference between the first name token and the incorrect first name tokens.
The difference between the correct first name token and the highest incorrect first name token will be the margin.

## 3. Define a metric

For any model output, grab the logit for the correct answer, subtract the logit for the main rival. That single number -- the logit difference -- tells you how confidently the model picks the right odd one out.

---

Margin metric -- logit of the correct name minus the highest logit among the two wrong names.
- Moves smoothly as you ablate or patch slices (helpful for localization)
- Tells how far model is from flipping
- Avoids picking arbitrary "rival" in advance

```python
def logit_margin(logits, correct_id, wrong_ids):
    wrong_max = logits[0, -1, wrong_ids].max()
    return (logits[0, -1, correct_id] - wrong_max).item()
```

- wrong_ids is a list of the two other name-token IDs
- For every sweep you store that single float
- When the margin foes from +3 to -2 you know the slice you just modified is pivotal

Since names use two tokens
- Compute the margin on the first token of each name (that token is unique enough for "Einstein", "DiCaprio")
- Or, add the logits of both tokens for each name before taking the margin; results rarely differ.

## 4. Write three tiny hook helpers

- ablate_layer: zeros a slice
- patch_layer: pastes in stored activations
- capture_layer copies activations into a list

## 5. Run the model once on each prompt

Record the baseline logit differences.
These baselines show you the gap you'll try to destory (necessity) or restore (sufficiency)

## 6. Necessity sweep

Loop over layers. At every layer, zero the residual stream only at the swapped-name position and rerun the clean prompt.
If the logit difference collapses at Layer 37, that layer is necessary.

## 7. Sufficiency sweep

First, save the clean activations for every layer at the swapped position.
Then run the corrupt prompt. One layer at a time, paste the clean activation back in.
The first layer that flips the answer back marks where the correct signal first becomes sufficient.

## 8. Pick critical layers

## 9. Zoom inside each critical layer

## 10. Path Patch

## 11. Backup-head test

## 12. Generalization Grid

## 13. Visualize

## 14. Write the report

## Coarse Sweep

Patch residual stream at the two name positions across layers.
These positions now contain all profession evidence. If patching either name from corrupt -> clean restores 