<a href="https://colab.research.google.com/github/kmeng01/rome/blob/main/notebooks/causal_trace.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" align="left"/></a>&nbsp;or in a local notebook.

In [None]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit
cd /content && rm -rf /content/rome
git clone https://github.com/kmeng01/rome rome > install.log 2>&1
ln -s rome/dsets .
pip install -r /content/rome/scripts/colab_reqs/rome.txt >> install.log 2>&1

In [None]:
IS_COLAB = False
try:
    import google.colab, torch, sys
    if not torch.cuda.is_available():
        print("Change runtime type to include a GPU.")
    sys.path.append('/content/rome')
    IS_COLAB = True
except:
    pass

# Who does GPT-2 know?

A puzzle in this notebook.  If we generate a random person name, can we determine the difference between who GPT-2 specifically knows about, and who it doesn't?

Some ideas to try:

   * If a name is known, it will have lower perplexity than unknown names.  But I guess that a very common-sounding name of an unknown person might also have low perplexity.  Similarly a known person with an unusual name that is rarely mentioned could have high perplexity.
   * If a person is known, then paraphrased prompts asking about the person should produce similar information, rather than changing fundamental details.
   * If a person is known, then maybe Causal Tracing would reveal a retrieval of information that is specific to their name.
   * If a person is unknown, then maybe changing key facts in a sentence about them could be done without a major change in perplexity, while for a known person, such changes might be penalized strongly.
   * If a person is unknown, then similarly maybe changing their name in a sentence might not be penalized very strongly.
   * If a person is known, then genuinely true facts about the person should be generated, or at least scored higher than false asserions.
   * If a person is known, then maybe there will be large parameters in the model that mediate processing of the person's name.  (On the theory that regularizers will tend to shrink parameters for unseen cases.)
   * If a name is known, then given an distinguishing attribute about the person and a portion of their name, the model should be able to elicit the rest of their name.



In [None]:
%load_ext autoreload
%autoreload 2

We begin by importing several utility functions that deal with tokens and transformer models.

In [None]:
import os, re, json
import torch, numpy
from collections import defaultdict
from util import nethook
from util.globals import DATA_DIR
from experiments.causal_trace import ModelAndTokenizer, layername, guess_subject, plot_trace_heatmap
from experiments.causal_trace import make_inputs, decode_tokens, find_token_range
from experiments.causal_trace import predict_token, predict_from_input
from dsets import KnownsDataset

Now we load a model and tokenizer.  We use gpt2-xl, but we can also load much larger models.

In [None]:
torch.set_grad_enabled(False)

model_name = r"gpt2-xl"

# Note that if you trace other models, you should set noise_level appropriately.
# (We use 0.03 for gpt-neox-20b and 0.025 for gpt-j-6b)
#model_name = r"EleutherAI/gpt-neox-20b"
#model_name = r"EleutherAI/gpt-j-6B"

torch_dtype = torch.float16 if '20b' in model_name else None

mt = ModelAndTokenizer(model_name, low_cpu_mem_usage=IS_COLAB, torch_dtype=torch_dtype)


## Estimating probablity of a single name

One idea (see https://www.overleaf.com/read/shbgwqhkrgcd) is that a model's knowlege of a subject can be estimated by asking the model how likely it things the appearance of the subject is.  If it's more likely, then (maybe) the subject has been observed more frequently during training and so knowledge is grounded in observations.

So here we just feed a name to the model and ask, what is the estimated negative log probability of that name (summing up the log probabilities).  The lower, the more common the name.

In [None]:
inp = make_inputs(mt.tokenizer, ['<|endoftext|>LeBron James'])
out = mt.model(**inp)["logits"]
tokens = [mt.tokenizer.decode(t) for t in inp['input_ids'][0,1:]]
logprobs = -torch.gather(torch.log_softmax(out, -1)[:,:-1], 2, inp['input_ids'][:,1:,None])[0,:,0]
for t, lp in zip(tokens, logprobs):
    print(t, lp.item())
print('Total:', sum(logprobs).item())

Instead of scoring a raw name, I found it works a little better to prefix the name with some text that might be used to introduce an webpage about a real person, using the word "About", as in: "About LeBron James".

This function grabs the negative log probability of a piece of text after this prefix.

In [None]:
def logprob_of(text):
    prefix_size = 1
    inp = make_inputs(mt.tokenizer, ['About ' + text])
    out = mt.model(**inp)["logits"]
    #tokens = [mt.tokenizer.decode(t) for t in inp['input_ids'][0,prefix_size:]]
    logprobs = -torch.gather(torch.log_softmax(out, -1)[:,:-prefix_size],
                             2, inp['input_ids'][:,prefix_size:,None])[0,:,0]
    #for t, lp in zip(tokens, logprobs):
    #    print(t, lp.item())
    return sum(logprobs).item()


## Generating candidate names

Now based on a list of 200 of the most common U.S. first-names and 100 of the most common U.S. last-names, I just generate 20,000 distinct random names.

Then I score them by the logprob function above.

In [None]:
boy_names = (
    'James Robert John Michael David William Richard Joseph Thomas Charles Christopher Daniel Matthew Anthony Mark Donald Steven Paul Andrew Joshua Kenneth Kevin Brian George Timothy Ronald Edward Jason Jeffrey Ryan Jacob Gary Nicholas Eric Jonathan Stephen Larry Justin Scott Brandon Benjamin Samuel Gregory Alexander Frank Patrick Raymond Jack Dennis Jerry Tyler Aaron Jose Adam Nathan Henry Douglas Zachary Peter Kyle Ethan Walter Noah Jeremy Christian Keith Roger Terry Gerald Harold Sean Austin Carl Arthur Lawrence Dylan Jesse Jordan Bryan Billy Joe Bruce Gabriel Logan Albert Willie Alan Juan Wayne Elijah Randy Roy Vincent Ralph Eugene Russell Bobby Mason Philip Louis'
).split()
girl_names = (
    'Mary Patricia Jennifer Linda Elizabeth Barbara Susan Jessica Sarah Karen Lisa Nancy Betty Margaret Sandra Ashley Kimberly Emily Donna Michelle Carol Amanda Dorothy Melissa Deborah Stephanie Rebecca Sharon Laura Cynthia Kathleen Amy Angela Shirley Anna Brenda Pamela Emma Nicole Helen Samantha Katherine Christine Debra Rachel Carolyn Janet Catherine Maria Heather Diane Ruth Julie Olivia Joyce Virginia Victoria Kelly Lauren Christina Joan Evelyn Judith Megan Andrea Cheryl Hannah Jacqueline Martha Gloria Teresa Ann Sara Madison Frances Kathryn Janice Jean Abigail Alice Julia Judy Sophia Grace Denise Amber Doris Marilyn Danielle Beverly Isabella Theresa Diana Natalie Brittany Charlotte Marie Kayla Alexis Lori'
).split()

first_names = boy_names + girl_names
last_names = (
    'Smith Johnson Williams Brown Jones Garcia Miller Davis Rodriguez Martinez Hernandez Lopez Gonzales Wilson Anderson Thomas Taylor Moore Jackson Martin Lee Perez Thompson White Harris Sanchez Clark Ramirez Lewis Robinson Walker Young Allen King Wright Scott Torres Nguyen Hill Flores Green Adams Nelson Baker Hall Rivera Campbell Mitchell Carter Roberts Gomez Phillips Evans Turner Diaz Parker Cruz Edwards Collins Reyes Stewart Morris Morales Murphy Cook Rogers Gutierrez Ortiz Morgan Cooper Peterson Bailey Reed Kelly Howard Ramos Kim Cox Ward Richardson Watson Brooks Chavez Wood James Bennett Gray Mendoza Ruiz Hughes Price Alvarez Castillo Sanders Patel Myers Long Ross Foster Jimenez'
).split()

results = []
for f in first_names:
    for l in last_names:
        n = f'{f} {l}'
        p = logprob_of(n)
        results.append((p, n))
    print(p, n)
        

If we sort names by the log prob, then several well-known names rise to the top: for example famous author Stephen King, famous tastemaker Martha Stewart, famous rock star Michael Jackson.  Gary Johnson was the Libertarian nominee for U.S. president a few years ago; Russell Wilson is a football player...  Walter White is a fictional character on a popular TV show.

In [None]:
known = sorted(results)
len(known)
print(known[:10])

## Checking consistency of knowledge via prompts

Another approach is to test knowledge by getting the model to generate text about the person.  If slight changes in wording cause the model to predict different professions, then it might be a clue that the model doesn't know about the person.

Here are a few examples.

Here are a couple interesitng cases where the "consistency test" seems to contradict the "log prob" test.

   * Beverly Cooper does not score very well on log prob (19.87), but seems to be recognized as a Hollywood actreess (which is true).  Does the model know who she is?
   * Convsersely "Mark Allen" scores very well on log prob (13.76), but the model doesn't recognize the athlete or actor by that name, switching between assertions that he is an author, photographer, or in the computer field.

In [None]:
templates = [
    '. {}, the well-known',
    '. {} is a professional',
    '. {} is the consummate',
    '. {} is a professional in the field of',
    #'. {} is known for work in the field of',
    #'. {} works in the domain of',
    #'{} works as a ',
]
for k in known:
    prompts = [t.format(k[1]) for t in templates]
    #print(prompts)
    #for t in templates:
    #x    print(t.format(k[1]))
    print(k, ' '.join([predict_token(mt, [p])[0] for p in prompts]))

## Causal Tracing

A natural next test is to try examining Casual Traces to see if it can distinguish knowledge retrieval from guessing behavior.

The code below is taken from the causal tracing notebook from the ROME project.

In [None]:
def trace_with_patch(
    model,            # The model
    inp,              # A set of inputs
    states_to_patch,  # A list of (token index, layername) triples to restore
    answers_t,        # Answer probabilities to collect
    tokens_to_mix,    # Range of tokens to corrupt (begin, end)
    noise=0.1,        # Level of noise to add
    trace_layers=None # List of traced outputs to return
):
    prng = numpy.random.RandomState(1)  # For reproducibility, use pseudorandom noise
    patch_spec = defaultdict(list)
    for t, l in states_to_patch:
        patch_spec[l].append(t)
    embed_layername = layername(model, 0, 'embed')
    
    def untuple(x):
        return x[0] if isinstance(x, tuple) else x

    # Define the model-patching rule.
    def patch_rep(x, layer):
        if layer == embed_layername:
            # If requested, we corrupt a range of token embeddings on batch items x[1:]
            if tokens_to_mix is not None:
                b, e = tokens_to_mix
                x[1:, b:e] += noise * torch.from_numpy(
                    prng.randn(x.shape[0] - 1, e - b, x.shape[2])
                ).to(x.device)
            return x
        if layer not in patch_spec:
            return x
        # If this layer is in the patch_spec, restore the uncorrupted hidden state
        # for selected tokens.
        h = untuple(x)
        for t in patch_spec[layer]:
            h[1:, t] = h[0, t]
        return x

    # With the patching rules defined, run the patched model in inference.
    additional_layers = [] if trace_layers is None else trace_layers
    with torch.no_grad(), nethook.TraceDict(
        model,
        [embed_layername] +
            list(patch_spec.keys()) + additional_layers,
        edit_output=patch_rep
    ) as td:
        outputs_exp = model(**inp)

    # We report softmax probabilities for the answers_t token predictions of interest.
    probs = torch.softmax(outputs_exp.logits[1:, -1, :], dim=1).mean(dim=0)[answers_t]

    # If tracing all layers, collect all activations together to return.
    if trace_layers is not None:
        all_traced = torch.stack(
            [untuple(td[layer].output).detach().cpu() for layer in trace_layers], dim=2)
        return probs, all_traced

    return probs

## Scanning all locations

A causal flow heatmap is created by repeating `trace_with_patch` at every individual hidden state, and measuring the impact of restoring state at each location.

The `calculate_hidden_flow` function does this loop.  It handles both the case of restoring a single hidden state, and also restoring MLP or attention states.  Because MLP and attention make small residual contributions, to observe a causal effect in those cases, we need to restore several layers of contributions at once, which is done by `trace_important_window`.

In [None]:
def calculate_hidden_flow(
    mt, prompt, subject, samples=10, noise=0.1, window=10, kind=None
):
    """
    Runs causal tracing over every token/layer combination in the network
    and returns a dictionary numerically summarizing the results.
    """
    inp = make_inputs(mt.tokenizer, [prompt] * (samples + 1))
    with torch.no_grad():
        answer_t, base_score = [d[0] for d in predict_from_input(mt.model, inp)]
    [answer] = decode_tokens(mt.tokenizer, [answer_t])
    e_range = find_token_range(mt.tokenizer, inp["input_ids"][0], subject)
    low_score = trace_with_patch(mt.model, inp, [], answer_t, e_range,
            noise=noise).item()
    if not kind:
        differences = trace_important_states(
            mt.model, mt.num_layers, inp, e_range, answer_t, noise=noise
        )
    else:
        differences = trace_important_window(
            mt.model,
            mt.num_layers,
            inp,
            e_range,
            answer_t,
            noise=noise,
            window=window,
            kind=kind,
        )
    differences = differences.detach().cpu()
    return dict(
        scores=differences,
        low_score=low_score,
        high_score=base_score,
        input_ids=inp["input_ids"][0],
        input_tokens=decode_tokens(mt.tokenizer, inp["input_ids"][0]),
        subject_range=e_range,
        answer=answer,
        window=window,
        kind=kind or "",
    )

def trace_important_states(model, num_layers, inp, e_range, answer_t, noise=0.1):
    ntoks = inp["input_ids"].shape[1]
    table = []
    for tnum in range(ntoks):
        row = []
        for layer in range(0, num_layers):
            r = trace_with_patch(
                model,
                inp,
                [(tnum, layername(model, layer))],
                answer_t,
                tokens_to_mix=e_range,
                noise=noise,
            )
            row.append(r)
        table.append(torch.stack(row))
    return torch.stack(table)


def trace_important_window(
    model, num_layers, inp, e_range, answer_t, kind, window=10, noise=0.1
):
    ntoks = inp["input_ids"].shape[1]
    table = []
    for tnum in range(ntoks):
        row = []
        for layer in range(0, num_layers):
            layerlist = [
                (tnum, layername(model, L, kind))
                for L in range(
                    max(0, layer - window // 2), min(num_layers, layer - (-window // 2))
                )
            ]
            r = trace_with_patch(
                model, inp, layerlist, answer_t, tokens_to_mix=e_range, noise=noise
            )
            row.append(r)
        table.append(torch.stack(row))
    return torch.stack(table)



## Plotting the results

The `plot_trace_heatmap` function draws the data on a heatmap.  That function is not shown here; it is in `experiments.causal_trace`.


In [None]:
def plot_hidden_flow(
    mt, prompt, subject=None, samples=10, noise=0.1, window=10, kind=None, modelname=None, savepdf=None
):
    if subject is None:
        subject = guess_subject(prompt)
    result = calculate_hidden_flow(
        mt, prompt, subject, samples=samples, noise=noise, window=window, kind=kind
    )
    plot_trace_heatmap(result, savepdf, modelname=modelname)
    
def plot_all_flow(mt, prompt, subject=None, noise=0.1, modelname=None):
    for kind in [None, "mlp", "attn"]:
        plot_hidden_flow(mt, prompt, subject, modelname=modelname, noise=noise, kind=kind)

## Causal traces of a few test cases

Martha Stewart seems to reveal very clearly-defined knowledge retreival based on her name.

On the other hand, Mark Allen does not at all.

Beverly Cooper seems to be weakly predicted as an actress mainly based on her first name.

In [None]:
plot_all_flow(mt, 'Martha Stewart, the well-known')
plot_all_flow(mt, 'Beverly Cooper, the well-known')
plot_all_flow(mt, 'Mark Allen, the well-known')