# Eliciting latent knowledge - experiments

Alexander Cai, Gabriel Wu, Max Nadeau

Some experiments initially performed for [CS 229br Foundations of Deep Learning](https://boazbk.github.io/mltheoryseminar/) as taught in Spring 2023 at Harvard University by Boaz Barak.

This is a work-in-progress research draft to explore some properties of the "Contrast-Consistent Search" algorithm (and later algorithms)
for identifying a language model's internal representation of truth. This would be helpful in identifying model misbehaviours
or "eliciting latent knowledge" from intelligent models.

Our research questions:

- Does the "direction" discovered by CCS carry any semantic meaning outside its original setting (i.e. the residual stream on the final "positive" / "negative" token)?

See [adzcai/llama-ccs](https://github.com/adzcai/llama-ccs) for some preliminary experiments on the Meta LLaMA models.

Currently the [EleutherAI/elk](https://github.com/EleutherAI/elk) package must be installed in editable mode to download the associated prompt templates. Also note that it requires Python 3.10 which is not supported on Google Colab.

To do this, make sure you have the desired environment enabled, navigate to a convenient repository, and run the following.

In [None]:
# ! git clone https://github.com/EleutherAI/elk.git
# ! cd elk && pip install -qe .

In [None]:
# install the remaining requirements

# ! pip install -q \
#     circuitsvis \
#     plotly \
#     git+https://github.com/neelnanda-io/TransformerLens.git

## Resources

[EleutherAI/elk](https://github.com/EleutherAI/elk): Contains many further innovations on top of CCS. Very convenient tool for interacting with HF models and datasets.

[Discovering Latent Knowledge in Language Models Without Supervision](https://arxiv.org/abs/2212.03827): The original paper by Collin Burns and Haotian Ye et al that proposes "Contrast-Consistent Search" (CCS).
- [collin-burns/discovering_latent_knowledge](https://github.com/collin-burns/discovering_latent_knowledge): The corresponding repository.
  - This is claimed to be quite buggy. See [Bugs of the Initial Release of CCS](https://docs.google.com/document/d/16Q8ZJFloA-x2lR65hs80rbbjX70TteCSMhuDQGcC75Q/edit?usp=sharing) by Fabien Roger.
- [How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme](https://www.lesswrong.com/posts/L4anhrxjv8j2yRKKp/how-discovering-latent-knowledge-in-language-models-without)

[What Discovering Latent Knowledge Did and Did Not Find](https://www.lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4): A writeup by Fabien Roger on takeaways from the original paper.

- [safer-ai/Exhaustive-CCS](https://github.com/safer-ai/Exhaustive-CCS): The corresponding repository. Similar to Collin Burns's but with fewer bugs.
- [Several experiments with CCS.](https://docs.google.com/document/d/1LCjjnUPN51gHl_rmCWEmmtbY-Wu1dixzOif14e-7i-U/edit)

## Getting started

In [1]:
import os
from pathlib import Path

cwd = Path(os.getcwd())
data_path = cwd / "data"
reporters_path = cwd / "reporters"

In [2]:
use_data_dir = True
"""Optionally store data in this folder instead of the default."""

if use_data_dir:
    data_path.mkdir(parents=True, exist_ok=True)
    os.environ["HF_HOME"] = data_path.as_posix()

In [8]:
datasets = ["super_glue boolq", "imdb"]

Here we elicit latent knowledge from the [Pythia](https://github.com/EleutherAI/pythia) model family from EleutherAI.
We used the _non-deduplicated_ version of the models as of 17 April 2023. We use the 1B and 1.4B parameter models.

This model are notable in that every model in the family is trained on the same data in the same order.
A [paper](https://arxiv.org/pdf/2304.01373.pdf) with detailed information about these models is also available.

Additionally, these models are also available for use with [TransformerLens](https://github.com/neelnanda-io/TransformerLens/blob/main/transformer_lens/model_properties_table.md).

We use the [SuperGLUE (BoolQ)](https://huggingface.co/datasets/super_glue/viewer/boolq/test) dataset for a QA task and the [IMDB](https://huggingface.co/datasets/imdb) dataset for sentiment analysis.

We chose these datasets for preliminary analysis since they're simple archetypes for their respective tasks.

In [None]:
# ! elk elicit EleutherAI/pythia-1b 'super_glue boolq' --net ccs --out_dir 'reporters/ccs/pythia-1b/super_glue boolq'

# ! elk elicit EleutherAI/pythia-1b 'super_glue boolq' --net eigen --out_dir 'reporters/eigen/pythia-1b/super_glue boolq'

# ! elk elicit EleutherAI/pythia-1b imdb --net ccs --out_dir reporters/ccs/pythia-1b/imdb

# ! elk elicit EleutherAI/pythia-1b imdb --net eigen --out_dir reporters/eigen/pythia-1b/imdb

# ! elk elicit EleutherAI/pythia-1.4b 'super_glue boolq' --net ccs --out_dir reporters/ccs/pythia-1.4b/'super_glue boolq'

# ! elk elicit EleutherAI/pythia-1.4b 'super_glue boolq' --net eigen --out_dir reporters/eigen/pythia-1.4b/'super_glue boolq'

# ! elk elicit EleutherAI/pythia-1.4b imdb --net ccs --out_dir reporters/ccs/pythia-1.4b/imdb

# ! elk elicit EleutherAI/pythia-1.4b imdb --net eigen --out_dir reporters/eigen/pythia-1.4b/imdb

## Load the learned directions

In [3]:
import torch

# disable gradients since we're not doing any training here
torch.set_grad_enabled(False)
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.cuda.device_count()

  from .autonotebook import tqdm as notebook_tqdm


1

In [4]:
ccs_path = reporters_path / "ccs/pythia-1b/imdb"
list(ccs_path.iterdir())

[PosixPath('/data/projects/ccs/elk-experiments/reporters/ccs/pythia-1b/imdb/lr_models'),
 PosixPath('/data/projects/ccs/elk-experiments/reporters/ccs/pythia-1b/imdb/reporters'),
 PosixPath('/data/projects/ccs/elk-experiments/reporters/ccs/pythia-1b/imdb/cfg.yaml'),
 PosixPath('/data/projects/ccs/elk-experiments/reporters/ccs/pythia-1b/imdb/eval.csv'),
 PosixPath('/data/projects/ccs/elk-experiments/reporters/ccs/pythia-1b/imdb/fingerprints.yaml')]

In [5]:
reporters = [
    torch.load(reporter, map_location=device)
    for reporter in (ccs_path / "reporters").iterdir()
]

## Load the model

We use the [TransformerLens](https://github.com/neelnanda-io/TransformerLens) library to interact with model internals.

The reference documentation can be found [here](https://neelnanda-io.github.io/TransformerLens/transformer_lens.html).

The [main tutorial](https://neelnanda.io/transformer-lens-demo) was very helpful in getting started with the library.

In [6]:
from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("EleutherAI/pythia-1b").cuda()

Using pad_token, but it is not set yet.


Loaded pretrained model EleutherAI/pythia-1b into HookedTransformer
Moving model to device:  cuda


In [7]:
n_layers = model.cfg.n_layers
n_layers, len(reporters) == n_layers

(16, True)

## Load prompts

In the original DLK paper, the authors found that out of the models they tested,
the single decoder-only model (GPT-J-6B) performed the worst (measured in terms of accuracy across datasets).

Later on, researchers at EleutherAI found that this could be resolved by prompting the model using different prompt templates.
Their method leverages that the truth of the given statement should be the same regardless of the prompt template chosen.

In [9]:
from elk.extraction.prompt_loading import load_prompts
import itertools
import pandas as pd

n_prompts = 12
batch_range = torch.arange(n_prompts)

prompt_datasets = dict()
for dataset in datasets:
    prompt_dataset = load_prompts(dataset, split_type="val")
    prompt_dataset = list(itertools.islice(prompt_dataset, n_prompts))
    prompt_datasets[dataset] = prompt_dataset
    del prompt_dataset

Found cached dataset super_glue (/data/projects/ccs/elk-experiments/data/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed)
100%|██████████| 3/3 [00:00<00:00, 1002.14it/s]
Loading cached shuffled indices for dataset at /data/projects/ccs/elk-experiments/data/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-0bfd301019086411.arrow
Loading cached shuffled indices for dataset at /data/projects/ccs/elk-experiments/data/datasets/super_glue/boolq/1.0.3/bb9675f958ebfee0d5d6dc5476fafe38c79123727a7258d515c450873dbdbbed/cache-1ad2eed97462bf18.arrow


Using 10 variants of each prompt


Found cached dataset imdb (/data/projects/ccs/elk-experiments/data/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0)
100%|██████████| 3/3 [00:00<00:00, 869.47it/s]
Loading cached shuffled indices for dataset at /data/projects/ccs/elk-experiments/data/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-c1eaa46e94dfbfd3.arrow
Loading cached shuffled indices for dataset at /data/projects/ccs/elk-experiments/data/datasets/imdb/plain_text/1.0.0/d613c88cf8fa3bab83b4ded3713f1f74830d1100e171db75bbddb80b3345c9c0/cache-9c48ce5d173413c7.arrow


Using 13 variants of each prompt


The loaded prompts (each element of the `prompt_dataset` have the following structure:

```
{
    "label": 0 if correct answer is "negative", 1 if correct answer is "positive"
    "prompts": [
        [
            {
                "answer": "negative" or "bad" or ...
                "text": formatted prompt with the "negative" answer
            },
            {
                "answer": "positive" or "good" or ...
                "text": formatted prompt with the "positive" answer
            }
        ],
        ...
    ],
    "template_names": [
        template name for prompt 0,
        ...
    ]
}
```

Note that prompts vary along two binary axes:

1. Whether the statement ends in the "positive" answer or the "negative" answer;
2. Whether the statement is factually correct or incorrect.

It's important to distinguish these two axes; The goal of CCS is to uncover the latter.

In [12]:
dfs = dict()

for dataset, prompt_dataset in prompt_datasets.items():
    records = []

    for j, prompts in enumerate(prompt_dataset):
        for i, ((negative, positive), template_name) in enumerate(
            zip(prompts["prompts"], prompts["template_names"])
        ):
            records.append(
                dict(
                    negative_prompt=negative["text"],
                    positive_prompt=positive["text"],
                    negative_answer=negative["answer"],
                    positive_answer=positive["answer"],
                    label=prompts["label"],
                    incorrect_answer=negative["answer"]
                    if prompts["label"]
                    else positive["answer"],
                    correct_answer=positive["answer"]
                    if prompts["label"]
                    else negative["answer"],
                    template_name=template_name,
                    template_id=i,
                    prompt_id=j,
                )
            )

    # set a multi-index using the prompt_id and template_id
    dfs[dataset] = pd.DataFrame(records).set_index(["prompt_id", "template_id"])
    del dataset, records, prompt_dataset, prompts, negative, positive, template_name

In [13]:
dfs["super_glue boolq"].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,negative_prompt,positive_prompt,negative_answer,positive_answer,label,incorrect_answer,correct_answer,template_name
prompt_id,template_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,Passage: The Ranch (TV series) -- The Ranch is...,Passage: The Ranch (TV series) -- The Ranch is...,False,True,0,True,False,after_reading
0,1,The Ranch (TV series) -- The Ranch is an Ameri...,The Ranch (TV series) -- The Ranch is an Ameri...,No,Yes,0,Yes,No,GPT-3 Style
0,2,The Ranch (TV series) -- The Ranch is an Ameri...,The Ranch (TV series) -- The Ranch is an Ameri...,No,Yes,0,Yes,No,I wonder…
0,3,Text: The Ranch (TV series) -- The Ranch is an...,Text: The Ranch (TV series) -- The Ranch is an...,No,Yes,0,Yes,No,yes_no_question
0,4,The Ranch (TV series) -- The Ranch is an Ameri...,The Ranch (TV series) -- The Ranch is an Ameri...,No,Yes,0,Yes,No,could you tell me…


In [14]:
dfs["imdb"].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,negative_prompt,positive_prompt,negative_answer,positive_answer,label,incorrect_answer,correct_answer,template_name
prompt_id,template_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,The following movie review expresses what sent...,The following movie review expresses what sent...,negative,positive,1,negative,positive,Movie Expressed Sentiment 2
0,1,<br /><br />When I unsuspectedly rented A Thou...,<br /><br />When I unsuspectedly rented A Thou...,bad,good,1,bad,good,Reviewer Opinion bad good choices
0,2,<br /><br />When I unsuspectedly rented A Thou...,<br /><br />When I unsuspectedly rented A Thou...,negative,positive,1,negative,positive,Sentiment with choices
0,3,<br /><br />When I unsuspectedly rented A Thou...,<br /><br />When I unsuspectedly rented A Thou...,negative,positive,1,negative,positive,Reviewer Sentiment Feeling
0,4,<br /><br />When I unsuspectedly rented A Thou...,<br /><br />When I unsuspectedly rented A Thou...,negative,positive,1,negative,positive,Writer Expressed Sentiment


In [15]:
print(dfs["super_glue boolq"].at[(0, 0), "negative_prompt"])

Passage: The Ranch (TV series) -- The Ranch is an American comedy web television series starring Ashton Kutcher, Danny Masterson, Debra Winger, Elisha Cuthbert, and Sam Elliott that debuted in 2016 on Netflix. The show takes place on the fictional Iron River Ranch in the fictitious small town of Garrison, Colorado; detailing the life of the Bennetts, a dysfunctional family consisting of two brothers, their rancher father, and his divorced wife and local bar owner. While the opening sequence shows scenes from Norwood and Ouray, Colorado and surrounding Ouray and San Miguel Counties, The Ranch is filmed on a sound stage in front of a live audience in Burbank, California. Each season consists of 20 episodes broken up into two parts, each containing 10 episodes.

After reading this passage, I have a question: is garrison from the ranch a real place? True or False?

False


## Forward pass on the text

In [16]:
from jaxtyping import Float
from tqdm import tqdm
from transformer_lens.hook_points import HookPoint
import transformer_lens.utils as utils
from functools import partial

Here we fix a given template ID. This gives us a single fully-formatted prompt for each original text sample.

### Preprocessing prompts

In [18]:
template_id = 2

save_paths = {
    dataset: Path(f"results/ccs/pythia-1b/{dataset}/template-{template_id}")
    for dataset in datasets
}

prompts = {
    dataset: dfs[dataset].loc[pd.IndexSlice[:, template_id], :].reset_index(drop=True)
    for dataset in datasets
}

In [19]:
prompts["super_glue boolq"].head()

Unnamed: 0,negative_prompt,positive_prompt,negative_answer,positive_answer,label,incorrect_answer,correct_answer,template_name
0,The Ranch (TV series) -- The Ranch is an Ameri...,The Ranch (TV series) -- The Ranch is an Ameri...,No,Yes,0,Yes,No,I wonder…
1,Lanugo -- Lanugo (/ləˈnjuːɡoʊ/; from Latin lan...,Lanugo -- Lanugo (/ləˈnjuːɡoʊ/; from Latin lan...,No,Yes,1,No,Yes,I wonder…
2,Administrative law judge -- An administrative ...,Administrative law judge -- An administrative ...,No,Yes,1,No,Yes,I wonder…
3,Plants in space -- Plant research continued on...,Plants in space -- Plant research continued on...,No,Yes,1,No,Yes,I wonder…
4,HCF Health Insurance -- HCF (The Hospitals Con...,HCF Health Insurance -- HCF (The Hospitals Con...,No,Yes,1,No,Yes,I wonder…


In [20]:
get_value = lambda column: { dataset: prompts[dataset][column].tolist() for dataset in datasets }
neg_prompts = get_value("negative_prompt")
pos_prompts = get_value("positive_prompt")
correct_answers = get_value("correct_answer")
incorrect_answers = get_value("incorrect_answer")

In [21]:
neg_tokens = {dataset: model.to_tokens(neg_prompts[dataset]) for dataset in datasets}
pos_tokens = {dataset: model.to_tokens(pos_prompts[dataset]) for dataset in datasets}

neg_str_tokens = {
    dataset: model.to_str_tokens(neg_prompts[dataset]) for dataset in datasets
}
pos_str_tokens = {
    dataset: model.to_str_tokens(pos_prompts[dataset]) for dataset in datasets
}

prompt_lengths = {
    dataset: torch.tensor([len(tokens) for tokens in neg_str_tokens[dataset]])
    for dataset in datasets
}

### Hooks for collecting activation values

In [22]:
prompt_lengths["super_glue boolq"]

tensor([184, 147,  75, 211, 111, 171, 120, 101, 171, 108, 128, 206])

In [23]:
# the negative and positive prompts match up until the very last token,
# so we just record one and then get only the last token of the other one
neg_results = {
    dataset: torch.zeros(
        (
            n_prompts,
            n_layers,
            prompt_lengths[dataset].max().item(),
        ),
        device="cpu",
    )
    for dataset in datasets
}
pos_results = {
    dataset: torch.zeros((n_prompts, n_layers), device="cpu")
    for dataset in datasets
}


def projection(
    resid_pre: Float[torch.Tensor, "batch pos d_model"],  # at a given layer
    hook: HookPoint,
    dataset: str,
    layer: int,
):
    # TODO should we be normalizing here?
    # since the prompts have different lengths, technically this is more computation than we need to do
    neg_results[dataset][:, layer, : resid_pre.shape[1]] = reporters[layer](
        resid_pre
    ).cpu()
    return resid_pre


def final_projection(
    resid_pre: Float[torch.Tensor, "batch pos d_model"],  # at a given layer
    hook: HookPoint,
    dataset: str,
    layer: int,
):
    # TODO should we be normalizing here?
    x = resid_pre[batch_range, prompt_lengths[dataset] - 1, :]
    pos_results[dataset][:, layer] = reporters[layer](x).cpu()
    return resid_pre

{ dataset: neg_results[dataset].shape for dataset in datasets }

{'super_glue boolq': torch.Size([12, 16, 211]),
 'imdb': torch.Size([12, 16, 424])}

### Extraction loop

In [24]:
# should be quite fast
for dataset in datasets:
    for layer in tqdm(range(n_layers), desc=dataset):
        act_name = utils.get_act_name("resid_pre", layer)

        patch_hook_fn = partial(projection, dataset=dataset, layer=layer)
        model.run_with_hooks(
            neg_tokens[dataset], fwd_hooks=[(act_name, patch_hook_fn)]
        )

        patch_hook_fn = partial(final_projection, dataset=dataset, layer=layer)
        model.run_with_hooks(
            pos_tokens[dataset], fwd_hooks=[(act_name, patch_hook_fn)]
        )

super_glue boolq: 100%|██████████| 16/16 [00:08<00:00,  1.86it/s]
imdb: 100%|██████████| 16/16 [00:17<00:00,  1.09s/it]


In [25]:
for dataset in datasets:
    save_path = save_paths[dataset]
    save_path.mkdir(parents=True, exist_ok=True)
    torch.save(neg_results[dataset], save_path / "neg.pt")
    torch.save(pos_results[dataset], save_path / "pos.pt")

## Visualize outputs

In [26]:
import circuitsvis as cv
from circuitsvis.tokens import colored_tokens

In [27]:
neg_results = {
    dataset: torch.load(save_paths[dataset] / "neg.pt") for dataset in datasets
}
pos_results = {
    dataset: torch.load(save_paths[dataset] / "pos.pt") for dataset in datasets
}

projections = dict()
for dataset in datasets:
    projections[dataset] = torch.cat(
        [neg_results[dataset], torch.zeros((n_prompts, n_layers, 2))], axis=-1
    )
    projections[dataset][batch_range, :, prompt_lengths[dataset]] = pos_results[dataset]
    projections[dataset][batch_range, :, prompt_lengths[dataset] + 1] = 0  # forvisualization

    # make the signs across layers consistent with the final token
    # since CCS only identifies the hyperplane up to sign
    projections[dataset] = pos_results[dataset].sign()[:, :, None] * projections[dataset]

Here we choose a given prompt to visualize. "positive" answers always corresponds to _blue_ and "negative" answers always correspond to _red_
independently of the correct answer.

In [28]:
def plot_colors(dataset: str, prompt_id: int):
    flattened_tokens = (
        neg_str_tokens[dataset][prompt_id] + pos_str_tokens[dataset][prompt_id][-1:] + ["\n\n\n"]
    ) * n_layers
    flattened_projections = projections[dataset][
        prompt_id, :, : prompt_lengths[dataset][prompt_id] + 2
    ].flatten()

    # clip the values to between (-5.5, 4) to make the visualization more readable
    print("distance bounds:", flattened_projections.min(), flattened_projections.max())
    return colored_tokens(
        flattened_tokens, flattened_projections, min_value=-5, max_value=5
    )

In [29]:
plot_colors('imdb', prompt_id=4)

distance bounds: tensor(-116.5692) tensor(6.9223)


In [30]:
plot_colors('super_glue boolq', prompt_id=6)

distance bounds: tensor(-116.5692) tensor(97.0533)


It's interesting that these almost look like attention patterns! For our experiments on IMDB, the blue tokens are often ones that carry some sort of positive connotation
and the red ones often carry some kind of negative connotation.
The probe for BoolQ seems to favour statistics and proper nouns.
It's also interesting that the logits across the board seem to get farther and farther from the CCS hyperplane
as we go into the later layers of the network.
We're not yet quite sure what's causing this behaviour.
In future experiments we'll try and see if there's any similarities with the actual attention patterns.