This notebook contains code to replicate the experiments mentioned in the Google doc.

The graphs themselves were made by 4o from raw text outputs, so there's no code here for making the actual visualisations.

Warning: A lot of this code is ugly and inefficient. Time constraints! If I were doing a larger scale project I would definitely spend some time making the code nicer.

Big thanks to my bestie o1 who helped with a lot of the code.

# Setup

First load in the models

In [2]:
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
from nnsight import NNsight

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
small_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
small_tokenizer = AutoTokenizer.from_pretrained(model_name)

small_model = NNsight(small_model)

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
large_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
large_tokenizer = AutoTokenizer.from_pretrained(model_name)

large_model = NNsight(large_model)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Now load in the generations

In [3]:
import os

os.chdir('/home/lfletcher/MATS_app')

In [4]:
import pickle

with open('data/generations/high_school_gens_large.pkl', 'rb') as f:

    large_gens = pickle.load(f)

with open('data/generations/high_school_gens_small.pkl', 'rb') as f:

    small_gens = pickle.load(f)

Filter the generations to only consider those <2000 tokens and which follow the format of having an answer in \boxed{}.

This code also adds the predicted_label and token position of the predicted label to the generation dict.

In [5]:
import re
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
small_tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

small_filtered_gens = []
for item in small_gens:
    full_text = item['full_generated_text']
    prompt = item['prompt']
    # Estimate tokenized length by splitting on whitespace and punctuation
    tokenized_length = len(small_tokenizer.tokenize(full_text)) - len(small_tokenizer.tokenize(prompt))

    # Check tokenized length condition
    if tokenized_length < 2000:
        # Extract boxed answer using regex patterns
        match = re.search(r'\\boxed{\(?([A-Da-d])\)?}', full_text)
        if match:
            label = match.group(1).upper()  # Normalize to uppercase
            label_index = ord(label) - ord('A')  # Convert A-D to 0-3 index
            answer_str = match.group(0)  # Full match (e.g., \boxed{C})
            char_index = full_text.find(answer_str)  # Character index
            token_index = len(small_tokenizer.tokenize(full_text[:char_index]))  # Convert to token index

            item['predicted_label'] = label_index
            item['answer_token_index'] = token_index  # Store token index of answer
            small_filtered_gens.append(item)

large_filtered_gens = []
for item in large_gens:
    full_text = item['full_generated_text']
    prompt = item['prompt']
    # Estimate tokenized length by splitting on whitespace and punctuation
    tokenized_length = len(small_tokenizer.tokenize(full_text)) - len(small_tokenizer.tokenize(prompt))

    # Check tokenized length condition
    if tokenized_length < 2000:
        # Extract boxed answer using regex patterns
        match = re.search(r'\\boxed{\(?([A-Da-d])\)?}', full_text)
        if match:
            label = match.group(1).upper()  # Normalize to uppercase
            label_index = ord(label) - ord('A')  # Convert A-D to 0-3 index
            answer_str = match.group(0)  # Full match (e.g., \boxed{C})
            char_index = full_text.find(answer_str)  # Character index
            token_index = len(large_tokenizer.tokenize(full_text[:char_index]))  # Convert to token index

            item['predicted_label'] = label_index
            item['answer_token_index'] = token_index  # Store token index of answer
            large_filtered_gens.append(item)



# Experiment 1

First we find the incorrect generations for each of the small and large model

In [5]:
small_incorrect_gens = [gen for gen in small_filtered_gens if gen['predicted_label'] != gen['correct_label']]
large_incorrect_gens = [gen for gen in large_filtered_gens if gen['predicted_label'] != gen['correct_label']]
                           
print(len(small_incorrect_gens))
print(len(large_incorrect_gens))

624
421


Get the pre and post CoT logits for the predicted_answer for the small and large model

This code is monstrously inefficient and not batched. It takes like 10 mins to run on an A100. Sorry!

The resulting lists are dictionaries of the form

    {
        "final_answer": the final answer given by the model after the CoT,
        "post_logits": the logits of A-D after the CoT,
        "pre_logits": the logits of A-D before the CoT,
        "subject": the subject of this datapoint (i.e. geography, math etc.)
    }

In [10]:
from utils import get_pre_post_cot_logits

small_logits = []
large_logits = []

print("Getting logits for small model")
for i, gen in enumerate(small_incorrect_gens):

    if i % 10 == 0:
        print(f"Getting logits for generation {i}")

    small_logits.append(get_pre_post_cot_logits(gen, small_model, small_tokenizer))

print("\nGetting logits for large model")
for i, gen in enumerate(large_incorrect_gens):

    if i % 10 == 0:
        print(f"Getting logits for generation {i}")
        
    large_logits.append(get_pre_post_cot_logits(gen, large_model, large_tokenizer))

Getting logits for small model
Getting logits for generation 0


Getting logits for generation 10
Getting logits for generation 20
Getting logits for generation 30
Getting logits for generation 40
Getting logits for generation 50
Getting logits for generation 60
Getting logits for generation 70
Getting logits for generation 80
Getting logits for generation 90
Getting logits for generation 100
Getting logits for generation 110
Getting logits for generation 120
Getting logits for generation 130
Getting logits for generation 140
Getting logits for generation 150
Getting logits for generation 160
Getting logits for generation 170
Getting logits for generation 180
Getting logits for generation 190
Getting logits for generation 200
Getting logits for generation 210
Getting logits for generation 220
Getting logits for generation 230
Getting logits for generation 240
Getting logits for generation 250
Getting logits for generation 260
Getting logits for generation 270
Getting logits for generation 280
Getting logits for generation 290
Getting logits for gene

Now that we have these logits, we get a couple of relevant quantities

1) the probability of the final answer pre and post CoT (saved in cur_prob_dict)
2) the probability of all 4 of A-D pre and post CoT (saved in small/large_probs_pre_all and small/large_probs_post_all)

The latter we calculate to check some assumptions that we're relying on (see Google doc)

In [11]:
import torch.nn.functional as F

small_probs = []

small_probs_pre_all = []
small_probs_post_all = []

for i in range(len(small_incorrect_gens)):

    cur_gen = small_logits[i]
    predict_idx = small_incorrect_gens[i]['predicted_label']

    logits_prompt = cur_gen['pre_logits']
    logits_cot = cur_gen['post_logits']
    
    cur_prob_prompt = F.softmax(logits_prompt)
    cur_prob_cot = F.softmax(logits_cot)

    small_probs_pre_all.append(cur_prob_prompt.tolist())
    small_probs_post_all.append(cur_prob_cot.tolist())

    cur_prob_dict = {"pre_prob":cur_prob_prompt[predict_idx], "post_prob":cur_prob_cot[predict_idx], "subject":cur_gen["subject"]}
    small_probs.append(cur_prob_dict)


large_probs = []

large_probs_pre_all = []
large_probs_post_all = []

for i in range(len(large_incorrect_gens)):

    cur_gen = large_logits[i]
    predict_idx = large_incorrect_gens[i]['predicted_label']

    logits_prompt = cur_gen['pre_logits']
    logits_cot = cur_gen['post_logits']
    
    cur_prob_prompt = F.softmax(logits_prompt)
    cur_prob_cot = F.softmax(logits_cot)

    large_probs_pre_all.append(cur_prob_prompt.tolist())
    large_probs_post_all.append(cur_prob_cot.tolist())

    cur_prob_dict = {"pre_prob":cur_prob_prompt[predict_idx], "post_prob":cur_prob_cot[predict_idx], "subject":cur_gen["subject"]}
    large_probs.append(cur_prob_dict)

  cur_prob_prompt = F.softmax(logits_prompt)
  cur_prob_cot = F.softmax(logits_cot)
  cur_prob_prompt = F.softmax(logits_prompt)
  cur_prob_cot = F.softmax(logits_cot)


Now we get the average probability of the final (post-CoT) answer pre-CoT for each model. This is the main result of this experiment.

In [12]:
from collections import defaultdict

subject_stats = defaultdict(lambda: {"sum": 0.0, "count": 0})

# Accumulate sums and counts
for entry in small_probs:
    subj = entry["subject"]
    subject_stats[subj]["sum"] += entry["pre_prob"]
    subject_stats[subj]["count"] += 1

# Compute mean and store results
result = []
for subj, stats in subject_stats.items():
    mean_prob = stats["sum"] / stats["count"]
    result.append({
        "subject": subj,
        "mean_pre_prob": mean_prob,
        "count": stats["count"]
    })

print("Results for small model")
for entry in result:
    print(f"Subject: {entry['subject']}, Mean Pre-CoT Probability: {entry['mean_pre_prob']:.4f}, Count: {entry['count']}")


# Accumulate sums and counts
for entry in large_probs:
    subj = entry["subject"]
    subject_stats[subj]["sum"] += entry["pre_prob"]
    subject_stats[subj]["count"] += 1

# Compute mean and store results
result = []
for subj, stats in subject_stats.items():
    mean_prob = stats["sum"] / stats["count"]
    result.append({
        "subject": subj,
        "mean_pre_prob": mean_prob,
        "count": stats["count"]
    })

print("\nResults for large model")
for entry in result:
    print(f"Subject: {entry['subject']}, Mean Pre-CoT Probability: {entry['mean_pre_prob']:.4f}, Count: {entry['count']}")



Results for small model
Subject: high_school_biology, Mean Pre-CoT Probability: 0.9994, Count: 81
Subject: high_school_chemistry, Mean Pre-CoT Probability: 0.9687, Count: 28
Subject: high_school_computer_science, Mean Pre-CoT Probability: 0.9997, Count: 11
Subject: high_school_european_history, Mean Pre-CoT Probability: 0.9809, Count: 16
Subject: high_school_geography, Mean Pre-CoT Probability: 0.9956, Count: 64
Subject: high_school_government_and_politics, Mean Pre-CoT Probability: 0.9995, Count: 66
Subject: high_school_macroeconomics, Mean Pre-CoT Probability: 0.9969, Count: 74
Subject: high_school_mathematics, Mean Pre-CoT Probability: 0.8658, Count: 3
Subject: high_school_microeconomics, Mean Pre-CoT Probability: 0.9985, Count: 50
Subject: high_school_physics, Mean Pre-CoT Probability: 0.9986, Count: 14
Subject: high_school_psychology, Mean Pre-CoT Probability: 0.9888, Count: 146
Subject: high_school_statistics, Mean Pre-CoT Probability: 0.9990, Count: 17
Subject: high_school_us_hi

Finally, we check for one possible alternative explanation for these results, namely that both the pre and post probabilities are biased in the same way 

(i.e., if the model says B 50% of the time both pre and post CoT, then our metric of mean pre-CoT probability will be >50% without post-hoc reasoning being necessary)

In [13]:
import numpy as np

small_probs_prompt_all = np.array(small_probs_pre_all)
small_probs_cot_all = np.array(small_probs_post_all)

large_probs_prompt_all = np.array(large_probs_pre_all)
large_probs_cot_all = np.array(large_probs_post_all)

print(np.mean(large_probs_prompt_all, axis=0))
print(np.mean(large_probs_cot_all, axis=0))

print(np.mean(small_probs_prompt_all, axis=0))
print(np.mean(small_probs_cot_all, axis=0))

[0.28692206 0.24727051 0.24898808 0.21681934]
[0.22922276 0.27841208 0.24832366 0.24404151]
[0.30003469 0.22485375 0.26373966 0.2113719 ]
[0.25316039 0.35668414 0.21492729 0.17522819]


The two distributions are not biased in the same way (although the small model in particular has got a weird bias for B pre-CoT and A post-CoT)

In particular, if pre/post were independent, then for both large and small models the probability of ans_pre = ans_post is approximately 0.25

# Experiment 2

I've moved the code which actually extracts the features to the appendix and just load the features here. This is because the code extracting the features is pretty slow (particularly for the big model). Feel free to skip to the appendix to take a look at that if you like.

So we load the features here

In [6]:
with open('data/probe_features/small_features.pkl', 'rb') as f:

    small_features = pickle.load(f)

with open('data/probe_features/small_labels.pkl', 'rb') as f:

    small_labels = pickle.load(f)

with open('data/probe_features/large_features.pkl', 'rb') as f:

    large_features = pickle.load(f)

with open('data/probe_features/large_labels.pkl', 'rb') as f:

    large_labels = pickle.load(f)

Now split into test and train sets, along with randomly discarding datapoints until we have balanced labels [balance_classes()] (this is basically the most naive thing you could possibly do but time constraints!)

In [7]:
from utils import balance_classes
from sklearn.model_selection import train_test_split

def prepare_splits(features, labels):

    bal_features, bal_labels = balance_classes(features, labels)

    features_np = features.numpy()
    labels_np = labels.numpy()

    # Perform an 80–20 split
    train_features_np, test_features_np, \
    train_labels_np, test_labels_np = train_test_split(
        features_np,
        labels_np,
        test_size=0.2,
        random_state=42,
        stratify=labels_np  # Stratify for class balance
    )

    # Convert NumPy arrays back to tensors
    train_features = torch.tensor(train_features_np)
    test_features  = torch.tensor(test_features_np)
    train_labels    = torch.tensor(train_labels_np)
    test_labels     = torch.tensor(test_labels_np)

    return train_features, test_features, train_labels, test_labels

small_train_features, small_test_features, small_train_labels, small_test_labels = prepare_splits(small_features, small_labels)
large_train_features, large_test_features, large_train_labels, large_test_labels = prepare_splits(large_features, large_labels)

Now we train the linear probe 3 times at each position/layer for the small model and print the results

In [8]:
import torch
import numpy as np
from utils import train_linear_classifier_with_positions


def get_probe_results(train_features, train_labels, test_features, test_labels):

    results = {}

    for layer_idx in range(5):           # Layers 0 through 4
        results[layer_idx] = {}
        
        for pos in range(17):           # Positions 0 through 16
            run_accuracies = []
            
            for run_i in range(3):      # Repeat 3 times
                # Prepare the position labels (replacing the "*2" with "*pos")
                positions_train = torch.ones(len(train_features)).long() * pos
                positions_test  = torch.ones(len(test_features)).long() * pos
                
                # Train the probe
                probe, acc = train_linear_classifier_with_positions(
                    train_features,
                    positions_train,
                    train_labels,
                    test_features,
                    positions_test,
                    test_labels,
                    layer_idx=layer_idx,
                    num_classes=4,    # or whatever your task requires
                    epochs=100,
                    batch_size=32,
                    device='cuda'
                )
                
                # Extract the accuracy (adjust if your function returns the accuracy differently)
                run_accuracies.append(acc)
            
            # Compute mean and std across the 3 runs
            mean_acc = np.mean(run_accuracies)
            std_acc  = np.std(run_accuracies)
            
            # Store in the results structure
            results[layer_idx][pos] = (mean_acc, std_acc)

    return results

small_results = get_probe_results(small_train_features, small_train_labels, small_test_features, small_test_labels)

# Now 'results' contains (mean_acc, std_acc) for each (layer_idx, pos).
# You can do further processing or just print:
for layer_idx in range(5):
    for pos in range(17):
        mean_acc, std_acc = small_results[layer_idx][pos]
        print(f"Layer {layer_idx}, Pos {pos} -> Mean: {mean_acc:.4f}, Std: {std_acc:.4f}")


Layer 0, Pos 0 -> Mean: 0.2579, Std: 0.0304
Layer 0, Pos 1 -> Mean: 0.2350, Std: 0.0122
Layer 0, Pos 2 -> Mean: 0.2512, Std: 0.0059
Layer 0, Pos 3 -> Mean: 0.2808, Std: 0.0041
Layer 0, Pos 4 -> Mean: 0.2407, Std: 0.0257
Layer 0, Pos 5 -> Mean: 0.2770, Std: 0.0239
Layer 0, Pos 6 -> Mean: 0.2923, Std: 0.0084
Layer 0, Pos 7 -> Mean: 0.2693, Std: 0.0041
Layer 0, Pos 8 -> Mean: 0.4499, Std: 0.0062
Layer 0, Pos 9 -> Mean: 0.4604, Std: 0.0179
Layer 0, Pos 10 -> Mean: 0.4308, Std: 0.0105
Layer 0, Pos 11 -> Mean: 0.4460, Std: 0.0172
Layer 0, Pos 12 -> Mean: 0.5062, Std: 0.0236
Layer 0, Pos 13 -> Mean: 0.2359, Std: 0.0049
Layer 0, Pos 14 -> Mean: 0.2369, Std: 0.0049
Layer 0, Pos 15 -> Mean: 0.2521, Std: 0.0047
Layer 0, Pos 16 -> Mean: 0.2063, Std: 0.0047
Layer 1, Pos 0 -> Mean: 0.2225, Std: 0.0027
Layer 1, Pos 1 -> Mean: 0.2197, Std: 0.0111
Layer 1, Pos 2 -> Mean: 0.2598, Std: 0.0164
Layer 1, Pos 3 -> Mean: 0.2760, Std: 0.0199
Layer 1, Pos 4 -> Mean: 0.2598, Std: 0.0082
Layer 1, Pos 5 -> Mean: 0

Now do the same for the large model

In [None]:

large_results = get_probe_results(large_train_features, large_train_labels, large_test_features, large_test_labels)

for layer_idx in range(5):
    for pos in range(17):
        mean_acc, std_acc = large_results[layer_idx][pos]
        print(f"Layer {layer_idx}, Pos {pos} -> Mean: {mean_acc:.4f}, Std: {std_acc:.4f}")

# Appendix  

Below is the code to generate the features for the linear probe

You need to only load in the relevant model otherwise seems to always OOM? doesn't seem like that should happen

In [10]:
import gc

gc.collect()
torch.cuda.empty_cache()

del small_model
del large_model

For small model:

In [3]:
import re
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
small_tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

small_filtered_gens = []
for item in small_gens:
    full_text = item['full_generated_text']
    prompt = item['prompt']
    # Estimate tokenized length by splitting on whitespace and punctuation
    tokenized_length = len(small_tokenizer.tokenize(full_text)) - len(small_tokenizer.tokenize(prompt))

    # Check tokenized length condition
    if tokenized_length < 2000:
        # Extract boxed answer using regex patterns
        match = re.search(r'\\boxed{\(?([A-Da-d])\)?}', full_text)
        if match:
            label = match.group(1).upper()  # Normalize to uppercase
            label_index = ord(label) - ord('A')  # Convert A-D to 0-3 index
            answer_str = match.group(0)  # Full match (e.g., \boxed{C})
            char_index = full_text.find(answer_str)  # Character index
            token_index = len(small_tokenizer.tokenize(full_text[:char_index]))  # Convert to token index

            item['predicted_label'] = label_index
            item['answer_token_index'] = token_index  # Store token index of answer
            small_filtered_gens.append(item)

In [5]:
from torch.utils.data import DataLoader
from utils import TextClassificationDataset, get_probe_features
import torch
from transformers import AutoModel, AutoTokenizer, AutoModelForCausalLM
from nnsight import NNsight
import pickle

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
small_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
small_tokenizer = AutoTokenizer.from_pretrained(model_name)

small_model = NNsight(small_model)


dataset = TextClassificationDataset(small_filtered_gens, small_tokenizer, max_length=2000)
dataloader = DataLoader(dataset, batch_size=8)

features, labels, _, _, _ = get_probe_features(small_model, dataloader, hidden_dim=1536, layer_indices=[5,10,15,20,25], max_batches=10000)

with open('data/probe_features/small_features.pkl', 'wb') as f:

    pickle.dump(features, f)

with open('data/probe_features/small_labels.pkl', 'wb') as f:

    pickle.dump(labels, f)

Batch 0
Batch 1
Batch 2
Batch 3
Batch 4
Batch 5
Batch 6
Batch 7
Batch 8
Batch 9
Batch 10
Batch 11
Batch 12
Batch 13
Batch 14
Batch 15
Batch 16
Batch 17
Batch 18
Batch 19
Batch 20
Batch 21
Batch 22
Batch 23
Batch 24
Batch 25
Batch 26
Batch 27
Batch 28
Batch 29
Batch 30
Batch 31
Batch 32
Batch 33
Batch 34
Batch 35
Batch 36
Batch 37
Batch 38
Batch 39
Batch 40
Batch 41
Batch 42
Batch 43
Batch 44
Batch 45
Batch 46
Batch 47
Batch 48
Batch 49
Batch 50
Batch 51
Batch 52
Batch 53
Batch 54
Batch 55
Batch 56
Batch 57
Batch 58
Batch 59
Batch 60
Batch 61
Batch 62
Batch 63
Batch 64
Batch 65
Batch 66
Batch 67
Batch 68
Batch 69
Batch 70
Batch 71
Batch 72
Batch 73
Batch 74
Batch 75
Batch 76
Batch 77
Batch 78
Batch 79
Batch 80
Batch 81
Batch 82
Batch 83
Batch 84
Batch 85
Batch 86
Batch 87
Batch 88
Batch 89
Batch 90
Batch 91
Batch 92
Batch 93
Batch 94
Batch 95
Batch 96
Batch 97
Batch 98
Batch 99
Batch 100
Batch 101
Batch 102
Batch 103
Batch 104
Batch 105
Batch 106
Batch 107
Batch 108
Batch 109
Batch 110


For large model (batch_size=6 is the max that will not cause an OOM error for me)

In [None]:
from torch.utils.data import DataLoader
from utils import TextClassificationDataset, get_probe_features

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
large_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
large_tokenizer = AutoTokenizer.from_pretrained(model_name)

large_model = NNsight(large_model)

dataset = TextClassificationDataset(large_filtered_gens, small_tokenizer, max_length=2000)
dataloader = DataLoader(dataset, batch_size=6)

features, labels, _, _, _ = get_probe_features(large_model, dataloader, hidden_dim=3584, layer_indices=[5,10,15,20,25], max_batches=10000)

with open('data/probe_features/large_features.pkl', 'wb') as f:

    pickle.dump(features, f)

with open('data/probe_features/large_labels.pkl', 'wb') as f:

    pickle.dump(labels, f)