This experiment uses the Moore-Penrose Pseudoinverse to identify and erase a specific concept (Verb Tense) from the model's internal representation.

This is a Linear Probe technique. We assume the model represents "Past Tense" as a specific direction vector ($v$) in its high-dimensional space. If we find that vector and flatten it (project it to zero), the model should lose the ability to understand or generate past tense.

The Experiment: "The Tense Lobotomy"

We will:
1. Harvest Activations: Run the model on "Past" vs. "Present" sentences and save the hidden states.
2. Compute the Direction: Use the Pseudoinverse to find the linear direction that best separates these two groups.
3. Create the Eraser: Build a projection matrix ($P$) that removes only that direction.
4. Intervene: Implant this matrix into the model and verify if it "forgets" the concept of the past.

## Step 1: Data Setup & Harvesting

First, we collect the activations of the model when it sees past vs present tense verbs.

In [91]:
import torch
from nnsight import LanguageModel
                            
model = LanguageModel("openai-community/gpt2")
model.to_device("mps")

'mps'

In [92]:
past_sentences = [
    "I walked to the store", "She ran fast", "They ate takeout",
    "He studied hard", "We played soccer", "It rained heavily",
]
present_sentences = [
    "I walk to the store", "She runs fast", "They eat takeout",
    "He studies hard", "We play soccer", "It rains heavily"
]

all_prompts = past_sentences + present_sentences
n_past = len(past_sentences)
n_present = len(present_sentences)

target_layer = 6
acts_storage = []

with torch.no_grad(), model.trace(all_prompts):
    # Get just the last token hidden states.
    hidden_states = model.transformer.h[target_layer].output[0][:, -1, :].save()

activations = hidden_states
print(activations.shape) # (12 x 768)

torch.Size([12, 768])


## Step 2: Finding the "Tense Direction"

We want a vector $w$ such that $Activations \cdot w \approx Labels$. Because our activation matrix ($A$) is not square, we cannot invert it. We use the Pseudoinverse ($A^+$) to solve for w.

In [93]:
# 1. Create labels (+1 for Past, -1 for Present)
# We cast to the same device/dtype as activations
labels = torch.cat([
    torch.ones(n_past),
    torch.ones(n_present) * -1
]).to(activations.device).to(activations.dtype)

print(labels)

tensor([ 1.,  1.,  1.,  1.,  1.,  1., -1., -1., -1., -1., -1., -1.])


In [94]:
# 2. Compute the "Concept Direction" vector (w)
# This is the Linear Probe
# w = pinv(A) @ Y
# A is our collected activations.
pseudo_inverse = torch.linalg.pinv(activations)
print("Pseudo Inverse Shape:", pseudo_inverse.shape) # (768 x 12)

# w = pinv(A) @ Y
tense_direction = torch.matmul(pseudo_inverse, labels)
print("Tense Direction Shape:", tense_direction.shape)

Pseudo Inverse Shape: torch.Size([768, 12])
Tense Direction Shape: torch.Size([768])


In [95]:
# 3. Construct the "Eraser" (Projection Matrix)
# We want to keep everything orthogonal to the tense direction.
# P = I - (v v^T) / (v^T v)
v = tense_direction.unsqueeze(1) # Make it a column vector
print("v Shape:", v.shape) # (768 x 1)

P = torch.eye(v.shape[0], device=v.device) - (v @ v.T) / (v.T @ v)
print("P Shape:", P.shape) # (768 x 768)

v Shape: torch.Size([768, 1])
P Shape: torch.Size([768, 768])


## Steo 3: The Intervention

Now we verify if the surgery works. We feed the model a prompt that requires past tense completion and force the activation through our Eraser matrix $P$.

In [96]:
test_prompts = [
    "Yesterday, I",
    "In the past, she"
]

# CORRUPTED RUN
with model.generate(test_prompts, max_new_tokens=5, do_sample=False):
    
    # Access the exact same layer we trained on
    current_hidden = model.transformer.h[target_layer].output[0]
    
    # The intervention
    # Apply the projection matrix P to the hidden states.
    # This removes the "Past Tense" component from the vector.
    # We transpose P because torch matmul expects (Batch, Dim) @ (Dim, Dim)
    # Note: We apply this to ALL tokens in the sequence
    
    # We cast P to the correct dtype (usually float16/bfloat16 on GPU)
    P = P.to(current_hidden.device).to(current_hidden.dtype)
    
    # Intervention: x_new = x @ P
    # We update the hidden state in-place
    current_hidden[:] = torch.matmul(current_hidden, P)
    
    # Save the output to see what the model generates
    output = model.generator.output.save()
    
# Decode the result
print("RESULT:")
for i, prompt in enumerate(test_prompts):
    completion = model.tokenizer.decode(output[i])
    print(f"Prompt: '{prompt}' --> Generated: '{completion}'")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


RESULT:
Prompt: 'Yesterday, I' --> Generated: '<|endoftext|><|endoftext|>Yesterday, I'm going to be doing'
Prompt: 'In the past, she' --> Generated: 'In the past, she has been a vocal critic'


In [97]:
test_prompts = [
    "Yesterday, I",
    "In the past, she"
]

# CLEAN RUN (For comparison)
with model.generate(test_prompts, max_new_tokens=5, do_sample=False):
    
    # Access the exact same layer we trained on
    current_hidden = model.transformer.h[target_layer].output[0]
    
    # Save the output to see what the model generates
    output = model.generator.output.save()
    
# Decode the result
print("RESULT:")
for i, prompt in enumerate(test_prompts):
    completion = model.tokenizer.decode(output[i])
    print(f"Prompt: '{prompt}' --> Generated: '{completion}'")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


RESULT:
Prompt: 'Yesterday, I' --> Generated: '<|endoftext|><|endoftext|>Yesterday, I was in the middle of'
Prompt: 'In the past, she' --> Generated: 'In the past, she has been a vocal critic'


## Probe Comparison Experiment

The `pinv` method is Linear Regression (Ordinary Least Squares). Here is how the family of linear probes compares:

1. Linear Regression (OLS / Pseudoinverse): Minimizes the squared error between the projection and the label. Good for regression, sometimes noisy for classification.
2. Difference of Means: Simply subtracts the "Present" centroid from the "Past" centroid. ($\mu_{past} - \mu_{present}$). This is extremely robust when you have very little data (like our $N = 12$), but ignores the shape/variance of the data.
3. Logistic Regression: Optimizes explicitly for classification accuracy (separating the two classes) using a sigmoid function. This is often considered the "gold standard" for probing.
4. Support Vector Machines (SVM): Finds the hyperplane that maximizes the margin (distance) between two classes.

We will implement a "Probe Battle" to see which method produces the most effective "Concept Erasure" for our small dataset.

**Note:** Since our dataset is tiny ($N = 12$), expect Difference of Means to perform well because it is less prone to overfitting than Logistic Regression.

In [102]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# Prepare the data for scikit-learn (CPU, Numpy)
X = hidden_states.cpu().numpy()
# Labels: 1 for Past, 0 for Present (standard for sklearn)
y = np.array([1] * n_past + [0] * n_present)

# Labels for Linear Regression (1 and -1 often works better for OLS centering)
y_ols = np.array([1] * n_past + [-1] * n_present)

def get_direction_diff_means(X, y):
    # Mean of Past - Mean of Present
    mu_past = X[y==1].mean(axis=0)
    mu_present = X[y==0].mean(axis=0)
    direction = mu_past - mu_present
    return torch.tensor(direction, dtype=torch.float32)

def get_direction_ols(X, y_signed):
    # w = A^+ Y (The method we used before)
    # Convert back to torch for simple linalg
    A = torch.tensor(X, dtype=torch.float32)
    Y = torch.tensor(y_signed, dtype=torch.float32)
    return torch.linalg.pinv(A) @ Y

def get_direction_logistic(X, y):
    # Train Logistic Regression classifier
    clf = LogisticRegression(random_state=9001, solver='liblinear', max_iter=1000).fit(X, y)
    # The coefficients are the normal vector to the decision boundary
    return torch.tensor(clf.coef_[0], dtype=torch.float32)

def get_direction_svm(X, y):
    # Train Linear SVM
    clf = LinearSVC(random_state=9001, dual="auto").fit(X, y)
    return torch.tensor(clf.coef_[0], dtype=torch.float32)

# The Probe Battle Loop

methods = {
    "Diff-of-Means": lambda: get_direction_diff_means(X, y),
    "Linear Reg (OLS)": lambda: get_direction_ols(X, y_ols),
    "Logistic Reg": lambda: get_direction_logistic(X, y),
    "SVM": lambda: get_direction_svm(X, y)
}

print(f"--- Running Erasure Experiment on Layer {target_layer} ---")

for name, method_fn in methods.items():
    print(f"\nMethod: {name}")
    
    # 1. Calculate Direction
    direction = method_fn().to(model.device)
    
    # 2. Create Projector (P)
    # P = I - vvT / vTv
    direction = direction / torch.norm(direction) # Normalize
    v = direction.unsqueeze(1)
    P = torch.eye(v.shape[0], device=v.device) - (v @ v.T)
    
    # 3. Intervene and Generate
    with model.generate(test_prompts, max_new_tokens=6, do_sample=False):
        # Hook the target layer
        current_hidden = model.transformer.h[target_layer].output[0]
        
        # Apply projection to all tokens
        # We broadcast P: [Hidden, Hidden] onto [Batch, Seq, Hidden]
        current_hidden[:] = torch.matmul(current_hidden, P.to(current_hidden.dtype))
        
        output = model.generator.output.save()
        
    # 4. Report
    for i, p in enumerate(test_prompts):
        gen = model.tokenizer.decode(output[i]).replace(p, "").strip()
        print(f"    '{p}' --> '{gen}'")
    

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--- Running Erasure Experiment on Layer 6 ---

Method: Diff-of-Means
    'Yesterday, I' --> '<|endoftext|><|endoftext|> was in the middle of a'
    'In the past, she' --> 'has been accused of being a'

Method: Linear Reg (OLS)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


    'Yesterday, I' --> '<|endoftext|><|endoftext|>'m going to be doing a'
    'In the past, she' --> 'has been a vocal critic of'

Method: Logistic Reg
    'Yesterday, I' --> '<|endoftext|><|endoftext|>'m going to be doing a'
    'In the past, she' --> 'has been a vocal critic of'

Method: SVM
    'Yesterday, I' --> '<|endoftext|><|endoftext|>'m going to be doing a'
    'In the past, she' --> 'has been a vocal critic of'
