# Neuron Activation Patching

This notebook explores the causal relationship between a neuron's activation and the model's prediction for that activation. By patching (modifying) the neuron's activation and observing changes in the prediction, we can determine whether the prediction model is directly using information from this neuron.

This provides a causal complement to the correlational analysis in the linear weights notebook.

## 1. Setup and Loading

In [None]:
# Imports
import os
import torch
import numpy as np
import matplotlib.pyplot as plt
import json
from pathlib import Path
from transformer_lens import HookedTransformer

In [None]:
# Function to load trained model head
def load_head(model_path):
    """Load the saved prediction head and its configuration"""
    config_path = os.path.join(model_path, "config.json")
    head_path = os.path.join(model_path, "head.pt")
    
    with open(config_path, "r") as f:
        config = json.load(f)
    
    head_weights = torch.load(head_path, map_location="cpu")
    
    return head_weights, config

In [None]:
# Path to a trained model - update this path to a valid model directory
model_path = "../output/models/neuron_l8_n481_20250422_173314"

# Load the trained model
head_weights, config = load_head(model_path)

# Display the config to understand what we're analyzing
print("Model configuration:")
for key, value in config.items():
    if key != "head_config":  # Skip printing the full head config for brevity
        print(f"  {key}: {value}")

# Print head type details
print(f"\nHead type: {config['head_type']}")
if "head_config" in config and "hidden_dim" in config["head_config"]:
    print(f"Hidden dimension: {config['head_config']['hidden_dim']}")

In [None]:
# Load the base transformer model
base_model_name = config["base_model_name"]
base_model = HookedTransformer.from_pretrained(base_model_name)

# Fix for potential CUDA vs. MPS vs. CPU device issues
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
base_model.to(device)

# Extract target neuron information
target_layer = config["target_layer"]
target_neuron = config["target_neuron"]
layer_type = config["layer_type"]
print(f"Analyzing neuron {target_neuron} in layer {target_layer}")
print(f"Base model: {base_model_name}")
print(f"Hidden dimension: {base_model.cfg.d_model}")
print(f"MLP dimension: {base_model.cfg.d_mlp}")
print(f"Using device: {device}")

In [None]:
# Sample test inputs - diverse set of texts to test with
test_inputs = [
    "The cat sat on the mat.",
    "Machine learning models can be difficult to interpret.",
    "Transformers use attention mechanisms to process sequences.",
    "Neural networks have revolutionized artificial intelligence research.",
    "The quick brown fox jumps over the lazy dog.",
    "Researchers work to make AI systems more transparent and explainable.",
    "Scientists study the complex patterns in large language models.",
    "Mechanistic interpretability aims to understand how neural networks work internally."
]

# Tokenize inputs
tokenized_inputs = []
for text in test_inputs:
    tokens = base_model.tokenizer(text, return_tensors="pt")
    tokens = {k: v.to(device) for k, v in tokens.items()}  # Move to device
    tokenized_inputs.append(tokens)

print(f"Prepared {len(test_inputs)} test inputs")

## 2. Helper Functions for Patching and Prediction

In [None]:
# Function to make predictions with the trained head
def predict_with_head(input_ids, attention_mask, head_weights, config, base_model):
    """Run the base model and predict with the trained head"""
    # Run the base model to get activations
    with torch.no_grad():
        # Get output features at the appropriate layer
        feature_layer = config.get("feature_layer", -1)
        outputs = base_model(input_ids, attention_mask=attention_mask, output_hidden_states=True)
        
        # Get the feature representation (residual stream)
        if feature_layer < 0:
            feature_layer = base_model.cfg.n_layers + feature_layer
        
        # Extract features for the specified position
        token_pos = config.get("token_pos", "last")
        if token_pos == "last":
            pos = int((attention_mask.sum(-1) - 1)[0]) if attention_mask is not None else -1
        else:
            pos = int(token_pos)
        
        features = outputs.hidden_states[feature_layer][0, pos]
        
        # Apply the head
        head_type = config["head_type"]
        
        if head_type == "regression":
            # Apply dropout, hidden layer (if present), and output layer
            if "hidden.weight" in head_weights:
                x = torch.nn.functional.dropout(features, p=config.get("head_config", {}).get("dropout", 0.1), training=False)
                x = torch.nn.functional.linear(x, head_weights["hidden.weight"], head_weights["hidden.bias"])
                x = torch.nn.functional.gelu(x)
                x = torch.nn.functional.dropout(x, p=config.get("head_config", {}).get("dropout", 0.1), training=False)
                pred = torch.nn.functional.linear(x, head_weights["output.weight"], head_weights["output.bias"])
            else:
                x = torch.nn.functional.dropout(features, p=config.get("head_config", {}).get("dropout", 0.1), training=False)
                pred = torch.nn.functional.linear(x, head_weights["output.weight"], head_weights["output.bias"])
            
            pred = pred.squeeze()
            
        elif head_type == "classification":
            # Apply classification head
            if "hidden.weight" in head_weights:
                x = torch.nn.functional.dropout(features, p=config.get("head_config", {}).get("dropout", 0.1), training=False)
                x = torch.nn.functional.linear(x, head_weights["hidden.weight"], head_weights["hidden.bias"])
                x = torch.nn.functional.gelu(x)
                x = torch.nn.functional.dropout(x, p=config.get("head_config", {}).get("dropout", 0.1), training=False)
                logits = torch.nn.functional.linear(x, head_weights["output.weight"], head_weights["output.bias"])
            else:
                x = torch.nn.functional.dropout(features, p=config.get("head_config", {}).get("dropout", 0.1), training=False)
                logits = torch.nn.functional.linear(x, head_weights["output.weight"], head_weights["output.bias"])
            
            # Convert to continuous prediction if bin edges are available
            bin_edges = config.get("bin_edges")
            if bin_edges:
                bin_centers = [(bin_edges[i] + bin_edges[i+1])/2 for i in range(len(bin_edges)-1)]
                probs = torch.nn.functional.softmax(logits, dim=-1)
                
                # Weight by bin centers
                pred = 0
                for i, center in enumerate(bin_centers):
                    if i < probs.shape[-1]:
                        pred += probs[i] * center
            else:
                pred = torch.argmax(logits).item()
                
        elif head_type == "token":
            # Token prediction is handled differently
            # This implementation needs to be customized based on your token head design
            # Basic implementation shown here
            logits = outputs.logits[0, pos]
            digit_tokens = config.get("head_config", {}).get("digit_tokens", list(range(48, 58)))  # ASCII 0-9
            
            # Extract logits for digit tokens
            digit_logits = torch.stack([logits[idx] for idx in digit_tokens])
            probs = torch.nn.functional.softmax(digit_logits, dim=0)
            
            # Weight by digit values (0-9)
            pred = 0
            for i, p in enumerate(probs):
                pred += i * p.item()
            
        return pred.item() if torch.is_tensor(pred) else pred

In [None]:
# Function to extract the neuron activation
def get_neuron_activation(input_ids, attention_mask, base_model, layer, neuron, layer_type="mlp_out"):
    """Extract the activation of a specific neuron"""
    with torch.no_grad():
        _, cache = base_model.run_with_cache(
            input_ids, 
            attention_mask=attention_mask
        )
        
        # Get last token position
        pos = int((attention_mask.sum(-1) - 1)[0]) if attention_mask is not None else -1
        
        # Extract activation
        activation = cache[layer_type, layer][0, pos, neuron].item()
        
        return activation

In [None]:
# Inspect the actual activations
example_input = tokenized_inputs[0]
example_activation = get_neuron_activation(
    example_input["input_ids"],
    example_input["attention_mask"],
    base_model,
    target_layer,
    target_neuron,
    layer_type
)

print(f"Neuron {target_neuron} activation for input '{test_inputs[0]}': {example_activation:.6f}")

# Also verify what's in the cache
with torch.no_grad():
    _, cache = base_model.run_with_cache(
        example_input["input_ids"],
        attention_mask=example_input["attention_mask"]
    )
    
    # Print available cache keys
    print("\nCache keys (showing a subset):")
    for i, key in enumerate(list(cache.keys())[:10]):  # Show first 10 keys
        print(f"  {key}: {cache[key].shape}")

In [None]:
# Function to create a patching hook
def create_patching_hook(neuron_idx, new_value, patch_type="set"):
    """Create a hook function for patching a neuron activation"""
    def hook_fn(activation, hook):
        # Get shape info
        batch_size = activation.shape[0]
        seq_len = activation.shape[1]
        
        # Create a copy to avoid modifying the original
        patched = activation.clone()
        
        # Apply patching to the target neuron for all positions or just the last token
        # We'll patch only the last token to match how we're extracting features
        if hook.n_pos is not None:  # If we have position information
            for i in range(batch_size):
                pos = hook.n_pos[i]
                if patch_type == "set":
                    # Set to constant value
                    patched[i, pos, neuron_idx] = new_value
                elif patch_type == "scale":
                    # Scale by factor
                    patched[i, pos, neuron_idx] = activation[i, pos, neuron_idx] * new_value
                elif patch_type == "zero":
                    # Set to zero
                    patched[i, pos, neuron_idx] = 0.0
        else:  # If we don't have position info, assume last token
            # This is simpler but maybe less precise
            if patch_type == "set":
                patched[:, -1, neuron_idx] = new_value
            elif patch_type == "scale":
                patched[:, -1, neuron_idx] = activation[:, -1, neuron_idx] * new_value
            elif patch_type == "zero":
                patched[:, -1, neuron_idx] = 0.0
                
        return patched
    
    return hook_fn

## 3. Baseline Measurement

In [None]:
# Measure baseline neuron activations and predictions
baseline_results = []

for i, inputs in enumerate(tokenized_inputs):
    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"] if "attention_mask" in inputs else None
    
    # Get neuron activation
    activation = get_neuron_activation(
        input_ids, 
        attention_mask, 
        base_model, 
        target_layer, 
        target_neuron, 
        layer_type
    )
    
    # Get prediction
    prediction = predict_with_head(
        input_ids, 
        attention_mask, 
        head_weights, 
        config, 
        base_model
    )
    
    baseline_results.append({
        "input": test_inputs[i],
        "activation": activation,
        "prediction": prediction
    })
    
    print(f"Input {i+1}:")
    print(f"  Text: {test_inputs[i][:50]}...")
    print(f"  Activation: {activation:.4f}")
    print(f"  Prediction: {prediction:.4f}")

# Calculate baseline statistics
baseline_activations = [r["activation"] for r in baseline_results]
baseline_predictions = [r["prediction"] for r in baseline_results]

print(f"\nBaseline Statistics:")
print(f"  Mean Activation: {np.mean(baseline_activations):.4f}")
print(f"  Std Dev Activation: {np.std(baseline_activations):.4f}")
print(f"  Mean Prediction: {np.mean(baseline_predictions):.4f}")
print(f"  Std Dev Prediction: {np.std(baseline_predictions):.4f}")
print(f"  Correlation: {np.corrcoef(baseline_activations, baseline_predictions)[0,1]:.4f}")

In [None]:
# Visualize baseline relationship
plt.figure(figsize=(10, 6))
plt.scatter(baseline_activations, baseline_predictions, alpha=0.8, s=100)

# Add text labels for each point
for i, txt in enumerate(test_inputs):
    # Truncate text for readability
    short_txt = txt[:20] + "..." if len(txt) > 20 else txt
    plt.annotate(short_txt, 
                 (baseline_activations[i], baseline_predictions[i]),
                 fontsize=8,
                 xytext=(5, 5), textcoords='offset points')

# Add best fit line
from scipy import stats
slope, intercept, r_value, p_value, std_err = stats.linregress(baseline_activations, baseline_predictions)
x = np.array([min(baseline_activations), max(baseline_activations)])
plt.plot(x, slope * x + intercept, 'r--', 
         label=f'Linear fit (r={r_value:.2f}, p={p_value:.4f})')

plt.title(f'Neuron {target_neuron} Activation vs. Prediction\nCorrelation: {np.corrcoef(baseline_activations, baseline_predictions)[0,1]:.4f}')
plt.xlabel('Neuron Activation')
plt.ylabel('Model Prediction')
plt.grid(alpha=0.3)
plt.legend()
plt.savefig('baseline_correlation.png')
plt.show()

## 4. Single Patching Experiment

In [None]:
# Run patching experiment with a constant value
def run_patching_experiment(patch_value, patch_type="set"):
    """Run experiment with patched neuron activation"""
    patched_results = []
    
    # Create hook function
    hook_fn = create_patching_hook(target_neuron, patch_value, patch_type)
    
    for i, inputs in enumerate(tokenized_inputs):
        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"] if "attention_mask" in inputs else None
        
        # Get last token position for hook context
        last_pos = int((attention_mask.sum(-1) - 1)[0]) if attention_mask is not None else -1
        
        # Create a hook context object to pass to the hook
        class HookContext:
            def __init__(self, positions):
                self.n_pos = positions
                
        hook_context = HookContext([last_pos])  # Pass position info to hook
        
        # Run with patching hook
        hook_name = f"{layer_type}.{target_layer}"
        with base_model.hooks([(hook_name, hook_fn)]):
            # Get patched activation
            patched_activation = get_neuron_activation(
                input_ids, 
                attention_mask, 
                base_model, 
                target_layer, 
                target_neuron, 
                layer_type
            )
            
            # Get prediction with patched activation
            patched_prediction = predict_with_head(
                input_ids, 
                attention_mask, 
                head_weights, 
                config, 
                base_model
            )
        
        # Calculate changes from baseline
        baseline = baseline_results[i]
        act_change = patched_activation - baseline["activation"]
        pred_change = patched_prediction - baseline["prediction"]
        
        patched_results.append({
            "input": test_inputs[i],
            "patched_activation": patched_activation,
            "patched_prediction": patched_prediction,
            "activation_change": act_change,
            "prediction_change": pred_change
        })
    
    return patched_results

In [None]:
# Run constant-value patching experiment (setting activation to zero)
constant_value = 0.0  # Try with zero
constant_results = run_patching_experiment(constant_value, "set")

# Print results
print(f"Patching results (value={constant_value}):")
for i, result in enumerate(constant_results):
    print(f"Input {i+1}:")
    print(f"  Baseline Activation: {baseline_results[i]['activation']:.4f}")
    print(f"  Patched Activation: {result['patched_activation']:.4f}")
    print(f"  Activation Change: {result['activation_change']:.4f}")
    print(f"  Prediction Change: {result['prediction_change']:.4f}")

# Calculate average change
avg_act_change = np.mean([r["activation_change"] for r in constant_results])
avg_pred_change = np.mean([r["prediction_change"] for r in constant_results])
print(f"\nAverage Activation Change: {avg_act_change:.4f}")
print(f"Average Prediction Change: {avg_pred_change:.4f}")

## 5. Scaling Experiment

In [None]:
# Run scaling experiment with multiple factors
scale_factors = [0.0, 0.25, 0.5, 0.75, 1.0, 1.25, 1.5, 2.0]
scaling_results = []

for factor in scale_factors:
    results = run_patching_experiment(factor, "scale")
    
    # Aggregate results
    avg_act_change = np.mean([r["activation_change"] for r in results])
    avg_pred_change = np.mean([r["prediction_change"] for r in results])
    
    scaling_results.append({
        "factor": factor,
        "results": results,
        "avg_act_change": avg_act_change,
        "avg_pred_change": avg_pred_change
    })
    
    print(f"Scale factor {factor}:")
    print(f"  Avg Activation Change: {avg_act_change:.4f}")
    print(f"  Avg Prediction Change: {avg_pred_change:.4f}")

In [None]:
# Visualize scaling experiment results
plt.figure(figsize=(12, 6))

# Plot activation changes
act_changes = [r["avg_act_change"] for r in scaling_results]
pred_changes = [r["avg_pred_change"] for r in scaling_results]
factors = [r["factor"] for r in scaling_results]

plt.subplot(1, 2, 1)
plt.plot(factors, act_changes, 'b-o', label='Activation Change')
plt.plot(factors, pred_changes, 'r-o', label='Prediction Change')
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=1.0, color='k', linestyle='--', alpha=0.3)

plt.title(f'Effect of Scaling Neuron {target_neuron}')
plt.xlabel('Scaling Factor')
plt.ylabel('Change from Baseline')
plt.legend()
plt.grid(alpha=0.3)

# Plot changes in predicted vs. actual space
plt.subplot(1, 2, 2)
# Get baseline means
baseline_act_mean = np.mean(baseline_activations)
baseline_pred_mean = np.mean(baseline_predictions)

# Calculate projected values after scaling
scaled_acts = [baseline_act_mean + change for change in act_changes]
scaled_preds = [baseline_pred_mean + change for change in pred_changes]

# Plot with connecting lines in activation-prediction space
plt.plot(scaled_acts, scaled_preds, 'g-o')
plt.plot([baseline_act_mean], [baseline_pred_mean], 'ko', markersize=10, label='Baseline (factor=1.0)')

# Add labels for scaling factors
for i, factor in enumerate(factors):
    plt.annotate(f"{factor:.2f}", 
                 (scaled_acts[i], scaled_preds[i]),
                 xytext=(5, 5), textcoords='offset points')

plt.title('Activation-Prediction Space')
plt.xlabel('Average Activation')
plt.ylabel('Average Prediction')
plt.legend()
plt.grid(alpha=0.3)

plt.tight_layout()
plt.savefig('scaling_experiment.png')
plt.show()

## 6. Analysis of Individual Inputs

In [None]:
# Check if prediction changes are proportional to activation changes
pred_vs_act = []
for result in scaling_results:
    for sample in result["results"]:
        pred_vs_act.append({
            "input": sample["input"],
            "act_change": sample["activation_change"],
            "pred_change": sample["prediction_change"],
            "factor": result["factor"]
        })

# Plot prediction change vs activation change
plt.figure(figsize=(10, 8))
colors = plt.cm.viridis(np.linspace(0, 1, len(scale_factors)))

# Create a plot with all individual points
for i, factor in enumerate(scale_factors):
    # Get points for this scaling factor
    points = [p for p in pred_vs_act if p["factor"] == factor]
    plt.scatter(
        [p["act_change"] for p in points],
        [p["pred_change"] for p in points],
        label=f'Scale={factor}',
        color=colors[i],
        alpha=0.7,
        s=50
    )

plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.title('Prediction Change vs Activation Change\nAll Inputs and Scaling Factors')
plt.xlabel('Activation Change')
plt.ylabel('Prediction Change')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('pred_vs_act_change_all.png')
plt.show()

In [None]:
# Fit a linear model to quantify the relationship
from sklearn.linear_model import LinearRegression

# Prepare data
X = np.array([p["act_change"] for p in pred_vs_act]).reshape(-1, 1)
y = np.array([p["pred_change"] for p in pred_vs_act])

# Fit model
model = LinearRegression()
model.fit(X, y)

# Calculate R² score
r2 = model.score(X, y)
slope = model.coef_[0]
intercept = model.intercept_

# Plot the data with regression line
plt.figure(figsize=(10, 7))
plt.scatter(X, y, alpha=0.6)
plt.plot(
    [X.min(), X.max()], 
    [model.predict([[X.min()]])[0], model.predict([[X.max()]])[0]], 
    'r-', linewidth=2
)

plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.title(f'Linear Relationship between Activation and Prediction Changes\nSlope: {slope:.4f}, R²: {r2:.4f}')
plt.xlabel('Activation Change')
plt.ylabel('Prediction Change')
plt.grid(alpha=0.3)

# Add equation on the plot
equation = f"y = {slope:.4f}x + {intercept:.4f}"
plt.annotate(equation, xy=(0.05, 0.95), xycoords='axes fraction', 
             backgroundcolor='white', fontsize=12)

plt.savefig('pred_vs_act_regression.png')
plt.show()

# Print out the relationship summary
print(f"Linear Relationship Summary:")
print(f"  Slope: {slope:.6f}")
print(f"  Intercept: {intercept:.6f}")
print(f"  R² coefficient: {r2:.6f}")
print(f"  Equation: Prediction Change = {slope:.4f} × Activation Change + {intercept:.4f}")

In [None]:
# Per-input analysis to see which inputs are most affected
per_input_sensitivity = {}

# Organize data by input
for inp in test_inputs:
    points = [p for p in pred_vs_act if p["input"] == inp]
    
    if len(points) >= 2:  # Need at least 2 points for regression
        X_inp = np.array([p["act_change"] for p in points]).reshape(-1, 1)
        y_inp = np.array([p["pred_change"] for p in points])
        
        # Fit individual model
        inp_model = LinearRegression()
        inp_model.fit(X_inp, y_inp)
        
        # Store results
        per_input_sensitivity[inp] = {
            "slope": float(inp_model.coef_[0]),
            "r2": inp_model.score(X_inp, y_inp),
            "points": len(points),
            "X": X_inp.flatten().tolist(),
            "y": y_inp.tolist()
        }

# Sort inputs by sensitivity (slope)
sorted_inputs = sorted(per_input_sensitivity.items(), 
                       key=lambda x: abs(x[1]["slope"]), 
                       reverse=True)

# Print results
print("Per-input sensitivity analysis:")
print("-" * 80)
print(f"{'Input':<50} | {'Slope':>10} | {'R²':>10} | {'Points':>6}")
print("-" * 80)
for inp, data in sorted_inputs:
    # Truncate input text
    short_inp = inp[:47] + "..." if len(inp) > 47 else inp
    print(f"{short_inp:<50} | {data['slope']:>10.4f} | {data['r2']:>10.4f} | {data['points']:>6}")

# Visualize individual input models
plt.figure(figsize=(15, 12))
n_inputs = len(sorted_inputs)
rows = (n_inputs + 1) // 2  # Calculate rows needed

for i, (inp, data) in enumerate(sorted_inputs[:min(n_inputs, 8)]):  # Show at most 8 inputs
    plt.subplot(rows, 2, i+1)
    
    # Plot points
    plt.scatter(data["X"], data["y"], alpha=0.7)
    
    # Plot regression line
    x_range = np.array([min(data["X"]), max(data["X"])])
    slope = data["slope"]
    intercept = 0  # Assuming zero intercept for simplicity
    plt.plot(x_range, slope * x_range + intercept, 'r-')
    
    # Add info
    short_title = inp[:30] + "..." if len(inp) > 30 else inp
    plt.title(f"{short_title}\nSlope: {slope:.4f}, R²: {data['r2']:.4f}")
    plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
    plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
    plt.grid(alpha=0.3)
    
plt.tight_layout()
plt.savefig('per_input_sensitivity.png')
plt.show()

## 7. Summary and Interpretation

In [None]:
# Create a summary of our findings
summary = {
    "model_path": model_path,
    "head_type": config["head_type"],
    "target_layer": target_layer,
    "target_neuron": target_neuron,
    "baseline_correlation": float(np.corrcoef(baseline_activations, baseline_predictions)[0,1]),
    "slope": float(slope),
    "intercept": float(intercept),
    "r_squared": float(r2),
    "scaling_factors": scale_factors,
    "avg_activation_changes": [float(r["avg_act_change"]) for r in scaling_results],
    "avg_prediction_changes": [float(r["avg_pred_change"]) for r in scaling_results],
    "per_input_sensitivity": {k: {
        "slope": v["slope"],
        "r2": v["r2"],
        "points": v["points"]
    } for k, v in per_input_sensitivity.items()},
    "timestamp": str(np.datetime64('now'))
}

# Save results
os.makedirs("../results", exist_ok=True)
results_path = f"../results/activation_patching_{target_layer}_{target_neuron}.json"
with open(results_path, "w") as f:
    json.dump(summary, f, indent=2)

print(f"Results saved to {results_path}")

## Interpretation of Results

Based on the patching experiments, we can make the following interpretations:

1. **Causal Relationship**: 
   - The slope of the regression line (`slope`) indicates how much the prediction changes when the neuron's activation changes. A non-zero slope indicates the prediction is causally affected by this neuron's activation.
   - The R² value (`r_squared`) tells us how much of the prediction's variation is explained by changes in this neuron's activation.

2. **Interpretation**:
   - If `slope` is close to 1.0: The predictor is directly tracking this neuron's activation
   - If `slope` is close to 0.0: The predictor ignores this neuron
   - If `slope` is negative: The predictor is inversely related to this neuron's activation
   - If `r_squared` is high (close to 1.0): The relationship is strong and consistent
   - If `r_squared` is low (close to 0.0): The relationship is weak or inconsistent

3. **Per-Input Variation**:
   - Different inputs show different sensitivities to neuron patching
   - This variation might reveal context-dependent computation
   - Inputs with high sensitivity slopes are most influenced by this neuron

4. **Comparison with Weight Analysis**:
   - If the linear weights notebook showed low similarity but patching shows high sensitivity, this suggests the predictor uses this neuron's information but encodes it differently
   - If both analyses show strong relationships, we have stronger evidence that the predictor is directly modeling this neuron

This causal analysis complements the correlational approach from the weights comparison, giving us a more complete picture of how the predictor relates to the target neuron.