# Embedding Geometry Experiment

This notebook tests whether embedding geometry causally determines reasoning style in LLM categorization.

**Hypothesis**: Tight exemplar clusters → rigid reasoning; Loose clusters → flexible reasoning

**Method**: Manipulate dog/cat/hamster embeddings, test on edge cases (monkey, snake, fish, spider)

## Setup

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import json
from pathlib import Path
from embedding_utils import get_token_ids, compute_centroid, modify_embeddings, print_modification_stats

In [2]:
MODEL_NAME = "openai/gpt-oss-20b"
EDGE_CASES = ["monkey", "snake", "fish", "spider"]
CONDITIONS = {
    "baseline": None,  # No modification
    "tight": 0.5,      # Move 50% closer to centroid
    "loose": 2.0,      # Move 100% farther (2x distance from centroid)
}
PROMPT_TEMPLATE = "Is a {item} a pet? Answer yes or no, then explain your reasoning."
RESULTS_DIR = "results"

## Load Model

In [3]:
print(f"Loading {MODEL_NAME}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=torch.float16,
    device_map="auto",
)
print("✓ Model loaded")

Loading openai/gpt-oss-20b...


Using MXFP4 quantized models requires a GPU, we will default to dequantizing the model to bf16


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the disk.


✓ Model loaded


## Check Token IDs for Candidate Exemplars

First, let's verify which pet words tokenize to single tokens (we need single-token words for clean embedding manipulation):

In [4]:
# Check tokenization of candidate exemplar words
candidates = [
    "dog", "cat", "hamster", "bird", "fish", "rat", "pig",
    "rabbit", "mouse", "horse", "cow", "sheep", "goat"
]

print("Checking which words tokenize to single tokens:\\n")
single_token = []
multi_token = []

for word in candidates:
    tokens = tokenizer.encode(word, add_special_tokens=False)
    if len(tokens) == 1:
        print(f"  ✓ {word:10} -> token {tokens[0]}")
        single_token.append(word)
    else:
        print(f"  ✗ {word:10} -> {tokens} ({len(tokens)} tokens)")
        multi_token.append(word)

print(f"Single-token words: {single_token}")
print(f"Multi-token words: {multi_token}")
print(f"\nWe use {single_token[:3]} as EXEMPLARS")
EXEMPLARS = single_token[:3]

Checking which words tokenize to single tokens:\n
  ✓ dog        -> token 30146
  ✓ cat        -> token 8837
  ✗ hamster    -> [6595, 3968] (2 tokens)
  ✓ bird       -> token 32981
  ✓ fish       -> token 29277
  ✓ rat        -> token 11990
  ✓ pig        -> token 131332
  ✓ rabbit     -> token 180596
  ✓ mouse      -> token 25673
  ✓ horse      -> token 105889
  ✓ cow        -> token 175080
  ✗ sheep      -> [45842, 1027] (2 tokens)
  ✗ goat       -> [2319, 266] (2 tokens)
Single-token words: ['dog', 'cat', 'bird', 'fish', 'rat', 'pig', 'rabbit', 'mouse', 'horse', 'cow']
Multi-token words: ['hamster', 'sheep', 'goat']

We use ['dog', 'cat', 'bird'] as EXEMPLARS
