# Zero-Shot Citation Hallucination Detection

**Scores:** ICS · PAS · PFS · BAS

| Score | Full Name | What it measures |
|-------|-----------|------------------|
| **ICS** | Internal Consistency Score | Cosine sim between citation hidden state and essay context |
| **PAS** | Pathway Alignment Score | Cosine sim between attention (read) and FFN (recall) pathways |
| **PFS** | Parametric Force Score | Magnitude of the hidden state update (FFN push) |
| **BAS** | BOS Attention Score | Attention from citation token to BOS token (position 0) |

**Workflow:** Generation and scoring are **fully decoupled**.
1. Generate essays one-by-one → each saved to JSONL immediately (crash-safe)
2. Push to GitHub → `git pull` on your PC
3. (Optional: restart runtime) Load saved essays → Score → Push results

---
## 1. Setup

In [None]:
!pip install -q transformers accelerate requests tqdm
import torch, gc, os, json, time
assert torch.cuda.is_available(), 'Switch to GPU runtime!'
print(f"GPU: {torch.cuda.get_device_name(0)}")

GPU: Tesla T4


In [None]:
from huggingface_hub import login
login(token="")  # <-- paste your HF token

In [None]:
# Clone repo (has prompts, will store results)
REPO_DIR = "/content/soppery"
if not os.path.exists(REPO_DIR):
    !git clone https://github.com/floating-reeds/soppery.git {REPO_DIR}
else:
    !cd {REPO_DIR} && git pull
os.chdir(REPO_DIR)
!git config user.email "colab@colab" && git config user.name "Colab"
print(f"Working in: {os.getcwd()}")

Cloning into '/content/soppery'...
remote: Enumerating objects: 5687, done.[K
remote: Counting objects: 100% (5687/5687), done.[K
remote: Compressing objects: 100% (4202/4202), done.[K
remote: Total 5687 (delta 1441), reused 5687 (delta 1441), pack-reused 0 (from 0)[K
Receiving objects: 100% (5687/5687), 31.52 MiB | 17.83 MiB/s, done.
Resolving deltas: 100% (1441/1441), done.
Working in: /content/soppery


In [None]:
MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"
MAX_NEW_TOKENS = 1024
TEMPERATURE = 0.7
SCORE_MAX_LEN = 512

---
## 2. Load Prompts from Repo

In [None]:
all_prompts = []
for fname in sorted(os.listdir('data/prompts')):
    if fname.endswith('.json'):
        with open(f'data/prompts/{fname}') as f:
            prompts = json.load(f)
        all_prompts.extend(prompts)
        print(f'  {fname}: {len(prompts)} prompts')
print(f'\nTotal: {len(all_prompts)} prompts')

  historical.json: 50 prompts
  legal.json: 50 prompts
  scientific.json: 50 prompts

Total: 150 prompts


---
## 3. Citation Extraction

In [None]:
import re
from dataclasses import dataclass, asdict
from typing import List, Optional

@dataclass
class Citation:
    raw_text: str
    start_pos: int
    end_pos: int
    extracted_authors: Optional[List[str]] = None
    extracted_year: Optional[int] = None
    citation_type: str = "academic"

PATTERNS = [
    re.compile(r'\(([A-Z][a-z]+(?:\s+(?:et\s+al\.?|&|and)\s+[A-Z][a-z]+)*,?\s*\d{4}[a-z]?)\)'),
    re.compile(r'([A-Z][a-z]+(?:\s+et\s+al\.?))\s*\((\d{4}[a-z]?)\)'),
    re.compile(r'([A-Z][a-z]+(?:\s+(?:and|&)\s+[A-Z][a-z]+)?)\s*\((\d{4})\)'),
    re.compile(r'["\u201c]([^"\u201d]{10,100})["\u201d]\s*\((\d{4})\)'),
    re.compile(r'([A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\s+v\.?\s+[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\s*\((\d{4})\)'),
]

def extract_citations(text):
    cites, seen = [], set()
    for pat in PATTERNS:
        for m in pat.finditer(text):
            s, e = m.span()
            if any(s <= p <= e for p in seen): continue
            seen.update(range(s, e+1))
            c = Citation(raw_text=m.group(0), start_pos=s, end_pos=e)
            yr = re.search(r'\d{4}', m.group(0))
            if yr: c.extracted_year = int(yr.group())
            au = re.search(r'([A-Z][a-z]+)', m.group(0))
            if au: c.extracted_authors = [au.group(1)]
            cites.append(c)
    return cites

# Test
t = extract_citations('Vaswani et al. (2017) proposed Transformers. Also see (Devlin et al., 2019).')
print(f'Test: {len(t)} citations: {", ".join(c.raw_text for c in t)}')

Test: 1 citations: Vaswani et al. (2017)


---
## 4. Generate ALL Essays (Incremental Save)

**Crash-safe:** Each essay is appended to a `.jsonl` file immediately after generation.
If Colab disconnects, you keep everything generated so far.
On restart, it **skips already-generated prompts** automatically.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm import tqdm

print(f'Loading {MODEL_NAME}...')
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, torch_dtype=torch.float16, device_map='auto')
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
print(f'Loaded. {model.config.num_hidden_layers} layers, {model.config.num_attention_heads} heads')

Loading meta-llama/Llama-3.2-3B-Instruct...


Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/254 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Loaded. 28 layers, 24 heads


In [None]:
SYSTEM_PROMPTS = {
    'scientific': """You are an academic researcher writing for a peer-reviewed journal.
You MUST include proper academic citations throughout.
Format: Author et al. (Year) or (Author et al., Year).
Every major claim MUST have a citation with real author names and years.
Example: Vaswani et al. (2017) introduced the Transformer architecture.""",

    'legal': """You are a legal scholar writing a detailed analysis.
You MUST cite relevant court cases: Case Name v. Case Name (Year).
Reference specific statutes and legal provisions.
Example: In Brown v. Board of Education (1954), the Court ruled...""",

    'historical': """You are a historian writing an academic essay.
You MUST include academic citations: Author (Year) or (Author, Year).
Every claim should reference a historian or primary source.
Example: According to Hobsbawm (1962), the Industrial Revolution began...""",
}

# Output file — one JSON object per line (append-safe)
os.makedirs('data/essays', exist_ok=True)
model_short = MODEL_NAME.split('/')[-1]
ESSAY_FILE = f'data/essays/{model_short}_essays.jsonl'

# Load already-generated IDs (for resume after crash)
done_ids = set()
if os.path.exists(ESSAY_FILE):
    with open(ESSAY_FILE) as f:
        for line in f:
            try: done_ids.add(json.loads(line)['prompt_id'])
            except: pass
    print(f'Resuming: {len(done_ids)} essays already generated')

remaining = [p for p in all_prompts if p['id'] not in done_ids]
print(f'To generate: {len(remaining)} essays')

To generate: 150 essays


In [None]:
# Generate one essay at a time, save immediately
total_new_cites = 0

for i, p in enumerate(tqdm(remaining, desc='Generating')):
    sys_prompt = SYSTEM_PROMPTS.get(p['domain'], SYSTEM_PROMPTS['scientific'])
    messages = [{"role": "system", "content": sys_prompt}, {"role": "user", "content": p["prompt"]}]

    try:
        formatted = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    except:
        formatted = f"{sys_prompt}\n\nRequest: {p['prompt']}\n\nResponse:"

    inputs = tokenizer(formatted, return_tensors='pt', truncation=True, max_length=2048).to('cuda')
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=MAX_NEW_TOKENS, temperature=TEMPERATURE,
                             top_p=0.9, do_sample=True, pad_token_id=tokenizer.pad_token_id)

    response = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True).strip()
    cites = extract_citations(response)
    total_new_cites += len(cites)

    essay = {
        'prompt_id': p['id'], 'domain': p['domain'], 'prompt': p['prompt'],
        'model_name': MODEL_NAME, 'response': response,
        'citations': [asdict(c) for c in cites],
        'num_citations': len(cites),
    }

    # Append immediately — crash-safe
    with open(ESSAY_FILE, 'a') as f:
        f.write(json.dumps(essay) + '\n')

    if (i+1) % 10 == 0:
        print(f'  [{i+1}/{len(remaining)}] {total_new_cites} new citations')

print(f'\nDone! Generated {len(remaining)} essays with {total_new_cites} citations')
print(f'All saved to: {ESSAY_FILE}')

Generating:   7%|▋         | 10/150 [06:13<1:26:04, 36.89s/it]

  [10/150] 92 new citations


Generating:  13%|█▎        | 20/150 [12:27<1:21:47, 37.75s/it]

  [20/150] 177 new citations


Generating:  20%|██        | 30/150 [19:23<1:22:13, 41.11s/it]

  [30/150] 251 new citations


Generating:  27%|██▋       | 40/150 [26:10<1:12:04, 39.31s/it]

  [40/150] 317 new citations


Generating:  33%|███▎      | 50/150 [32:33<1:00:28, 36.28s/it]

  [50/150] 389 new citations


Generating:  40%|████      | 60/150 [38:27<56:30, 37.67s/it]  

  [60/150] 437 new citations


Generating:  47%|████▋     | 70/150 [44:44<53:08, 39.86s/it]

  [70/150] 507 new citations


Generating:  53%|█████▎    | 80/150 [51:26<46:36, 39.95s/it]

  [80/150] 540 new citations


Generating:  60%|██████    | 90/150 [58:00<40:15, 40.26s/it]

  [90/150] 596 new citations


Generating:  67%|██████▋   | 100/150 [1:04:01<30:11, 36.23s/it]

  [100/150] 653 new citations


Generating:  73%|███████▎  | 110/150 [1:10:38<26:04, 39.12s/it]

  [110/150] 676 new citations


Generating:  80%|████████  | 120/150 [1:17:37<18:52, 37.76s/it]

  [120/150] 724 new citations


Generating:  87%|████████▋ | 130/150 [1:23:05<10:35, 31.78s/it]

  [130/150] 762 new citations


Generating:  93%|█████████▎| 140/150 [1:29:58<06:43, 40.34s/it]

  [140/150] 792 new citations


Generating: 100%|██████████| 150/150 [1:35:58<00:00, 38.39s/it]

  [150/150] 820 new citations

Done! Generated 150 essays with 820 citations
All saved to: data/essays/Llama-3.2-3B-Instruct_essays.jsonl





In [None]:
# Push essays to GitHub so they appear on your PC
!git add data/essays/
!git commit -m "Add generated essays"
!git push
print('\nPushed! Run git pull on your PC to get the essays.')

In [None]:
# Free GPU for scoring later
del model
gc.collect(); torch.cuda.empty_cache()
print(f'VRAM freed: {torch.cuda.memory_allocated()/1e9:.2f} GB in use')

---
## 5. Citation Verification

**What causes `unverified`:**
- API timeout / rate-limiting (HTTP 429) after retries
- Citation has no extractable author or year (regex found a partial match)
- Both Semantic Scholar AND CrossRef are unreachable

**How we minimize it:** Retry on failure, CrossRef fallback, broader search queries.

If a citation is still `unverified`, it's excluded from scoring (only `real` vs `fabricated` are scored).

In [None]:
import requests

def search_ss(query, retries=2):
    for _ in range(retries):
        try:
            r = requests.get('https://api.semanticscholar.org/graph/v1/paper/search',
                             params={'query': query, 'limit': 5, 'fields': 'title,authors,year'}, timeout=15)
            if r.status_code == 429: time.sleep(3); continue
            if r.status_code == 200: return r.json().get('data', [])
        except: time.sleep(2)
    return None

def search_crossref(query):
    try:
        r = requests.get('https://api.crossref.org/works', params={'query': query, 'rows': 3}, timeout=15)
        if r.status_code == 200:
            return [{'title': it.get('title',[''])[0],
                     'year': it.get('published',{}).get('date-parts',[[None]])[0][0],
                     'authors': [{'name': f"{a.get('given','')} {a.get('family','')}"} for a in it.get('author',[])]}
                    for it in r.json().get('message',{}).get('items',[])]
    except: pass
    return []

def verify_citation(authors, year, raw_text=''):
    parts = []
    if authors: parts.append(authors[0])
    if year: parts.append(str(year))
    if not parts:
        words = [w for w in raw_text.split() if len(w) > 2][:3]
        if words: parts = words
    if not parts: return 'unverified', 0.0, None
    query = ' '.join(parts)
    papers = search_ss(query)
    if papers is None:
        papers = search_crossref(query)
        if not papers: return 'unverified', 0.0, None
    if not papers: return 'fabricated', 0.75, None
    for p in papers:
        yr_ok = (year is None) or (p.get('year') == year)
        au_ok = True
        if authors:
            last = authors[0].split()[-1].lower()
            au_ok = any(last in a.get('name','').lower() for a in p.get('authors',[]))
        if yr_ok and au_ok: return 'real', 0.9, p.get('title')
    return 'fabricated', 0.6, None

print('Verification functions defined (SS + CrossRef fallback)')

In [None]:
# Load all generated essays
essays = []
with open(ESSAY_FILE) as f:
    for line in f:
        try: essays.append(json.loads(line))
        except: pass
print(f'Loaded {len(essays)} essays')

# Verify
stats = {'real': 0, 'fabricated': 0, 'unverified': 0}
for essay in tqdm(essays, desc='Verifying'):
    for c in essay['citations']:
        if 'label' in c: stats[c['label']] += 1; continue  # Skip already verified
        label, conf, matched = verify_citation(c.get('extracted_authors'), c.get('extracted_year'), c.get('raw_text',''))
        c['label'] = label; c['confidence'] = conf; c['matched_title'] = matched
        stats[label] += 1
        time.sleep(0.5)

total = sum(stats.values())
print(f'\nVerification ({total} citations):')
for k, v in stats.items(): print(f'  {k}: {v} ({v/total*100:.0f}%)' if total else f'  {k}: {v}')

# Overwrite JSONL with verified versions
with open(ESSAY_FILE, 'w') as f:
    for e in essays: f.write(json.dumps(e) + '\n')
print(f'Saved verified essays')

---
## 6. Mechanistic Scoring (ICS, PAS, PFS, BAS)

**You can run this anytime** — essays are loaded from the saved JSONL file.
Generation and scoring are completely independent.

⚠️ If you just ran generation, **restart the runtime** first to free VRAM,
then re-run cells 1–3 (setup/login/clone) and skip to here.

In [None]:
# Load essays (works after runtime restart)
import json, os, gc, torch, time
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm import tqdm

MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"
SCORE_MAX_LEN = 512
os.chdir('/content/soppery')
ESSAY_FILE = f'data/essays/{MODEL_NAME.split("/")[-1]}_essays.jsonl'

essays = []
with open(ESSAY_FILE) as f:
    for line in f:
        try: essays.append(json.loads(line))
        except: pass

scorable = [e for e in essays if any(c.get('label') in ('real','fabricated') for c in e.get('citations',[]))]
print(f'Loaded {len(essays)} essays, {len(scorable)} have real/fabricated citations')

print(f'Loading {MODEL_NAME} with eager attention...')
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
scorer = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, device_map='auto', attn_implementation='eager')
scorer.eval()
NL = scorer.config.num_hidden_layers
print(f'Ready: {NL} layers, VRAM: {torch.cuda.memory_allocated()/1e9:.1f} GB')

In [None]:
# --- HOOKS: separate attention vs FFN pathways ---
activation_cache = {}
def get_activation(name):
    def hook(model, input, output):
        activation_cache[name] = output[0].detach()
    return hook

for i in range(NL):
    scorer.model.layers[i].post_attention_layernorm.register_forward_hook(get_activation(f'layer_{i}_mid'))

def find_cite_toks(essay, tok, ml):
    txt = essay['prompt'] + essay['response']
    enc = tok(txt, return_tensors='pt', return_offsets_mapping=True, truncation=True, max_length=ml)
    offs = enc.offset_mapping[0].tolist()
    plen = len(tok(essay['prompt'], return_tensors='pt').input_ids[0])
    res = []
    for c in essay['citations']:
        if c.get('label') not in ('real','fabricated'): continue
        cs = c['start_pos'] + len(essay['prompt'])
        for i,(ts,te) in enumerate(offs):
            if ts <= cs < te: res.append((c,i)); break
    return enc, plen, res

all_scores = []
for essay in tqdm(scorable, desc='Scoring'):
    enc, plen, cps = find_cite_toks(essay, tokenizer, SCORE_MAX_LEN)
    if not cps: continue

    try:
        activation_cache = {}
        with torch.no_grad():
            out = scorer(input_ids=enc.input_ids.cuda(),
                         output_hidden_states=True, output_attentions=True, return_dict=True)
        hs = tuple(h.cpu() for h in out.hidden_states)
        attn = torch.stack(out.attentions).squeeze(1).cpu()
        del out; torch.cuda.empty_cache()
    except Exception as e:
        print(f'  Error {essay["prompt_id"]}: {e}'); continue

    for c, tp in cps:
        ics, pas, pfs, bas = [], [], [], []
        for l in range(NL):
            pre  = hs[l][0, tp].cuda()
            mid  = activation_cache[f'layer_{l}_mid'][0, tp].cuda()
            post = hs[l+1][0, tp].cuda()

            v_attn = mid - pre
            v_ffn  = post - mid

            # ICS: attention-weighted context similarity
            if l < attn.shape[0]:
                a = attn[l, :, tp, plen:tp].cuda()
                if a.shape[-1] > 0:
                    a = a / (a.sum(dim=-1, keepdim=True) + 1e-9)
                    a = a.to(dtype=torch.float16)
                    ctx = torch.matmul(a, hs[l+1][0, plen:tp].cuda())
                    ics_val = F.cosine_similarity(ctx, post.unsqueeze(0).expand_as(ctx), dim=-1).mean().item()
                else: ics_val = 0.0
            else: ics_val = 0.0
            ics.append(ics_val)

            # PAS: cosine sim between attn and FFN pathways
            if torch.norm(v_attn) > 1e-6 and torch.norm(v_ffn) > 1e-6:
                pas.append(F.cosine_similarity(v_attn.unsqueeze(0), v_ffn.unsqueeze(0)).item())
            else:
                pas.append(0.0)

            # PFS: FFN force magnitude
            pfs.append(torch.norm(v_ffn).item())

            # BAS: attention to BOS
            if l < attn.shape[0]:
                bas.append(attn[l, :, tp, 0].mean().item())
            else: bas.append(0.0)

        all_scores.append({
            'ics_scores': ics, 'ics_mean': sum(ics)/len(ics), 'ics_final': ics[-1],
            'pas_scores': pas, 'pas_mean': sum(pas)/len(pas), 'pas_final': pas[-1],
            'pfs_scores': pfs, 'pfs_mean': sum(pfs)/len(pfs), 'pfs_final': pfs[-1],
            'bas_scores': bas, 'bas_mean': sum(bas)/len(bas),
            'prompt_id': essay['prompt_id'], 'domain': essay['domain'],
            'label': c.get('label','?'), 'citation': c['raw_text'],
        })
    del hs, attn; torch.cuda.empty_cache()

print(f'\nScored {len(all_scores)} citations')

---
## 7. Results

In [None]:
import pandas as pd

if all_scores:
    df = pd.DataFrame(all_scores)
    cols = [c for c in ['prompt_id','domain','citation','label','ics_mean','pas_mean','pfs_mean','bas_mean'] if c in df.columns]
    display(df[cols].round(4))
    print('\n=== Averages by Label ===')
    num = [c for c in ['ics_mean','pas_mean','pfs_mean','bas_mean'] if c in df.columns]
    if len(df['label'].unique()) > 1:
        print(df.groupby('label')[num].mean().round(4).to_string())
    else:
        print(f'All: {df["label"].iloc[0]}')
        print(df[num].describe().round(4).to_string())
else:
    print('No citations scored.')

In [None]:
import matplotlib.pyplot as plt

if all_scores and len(df['label'].unique()) > 1:
    fig, axes = plt.subplots(1, 4, figsize=(20, 5))
    for ax, (key, title) in zip(axes, [('ics_scores','ICS'),('pas_scores','PAS'),('pfs_scores','PFS'),('bas_scores','BAS')]):
        for label in sorted(df['label'].unique()):
            sub = df[df['label']==label]
            means = []
            for l in range(NL):
                vals = [s[l] for s in sub[key] if len(s)>l]
                means.append(sum(vals)/len(vals) if vals else 0)
            color = {'real':'green','fabricated':'red'}.get(label,'gray')
            style = '--' if label == 'fabricated' else '-'
            ax.plot(range(NL), means, label=label, color=color, alpha=0.8, lw=2, linestyle=style)
        ax.set_title(title, fontsize=14, fontweight='bold')
        ax.set_xlabel('Layer'); ax.legend(); ax.grid(True, alpha=0.3)
    plt.suptitle(f'Mechanistic Signatures — {MODEL_NAME.split("/")[-1]}', fontsize=16, fontweight='bold')
    plt.tight_layout()
    os.makedirs('data/results', exist_ok=True)
    plt.savefig('data/results/layer_scores.png', dpi=150, bbox_inches='tight')
    plt.show()
else:
    print('Need both real and fabricated citations for comparison plots')

In [None]:
# Save scores & push to GitHub
os.makedirs('data/results', exist_ok=True)
sc_out = [{k: ([float(x) for x in v] if isinstance(v, list) else v) for k, v in s.items()} for s in all_scores]
scores_file = f'data/results/{MODEL_NAME.split("/")[-1]}_scores.json'
with open(scores_file, 'w') as f: json.dump(sc_out, f, indent=2)
with open(ESSAY_FILE, 'w') as f:
    for e in essays: f.write(json.dumps(e) + '\n')

!git add data/
!git commit -m "Add scores and verified essays"
!git push
print(f'\nPushed! git pull on your PC to get results.')

---
## 8. Interpreting the Graphs

Here's what to look for in each plot:

### ICS (Internal Consistency Score)
- Measures how much the citation "fits" with the surrounding essay text
- **Expected:** Real citations should have **higher ICS** (more consistent with context) in later layers where semantic meaning is encoded
- **If real ≈ fabricated:** The model generates both types with similar internal consistency — it doesn't "know" the citation is fake at the representation level

### PAS (Pathway Alignment Score)
- Measures internal agreement between attention (what the model reads) and FFN (what it recalls from parameters)
- **Expected:** Fabricated citations should show **higher PAS** (more internal conflict) in middle layers where factual knowledge is stored
- **If both spike in early layers:** The early layers are doing general token processing — the interesting signal is in **layers 10–20** for a 28-layer model

### PFS (Parametric Force Score)  
- Raw magnitude of the hidden state update — how hard the FFN is "pushing"
- **Expected:** Fabricated citations need stronger parametric force since they aren't grounded in real knowledge
- **If similar:** The model applies similar computational effort to both types

### BAS (BOS Attention Score)
- How much the citation token attends to the BOS (beginning-of-sequence) token
- **Expected:** Higher BAS = model is relying more on parametric memory. Fabricated citations may show **different BAS patterns** in middle/late layers
- **Bell curve pattern (peaking layers 5-10):** Normal — BOS acts as a "memory sink" in transformer models

### What if curves overlap heavily?
This means the 3B model's internal representations don't clearly distinguish real from fabricated citations. Possible reasons:
1. The model is too small to have separable mechanistic signatures
2. More data (more essays) is needed for the averages to separate
3. The zero-shot setting makes it harder (no external context to contrast against)
4. Different layer ranges or per-head analysis might reveal subtler signals