# Tutorial 2: Feature Extraction and Comparison

**Learning Objectives:**
- Use the `hallucination_detector` package for feature extraction
- Compare feature activations between different texts
- Identify unique and shared features
- Understand the foundation for hallucination detection

**Estimated Time:** 15-20 minutes

**Prerequisites:** Complete Tutorial 1: SAE Basics

---


## Introduction: From Manual to Reusable

In Tutorial 1, we manually extracted and decoded features. Now we'll use the `hallucination_detector` package, which provides clean, reusable functions for:

- `initialize_model_and_sae()`: Load model and SAE
- `extract_features()`: Get feature activations from text
- `decode_feature()`: Translate features to words
- `get_loudest_unique_features()`: Find features unique to one text
- `run_differential_diagnosis()`: Compare two texts

These functions form the foundation of our hallucination detection methodology.


## Setup: Import from hallucination_detector


In [1]:
from hallucination_detector import (
    initialize_model_and_sae,
    extract_features,
    decode_feature,
    get_loudest_unique_features,
    run_differential_diagnosis,
)

# Initialize instruments
model, sae, device = initialize_model_and_sae()


Loading instruments on device: mps
  Loading SAE microscope...


  Loading Gemma-2-2b model...


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Loaded pretrained model gemma-2-2b into HookedTransformer
  ✓ Instruments ready


## Demo: Extract Features from Multiple Texts

Let's extract features from three different texts and see how they differ.


In [2]:
# Define three texts
texts = [
    "Paris is the capital of France",
    "The cat sat on the mat",
    "Machine learning uses neural networks"
]

# Extract features from each
print("Extracting features from each text:\n")
for i, text in enumerate(texts, 1):
    features = extract_features(text, model, sae)
    print(f"{i}. '{text}'")
    print(f"   Active features: {features['total_active']}")
    print(f"   Total energy: {features['energy']:.3f}")
    print()


Extracting features from each text:

1. 'Paris is the capital of France'
   Active features: 76
   Total energy: 368.051

2. 'The cat sat on the mat'
   Active features: 91
   Total energy: 406.055

3. 'Machine learning uses neural networks'
   Active features: 102
   Total energy: 484.115



## Comparing Two Texts: Finding Unique Features

Now let's compare two similar texts to find features unique to each. This is the core technique for hallucination detection.


In [3]:
# Compare two texts
text_a = "The Eiffel Tower is in Paris"
text_b = "The Eiffel Tower is in Rome"

print(f"Text A: '{text_a}'")
print(f"Text B: '{text_b}'")
print()

# Extract features
features_a = extract_features(text_a, model, sae)
features_b = extract_features(text_b, model, sae)

# Find unique features
set_a = set(features_a['indices'])
set_b = set(features_b['indices'])

unique_to_a = set_a - set_b
unique_to_b = set_b - set_a
shared = set_a & set_b

print(f"Features unique to A: {len(unique_to_a)}")
print(f"Features unique to B: {len(unique_to_b)}")
print(f"Shared features: {len(shared)}")
print()

# Get the loudest unique features in B
loudest_b = get_loudest_unique_features(text_a, text_b, model, sae, top_k=3)
print(f"Top 3 loudest features unique to B:")
for i, feat_id in enumerate(loudest_b, 1):
    decoded = decode_feature(feat_id, model, sae, top_k=3)
    print(f"  {i}. Feature #{feat_id} → {', '.join(decoded['words'])}")


Text A: 'The Eiffel Tower is in Paris'
Text B: 'The Eiffel Tower is in Rome'

Features unique to A: 43
Features unique to B: 80
Shared features: 42

Top 3 loudest features unique to B:
  1. Feature #6386 → Portale, ANSA, uolo
  2. Feature #11133 →  Vatican, Pope,  Pope
  3. Feature #9958 →  RB, RSD,  RCS


## Differential Diagnosis: The Full Analysis

The `run_differential_diagnosis()` function performs a complete comparison, returning spectral metrics and biomarkers.


In [5]:
# Run full diagnosis
diagnosis = run_differential_diagnosis(text_a, text_b, model, sae)

print("DIFFERENTIAL DIAGNOSIS REPORT")
print("=" * 60)
print(f"\nControl (A): '{text_a}'")
print(f"Sample (B):  '{text_b}'")
print()

print("Spectral Metrics:")
print(f"  Control entropy: {diagnosis['spectral_metrics']['control_entropy']}")
print(f"  Sample entropy:  {diagnosis['spectral_metrics']['sample_entropy']}")
print(f"  Energy diff:     {diagnosis['spectral_metrics']['energy_diff']:.3f}")
print()

print("Biomarkers:")
print(f"  Unique to sample: {diagnosis['biomarkers']['unique_to_hallucination_count']}")
print(f"  Missing from sample: {diagnosis['biomarkers']['missing_grounding_count']}")
print()

print("Top 5 unique features in sample:")
for feat_id in diagnosis['biomarkers']['top_hallucination_features']:
    decoded = decode_feature(feat_id, model, sae, top_k=3)
    print(f"  Feature #{feat_id} → {', '.join(decoded['words'])}")


DIFFERENTIAL DIAGNOSIS REPORT

Control (A): 'The Eiffel Tower is in Paris'
Sample (B):  'The Eiffel Tower is in Rome'

Spectral Metrics:
  Control entropy: 85
  Sample entropy:  122
  Energy diff:     126.984

Biomarkers:
  Unique to sample: 80
  Missing from sample: 43

Top 5 unique features in sample:
  Feature #15876 →  Wittenberg, silian,  Jind
  Feature #5 →  صوتيه,  OnInit, ITUTION
  Feature #11789 → ably,  Folly, WebServlet
  Feature #4110 →  Socorro,  Jod,  AOC
  Feature #14872 →  DC,  Canberra,  Washington


## Key Takeaways

In this tutorial, you learned:

1. **Reusable Functions:** How to use the `hallucination_detector` package for clean, modular code
2. **Feature Comparison:** Techniques for comparing feature sets between texts
3. **Unique Features:** How to identify features unique to one text (the basis of hallucination detection)
4. **Differential Diagnosis:** A complete analytical framework for comparing texts

### The Foundation for Hallucination Detection

The key insight: **hallucinations activate unique features that facts don't**. By identifying these "hallucination biomarkers," we can detect when a model is generating false information.

### What's Next?

- **experiments/hallucination_biopsy.py**: Run the full experiment with multiple test cases
- **Medium Article Series**: Read the detailed methodology and findings
- **Neuronpedia**: Explore individual features at https://neuronpedia.org/gemma-2b

---

### Try It Yourself

Modify the texts above to test your own fact/hallucination pairs:
- "The Great Wall of China is in China" vs "The Great Wall of China is in Japan"
- "Water boils at 100°C" vs "Water boils at 50°C"
- "Shakespeare wrote Hamlet" vs "Shakespeare wrote Harry Potter"

What patterns do you notice in the unique features?
