# Style Evaluation: Few-Shot Source Comparison

This notebook evaluates how the **source of few-shot examples** affects style reconstruction quality.

## Research Question

When using few-shot prompting to reconstruct style, does it matter whose writing we use as examples?

## Methodology

For each gold standard text (Bertrand Russell):
1. **Flatten**: Extract content while removing style
2. **Reconstruct**: Generate text using 4 different few-shot sources (M stochastic runs each):
   - Russell (same author - baseline)
   - Chesterton (different author)
   - Clausewitz (different author, different domain)
   - Hume (different author, same domain - philosophy)
3. **Judge (Blind Comparative)**: Judge ranks all 4 reconstructions based on similarity to original
   - **Blind evaluation**: Judge sees only anonymous labels (Text A, B, C, D)
   - **Ranking**: 1 = most similar, 4 = least similar
   - **Order randomized**: Position of methods varies across samples
4. **Aggregate**: Analyze rankings to determine which few-shot source works best

## Hypothesis

We expect Russell examples (same author) to perform best, but want to quantify:
- How much worse are other authors?
- Does domain similarity (Hume - philosophy) matter more than authorship?
- Can cross-author examples still capture some style elements?

## Key Features

- **Crash resilient**: All LLM responses saved to SQLite immediately
- **Resume support**: Can restart after failures, skips completed work
- **Blind evaluation**: Eliminates judge bias by hiding method names
- **Comparative ranking**: More informative than binary comparisons

### Install Libraries and Check

In [1]:
!pip install -r requirements.txt




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
try:
    import litellm
    print('Providers\n=========')
    print('* ' + '\n* '.join(litellm.LITELLM_CHAT_PROVIDERS))
    litellm.drop_params = True
except ImportError as e:
    print(f"✗ Cannot import litellm: {e}")

Providers
* openai
* openai_like
* bytez
* xai
* custom_openai
* text-completion-openai
* cohere
* cohere_chat
* clarifai
* anthropic
* anthropic_text
* replicate
* huggingface
* together_ai
* datarobot
* openrouter
* cometapi
* vertex_ai
* vertex_ai_beta
* gemini
* ai21
* baseten
* azure
* azure_text
* azure_ai
* sagemaker
* sagemaker_chat
* bedrock
* vllm
* nlp_cloud
* petals
* oobabooga
* ollama
* ollama_chat
* deepinfra
* perplexity
* mistral
* groq
* nvidia_nim
* cerebras
* baseten
* ai21_chat
* volcengine
* codestral
* text-completion-codestral
* deepseek
* sambanova
* maritalk
* cloudflare
* fireworks_ai
* friendliai
* watsonx
* watsonx_text
* triton
* predibase
* databricks
* empower
* github
* custom
* litellm_proxy
* hosted_vllm
* llamafile
* lm_studio
* galadriel
* gradient_ai
* github_copilot
* novita
* meta_llama
* featherless_ai
* nscale
* nebius
* dashscope
* moonshot
* v0
* heroku
* oci
* morph
* lambda_ai
* vercel_ai_gateway
* wandb
* ovhcloud
* lemonade


## Setup and Configuration

In [3]:
import os
import json
from pathlib import Path
import pandas as pd
from datetime import datetime

### Initialize Base Objects

In [4]:
# Model Configuration
#model_reconstruction_string = 'anthropic/claude-sonnet-4-5-20250929'
#model_reconstruction_api_key_env_var = 'ANTHROPIC_API_KEY'
#model_reconstruction_string = 'mistral/mistral-large-2512'
#model_reconstruction_api_key_env_var = 'MISTRAL_API_KEY'
model_reconstruction_string = 'openai/gpt-5.1-2025-11-13'
model_reconstruction_api_key_env_var = 'OPENAI_API_KEY'
model_judge_string = 'anthropic/claude-sonnet-4-5-20250929'
model_judge_api_key_env_var = 'ANTHROPIC_API_KEY'

In [5]:
from belletrist import PromptMaker, DataSampler, StyleEvaluationStore

# ============================================================================
# CONFIGURATION - Modify these parameters before running
# ============================================================================

# Data paths
DATA_PATH_RUSSELL = Path(os.getcwd()) / "data" / "russell"
DATA_PATH_OTHER = Path(os.getcwd()) / "data" / "other_author"
EVALUATION_DB_PATH = Path(os.getcwd()) / "style_eval_fewshot_sources_openai.db"

# Methods for this experiment (must be exactly 5 for 5-way comparison)
METHODS = [
    'fewshot_russell',
    'fewshot_chesterton', 
    'fewshot_clausewitz',
#    'fewshot_freud',
    'fewshot_hume'
]

# Mapping of methods to source files
FEWSHOT_SOURCE_FILES = {
    'fewshot_chesterton': 'excerpt_chesterton_orthodoxy.txt',
    'fewshot_clausewitz': 'excerpt_clausewitz_on_war.txt',
#    'fewshot_freud': 'excerpt_freud_a_general_introduction_to_psychoanalysis.txt',
    'fewshot_hume': 'excerpt_hume_a_treatise_on_human_nature.txt'
}

# ============================================================================

# Validate configuration
if not DATA_PATH_RUSSELL.exists():
    raise FileNotFoundError(
        f"Russell data directory not found: {DATA_PATH_RUSSELL}\n"
        f"Please ensure the data directory exists."
    )

if not DATA_PATH_OTHER.exists():
    raise FileNotFoundError(
        f"Other authors data directory not found: {DATA_PATH_OTHER}\n"
        f"Please ensure the data directory exists."
    )

# Validate that all source files exist
for method, filename in FEWSHOT_SOURCE_FILES.items():
    filepath = DATA_PATH_OTHER / filename
    if not filepath.exists():
        raise FileNotFoundError(f"Source file not found: {filepath}")

# Initialize components
prompt_maker = PromptMaker()

sampler_russell = DataSampler(data_path=DATA_PATH_RUSSELL.resolve())
sampler_other = DataSampler(data_path=DATA_PATH_OTHER.resolve())

store = StyleEvaluationStore(
    EVALUATION_DB_PATH,
    methods=METHODS
)

print(f"✓ Russell data path: {DATA_PATH_RUSSELL}")
print(f"✓ Other authors data path: {DATA_PATH_OTHER}")
print(f"✓ Evaluation database: {EVALUATION_DB_PATH}")
print(f"✓ Configured methods: {METHODS}")

ImportError: cannot import name 'StyleRewritePlannerConfig' from 'belletrist.prompts' (/Users/andersohrn/PycharmProjects/ClaudeCodeCourse/style-retrieval/belletrist/prompts/__init__.py)

In [6]:
#store.reset('all')

In [7]:
from belletrist import LLM, LLMConfig

reconstruction_llm = LLM(LLMConfig(
    model=model_reconstruction_string,
    api_key=os.environ.get(model_reconstruction_api_key_env_var)
))
judge_llm = LLM(LLMConfig(
    model=model_judge_string,
    api_key=os.environ.get(model_judge_api_key_env_var)
))

### Sample Test Data and Few-Shot Data

Test texts are Russell paragraphs. Few-shot examples come from 4 different sources.

In [8]:
# Experiment parameters
n_sample = 5
m_paragraphs_per_sample = 3
n_few_shot_per_source = 5  # Number of examples from each source
m_paragraphs_per_few_shot = 2

In [9]:
# Select deterministic test samples (Russell)
quality_texts_deterministic = [
    sampler_russell.get_paragraph_chunk(file_index=0, paragraph_range=slice(9, 9+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=0, paragraph_range=slice(29, 29+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=0, paragraph_range=slice(131, 131+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=1, paragraph_range=slice(13, 13+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=1, paragraph_range=slice(39, 39+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=1, paragraph_range=slice(192, 192+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=2, paragraph_range=slice(20, 20+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=2, paragraph_range=slice(43, 43+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=2, paragraph_range=slice(146, 146+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=3, paragraph_range=slice(7, 7+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=3, paragraph_range=slice(73, 73+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=3, paragraph_range=slice(202, 202+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=3, paragraph_range=slice(285, 285+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=4, paragraph_range=slice(4, 4+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=4, paragraph_range=slice(67, 67+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=4, paragraph_range=slice(124, 124+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=5, paragraph_range=slice(6, 6+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=5, paragraph_range=slice(119, 119+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=5, paragraph_range=slice(301, 301+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=6, paragraph_range=slice(23, 23+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=6, paragraph_range=slice(75, 75+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=6, paragraph_range=slice(152, 152+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=6, paragraph_range=slice(198, 198+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=6, paragraph_range=slice(271, 271+m_paragraphs_per_sample)),
]
reindex = [0, 5, 10, 15, 20, 1, 6, 11, 16, 21, 2, 7, 12, 17, 22, 3, 8, 13, 18, 23, 4, 9, 14, 19]
test_texts = [
    quality_texts_deterministic[i] for i in reindex[:n_sample]
]

In [10]:
# Load few-shot examples from Russell (baseline)
few_shot_russell = []
while len(few_shot_russell) < n_few_shot_per_source:
    p = sampler_russell.sample_segment(p_length=m_paragraphs_per_few_shot)
    
    # Check if p overlaps with any test text
    has_overlap = any(
        p.file_index == test_seg.file_index and
        p.paragraph_start < test_seg.paragraph_end and
        test_seg.paragraph_start < p.paragraph_end
        for test_seg in test_texts
    )
    
    if not has_overlap:
        few_shot_russell.append(p.text)

print(f"✓ Loaded {len(few_shot_russell)} Russell few-shot examples")

✓ Loaded 5 Russell few-shot examples


In [11]:
# Load few-shot examples from other authors
# Files are alphabetically ordered: chesterton=0, clausewitz=1, freud=2, hume=3
AUTHOR_FILE_INDICES = {
    'excerpt_chesterton_orthodoxy.txt': 0,
    'excerpt_clausewitz_on_war.txt': 1,
    'excerpt_freud_a_general_introduction_to_psychoanalysis.txt': 2,
    'excerpt_hume_a_treatise_on_human_nature.txt': 3
}

def load_fewshot_from_file(filename: str, n_examples: int) -> list[str]:
    """Load n_examples paragraphs from a text file using alphabetical file index."""
    file_idx = AUTHOR_FILE_INDICES.get(filename)
    
    if file_idx is None:
        raise ValueError(f"File not found in mapping: {filename}")
    
    # Get total paragraphs in file using the n_paragraphs dict
    file_path = sampler_other.fps[file_idx]
    total_paras = sampler_other.n_paragraphs[file_path.name]
    
    # Sample evenly across the text
    step = max(1, total_paras // (n_examples + 1))
    examples = []
    
    for i in range(n_examples):
        start = min(step * (i + 1), total_paras - m_paragraphs_per_few_shot)
        segment = sampler_other.get_paragraph_chunk(
            file_index=file_idx,
            paragraph_range=slice(start, start + m_paragraphs_per_few_shot)
        )
        examples.append(segment.text)
    
    return examples

# Load examples from each source
few_shot_chesterton = load_fewshot_from_file(
    FEWSHOT_SOURCE_FILES['fewshot_chesterton'],
    n_few_shot_per_source
)
few_shot_clausewitz = load_fewshot_from_file(
    FEWSHOT_SOURCE_FILES['fewshot_clausewitz'],
    n_few_shot_per_source
)
few_shot_hume = load_fewshot_from_file(
    FEWSHOT_SOURCE_FILES['fewshot_hume'],
    n_few_shot_per_source
)

print(f"✓ Loaded {len(few_shot_chesterton)} Chesterton few-shot examples")
print(f"✓ Loaded {len(few_shot_clausewitz)} Clausewitz few-shot examples")
print(f"✓ Loaded {len(few_shot_hume)} Hume few-shot examples")

✓ Loaded 5 Chesterton few-shot examples
✓ Loaded 5 Clausewitz few-shot examples
✓ Loaded 5 Hume few-shot examples


## Evaluation Pipeline

In [12]:
for x in few_shot_chesterton:
    print(x)
    print('========')

First, I found the whole modern world talking scientific fatalism;
saying that everything is as it must always have been, being unfolded
without fault from the beginning.  The leaf on the tree is green
because it could never have been anything else.  Now, the fairy-tale
philosopher is glad that the leaf is green precisely because it
might have been scarlet.  He feels as if it had turned green an
instant before he looked at it.  He is pleased that snow is white
on the strictly reasonable ground that it might have been black.
Every colour has in it a bold quality as of choice; the red of garden
roses is not only decisive but dramatic, like suddenly spilt blood.
He feels that something has been DONE.  But the great determinists
of the nineteenth century were strongly against this native
feeling that something had happened an instant before.  In fact,
according to them, nothing ever really had happened since the beginning
of the world.  Nothing ever had happened since existence had happene

### Step 1: Content Flattening

Extract content from each test sample, removing stylistic elements:

In [13]:
from belletrist.prompts import StyleFlatteningConfig

print("=== Step 1: Flattening and Saving Samples ===\n")

for k_text, test_sample in enumerate(test_texts):
    sample_id = f"sample_{k_text:03d}"
    
    # Skip if already saved
    if store.get_sample(sample_id):
        print(f"✓ {sample_id} already flattened (skipping)")
        continue
    
    print(f"Flattening {sample_id}...", end=" ")
    
    # Flatten content
    flatten_prompt = prompt_maker.render(
        StyleFlatteningConfig(text=test_sample.text)
    )
    flattened = reconstruction_llm.complete(flatten_prompt)
    
    # Save to store with provenance
    source_info = f"File {test_sample.file_index}, para {test_sample.paragraph_start}-{test_sample.paragraph_end}"
    store.save_sample(
        sample_id=sample_id,
        original_text=test_sample.text,
        flattened_content=flattened.content,
        flattening_model=flattened.model,
        source_info=source_info
    )
    
    print(f"✓ ({len(flattened.content)} chars)")

print(f"\n✓ All samples flattened and saved to store")

=== Step 1: Flattening and Saving Samples ===

✓ sample_000 already flattened (skipping)
✓ sample_001 already flattened (skipping)
✓ sample_002 already flattened (skipping)
✓ sample_003 already flattened (skipping)
✓ sample_004 already flattened (skipping)

✓ All samples flattened and saved to store


### Step 2: Reconstruction with Different Few-Shot Sources

Generate reconstructions using the same prompt template but different few-shot examples:

In [14]:
from belletrist.prompts import StyleReconstructionFewShotConfig

# Configuration
n_runs = 3  # Stochastic runs per sample

# Map methods to their few-shot examples
FEWSHOT_EXAMPLES = {
    'fewshot_russell': few_shot_russell,
    'fewshot_chesterton': few_shot_chesterton,
    'fewshot_clausewitz': few_shot_clausewitz,
    'fewshot_hume': few_shot_hume
}

print("=== Step 2: Generating Reconstructions ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for run in range(n_runs):
        print(f"  Run {run}:")
        
        for method in METHODS:
            if store.has_reconstruction(sample_id, run, method):
                print(f"    ✓ {method:25s} (already done)")
                continue
            
            # Get few-shot examples for this method
            few_shot_examples = FEWSHOT_EXAMPLES[method]
            
            # Build reconstruction prompt
            config = StyleReconstructionFewShotConfig(
                content_summary=sample['flattened_content'],
                few_shot_examples=few_shot_examples
            )
            prompt = prompt_maker.render(config)
            response = reconstruction_llm.complete(prompt)
            
            # Save immediately (crash resilient!)
            store.save_reconstruction(
                sample_id=sample_id,
                run=run,
                method=method,
                reconstructed_text=response.content,
                model=response.model
            )
            print(f"    ✓ {method:25s} ({len(response.content)} chars)")

stats = store.get_stats()
print(f"\n✓ Generated {stats['n_reconstructions']} total reconstructions")
print(f"✓ Configured methods: {stats['configured_methods']}")

=== Step 2: Generating Reconstructions ===


sample_000:
  Run 0:
    ✓ fewshot_russell           (already done)
    ✓ fewshot_chesterton        (already done)
    ✓ fewshot_clausewitz        (already done)
    ✓ fewshot_hume              (already done)
  Run 1:
    ✓ fewshot_russell           (already done)
    ✓ fewshot_chesterton        (already done)
    ✓ fewshot_clausewitz        (already done)
    ✓ fewshot_hume              (already done)
  Run 2:
    ✓ fewshot_russell           (already done)
    ✓ fewshot_chesterton        (already done)
    ✓ fewshot_clausewitz        (already done)
    ✓ fewshot_hume              (already done)

sample_001:
  Run 0:
    ✓ fewshot_russell           (already done)
    ✓ fewshot_chesterton        (already done)
    ✓ fewshot_clausewitz        (already done)
    ✓ fewshot_hume              (already done)
  Run 1:
    ✓ fewshot_russell           (already done)
    ✓ fewshot_chesterton        (already done)
    ✓ fewshot_clausewitz        (alread

### Step 3: Judging

Compare each reconstruction against the original using blind 5-way ranking:

In [15]:
from belletrist.style_evaluation_models import StyleJudgmentComparative
from belletrist.prompts import StyleJudgeComparativeConfig

n_judge_runs = 1

print("=== Step 3: Comparative Blind Judging (4-way) ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for reconstruction_run in range(n_runs):
        print(f"  Reconstruction run {reconstruction_run}:")
        
        # Get all 4 reconstructions ONCE
        reconstructions = store.get_reconstructions(sample_id, reconstruction_run)
        if len(reconstructions) != 4:
            print(f"    ✗ Missing reconstructions (found {len(reconstructions)}/4)")
            continue
        
        # Create mapping ONCE per reconstruction_run
        mapping = store.create_random_mapping(seed=hash(f"{sample_id}_{reconstruction_run}"))
        
        # Build prompt ONCE per reconstruction_run
        judge_config = StyleJudgeComparativeConfig(
            original_text=sample['original_text'],
            reconstruction_text_a=reconstructions[mapping.text_a],
            reconstruction_text_b=reconstructions[mapping.text_b],
            reconstruction_text_c=reconstructions[mapping.text_c],
            reconstruction_text_d=reconstructions[mapping.text_d]
        )
        judge_prompt = prompt_maker.render(judge_config)
        
        # Judge the SAME reconstructions multiple times
        for judge_run in range(n_judge_runs):
            if store.has_judgment(sample_id, reconstruction_run, judge_run):
                print(f"    Judge run {judge_run}: ✓ Already judged (skipping)")
                continue
            
            print(f"    Judge run {judge_run}: Judging...", end=" ")
            
            # Get structured JSON judgment
            try:
                response = judge_llm.complete_with_schema(judge_prompt, StyleJudgmentComparative)
                judgment = response.content
                
                # Save judgment
                store.save_judgment(
                    sample_id=sample_id,
                    reconstruction_run=reconstruction_run,
                    judgment=judgment,
                    mapping=mapping,
                    judge_model=response.model,
                    judge_run=judge_run
                )
                print(f"✓ (confidence: {judgment.confidence})")
                
            except Exception as e:
                print(f"✗ Error: {e}")

stats = store.get_stats()
print(f"\n✓ Completed {stats['n_judgments']} judgments")

=== Step 3: Comparative Blind Judging (4-way) ===


sample_000:
  Reconstruction run 0:
    Judge run 0: ✓ Already judged (skipping)
  Reconstruction run 1:
    Judge run 0: ✓ Already judged (skipping)
  Reconstruction run 2:
    Judge run 0: ✓ Already judged (skipping)

sample_001:
  Reconstruction run 0:
    Judge run 0: ✓ Already judged (skipping)
  Reconstruction run 1:
    Judge run 0: ✓ Already judged (skipping)
  Reconstruction run 2:
    Judge run 0: ✓ Already judged (skipping)

sample_002:
  Reconstruction run 0:
    Judge run 0: ✓ Already judged (skipping)
  Reconstruction run 1:
    Judge run 0: ✓ Already judged (skipping)
  Reconstruction run 2:
    Judge run 0: ✓ Already judged (skipping)

sample_003:
  Reconstruction run 0:
    Judge run 0: ✓ Already judged (skipping)
  Reconstruction run 1:
    Judge run 0: ✓ Already judged (skipping)
  Reconstruction run 2:
    Judge run 0: ✓ Already judged (skipping)

sample_004:
  Reconstruction run 0:
    Judge run 0: ✓ Already judge

## Results Analysis

In [30]:
# View judge reasoning for a specific sample/run
INSPECT_SAMPLE = 'sample_004'
INSPECT_RUN = 1
INSPECT_JUDGE_RUN = 0

# Query judgment directly from database
judgment = store.conn.execute("""
    SELECT * FROM comparative_judgments
    WHERE sample_id=? AND reconstruction_run=? AND judge_run=?
""", (INSPECT_SAMPLE, INSPECT_RUN, INSPECT_JUDGE_RUN)).fetchone()

if judgment:
    print(f"=== JUDGE REASONING: {INSPECT_SAMPLE}, Reconstruction Run {INSPECT_RUN}, Judge Run {INSPECT_JUDGE_RUN} ===\n")
    
    # Build mapping to show which label = which method
    label_to_method = {
        'A': judgment['method_text_a'],
        'B': judgment['method_text_b'],
        'C': judgment['method_text_c'],
        'D': judgment['method_text_d']
    }
    
    label_to_rank = {
        'A': judgment['ranking_text_a'],
        'B': judgment['ranking_text_b'],
        'C': judgment['ranking_text_c'],
        'D': judgment['ranking_text_d']
    }
    
    print("ANONYMOUS LABELS → METHODS:")
    for label in ['A', 'B', 'C', 'D']:
        method = label_to_method[label]
        rank = label_to_rank[label]
        print(f"  Text {label} = {method:25s} → Rank {rank}")
    
    print(f"\nConfidence: {judgment['confidence']}")
    print(f"Judge Model: {judgment['judge_model']}")
    
    # Show reasoning
    print(f"\n{'REASONING':-^80}")
    print(judgment['reasoning'])
else:
    print(f"No judgment found for {INSPECT_SAMPLE}, run {INSPECT_RUN}, judge run {INSPECT_JUDGE_RUN}")

=== JUDGE REASONING: sample_004, Reconstruction Run 1, Judge Run 0 ===

ANONYMOUS LABELS → METHODS:
  Text A = fewshot_chesterton        → Rank 4
  Text B = fewshot_hume              → Rank 1
  Text C = fewshot_clausewitz        → Rank 3
  Text D = fewshot_russell           → Rank 2

Confidence: high
Judge Model: claude-sonnet-4-5-20250929

-----------------------------------REASONING------------------------------------
Let me analyze each text's stylistic similarity to the original.

**Original's distinctive voice:**
- Clear, methodical exposition with logical connectives ("In the preceding chapter," "But first of all," "Thus")
- Direct, conversational relationship with reader ("We shall say," "we must make clear")
- Uses concrete examples (the table) embedded naturally in philosophical argument
- Balanced sentence structure—neither overly ornate nor terse
- Measured, patient tone that explains step-by-step
- Uses italics for technical terms being defined
- Pragmatic qualifications ("

In [31]:
# Compare rankings across all runs for a specific sample
INSPECT_SAMPLE = 'sample_002'

# Query all judgments for this sample from database
judgments = store.conn.execute("""
    SELECT * FROM comparative_judgments
    WHERE sample_id=?
    ORDER BY reconstruction_run, judge_run
""", (INSPECT_SAMPLE,)).fetchall()

if judgments:
    print(f"=== RANKING CONSISTENCY FOR {INSPECT_SAMPLE} ===\n")
    print(f"Total judgments: {len(judgments)}\n")
    
    # Show rankings by run
    print(f"{'Run':<6} {'Judge':<6} ", end='')
    for method in METHODS:
        print(f"{method[:12]:<14}", end='')
    print()
    print("-" * 70)
    
    # Collect rankings for mean calculation
    rankings_by_method = {method: [] for method in METHODS}
    
    for j in judgments:
        # Build method->rank mapping for this judgment
        method_ranks = {}
        for label in ['a', 'b', 'c', 'd']:
            method = j[f'method_text_{label}']
            rank = j[f'ranking_text_{label}']
            method_ranks[method] = rank
        
        # Print row
        print(f"{j['reconstruction_run']:<6} {j['judge_run']:<6} ", end='')
        for method in METHODS:
            rank = method_ranks[method]
            rankings_by_method[method].append(rank)
            print(f"{rank:<14}", end='')
        print()
    
    # Show mean rankings
    print("\n\nMean rankings across all runs:")
    import statistics
    for method in METHODS:
        ranks = rankings_by_method[method]
        mean_rank = statistics.mean(ranks)
        std_rank = statistics.stdev(ranks) if len(ranks) > 1 else 0.0
        print(f"  {method:25s}: {mean_rank:.2f} ± {std_rank:.2f}")
else:
    print(f"No judgments found for {INSPECT_SAMPLE}")

=== RANKING CONSISTENCY FOR sample_002 ===

Total judgments: 3

Run    Judge  fewshot_russ  fewshot_ches  fewshot_clau  fewshot_hume  
----------------------------------------------------------------------
0      0      1             4             2             3             
1      0      3             4             2             1             
2      0      1             4             3             2             


Mean rankings across all runs:
  fewshot_russell          : 1.67 ± 1.15
  fewshot_chesterton       : 4.00 ± 0.00
  fewshot_clausewitz       : 2.33 ± 0.58
  fewshot_hume             : 2.00 ± 1.00


In [33]:
# View all reconstructions for a specific sample and run
INSPECT_SAMPLE = 'sample_002'
INSPECT_RUN = 0

print(f"=== RECONSTRUCTIONS FOR {INSPECT_SAMPLE}, RUN {INSPECT_RUN} ===\n")

sample = store.get_sample(INSPECT_SAMPLE)
reconstructions = store.get_reconstructions(INSPECT_SAMPLE, INSPECT_RUN)

print(f"{'ORIGINAL':-^80}")
print(sample['original_text'])
print("\n")

for method in METHODS:
    print(f"{method.upper():-^80}")
    print(reconstructions[method])
    print(f"\n({len(reconstructions[method])} chars)\n")

=== RECONSTRUCTIONS FOR sample_002, RUN 0 ===

------------------------------------ORIGINAL------------------------------------
In attempting to understand the elements out of which mental phenomena
are compounded, it is of the greatest importance to remember that from
the protozoa to man there is nowhere a very wide gap either in structure
or in behaviour. From this fact it is a highly probable inference that
there is also nowhere a very wide mental gap. It is, of course, POSSIBLE
that there may be, at certain stages in evolution, elements which are
entirely new from the standpoint of analysis, though in their nascent
form they have little influence on behaviour and no very marked
correlatives in structure. But the hypothesis of continuity in mental
development is clearly preferable if no psychological facts make it
impossible. We shall find, if I am not mistaken, that there are no facts
which refute the hypothesis of mental continuity, and that, on the other
hand, this hypothesis aff

In [19]:
# View the reconstruction prompt for a specific method
INSPECT_SAMPLE = 'sample_000'
INSPECT_METHOD = 'fewshot_chesterton'  # Change to view different method's prompt

sample = store.get_sample(INSPECT_SAMPLE)
few_shot_examples = FEWSHOT_EXAMPLES[INSPECT_METHOD]

config = StyleReconstructionFewShotConfig(
    content_summary=sample['flattened_content'],
    few_shot_examples=few_shot_examples
)
prompt = prompt_maker.render(config)

print(f"=== RECONSTRUCTION PROMPT: {INSPECT_METHOD.upper()} ===\n")
print(prompt)

=== RECONSTRUCTION PROMPT: FEWSHOT_CHESTERTON ===

You are given a content summary that captures the core ideas and arguments of a text, but stripped of stylistic elements.

Your task is to expand this summary into a complete prose passage using a writing style similar to the examples below.

**CRITICAL REQUIREMENTS:**
- Match the STYLE of the examples (rhythm, vocabulary, tone, sentence structure), not their formatting choices.
- If the examples are plain prose with no titles/headers/formatting, write plain prose.
- Do NOT add meta-commentary, preambles like "Here is the passage...", or postscripts explaining your work.
- Simply write naturally in this style.

**EXAMPLE TEXTS (demonstrating the target style):**

---

First, I found the whole modern world talking scientific fatalism;
saying that everything is as it must always have been, being unfolded
without fault from the beginning.  The leaf on the tree is green
because it could never have been anything else.  Now, the fairy-tale
p

## Inspection & Debug

Examine samples, reconstructions, and judge reasoning:

In [22]:
# Export results to DataFrame
print("=== Exporting Results ===\n")

df = store.to_dataframe()

print(f"Total judgments: {len(df)}")
print(f"Samples: {df['sample_id'].nunique()}")
print(f"Reconstruction runs per sample: {df.groupby('sample_id')['reconstruction_run'].nunique().mean():.1f}")
print(f"Judge runs per reconstruction: {df.groupby(['sample_id', 'reconstruction_run'])['judge_run'].nunique().mean():.1f}")

# Show first few rows
print(f"\n=== Sample Results ===\n")
display_cols = ['sample_id', 'reconstruction_run', 'judge_run'] + [f'ranking_{m}' for m in METHODS] + ['confidence']
print(df[display_cols].head(25))

# Export to CSV
output_file = f"style_eval_fewshot_sources_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")

=== Exporting Results ===

Total judgments: 15
Samples: 5
Reconstruction runs per sample: 3.0
Judge runs per reconstruction: 1.0

=== Sample Results ===

     sample_id  reconstruction_run  judge_run  ranking_fewshot_russell  \
0   sample_000                   0          0                        1   
1   sample_000                   1          0                        1   
2   sample_000                   2          0                        1   
3   sample_001                   0          0                        1   
4   sample_001                   1          0                        1   
5   sample_001                   2          0                        1   
6   sample_002                   0          0                        1   
7   sample_002                   1          0                        3   
8   sample_002                   2          0                        1   
9   sample_003                   0          0                        1   
10  sample_003                  

In [23]:
# Analyze ranking distributions
print("=== Ranking Distribution by Few-Shot Source ===\n")

for method in METHODS:
    col = f'ranking_{method}'
    print(f"\n{method.upper()}:")
    ranking_counts = df[col].value_counts().sort_index()
    for rank in [1, 2, 3, 4]:
        count = ranking_counts.get(rank, 0)
        pct = (count / len(df) * 100) if len(df) > 0 else 0
        print(f"  Rank {rank}: {count:3d} ({pct:5.1f}%)")

print("\n=== Confidence Distribution ===\n")
print(df['confidence'].value_counts())

=== Ranking Distribution by Few-Shot Source ===


FEWSHOT_RUSSELL:
  Rank 1:   9 ( 60.0%)
  Rank 2:   2 ( 13.3%)
  Rank 3:   3 ( 20.0%)
  Rank 4:   1 (  6.7%)

FEWSHOT_CHESTERTON:
  Rank 1:   1 (  6.7%)
  Rank 2:   2 ( 13.3%)
  Rank 3:   3 ( 20.0%)
  Rank 4:   9 ( 60.0%)

FEWSHOT_CLAUSEWITZ:
  Rank 1:   2 ( 13.3%)
  Rank 2:   8 ( 53.3%)
  Rank 3:   3 ( 20.0%)
  Rank 4:   2 ( 13.3%)

FEWSHOT_HUME:
  Rank 1:   3 ( 20.0%)
  Rank 2:   3 ( 20.0%)
  Rank 3:   6 ( 40.0%)
  Rank 4:   3 ( 20.0%)

=== Confidence Distribution ===

confidence
high    15
Name: count, dtype: int64


In [24]:
# Calculate method performance metrics
print("=== Few-Shot Source Performance Metrics ===\n")

# Calculate mean ranking for each method (lower is better: 1 = best, 4 = worst)
mean_rankings = {}
for method in METHODS:
    col = f'ranking_{method}'
    mean_rankings[method] = df[col].mean()

# Sort by mean ranking (best first)
sorted_methods = sorted(mean_rankings.items(), key=lambda x: x[1])

print("Average Ranking (lower is better):")
for i, (method, mean_rank) in enumerate(sorted_methods, 1):
    # Count how often this method ranked 1st
    first_place = (df[f'ranking_{method}'] == 1).sum()
    first_place_pct = (first_place / len(df) * 100) if len(df) > 0 else 0
    
    print(f"{i}. {method:25s}: {mean_rank:.2f} (1st place: {first_place}/{len(df)} = {first_place_pct:.1f}%)")

# Top-2 rate
print("\nTop-2 Rate (ranked 1st or 2nd):")
for method in METHODS:
    col = f'ranking_{method}'
    top2 = ((df[col] == 1) | (df[col] == 2)).sum()
    top2_pct = (top2 / len(df) * 100) if len(df) > 0 else 0
    print(f"  {method:25s}: {top2}/{len(df)} = {top2_pct:.1f}%")

# Bottom-2 rate (how often ranked 3rd or 4th)
print("\nBottom-2 Rate (ranked 3rd or 4th):")
for method in METHODS:
    col = f'ranking_{method}'
    bottom2 = ((df[col] == 3) | (df[col] == 4)).sum()
    bottom2_pct = (bottom2 / len(df) * 100) if len(df) > 0 else 0
    print(f"  {method:25s}: {bottom2}/{len(df)} = {bottom2_pct:.1f}%")

=== Few-Shot Source Performance Metrics ===

Average Ranking (lower is better):
1. fewshot_russell          : 1.73 (1st place: 9/15 = 60.0%)
2. fewshot_clausewitz       : 2.33 (1st place: 2/15 = 13.3%)
3. fewshot_hume             : 2.60 (1st place: 3/15 = 20.0%)
4. fewshot_chesterton       : 3.33 (1st place: 1/15 = 6.7%)

Top-2 Rate (ranked 1st or 2nd):
  fewshot_russell          : 11/15 = 73.3%
  fewshot_chesterton       : 3/15 = 20.0%
  fewshot_clausewitz       : 10/15 = 66.7%
  fewshot_hume             : 6/15 = 40.0%

Bottom-2 Rate (ranked 3rd or 4th):
  fewshot_russell          : 4/15 = 26.7%
  fewshot_chesterton       : 12/15 = 80.0%
  fewshot_clausewitz       : 5/15 = 33.3%
  fewshot_hume             : 9/15 = 60.0%


In [25]:
# Export final results
output_file = f"style_evaluation_fewshot_sources_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")


✓ Results saved to style_evaluation_fewshot_sources_results_20251212_153948.csv


## Interpretation

### Key Questions:

1. **Does same-author few-shot win?** 
   - Compare Russell vs. all other sources
   - Expected: Russell should rank highest on average

2. **Does domain matter more than author?**
   - Compare Hume (philosophy) vs. Chesterton/Clausewitz (different domains)
   - If Hume ranks higher than others, domain similarity helps

3. **How much worse are cross-author examples?**
   - Mean rank difference between Russell and next-best
   - Practical significance: is cross-author few-shot viable?

4. **Are some authors actively harmful?**
   - Check if any method consistently ranks 3rd or 4th
   - Suggests stylistic mismatch is worse than no examples

### Next Steps:

- Statistical significance testing (Friedman test for rankings)
- Pairwise comparisons (Russell vs. each other source)
- Qualitative review of reconstructions
- Analyze judge reasoning for insights