# Style Evaluation: Few-Shot Source Comparison

This notebook evaluates how the **source of few-shot examples** affects style reconstruction quality.

## Research Question

When using few-shot prompting to reconstruct style, does it matter whose writing we use as examples?

## Methodology

For each gold standard text (Bertrand Russell):
1. **Flatten**: Extract content while removing style
2. **Reconstruct**: Generate text using 5 different few-shot sources (M stochastic runs each):
   - Russell (same author - baseline)
   - Chesterton (different author)
   - Clausewitz (different author, different domain)
   - Freud (different author, different domain)
   - Hume (different author, same domain)
3. **Judge (Blind Comparative)**: Judge ranks all 5 reconstructions based on similarity to original
   - **Blind evaluation**: Judge sees only anonymous labels (Text A, B, C, D, E)
   - **Ranking**: 1 = most similar, 5 = least similar
   - **Order randomized**: Position of methods varies across samples
4. **Aggregate**: Analyze rankings to determine which few-shot source works best

## Hypothesis

We expect Russell examples (same author) to perform best, but want to quantify:
- How much worse are other authors?
- Does domain similarity (Hume - philosophy) matter more than authorship?
- Can cross-author examples still capture some style elements?

## Key Features

- **Crash resilient**: All LLM responses saved to SQLite immediately
- **Resume support**: Can restart after failures, skips completed work
- **Blind evaluation**: Eliminates judge bias by hiding method names
- **Comparative ranking**: More informative than binary comparisons

### Install Libraries and Check

In [None]:
!pip install -r requirements.txt

In [None]:
try:
    import litellm
    print('Providers\n=========')
    print('* ' + '\n* '.join(litellm.LITELLM_CHAT_PROVIDERS))
    litellm.drop_params = True
except ImportError as e:
    print(f"✗ Cannot import litellm: {e}")

## Setup and Configuration

In [None]:
import os
import json
from pathlib import Path
import pandas as pd
from datetime import datetime

### Initialize Base Objects

In [None]:
# Model Configuration
model_reconstruction_string = 'anthropic/claude-sonnet-4-5-20250929'
model_reconstruction_api_key_env_var = 'ANTHROPIC_API_KEY'
model_judge_string = 'anthropic/claude-sonnet-4-5-20250929'
model_judge_api_key_env_var = 'ANTHROPIC_API_KEY'

In [None]:
from belletrist import PromptMaker, DataSampler, StyleEvaluationStore

# ============================================================================
# CONFIGURATION - Modify these parameters before running
# ============================================================================

# Data paths
DATA_PATH_RUSSELL = Path(os.getcwd()) / "data" / "russell"
DATA_PATH_OTHER = Path(os.getcwd()) / "data" / "other_author"
EVALUATION_DB_PATH = Path(os.getcwd()) / "style_eval_fewshot_sources.db"

# Methods for this experiment (must be exactly 5 for 5-way comparison)
METHODS = [
    'fewshot_russell',
    'fewshot_chesterton', 
    'fewshot_clausewitz',
    'fewshot_freud',
    'fewshot_hume'
]

# Mapping of methods to source files
FEWSHOT_SOURCE_FILES = {
    'fewshot_chesterton': 'excerpt_chesterton_orthodoxy.txt',
    'fewshot_clausewitz': 'excerpt_clausewitz_on_war.txt',
    'fewshot_freud': 'excerpt_freud_a_general_introduction_to_psychoanalysis.txt',
    'fewshot_hume': 'excerpt_hume_a_treatise_on_human_nature.txt'
}

# ============================================================================

# Validate configuration
if not DATA_PATH_RUSSELL.exists():
    raise FileNotFoundError(
        f"Russell data directory not found: {DATA_PATH_RUSSELL}\n"
        f"Please ensure the data directory exists."
    )

if not DATA_PATH_OTHER.exists():
    raise FileNotFoundError(
        f"Other authors data directory not found: {DATA_PATH_OTHER}\n"
        f"Please ensure the data directory exists."
    )

# Validate that all source files exist
for method, filename in FEWSHOT_SOURCE_FILES.items():
    filepath = DATA_PATH_OTHER / filename
    if not filepath.exists():
        raise FileNotFoundError(f"Source file not found: {filepath}")

# Initialize components
prompt_maker = PromptMaker()

sampler_russell = DataSampler(data_path=DATA_PATH_RUSSELL.resolve())
sampler_other = DataSampler(data_path=DATA_PATH_OTHER.resolve())

store = StyleEvaluationStore(
    EVALUATION_DB_PATH,
    methods=METHODS
)

print(f"✓ Russell data path: {DATA_PATH_RUSSELL}")
print(f"✓ Other authors data path: {DATA_PATH_OTHER}")
print(f"✓ Evaluation database: {EVALUATION_DB_PATH}")
print(f"✓ Configured methods: {METHODS}")

In [None]:
from belletrist import LLM, LLMConfig

reconstruction_llm = LLM(LLMConfig(
    model=model_reconstruction_string,
    api_key=os.environ.get(model_reconstruction_api_key_env_var)
))
judge_llm = LLM(LLMConfig(
    model=model_judge_string,
    api_key=os.environ.get(model_judge_api_key_env_var)
))

### Sample Test Data and Few-Shot Data

Test texts are Russell paragraphs. Few-shot examples come from 5 different sources.

In [None]:
# Experiment parameters
n_sample = 5
m_paragraphs_per_sample = 1
n_few_shot_per_source = 3  # Number of examples from each source

In [None]:
# Select deterministic test samples (Russell)
quality_texts_deterministic = [
    sampler_russell.get_paragraph_chunk(file_index=0, paragraph_range=slice(9, 9+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=0, paragraph_range=slice(29, 29+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=0, paragraph_range=slice(131, 131+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=1, paragraph_range=slice(13, 13+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=1, paragraph_range=slice(39, 39+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=1, paragraph_range=slice(192, 192+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=2, paragraph_range=slice(20, 20+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=2, paragraph_range=slice(43, 43+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=2, paragraph_range=slice(146, 146+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=3, paragraph_range=slice(7, 7+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=3, paragraph_range=slice(73, 73+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=3, paragraph_range=slice(202, 202+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=3, paragraph_range=slice(285, 285+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=4, paragraph_range=slice(4, 4+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=4, paragraph_range=slice(67, 67+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=4, paragraph_range=slice(124, 124+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=5, paragraph_range=slice(6, 6+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=5, paragraph_range=slice(119, 119+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=5, paragraph_range=slice(301, 301+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=6, paragraph_range=slice(23, 23+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=6, paragraph_range=slice(75, 75+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=6, paragraph_range=slice(152, 152+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=6, paragraph_range=slice(198, 198+m_paragraphs_per_sample)),
    sampler_russell.get_paragraph_chunk(file_index=6, paragraph_range=slice(271, 271+m_paragraphs_per_sample)),
]
reindex = [0, 5, 10, 15, 20, 1, 6, 11, 16, 21, 2, 7, 12, 17, 22, 3, 8, 13, 18, 23, 4, 9, 14, 19]
test_texts = [
    quality_texts_deterministic[i] for i in reindex[:n_sample]
]

In [None]:
# Load few-shot examples from Russell (baseline)
few_shot_russell = []
while len(few_shot_russell) < n_few_shot_per_source:
    p = sampler_russell.sample_segment(p_length=m_paragraphs_per_sample)
    
    # Check if p overlaps with any test text
    has_overlap = any(
        p.file_index == test_seg.file_index and
        p.paragraph_start < test_seg.paragraph_end and
        test_seg.paragraph_start < p.paragraph_end
        for test_seg in test_texts
    )
    
    if not has_overlap:
        few_shot_russell.append(p.text)

print(f"✓ Loaded {len(few_shot_russell)} Russell few-shot examples")

In [None]:
# Load few-shot examples from other authors
def load_fewshot_from_file(filename: str, n_examples: int) -> list[str]:
    """Load n_examples paragraphs from a text file."""
    # Find file index for this filename
    file_idx = None
    for idx, f in enumerate(sampler_other.files):
        if f.name == filename:
            file_idx = idx
            break
    
    if file_idx is None:
        raise ValueError(f"File not found in sampler: {filename}")
    
    # Sample n_examples non-overlapping paragraphs
    examples = []
    used_ranges = []
    
    # Get total paragraphs in file
    total_paras = len(sampler_other.paragraph_indices[file_idx])
    
    # Sample evenly across the text
    step = max(1, total_paras // (n_examples + 1))
    for i in range(n_examples):
        start = min(step * (i + 1), total_paras - m_paragraphs_per_sample)
        segment = sampler_other.get_paragraph_chunk(
            file_index=file_idx,
            paragraph_range=slice(start, start + m_paragraphs_per_sample)
        )
        examples.append(segment.text)
    
    return examples

# Load examples from each source
few_shot_chesterton = load_fewshot_from_file(
    FEWSHOT_SOURCE_FILES['fewshot_chesterton'],
    n_few_shot_per_source
)
few_shot_clausewitz = load_fewshot_from_file(
    FEWSHOT_SOURCE_FILES['fewshot_clausewitz'],
    n_few_shot_per_source
)
few_shot_freud = load_fewshot_from_file(
    FEWSHOT_SOURCE_FILES['fewshot_freud'],
    n_few_shot_per_source
)
few_shot_hume = load_fewshot_from_file(
    FEWSHOT_SOURCE_FILES['fewshot_hume'],
    n_few_shot_per_source
)

print(f"✓ Loaded {len(few_shot_chesterton)} Chesterton few-shot examples")
print(f"✓ Loaded {len(few_shot_clausewitz)} Clausewitz few-shot examples")
print(f"✓ Loaded {len(few_shot_freud)} Freud few-shot examples")
print(f"✓ Loaded {len(few_shot_hume)} Hume few-shot examples")

## Evaluation Pipeline

### Step 1: Content Flattening

Extract content from each test sample, removing stylistic elements:

In [None]:
from belletrist.prompts import StyleFlatteningConfig

print("=== Step 1: Flattening and Saving Samples ===\n")

for k_text, test_sample in enumerate(test_texts):
    sample_id = f"sample_{k_text:03d}"
    
    # Skip if already saved
    if store.get_sample(sample_id):
        print(f"✓ {sample_id} already flattened (skipping)")
        continue
    
    print(f"Flattening {sample_id}...", end=" ")
    
    # Flatten content
    flatten_prompt = prompt_maker.render(
        StyleFlatteningConfig(text=test_sample.text)
    )
    flattened = reconstruction_llm.complete(flatten_prompt)
    
    # Save to store with provenance
    source_info = f"File {test_sample.file_index}, para {test_sample.paragraph_start}-{test_sample.paragraph_end}"
    store.save_sample(
        sample_id=sample_id,
        original_text=test_sample.text,
        flattened_content=flattened.content,
        flattening_model=flattened.model,
        source_info=source_info
    )
    
    print(f"✓ ({len(flattened.content)} chars)")

print(f"\n✓ All samples flattened and saved to store")

### Step 2: Reconstruction with Different Few-Shot Sources

Generate reconstructions using the same prompt template but different few-shot examples:

In [None]:
from belletrist.prompts import StyleReconstructionFewShotConfig

# Configuration
n_runs = 2  # Stochastic runs per sample

# Map methods to their few-shot examples
FEWSHOT_EXAMPLES = {
    'fewshot_russell': few_shot_russell,
    'fewshot_chesterton': few_shot_chesterton,
    'fewshot_clausewitz': few_shot_clausewitz,
    'fewshot_freud': few_shot_freud,
    'fewshot_hume': few_shot_hume
}

print("=== Step 2: Generating Reconstructions ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for run in range(n_runs):
        print(f"  Run {run}:")
        
        for method in METHODS:
            if store.has_reconstruction(sample_id, run, method):
                print(f"    ✓ {method:25s} (already done)")
                continue
            
            # Get few-shot examples for this method
            few_shot_examples = FEWSHOT_EXAMPLES[method]
            
            # Build reconstruction prompt
            config = StyleReconstructionFewShotConfig(
                content_summary=sample['flattened_content'],
                few_shot_examples=few_shot_examples
            )
            prompt = prompt_maker.render(config)
            response = reconstruction_llm.complete(prompt)
            
            # Save immediately (crash resilient!)
            store.save_reconstruction(
                sample_id=sample_id,
                run=run,
                method=method,
                reconstructed_text=response.content,
                model=response.model
            )
            print(f"    ✓ {method:25s} ({len(response.content)} chars)")

stats = store.get_stats()
print(f"\n✓ Generated {stats['n_reconstructions']} total reconstructions")
print(f"✓ Configured methods: {stats['configured_methods']}")

### Step 3: Judging

Compare each reconstruction against the original using blind 5-way ranking:

In [None]:
from belletrist.style_evaluation_models import StyleJudgmentComparative5Way
from belletrist.prompts import StyleJudgeComparative5WayConfig

n_judge_runs = 1

print("=== Step 3: Comparative Blind Judging (5-way) ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for reconstruction_run in range(n_runs):
        print(f"  Reconstruction run {reconstruction_run}:")
        
        # Get all 5 reconstructions ONCE
        reconstructions = store.get_reconstructions(sample_id, reconstruction_run)
        if len(reconstructions) != 5:
            print(f"    ✗ Missing reconstructions (found {len(reconstructions)}/5)")
            continue
        
        # Create mapping ONCE per reconstruction_run
        mapping = store.create_random_mapping(seed=hash(f"{sample_id}_{reconstruction_run}"))
        
        # Build prompt ONCE per reconstruction_run
        judge_config = StyleJudgeComparative5WayConfig(
            original_text=sample['original_text'],
            reconstruction_text_a=reconstructions[mapping.text_a],
            reconstruction_text_b=reconstructions[mapping.text_b],
            reconstruction_text_c=reconstructions[mapping.text_c],
            reconstruction_text_d=reconstructions[mapping.text_d],
            reconstruction_text_e=reconstructions[mapping.text_e]
        )
        judge_prompt = prompt_maker.render(judge_config)
        
        # Judge the SAME reconstructions multiple times
        for judge_run in range(n_judge_runs):
            if store.has_judgment(sample_id, reconstruction_run, judge_run):
                print(f"    Judge run {judge_run}: ✓ Already judged (skipping)")
                continue
            
            print(f"    Judge run {judge_run}: Judging...", end=" ")
            
            # Get structured JSON judgment
            try:
                response = judge_llm.complete_with_schema(judge_prompt, StyleJudgmentComparative5Way)
                judgment = response.content
                
                # Save judgment
                store.save_judgment(
                    sample_id=sample_id,
                    reconstruction_run=reconstruction_run,
                    judgment=judgment,
                    mapping=mapping,
                    judge_model=response.model,
                    judge_run=judge_run
                )
                print(f"✓ (confidence: {judgment.confidence})")
                
            except Exception as e:
                print(f"✗ Error: {e}")

stats = store.get_stats()
print(f"\n✓ Completed {stats['n_judgments']} judgments")

## Results Analysis

In [None]:
# Export results to DataFrame
print("=== Exporting Results ===\n")

df = store.to_dataframe()

print(f"Total judgments: {len(df)}")
print(f"Samples: {df['sample_id'].nunique()}")
print(f"Reconstruction runs per sample: {df.groupby('sample_id')['reconstruction_run'].nunique().mean():.1f}")
print(f"Judge runs per reconstruction: {df.groupby(['sample_id', 'reconstruction_run'])['judge_run'].nunique().mean():.1f}")

# Show first few rows
print(f"\n=== Sample Results ===\n")
display_cols = ['sample_id', 'reconstruction_run', 'judge_run'] + [f'ranking_{m}' for m in METHODS] + ['confidence']
print(df[display_cols].head(25))

# Export to CSV
output_file = f"style_eval_fewshot_sources_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")

In [None]:
# Analyze ranking distributions
print("=== Ranking Distribution by Few-Shot Source ===\n")

for method in METHODS:
    col = f'ranking_{method}'
    print(f"\n{method.upper()}:")
    ranking_counts = df[col].value_counts().sort_index()
    for rank in [1, 2, 3, 4, 5]:
        count = ranking_counts.get(rank, 0)
        pct = (count / len(df) * 100) if len(df) > 0 else 0
        print(f"  Rank {rank}: {count:3d} ({pct:5.1f}%)")

print("\n=== Confidence Distribution ===\n")
print(df['confidence'].value_counts())

In [None]:
# Calculate method performance metrics
print("=== Few-Shot Source Performance Metrics ===\n")

# Calculate mean ranking for each method (lower is better: 1 = best, 5 = worst)
mean_rankings = {}
for method in METHODS:
    col = f'ranking_{method}'
    mean_rankings[method] = df[col].mean()

# Sort by mean ranking (best first)
sorted_methods = sorted(mean_rankings.items(), key=lambda x: x[1])

print("Average Ranking (lower is better):")
for i, (method, mean_rank) in enumerate(sorted_methods, 1):
    # Count how often this method ranked 1st
    first_place = (df[f'ranking_{method}'] == 1).sum()
    first_place_pct = (first_place / len(df) * 100) if len(df) > 0 else 0
    
    print(f"{i}. {method:25s}: {mean_rank:.2f} (1st place: {first_place}/{len(df)} = {first_place_pct:.1f}%)")

# Top-2 rate
print("\nTop-2 Rate (ranked 1st or 2nd):")
for method in METHODS:
    col = f'ranking_{method}'
    top2 = ((df[col] == 1) | (df[col] == 2)).sum()
    top2_pct = (top2 / len(df) * 100) if len(df) > 0 else 0
    print(f"  {method:25s}: {top2}/{len(df)} = {top2_pct:.1f}%")

# Bottom-2 rate (how often ranked 4th or 5th)
print("\nBottom-2 Rate (ranked 4th or 5th):")
for method in METHODS:
    col = f'ranking_{method}'
    bottom2 = ((df[col] == 4) | (df[col] == 5)).sum()
    bottom2_pct = (bottom2 / len(df) * 100) if len(df) > 0 else 0
    print(f"  {method:25s}: {bottom2}/{len(df)} = {bottom2_pct:.1f}%")

In [None]:
# Export final results
output_file = f"style_evaluation_fewshot_sources_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")

## Interpretation

### Key Questions:

1. **Does same-author few-shot win?** 
   - Compare Russell vs. all other sources
   - Expected: Russell should rank highest on average

2. **Does domain matter more than author?**
   - Compare Hume (philosophy) vs. Freud/Clausewitz (different domains)
   - If Hume ranks higher than others, domain similarity helps

3. **How much worse are cross-author examples?**
   - Mean rank difference between Russell and next-best
   - Practical significance: is cross-author few-shot viable?

4. **Are some authors actively harmful?**
   - Check if any method consistently ranks 4th or 5th
   - Suggests stylistic mismatch is worse than no examples

### Next Steps:

- Statistical significance testing (Friedman test for rankings)
- Pairwise comparisons (Russell vs. each other source)
- Qualitative review of reconstructions
- Analyze judge reasoning for insights