# Style Instruction Evaluation Framework

This notebook evaluates how well derived style instructions enable style replication, compared to alternative approaches.

## Methodology

For each gold standard text:
1. **Neutralize**: Rewrite text in bland journalistic prose, preserving rhetorical moves and argumentative structure while removing distinctive stylistic choices
2. **Reconstruct**: Generate text using 4 different methods (M stochastic runs each):
   - Generic baseline
   - Few-shot learning
   - Author name prompting
   - Derived style instructions
3. **Judge (Blind Comparative)**: Judge ranks all 4 reconstructions from 1-4 based on similarity to original
   - **Blind evaluation**: Judge sees only anonymous labels (Text A, B, C, D) - no method names
   - **Ranking**: 1 = most similar, 2 = second, 3 = third, 4 = least similar
   - **Order randomized**: Position of methods varies across samples to eliminate bias
4. **Aggregate**: Analyze rankings to determine which method best captures style

## Key Features

- **Crash resilient**: All LLM responses saved to SQLite immediately
- **Resume support**: Can restart after failures, skips completed work
- **Blind evaluation**: Eliminates judge bias by hiding method names
- **Comparative ranking**: More informative than binary comparisons

## Updated Approach (StyleNeutralization)

Unlike pure content extraction (StyleFlattening), this version uses **StyleNeutralization** which:
- Preserves complete rhetorical structure (concessions, qualifications, emphasis patterns)
- Maintains logical connectors and argumentative moves
- Removes only distinctive stylistic execution (word choices, sentence rhythm, figurative language)
- Produces full-length neutral rewrites (~80-100% of original length)

This gives reconstruction methods the rhetorical scaffolding while leaving room for stylistic variation.

### Install Libraries and Check

In [6]:
!pip install -r requirements.txt



[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [7]:
try:
    import litellm
    print('Providers\n=========')
    print('* ' + '\n* '.join(litellm.LITELLM_CHAT_PROVIDERS))
    litellm.drop_params = True
except ImportError as e:
    print(f"✗ Cannot import litellm: {e}")

Providers
* openai
* openai_like
* bytez
* xai
* custom_openai
* text-completion-openai
* cohere
* cohere_chat
* clarifai
* anthropic
* anthropic_text
* replicate
* huggingface
* together_ai
* datarobot
* openrouter
* cometapi
* vertex_ai
* vertex_ai_beta
* gemini
* ai21
* baseten
* azure
* azure_text
* azure_ai
* sagemaker
* sagemaker_chat
* bedrock
* vllm
* nlp_cloud
* petals
* oobabooga
* ollama
* ollama_chat
* deepinfra
* perplexity
* mistral
* groq
* nvidia_nim
* cerebras
* baseten
* ai21_chat
* volcengine
* codestral
* text-completion-codestral
* deepseek
* sambanova
* maritalk
* cloudflare
* fireworks_ai
* friendliai
* watsonx
* watsonx_text
* triton
* predibase
* databricks
* empower
* github
* custom
* litellm_proxy
* hosted_vllm
* llamafile
* lm_studio
* galadriel
* gradient_ai
* github_copilot
* novita
* meta_llama
* featherless_ai
* nscale
* nebius
* dashscope
* moonshot
* v0
* heroku
* oci
* morph
* lambda_ai
* vercel_ai_gateway
* wandb
* ovhcloud
* lemonade


## Setup and Configuration

In [8]:
import os
import json
from pathlib import Path
import pandas as pd
from datetime import datetime

### Initialize Base Objects
The base objects part of the current project library (`belletrist`) are initialized. They are:
* `LLM`: the LLM object.
* `LLMConfig`: the configuration of the LLM object, such as what model to use.
* `PromptMaker`: generates prompts from templates and variables
* `DataSampler`: retrieves and samples text at a source directory

These will implement text transformations by LLMs part of the evaluation process. They build on the third-party LLMs, which we furthermore split into LLMs for text reconstruction and text judging, the key parameters for which are set below.

In [9]:
#model_reconstruction_string = 'together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput'
#model_reconstruction_api_key_env_var = 'TOGETHER_AI_API_KEY'
#model_reconstruction_string = 'anthropic/claude-sonnet-4-5-20250929'
#model_reconstruction_api_key_env_var = 'ANTHROPIC_API_KEY'
#model_reconstruction_string = 'openai/gpt-5.1-2025-11-13'
#model_reconstruction_api_key_env_var = 'OPENAI_API_KEY'
#model_reconstruction_string = 'together_ai/moonshotai/Kimi-K2-Instruct'
#model_reconstruction_api_key_env_var = 'TOGETHER_AI_API_KEY'
model_reconstruction_string = 'mistral/mistral-large-2512'
model_reconstruction_api_key_env_var = 'MISTRAL_API_KEY'
model_judge_string = 'anthropic/claude-sonnet-4-5-20250929'
model_judge_api_key_env_var = 'ANTHROPIC_API_KEY'
#model_judge_string = 'mistral/mistral-large-2512'
#model_judge_api_key_env_var = 'MISTRAL_API_KEY'

In [10]:
from belletrist import PromptMaker, DataSampler, StyleEvaluationStore, SegmentStore

# ============================================================================
# CONFIGURATION - Modify these parameters before running
# ============================================================================

# Data paths
DATA_PATH = Path(os.getcwd()) / "data" / "russell"
SEGMENT_DB_PATH = Path(os.getcwd()) / "segments_mistral.db"
EVALUATION_DB_PATH = Path(os.getcwd()) / "tmp_style_eval_bland_s_mistral_r_mistral_j_anthropic.db"

# Methods for this experiment (must be exactly 4)
METHODS = ['generic', 'fewshot', 'author', 'agent_holistic']

# ============================================================================

# Validate configuration
if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Data directory not found: {DATA_PATH}\n"
        f"Please ensure the data directory exists."
    )

if not SEGMENT_DB_PATH.exists():
    raise FileNotFoundError(
        f"Segment database not found: {SEGMENT_DB_PATH}\n"
        f"The 'agent_fewshot' method requires a segment catalog.\n"
        f"Please run 'python runs/style_retrieval.py' first to build the catalog."
    )

# Initialize components
prompt_maker = PromptMaker()

sampler = DataSampler(data_path=DATA_PATH.resolve())

store = StyleEvaluationStore(
    EVALUATION_DB_PATH,
    methods=METHODS
)

print(f"✓ Data path: {DATA_PATH}")
print(f"✓ Segment database: {SEGMENT_DB_PATH}")
print(f"✓ Evaluation database: {EVALUATION_DB_PATH}")
print(f"✓ Configured methods: {METHODS}")

✓ Data path: /Users/andersohrn/PycharmProjects/ClaudeCodeCourse/style-retrieval/data/russell
✓ Segment database: /Users/andersohrn/PycharmProjects/ClaudeCodeCourse/style-retrieval/segments_mistral.db
✓ Evaluation database: /Users/andersohrn/PycharmProjects/ClaudeCodeCourse/style-retrieval/tmp_style_eval_bland_s_mistral_r_mistral_j_anthropic.db
✓ Configured methods: ['generic', 'fewshot', 'author', 'agent_holistic']


In [11]:
store.reset('reconstructions_and_judgments')

In [12]:
from belletrist import LLM, LLMConfig

reconstruction_llm = LLM(LLMConfig(
    model=model_reconstruction_string,
    api_key=os.environ.get(model_reconstruction_api_key_env_var)
))
judge_llm = LLM(LLMConfig(
    model=model_judge_string,
    api_key=os.environ.get(model_judge_api_key_env_var)
))

### Sample Test Data and Few-Shot Data
The reconstruction method tests build on gold standard texts. The test also includes few-shot prompting with the gold standard texts. In order to not skew the tests, no few-shot examples can overlap with the test texts.

In [13]:
n_sample = 1
m_paragraphs_per_sample = 3
n_few_shot_sample = 3

In [14]:
#test_texts = []
#for _ in range(n_sample):
#    test_texts.append(sampler.sample_segment(p_length=m_paragraphs_per_sample))

In [15]:
quality_texts_deterministic = [
    sampler.get_paragraph_chunk(file_index=0, paragraph_range=slice(9, 9+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=0, paragraph_range=slice(29, 29+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=0, paragraph_range=slice(131, 131+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=1, paragraph_range=slice(13, 13+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=1, paragraph_range=slice(39, 39+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=1, paragraph_range=slice(192, 192+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=2, paragraph_range=slice(20, 20+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=2, paragraph_range=slice(43, 43+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=2, paragraph_range=slice(146, 146+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=3, paragraph_range=slice(7, 7+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=3, paragraph_range=slice(73, 73+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=3, paragraph_range=slice(202, 202+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=3, paragraph_range=slice(285, 285+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=4, paragraph_range=slice(4, 4+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=4, paragraph_range=slice(67, 67+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=4, paragraph_range=slice(124, 124+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=5, paragraph_range=slice(6, 6+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=5, paragraph_range=slice(119, 119+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=5, paragraph_range=slice(301, 301+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=6, paragraph_range=slice(23, 23+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=6, paragraph_range=slice(75, 75+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=6, paragraph_range=slice(152, 152+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=6, paragraph_range=slice(198, 198+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=6, paragraph_range=slice(271, 271+m_paragraphs_per_sample)),
]
reindex = [0, 5, 10, 15, 20, 1, 6, 11, 16, 21, 2, 7, 12, 17, 22, 3, 8, 13, 18, 23, 4, 9, 14, 19]
test_texts = [
    quality_texts_deterministic[i] for i in reindex[:n_sample]
]

In [16]:
from random import choice
few_shot_texts = []
while len(few_shot_texts) < n_few_shot_sample:
    #p = sampler.sample_segment(p_length=m_paragraphs_per_sample)
    p = choice(quality_texts_deterministic)

    # Check if p overlaps with any test text
    # Two segments overlap if they're from the same file AND their paragraph ranges overlap
    # Ranges [a, b) and [c, d) overlap if: a < d AND c < b
    has_overlap = any(
        p.file_index == test_seg.file_index and
        p.paragraph_start < test_seg.paragraph_end and
        test_seg.paragraph_start < p.paragraph_end
        for test_seg in test_texts
    )

    if not has_overlap:
        few_shot_texts.append(p)

## Create the Test Transformation Objects
The combination of prompt and LLM leads to the following operators in the test chain:
* **Style Neutralizer**, which given a text rewrites it in bland journalistic prose while preserving rhetorical structure and argumentative moves.
* **Reconstructor, LLM House Style**, which given neutralized content expands it into a complete text with the "house style" of the LLM employed for the reconstruction.
* **Reconstructor, Few Shot**, which given neutralized content expands it into a complete text with a few text excerpts on unrelated topics as style guide.
* **Reconstructor, LLM Author Model**, which given neutralized content expands it into a complete text with the named author's style as the LLM conceives it without any other guidance.
* **Reconstructor, Style Instruction**, which given neutralized content expands it into a complete text following the detailed style instruction, as derived from previous analysis.

In [17]:
# Import from style-retrieval (no models/ subdirectory)
from belletrist.style_evaluation_models import (
    MethodMapping,
    StyleJudgmentComparative
)
from belletrist.prompts import (
    StyleNeutralizationConfig,  # Changed from StyleFlatteningConfig
    StyleReconstructionGenericConfig,
    StyleReconstructionFewShotConfig,
    StyleReconstructionAuthorConfig,
    StyleJudgeComparativeConfig
)

# Configuration
n_runs = 2
n_judge_runs = 1
AUTHOR_NAME = "Bertrand Russell"

# Reconstructor configs (agent_fewshot handled separately)
RECONSTRUCTORS_CFGS = {
    'generic': StyleReconstructionGenericConfig,
    'fewshot': StyleReconstructionFewShotConfig,
    'author': StyleReconstructionAuthorConfig,
    'agent_holistic': None  # Special case: uses agent_rewriter.agent_rewrite_holistic
}

# Reconstructor kwargs
RECONSTRUCTORS_KWARGS = {
    'generic': {},
    'fewshot': {'few_shot_examples': [seg.text for seg in few_shot_texts]},
    'author': {'author_name': AUTHOR_NAME},
    'agent_holistic': {}  # Handled separately
}

print(f"✓ Configuration: {n_runs} runs per sample, {len(METHODS)} reconstruction methods")
print(f"✓ Methods: {', '.join(METHODS)}")

✓ Configuration: 2 runs per sample, 4 reconstruction methods
✓ Methods: generic, fewshot, author, agent_holistic


## Evaluation Pipeline

### Step 1: Content Neutralization

Rewrite each test sample in neutral journalistic prose, preserving rhetorical structure and argumentative moves while removing distinctive stylistic choices:

In [18]:
# Step 1: Save samples and neutralize content (preserving rhetorical structure)
print("=== Step 1: Neutralizing Style (Preserving Argument Structure) ===\n")

for k_text, test_sample in enumerate(test_texts):
    sample_id = f"sample_{k_text:03d}"
    
    # Skip if already saved
    if store.get_sample(sample_id):
        print(f"✓ {sample_id} already neutralized (skipping)")
        continue
    
    print(f"Neutralizing {sample_id}...", end=" ")
    
    # Neutralize content (rewrite in bland journalistic prose)
    neutralize_prompt = prompt_maker.render(
        StyleNeutralizationConfig(text=test_sample.text)
    )
    neutralized = reconstruction_llm.complete(neutralize_prompt)
    
    # Save to store with provenance
    source_info = f"File {test_sample.file_index}, para {test_sample.paragraph_start}-{test_sample.paragraph_end}"
    store.save_sample(
        sample_id=sample_id,
        original_text=test_sample.text,
        flattened_content=neutralized.content,  # Still using 'flattened_content' column for compatibility
        flattening_model=neutralized.model,
        source_info=source_info
    )
    
    print(f"✓ ({len(neutralized.content)} chars)")

print(f"\n✓ All samples neutralized and saved to store")

=== Step 1: Neutralizing Style (Preserving Argument Structure) ===

Neutralizing sample_000... ✓ (4280 chars)

✓ All samples neutralized and saved to store


In [19]:
print('====ORIGINAL===')
print(store.get_sample('sample_000')['original_text'])
print('\n\n====NEUTRALIZED (BLAND REWRITE)====')
print(store.get_sample('sample_000')['flattened_content'])

====ORIGINAL===
In reading even the best treatises on education written in former
times, one becomes aware of certain changes that have come over
educational theory. The two great reformers of educational theory
before the nineteenth century were Locke and Rousseau. Both deserved
their reputation, for both repudiated many errors which were
wide-spread when they wrote. But neither went as far in his own
direction as almost all modern educationists go. Both, for example,
belong to the tendency which led to liberalism and democracy; yet
both consider only the education of an aristocratic boy, to which one
man’s whole time is devoted. However excellent might be the results
of such a system, no man with a modern outlook would give it serious
consideration, because it is arithmetically impossible for every child
to absorb the whole time of an adult tutor. The system is therefore
one which can only be employed by a privileged caste; in a just world,
its existence would be impossible. The mode

### Step 2: Reconstruction

Generate reconstructions using all 4 methods, with M stochastic runs each:

In [20]:
# Step 2: Generate reconstructions (with crash resume and agent_fewshot support)
print("=== Step 2: Generating Reconstructions ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for run in range(n_runs):
        print(f"  Run {run}:")
        
        # Check which methods need reconstruction
        for method in METHODS:
            if store.has_reconstruction(sample_id, run, method):
                print(f"    ✓ {method:15s} (already done)")
                continue
            
            # Special handling for agent_fewshot
            if method == 'agent_holistic':
                from belletrist import agent_rewrite_holistic
                
                # Initialize LLMs with proper temperatures for agent workflow
                planning_llm = LLM(LLMConfig(
                    model=model_reconstruction_string,
                    api_key=os.environ.get(model_reconstruction_api_key_env_var),
                    temperature=0.5,  # Deterministic planning
                    max_tokens=8192  # Large JSON plans need generous token budget
                ))
                rewriting_llm = LLM(LLMConfig(
                    model=model_reconstruction_string,
                    api_key=os.environ.get(model_reconstruction_api_key_env_var),
                    temperature=0.7,  # Creative rewriting
                    max_tokens=8192  # Full paragraph rewrites need generous token budget
                ))
                
                with SegmentStore(SEGMENT_DB_PATH) as segment_store:
                    reconstructed_text = agent_rewrite_holistic(
                        flattened_content=sample['flattened_content'],
                        segment_store=segment_store,
                        planning_llm=planning_llm,
                        rewriting_llm=rewriting_llm,
                        prompt_maker=prompt_maker,
                        target_example_count=9,
                    )
                
                # Save reconstruction
                store.save_reconstruction(
                    sample_id=sample_id,
                    run=run,
                    method=method,
                    reconstructed_text=reconstructed_text,
                    model=model_reconstruction_string
                )
                print(f"    ✓ {method:15s} ({len(reconstructed_text)} chars)")
            else:
                # Standard reconstruction
                config = RECONSTRUCTORS_CFGS[method](
                    content_summary=sample['flattened_content'],
                    **RECONSTRUCTORS_KWARGS[method]
                )
                prompt = prompt_maker.render(config)
                response = reconstruction_llm.complete(prompt)
                
                # Save immediately (crash resilient!)
                store.save_reconstruction(
                    sample_id=sample_id,
                    run=run,
                    method=method,
                    reconstructed_text=response.content,
                    model=response.model
                )
                print(f"    ✓ {method:15s} ({len(response.content)} chars)")

stats = store.get_stats()
print(f"\n✓ Generated {stats['n_reconstructions']} total reconstructions")
print(f"✓ Configured methods: {stats['configured_methods']}")

=== Step 2: Generating Reconstructions ===


sample_000:
  Run 0:
    ✓ generic         (4111 chars)
    ✓ fewshot         (4387 chars)
    ✓ author          (4985 chars)
structural_diagnosis='Dialectical argument comparing historical and modern educational theories, needing sharper contrast, concrete grounding, and a stronger closing that reinforces democratic ideals without sacrificing nuance. Currently abstract and repetitive; requires texture and persuasive momentum.' selected_tags=['opens_with_concession', 'builds_via_contrast', 'builds_via_examples', 'builds_via_refinement', 'pivots_with_but', 'pairs_abstract_and_concrete', 'acknowledges_complexity', 'asserts_with_confidence', 'varies_sentence_length', 'closes_with_implication', 'closes_by_widening_scope'] target_example_count=10
You are rewriting a text that has been neutralized (written in bland journalistic prose). The text preserves all argumentative structure and rhetorical moves, but lacks distinctive stylistic choices. You

    ✓ agent_holistic  (4659 chars)
  Run 1:
    ✓ generic         (6291 chars)
    ✓ fewshot         (4140 chars)
    ✓ author          (5622 chars)
structural_diagnosis='Dialectical argument exploring the tension between democratic ideals and educational excellence, needing sharper grounding in concrete examples and clearer pivots between concession and assertion. Currently abstract; lacks rhythmic variation and persuasive momentum.' selected_tags=['opens_with_concession', 'builds_via_contrast', 'builds_via_examples', 'builds_via_refinement', 'pivots_with_but', 'pairs_abstract_and_concrete', 'acknowledges_complexity', 'varies_sentence_length', 'closes_with_implication', 'embeds_qualification', 'asserts_with_confidence'] target_example_count=10
You are rewriting a text that has been neutralized (written in bland journalistic prose). The text preserves all argumentative structure and rhetorical moves, but lacks distinctive stylistic choices. Your task is to enhance it using demonstrated

    ✓ agent_holistic  (4439 chars)

✓ Generated 8 total reconstructions
✓ Configured methods: ['generic', 'fewshot', 'author', 'agent_holistic']


In [21]:
config = RECONSTRUCTORS_CFGS['fewshot'](
                content_summary=sample['flattened_content'],
                **RECONSTRUCTORS_KWARGS['fewshot']
            )
prompt = prompt_maker.render(config)
print(prompt)

You are given a text written in neutral, straightforward journalistic prose. The text preserves all argumentative structure, rhetorical moves, and content, but uses bland, unmarked language without distinctive stylistic choices.

Your task is to rewrite this text with the writing style demonstrated in the examples below.

**CRITICAL REQUIREMENTS:**
- Match the STYLE of the examples (rhythm, vocabulary, tone, sentence structure), not their formatting choices or topics.
- Preserve all argumentative structure and rhetorical moves from the neutral text.
- Transform the neutral prose by applying the stylistic techniques shown in the examples.
- If the examples are plain prose with no titles/headers/formatting, write plain prose.
- Do NOT add meta-commentary, preambles like "Here is the passage...", or postscripts explaining your work.
- Simply write naturally in this style.

**EXAMPLE TEXTS (demonstrating the target style):**

---

Much of mysticism underlies the ethics of Heraclitus. It is

In [22]:
reconstructions = store.get_reconstructions('sample_000', 0)
for reconstructor in reconstructions.keys():
    print(f"{reconstructor.upper()}\n===================")
    print(f"\n{reconstructions.get(reconstructor)}\n\n")

AGENT_HOLISTIC

Educational theory has not so much evolved as it has lurched—each generation of reformers convinced they have finally uncovered the one true method, only to leave behind a trail of discarded philosophies like so many broken slates. To read Locke or Rousseau today is to step into a world where the very air hums with the certainty of their contradictions. Both men were giants, yes, but giants who stood on the shoulders of a very small boy: the aristocratic heir, cosseted by tutors, nursemaids, and the quiet assurance that the world existed to serve his education. Their theories were revolutionary, but their vision was not. They dreamed of liberty and democracy in the abstract while designing systems that could only ever serve the few. And here lies the first great tension of modern pedagogy: how to reconcile the undeniable excellence of an elite education with the democratic imperative that every child should have a fair shot at it.

We are all hypocrites in this regard. 

### Step 3: Judging

Compare each reconstruction against the original using the judge LLM:

In [41]:
# Import from style-retrieval (no models/ subdirectory)
from belletrist.style_evaluation_models import StyleJudgmentComparative
from belletrist.prompts import StyleJudgeComparativeConfig

# Step 3: Comparative blind judging (with crash resume and judge consistency testing)
print("=== Step 3: Comparative Blind Judging ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for reconstruction_run in range(n_runs):
        print(f"  Reconstruction run {reconstruction_run}:")
        
        # Get all 4 reconstructions ONCE
        reconstructions = store.get_reconstructions(sample_id, reconstruction_run)
        if len(reconstructions) != 4:
            print(f"    ✗ Missing reconstructions (found {len(reconstructions)}/4)")
            continue
        
        # Create mapping ONCE per reconstruction_run (deterministic seed for reproducibility)
        mapping = store.create_random_mapping(seed=hash(f"{sample_id}_{reconstruction_run}"))
        
        # Build prompt ONCE per reconstruction_run (SAME prompt for all judge runs)
        judge_config = StyleJudgeComparativeConfig(
            original_text=sample['original_text'],
            reconstruction_text_a=reconstructions[mapping.text_a],
            reconstruction_text_b=reconstructions[mapping.text_b],
            reconstruction_text_c=reconstructions[mapping.text_c],
            reconstruction_text_d=reconstructions[mapping.text_d]
        )
        judge_prompt = prompt_maker.render(judge_config)
        
        # Judge the SAME reconstructions with the SAME prompt multiple times
        for judge_run in range(n_judge_runs):
            if store.has_judgment(sample_id, reconstruction_run, judge_run):
                print(f"    Judge run {judge_run}: ✓ Already judged (skipping)")
                continue
            
            print(f"    Judge run {judge_run}: Judging...", end=" ")
            
            # Get structured JSON judgment with schema enforcement
            try:
                response = judge_llm.complete_with_schema(judge_prompt, StyleJudgmentComparative)
                judgment = response.content  # Already validated Pydantic instance
                
                # Save judgment with both reconstruction_run and judge_run
                store.save_judgment(
                    sample_id=sample_id,
                    reconstruction_run=reconstruction_run,
                    judgment=judgment,
                    mapping=mapping,
                    judge_model=response.model,
                    judge_run=judge_run
                )
                print(f"✓ (confidence: {judgment.confidence})")
                
            except Exception as e:
                print(f"✗ Error: {e}")

stats = store.get_stats()
print(f"\n✓ Completed {stats['n_judgments']} judgments")

=== Step 3: Comparative Blind Judging ===


sample_000:
  Reconstruction run 0:
    Judge run 0: Judging... ✓ (confidence: high)
  Reconstruction run 1:
    Judge run 0: Judging... ✓ (confidence: high)

sample_001:
  Reconstruction run 0:
    Judge run 0: Judging... ✓ (confidence: high)
  Reconstruction run 1:
    Judge run 0: Judging... ✓ (confidence: high)

sample_002:
  Reconstruction run 0:
    Judge run 0: Judging... ✓ (confidence: high)
  Reconstruction run 1:
    Judge run 0: Judging... ✓ (confidence: high)

sample_003:
  Reconstruction run 0:
    Judge run 0: Judging... ✓ (confidence: high)
  Reconstruction run 1:
    Judge run 0: Judging... ✓ (confidence: high)

sample_004:
  Reconstruction run 0:
    Judge run 0: Judging... ✓ (confidence: high)
  Reconstruction run 1:
    Judge run 0: Judging... ✓ (confidence: high)

sample_005:
  Reconstruction run 0:
    Judge run 0: Judging... ✓ (confidence: high)
  Reconstruction run 1:
    Judge run 0: Judging... ✓ (confidence: high)

sa

## Results Analysis

In [42]:
# Export results to DataFrame
print("=== Exporting Results ===\n")

# Export from store (resolves anonymous rankings to methods)
df = store.to_dataframe()

print(f"Total judgments: {len(df)}")
print(f"Samples: {df['sample_id'].nunique()}")
print(f"Reconstruction runs per sample: {df.groupby('sample_id')['reconstruction_run'].nunique().mean():.1f}")
print(f"Judge runs per reconstruction: {df.groupby(['sample_id', 'reconstruction_run'])['judge_run'].nunique().mean():.1f}")

# Show first few rows (with dynamic method columns)
print(f"\n=== Sample Results ===\n")
display_cols = ['sample_id', 'reconstruction_run', 'judge_run'] + [f'ranking_{m}' for m in METHODS] + ['confidence']
print(df[display_cols].head(25))

# Export to CSV
output_file = f"style_eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")

=== Exporting Results ===

Total judgments: 20
Samples: 10
Reconstruction runs per sample: 2.0
Judge runs per reconstruction: 1.0

=== Sample Results ===

     sample_id  reconstruction_run  judge_run  ranking_generic  \
0   sample_000                   0          0                4   
1   sample_000                   1          0                2   
2   sample_001                   0          0                4   
3   sample_001                   1          0                3   
4   sample_002                   0          0                3   
5   sample_002                   1          0                2   
6   sample_003                   0          0                1   
7   sample_003                   1          0                2   
8   sample_004                   0          0                4   
9   sample_004                   1          0                3   
10  sample_005                   0          0                2   
11  sample_005                   1          0        

In [43]:
# Analyze judge consistency across multiple judgments of same reconstructions
print("=== Judge Consistency Analysis ===\n")

if 'judge_run' in df.columns and df['judge_run'].nunique() > 1:
    # For each (sample_id, reconstruction_run), check ranking variance across judge_runs
    consistency_results = []
    
    for (sample_id, recon_run), group in df.groupby(['sample_id', 'reconstruction_run']):
        if len(group) > 1:  # Only if multiple judge runs exist
            for method in ['generic', 'fewshot', 'author', 'instructions']:
                col = f'ranking_{method}'
                ranks = group[col].values
                variance = ranks.std()
                mean_rank = ranks.mean()
                
                consistency_results.append({
                    'sample_id': sample_id,
                    'reconstruction_run': recon_run,
                    'method': method,
                    'mean_rank': mean_rank,
                    'std_dev': variance,
                    'min_rank': ranks.min(),
                    'max_rank': ranks.max(),
                    'rank_range': ranks.max() - ranks.min(),
                    'n_judgments': len(ranks)
                })
    
    if consistency_results:
        consistency_df = pd.DataFrame(consistency_results)
        
        print("Method-level consistency (across all samples):")
        method_consistency = consistency_df.groupby('method').agg({
            'std_dev': 'mean',
            'rank_range': 'mean'
        }).round(2)
        method_consistency.columns = ['Avg Std Dev', 'Avg Rank Range']
        print(method_consistency)
        print("\n(Lower values = more consistent)")
        
        print("\n\nMost inconsistent cases (judge disagreed most):")
        inconsistent = consistency_df.nlargest(10, 'rank_range')[
            ['sample_id', 'reconstruction_run', 'method', 'min_rank', 'max_rank', 'rank_range']
        ]
        print(inconsistent.to_string(index=False))
        
        print("\n\nMost consistent cases (judge agreed most):")
        consistent = consistency_df.nsmallest(10, 'std_dev')[
            ['sample_id', 'reconstruction_run', 'method', 'mean_rank', 'std_dev']
        ]
        print(consistent.to_string(index=False))
        
else:
    print("⚠ Only one judge_run per reconstruction. Set n_judge_runs > 1 to test consistency.")

=== Judge Consistency Analysis ===

⚠ Only one judge_run per reconstruction. Set n_judge_runs > 1 to test consistency.


In [44]:
print(df.loc[0,'reasoning'])

Let me analyze each text's stylistic similarity to the original by internalizing the original author's voice first.

**The Original's Voice:**
The original has a distinctive philosophical yet accessible style. Key characteristics include:
- Clear, direct prose that builds complex arguments methodically
- Balanced sentences that often use qualifications ("I do not mean that... What I do mean is...")
- A conversational yet authoritative tone that addresses the reader directly
- Practical examples grounded in reality (Montessori's work in slums)
- Measured, reasonable argumentation that acknowledges counterpoints
- Natural flow without excessive ornamentation
- Straightforward vocabulary with occasional formal constructions
- A sense of earnest engagement with real-world problems

**Text A Analysis:**
This text immediately strikes a very different tone. The opening—"It is a curious fact, and one not without its irony"—feels overly literary and self-conscious. The prose is heavily ornament

In [45]:
# Analyze ranking distributions (dynamic for configured methods)
print("=== Ranking Distribution by Method ===\n")

for method in METHODS:
    col = f'ranking_{method}'
    print(f"\n{method.upper()}:")
    ranking_counts = df[col].value_counts().sort_index()
    for rank in [1, 2, 3, 4]:
        count = ranking_counts.get(rank, 0)
        pct = (count / len(df) * 100) if len(df) > 0 else 0
        print(f"  Rank {rank}: {count:3d} ({pct:5.1f}%)")

print("\n=== Confidence Distribution ===\n")
print(df['confidence'].value_counts())

=== Ranking Distribution by Method ===


GENERIC:
  Rank 1:   1 (  5.0%)
  Rank 2:   6 ( 30.0%)
  Rank 3:   7 ( 35.0%)
  Rank 4:   6 ( 30.0%)

FEWSHOT:
  Rank 1:  13 ( 65.0%)
  Rank 2:   4 ( 20.0%)
  Rank 3:   2 ( 10.0%)
  Rank 4:   1 (  5.0%)

AUTHOR:
  Rank 1:   0 (  0.0%)
  Rank 2:   1 (  5.0%)
  Rank 3:   9 ( 45.0%)
  Rank 4:  10 ( 50.0%)

AGENT_HOLISTIC:
  Rank 1:   6 ( 30.0%)
  Rank 2:   9 ( 45.0%)
  Rank 3:   2 ( 10.0%)
  Rank 4:   3 ( 15.0%)

=== Confidence Distribution ===

confidence
high    20
Name: count, dtype: int64


In [46]:
# Calculate method performance metrics (dynamic for configured methods)
print("=== Method Performance Metrics ===\n")

# Calculate mean ranking for each method (lower is better: 1 = best, 4 = worst)
mean_rankings = {}
for method in METHODS:
    col = f'ranking_{method}'
    mean_rankings[method] = df[col].mean()

# Sort by mean ranking (best first)
sorted_methods = sorted(mean_rankings.items(), key=lambda x: x[1])

print("Average Ranking (lower is better):")
for i, (method, mean_rank) in enumerate(sorted_methods, 1):
    # Count how often this method ranked 1st
    first_place = (df[f'ranking_{method}'] == 1).sum()
    first_place_pct = (first_place / len(df) * 100) if len(df) > 0 else 0
    
    print(f"{i}. {method:15s}: {mean_rank:.2f} (1st place: {first_place}/{len(df)} = {first_place_pct:.1f}%)")

# Win rate (percentage of times ranked 1st or 2nd)
print("\nTop-2 Rate (ranked 1st or 2nd):")
for method in METHODS:
    col = f'ranking_{method}'
    top2 = ((df[col] == 1) | (df[col] == 2)).sum()
    top2_pct = (top2 / len(df) * 100) if len(df) > 0 else 0
    print(f"  {method:15s}: {top2}/{len(df)} = {top2_pct:.1f}%")

=== Method Performance Metrics ===

Average Ranking (lower is better):
1. fewshot        : 1.55 (1st place: 13/20 = 65.0%)
2. agent_holistic : 2.10 (1st place: 6/20 = 30.0%)
3. generic        : 2.90 (1st place: 1/20 = 5.0%)
4. author         : 3.45 (1st place: 0/20 = 0.0%)

Top-2 Rate (ranked 1st or 2nd):
  generic        : 7/20 = 35.0%
  fewshot        : 17/20 = 85.0%
  author         : 1/20 = 5.0%
  agent_holistic : 15/20 = 75.0%


## Detailed Inspection

Examine individual samples, reconstructions, and judge reasoning:

In [48]:
# View judge reasoning for a specific sample/run
INSPECT_SAMPLE = 'sample_008'
INSPECT_RUN = 0
INSPECT_JUDGE_RUN = 0

# Query judgment directly from database
judgment = store.conn.execute("""
    SELECT * FROM comparative_judgments
    WHERE sample_id=? AND reconstruction_run=? AND judge_run=?
""", (INSPECT_SAMPLE, INSPECT_RUN, INSPECT_JUDGE_RUN)).fetchone()

if judgment:
    print(f"=== JUDGE REASONING: {INSPECT_SAMPLE}, Reconstruction Run {INSPECT_RUN}, Judge Run {INSPECT_JUDGE_RUN} ===\n")
    
    # Build mapping to show which label = which method
    label_to_method = {
        'A': judgment['method_text_a'],
        'B': judgment['method_text_b'],
        'C': judgment['method_text_c'],
        'D': judgment['method_text_d']
    }
    
    label_to_rank = {
        'A': judgment['ranking_text_a'],
        'B': judgment['ranking_text_b'],
        'C': judgment['ranking_text_c'],
        'D': judgment['ranking_text_d']
    }
    
    print("ANONYMOUS LABELS → METHODS:")
    for label in ['A', 'B', 'C', 'D']:
        method = label_to_method[label]
        rank = label_to_rank[label]
        print(f"  Text {label} = {method:25s} → Rank {rank}")
    
    print(f"\nConfidence: {judgment['confidence']}")
    print(f"Judge Model: {judgment['judge_model']}")
    
    # Show reasoning
    print(f"\n{'REASONING':-^80}")
    print(judgment['reasoning'])
else:
    print(f"No judgment found for {INSPECT_SAMPLE}, run {INSPECT_RUN}, judge run {INSPECT_JUDGE_RUN}")

=== JUDGE REASONING: sample_008, Reconstruction Run 0, Judge Run 0 ===

ANONYMOUS LABELS → METHODS:
  Text A = fewshot                   → Rank 1
  Text B = author                    → Rank 3
  Text C = agent_holistic            → Rank 2
  Text D = generic                   → Rank 4

Confidence: high
Judge Model: claude-sonnet-4-5-20250929

-----------------------------------REASONING------------------------------------
Let me analyze each text's stylistic similarity to the original by examining voice, rhythm, word choice, and the relationship with the reader.

**Original's Distinctive Voice:**
The original has a conversational yet philosophical tone. It's direct and unpretentious—"I must therefore be content merely to state"—with natural transitions and a sense of someone thinking aloud. The prose flows smoothly without ornate flourishes. Sentences vary in length naturally. The author addresses the reader directly ("the reader may feel likewise") and uses straightforward constructions

In [None]:
# Compare rankings across all runs for a specific sample
INSPECT_SAMPLE = 'sample_000'

# Query all judgments for this sample from database
judgments = store.conn.execute("""
    SELECT * FROM comparative_judgments
    WHERE sample_id=?
    ORDER BY reconstruction_run, judge_run
""", (INSPECT_SAMPLE,)).fetchall()

if judgments:
    print(f"=== RANKING CONSISTENCY FOR {INSPECT_SAMPLE} ===\n")
    print(f"Total judgments: {len(judgments)}\n")
    
    # Show rankings by run
    print(f"{'Run':<6} {'Judge':<6} ", end='')
    for method in METHODS:
        print(f"{method[:12]:<14}", end='')
    print()
    print("-" * 70)
    
    # Collect rankings for mean calculation
    rankings_by_method = {method: [] for method in METHODS}
    
    for j in judgments:
        # Build method->rank mapping for this judgment
        method_ranks = {}
        for label in ['a', 'b', 'c', 'd']:
            method = j[f'method_text_{label}']
            rank = j[f'ranking_text_{label}']
            method_ranks[method] = rank
        
        # Print row
        print(f"{j['reconstruction_run']:<6} {j['judge_run']:<6} ", end='')
        for method in METHODS:
            rank = method_ranks[method]
            rankings_by_method[method].append(rank)
            print(f"{rank:<14}", end='')
        print()
    
    # Show mean rankings
    print("\n\nMean rankings across all runs:")
    import statistics
    for method in METHODS:
        ranks = rankings_by_method[method]
        mean_rank = statistics.mean(ranks)
        std_rank = statistics.stdev(ranks) if len(ranks) > 1 else 0.0
        print(f"  {method:25s}: {mean_rank:.2f} ± {std_rank:.2f}")
else:
    print(f"No judgments found for {INSPECT_SAMPLE}")

In [None]:
# View all reconstructions for a specific sample and run
INSPECT_SAMPLE = 'sample_000'
INSPECT_RUN = 0

print(f"=== RECONSTRUCTIONS FOR {INSPECT_SAMPLE}, RUN {INSPECT_RUN} ===\n")

sample = store.get_sample(INSPECT_SAMPLE)
reconstructions = store.get_reconstructions(INSPECT_SAMPLE, INSPECT_RUN)

print(f"{'ORIGINAL':-^80}")
print(sample['original_text'])
print("\n\n")

for method in METHODS:
    print(f"{method.upper():-^80}")
    print(reconstructions[method])
    print(f"\n({len(reconstructions[method])} chars)\n")

In [None]:
# View the reconstruction prompt for a specific method
INSPECT_SAMPLE = 'sample_000'
INSPECT_METHOD = 'fewshot'  # Change to 'generic', 'fewshot', 'author' (agent_holistic uses complex workflow)

sample = store.get_sample(INSPECT_SAMPLE)

print(f"=== RECONSTRUCTION PROMPT: {INSPECT_METHOD.upper()} ===\n")

if INSPECT_METHOD == 'agent_holistic':
    print("⚠ The 'agent_holistic' method uses a multi-step agentic workflow,")
    print("  not a single prompt. See belletrist/agent_rewriter.py for details.")
elif INSPECT_METHOD in RECONSTRUCTORS_CFGS:
    config = RECONSTRUCTORS_CFGS[INSPECT_METHOD](
        content_summary=sample['flattened_content'],
        **RECONSTRUCTORS_KWARGS[INSPECT_METHOD]
    )
    prompt = prompt_maker.render(config)
    print(prompt)
else:
    print(f"Unknown method: {INSPECT_METHOD}")

## Next Steps

TODO: Add analysis cells for:
- Statistical significance testing (e.g., Friedman test for rankings)
- Visualization of results (bar charts, violin plots)
- Pairwise method comparisons
- Effect size calculations

The detailed inspection cells above allow you to:
- View judge reasoning with method labels revealed
- Check ranking consistency across multiple runs
- Compare all reconstructions side-by-side
- Examine the prompts used for each method