# Style Instruction Evaluation Framework

This notebook evaluates how well derived style instructions enable style replication, compared to alternative approaches.

## Methodology

For each gold standard text:
1. **Flatten**: Extract content while removing style
2. **Reconstruct**: Generate text using 4 different methods (M stochastic runs each):
   - Generic baseline
   - Few-shot learning
   - Author name prompting
   - Derived style instructions
3. **Judge (Blind Comparative)**: Judge ranks all 4 reconstructions from 1-4 based on similarity to original
   - **Blind evaluation**: Judge sees only anonymous labels (Text A, B, C, D) - no method names
   - **Ranking**: 1 = most similar, 2 = second, 3 = third, 4 = least similar
   - **Order randomized**: Position of methods varies across samples to eliminate bias
4. **Aggregate**: Analyze rankings to determine which method best captures style

## Key Features

- **Crash resilient**: All LLM responses saved to SQLite immediately
- **Resume support**: Can restart after failures, skips completed work
- **Blind evaluation**: Eliminates judge bias by hiding method names
- **Comparative ranking**: More informative than binary comparisons

### Install Libraries and Check

In [1]:
!pip install -r requirements.txt




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
try:
    import litellm
    print('Providers\n=========')
    print('* ' + '\n* '.join(litellm.LITELLM_CHAT_PROVIDERS))
    litellm.drop_params = True
except ImportError as e:
    print(f"✗ Cannot import litellm: {e}")

Providers
* openai
* openai_like
* bytez
* xai
* custom_openai
* text-completion-openai
* cohere
* cohere_chat
* clarifai
* anthropic
* anthropic_text
* replicate
* huggingface
* together_ai
* datarobot
* openrouter
* cometapi
* vertex_ai
* vertex_ai_beta
* gemini
* ai21
* baseten
* azure
* azure_text
* azure_ai
* sagemaker
* sagemaker_chat
* bedrock
* vllm
* nlp_cloud
* petals
* oobabooga
* ollama
* ollama_chat
* deepinfra
* perplexity
* mistral
* groq
* nvidia_nim
* cerebras
* baseten
* ai21_chat
* volcengine
* codestral
* text-completion-codestral
* deepseek
* sambanova
* maritalk
* cloudflare
* fireworks_ai
* friendliai
* watsonx
* watsonx_text
* triton
* predibase
* databricks
* empower
* github
* custom
* litellm_proxy
* hosted_vllm
* llamafile
* lm_studio
* galadriel
* gradient_ai
* github_copilot
* novita
* meta_llama
* featherless_ai
* nscale
* nebius
* dashscope
* moonshot
* v0
* heroku
* oci
* morph
* lambda_ai
* vercel_ai_gateway
* wandb
* ovhcloud
* lemonade


## Setup and Configuration

In [3]:
import os
import json
from pathlib import Path
import pandas as pd
from datetime import datetime

### Initialize Base Objects
The base objects part of the current project library (`belletrist`) are initialized. They are:
* `LLM`: the LLM object.
* `LLMConfig`: the configuration of the LLM object, such as what model to use.
* `PromptMaker`: generates prompts from templates and variables
* `DataSampler`: retrieves and samples text at a source directory

These will implement text transformations by LLMs part of the evaluation process. They build on the third-party LLMs, which we furthermore split into LLMs for text reconstruction and text judging, the key parameters for which are set below.

In [4]:
model_reconstruction_string = 'together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput'
model_reconstruction_api_key_env_var = 'TOGETHER_AI_API_KEY'
#model_reconstruction_string = 'anthropic/claude-sonnet-4-5-20250929'
#model_reconstruction_api_key_env_var = 'ANTHROPIC_API_KEY'
#model_reconstruction_string = 'openai/gpt-5.1-2025-11-13'
#model_reconstruction_api_key_env_var = 'OPENAI_API_KEY'
#model_reconstruction_string = 'together_ai/moonshotai/Kimi-K2-Instruct'
#model_reconstruction_api_key_env_var = 'TOGETHER_AI_API_KEY'
#model_reconstruction_string = 'mistral/mistral-large-2512'
#model_reconstruction_api_key_env_var = 'MISTRAL_API_KEY'
#model_judge_string = 'anthropic/claude-sonnet-4-5-20250929'
#model_judge_api_key_env_var = 'ANTHROPIC_API_KEY'
model_judge_string = 'mistral/mistral-large-2512'
model_judge_api_key_env_var = 'MISTRAL_API_KEY'

In [5]:
from belletrist import PromptMaker, DataSampler, StyleEvaluationStore

prompt_maker = PromptMaker()
sampler = DataSampler(
    data_path=(Path(os.getcwd()) / "data" / "russell").resolve()
)
store = StyleEvaluationStore(Path("style_eval_instruct_sonnet_write_qwen_judge_mistral.db"))

In [6]:
#store.reset('all')

In [7]:
from belletrist import LLM, LLMConfig

reconstruction_llm = LLM(LLMConfig(
    model=model_reconstruction_string,
    api_key=os.environ.get(model_reconstruction_api_key_env_var)
))
judge_llm = LLM(LLMConfig(
    model=model_judge_string,
    api_key=os.environ.get(model_judge_api_key_env_var)
))

### Sample Test Data and Few-Shot Data
The reconstruction method tests build on gold standard texts. The test also includes few-shot prompting with the gold standard texts. In order to not skew the tests, no few-shot examples can overlap with the test texts.

In [8]:
n_sample = 5
m_paragraphs_per_sample = 5
n_few_shot_sample = 5

In [9]:
#test_texts = []
#for _ in range(n_sample):
#    test_texts.append(sampler.sample_segment(p_length=m_paragraphs_per_sample))

In [10]:
quality_texts_deterministic = [
    sampler.get_paragraph_chunk(file_index=0, paragraph_range=slice(9, 9+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=0, paragraph_range=slice(29, 29+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=0, paragraph_range=slice(131, 131+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=1, paragraph_range=slice(13, 13+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=1, paragraph_range=slice(39, 39+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=1, paragraph_range=slice(192, 192+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=2, paragraph_range=slice(20, 20+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=2, paragraph_range=slice(43, 43+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=2, paragraph_range=slice(146, 146+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=3, paragraph_range=slice(7, 7+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=3, paragraph_range=slice(73, 73+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=3, paragraph_range=slice(202, 202+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=3, paragraph_range=slice(285, 285+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=4, paragraph_range=slice(4, 4+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=4, paragraph_range=slice(67, 67+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=4, paragraph_range=slice(124, 124+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=5, paragraph_range=slice(6, 6+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=5, paragraph_range=slice(119, 119+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=5, paragraph_range=slice(301, 301+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=6, paragraph_range=slice(23, 23+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=6, paragraph_range=slice(75, 75+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=6, paragraph_range=slice(152, 152+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=6, paragraph_range=slice(198, 198+m_paragraphs_per_sample)),
    sampler.get_paragraph_chunk(file_index=6, paragraph_range=slice(271, 271+m_paragraphs_per_sample)),
]
reindex = [0, 5, 10, 15, 20, 1, 6, 11, 16, 21, 2, 7, 12, 17, 22, 3, 8, 13, 18, 23, 4, 9, 14, 19]
test_texts = [
    quality_texts_deterministic[i] for i in reindex[:n_sample]
]

In [11]:
#few_shot_texts = []
#while len(few_shot_texts) < n_few_shot_sample:
#    p = sampler.sample_segment(p_length=m_paragraphs_per_sample)

    # Check if p overlaps with any test text
    # Two segments overlap if they're from the same file AND their paragraph ranges overlap
    # Ranges [a, b) and [c, d) overlap if: a < d AND c < b
#    has_overlap = any(
#        p.file_index == test_seg.file_index and
#        p.paragraph_start < test_seg.paragraph_end and
#        test_seg.paragraph_start < p.paragraph_end
#        for test_seg in test_texts
#    )
#
#    if not has_overlap:
#        few_shot_texts.append(p)

In [12]:
few_shot_texts = [
    quality_texts_deterministic[i] for i in reversed(reindex[:n_few_shot_sample])
]

## Create the Test Transformation Objects
The combination of prompt and LLM leads to the following operators in the test chain:
* **Style Flattener**, which given a text compresses it into its content bare bones.
* **Reconstructor, LLM House Style**, which given a compressed content expands it into a complete text with the "house style" of the LLM employed for the reconstruction.
* **Reconstructor, Few Shot**, which given a compressed content expands it into a complete text with a few text excerpts on unrelated topics as style guide.
* **Reconstructor, LLM Author Model**, which given a compressed content expands it into a complete text with the named author's style as the LLM conceives it without any other guidance.
* **Reconstructor, Style Instruction**, which given a compressed content expands it into a complete text following the detailed style instruction, as derived from previous analysis.

In [13]:
style_instructions_path = Path("outputs/author_modeling/author_model_definition_001_sonnet.txt")

if style_instructions_path.exists():
    style_instructions = style_instructions_path.read_text()
    print(f"✓ Loaded style instructions ({len(style_instructions)} chars)")
    print(f"\nFirst 200 chars:\n{style_instructions[:500]}...")
else:
    print("⚠ Style instructions file not found. Please provide path.")
    style_instructions = ""  # Placeholder

✓ Loaded style instructions (20519 chars)

First 200 chars:
---
synthesis_id: author_model_definition_001
synthesis_type: author_model_definition
model: claude-sonnet-4-5-20250929
created_at: 2025-12-02T12:29:47.382030
parent_synthesis_id: implied_author_synthesis_001
num_samples: 5
sample_ids:
  - sample_005
  - sample_003
  - sample_002
  - sample_004
  - sample_001
is_homogeneous_model: true
models_used:
  - claude-sonnet-4-5-20250929
---

## PART 1: THE GENERATIVE STANCE

When you write as this author, you inhabit a position of **clarity achieved and...


In [14]:
from belletrist.models import (
    StyleFlatteningConfig,
    StyleReconstructionGenericConfig,
    StyleReconstructionFewShotConfig,
    StyleReconstructionAuthorConfig,
    StyleReconstructionInstructionsConfig,
    MethodMapping,
    StyleJudgeComparativeConfig
)

# Configuration
n_runs = 3
n_judge_runs = 3
AUTHOR_NAME = "Bertrand Russell"

# Reconstructor configs
RECONSTRUCTORS_CFGS = {
    'generic': StyleReconstructionGenericConfig,
    'fewshot': StyleReconstructionFewShotConfig,
    'author': StyleReconstructionAuthorConfig,
    'instructions': StyleReconstructionInstructionsConfig
}

# Reconstructor kwargs
RECONSTRUCTORS_KWARGS = {
    'generic': {},
    'fewshot': {'few_shot_examples': [seg.text for seg in few_shot_texts]},
    'author': {'author_name': AUTHOR_NAME},
    'instructions': {'style_instructions': style_instructions}
}

print(f"✓ Store initialized at {store.filepath}")
print(f"✓ Configuration: {n_runs} runs per sample, 4 reconstruction methods")

✓ Store initialized at style_eval_instruct_sonnet_write_qwen_judge_mistral.db
✓ Configuration: 3 runs per sample, 4 reconstruction methods


## Evaluation Pipeline

### Step 1: Content Flattening

Extract content from each test sample, removing stylistic elements:

In [15]:
# Step 1: Save samples and flatten content
print("=== Step 1: Flattening and Saving Samples ===\n")

for k_text, test_sample in enumerate(test_texts):
    sample_id = f"sample_{k_text:03d}"
    
    # Skip if already saved
    if store.get_sample(sample_id):
        print(f"✓ {sample_id} already flattened (skipping)")
        continue
    
    print(f"Flattening {sample_id}...", end=" ")
    
    # Flatten content
    flatten_prompt = prompt_maker.render(
        StyleFlatteningConfig(text=test_sample.text)
    )
    flattened = reconstruction_llm.complete(flatten_prompt)
    
    # Save to store with provenance
    source_info = f"File {test_sample.file_index}, para {test_sample.paragraph_start}-{test_sample.paragraph_end}"
    store.save_sample(
        sample_id=sample_id,
        original_text=test_sample.text,
        flattened_content=flattened.content,
        flattening_model=flattened.model,
        source_info=source_info
    )
    
    print(f"✓ ({len(flattened.content)} chars)")

print(f"\n✓ All samples flattened and saved to store")

=== Step 1: Flattening and Saving Samples ===

✓ sample_000 already flattened (skipping)
✓ sample_001 already flattened (skipping)
✓ sample_002 already flattened (skipping)
✓ sample_003 already flattened (skipping)
✓ sample_004 already flattened (skipping)

✓ All samples flattened and saved to store


In [16]:
print('====ORIGINAL===')
print(store.get_sample('sample_001')['original_text'])
print('\n\n====FLATTENED====')
print(store.get_sample('sample_001')['flattened_content'])

====ORIGINAL===
The infinitesimal played formerly a great part in mathematics. It was
introduced by the Greeks, who regarded a circle as differing
infinitesimally from a polygon with a very large number of very small
equal sides. It gradually grew in importance, until, when Leibniz
invented the Infinitesimal Calculus, it seemed to become the
fundamental notion of all higher mathematics. Carlyle tells, in his
_Frederick the Great_, how Leibniz used to discourse to Queen Sophia
Charlotte of Prussia concerning the infinitely little, and how she
would reply that on that subject she needed no instruction--the
behaviour of courtiers had made her thoroughly familiar with it. But
philosophers and mathematicians--who for the most part had less
acquaintance with courts--continued to discuss this topic, though
without making any advance. The Calculus required continuity, and
continuity was supposed to require the infinitely little; but nobody
could discover what the infinitely little might be. It

### Step 2: Reconstruction

Generate reconstructions using all 4 methods, with M stochastic runs each:

In [17]:
# Step 2: Generate reconstructions (with crash resume)
print("=== Step 2: Generating Reconstructions ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for run in range(n_runs):
        print(f"  Run {run}:")
        
        # Check which methods need reconstruction
        for method in ['generic', 'fewshot', 'author', 'instructions']:
            if store.has_reconstruction(sample_id, run, method):
                print(f"    ✓ {method:12s} (already done)")
                continue
            
            # Generate reconstruction
            config = RECONSTRUCTORS_CFGS[method](
                content_summary=sample['flattened_content'],
                **RECONSTRUCTORS_KWARGS[method]
            )
            prompt = prompt_maker.render(config)
            response = reconstruction_llm.complete(prompt)
            
            # Save immediately (crash resilient!)
            store.save_reconstruction(
                sample_id=sample_id,
                run=run,
                method=method,
                reconstructed_text=response.content,
                model=response.model
            )
            print(f"    ✓ {method:12s} ({len(response.content)} chars)")

stats = store.get_stats()
print(f"\n✓ Generated {stats['n_reconstructions']} total reconstructions")

=== Step 2: Generating Reconstructions ===


sample_000:
  Run 0:
    ✓ generic      (already done)
    ✓ fewshot      (already done)
    ✓ author       (already done)
    ✓ instructions (already done)
  Run 1:
    ✓ generic      (already done)
    ✓ fewshot      (already done)
    ✓ author       (already done)
    ✓ instructions (already done)
  Run 2:
    ✓ generic      (already done)
    ✓ fewshot      (already done)
    ✓ author       (already done)
    ✓ instructions (already done)

sample_001:
  Run 0:
    ✓ generic      (already done)
    ✓ fewshot      (already done)
    ✓ author       (already done)
    ✓ instructions (already done)
  Run 1:
    ✓ generic      (already done)
    ✓ fewshot      (already done)
    ✓ author       (already done)
    ✓ instructions (already done)
  Run 2:
    ✓ generic      (already done)
    ✓ fewshot      (already done)
    ✓ author       (already done)
    ✓ instructions (already done)

sample_002:
  Run 0:
    ✓ generic      (already done)
    

In [18]:
config = RECONSTRUCTORS_CFGS['instructions'](
                content_summary=sample['flattened_content'],
                **RECONSTRUCTORS_KWARGS['instructions']
            )
prompt = prompt_maker.render(config)
print(prompt)

You are given a content summary and detailed writing style guidelines.

Expand the content summary into a complete prose passage, writing naturally in the style described by the guidelines.

**CRITICAL REQUIREMENTS:**
- Write plain prose. Do NOT add titles, section headers, or formatting (bold, italics, markdown, etc.) unless the style guidelines explicitly call for them.
- Do NOT include meta-commentary about following instructions, demonstrating features, or explaining your approach.
- Do NOT add preambles like "Here is the passage..." or postscripts analyzing your work.
- Simply write as if you naturally embody this style.

---

**STYLE GUIDELINES:**

## PART 1: THE GENERATIVE STANCE

When you write as this author, you inhabit a position of **clarity achieved and being transmitted**. You are not discovering in real-time; you have thought through the confusions, identified where thinking goes wrong, and now demonstrate the path to understanding. Your relationship to your material is 

In [19]:
reconstructions = store.get_reconstructions('sample_001', 1)
for reconstructor in reconstructions.keys():
    print(f"{reconstructor.upper()}\n===================")
    print(f"\n{reconstructions.get(reconstructor)}\n\n")

AUTHOR

It is a curious feature of human thought that ideas which, for centuries, have exercised a powerful influence upon both science and philosophy, may eventually be found to rest upon no foundation whatever—ideas so insubstantial that they dissolve at the touch of rigorous analysis, yet so seductive that even the most acute minds have been drawn into their spell. Such was the case with the infinitesimal, a notion that, from its first dim conception among the Greeks to its apparent triumph in the calculus of Leibniz, held an almost mystical sway over the development of higher mathematics. The Greeks, in their contemplation of the circle, approached it as the limit of a polygon with an ever-increasing number of sides, each of which became smaller and smaller—so small, indeed, that one might suppose them to be vanishingly slight, almost nothing, yet not quite nothing. This artifice, though suggestive, remained vague, and the infinitesimal hovered at the edge of legitimacy: too useful

### Step 3: Judging

Compare each reconstruction against the original using the judge LLM:

In [21]:
from belletrist.models import StyleJudgeComparativeConfig, StyleJudgmentComparative

# Configuration: how many times to judge each reconstruction set
n_judge_runs = 3  # Set to 1 for no consistency testing, 3-5 for consistency testing

# Step 3: Comparative blind judging (with crash resume and judge consistency testing)
print("=== Step 3: Comparative Blind Judging ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for reconstruction_run in range(n_runs):
        print(f"  Reconstruction run {reconstruction_run}:")
        
        # Get all 4 reconstructions ONCE
        reconstructions = store.get_reconstructions(sample_id, reconstruction_run)
        if len(reconstructions) != 4:
            print(f"    ✗ Missing reconstructions (found {len(reconstructions)}/4)")
            continue
        
        # Create mapping ONCE per reconstruction_run (deterministic seed for reproducibility)
        mapping = store.create_random_mapping(seed=hash(f"{sample_id}_{reconstruction_run}"))
        
        # Build prompt ONCE per reconstruction_run (SAME prompt for all judge runs)
        judge_config = StyleJudgeComparativeConfig(
            original_text=sample['original_text'],
            reconstruction_text_a=reconstructions[mapping.text_a],
            reconstruction_text_b=reconstructions[mapping.text_b],
            reconstruction_text_c=reconstructions[mapping.text_c],
            reconstruction_text_d=reconstructions[mapping.text_d]
        )
        judge_prompt = prompt_maker.render(judge_config)
        
        # Judge the SAME reconstructions with the SAME prompt multiple times
        for judge_run in range(n_judge_runs):
            if store.has_judgment(sample_id, reconstruction_run, judge_run):
                print(f"    Judge run {judge_run}: ✓ Already judged (skipping)")
                continue
            
            print(f"    Judge run {judge_run}: Judging...", end=" ")
            
            # Get structured JSON judgment with schema enforcement
            try:
                response = judge_llm.complete_with_schema(judge_prompt, StyleJudgmentComparative)
                #print(response)
                judgment = response.content  # Already validated Pydantic instance
                #print (judgment)
                
                # Save judgment with both reconstruction_run and judge_run
                store.save_judgment(
                    sample_id=sample_id,
                    reconstruction_run=reconstruction_run,
                    judgment=judgment,
                    mapping=mapping,
                    judge_model=response.model,
                    judge_run=judge_run
                )
                print(f"✓ (confidence: {judgment.confidence})")
                
            except Exception as e:
                print(f"✗ Error: {e}")

stats = store.get_stats()
print(f"\n✓ Completed {stats['n_judgments']} judgments")

=== Step 3: Comparative Blind Judging ===


sample_000:
  Reconstruction run 0:
    Judge run 0: ✓ Already judged (skipping)
    Judge run 1: ✓ Already judged (skipping)
    Judge run 2: Judging... ✓ (confidence: high)
  Reconstruction run 1:
    Judge run 0: Judging... ✓ (confidence: high)
    Judge run 1: Judging... ✓ (confidence: high)
    Judge run 2: Judging... ✓ (confidence: high)
  Reconstruction run 2:
    Judge run 0: Judging... ✓ (confidence: high)
    Judge run 1: Judging... ✓ (confidence: high)
    Judge run 2: Judging... ✓ (confidence: high)

sample_001:
  Reconstruction run 0:
    Judge run 0: Judging... ✓ (confidence: high)
    Judge run 1: Judging... ✓ (confidence: high)
    Judge run 2: Judging... ✓ (confidence: high)
  Reconstruction run 1:
    Judge run 0: Judging... ✓ (confidence: high)
    Judge run 1: Judging... ✓ (confidence: high)
    Judge run 2: Judging... ✓ (confidence: high)
  Reconstruction run 2:
    Judge run 0: Judging... ✓ (confidence: high)
    Judge r

## Results Analysis

In [22]:
# Export results to DataFrame
print("=== Exporting Results ===\n")

# Export from store (resolves anonymous rankings to methods)
df = store.to_dataframe()

print(f"Total judgments: {len(df)}")
print(f"Samples: {df['sample_id'].nunique()}")
print(f"Reconstruction runs per sample: {df.groupby('sample_id')['reconstruction_run'].nunique().mean():.1f}")
print(f"Judge runs per reconstruction: {df.groupby(['sample_id', 'reconstruction_run'])['judge_run'].nunique().mean():.1f}")

# Show first few rows
print(f"\n=== Sample Results ===\n")
display_cols = ['sample_id', 'reconstruction_run', 'judge_run', 'ranking_generic', 'ranking_fewshot', 
                'ranking_author', 'ranking_instructions', 'confidence']
print(df[display_cols].head(15))

# Export to CSV
output_file = f"style_eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")

=== Exporting Results ===

Total judgments: 45
Samples: 5
Reconstruction runs per sample: 3.0
Judge runs per reconstruction: 3.0

=== Sample Results ===

     sample_id  reconstruction_run  judge_run  ranking_generic  \
0   sample_000                   0          0                3   
1   sample_000                   0          1                3   
2   sample_000                   0          2                3   
3   sample_000                   1          0                3   
4   sample_000                   1          1                3   
5   sample_000                   1          2                3   
6   sample_000                   2          0                4   
7   sample_000                   2          1                4   
8   sample_000                   2          2                4   
9   sample_001                   0          0                2   
10  sample_001                   0          1                3   
11  sample_001                   0          2         

In [23]:
# Analyze judge consistency across multiple judgments of same reconstructions
print("=== Judge Consistency Analysis ===\n")

if 'judge_run' in df.columns and df['judge_run'].nunique() > 1:
    # For each (sample_id, reconstruction_run), check ranking variance across judge_runs
    consistency_results = []
    
    for (sample_id, recon_run), group in df.groupby(['sample_id', 'reconstruction_run']):
        if len(group) > 1:  # Only if multiple judge runs exist
            for method in ['generic', 'fewshot', 'author', 'instructions']:
                col = f'ranking_{method}'
                ranks = group[col].values
                variance = ranks.std()
                mean_rank = ranks.mean()
                
                consistency_results.append({
                    'sample_id': sample_id,
                    'reconstruction_run': recon_run,
                    'method': method,
                    'mean_rank': mean_rank,
                    'std_dev': variance,
                    'min_rank': ranks.min(),
                    'max_rank': ranks.max(),
                    'rank_range': ranks.max() - ranks.min(),
                    'n_judgments': len(ranks)
                })
    
    if consistency_results:
        consistency_df = pd.DataFrame(consistency_results)
        
        print("Method-level consistency (across all samples):")
        method_consistency = consistency_df.groupby('method').agg({
            'std_dev': 'mean',
            'rank_range': 'mean'
        }).round(2)
        method_consistency.columns = ['Avg Std Dev', 'Avg Rank Range']
        print(method_consistency)
        print("\n(Lower values = more consistent)")
        
        print("\n\nMost inconsistent cases (judge disagreed most):")
        inconsistent = consistency_df.nlargest(10, 'rank_range')[
            ['sample_id', 'reconstruction_run', 'method', 'min_rank', 'max_rank', 'rank_range']
        ]
        print(inconsistent.to_string(index=False))
        
        print("\n\nMost consistent cases (judge agreed most):")
        consistent = consistency_df.nsmallest(10, 'std_dev')[
            ['sample_id', 'reconstruction_run', 'method', 'mean_rank', 'std_dev']
        ]
        print(consistent.to_string(index=False))
        
else:
    print("⚠ Only one judge_run per reconstruction. Set n_judge_runs > 1 to test consistency.")

=== Judge Consistency Analysis ===

Method-level consistency (across all samples):
              Avg Std Dev  Avg Rank Range
method                                   
author               0.09            0.20
fewshot              0.00            0.00
generic              0.03            0.07
instructions         0.06            0.13

(Lower values = more consistent)


Most inconsistent cases (judge disagreed most):
 sample_id  reconstruction_run       method  min_rank  max_rank  rank_range
sample_001                   0       author         1         3           2
sample_000                   1       author         1         2           1
sample_000                   1 instructions         1         2           1
sample_001                   0      generic         2         3           1
sample_001                   0 instructions         1         2           1
sample_000                   0      generic         3         3           0
sample_000                   0      fewshot      

In [67]:
print(df.loc[4,'reasoning'])

Let me analyze each text's stylistic fit with the original Russell passage.

**ORIGINAL VOICE CHARACTERISTICS:**
The original has a distinctive voice: crisp, analytical, relentlessly clear. Russell uses concrete examples (symphony/trombone) to illuminate abstractions. His sentences have a measured, almost conversational rhythm despite the philosophical density. He addresses objections systematically but without excessive signposting. The prose feels like a patient teacher walking you through a problem, not discovering it in real-time but presenting a path already cleared. Crucially, there's an economy of expression—no word wasted, no rhetorical flourish for its own sake.

**TEXT A Analysis:**
This text is massively over-written and self-conscious about its own method. It opens with meta-commentary ("carefully following the provided style instructions") and ends with a numbered list of "Key Stylistic Features Demonstrated"—an absolute giveaway that this is a construction following expli

In [68]:
# Analyze ranking distributions
print("=== Ranking Distribution by Method ===\n")

for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    print(f"\n{method.upper()}:")
    ranking_counts = df[col].value_counts().sort_index()
    for rank in [1, 2, 3, 4]:
        count = ranking_counts.get(rank, 0)
        pct = (count / len(df) * 100) if len(df) > 0 else 0
        print(f"  Rank {rank}: {count:3d} ({pct:5.1f}%)")

print("\n=== Confidence Distribution ===\n")
print(df['confidence'].value_counts())

=== Ranking Distribution by Method ===


GENERIC:
  Rank 1:   0 (  0.0%)
  Rank 2:   5 ( 35.7%)
  Rank 3:   4 ( 28.6%)
  Rank 4:   5 ( 35.7%)

FEWSHOT:
  Rank 1:   9 ( 64.3%)
  Rank 2:   3 ( 21.4%)
  Rank 3:   2 ( 14.3%)
  Rank 4:   0 (  0.0%)

AUTHOR:
  Rank 1:   0 (  0.0%)
  Rank 2:   1 (  7.1%)
  Rank 3:   7 ( 50.0%)
  Rank 4:   6 ( 42.9%)

INSTRUCTIONS:
  Rank 1:   5 ( 35.7%)
  Rank 2:   5 ( 35.7%)
  Rank 3:   1 (  7.1%)
  Rank 4:   3 ( 21.4%)

=== Confidence Distribution ===

confidence
high    14
Name: count, dtype: int64


In [69]:
# Calculate method performance metrics
print("=== Method Performance Metrics ===\n")

# Calculate mean ranking for each method (lower is better: 1 = best, 4 = worst)
mean_rankings = {}
for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    mean_rankings[method] = df[col].mean()

# Sort by mean ranking (best first)
sorted_methods = sorted(mean_rankings.items(), key=lambda x: x[1])

print("Average Ranking (lower is better):")
for i, (method, mean_rank) in enumerate(sorted_methods, 1):
    # Count how often this method ranked 1st
    first_place = (df[f'ranking_{method}'] == 1).sum()
    first_place_pct = (first_place / len(df) * 100) if len(df) > 0 else 0
    
    print(f"{i}. {method:12s}: {mean_rank:.2f} (1st place: {first_place}/{len(df)} = {first_place_pct:.1f}%)")

# Win rate (percentage of times ranked 1st or 2nd)
print("\nTop-2 Rate (ranked 1st or 2nd):")
for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    top2 = ((df[col] == 1) | (df[col] == 2)).sum()
    top2_pct = (top2 / len(df) * 100) if len(df) > 0 else 0
    print(f"  {method:12s}: {top2}/{len(df)} = {top2_pct:.1f}%")

=== Method Performance Metrics ===

Average Ranking (lower is better):
1. fewshot     : 1.50 (1st place: 9/14 = 64.3%)
2. instructions: 2.14 (1st place: 5/14 = 35.7%)
3. generic     : 3.00 (1st place: 0/14 = 0.0%)
4. author      : 3.36 (1st place: 0/14 = 0.0%)

Top-2 Rate (ranked 1st or 2nd):
  generic     : 5/14 = 35.7%
  fewshot     : 12/14 = 85.7%
  author      : 1/14 = 7.1%
  instructions: 10/14 = 71.4%


In [25]:
# Export results
output_file = f"style_evaluation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")


✓ Results saved to style_evaluation_results_20251202_164845.csv


## Next Steps

TODO: Add analysis cells for:
- Statistical significance testing
- Visualization of results
- Sample-by-sample breakdown
- Confidence level analysis
- Qualitative review of reasoning