# Style Instruction Evaluation Framework

This notebook evaluates how well derived style instructions enable style replication, compared to alternative approaches.

## Methodology

For each gold standard text:
1. **Flatten**: Extract content while removing style
2. **Reconstruct**: Generate text using 4 different methods (M stochastic runs each):
   - Generic baseline
   - Few-shot learning
   - Author name prompting
   - Derived style instructions
3. **Judge (Blind Comparative)**: Judge ranks all 4 reconstructions from 1-4 based on similarity to original
   - **Blind evaluation**: Judge sees only anonymous labels (Text A, B, C, D) - no method names
   - **Ranking**: 1 = most similar, 2 = second, 3 = third, 4 = least similar
   - **Order randomized**: Position of methods varies across samples to eliminate bias
4. **Aggregate**: Analyze rankings to determine which method best captures style

## Key Features

- **Crash resilient**: All LLM responses saved to SQLite immediately
- **Resume support**: Can restart after failures, skips completed work
- **Blind evaluation**: Eliminates judge bias by hiding method names
- **Comparative ranking**: More informative than binary comparisons

### Install Libraries and Check

In [49]:
!pip install -r requirements.txt




[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [50]:
try:
    import litellm
    print('Providers\n=========')
    print('* ' + '\n* '.join(litellm.LITELLM_CHAT_PROVIDERS))
    litellm.drop_params = True
except ImportError as e:
    print(f"✗ Cannot import litellm: {e}")

Providers
* openai
* openai_like
* bytez
* xai
* custom_openai
* text-completion-openai
* cohere
* cohere_chat
* clarifai
* anthropic
* anthropic_text
* replicate
* huggingface
* together_ai
* datarobot
* openrouter
* cometapi
* vertex_ai
* vertex_ai_beta
* gemini
* ai21
* baseten
* azure
* azure_text
* azure_ai
* sagemaker
* sagemaker_chat
* bedrock
* vllm
* nlp_cloud
* petals
* oobabooga
* ollama
* ollama_chat
* deepinfra
* perplexity
* mistral
* groq
* nvidia_nim
* cerebras
* baseten
* ai21_chat
* volcengine
* codestral
* text-completion-codestral
* deepseek
* sambanova
* maritalk
* cloudflare
* fireworks_ai
* friendliai
* watsonx
* watsonx_text
* triton
* predibase
* databricks
* empower
* github
* custom
* litellm_proxy
* hosted_vllm
* llamafile
* lm_studio
* galadriel
* gradient_ai
* github_copilot
* novita
* meta_llama
* featherless_ai
* nscale
* nebius
* dashscope
* moonshot
* v0
* heroku
* oci
* morph
* lambda_ai
* vercel_ai_gateway
* wandb
* ovhcloud
* lemonade


## Setup and Configuration

In [51]:
import os
import json
from pathlib import Path
import pandas as pd
from datetime import datetime

### Initialize Base Objects
The base objects part of the current project library (`belletrist`) are initialized. They are:
* `LLM`: the LLM object.
* `LLMConfig`: the configuration of the LLM object, such as what model to use.
* `PromptMaker`: generates prompts from templates and variables
* `DataSampler`: retrieves and samples text at a source directory

These will implement text transformations by LLMs part of the evaluation process. They build on the third-party LLMs, which we furthermore split into LLMs for text reconstruction and text judging, the key parameters for which are set below.

In [52]:
#model_reconstruction_string = 'together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput'
#model_reconstruction_api_key_env_var = 'TOGETHER_AI_API_KEY'
#model_reconstruction_string = 'anthropic/claude-sonnet-4-5-20250929'
#model_reconstruction_api_key_env_var = 'ANTHROPIC_API_KEY'
#model_reconstruction_string = 'openai/gpt-5.1-2025-11-13'
#model_reconstruction_api_key_env_var = 'OPENAI_API_KEY'
#model_reconstruction_string = 'together_ai/moonshotai/Kimi-K2-Instruct'
#model_reconstruction_api_key_env_var = 'TOGETHER_AI_API_KEY'
model_reconstruction_string = 'mistral/mistral-large-2512'
model_reconstruction_api_key_env_var = 'MISTRAL_API_KEY'
model_judge_string = 'anthropic/claude-sonnet-4-5-20250929'
model_judge_api_key_env_var = 'ANTHROPIC_API_KEY'

In [53]:
from belletrist import PromptMaker, DataSampler, StyleEvaluationStore

prompt_maker = PromptMaker()
sampler = DataSampler(
    data_path=(Path(os.getcwd()) / "data" / "russell").resolve()
)
store = StyleEvaluationStore(Path("style_evaluation_results_author_sonnet_write_mistral_judge_sonnet.db"))

In [54]:
store.reset('all')

In [55]:
from belletrist import LLM, LLMConfig

reconstruction_llm = LLM(LLMConfig(
    model=model_reconstruction_string,
    api_key=os.environ.get(model_reconstruction_api_key_env_var)
))
judge_llm = LLM(LLMConfig(
    model=model_judge_string,
    api_key=os.environ.get(model_judge_api_key_env_var)
))

### Sample Test Data and Few-Shot Data
The reconstruction method tests build on gold standard texts. The test also includes few-shot prompting with the gold standard texts. In order to not skew the tests, no few-shot examples can overlap with the test texts.

In [56]:
n_sample = 5
m_paragraphs_per_sample = 5
n_few_shot_sample = 5

test_texts = []
for _ in range(n_sample):
    test_texts.append(sampler.sample_segment(p_length=m_paragraphs_per_sample))

few_shot_texts = []
while len(few_shot_texts) < n_few_shot_sample:
    p = sampler.sample_segment(p_length=m_paragraphs_per_sample)

    # Check if p overlaps with any test text
    # Two segments overlap if they're from the same file AND their paragraph ranges overlap
    # Ranges [a, b) and [c, d) overlap if: a < d AND c < b
    has_overlap = any(
        p.file_index == test_seg.file_index and
        p.paragraph_start < test_seg.paragraph_end and
        test_seg.paragraph_start < p.paragraph_end
        for test_seg in test_texts
    )

    if not has_overlap:
        few_shot_texts.append(p)

## Create the Test Transformation Objects
The combination of prompt and LLM leads to the following operators in the test chain:
* **Style Flattener**, which given a text compresses it into its content bare bones.
* **Reconstructor, LLM House Style**, which given a compressed content expands it into a complete text with the "house style" of the LLM employed for the reconstruction.
* **Reconstructor, Few Shot**, which given a compressed content expands it into a complete text with a few text excerpts on unrelated topics as style guide.
* **Reconstructor, LLM Author Model**, which given a compressed content expands it into a complete text with the named author's style as the LLM conceives it without any other guidance.
* **Reconstructor, Style Instruction**, which given a compressed content expands it into a complete text following the detailed style instruction, as derived from previous analysis.

In [57]:
style_instructions_path = Path("outputs/author_modeling/author_model_definition_001_sonnet.txt")

if style_instructions_path.exists():
    style_instructions = style_instructions_path.read_text()
    print(f"✓ Loaded style instructions ({len(style_instructions)} chars)")
    print(f"\nFirst 200 chars:\n{style_instructions[:500]}...")
else:
    print("⚠ Style instructions file not found. Please provide path.")
    style_instructions = ""  # Placeholder

✓ Loaded style instructions (20519 chars)

First 200 chars:
---
synthesis_id: author_model_definition_001
synthesis_type: author_model_definition
model: claude-sonnet-4-5-20250929
created_at: 2025-12-02T12:29:47.382030
parent_synthesis_id: implied_author_synthesis_001
num_samples: 5
sample_ids:
  - sample_005
  - sample_003
  - sample_002
  - sample_004
  - sample_001
is_homogeneous_model: true
models_used:
  - claude-sonnet-4-5-20250929
---

## PART 1: THE GENERATIVE STANCE

When you write as this author, you inhabit a position of **clarity achieved and...


In [58]:
from belletrist.models import (
    StyleFlatteningConfig,
    StyleReconstructionGenericConfig,
    StyleReconstructionFewShotConfig,
    StyleReconstructionAuthorConfig,
    StyleReconstructionInstructionsConfig,
    MethodMapping,
    StyleJudgeComparativeConfig
)

# Configuration
n_runs = 3
AUTHOR_NAME = "Bertrand Russell"

# Reconstructor configs
RECONSTRUCTORS_CFGS = {
    'generic': StyleReconstructionGenericConfig,
    'fewshot': StyleReconstructionFewShotConfig,
    'author': StyleReconstructionAuthorConfig,
    'instructions': StyleReconstructionInstructionsConfig
}

# Reconstructor kwargs
RECONSTRUCTORS_KWARGS = {
    'generic': {},
    'fewshot': {'few_shot_examples': [seg.text for seg in few_shot_texts]},
    'author': {'author_name': AUTHOR_NAME},
    'instructions': {'style_instructions': style_instructions}
}

print(f"✓ Store initialized at {store.filepath}")
print(f"✓ Configuration: {n_runs} runs per sample, 4 reconstruction methods")

✓ Store initialized at style_evaluation_results_author_sonnet_write_mistral_judge_sonnet.db
✓ Configuration: 3 runs per sample, 4 reconstruction methods


## Evaluation Pipeline

### Step 1: Content Flattening

Extract content from each test sample, removing stylistic elements:

In [59]:
# Step 1: Save samples and flatten content
print("=== Step 1: Flattening and Saving Samples ===\n")

for k_text, test_sample in enumerate(test_texts):
    sample_id = f"sample_{k_text:03d}"
    
    # Skip if already saved
    if store.get_sample(sample_id):
        print(f"✓ {sample_id} already flattened (skipping)")
        continue
    
    print(f"Flattening {sample_id}...", end=" ")
    
    # Flatten content
    flatten_prompt = prompt_maker.render(
        StyleFlatteningConfig(text=test_sample.text)
    )
    flattened = reconstruction_llm.complete(flatten_prompt)
    
    # Save to store with provenance
    source_info = f"File {test_sample.file_index}, para {test_sample.paragraph_start}-{test_sample.paragraph_end}"
    store.save_sample(
        sample_id=sample_id,
        original_text=test_sample.text,
        flattened_content=flattened.content,
        flattening_model=flattened.model,
        source_info=source_info
    )
    
    print(f"✓ ({len(flattened.content)} chars)")

print(f"\n✓ All samples flattened and saved to store")

=== Step 1: Flattening and Saving Samples ===

Flattening sample_000... ✓ (4941 chars)
Flattening sample_001... ✓ (8205 chars)
Flattening sample_002... ✓ (2483 chars)
Flattening sample_003... ✓ (3633 chars)
Flattening sample_004... ✓ (4892 chars)

✓ All samples flattened and saved to store


In [60]:
flattened

LLMResponse(content='### **Summary of Informational and Argumentative Content**\n\n#### **1. Position and Influence of the Chinese Intelligentsia**\n- **Historical Context:**\n  - China has lacked a hereditary aristocracy for ~2,000 years.\n  - Governance has long been based on competitive examinations, granting the educated a prestige comparable to that of a governing aristocracy in other societies.\n- **Current Status:**\n  - Traditional education is declining, replaced by modern subjects.\n  - The prestige of education persists; public opinion remains influenced by intellectuals.\n  - Warlords (*Tuchuns*), often uneducated (e.g., former brigands like Chang-tso-lin), lack this respect, making their rule weak and unstable.\n- **Role of "Young China":**\n  - Refers to those educated abroad or in modern domestic institutions.\n  - Their influence is disproportionately strong due to China’s historical respect for learning.\n  - Their numbers are rapidly increasing; their outlook and goal

In [61]:
print('====ORIGINAL===')
print(store.get_sample('sample_001')['original_text'])
print('\n\n====FLATTENED====')
print(store.get_sample('sample_001')['flattened_content'])

====ORIGINAL===
The world may be conceived as consisting of a multitude of entities
arranged in a certain pattern. The entities which are arranged I shall
call "particulars." The arrangement or pattern results from relations
among particulars. Classes or series of particulars, collected
together on account of some property which makes it convenient to be
able to speak of them as wholes, are what I call logical constructions
or symbolic fictions. The particulars are to be conceived, not on the
analogy of bricks in a building, but rather on the analogy of notes
in a symphony. The ultimate constituents of a symphony (apart from
relations) are the notes, each of which lasts only for a very short
time. We may collect together all the notes played by one instrument:
these may be regarded as the analogues of the successive particulars
which common sense would regard as successive states of one "thing."
But the "thing" ought to be regarded as no more "real" or
"substantial" than, for example, 

### Step 2: Reconstruction

Generate reconstructions using all 4 methods, with M stochastic runs each:

In [62]:
# Step 2: Generate reconstructions (with crash resume)
print("=== Step 2: Generating Reconstructions ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for run in range(n_runs):
        print(f"  Run {run}:")
        
        # Check which methods need reconstruction
        for method in ['generic', 'fewshot', 'author', 'instructions']:
            if store.has_reconstruction(sample_id, run, method):
                print(f"    ✓ {method:12s} (already done)")
                continue
            
            # Generate reconstruction
            config = RECONSTRUCTORS_CFGS[method](
                content_summary=sample['flattened_content'],
                **RECONSTRUCTORS_KWARGS[method]
            )
            prompt = prompt_maker.render(config)
            response = reconstruction_llm.complete(prompt)
            
            # Save immediately (crash resilient!)
            store.save_reconstruction(
                sample_id=sample_id,
                run=run,
                method=method,
                reconstructed_text=response.content,
                model=response.model
            )
            print(f"    ✓ {method:12s} ({len(response.content)} chars)")

stats = store.get_stats()
print(f"\n✓ Generated {stats['n_reconstructions']} total reconstructions")

=== Step 2: Generating Reconstructions ===


sample_000:
  Run 0:
    ✓ generic      (11067 chars)
    ✓ fewshot      (6822 chars)
    ✓ author       (6193 chars)
    ✓ instructions (8052 chars)
  Run 1:
    ✓ generic      (7817 chars)
    ✓ fewshot      (6165 chars)
    ✓ author       (7149 chars)
    ✓ instructions (8397 chars)
  Run 2:
    ✓ generic      (10382 chars)
    ✓ fewshot      (9859 chars)
    ✓ author       (6550 chars)
    ✓ instructions (11200 chars)

sample_001:
  Run 0:
    ✓ generic      (8577 chars)
    ✓ fewshot      (9003 chars)
    ✓ author       (9829 chars)
    ✓ instructions (9240 chars)
  Run 1:
    ✓ generic      (10639 chars)
    ✓ fewshot      (8287 chars)
    ✓ author       (8115 chars)
    ✓ instructions (12610 chars)
  Run 2:
    ✓ generic      (8899 chars)
    ✓ fewshot      (10544 chars)
    ✓ author       (9043 chars)
    ✓ instructions (10172 chars)

sample_002:
  Run 0:
    ✓ generic      (5336 chars)
    ✓ fewshot      (4623 chars)
    ✓ author   

In [63]:
config = RECONSTRUCTORS_CFGS['instructions'](
                content_summary=sample['flattened_content'],
                **RECONSTRUCTORS_KWARGS['instructions']
            )
prompt = prompt_maker.render(config)
print(prompt)

You are given a content summary and a set of detailed writing style instructions.

Your task is to expand the content summary into a complete prose passage, carefully following the style instructions provided.

---

**WRITING STYLE INSTRUCTIONS:**

## PART 1: THE GENERATIVE STANCE

When you write as this author, you inhabit a position of **clarity achieved and being transmitted**. You are not discovering in real-time; you have thought through the confusions, identified where thinking goes wrong, and now demonstrate the path to understanding. Your relationship to your material is that of someone who has **solved the puzzle and now shows others the solution**—not with condescension, but with the confidence that clear thinking is accessible to anyone willing to follow the reasoning.

**Your orientation to material** is fundamentally diagnostic. When you encounter a philosophical problem or cultural phenomenon, you ask: *Where does the confusion originate? What historical accident or logic

In [64]:
reconstructions = store.get_reconstructions('sample_001', 1)
for reconstructor in reconstructions.keys():
    print(f"{reconstructor.upper()}\n===================")
    print(f"\n{reconstructions.get(reconstructor)}\n\n")

AUTHOR

**On the Nature of Particulars, Perception, and the Illusion of Substance**

It is a curious fact, and one not sufficiently remarked upon, that the common understanding of the world is built upon a foundation of unexamined assumptions—assumptions so deeply ingrained that they pass for self-evident truths. We speak of "things" as though they were the solid, enduring stuff of reality, when in fact they are no more substantial than the role of an instrument in a symphony. A trombone, for instance, is not a single, unchanging entity but a succession of notes, each fleeting, each dependent upon the relations that bind it to the whole. And so it is with the objects of our perception: they are not the ultimate constituents of the world, but rather logical constructions, symbolic fictions woven from the brief, evanescent particulars that alone possess the dignity of reality.

This distinction is of more than mere academic interest, for upon it hinges the resolution of one of philosophy

### Step 3: Judging

Compare each reconstruction against the original using the judge LLM:

In [65]:
from belletrist.models import StyleJudgeComparativeConfig, StyleJudgmentComparative
# Step 3: Comparative blind judging (with crash resume)
print("=== Step 3: Comparative Blind Judging ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for run in range(n_runs):
        if store.has_judgment(sample_id, run):
            print(f"  Run {run}: ✓ Already judged (skipping)")
            continue
        
        print(f"  Run {run}: Judging...", end=" ")
        
        # Get all 4 reconstructions
        reconstructions = store.get_reconstructions(sample_id, run)
        if len(reconstructions) != 4:
            print(f"✗ Missing reconstructions (found {len(reconstructions)}/4)")
            continue
        
        # Create randomized mapping for blind evaluation
        # Using deterministic seed per (sample, run) for reproducibility
        mapping = store.create_random_mapping(seed=hash(f"{sample_id}_{run}"))
        
        # Build prompt with anonymous labels (BLIND EVALUATION)
        judge_config = StyleJudgeComparativeConfig(
            original_text=sample['original_text'],
            reconstruction_text_a=reconstructions[mapping.text_a],
            reconstruction_text_b=reconstructions[mapping.text_b],
            reconstruction_text_c=reconstructions[mapping.text_c],
            reconstruction_text_d=reconstructions[mapping.text_d]
        )
        judge_prompt = prompt_maker.render(judge_config)
        
        # Get structured JSON judgment
        try:
            response = judge_llm.complete_json(judge_prompt)
            judgment_data = json.loads(response.content)
            judgment = StyleJudgmentComparative(**judgment_data)
            
            # Save judgment with mapping (crash resilient!)
            store.save_judgment(sample_id, run, judgment, mapping, response.model)
            print(f"✓ (confidence: {judgment.confidence})")
            
        except Exception as e:
            print(f"✗ Error: {e}")

stats = store.get_stats()
print(f"\n✓ Completed {stats['n_judgments']} judgments")

=== Step 3: Comparative Blind Judging ===


sample_000:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

sample_001:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✗ Error: 5 validation errors for StyleJudgmentComparative
ranking_text_a
  Field required [type=missing, input_value={'reasoning': 'Let me ana...a-commentary disaster.'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing
ranking_text_b
  Field required [type=missing, input_value={'reasoning': 'Let me ana...a-commentary disaster.'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing
ranking_text_c
  Field required [type=missing, input_value={'reasoning': 'Let me ana...a-commentary disaster.'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.7/v/missing
ranking_text_d
  Field require

## Results Analysis

In [66]:
# Export results to DataFrame
print("=== Exporting Results ===\n")

# Export from store (resolves anonymous rankings to methods)
df = store.to_dataframe()

print(f"Total judgments: {len(df)}")
print(f"Samples: {df['sample_id'].nunique()}")
print(f"Runs per sample: {df.groupby('sample_id')['run'].nunique().mean():.1f}")

# Show first few rows
print(f"\n=== Sample Results ===\n")
display_cols = ['sample_id', 'run', 'ranking_generic', 'ranking_fewshot', 
                'ranking_author', 'ranking_instructions', 'confidence']
print(df[display_cols].head(15))

# Export to CSV
output_file = f"style_eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")

=== Exporting Results ===

Total judgments: 14
Samples: 5
Runs per sample: 2.8

=== Sample Results ===

     sample_id  run  ranking_generic  ranking_fewshot  ranking_author  \
0   sample_000    0                2                1               4   
1   sample_000    1                3                2               4   
2   sample_000    2                4                2               3   
3   sample_001    0                3                1               4   
4   sample_001    1                2                1               3   
5   sample_002    0                2                1               3   
6   sample_002    1                4                3               2   
7   sample_002    2                4                1               3   
8   sample_003    0                4                1               3   
9   sample_003    1                3                1               4   
10  sample_003    2                4                1               3   
11  sample_004    0 

In [67]:
print(df.loc[4,'reasoning'])

Let me analyze each text's stylistic fit with the original Russell passage.

**ORIGINAL VOICE CHARACTERISTICS:**
The original has a distinctive voice: crisp, analytical, relentlessly clear. Russell uses concrete examples (symphony/trombone) to illuminate abstractions. His sentences have a measured, almost conversational rhythm despite the philosophical density. He addresses objections systematically but without excessive signposting. The prose feels like a patient teacher walking you through a problem, not discovering it in real-time but presenting a path already cleared. Crucially, there's an economy of expression—no word wasted, no rhetorical flourish for its own sake.

**TEXT A Analysis:**
This text is massively over-written and self-conscious about its own method. It opens with meta-commentary ("carefully following the provided style instructions") and ends with a numbered list of "Key Stylistic Features Demonstrated"—an absolute giveaway that this is a construction following expli

In [68]:
# Analyze ranking distributions
print("=== Ranking Distribution by Method ===\n")

for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    print(f"\n{method.upper()}:")
    ranking_counts = df[col].value_counts().sort_index()
    for rank in [1, 2, 3, 4]:
        count = ranking_counts.get(rank, 0)
        pct = (count / len(df) * 100) if len(df) > 0 else 0
        print(f"  Rank {rank}: {count:3d} ({pct:5.1f}%)")

print("\n=== Confidence Distribution ===\n")
print(df['confidence'].value_counts())

=== Ranking Distribution by Method ===


GENERIC:
  Rank 1:   0 (  0.0%)
  Rank 2:   5 ( 35.7%)
  Rank 3:   4 ( 28.6%)
  Rank 4:   5 ( 35.7%)

FEWSHOT:
  Rank 1:   9 ( 64.3%)
  Rank 2:   3 ( 21.4%)
  Rank 3:   2 ( 14.3%)
  Rank 4:   0 (  0.0%)

AUTHOR:
  Rank 1:   0 (  0.0%)
  Rank 2:   1 (  7.1%)
  Rank 3:   7 ( 50.0%)
  Rank 4:   6 ( 42.9%)

INSTRUCTIONS:
  Rank 1:   5 ( 35.7%)
  Rank 2:   5 ( 35.7%)
  Rank 3:   1 (  7.1%)
  Rank 4:   3 ( 21.4%)

=== Confidence Distribution ===

confidence
high    14
Name: count, dtype: int64


In [69]:
# Calculate method performance metrics
print("=== Method Performance Metrics ===\n")

# Calculate mean ranking for each method (lower is better: 1 = best, 4 = worst)
mean_rankings = {}
for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    mean_rankings[method] = df[col].mean()

# Sort by mean ranking (best first)
sorted_methods = sorted(mean_rankings.items(), key=lambda x: x[1])

print("Average Ranking (lower is better):")
for i, (method, mean_rank) in enumerate(sorted_methods, 1):
    # Count how often this method ranked 1st
    first_place = (df[f'ranking_{method}'] == 1).sum()
    first_place_pct = (first_place / len(df) * 100) if len(df) > 0 else 0
    
    print(f"{i}. {method:12s}: {mean_rank:.2f} (1st place: {first_place}/{len(df)} = {first_place_pct:.1f}%)")

# Win rate (percentage of times ranked 1st or 2nd)
print("\nTop-2 Rate (ranked 1st or 2nd):")
for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    top2 = ((df[col] == 1) | (df[col] == 2)).sum()
    top2_pct = (top2 / len(df) * 100) if len(df) > 0 else 0
    print(f"  {method:12s}: {top2}/{len(df)} = {top2_pct:.1f}%")

=== Method Performance Metrics ===

Average Ranking (lower is better):
1. fewshot     : 1.50 (1st place: 9/14 = 64.3%)
2. instructions: 2.14 (1st place: 5/14 = 35.7%)
3. generic     : 3.00 (1st place: 0/14 = 0.0%)
4. author      : 3.36 (1st place: 0/14 = 0.0%)

Top-2 Rate (ranked 1st or 2nd):
  generic     : 5/14 = 35.7%
  fewshot     : 12/14 = 85.7%
  author      : 1/14 = 7.1%
  instructions: 10/14 = 71.4%


In [25]:
# Export results
output_file = f"style_evaluation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")


✓ Results saved to style_evaluation_results_20251202_164845.csv


## Next Steps

TODO: Add analysis cells for:
- Statistical significance testing
- Visualization of results
- Sample-by-sample breakdown
- Confidence level analysis
- Qualitative review of reasoning