# Style Instruction Evaluation Framework

This notebook evaluates how well derived style instructions enable style replication, compared to alternative approaches.

## Methodology

For each gold standard text:
1. **Flatten**: Extract content while removing style
2. **Reconstruct**: Generate text using 4 different methods (M stochastic runs each):
   - Generic baseline
   - Few-shot learning
   - Author name prompting
   - Derived style instructions
3. **Judge (Blind Comparative)**: Judge ranks all 4 reconstructions from 1-4 based on similarity to original
   - **Blind evaluation**: Judge sees only anonymous labels (Text A, B, C, D) - no method names
   - **Ranking**: 1 = most similar, 2 = second, 3 = third, 4 = least similar
   - **Order randomized**: Position of methods varies across samples to eliminate bias
4. **Aggregate**: Analyze rankings to determine which method best captures style

## Key Features

- **Crash resilient**: All LLM responses saved to SQLite immediately
- **Resume support**: Can restart after failures, skips completed work
- **Blind evaluation**: Eliminates judge bias by hiding method names
- **Comparative ranking**: More informative than binary comparisons

### Install Libraries and Check

In [45]:
!pip install -r requirements.txt

Python(42431) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.




[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [46]:
try:
    import litellm
    print('Providers\n=========')
    print('* ' + '\n* '.join(litellm.LITELLM_CHAT_PROVIDERS))
    litellm.drop_params = True
except ImportError as e:
    print(f"✗ Cannot import litellm: {e}")

Providers
* openai
* openai_like
* bytez
* xai
* custom_openai
* text-completion-openai
* cohere
* cohere_chat
* clarifai
* anthropic
* anthropic_text
* replicate
* huggingface
* together_ai
* datarobot
* openrouter
* cometapi
* vertex_ai
* vertex_ai_beta
* gemini
* ai21
* baseten
* azure
* azure_text
* azure_ai
* sagemaker
* sagemaker_chat
* bedrock
* vllm
* nlp_cloud
* petals
* oobabooga
* ollama
* ollama_chat
* deepinfra
* perplexity
* mistral
* groq
* nvidia_nim
* cerebras
* baseten
* ai21_chat
* volcengine
* codestral
* text-completion-codestral
* deepseek
* sambanova
* maritalk
* cloudflare
* fireworks_ai
* friendliai
* watsonx
* watsonx_text
* triton
* predibase
* databricks
* empower
* github
* custom
* litellm_proxy
* hosted_vllm
* llamafile
* lm_studio
* galadriel
* gradient_ai
* github_copilot
* novita
* meta_llama
* featherless_ai
* nscale
* nebius
* dashscope
* moonshot
* v0
* heroku
* oci
* morph
* lambda_ai
* vercel_ai_gateway
* wandb
* ovhcloud
* lemonade


## Setup and Configuration

In [47]:
import os
import json
from pathlib import Path
import pandas as pd
from datetime import datetime

### Initialize Base Objects
The base objects part of the current project library (`belletrist`) are initialized. They are:
* `LLM`: the LLM object.
* `LLMConfig`: the configuration of the LLM object, such as what model to use.
* `PromptMaker`: generates prompts from templates and variables
* `DataSampler`: retrieves and samples text at a source directory

These will implement text transformations by LLMs part of the evaluation process. They build on the third-party LLMs, which we furthermore split into LLMs for text reconstruction and text judging, the key parameters for which are set below.

In [48]:
#model_reconstruction_string = 'together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput'
#model_reconstruction_api_key_env_var = 'TOGETHER_AI_API_KEY'
#model_reconstruction_string = 'anthropic/claude-sonnet-4-5-20250929'
#model_reconstruction_api_key_env_var = 'ANTHROPIC_API_KEY'
#model_reconstruction_string = 'openai/gpt-5.1-2025-11-13'
#model_reconstruction_api_key_env_var = 'OPENAI_API_KEY'
model_reconstruction_string = 'together_ai/moonshotai/Kimi-K2-Instruct'
model_reconstruction_api_key_env_var = 'TOGETHER_AI_API_KEY'
model_judge_string = 'anthropic/claude-sonnet-4-5-20250929'
model_judge_api_key_env_var = 'ANTHROPIC_API_KEY'

In [49]:
from belletrist import PromptMaker, DataSampler, StyleEvaluationStore

prompt_maker = PromptMaker()
sampler = DataSampler(
    data_path=(Path(os.getcwd()) / "data" / "russell").resolve()
)
store = StyleEvaluationStore(Path("style_evaluation_results_mistral_kimik2_sonnet.db"))

In [50]:
store.reset('all')

In [51]:
from belletrist import LLM, LLMConfig

reconstruction_llm = LLM(LLMConfig(
    model=model_reconstruction_string,
    api_key=os.environ.get(model_reconstruction_api_key_env_var)
))
judge_llm = LLM(LLMConfig(
    model=model_judge_string,
    api_key=os.environ.get(model_judge_api_key_env_var)
))

### Sample Test Data and Few-Shot Data
The reconstruction method tests build on gold standard texts. The test also includes few-shot prompting with the gold standard texts. In order to not skew the tests, no few-shot examples can overlap with the test texts.

In [52]:
n_sample = 5
m_paragraphs_per_sample = 5
n_few_shot_sample = 5

test_texts = []
for _ in range(n_sample):
    test_texts.append(sampler.sample_segment(p_length=m_paragraphs_per_sample))

few_shot_texts = []
while len(few_shot_texts) < n_few_shot_sample:
    p = sampler.sample_segment(p_length=m_paragraphs_per_sample)

    # Check if p overlaps with any test text
    # Two segments overlap if they're from the same file AND their paragraph ranges overlap
    # Ranges [a, b) and [c, d) overlap if: a < d AND c < b
    has_overlap = any(
        p.file_index == test_seg.file_index and
        p.paragraph_start < test_seg.paragraph_end and
        test_seg.paragraph_start < p.paragraph_end
        for test_seg in test_texts
    )

    if not has_overlap:
        few_shot_texts.append(p)

## Create the Test Transformation Objects
The combination of prompt and LLM leads to the following operators in the test chain:
* **Style Flattener**, which given a text compresses it into its content bare bones.
* **Reconstructor, LLM House Style**, which given a compressed content expands it into a complete text with the "house style" of the LLM employed for the reconstruction.
* **Reconstructor, Few Shot**, which given a compressed content expands it into a complete text with a few text excerpts on unrelated topics as style guide.
* **Reconstructor, LLM Author Model**, which given a compressed content expands it into a complete text with the named author's style as the LLM conceives it without any other guidance.
* **Reconstructor, Style Instruction**, which given a compressed content expands it into a complete text following the detailed style instruction, as derived from previous analysis.

In [53]:
style_instructions_path = Path("outputs/derived_style_instructions_mistral_1127.txt")

if style_instructions_path.exists():
    style_instructions = style_instructions_path.read_text()
    print(f"✓ Loaded style instructions ({len(style_instructions)} chars)")
    print(f"\nFirst 200 chars:\n{style_instructions[:500]}...")
else:
    print("⚠ Style instructions file not found. Please provide path.")
    style_instructions = ""  # Placeholder

✓ Loaded style instructions (24952 chars)

First 200 chars:
---
synthesis_id: principles_guide_001
synthesis_type: principles_guide
model: mistral-large-2411
created_at: 2025-11-27T18:18:13.356530
parent_synthesis_id: cross_text_synthesis_001
num_samples: 5
sample_ids:
  - sample_005
  - sample_002
  - sample_001
  - sample_004
  - sample_003
is_homogeneous_model: true
models_used:
  - mistral-large-2411
---

# THE SYNTHESIST: A STYLE GUIDE FOR WRITERS

## SECTION 1: FOUNDATIONAL PRINCIPLES

### Subordinate Complexity to Simple Frames

Place your core cl...


In [54]:
from belletrist.models import (
    StyleFlatteningConfig,
    StyleReconstructionGenericConfig,
    StyleReconstructionFewShotConfig,
    StyleReconstructionAuthorConfig,
    StyleReconstructionInstructionsConfig,
    MethodMapping,
    StyleJudgeComparativeConfig
)

# Configuration
n_runs = 3
AUTHOR_NAME = "Bertrand Russell"

# Reconstructor configs
RECONSTRUCTORS_CFGS = {
    'generic': StyleReconstructionGenericConfig,
    'fewshot': StyleReconstructionFewShotConfig,
    'author': StyleReconstructionAuthorConfig,
    'instructions': StyleReconstructionInstructionsConfig
}

# Reconstructor kwargs
RECONSTRUCTORS_KWARGS = {
    'generic': {},
    'fewshot': {'few_shot_examples': [seg.text for seg in few_shot_texts]},
    'author': {'author_name': AUTHOR_NAME},
    'instructions': {'style_instructions': style_instructions}
}

print(f"✓ Store initialized at {store.filepath}")
print(f"✓ Configuration: {n_runs} runs per sample, 4 reconstruction methods")

✓ Store initialized at style_evaluation_results_mistral_kimik2_sonnet.db
✓ Configuration: 3 runs per sample, 4 reconstruction methods


## Evaluation Pipeline

### Step 1: Content Flattening

Extract content from each test sample, removing stylistic elements:

In [55]:
# Step 1: Save samples and flatten content
print("=== Step 1: Flattening and Saving Samples ===\n")

for k_text, test_sample in enumerate(test_texts):
    sample_id = f"sample_{k_text:03d}"
    
    # Skip if already saved
    if store.get_sample(sample_id):
        print(f"✓ {sample_id} already flattened (skipping)")
        continue
    
    print(f"Flattening {sample_id}...", end=" ")
    
    # Flatten content
    flatten_prompt = prompt_maker.render(
        StyleFlatteningConfig(text=test_sample.text)
    )
    flattened = reconstruction_llm.complete(flatten_prompt)
    
    # Save to store with provenance
    source_info = f"File {test_sample.file_index}, para {test_sample.paragraph_start}-{test_sample.paragraph_end}"
    store.save_sample(
        sample_id=sample_id,
        original_text=test_sample.text,
        flattened_content=flattened.content,
        flattening_model=flattened.model,
        source_info=source_info
    )
    
    print(f"✓ ({len(flattened.content)} chars)")

print(f"\n✓ All samples flattened and saved to store")

=== Step 1: Flattening and Saving Samples ===

Flattening sample_000... ✓ (1701 chars)
Flattening sample_001... ✓ (1892 chars)
Flattening sample_002... ✓ (1935 chars)
Flattening sample_003... ✓ (2459 chars)
Flattening sample_004... ✓ (4793 chars)

✓ All samples flattened and saved to store


In [56]:
flattened

LLMResponse(content='**Core thesis:**  \nCapitalism cannot deliver both peace and freedom; any peace it produces is an imperial peace that suppresses weaker nations. Only international socialism can reconcile peace with genuine freedom.\n\n---\n\n### 1. The Washington naval treaty (four-power pact)  \n- Claimed purpose: safeguard peace.  \n- Actual effect: moves freedom farther away.  \n- Mechanism: functions like a business combine that ends competition among dominant firms; advantages for weaker parties are accidental and unlikely.  \n- Specific consequence for China: domination without formal violation of the Open-Door principle, because U.S. financial and commercial supremacy turns the principle into a guarantee of American control.\n\n---\n\n### 2. American objectives in China  \n**Interests overlapping with Chinese welfare:**  \n- Stable government.  \n- Rising purchasing power.  \n- Prevention of territorial seizures by other powers.  \n\n**Interests conflicting with Chinese wel

In [68]:
print('====ORIGINAL===')
print(store.get_sample('sample_004')['original_text'])
print('\n\n====FLATTENED====')
print(store.get_sample('sample_004')['flattened_content'])

====ORIGINAL===
But it is impossible to make a silk purse out of a sow's ear, or peace
and freedom out of capitalism. The fourfold agreement between England,
France, America and Japan is, perhaps, a safeguard of peace, but in so
far as it brings peace nearer it puts freedom further off. It is the
peace obtained when competing firms join in a combine, which is by no
means always advantageous to those who have profited by the previous
competition. It is quite possible to dominate China without infringing
the principle of the Open Door. This principle merely ensures that the
domination everywhere shall be American, because America is the
strongest Power financially and commercially. It is to America's
interest to secure, in China, certain things consistent with Chinese
interests, and certain others inconsistent with them. The Americans, for
the sake of commerce and good investments, would wish to see a stable
government in China, an increase in the purchasing power of the people,
and an a

### Step 2: Reconstruction

Generate reconstructions using all 4 methods, with M stochastic runs each:

In [58]:
# Step 2: Generate reconstructions (with crash resume)
print("=== Step 2: Generating Reconstructions ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for run in range(n_runs):
        print(f"  Run {run}:")
        
        # Check which methods need reconstruction
        for method in ['generic', 'fewshot', 'author', 'instructions']:
            if store.has_reconstruction(sample_id, run, method):
                print(f"    ✓ {method:12s} (already done)")
                continue
            
            # Generate reconstruction
            config = RECONSTRUCTORS_CFGS[method](
                content_summary=sample['flattened_content'],
                **RECONSTRUCTORS_KWARGS[method]
            )
            prompt = prompt_maker.render(config)
            response = reconstruction_llm.complete(prompt)
            
            # Save immediately (crash resilient!)
            store.save_reconstruction(
                sample_id=sample_id,
                run=run,
                method=method,
                reconstructed_text=response.content,
                model=response.model
            )
            print(f"    ✓ {method:12s} ({len(response.content)} chars)")

stats = store.get_stats()
print(f"\n✓ Generated {stats['n_reconstructions']} total reconstructions")

=== Step 2: Generating Reconstructions ===


sample_000:
  Run 0:
    ✓ generic      (3881 chars)
    ✓ fewshot      (4367 chars)
    ✓ author       (4188 chars)
    ✓ instructions (1565 chars)
  Run 1:
    ✓ generic      (3463 chars)
    ✓ fewshot      (4212 chars)
    ✓ author       (4839 chars)
    ✓ instructions (2173 chars)
  Run 2:
    ✓ generic      (5382 chars)
    ✓ fewshot      (3603 chars)
    ✓ author       (4140 chars)
    ✓ instructions (1374 chars)

sample_001:
  Run 0:
    ✓ generic      (6586 chars)
    ✓ fewshot      (7338 chars)
    ✓ author       (5501 chars)
    ✓ instructions (3047 chars)
  Run 1:
    ✓ generic      (3807 chars)
    ✓ fewshot      (5792 chars)
    ✓ author       (6229 chars)
    ✓ instructions (3278 chars)
  Run 2:
    ✓ generic      (5370 chars)
    ✓ fewshot      (11243 chars)
    ✓ author       (4821 chars)
    ✓ instructions (3381 chars)

sample_002:
  Run 0:
    ✓ generic      (4225 chars)
    ✓ fewshot      (5350 chars)
    ✓ author       (4

In [59]:
config = RECONSTRUCTORS_CFGS['fewshot'](
                content_summary=sample['flattened_content'],
                **RECONSTRUCTORS_KWARGS['fewshot']
            )
prompt = prompt_maker.render(config)
print(prompt)

You are given a content summary that captures the core ideas and arguments of a text, but stripped of stylistic elements.

Your task is to expand this summary into a complete prose passage, following the writing style demonstrated in the example texts below.

**EXAMPLE TEXTS (demonstrating the target style):**

---

The fundamental epistemological principle in the analysis of
propositions containing descriptions is this: _Every proposition which
we can understand must be composed wholly of constituents with which
we are acquainted._ From what has been said already, it will be plain
why I advocate this principle, and how I propose to meet the case of
propositions which at first sight contravene it. Let us begin with the
reasons for supposing the principle true.

The chief reason for supposing the principle true is that it seems
scarcely possible to believe that we can make a judgment or entertain
a supposition without knowing what it is that we are judging or
supposing about. If we make

In [69]:
reconstructions = store.get_reconstructions('sample_004', 0)
for reconstructor in reconstructions.keys():
    print(f"{reconstructor.upper()}\n===================")
    print(f"\n{reconstructions.get(reconstructor)}\n\n")

AUTHOR

It is customary, among those who inherit the comfortable empires of the nineteenth century, to speak of peace as though it were a self-evident blessing, like sunlight or maternal affection.  The Washington Naval Treaty, signed in February of 1922 amid a gratifying clatter of battleship rivets, was hailed by the press on both sides of the Pacific as the dawn of a new and costless millennium.  Yet the language of its clauses—so prudently technical, so reassuringly barren of moral enthusiasm—already betrayed the real character of the transaction.  It was, in essence, a cartel among the strong, a gentleman's agreement to suppress the vulgar spectacle of gun-fire while reserving every effective means of economic coercion.  Competition between the great Powers was to be curtailed only in that department where it had become dangerously expensive; in every other sphere the struggle would proceed with a decorum befitting creditors who have discovered that foreclosure is more profitable 

### Step 3: Judging

Compare each reconstruction against the original using the judge LLM:

In [61]:
from belletrist.models import StyleJudgeComparativeConfig, StyleJudgmentComparative
# Step 3: Comparative blind judging (with crash resume)
print("=== Step 3: Comparative Blind Judging ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for run in range(n_runs):
        if store.has_judgment(sample_id, run):
            print(f"  Run {run}: ✓ Already judged (skipping)")
            continue
        
        print(f"  Run {run}: Judging...", end=" ")
        
        # Get all 4 reconstructions
        reconstructions = store.get_reconstructions(sample_id, run)
        if len(reconstructions) != 4:
            print(f"✗ Missing reconstructions (found {len(reconstructions)}/4)")
            continue
        
        # Create randomized mapping for blind evaluation
        # Using deterministic seed per (sample, run) for reproducibility
        mapping = store.create_random_mapping(seed=hash(f"{sample_id}_{run}"))
        
        # Build prompt with anonymous labels (BLIND EVALUATION)
        judge_config = StyleJudgeComparativeConfig(
            original_text=sample['original_text'],
            reconstruction_text_a=reconstructions[mapping.text_a],
            reconstruction_text_b=reconstructions[mapping.text_b],
            reconstruction_text_c=reconstructions[mapping.text_c],
            reconstruction_text_d=reconstructions[mapping.text_d]
        )
        judge_prompt = prompt_maker.render(judge_config)
        
        # Get structured JSON judgment
        try:
            response = judge_llm.complete_json(judge_prompt)
            judgment_data = json.loads(response.content)
            judgment = StyleJudgmentComparative(**judgment_data)
            
            # Save judgment with mapping (crash resilient!)
            store.save_judgment(sample_id, run, judgment, mapping, response.model)
            print(f"✓ (confidence: {judgment.confidence})")
            
        except Exception as e:
            print(f"✗ Error: {e}")

stats = store.get_stats()
print(f"\n✓ Completed {stats['n_judgments']} judgments")

=== Step 3: Comparative Blind Judging ===


sample_000:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

sample_001:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

sample_002:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

sample_003:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

sample_004:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

✓ Completed 15 judgments


## Results Analysis

In [62]:
# Export results to DataFrame
print("=== Exporting Results ===\n")

# Export from store (resolves anonymous rankings to methods)
df = store.to_dataframe()

print(f"Total judgments: {len(df)}")
print(f"Samples: {df['sample_id'].nunique()}")
print(f"Runs per sample: {df.groupby('sample_id')['run'].nunique().mean():.1f}")

# Show first few rows
print(f"\n=== Sample Results ===\n")
display_cols = ['sample_id', 'run', 'ranking_generic', 'ranking_fewshot', 
                'ranking_author', 'ranking_instructions', 'confidence']
print(df[display_cols].head(15))

# Export to CSV
output_file = f"style_eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")

=== Exporting Results ===

Total judgments: 15
Samples: 5
Runs per sample: 3.0

=== Sample Results ===

     sample_id  run  ranking_generic  ranking_fewshot  ranking_author  \
0   sample_000    0                3                1               2   
1   sample_000    1                3                2               1   
2   sample_000    2                3                1               2   
3   sample_001    0                4                1               2   
4   sample_001    1                3                1               2   
5   sample_001    2                1                4               3   
6   sample_002    0                4                1               3   
7   sample_002    1                2                3               4   
8   sample_002    2                2                3               4   
9   sample_003    0                4                1               3   
10  sample_003    1                2                3               1   
11  sample_003    2 

In [67]:
print(df.loc[12,'reasoning'])

Let me carefully analyze the original's distinctive voice and compare each reconstruction.

**Original's Stylistic Fingerprint:**

The original has a direct, argumentative, almost conversational voice. Russell writes with clarity and force, building arguments through concrete examples rather than elaborate metaphors. Key features:

1. **Straightforward declarative sentences** that state positions bluntly ("But it is impossible to make a silk purse out of a sow's ear")
2. **Concrete, specific examples** (Gorky's expulsion, the dossier of students, Open Door principle)
3. **Logical progression** with clear signposting ("To sum up," "But," "Hence")
4. **Restrained rhetoric** - persuasive but not ornate
5. **Mix of sentence lengths** with some long analytical sentences but also punchy short ones
6. **Direct address of counter-arguments** in a matter-of-fact tone
7. **Technical precision** when needed (State Capitalism, State Socialism) without excessive elaboration

**Text A Analysis:**

T

In [64]:
# Analyze ranking distributions
print("=== Ranking Distribution by Method ===\n")

for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    print(f"\n{method.upper()}:")
    ranking_counts = df[col].value_counts().sort_index()
    for rank in [1, 2, 3, 4]:
        count = ranking_counts.get(rank, 0)
        pct = (count / len(df) * 100) if len(df) > 0 else 0
        print(f"  Rank {rank}: {count:3d} ({pct:5.1f}%)")

print("\n=== Confidence Distribution ===\n")
print(df['confidence'].value_counts())

=== Ranking Distribution by Method ===


GENERIC:
  Rank 1:   2 ( 13.3%)
  Rank 2:   5 ( 33.3%)
  Rank 3:   4 ( 26.7%)
  Rank 4:   4 ( 26.7%)

FEWSHOT:
  Rank 1:   8 ( 53.3%)
  Rank 2:   3 ( 20.0%)
  Rank 3:   3 ( 20.0%)
  Rank 4:   1 (  6.7%)

AUTHOR:
  Rank 1:   3 ( 20.0%)
  Rank 2:   4 ( 26.7%)
  Rank 3:   5 ( 33.3%)
  Rank 4:   3 ( 20.0%)

INSTRUCTIONS:
  Rank 1:   2 ( 13.3%)
  Rank 2:   3 ( 20.0%)
  Rank 3:   3 ( 20.0%)
  Rank 4:   7 ( 46.7%)

=== Confidence Distribution ===

confidence
high    15
Name: count, dtype: int64


In [65]:
# Calculate method performance metrics
print("=== Method Performance Metrics ===\n")

# Calculate mean ranking for each method (lower is better: 1 = best, 4 = worst)
mean_rankings = {}
for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    mean_rankings[method] = df[col].mean()

# Sort by mean ranking (best first)
sorted_methods = sorted(mean_rankings.items(), key=lambda x: x[1])

print("Average Ranking (lower is better):")
for i, (method, mean_rank) in enumerate(sorted_methods, 1):
    # Count how often this method ranked 1st
    first_place = (df[f'ranking_{method}'] == 1).sum()
    first_place_pct = (first_place / len(df) * 100) if len(df) > 0 else 0
    
    print(f"{i}. {method:12s}: {mean_rank:.2f} (1st place: {first_place}/{len(df)} = {first_place_pct:.1f}%)")

# Win rate (percentage of times ranked 1st or 2nd)
print("\nTop-2 Rate (ranked 1st or 2nd):")
for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    top2 = ((df[col] == 1) | (df[col] == 2)).sum()
    top2_pct = (top2 / len(df) * 100) if len(df) > 0 else 0
    print(f"  {method:12s}: {top2}/{len(df)} = {top2_pct:.1f}%")

=== Method Performance Metrics ===

Average Ranking (lower is better):
1. fewshot     : 1.80 (1st place: 8/15 = 53.3%)
2. author      : 2.53 (1st place: 3/15 = 20.0%)
3. generic     : 2.67 (1st place: 2/15 = 13.3%)
4. instructions: 3.00 (1st place: 2/15 = 13.3%)

Top-2 Rate (ranked 1st or 2nd):
  generic     : 7/15 = 46.7%
  fewshot     : 11/15 = 73.3%
  author      : 7/15 = 46.7%
  instructions: 5/15 = 33.3%


In [66]:
# Export results
output_file = f"style_evaluation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")


✓ Results saved to style_evaluation_results_20251128_183043.csv


## Next Steps

TODO: Add analysis cells for:
- Statistical significance testing
- Visualization of results
- Sample-by-sample breakdown
- Confidence level analysis
- Qualitative review of reasoning