# Style Instruction Evaluation Framework

This notebook evaluates how well derived style instructions enable style replication, compared to alternative approaches.

## Methodology

For each gold standard text:
1. **Flatten**: Extract content while removing style
2. **Reconstruct**: Generate text using 4 different methods (M stochastic runs each):
   - Generic baseline
   - Few-shot learning
   - Author name prompting
   - Derived style instructions
3. **Judge (Blind Comparative)**: Judge ranks all 4 reconstructions from 1-4 based on similarity to original
   - **Blind evaluation**: Judge sees only anonymous labels (Text A, B, C, D) - no method names
   - **Ranking**: 1 = most similar, 2 = second, 3 = third, 4 = least similar
   - **Order randomized**: Position of methods varies across samples to eliminate bias
4. **Aggregate**: Analyze rankings to determine which method best captures style

## Key Features

- **Crash resilient**: All LLM responses saved to SQLite immediately
- **Resume support**: Can restart after failures, skips completed work
- **Blind evaluation**: Eliminates judge bias by hiding method names
- **Comparative ranking**: More informative than binary comparisons

### Install Libraries and Check

In [68]:
!pip install -r requirements.txt



[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [69]:
try:
    import litellm
    print('Providers\n=========')
    print('* ' + '\n* '.join(litellm.LITELLM_CHAT_PROVIDERS))
except ImportError as e:
    print(f"✗ Cannot import litellm: {e}")

Providers
* openai
* openai_like
* bytez
* xai
* custom_openai
* text-completion-openai
* cohere
* cohere_chat
* clarifai
* anthropic
* anthropic_text
* replicate
* huggingface
* together_ai
* datarobot
* openrouter
* cometapi
* vertex_ai
* vertex_ai_beta
* gemini
* ai21
* baseten
* azure
* azure_text
* azure_ai
* sagemaker
* sagemaker_chat
* bedrock
* vllm
* nlp_cloud
* petals
* oobabooga
* ollama
* ollama_chat
* deepinfra
* perplexity
* mistral
* groq
* nvidia_nim
* cerebras
* baseten
* ai21_chat
* volcengine
* codestral
* text-completion-codestral
* deepseek
* sambanova
* maritalk
* cloudflare
* fireworks_ai
* friendliai
* watsonx
* watsonx_text
* triton
* predibase
* databricks
* empower
* github
* custom
* litellm_proxy
* hosted_vllm
* llamafile
* lm_studio
* galadriel
* gradient_ai
* github_copilot
* novita
* meta_llama
* featherless_ai
* nscale
* nebius
* dashscope
* moonshot
* v0
* heroku
* oci
* morph
* lambda_ai
* vercel_ai_gateway
* wandb
* ovhcloud
* lemonade


## Setup and Configuration

In [70]:
import os
import json
from pathlib import Path
import pandas as pd
from datetime import datetime

### Initialize Base Objects
The base objects part of the current project library (`belletrist`) are initialized. They are:
* `LLM`: the LLM object.
* `LLMConfig`: the configuration of the LLM object, such as what model to use.
* `PromptMaker`: generates prompts from templates and variables
* `DataSampler`: retrieves and samples text at a source directory

These will implement text transformations by LLMs part of the evaluation process. They build on the third-party LLMs, which we furthermore split into LLMs for text reconstruction and text judging, the key parameters for which are set below.

In [71]:
model_reconstruction_string = 'anthropic/claude-sonnet-4-5-20250929'
model_reconstruction_api_key_env_var = 'ANTHROPIC_API_KEY'
model_judge_string = 'anthropic/claude-sonnet-4-5-20250929'
model_judge_api_key_env_var = 'ANTHROPIC_API_KEY'

In [None]:
from belletrist import PromptMaker, DataSampler, StyleEvaluationStore

prompt_maker = PromptMaker()
sampler = DataSampler(
    data_path=(Path(os.getcwd()) / "data" / "russell").resolve()
)
store = StyleEvaluationStore(Path("style_evaluation_results_sonnet_sonnet.db"))

In [None]:
store.reset('all')

In [None]:
from belletrist import LLM, LLMConfig

reconstruction_llm = LLM(LLMConfig(
    model=model_reconstruction_string,
    api_key=os.environ.get(model_reconstruction_api_key_env_var)
))
judge_llm = LLM(LLMConfig(
    model=model_judge_string,
    api_key=os.environ.get(model_judge_api_key_env_var)
))

### Sample Test Data and Few-Shot Data
The reconstruction method tests build on gold standard texts. The test also includes few-shot prompting with the gold standard texts. In order to not skew the tests, no few-shot examples can overlap with the test texts.

In [75]:
n_sample = 5
m_paragraphs_per_sample = 5
n_few_shot_sample = 5

test_texts = []
for _ in range(n_sample):
    test_texts.append(sampler.sample_segment(p_length=m_paragraphs_per_sample))

few_shot_texts = []
while len(few_shot_texts) < n_few_shot_sample:
    p = sampler.sample_segment(p_length=m_paragraphs_per_sample)

    # Check if p overlaps with any test text
    # Two segments overlap if they're from the same file AND their paragraph ranges overlap
    # Ranges [a, b) and [c, d) overlap if: a < d AND c < b
    has_overlap = any(
        p.file_index == test_seg.file_index and
        p.paragraph_start < test_seg.paragraph_end and
        test_seg.paragraph_start < p.paragraph_end
        for test_seg in test_texts
    )

    if not has_overlap:
        few_shot_texts.append(p)

## Create the Test Transformation Objects
The combination of prompt and LLM leads to the following operators in the test chain:
* **Style Flattener**, which given a text compresses it into its content bare bones.
* **Reconstructor, LLM House Style**, which given a compressed content expands it into a complete text with the "house style" of the LLM employed for the reconstruction.
* **Reconstructor, Few Shot**, which given a compressed content expands it into a complete text with a few text excerpts on unrelated topics as style guide.
* **Reconstructor, LLM Author Model**, which given a compressed content expands it into a complete text with the named author's style as the LLM conceives it without any other guidance.
* **Reconstructor, Style Instruction**, which given a compressed content expands it into a complete text following the detailed style instruction, as derived from previous analysis.

In [76]:
style_instructions_path = Path("outputs/derived_style_instructions_1125_000.txt")

if style_instructions_path.exists():
    style_instructions = style_instructions_path.read_text()
    print(f"✓ Loaded style instructions ({len(style_instructions)} chars)")
    print(f"\nFirst 200 chars:\n{style_instructions[:500]}...")
else:
    print("⚠ Style instructions file not found. Please provide path.")
    style_instructions = ""  # Placeholder

✓ Loaded style instructions (23968 chars)

First 200 chars:
---
synthesis_id: principles_guide_001
synthesis_type: principles_guide
model: mistral-large-2411
created_at: 2025-11-25T16:33:20.019525
parent_synthesis_id: cross_text_synthesis_002
num_samples: 5
sample_ids:
  - sample_005
  - sample_003
  - sample_001
  - sample_002
  - sample_004
is_homogeneous_model: true
models_used:
  - mistral-large-2411
---

# A GUIDE TO CLEAR AND ENGAGING PROSE
## Principles Extracted from Pattern Analysis

### PART I: FOUNDATIONS
## Core Principles

### 1. Subordinate...


In [77]:
from belletrist.models import (
    StyleFlatteningConfig,
    StyleReconstructionGenericConfig,
    StyleReconstructionFewShotConfig,
    StyleReconstructionAuthorConfig,
    StyleReconstructionInstructionsConfig,
    MethodMapping,
    StyleJudgeComparativeConfig
)

# Configuration
n_runs = 3
AUTHOR_NAME = "Bertrand Russell"

# Reconstructor configs
RECONSTRUCTORS_CFGS = {
    'generic': StyleReconstructionGenericConfig,
    'fewshot': StyleReconstructionFewShotConfig,
    'author': StyleReconstructionAuthorConfig,
    'instructions': StyleReconstructionInstructionsConfig
}

# Reconstructor kwargs
RECONSTRUCTORS_KWARGS = {
    'generic': {},
    'fewshot': {'few_shot_examples': [seg.text for seg in few_shot_texts]},
    'author': {'author_name': AUTHOR_NAME},
    'instructions': {'style_instructions': style_instructions}
}

print(f"✓ Store initialized at {store.filepath}")
print(f"✓ Configuration: {n_runs} runs per sample, 4 reconstruction methods")

✓ Store initialized at style_evaluation_results_sonnet_sonnet.db
✓ Configuration: 3 runs per sample, 4 reconstruction methods


## Evaluation Pipeline

### Step 1: Content Flattening

Extract content from each test sample, removing stylistic elements:

In [78]:
# Step 1: Save samples and flatten content
print("=== Step 1: Flattening and Saving Samples ===\n")

for k_text, test_sample in enumerate(test_texts):
    sample_id = f"sample_{k_text:03d}"
    
    # Skip if already saved
    if store.get_sample(sample_id):
        print(f"✓ {sample_id} already flattened (skipping)")
        continue
    
    print(f"Flattening {sample_id}...", end=" ")
    
    # Flatten content
    flatten_prompt = prompt_maker.render(
        StyleFlatteningConfig(text=test_sample.text)
    )
    flattened = reconstruction_llm.complete(flatten_prompt)
    
    # Save to store with provenance
    source_info = f"File {test_sample.file_index}, para {test_sample.paragraph_start}-{test_sample.paragraph_end}"
    store.save_sample(
        sample_id=sample_id,
        original_text=test_sample.text,
        flattened_content=flattened.content,
        flattening_model=flattened.model,
        source_info=source_info
    )
    
    print(f"✓ ({len(flattened.content)} chars)")

print(f"\n✓ All samples flattened and saved to store")

=== Step 1: Flattening and Saving Samples ===

Flattening sample_000... ✓ (4272 chars)
Flattening sample_001... ✓ (4258 chars)
Flattening sample_002... ✓ (3357 chars)
Flattening sample_003... ✓ (5134 chars)
Flattening sample_004... ✓ (2925 chars)

✓ All samples flattened and saved to store


In [79]:
print('====ORIGINAL===')
print(store.get_sample('sample_000')['original_text'])
print('\n\n====FLATTENED====')
print(store.get_sample('sample_000')['flattened_content'])

====ORIGINAL===
In regard to mines, development by the Chinese themselves is urgent,
since undeveloped resources tempt the greed of the Great Powers, and
development by foreigners makes it possible to keep China enslaved. It
should therefore be enacted that, in future, no sale of mines or of any
interest in mines to foreigners, and no loan from foreigners on the
security of mines, will be recognized as legally valid. In view of
extra-territoriality, it will be difficult to induce foreigners to
accept such legislation, and Consular Courts will not readily admit its
validity. But, as the example of extra-territoriality in Japan shows,
such matters depend upon the national strength; if the Powers fear
China, they will recognize the validity of Chinese legislation, but if
not, not. In view of the need of rapid development of mining by Chinese,
it would probably be unwise to nationalize all mines here and now. It
would be better to provide every possible encouragement to genuinely
Chinese p

### Step 2: Reconstruction

Generate reconstructions using all 4 methods, with M stochastic runs each:

In [None]:
# Step 2: Generate reconstructions (with crash resume)
print("=== Step 2: Generating Reconstructions ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for run in range(n_runs):
        print(f"  Run {run}:")
        
        # Check which methods need reconstruction
        for method in ['generic', 'fewshot', 'author', 'instructions']:
            if store.has_reconstruction(sample_id, run, method):
                print(f"    ✓ {method:12s} (already done)")
                continue
            
            # Generate reconstruction
            config = RECONSTRUCTORS_CFGS[method](
                content_summary=sample['flattened_content'],
                **RECONSTRUCTORS_KWARGS[method]
            )
            prompt = prompt_maker.render(config)
            response = reconstruction_llm.complete(prompt)
            
            # Save immediately (crash resilient!)
            store.save_reconstruction(
                sample_id=sample_id,
                run=run,
                method=method,
                reconstructed_text=response.content,
                model=response.model
            )
            print(f"    ✓ {method:12s} ({len(response.content)} chars)")

stats = store.get_stats()
print(f"\n✓ Generated {stats['n_reconstructions']} total reconstructions")

=== Step 2: Generating Reconstructions ===


sample_000:
  Run 0:
    ✓ generic      (5590 chars)
    ✓ fewshot      (5507 chars)
    ✓ author       (6374 chars)
    ✓ instructions (9295 chars)
  Run 1:
    ✓ generic      (6510 chars)
    ✓ fewshot      (5156 chars)
    ✓ author       (5315 chars)
    ✓ instructions (8600 chars)
  Run 2:
    ✓ generic      (7144 chars)
    ✓ fewshot      (4860 chars)
    ✓ author       (5734 chars)
    ✓ instructions (4648 chars)

sample_001:
  Run 0:
    ✓ generic      (5889 chars)
    ✓ fewshot      (4715 chars)
    ✓ author       (5226 chars)


In [None]:
reconstructions = store.get_reconstructions('sample_001', 0)
for reconstructor in reconstructions.keys():
    print(f"{reconstructor.upper()}\n===================")
    print(f"\n{reconstructions.get(reconstructor)}\n\n")

### Step 3: Judging

Compare each reconstruction against the original using the judge LLM:

In [31]:
from belletrist.models import StyleJudgeComparativeConfig, StyleJudgmentComparative
# Step 3: Comparative blind judging (with crash resume)
print("=== Step 3: Comparative Blind Judging ===\n")

for sample_id in store.list_samples():
    sample = store.get_sample(sample_id)
    print(f"\n{sample_id}:")
    
    for run in range(n_runs):
        if store.has_judgment(sample_id, run):
            print(f"  Run {run}: ✓ Already judged (skipping)")
            continue
        
        print(f"  Run {run}: Judging...", end=" ")
        
        # Get all 4 reconstructions
        reconstructions = store.get_reconstructions(sample_id, run)
        if len(reconstructions) != 4:
            print(f"✗ Missing reconstructions (found {len(reconstructions)}/4)")
            continue
        
        # Create randomized mapping for blind evaluation
        # Using deterministic seed per (sample, run) for reproducibility
        mapping = store.create_random_mapping(seed=hash(f"{sample_id}_{run}"))
        
        # Build prompt with anonymous labels (BLIND EVALUATION)
        judge_config = StyleJudgeComparativeConfig(
            original_text=sample['original_text'],
            reconstruction_text_a=reconstructions[mapping.text_a],
            reconstruction_text_b=reconstructions[mapping.text_b],
            reconstruction_text_c=reconstructions[mapping.text_c],
            reconstruction_text_d=reconstructions[mapping.text_d]
        )
        judge_prompt = prompt_maker.render(judge_config)
        
        # Get structured JSON judgment
        try:
            response = judge_llm.complete_json(judge_prompt)
            judgment_data = json.loads(response.content)
            judgment = StyleJudgmentComparative(**judgment_data)
            
            # Save judgment with mapping (crash resilient!)
            store.save_judgment(sample_id, run, judgment, mapping, response.model)
            print(f"✓ (confidence: {judgment.confidence})")
            
        except Exception as e:
            print(f"✗ Error: {e}")

stats = store.get_stats()
print(f"\n✓ Completed {stats['n_judgments']} judgments")

=== Step 3: Comparative Blind Judging ===


sample_000:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

sample_001:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

sample_002:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

sample_003:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

sample_004:
  Run 0: Judging... ✓ (confidence: high)
  Run 1: Judging... ✓ (confidence: high)
  Run 2: Judging... ✓ (confidence: high)

✓ Completed 15 judgments


## Results Analysis

In [36]:
# Export results to DataFrame
print("=== Exporting Results ===\n")

# Export from store (resolves anonymous rankings to methods)
df = store.to_dataframe()

print(f"Total judgments: {len(df)}")
print(f"Samples: {df['sample_id'].nunique()}")
print(f"Runs per sample: {df.groupby('sample_id')['run'].nunique().mean():.1f}")

# Show first few rows
print(f"\n=== Sample Results ===\n")
display_cols = ['sample_id', 'run', 'ranking_generic', 'ranking_fewshot', 
                'ranking_author', 'ranking_instructions', 'confidence']
print(df[display_cols].head(15))

# Export to CSV
output_file = f"style_eval_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")

=== Exporting Results ===

Total judgments: 15
Samples: 5
Runs per sample: 3.0

=== Sample Results ===

     sample_id  run  ranking_generic  ranking_fewshot  ranking_author  \
0   sample_000    0                3                2               1   
1   sample_000    1                4                3               2   
2   sample_000    2                4                3               2   
3   sample_001    0                3                1               4   
4   sample_001    1                4                1               2   
5   sample_001    2                2                3               4   
6   sample_002    0                3                2               4   
7   sample_002    1                4                3               2   
8   sample_002    2                3                1               2   
9   sample_003    0                3                2               1   
10  sample_003    1                4                1               2   
11  sample_003    2 

In [37]:
print(df.loc[0,'reasoning'])

Let me work through each text systematically, focusing on the distinctive voice of the original.

**The Original's Voice:**
The original has a particular philosophical tone that balances formality with accessibility. Key features include:
- Direct, matter-of-fact statements ("The inductive principle, however, is equally incapable...")
- Concrete, vivid examples woven naturally into abstract arguments ("to expect bread to be more nourishing than a stone")
- A conversational relationship with the reader, using "we" inclusively
- Measured pacing with neither excessive elaboration nor stark minimalism
- Phrases like "begging the question," "intrinsic evidence," "forgo all justification"
- A distinctive rhythm that moves from principle to example to implication
- The memorable image of the friend's body inhabited by an enemy's mind
- Chapter transitions that are straightforward and organizational

**TEXT A Analysis:**
This text feels like an educational expansion or modernization. The voice

In [38]:
# Analyze ranking distributions
print("=== Ranking Distribution by Method ===\n")

for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    print(f"\n{method.upper()}:")
    ranking_counts = df[col].value_counts().sort_index()
    for rank in [1, 2, 3, 4]:
        count = ranking_counts.get(rank, 0)
        pct = (count / len(df) * 100) if len(df) > 0 else 0
        print(f"  Rank {rank}: {count:3d} ({pct:5.1f}%)")

print("\n=== Confidence Distribution ===\n")
print(df['confidence'].value_counts())

=== Ranking Distribution by Method ===


GENERIC:
  Rank 1:   1 (  6.7%)
  Rank 2:   3 ( 20.0%)
  Rank 3:   6 ( 40.0%)
  Rank 4:   5 ( 33.3%)

FEWSHOT:
  Rank 1:   6 ( 40.0%)
  Rank 2:   4 ( 26.7%)
  Rank 3:   5 ( 33.3%)
  Rank 4:   0 (  0.0%)

AUTHOR:
  Rank 1:   3 ( 20.0%)
  Rank 2:   6 ( 40.0%)
  Rank 3:   0 (  0.0%)
  Rank 4:   6 ( 40.0%)

INSTRUCTIONS:
  Rank 1:   5 ( 33.3%)
  Rank 2:   2 ( 13.3%)
  Rank 3:   4 ( 26.7%)
  Rank 4:   4 ( 26.7%)

=== Confidence Distribution ===

confidence
high    15
Name: count, dtype: int64


In [39]:
# Calculate method performance metrics
print("=== Method Performance Metrics ===\n")

# Calculate mean ranking for each method (lower is better: 1 = best, 4 = worst)
mean_rankings = {}
for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    mean_rankings[method] = df[col].mean()

# Sort by mean ranking (best first)
sorted_methods = sorted(mean_rankings.items(), key=lambda x: x[1])

print("Average Ranking (lower is better):")
for i, (method, mean_rank) in enumerate(sorted_methods, 1):
    # Count how often this method ranked 1st
    first_place = (df[f'ranking_{method}'] == 1).sum()
    first_place_pct = (first_place / len(df) * 100) if len(df) > 0 else 0
    
    print(f"{i}. {method:12s}: {mean_rank:.2f} (1st place: {first_place}/{len(df)} = {first_place_pct:.1f}%)")

# Win rate (percentage of times ranked 1st or 2nd)
print("\nTop-2 Rate (ranked 1st or 2nd):")
for method in ['generic', 'fewshot', 'author', 'instructions']:
    col = f'ranking_{method}'
    top2 = ((df[col] == 1) | (df[col] == 2)).sum()
    top2_pct = (top2 / len(df) * 100) if len(df) > 0 else 0
    print(f"  {method:12s}: {top2}/{len(df)} = {top2_pct:.1f}%")

=== Method Performance Metrics ===

Average Ranking (lower is better):
1. fewshot     : 1.93 (1st place: 6/15 = 40.0%)
2. instructions: 2.47 (1st place: 5/15 = 33.3%)
3. author      : 2.60 (1st place: 3/15 = 20.0%)
4. generic     : 3.00 (1st place: 1/15 = 6.7%)

Top-2 Rate (ranked 1st or 2nd):
  generic     : 4/15 = 26.7%
  fewshot     : 10/15 = 66.7%
  author      : 9/15 = 60.0%
  instructions: 7/15 = 46.7%


In [40]:
# Export results
output_file = f"style_evaluation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")


✓ Results saved to style_evaluation_results_20251127_092147.csv


## Next Steps

TODO: Add analysis cells for:
- Statistical significance testing
- Visualization of results
- Sample-by-sample breakdown
- Confidence level analysis
- Qualitative review of reasoning