# Style Instruction Evaluation Framework

This notebook evaluates how well derived style instructions enable style replication, compared to alternative approaches.

## Methodology

For each gold standard text:
1. **Flatten**: Extract content while removing style
2. **Reconstruct**: Generate text using 4 different methods:
   - Generic baseline
   - Few-shot learning
   - Author name prompting
   - Derived style instructions
3. **Judge**: Compare each reconstruction against the original
4. **Aggregate**: Analyze results across multiple texts and runs

## Setup and Configuration

In [None]:
import os
import json
from pathlib import Path
import pandas as pd
from datetime import datetime

### Initialize Base Objects
The base objects part of the current project library (`belletrist`) are initialized. They are:
* `LLM`: the LLM object.
* `LLMConfig`: the configuration of the LLM object, such as what model to use.
* `PromptMaker`: generates prompts from templates and variables
* `DataSampler`: retrieves and samples text at a source directory

These will implement text transformations by LLMs part of the evaluation process. They build on the third-party LLMs, which we furthermore split into LLMs for text reconstruction and text judging, the key parameters for which are set below.

In [None]:
model_reconstruction_string = 'together_ai/Qwen/Qwen3-235B-A22B-Instruct-2507-tput'
model_reconstruction_api_key_env_var = 'TOGETHER_AI_API_KEY'
model_judge_string = 'anthropic/claude-sonnet-4-5-20250929'
model_judge_api_key_env_var = 'ANTHROPIC_API_KEY'

In [None]:
from belletrist import PromptMaker, DataSampler

prompt_maker = PromptMaker()
sampler = DataSampler(
    data_path=(Path(os.getcwd()) / "data" / "russell").resolve()
)

In [None]:
from belletrist import LLM, LLMConfig

reconstruction_llm = LLM(LLMConfig(
    model=model_reconstruction_string,
    api_key=os.environ.get(model_reconstruction_api_key_env_var)
))
judge_llm = LLM(LLMConfig(
    model=model_judge_string,
    api_key=os.environ.get(model_judge_api_key_env_var)
))

### Sample Test Data and Few-Shot Data
The reconstruction method tests build on gold standard texts. The test also includes few-shot prompting with the gold standard texts. In order to not skew the tests, no few-shot examples can overlap with the test texts.

In [None]:
n_sample = 10
m_paragraphs_per_sample = 5
n_few_shot_sample = 3

test_texts = []
for _ in range(n_sample):
    test_texts.append(sampler.sample_segment(p_length=m_paragraphs_per_sample))

few_shot_texts = []
while len(few_shot_texts) < n_few_shot_sample:
    p = sampler.sample_segment(p_length=m_paragraphs_per_sample)

    # Check if p overlaps with any test text
    # Two segments overlap if they're from the same file AND their paragraph ranges overlap
    # Ranges [a, b) and [c, d) overlap if: a < d AND c < b
    has_overlap = any(
        p.file_index == test_seg.file_index and
        p.paragraph_start < test_seg.paragraph_end and
        test_seg.paragraph_start < p.paragraph_end
        for test_seg in test_texts
    )

    if not has_overlap:
        few_shot_texts.append(p)

## Create the Test Transformation Objects
The combination of prompt and LLM leads to the following operators in the test chain:
* **Style Flattener**, which given a text compresses it into its content bare bones.
* **Reconstructor, LLM House Style**, which given a compressed content expands it into a complete text with the "house style" of the LLM employed for the reconstruction.
* **Reconstructor, Few Shot**, which given a compressed content expands it into a complete text with a few text excerpts on unrelated topics as style guide.
* **Reconstructor, LLM Author Model**, which given a compressed content expands it into a complete text with the named author's style as the LLM conceives it without any other guidance.
* **Reconstructor, Style Instruction**, which given a compressed content expands it into a complete text following the detailed style instruction, as derived from previous analysis.

In [None]:
style_instructions_path = Path("outputs/derived_style_instructions_1125_000.txt")

if style_instructions_path.exists():
    style_instructions = style_instructions_path.read_text()
    print(f"✓ Loaded style instructions ({len(style_instructions)} chars)")
    print(f"\nFirst 200 chars:\n{style_instructions[:200]}...")
else:
    print("⚠ Style instructions file not found. Please provide path.")
    style_instructions = ""  # Placeholder

In [None]:
from belletrist.models import (
    StyleFlatteningConfig,
    StyleReconstructionGenericConfig,
    StyleReconstructionFewShotConfig,
    StyleReconstructionAuthorConfig,
    StyleReconstructionInstructionsConfig
)
n_runs = 3
RECONSTRUCTORS = [
    'generic',
    'fewshot',
    'author',
    'instructions'
]
RECONSTRUCTORS_CFGS = {
    'generic': StyleReconstructionGenericConfig,
    'fewshot': StyleReconstructionFewShotConfig,
    'author': StyleReconstructionAuthorConfig,
    'instructions': StyleReconstructionInstructionsConfig
}
RECONSTRUCTORS_KWARGS = {
    'generic': {},
    'fewshot': {'few_shot_examples': [seg.text for seg in few_shot_texts]},
    'author': {'author_name': 'Bertrand Russell'},
    'instructions': {'style_instructions': style_instructions}
}

reconstructions = {}
for k_text, test_sample in enumerate(test_texts):
    flatten_prompt = prompt_maker.render(
        StyleFlatteningConfig(text=test_sample.text)
    )
    flattened_content = reconstruction_llm.complete(flatten_prompt)
    for k_run in range(n_runs):
        for reconstructor in RECONSTRUCTORS:
            config = RECONSTRUCTORS_CFGS[reconstructor](content_summary=flattened_content.content, **RECONSTRUCTORS_KWARGS[reconstructor])
            reconstruct_prompt = prompt_maker.render(config)
            reconstructed_text = reconstruction_llm.complete(reconstruct_prompt)

            reconstructions[(k_text, k_run, reconstructor)] = reconstructed_text.content

## Evaluation Pipeline

### Step 1: Content Flattening

Extract content from each test sample, removing stylistic elements:

In [None]:
# Flatten all test samples
for sample in test_samples:
    print(f"Flattening {sample['id']}...", end=" ")
    
    flattening_config = StyleFlatteningConfig(text=sample['text'])
    flattening_prompt = prompt_maker.render(flattening_config)
    
    response = reconstruction_llm.complete(flattening_prompt)
    sample['content_summary'] = response.content
    
    print(f"✓ ({len(response.content)} chars)")

print(f"\n✓ All samples flattened")

### Step 2: Reconstruction

Generate reconstructions using all 4 methods, with M stochastic runs each:

In [None]:
# Store reconstructions
reconstructions = []

for sample in test_samples:
    print(f"\nProcessing {sample['id']}:")
    
    for run in range(M_RUNS):
        print(f"  Run {run+1}/{M_RUNS}:")
        
        content_summary = sample['content_summary']
        
        # Method 1: Generic baseline
        print("    - Generic...", end=" ")
        config = StyleReconstructionGenericConfig(content_summary=content_summary)
        prompt = prompt_maker.render(config)
        response = reconstruction_llm.complete(prompt)
        reconstructions.append({
            'sample_id': sample['id'],
            'run': run,
            'method': 'generic',
            'text': response.content
        })
        print("✓")
        
        # Method 2: Few-shot
        print("    - Few-shot...", end=" ")
        config = StyleReconstructionFewShotConfig(
            content_summary=content_summary,
            few_shot_examples=fewshot_examples
        )
        prompt = prompt_maker.render(config)
        response = reconstruction_llm.complete(prompt)
        reconstructions.append({
            'sample_id': sample['id'],
            'run': run,
            'method': 'fewshot',
            'text': response.content
        })
        print("✓")
        
        # Method 3: Author name
        print("    - Author name...", end=" ")
        config = StyleReconstructionAuthorConfig(
            content_summary=content_summary,
            author_name=AUTHOR_NAME
        )
        prompt = prompt_maker.render(config)
        response = reconstruction_llm.complete(prompt)
        reconstructions.append({
            'sample_id': sample['id'],
            'run': run,
            'method': 'author',
            'text': response.content
        })
        print("✓")
        
        # Method 4: Derived instructions
        print("    - Instructions...", end=" ")
        config = StyleReconstructionInstructionsConfig(
            content_summary=content_summary,
            style_instructions=style_instructions
        )
        prompt = prompt_maker.render(config)
        response = reconstruction_llm.complete(prompt)
        reconstructions.append({
            'sample_id': sample['id'],
            'run': run,
            'method': 'instructions',
            'text': response.content
        })
        print("✓")

print(f"\n✓ Generated {len(reconstructions)} total reconstructions")

### Step 3: Judging

Compare each reconstruction against the original using the judge LLM:

In [None]:
# Store judgments
judgments = []

for reconstruction in reconstructions:
    # Find corresponding original sample
    sample = next(s for s in test_samples if s['id'] == reconstruction['sample_id'])
    
    print(f"Judging {reconstruction['sample_id']}, run {reconstruction['run']}, method {reconstruction['method']}...", end=" ")
    
    # Build judge prompt
    judge_config = StyleJudgeConfig(
        original_text=sample['text'],
        reconstruction_text=reconstruction['text']
    )
    judge_prompt = prompt_maker.render(judge_config)
    
    # Get structured JSON judgment
    response = judge_llm.complete_json(judge_prompt)
    
    try:
        # Parse and validate JSON
        judgment_data = json.loads(response.content)
        judgment = StyleJudgment(**judgment_data)
        
        # Store judgment
        judgments.append({
            'sample_id': reconstruction['sample_id'],
            'run': reconstruction['run'],
            'method': reconstruction['method'],
            'ranking': judgment.ranking,
            'confidence': judgment.confidence,
            'reasoning': judgment.reasoning,
            'timestamp': datetime.now().isoformat()
        })
        print(f"✓ {judgment.ranking}")
        
    except Exception as e:
        print(f"✗ Error: {e}")
        judgments.append({
            'sample_id': reconstruction['sample_id'],
            'run': reconstruction['run'],
            'method': reconstruction['method'],
            'ranking': 'error',
            'confidence': 'error',
            'reasoning': str(e),
            'timestamp': datetime.now().isoformat()
        })

print(f"\n✓ Completed {len(judgments)} judgments")

## Results Analysis

In [None]:
# Convert to DataFrame for analysis
df = pd.DataFrame(judgments)

# Remove error rows
df_clean = df[df['ranking'] != 'error']

print(f"Total judgments: {len(df)}")
print(f"Valid judgments: {len(df_clean)}")
print(f"Errors: {len(df) - len(df_clean)}")

# Show summary
df_clean.head(10)

In [None]:
# Aggregate results by method
print("\n=== Results by Method ===")
print("\nRanking distribution:")
print(df_clean.groupby(['method', 'ranking']).size().unstack(fill_value=0))

print("\nConfidence distribution:")
print(df_clean.groupby(['method', 'confidence']).size().unstack(fill_value=0))

In [None]:
# Calculate success rate (reconstruction_better)
success_rates = df_clean[df_clean['ranking'] == 'reconstruction_better'].groupby('method').size() / df_clean.groupby('method').size()

print("\n=== Success Rate by Method ===")
print("(Percentage where reconstruction was judged better than original)\n")
print(success_rates.sort_values(ascending=False))

In [None]:
# Export results
output_file = f"style_evaluation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")

## Next Steps

TODO: Add analysis cells for:
- Statistical significance testing
- Visualization of results
- Sample-by-sample breakdown
- Confidence level analysis
- Qualitative review of reasoning