# Style Instruction Evaluation Framework

This notebook evaluates how well derived style instructions enable style replication, compared to alternative approaches.

## Methodology

For each gold standard text:
1. **Flatten**: Extract content while removing style
2. **Reconstruct**: Generate text using 4 different methods:
   - Generic baseline
   - Few-shot learning
   - Author name prompting
   - Derived style instructions
3. **Judge**: Compare each reconstruction against the original
4. **Aggregate**: Analyze results across multiple texts and runs

## Setup and Configuration

In [None]:
import os
import json
from pathlib import Path
import pandas as pd
from datetime import datetime

from belletrist import (
    LLM, LLMConfig, PromptMaker, DataSampler,
    StyleFlatteningConfig,
    StyleReconstructionGenericConfig,
    StyleReconstructionFewShotConfig,
    StyleReconstructionAuthorConfig,
    StyleReconstructionInstructionsConfig,
    StyleJudgeConfig,
    StyleJudgment
)

### Experimental Parameters

Configure the evaluation settings here:

In [None]:
# Evaluation parameters
N_SAMPLES = 5              # Number of gold standard texts to test
M_RUNS = 3                 # Stochastic runs per sample
AUTHOR_NAME = "Bertrand Russell"
N_FEWSHOT_EXAMPLES = 3

# Model configuration
# TODO: Adjust models and API keys for your setup
RECONSTRUCTION_MODEL = "mistral/mistral-large-2411"  # For generating reconstructions
JUDGE_MODEL = "openai/gpt-4o"                        # Best model for judging
RECONSTRUCTION_TEMPERATURE = 0.8                     # Higher for variety across runs
JUDGE_TEMPERATURE = 0.3                              # Lower for consistent judgments

# API keys (set in environment)
RECONSTRUCTION_API_KEY = os.environ.get('MISTRAL_API_KEY')
JUDGE_API_KEY = os.environ.get('OPENAI_API_KEY')

# Sample selection (TODO: Specify text indices manually)
# Split your corpus into:
# - INSTRUCTION_SAMPLES: Used to derive style instructions (already done)
# - TEST_SAMPLES: Used for compression and reconstruction (this evaluation)
# - FEWSHOT_SAMPLES: Used as few-shot examples (different from test samples)
TEST_SAMPLE_INDICES = [10, 15, 20, 25, 30]  # Example indices
FEWSHOT_SAMPLE_INDICES = [5, 12, 18]        # Example indices

### Load Artifacts

In [None]:
# Initialize components
prompt_maker = PromptMaker()
sampler = DataSampler()

# Initialize LLMs
reconstruction_llm = LLM(LLMConfig(
    model=RECONSTRUCTION_MODEL,
    api_key=RECONSTRUCTION_API_KEY,
    temperature=RECONSTRUCTION_TEMPERATURE,
    max_tokens=2000
))

judge_llm = LLM(LLMConfig(
    model=JUDGE_MODEL,
    api_key=JUDGE_API_KEY,
    temperature=JUDGE_TEMPERATURE
))

print("✓ LLMs initialized")

In [None]:
# TODO: Load derived style instructions
# This should be the output from SynthesizerOfPrinciplesConfig
style_instructions_path = Path("derived_style_instructions.txt")

if style_instructions_path.exists():
    style_instructions = style_instructions_path.read_text()
    print(f"✓ Loaded style instructions ({len(style_instructions)} chars)")
    print(f"\nFirst 200 chars:\n{style_instructions[:200]}...")
else:
    print("⚠ Style instructions file not found. Please provide path.")
    style_instructions = ""  # Placeholder

In [None]:
# Load test samples and few-shot examples
# TODO: Implement sample selection based on your corpus structure

test_samples = []
for idx in TEST_SAMPLE_INDICES[:N_SAMPLES]:
    segment = sampler.get_paragraph_chunk(file_index=0, start_para=idx*10, length=5)
    test_samples.append({
        'id': f'test_{idx}',
        'text': segment.text,
        'source': f'File {segment.file_index}, para {segment.paragraph_start}-{segment.paragraph_end}'
    })

fewshot_examples = []
for idx in FEWSHOT_SAMPLE_INDICES[:N_FEWSHOT_EXAMPLES]:
    segment = sampler.get_paragraph_chunk(file_index=0, start_para=idx*10, length=5)
    fewshot_examples.append(segment.text)

print(f"✓ Loaded {len(test_samples)} test samples")
print(f"✓ Loaded {len(fewshot_examples)} few-shot examples")

## Evaluation Pipeline

### Step 1: Content Flattening

Extract content from each test sample, removing stylistic elements:

In [None]:
# Flatten all test samples
for sample in test_samples:
    print(f"Flattening {sample['id']}...", end=" ")
    
    flattening_config = StyleFlatteningConfig(text=sample['text'])
    flattening_prompt = prompt_maker.render(flattening_config)
    
    response = reconstruction_llm.complete(flattening_prompt)
    sample['content_summary'] = response.content
    
    print(f"✓ ({len(response.content)} chars)")

print(f"\n✓ All samples flattened")

### Step 2: Reconstruction

Generate reconstructions using all 4 methods, with M stochastic runs each:

In [None]:
# Store reconstructions
reconstructions = []

for sample in test_samples:
    print(f"\nProcessing {sample['id']}:")
    
    for run in range(M_RUNS):
        print(f"  Run {run+1}/{M_RUNS}:")
        
        content_summary = sample['content_summary']
        
        # Method 1: Generic baseline
        print("    - Generic...", end=" ")
        config = StyleReconstructionGenericConfig(content_summary=content_summary)
        prompt = prompt_maker.render(config)
        response = reconstruction_llm.complete(prompt)
        reconstructions.append({
            'sample_id': sample['id'],
            'run': run,
            'method': 'generic',
            'text': response.content
        })
        print("✓")
        
        # Method 2: Few-shot
        print("    - Few-shot...", end=" ")
        config = StyleReconstructionFewShotConfig(
            content_summary=content_summary,
            few_shot_examples=fewshot_examples
        )
        prompt = prompt_maker.render(config)
        response = reconstruction_llm.complete(prompt)
        reconstructions.append({
            'sample_id': sample['id'],
            'run': run,
            'method': 'fewshot',
            'text': response.content
        })
        print("✓")
        
        # Method 3: Author name
        print("    - Author name...", end=" ")
        config = StyleReconstructionAuthorConfig(
            content_summary=content_summary,
            author_name=AUTHOR_NAME
        )
        prompt = prompt_maker.render(config)
        response = reconstruction_llm.complete(prompt)
        reconstructions.append({
            'sample_id': sample['id'],
            'run': run,
            'method': 'author',
            'text': response.content
        })
        print("✓")
        
        # Method 4: Derived instructions
        print("    - Instructions...", end=" ")
        config = StyleReconstructionInstructionsConfig(
            content_summary=content_summary,
            style_instructions=style_instructions
        )
        prompt = prompt_maker.render(config)
        response = reconstruction_llm.complete(prompt)
        reconstructions.append({
            'sample_id': sample['id'],
            'run': run,
            'method': 'instructions',
            'text': response.content
        })
        print("✓")

print(f"\n✓ Generated {len(reconstructions)} total reconstructions")

### Step 3: Judging

Compare each reconstruction against the original using the judge LLM:

In [None]:
# Store judgments
judgments = []

for reconstruction in reconstructions:
    # Find corresponding original sample
    sample = next(s for s in test_samples if s['id'] == reconstruction['sample_id'])
    
    print(f"Judging {reconstruction['sample_id']}, run {reconstruction['run']}, method {reconstruction['method']}...", end=" ")
    
    # Build judge prompt
    judge_config = StyleJudgeConfig(
        original_text=sample['text'],
        reconstruction_text=reconstruction['text']
    )
    judge_prompt = prompt_maker.render(judge_config)
    
    # Get structured JSON judgment
    response = judge_llm.complete_json(judge_prompt)
    
    try:
        # Parse and validate JSON
        judgment_data = json.loads(response.content)
        judgment = StyleJudgment(**judgment_data)
        
        # Store judgment
        judgments.append({
            'sample_id': reconstruction['sample_id'],
            'run': reconstruction['run'],
            'method': reconstruction['method'],
            'ranking': judgment.ranking,
            'confidence': judgment.confidence,
            'reasoning': judgment.reasoning,
            'timestamp': datetime.now().isoformat()
        })
        print(f"✓ {judgment.ranking}")
        
    except Exception as e:
        print(f"✗ Error: {e}")
        judgments.append({
            'sample_id': reconstruction['sample_id'],
            'run': reconstruction['run'],
            'method': reconstruction['method'],
            'ranking': 'error',
            'confidence': 'error',
            'reasoning': str(e),
            'timestamp': datetime.now().isoformat()
        })

print(f"\n✓ Completed {len(judgments)} judgments")

## Results Analysis

In [None]:
# Convert to DataFrame for analysis
df = pd.DataFrame(judgments)

# Remove error rows
df_clean = df[df['ranking'] != 'error']

print(f"Total judgments: {len(df)}")
print(f"Valid judgments: {len(df_clean)}")
print(f"Errors: {len(df) - len(df_clean)}")

# Show summary
df_clean.head(10)

In [None]:
# Aggregate results by method
print("\n=== Results by Method ===")
print("\nRanking distribution:")
print(df_clean.groupby(['method', 'ranking']).size().unstack(fill_value=0))

print("\nConfidence distribution:")
print(df_clean.groupby(['method', 'confidence']).size().unstack(fill_value=0))

In [None]:
# Calculate success rate (reconstruction_better)
success_rates = df_clean[df_clean['ranking'] == 'reconstruction_better'].groupby('method').size() / df_clean.groupby('method').size()

print("\n=== Success Rate by Method ===")
print("(Percentage where reconstruction was judged better than original)\n")
print(success_rates.sort_values(ascending=False))

In [None]:
# Export results
output_file = f"style_evaluation_results_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
df.to_csv(output_file, index=False)
print(f"\n✓ Results saved to {output_file}")

## Next Steps

TODO: Add analysis cells for:
- Statistical significance testing
- Visualization of results
- Sample-by-sample breakdown
- Confidence level analysis
- Qualitative review of reasoning