# Author Modeling Pipeline

This notebook implements a 5-stage pipeline for modeling an author's characteristic writing patterns through few-shot example curation.

## Methodology

Rather than extracting explicit style rules, this approach models the author's **decision-making patterns** and **sensibility**, then curates exemplary passages for tacit transmission via few-shot learning.

## Pipeline Stages

1. **Stage 1: Analytical Mining** - Analyze sample texts through three lenses:
   - Implied Author: Sensibility and stance
   - Decision Patterns: Compositional choices
   - Functional Texture: How surface features serve purpose

2. **Stage 2: Cross-Text Synthesis** - Synthesize each dimension across multiple samples to identify stable patterns

3. **Stage 3: Field Guide Construction** - Integrate all three syntheses into unified recognition criteria and density rubric

4. **Stage 4: Passage Evaluation** - Evaluate passages from corpus using field guide (1-5 density rating)

5. **Stage 5: Example Set Construction** - Curate 3-4 high-density passages that balance density, diversity, and complementarity

## Note

This pipeline is author-agnostic. The Russell corpus is used as a test case, but the same approach applies to any author.

## 1. Setup & Dependencies

In [14]:
!pip install -r requirements.txt



[33mDEPRECATION: pyodbc 4.0.0-unsupported has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of pyodbc or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [15]:
import os
from pathlib import Path
from pprint import pprint
from typing import List, Dict, Any

from belletrist import (
    LLM,
    LLMConfig,
    PromptMaker,
    DataSampler,
    ResultStore,
    # Stage 1 configs
    ImpliedAuthorConfig,
    DecisionPatternConfig,
    FunctionalTextureConfig,
    # Stage 2 configs
    ImpliedAuthorSynthesisConfig,
    DecisionPatternSynthesisConfig,
    TexturalSynthesisConfig,
    # Stage 3 config
    AuthorModelDefinitionConfig,
    FieldGuideConstructionConfig,
    # Stage 4 config
    PassageEvaluationConfig,
    # Stage 5 config
    ExampleSetConstructionConfig,
    # Utilities
    extract_paragraph_windows,
    extract_logical_sections,
    get_full_sample_as_passage,
    parse_passage_evaluation,
    parse_example_set_selection,
    validate_passage_evaluation,
    validate_example_set_selection,
)

print("Dependencies imported successfully")

Dependencies imported successfully


### Initialize Base Objects

In [16]:
# Initialize LLM
llm_config = LLMConfig(
    model="mistral/mistral-large-2411",  # Change as needed
    api_key=os.environ.get('MISTRAL_API_KEY'),
    temperature=0.7
)
llm = LLM(llm_config)

# Initialize prompt maker
prompt_maker = PromptMaker()

# Initialize data sampler (Russell corpus)
data_dir = Path("data/russell")
sampler = DataSampler(data_dir)

# Initialize result store
store = ResultStore("russell_author_modeling.db")

In [17]:
#store.reset('all')

### Configuration: Select Samples for Analysis

For Stages 1-3, we'll analyze 3-5 samples to extract stable cross-text patterns.

In [18]:
# Select samples for Stage 1-3 analysis
# These should be diverse, substantial passages (500-800 words recommended)
ANALYSIS_SAMPLES = [
    "sample_001",
    "sample_002",
    "sample_003",
    "sample_004",
    "sample_005",
    # Add more as needed for richer cross-text synthesis
]

# For this demo, we'll generate samples if they don't exist
# In production, you might use specific pre-selected samples
NUM_SAMPLES = 5
SAMPLE_PARAGRAPH_LENGTH = 10

print(f"Will analyze {NUM_SAMPLES} samples through Stages 1-3")

Will analyze 5 samples through Stages 1-3


## 2. Stage 1: Analytical Mining

Run three specialized analyses on each sample:
- **Implied Author**: What sensibility emerges from the prose?
- **Decision Patterns**: What compositional choices are made at key junctures?
- **Functional Texture**: How do surface features serve purpose?

### Generate or Load Samples

In [19]:
# Generate samples if they don't exist
for i in range(NUM_SAMPLES):
    sample_id = f"sample_{i+1:03d}"
    
    # Skip if already saved
    if store.get_sample(sample_id):
        print(f"✓ {sample_id} already gathered")
        continue
        
    # Generate new sample
    segment = sampler.sample_segment(p_length=SAMPLE_PARAGRAPH_LENGTH)
    store.save_segment(sample_id, segment)
    print(f"Generated {sample_id}")

# Display first sample for reference
sample = store.get_sample("sample_001")
print(f"\nSample 001 preview:\n{sample['text']}...")

✓ sample_001 already gathered
✓ sample_002 already gathered
✓ sample_003 already gathered
✓ sample_004 already gathered
✓ sample_005 already gathered

Sample 001 preview:
"Conscious" desire, which we have now to consider, consists of desire
in the sense hitherto discussed, together with a true belief as to its
"purpose," i.e. as to the state of affairs that will bring quiescence
with cessation of the discomfort. If our theory of desire is correct,
a belief as to its purpose may very well be erroneous, since only
experience can show what causes a discomfort to cease. When the
experience needed is common and simple, as in the case of hunger, a
mistake is not very probable. But in other cases--e.g. erotic desire in
those who have had little or no experience of its satisfaction--mistakes
are to be expected, and do in fact very often occur. The practice of
inhibiting impulses, which is to a great extent necessary to civilized
life, makes mistakes easier, by preventing experience of the acti

### Run Stage 1A: Implied Author Analysis

In [20]:
# Analyze each sample for implied author
for i in range(NUM_SAMPLES):
    sample_id = f"sample_{i+1:03d}"
    analyst_name = ImpliedAuthorConfig.analyst_name()
    
    # Check if analysis already exists (resume support)
    if store.get_analysis(sample_id, analyst_name):
        print(f"{sample_id}: {analyst_name} analysis already complete")
        continue
    
    # Get sample text
    sample = store.get_sample(sample_id)
    
    # Generate prompt
    config = ImpliedAuthorConfig(text=sample['text'])
    prompt = prompt_maker.render(config)
    
    # Run LLM
    print(f"{sample_id}: Running {analyst_name} analysis...")
    response = llm.complete(prompt)
    
    # Save analysis
    store.save_analysis(
        sample_id=sample_id,
        analyst=analyst_name,
        output=response.content,
        model=response.model
    )
    print(f"{sample_id}: {analyst_name} analysis complete")

print("\nStage 1A complete: All samples analyzed for implied author")

sample_001: implied_author analysis already complete
sample_002: implied_author analysis already complete
sample_003: implied_author analysis already complete
sample_004: implied_author analysis already complete
sample_005: implied_author analysis already complete

Stage 1A complete: All samples analyzed for implied author


### Run Stage 1B: Decision Pattern Analysis

In [21]:
# Analyze each sample for decision patterns
for i in range(NUM_SAMPLES):
    sample_id = f"sample_{i+1:03d}"
    analyst_name = DecisionPatternConfig.analyst_name()
    
    if store.get_analysis(sample_id, analyst_name):
        print(f"{sample_id}: {analyst_name} analysis already complete")
        continue
    
    sample = store.get_sample(sample_id)
    config = DecisionPatternConfig(text=sample['text'])
    prompt = prompt_maker.render(config)
    
    print(f"{sample_id}: Running {analyst_name} analysis...")
    response = llm.complete(prompt)
    
    store.save_analysis(
        sample_id=sample_id,
        analyst=analyst_name,
        output=response.content,
        model=response.model
    )
    print(f"{sample_id}: {analyst_name} analysis complete")

print("\nStage 1B complete: All samples analyzed for decision patterns")

sample_001: decision_pattern analysis already complete
sample_002: decision_pattern analysis already complete
sample_003: decision_pattern analysis already complete
sample_004: decision_pattern analysis already complete
sample_005: decision_pattern analysis already complete

Stage 1B complete: All samples analyzed for decision patterns


### Run Stage 1C: Functional Texture Analysis

In [22]:
# Analyze each sample for functional texture
for i in range(NUM_SAMPLES):
    sample_id = f"sample_{i+1:03d}"
    analyst_name = FunctionalTextureConfig.analyst_name()
    
    if store.get_analysis(sample_id, analyst_name):
        print(f"{sample_id}: {analyst_name} analysis already complete")
        continue
    
    sample = store.get_sample(sample_id)
    config = FunctionalTextureConfig(text=sample['text'])
    prompt = prompt_maker.render(config)
    
    print(f"{sample_id}: Running {analyst_name} analysis...")
    response = llm.complete(prompt)
    
    store.save_analysis(
        sample_id=sample_id,
        analyst=analyst_name,
        output=response.content,
        model=response.model
    )
    print(f"{sample_id}: {analyst_name} analysis complete")

print("\nStage 1C complete: All samples analyzed for functional texture")

sample_001: functional_texture analysis already complete
sample_002: functional_texture analysis already complete
sample_003: functional_texture analysis already complete
sample_004: functional_texture analysis already complete
sample_005: functional_texture analysis already complete

Stage 1C complete: All samples analyzed for functional texture


### Inspect Stage 1 Results

In [23]:
# Display completion status
for i in range(NUM_SAMPLES):
    sample_id = f"sample_{i+1:03d}"
    sample, analyses = store.get_sample_with_analyses(sample_id)
    print(f"{sample_id}: {len(analyses)} analyses complete")
    for analyst_name in analyses.keys():
        print(f"  - {analyst_name}")

# Optionally display one analysis
print("\n--- Sample Implied Author Analysis (first 500 chars) ---")
sample, analyses = store.get_sample_with_analyses("sample_005")
implied_author = analyses.get(ImpliedAuthorConfig.analyst_name(), "Not found")
print(f"Total chars: {len(implied_author)}")
print(implied_author)

sample_001: 3 analyses complete
  - decision_pattern
  - functional_texture
  - implied_author
sample_002: 3 analyses complete
  - decision_pattern
  - functional_texture
  - implied_author
sample_003: 3 analyses complete
  - decision_pattern
  - functional_texture
  - implied_author
sample_004: 3 analyses complete
  - decision_pattern
  - functional_texture
  - implied_author
sample_005: 3 analyses complete
  - decision_pattern
  - functional_texture
  - implied_author

--- Sample Implied Author Analysis (first 500 chars) ---
Total chars: 7483
## PART 1: DIMENSIONAL ANALYSIS

### 1. RELATIONSHIP TO MATERIAL

**Observation:**
This author approaches the material with a critical and analytical stance, often challenging the ideas presented and seeking to dissect them rigorously. The text reveals a mind that is both curious and skeptical, eager to probe the depths of philosophical claims to reveal their inconsistencies and limitations.

**Quoted Passages:**
1. "Of Bergson's theory that int

## 3. Stage 2: Cross-Text Synthesis

Synthesize each analytical dimension across samples to identify stable patterns.

### Stage 2A: Implied Author Synthesis

In [24]:
# Collect all implied author analyses
implied_author_analyses = {}
for i in range(NUM_SAMPLES):
    sample_id = f"sample_{i+1:03d}"
    analysis = store.get_analysis(sample_id, ImpliedAuthorConfig.analyst_name())
    if analysis:
        implied_author_analyses[sample_id] = analysis

# Check if synthesis already exists
synthesis_type = ImpliedAuthorSynthesisConfig.synthesis_type()
existing = store.list_syntheses(synthesis_type)

if existing:
    print(f"Implied author synthesis already exists: {existing[0]}")
    implied_author_synthesis_id = existing[0]
else:
    # Generate synthesis prompt
    config = ImpliedAuthorSynthesisConfig(
        implied_author_analyses=implied_author_analyses
    )
    prompt = prompt_maker.render(config)
    
    # Run synthesis
    print("Running implied author synthesis...")
    response = llm.complete(prompt)
    
    # Save synthesis
    sample_contributions = [
        (sample_id, ImpliedAuthorConfig.analyst_name())
        for sample_id in implied_author_analyses.keys()
    ]
    
    implied_author_synthesis_id = store.save_synthesis(
        synthesis_type=synthesis_type,
        output=response.content,
        model=response.model,
        sample_contributions=sample_contributions,
        config=config
    )
    print(f"Synthesis saved: {implied_author_synthesis_id}")

# Display preview
synth = store.get_synthesis(implied_author_synthesis_id)
print(f"\nPreview (first 400 chars):\n{synth['output'][:400]}...")

Implied author synthesis already exists: implied_author_synthesis_001

Preview (first 400 chars):
## OUTPUT

### PART 1: THE STABLE CORE

#### FUNDAMENTAL STANCE TOWARD IDEAS AND INQUIRY

**How does this author consistently engage with their subject matter?**
This author consistently engages with their subject matter with a blend of analytical rigor, curiosity, and a systematic approach. They approach complex topics with a sense of mastery but also allow for the inherent uncertainties and nuan...


### Stage 2B: Decision Pattern Synthesis

In [25]:
# Collect all decision pattern analyses
decision_pattern_analyses = {}
for i in range(NUM_SAMPLES):
    sample_id = f"sample_{i+1:03d}"
    analysis = store.get_analysis(sample_id, DecisionPatternConfig.analyst_name())
    if analysis:
        decision_pattern_analyses[sample_id] = analysis

# Check if synthesis already exists
synthesis_type = DecisionPatternSynthesisConfig.synthesis_type()
existing = store.list_syntheses(synthesis_type)

if existing:
    print(f"Decision pattern synthesis already exists: {existing[0]}")
    decision_pattern_synthesis_id = existing[0]
else:
    config = DecisionPatternSynthesisConfig(
        decision_pattern_analyses=decision_pattern_analyses
    )
    prompt = prompt_maker.render(config)
    
    print("Running decision pattern synthesis...")
    response = llm.complete(prompt)
    
    sample_contributions = [
        (sample_id, DecisionPatternConfig.analyst_name())
        for sample_id in decision_pattern_analyses.keys()
    ]
    
    decision_pattern_synthesis_id = store.save_synthesis(
        synthesis_type=synthesis_type,
        output=response.content,
        model=response.model,
        sample_contributions=sample_contributions,
        config=config
    )
    print(f"Synthesis saved: {decision_pattern_synthesis_id}")

synth = store.get_synthesis(decision_pattern_synthesis_id)
print(f"\nPreview (first 400 chars):\n{synth['output'][:400]}...")

Decision pattern synthesis already exists: decision_pattern_synthesis_001

Preview (first 400 chars):
## PART 1: COMPOSITIONAL PROBLEM TAXONOMY

Across all texts, this author recurrently faces the following types of compositional problems:

1. **Introducing a Complex Concept**:
   - **Description**: The author often needs to introduce complex or abstract ideas in a way that is accessible and engaging for the reader.

2. **Handling Objections or Alternative Views**:
   - **Description**: The author...


### Stage 2C: Textural Synthesis

In [26]:
# Collect all functional texture analyses
textural_analyses = {}
for i in range(NUM_SAMPLES):
    sample_id = f"sample_{i+1:03d}"
    analysis = store.get_analysis(sample_id, FunctionalTextureConfig.analyst_name())
    if analysis:
        textural_analyses[sample_id] = analysis

# Check if synthesis already exists
synthesis_type = TexturalSynthesisConfig.synthesis_type()
existing = store.list_syntheses(synthesis_type)

if existing:
    print(f"Textural synthesis already exists: {existing[0]}")
    textural_synthesis_id = existing[0]
else:
    config = TexturalSynthesisConfig(
        textural_analyses=textural_analyses
    )
    prompt = prompt_maker.render(config)
    
    print("Running textural synthesis...")
    response = llm.complete(prompt)
    
    sample_contributions = [
        (sample_id, FunctionalTextureConfig.analyst_name())
        for sample_id in textural_analyses.keys()
    ]
    
    textural_synthesis_id = store.save_synthesis(
        synthesis_type=synthesis_type,
        output=response.content,
        model=response.model,
        sample_contributions=sample_contributions,
        config=config
    )
    print(f"Synthesis saved: {textural_synthesis_id}")

synth = store.get_synthesis(textural_synthesis_id)
print(f"\nPreview (first 400 chars):\n{synth['output'][:400]}...")

Textural synthesis already exists: textural_synthesis_001

Preview (first 400 chars):
## UNIFIED DESCRIPTION OF THE AUTHOR'S CHARACTERISTIC TEXTURE

### PART 1: SENTENCE ARCHITECTURE

**Structural Character**
This author's characteristic sentence is complex, often embedding subordinate clauses and parenthetical explanations to build nuanced arguments. Complexity is typically managed through distributed information, front-loaded main clauses, and the use of semicolons for linking re...


In [32]:
implied_synth = store.get_synthesis(implied_author_synthesis_id)
decision_synth = store.get_synthesis(decision_pattern_synthesis_id)
textural_synth = store.get_synthesis(textural_synthesis_id)

# Check if author model definition already exists
synthesis_type = AuthorModelDefinitionConfig.synthesis_type()
existing = store.list_syntheses(synthesis_type)

if existing:
    print(f"Author model definition already exists: {existing[0]}")
    author_model_id = existing[0]
else:
    # Generate author model definition
    config = AuthorModelDefinitionConfig(
        implied_author_synthesis=implied_synth['output'],
        decision_pattern_synthesis=decision_synth['output'],
        textural_synthesis=textural_synth['output']
    )
    prompt = prompt_maker.render(config)

    print("Constructing author model definition...")
    response = llm.complete(prompt)

    # Save with parent linkage (no direct sample contributions)
    author_model_id = store.save_synthesis(
        synthesis_type=synthesis_type,
        output=response.content,
        model=response.model,
        sample_contributions=[],  # Inherits from parents
        config=config,
        parent_synthesis_id=implied_author_synthesis_id  # Link to one parent
    )
    print(f"Author model definition saved: {author_model_id}")

# Display preview
author_model = store.get_synthesis(author_model_id)
print(f"\nTotal chars of Author Model: {len(author_model['output'])}")
print(f"\nAuthor Model Preview (first 800 chars):\n{author_model['output'][:800]}...")

# Export to filesystem for use in generation
output_dir = Path("outputs/author_modeling")
output_dir.mkdir(parents=True, exist_ok=True)

author_model_path = output_dir / f"{author_model_id}.txt"
store.export_synthesis(
    synthesis_id=author_model_id,
    output_path=author_model_path,
    metadata_format='yaml'
)
print(f"\nAuthor model exported to: {author_model_path}")

Constructing author model definition...
Author model definition saved: author_model_definition_001

Total chars of Author Model: 25659

Author Model Preview (first 800 chars):
## OUTPUT

### PART 1: THE GENERATIVE STANCE

**What it is like to be this author when writing**:

#### Orientation to Material

You approach complex topics with a sense of mastery and curiosity. You hold your subject matter with a blend of analytical rigor and intellectual curiosity, acknowledging the inherent uncertainties and nuances. When you sit down to write, you are fundamentally trying to unravel complexities and convey these intricacies accurately to your reader. Your relationship to complexity is one of relish; you enjoy breaking down complex topics into digestible parts, often using examples and anecdotes to illustrate your points. You are comfortable with uncertainty and acknowledge the limitations of your knowledge while asserting confidence in logical reasoning.

**Exemplary ...

Author model export