# Multi-Analyst Text Analysis Pipeline

This notebook demonstrates the full pipeline for analyzing text through multiple specialist lenses (rhetorician, syntactician, lexicologist, etc.) and synthesizing their observations.

## Installations and Preparations
First, external modules are installed and ensured to be in working order.

In [None]:
# Optional: Install requirements if running in a fresh kernel
# Uncomment if needed:
!pip install -r requirements.txt

# Or install individual packages:
# !pip install litellm pydantic jinja2

In [None]:
try:
    import litellm
    print('Providers\n=========')
    print('* ' + '\n* '.join(litellm.LITELLM_CHAT_PROVIDERS))
except ImportError as e:
    print(f"✗ Cannot import litellm: {e}")

## Initialize Base Objects
Set up connections to a Large Language Model provider via `litellm` model router. Also, setup up tools to retrieve text data to be part of the context window, that is, instruction prompts and texts to analyze. A basic result storage is also initialized.

The LLM to use is set by the `model_string`, which is constructed as `<provider>/<model>`, the providers defined by the `litellm` package, see in particular `litellm.LITELLM_CHAT_PROVIDERS`. The API key to the provider should be stored in an environment variable with name defined in `model_provider_api_key_env_var`. Do **not** store the API key as a string variable directly in the notebook.

In [None]:
model_string = 'mistral/mistral-large-2411'
model_provider_api_key_env_var = 'MISTRAL_API_KEY'

In [None]:
import os
from pathlib import Path
from belletrist import LLM, LLMConfig, PromptMaker, DataSampler, ResultStore

llm = LLM(LLMConfig(
    model=model_string,
    api_key=os.environ.get(model_provider_api_key_env_var)
))
prompt_maker = PromptMaker()
sampler = DataSampler()
store = ResultStore(Path(f"{os.getcwd()}/belletrist_storage.db"))
store.reset()

## Generate and Store Text Samples to be Analyzed

A random text sample is taken from the data corpus and stored with full provenance (which file, which paragraphs). Each sample is an instance of `TextSegment`.

The sample size is set by the variable `n_sample` and each sample comprises `m_paragraphs_per_sample` number of consecutive paragraphs.

If non-random text samples are preferred, use the `get_paragraph_chunk` method of the `DataSampler` instance.

In [5]:
text_sample = sampler.sample_segment(p_length=4)
print(f'Text source: {text_sample.file_path}')
print(f'Paragraph range: {text_sample.paragraph_start} - {text_sample.paragraph_end}')
print(f'\n{text_sample.text}')

Text source: /Users/andersohrn/PycharmProjects/russell_writes/data/russell/education_and_the_good_life.txt
Paragraph range: 171 - 175

In play, we have two forms of the will to power: the form which
consists in learning to do things, and the form which consists in
fantasy. Just as the balked adult may indulge in daydreams that have
a sexual significance, so the normal child indulges in pretences that
have a power-significance. He likes to be a giant, or a lion, or a
train; in his make-believe, he inspires terror. When I told my boy the
story of Jack the Giant Killer, I tried to make him identify himself
with Jack, but he firmly chose the giant. When his mother told him the
story of Bluebeard, he insisted on being Bluebeard, and regarded the
wife as justly punished for insubordination. In his play, there was
a sanguinary outbreak of cutting off ladies’ heads. Sadism, Freudians
would say; but he enjoyed just as much being a giant who ate little
boys, or an engine that could pull a heavy 

In [6]:
n_sample = 1
m_paragraphs_per_sample = 10

for _ in range(n_sample):
    sample_id = f'sample_{len(store.list_samples()) + 1:03d}'
    segment = sampler.sample_segment(p_length=m_paragraphs_per_sample)
    store.save_segment(sample_id, segment)


In [7]:
print('Sample keys:\n============')
store.list_samples()

Sample keys:


['sample_001']

## Construct the Analyst Agents and Analyze

Send the text samples through each specialist analyst. Each produces an independent analysis from their domain expertise.

**Prompt structure for caching optimization:**
1. Preamble instruction (static)
2. Analyst-specific template (static per analyst)
3. Text to analyze (dynamic)

In [8]:
from belletrist.models import (
    PreambleInstructionConfig,
    PreambleTextConfig,
    RhetoricianConfig,
    SyntacticianConfig,
    LexicologistConfig,
    InformationArchitectConfig,
    EfficiencyAuditorConfig,
    PatternRecognizerTextConfig,
)
ANALYSTS = ["rhetorician", "syntactician", "lexicologist", "information_architect", "efficiency_auditor"]
ANALYST_CONFIGS = {
    "rhetorician": RhetoricianConfig,
    "syntactician": SyntacticianConfig,
    "lexicologist": LexicologistConfig,
    "information_architect": InformationArchitectConfig,
    "efficiency_auditor": EfficiencyAuditorConfig,
}

def build_analyst_prompt(preamble_instruction: str, analyst_prompt: str, preamble_text: str) -> str:
    """
    Helper function to construct the full prompt for an analyst.
    
    """
    return f"{preamble_instruction}\n\n{analyst_prompt}\n\n{preamble_text}"

In [9]:
# Get all samples from the store
all_samples = store.list_samples()
print(f"Processing {len(all_samples)} samples with {len(ANALYSTS)} analysts each\n")

# Outer loop: iterate over each text sample
for sample_id in all_samples:
    print(f"Sample: {sample_id}")
    
    # Get the sample text
    sample = store.get_sample(sample_id)
    text = sample['text']
    
    # Build shared prompt components (reused across all analysts for this sample)
    preamble_instruction = prompt_maker.render(PreambleInstructionConfig())
    preamble_text = prompt_maker.render(PreambleTextConfig(text_to_analyze=text))
    
    # Inner loop: run each analyst on this sample
    for analyst_name in ANALYSTS:
        print(f"  Running {analyst_name}...", end=" ")
        
        # Get analyst-specific prompt using the config class
        analyst_config = ANALYST_CONFIGS[analyst_name]()
        analyst_prompt = prompt_maker.render(analyst_config)
        
        # Build full prompt using helper function
        full_prompt = build_analyst_prompt(preamble_instruction, analyst_prompt, preamble_text)
        
        # Run analysis
        response = llm.complete(full_prompt)
        store.save_analysis(sample_id, analyst_name, response.content, response.model)
        
        print(f"✓ ({len(response.content)} chars)")
    
    print()  # Blank line between samples

print(f"All analyses complete for {len(all_samples)} samples")

Processing 1 samples with 5 analysts each

Sample: sample_001
  Running rhetorician... ✓ (6058 chars)
  Running syntactician... ✓ (9411 chars)
  Running lexicologist... ✓ (6344 chars)
  Running information_architect... ✓ (16208 chars)
  Running efficiency_auditor... ✓ (6266 chars)

All analyses complete for 1 samples


## Step 3: Retrieve and Examine Results

Check what's been stored and verify all analyses are present.

In [10]:
# Check if all required analyses are present
is_complete = store.is_complete(sample_id, ANALYSTS)
print(f"Analysis complete: {is_complete}")

# Retrieve sample and all analyses (both are now dicts)
sample, analyses = store.get_sample_with_analyses(sample_id)

print(f"\nSample: {sample['sample_id']}")
print(f"Source: File {sample['file_index']}, paragraphs {sample['paragraph_start']}-{sample['paragraph_end']}")
print(f"Analyses available: {list(analyses.keys())}")

# Examine one analysis
print(f"\n--- Rhetorician Output (first 500 chars) ---")
print(analyses.get("rhetorician", "Not found")[:500])

Analysis complete: True

Sample: sample_001
Source: File 4, paragraphs 198-208
Analyses available: ['efficiency_auditor', 'information_architect', 'lexicologist', 'rhetorician', 'syntactician']

--- Rhetorician Output (first 500 chars) ---
### RHETORICAL STRATEGY AND STANCE ANALYSIS

#### 1. WRITER'S POSITION

**Persona:**
The persona that emerges is authoritative yet exploratory, blending the voice of an expert with the curiosity of a thinker. The tone is intellectual and somewhat detached, suggesting a scholar who is both knowledgeable and reflective.

**Relationship to Subject Matter:**
The writer positions himself as an expert, deeply engaged with the subject matter. He offers insights and critiques that imply a high level of 


## Step 4: Pattern Recognition (Cross-Perspective Integration)

Synthesize all analyst perspectives to identify interactions, tensions, and load-bearing features.

In [None]:
# Get sample and all analyses (both are dicts now)
sample, analyses = store.get_sample_with_analyses(sample_id)

# Build nested analysts dict with analysis and description metadata
# Using config classes to get the descriptions dynamically
analysts_dict = {}
for analyst_name in ANALYSTS:
    config_class = ANALYST_CONFIGS[analyst_name]
    analysts_dict[analyst_name] = {
        'analysis': analyses[analyst_name],
        'analyst_descr_short': config_class.description()
    }

# Build pattern recognizer prompt using PromptMaker with dynamic analysts dict
pattern_config = PatternRecognizerTextConfig(
    original_text=sample['text'],
    analysts=analysts_dict
)
pattern_prompt = prompt_maker.render(pattern_config)
print(pattern_prompt)

In [None]:
# Get cross-perspective integration
print("Running pattern recognizer...", end=" ")
pattern_response = llm.complete(pattern_prompt)
print(f"✓ ({len(pattern_response.content)} chars)")

# Display first part of the synthesis
print("\n--- Pattern Recognition Output (first 1000 chars) ---")
print(pattern_response.content[:1000])

## Utilities: Working with Stored Samples

Helper functions for browsing and managing stored results.

In [None]:
# List all samples in the database
all_samples = store.list_samples()
print(f"Total samples: {len(all_samples)}")
print(f"Sample IDs: {all_samples}")

# Check completion status for each
print("\nCompletion status:")
for sid in all_samples:
    complete = store.is_complete(sid, ANALYSTS)
    status = "✓" if complete else "✗"
    print(f"  {status} {sid}")

# Close database connection when done
# store.close()