# Multi-Analyst Text Analysis Pipeline

This notebook demonstrates the full pipeline for analyzing text through multiple specialist lenses (rhetorician, syntactician, lexicologist, etc.) and synthesizing their observations.

## Installations and Preparations
First, external modules are installed and ensured to be in working order.

In [1]:
# Optional: Install requirements if running in a fresh kernel
# Uncomment if needed:
!pip install -r requirements.txt

# Or install individual packages:
# !pip install litellm pydantic jinja2


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
try:
    import litellm
    print('Providers\n=========')
    print('* ' + '\n* '.join(litellm.LITELLM_CHAT_PROVIDERS))
except ImportError as e:
    print(f"✗ Cannot import litellm: {e}")

Providers
* openai
* openai_like
* bytez
* xai
* custom_openai
* text-completion-openai
* cohere
* cohere_chat
* clarifai
* anthropic
* anthropic_text
* replicate
* huggingface
* together_ai
* datarobot
* openrouter
* cometapi
* vertex_ai
* vertex_ai_beta
* gemini
* ai21
* baseten
* azure
* azure_text
* azure_ai
* sagemaker
* sagemaker_chat
* bedrock
* vllm
* nlp_cloud
* petals
* oobabooga
* ollama
* ollama_chat
* deepinfra
* perplexity
* mistral
* groq
* nvidia_nim
* cerebras
* baseten
* ai21_chat
* volcengine
* codestral
* text-completion-codestral
* deepseek
* sambanova
* maritalk
* cloudflare
* fireworks_ai
* friendliai
* watsonx
* watsonx_text
* triton
* predibase
* databricks
* empower
* github
* custom
* litellm_proxy
* hosted_vllm
* llamafile
* lm_studio
* galadriel
* gradient_ai
* github_copilot
* novita
* meta_llama
* featherless_ai
* nscale
* nebius
* dashscope
* moonshot
* v0
* heroku
* oci
* morph
* lambda_ai
* vercel_ai_gateway
* wandb
* ovhcloud
* lemonade


## Initialize Base Objects
Set up connections to a Large Language Model provider via `litellm` model router. Also, setup up tools to retrieve text data to be part of the context window, that is, instruction prompts and texts to analyze. A basic result storage is also initialized.

The LLM to use is set by the `model_string`, which is constructed as `<provider>/<model>`, the providers defined by the `litellm` package, see in particular `litellm.LITELLM_CHAT_PROVIDERS`. The API key to the provider should be stored in an environment variable with name defined in `model_provider_api_key_env_var`. Do **not** store the API key as a string variable directly in the notebook.

In [16]:
model_string = 'mistral/mistral-large-2411'
model_provider_api_key_env_var = 'MISTRAL_API_KEY'

In [17]:
import os
from pathlib import Path
from belletrist import LLM, LLMConfig, PromptMaker, DataSampler, ResultStore
from belletrist.models import (
    PreambleInstructionConfig,
    PreambleTextConfig,
    RhetoricianConfig,
    SyntacticianConfig,
    LexicologistConfig,
    InformationArchitectConfig,
    EfficiencyAuditorConfig,
    PatternRecognizerTextConfig,
)

llm = LLM(LLMConfig(
    model=model_string,
    api_key=os.environ.get(model_provider_api_key_env_var)
))
prompt_maker = PromptMaker()
sampler = DataSampler()
store = ResultStore(Path(f"{os.getcwd()}/belletrist_storage.db"))
store.reset()

## Generate and Store Text Samples to be Analyzed

A random text sample is taken from the data corpus and stored with full provenance (which file, which paragraphs). Each sample is an instance of `TextSegment`.

The sample size is set by the variable `n_sample` and each sample comprises `m_paragraphs_per_sample` number of consecutive paragraphs.

If non-random text samples are preferred, use the `get_paragraph_chunk` method of the `DataSampler` instance.

In [18]:
text_sample = sampler.sample_segment(p_length=4)
print(f'Text source: {text_sample.file_path}')
print(f'Paragraph range: {text_sample.paragraph_start} - {text_sample.paragraph_end}')
print(f'\n{text_sample.text}')

Text source: /Users/andersohrn/PycharmProjects/russell_writes/data/russell/mysticism_and_logic_and_other_essays.txt
Paragraph range: 434 - 438

And Bergson, who has rightly perceived that the law as stated by
philosophers is worthless, nevertheless continues to suppose that it
is used in science. Thus he says:--

"Now, it is argued, this law [the law of causality] means that every
phenomenon is determined by its conditions, or, in other words, that
the same causes produce the same effects."[37]

And again:--

"We perceive physical phenomena, and these phenomena obey laws. This
means: (1) That phenomena _a_, _b_, _c_, _d_, previously perceived,
can occur again in the same shape; (2) that a certain phenomenon P,
which appeared after the conditions _a_, _b_, _c_, _d_, and after
these conditions only, will not fail to recur as soon as the same
conditions are again present."[38]


In [15]:
n_sample = 10
m_paragraphs_per_sample = 10

for _ in range(n_sample):
    sample_id = f'sample_{len(store.list_samples()) + 1:03d}'
    segment = sampler.sample_segment(p_length=m_paragraphs_per_sample)
    store.save_segment(sample_id, segment)


## Step 2: Run Multi-Analyst Pipeline

Send the text through each specialist analyst. Each produces an independent analysis from their domain expertise.

**Prompt structure for caching optimization:**
1. Preamble instruction (static)
2. Analyst-specific template (static per analyst)
3. Text to analyze (dynamic)

In [None]:
# Get the sample text
sample = store.get_sample(sample_id)
text = sample['text']  # Now returns dict, access via keys

# Build shared prompt components (reused across all analysts)
preamble_instruction = prompt_maker.render(PreambleInstructionConfig())
preamble_text = prompt_maker.render(PreambleTextConfig(text_to_analyze=text))

# --- RHETORICIAN ---
print("Running rhetorician...", end=" ")
rhetorician_prompt = prompt_maker.render(RhetoricianConfig())  # All sections enabled by default
full_prompt = f"{preamble_instruction}\n\n{rhetorician_prompt}\n\n{preamble_text}"
response = llm.complete(full_prompt)
store.save_analysis(sample_id, "rhetorician", response.content, response.model)
print(f"✓ ({len(response.content)} chars)")

# --- SYNTACTICIAN ---
print("Running syntactician...", end=" ")
syntactician_prompt = prompt_maker.render(SyntacticianConfig())
full_prompt = f"{preamble_instruction}\n\n{syntactician_prompt}\n\n{preamble_text}"
response = llm.complete(full_prompt)
store.save_analysis(sample_id, "syntactician", response.content, response.model)
print(f"✓ ({len(response.content)} chars)")

# --- LEXICOLOGIST ---
print("Running lexicologist...", end=" ")
lexicologist_prompt = prompt_maker.render(LexicologistConfig())
full_prompt = f"{preamble_instruction}\n\n{lexicologist_prompt}\n\n{preamble_text}"
response = llm.complete(full_prompt)
store.save_analysis(sample_id, "lexicologist", response.content, response.model)
print(f"✓ ({len(response.content)} chars)")

# --- INFORMATION ARCHITECT ---
print("Running information_architect...", end=" ")
info_arch_prompt = prompt_maker.render(InformationArchitectConfig())
full_prompt = f"{preamble_instruction}\n\n{info_arch_prompt}\n\n{preamble_text}"
response = llm.complete(full_prompt)
store.save_analysis(sample_id, "information_architect", response.content, response.model)
print(f"✓ ({len(response.content)} chars)")

# --- EFFICIENCY AUDITOR ---
print("Running efficiency_auditor...", end=" ")
efficiency_prompt = prompt_maker.render(EfficiencyAuditorConfig())
full_prompt = f"{preamble_instruction}\n\n{efficiency_prompt}\n\n{preamble_text}"
response = llm.complete(full_prompt)
store.save_analysis(sample_id, "efficiency_auditor", response.content, response.model)
print(f"✓ ({len(response.content)} chars)")

print(f"\nAll analyses complete for {sample_id}")

## Step 3: Retrieve and Examine Results

Check what's been stored and verify all analyses are present.

In [None]:
# Define list of analyst types we're using
ANALYSTS = ["rhetorician", "syntactician", "lexicologist", "information_architect", "efficiency_auditor"]

# Check if all required analyses are present
is_complete = store.is_complete(sample_id, ANALYSTS)
print(f"Analysis complete: {is_complete}")

# Retrieve sample and all analyses (both are now dicts)
sample, analyses = store.get_sample_with_analyses(sample_id)

print(f"\nSample: {sample['sample_id']}")
print(f"Source: File {sample['file_index']}, paragraphs {sample['paragraph_start']}-{sample['paragraph_end']}")
print(f"Analyses available: {list(analyses.keys())}")

# Examine one analysis
print(f"\n--- Rhetorician Output (first 500 chars) ---")
print(analyses.get("rhetorician", "Not found")[:500])

## Step 4: Pattern Recognition (Cross-Perspective Integration)

Synthesize all analyst perspectives to identify interactions, tensions, and load-bearing features.

In [None]:
# Get sample and all analyses (both are dicts now)
sample, analyses = store.get_sample_with_analyses(sample_id)

# Format all analyst reports into a single string
specialist_analyses = f"""**RHETORICIAN:**
{analyses['rhetorician']}

**SYNTACTICIAN:**
{analyses['syntactician']}

**LEXICOLOGIST:**
{analyses['lexicologist']}

**INFORMATION ARCHITECT:**
{analyses['information_architect']}

**EFFICIENCY AUDITOR:**
{analyses['efficiency_auditor']}
"""

# Build pattern recognizer prompt using PromptMaker
pattern_config = PatternRecognizerTextConfig(
    original_text=sample['text'],  # Access dict with key
    specialist_analyses=specialist_analyses
)
pattern_prompt = prompt_maker.render(pattern_config)

# Get cross-perspective integration
print("Running pattern recognizer...", end=" ")
pattern_response = llm.complete(pattern_prompt)
print(f"✓ ({len(pattern_response.content)} chars)")

# Display first part of the synthesis
print("\n--- Pattern Recognition Output (first 1000 chars) ---")
print(pattern_response.content[:1000])

## Utilities: Working with Stored Samples

Helper functions for browsing and managing stored results.

In [None]:
# List all samples in the database
all_samples = store.list_samples()
print(f"Total samples: {len(all_samples)}")
print(f"Sample IDs: {all_samples}")

# Check completion status for each
print("\nCompletion status:")
for sid in all_samples:
    complete = store.is_complete(sid, ANALYSTS)
    status = "✓" if complete else "✗"
    print(f"  {status} {sid}")

# Close database connection when done
# store.close()