# Style Segmentor: Building a Catalog of Exemplary Passages

This notebook builds a segment catalog by analyzing chapters and extracting exemplary passages that demonstrate specific craft moves.

## Workflow

For each chapter defined in `chapters_config.yaml`:
1. **Analyze**: LLM identifies exemplary passages demonstrating teachable craft moves
2. **Store**: Save passages to SQLite catalog with craft annotations (craft_move, teaching_note, tags)
3. **Browse**: Demonstrate catalog browsing pattern (list tags → browse summaries → retrieve full text)

## Output

- **segments.db**: SQLite database containing annotated passages ready for agent retrieval
- Agents can browse by tags, search by craft move, and retrieve full text with provenance

## Key Features

- **Skip existing chapters**: Resume after interruption without re-analyzing
- **Tag consistency**: Encourages reuse of existing tags across chapters
- **Provenance tracking**: Every segment includes file index and paragraph range
- **Skills pattern**: Catalog designed for agent browsing, not pre-loaded prompts

### Install Libraries and Check

In [None]:
!pip install -r requirements.txt

In [None]:
try:
    import litellm
    print('Providers\n=========')
    print('* ' + '\n* '.join(litellm.LITELLM_CHAT_PROVIDERS))
    litellm.drop_params = True
except ImportError as e:
    print(f"✗ Cannot import litellm: {e}")

## Setup and Configuration

In [None]:
import os
import yaml
from pathlib import Path
from typing import List, Optional
from pydantic import BaseModel, Field

### Chapter Configuration Models

Define Pydantic models for validating chapter configurations loaded from YAML.

In [None]:
class ChapterConfig(BaseModel):
    """Configuration for a single chapter to analyze."""
    file_index: int = Field(..., ge=0, description="Index of file in data directory")
    paragraph_start: int = Field(..., ge=0, description="Starting paragraph (inclusive)")
    paragraph_end: int = Field(..., gt=0, description="Ending paragraph (exclusive)")
    description: str = Field(..., min_length=1, description="Human-readable chapter description")
    enabled: bool = Field(default=True, description="Whether to process this chapter")

    @property
    def paragraph_range(self) -> slice:
        """Convert to slice for DataSampler."""
        return slice(self.paragraph_start, self.paragraph_end)


class ChaptersConfig(BaseModel):
    """Root configuration containing all chapters."""
    chapters: List[ChapterConfig] = Field(..., min_items=1)

print("✓ Chapter configuration models loaded")

### Model Configuration

Configure which LLM to use for segment analysis.

In [None]:
# Model configuration
model_string = 'together_ai/Qwen/Qwen3-235B-A22B-Thinking-2507'
model_api_key_env_var = 'TOGETHER_AI_API_KEY'

# Alternative examples (uncomment to use):
#model_string = 'anthropic/claude-sonnet-4-5-20250929'
#model_api_key_env_var = 'ANTHROPIC_API_KEY'
#model_string = 'openai/gpt-4o'
#model_api_key_env_var = 'OPENAI_API_KEY'
#model_string = 'mistral/mistral-large-2512'
#model_api_key_env_var = 'MISTRAL_API_KEY'

print(f"✓ Model: {model_string}")
print(f"✓ API key from env var: {model_api_key_env_var}")

### Initialize Base Objects

Initialize core components:
- `LLM`: LLM interface for analysis
- `PromptMaker`: Template rendering engine
- `DataSampler`: Text loading with provenance
- `SegmentStore`: SQLite catalog for passages

In [None]:
from belletrist import LLM, LLMConfig, PromptMaker, DataSampler, SegmentStore
from belletrist.prompts import ExemplarySegmentAnalysisConfig, ExemplarySegmentAnalysis

# ============================================================================
# CONFIGURATION - Modify these parameters before running
# ============================================================================

# Data paths
DATA_PATH = Path(os.getcwd()) / "data" / "russell"
SEGMENT_DB_PATH = Path(os.getcwd()) / "segments.db"
CHAPTERS_CONFIG_PATH = Path(os.getcwd()) / "chapters_config.yaml"

# Analysis parameters
TEMPERATURE = 0.7
NUM_SEGMENTS_PER_CHAPTER = 5  # How many passages to extract per chapter

# Processing control
SKIP_EXISTING_CHAPTERS = True  # Skip chapters already processed

# Catalog browsing (for demo at end)
CATALOG_PREVIEW_LIMIT = 5  # Number of segments to show in browse demo
TAG_PREVIEW_LIMIT = 10  # Number of tags to show in tag list

# ============================================================================

# Validate configuration
if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Data directory not found: {DATA_PATH}\n"
        f"Please ensure the data directory exists."
    )

if not CHAPTERS_CONFIG_PATH.exists():
    raise FileNotFoundError(
        f"Chapters configuration not found: {CHAPTERS_CONFIG_PATH}\n"
        f"Please create chapters_config.yaml with chapter definitions."
    )

# Initialize components
prompt_maker = PromptMaker()
sampler = DataSampler(data_path=DATA_PATH.resolve())

llm = LLM(LLMConfig(
    model=model_string,
    api_key=os.environ.get(model_api_key_env_var),
    temperature=TEMPERATURE,
    max_tokens=16384  # Ensure enough tokens for full JSON response
))

print(f"✓ Data path: {DATA_PATH}")
print(f"✓ Segment database: {SEGMENT_DB_PATH}")
print(f"✓ Chapters config: {CHAPTERS_CONFIG_PATH}")
print(f"✓ DataSampler loaded {len(sampler.fps)} files")
print(f"✓ LLM configured: {model_string} (temp={TEMPERATURE})")
print(f"✓ Skip existing chapters: {SKIP_EXISTING_CHAPTERS}")

### Load Chapters Configuration

Load chapter definitions from YAML file.

In [None]:
def load_chapters_config(config_path: Path) -> ChaptersConfig:
    """Load and validate chapters configuration from YAML file."""
    with open(config_path, 'r') as f:
        data = yaml.safe_load(f)
    
    try:
        config = ChaptersConfig(**data)
    except Exception as e:
        raise ValueError(f"Invalid chapters configuration: {e}")
    
    return config

# Load configuration
chapters_config = load_chapters_config(CHAPTERS_CONFIG_PATH)
enabled_chapters = [ch for ch in chapters_config.chapters if ch.enabled]
total_chapters = len(enabled_chapters)

print(f"✓ Loaded {total_chapters} enabled chapters")
print(f"  (Skipping {len(chapters_config.chapters) - total_chapters} disabled chapters)")

# Preview chapters
print("\nChapters to process:")
for i, chapter in enumerate(enabled_chapters[:5], 1):
    print(f"  {i}. {chapter.description}")
    print(f"     File {chapter.file_index}, paragraphs {chapter.paragraph_start}-{chapter.paragraph_end}")
if len(enabled_chapters) > 5:
    print(f"  ... and {len(enabled_chapters) - 5} more")

## Helper Functions

Define functions for the analysis workflow.

In [None]:
def chapter_already_processed(
    store: SegmentStore,
    chapter: ChapterConfig
) -> bool:
    """Check if a chapter has already been processed."""
    cursor = store.conn.cursor()
    cursor.execute(
        """
        SELECT COUNT(*) FROM segments
        WHERE file_index = ?
        AND paragraph_start >= ?
        AND paragraph_end <= ?
        """,
        (chapter.file_index, chapter.paragraph_start, chapter.paragraph_end)
    )
    count = cursor.fetchone()[0]
    return count > 0


def find_passage_in_chapter(
    passage_text: str,
    chapter_text: str,
    sampler: DataSampler,
    file_index: int,
    chapter_start_paragraph: int
) -> tuple[int, int] | None:
    """Find paragraph range for a passage within a chapter."""
    # Normalize text for comparison
    normalized_passage = ' '.join(passage_text.split())
    
    # Check if passage exists in chapter
    if normalized_passage not in ' '.join(chapter_text.split()):
        return None
    
    # Iterate through paragraphs to find the match
    file_path = sampler.fps[file_index]
    max_paragraphs = sampler.n_paragraphs[file_path.name]
    
    for length in range(1, 10):  # Try up to 10 paragraphs
        for start_offset in range(0, 50):  # Search within first 50 paragraphs
            abs_start = chapter_start_paragraph + start_offset
            abs_end = abs_start + length
            
            if abs_end > max_paragraphs:
                break
            
            chunk = sampler.get_paragraph_chunk(file_index, slice(abs_start, abs_end))
            normalized_chunk = ' '.join(chunk.text.split())
            
            if normalized_passage in normalized_chunk:
                return (abs_start, abs_end)
    
    return None

print("✓ Helper functions loaded")

## Phase 1 & 2: Analyze and Store Chapters

Loop through chapters, analyzing each to identify exemplary passages and storing them in the catalog.

In [None]:
# Open segment store
store = SegmentStore(SEGMENT_DB_PATH)

# Track progress
all_segment_ids = []
processed_count = 0
skipped_count = 0
failed_chapters = []

print("="*60)
print("PROCESSING CHAPTERS")
print("="*60)
if SKIP_EXISTING_CHAPTERS:
    print("Mode: Skip already-processed chapters\n")
else:
    print("Mode: Reprocess all chapters\n")

for chapter_idx, chapter in enumerate(enabled_chapters, 1):
    print(f"\n{'='*60}")
    print(f"CHAPTER {chapter_idx}/{total_chapters}: {chapter.description}")
    print(f"{'='*60}")
    print(f"File: {chapter.file_index}, Paragraphs: {chapter.paragraph_start}-{chapter.paragraph_end}")
    
    # Check if already processed
    if SKIP_EXISTING_CHAPTERS and chapter_already_processed(store, chapter):
        print(f"\n⏭ Skipping - chapter already processed")
        skipped_count += 1
        continue
    
    try:
        # Get existing tags for consistency
        existing_tags_dict = store.list_all_tags()
        existing_tags = list(existing_tags_dict.keys()) if existing_tags_dict else []
        
        if chapter_idx == 1:
            if existing_tags:
                print(f"\nCatalog currently contains {len(existing_tags)} unique tags")
                print(f"Will encourage reuse for consistency")
            else:
                print("\nCatalog is empty - this will establish initial tag vocabulary")
        else:
            print(f"\nCatalog now contains {len(existing_tags)} unique tags")
        
        # PHASE 1: ANALYSIS
        print("\n[1/3] Loading chapter text...")
        chapter_segment = sampler.get_paragraph_chunk(
            chapter.file_index,
            chapter.paragraph_range
        )
        print(f"      File: {chapter_segment.file_path.name}")
        print(f"      Length: {len(chapter_segment.text):,} characters")
        
        print("\n[2/3] Analyzing with LLM...")
        config = ExemplarySegmentAnalysisConfig(
            chapter_text=chapter_segment.text,
            file_name=chapter_segment.file_path.name,
            num_segments=NUM_SEGMENTS_PER_CHAPTER,
            existing_tags=existing_tags
        )
        prompt = prompt_maker.render(config)
        
        response = llm.complete_with_schema(
            prompt=prompt,
            schema_model=ExemplarySegmentAnalysis
        )
        analysis = response.content
        
        print(f"      ✓ Identified {len(analysis.passages)} exemplary passages")
        if analysis.overall_observations:
            print(f"      Observations: {analysis.overall_observations[:100]}...")
        
        # PHASE 2: STORAGE
        print("\n[3/3] Storing passages in catalog...")
        chapter_segment_ids = []
        
        for i, passage in enumerate(analysis.passages, 1):
            print(f"  [{i}/{len(analysis.passages)}] {passage.craft_move}")
            
            # Find passage location
            para_range = find_passage_in_chapter(
                passage.text,
                chapter_segment.text,
                sampler,
                chapter.file_index,
                chapter.paragraph_start
            )
            
            if para_range is None:
                print(f"      ⚠ Warning: Could not locate passage, using approximate range")
                para_start = chapter.paragraph_start
                para_end = chapter.paragraph_start + 1
            else:
                para_start, para_end = para_range
            
            # Get TextSegment with provenance
            text_segment = sampler.get_paragraph_chunk(
                chapter.file_index,
                slice(para_start, para_end)
            )
            
            # Save to catalog
            segment_id = store.save_segment(
                text_segment=text_segment,
                craft_move=passage.craft_move,
                teaching_note=passage.teaching_note,
                tags=passage.tags
            )
            
            chapter_segment_ids.append(segment_id)
            print(f"      ✓ Saved: {segment_id} (para {para_start}-{para_end})")
            print(f"      Tags: {', '.join(passage.tags)}")
        
        all_segment_ids.extend(chapter_segment_ids)
        processed_count += 1
        
        print(f"\n✓ Chapter {chapter_idx} complete: {len(chapter_segment_ids)} passages stored")
        print(f"  Progress: {processed_count}/{total_chapters} chapters processed")
        
    except Exception as e:
        print(f"\n✗ ERROR processing chapter {chapter_idx}: {e}")
        failed_chapters.append((chapter_idx, chapter.description, str(e)))
        print(f"  Continuing with next chapter...")
        continue

# Summary
print(f"\n{'='*60}")
print(f"CHAPTER PROCESSING COMPLETE")
print(f"{'='*60}")
print(f"Successfully processed: {processed_count}/{total_chapters} chapters")
if skipped_count > 0:
    print(f"Skipped (already processed): {skipped_count} chapters")
print(f"Total segments stored (this run): {len(all_segment_ids)}")

if failed_chapters:
    print(f"\nFailed chapters ({len(failed_chapters)}):")
    for idx, desc, error in failed_chapters:
        print(f"  - Chapter {idx} ({desc}): {error}")
elif processed_count > 0:
    print("\n✓ All processed chapters completed successfully!")

if skipped_count == total_chapters:
    print("\n⏭ All chapters were already processed - no new segments added")
    print("  Set SKIP_EXISTING_CHAPTERS=False to reprocess")

## Phase 3: Catalog Browsing Demo

Demonstrate the skills pattern: how agents browse the catalog to find and retrieve passages.

In [None]:
print("="*60)
print("CATALOG BROWSING DEMONSTRATION (Skills Pattern)")
print("="*60)

# Step 1: List available tags
print("\n[1/3] Listing available tags...")
tags = store.list_all_tags()
print(f"      Found {len(tags)} unique tags in catalog")
print(f"\n      Top {TAG_PREVIEW_LIMIT} tags:")
for tag, count in list(tags.items())[:TAG_PREVIEW_LIMIT]:
    print(f"      - {tag}: {count} segments")

# Step 2: Browse catalog summaries
print(f"\n[2/3] Browsing catalog (showing {CATALOG_PREVIEW_LIMIT} segments)...")
catalog = store.browse_catalog(limit=CATALOG_PREVIEW_LIMIT)
print(f"      Retrieved {len(catalog)} segment summaries")
for i, entry in enumerate(catalog, 1):
    print(f"\n      Segment {i}/{len(catalog)}: {entry['segment_id']}")
    print(f"      File: {entry['file_name']}")
    print(f"      Range: paragraphs {entry['paragraph_range']}")
    print(f"      Craft Move: {entry['craft_move']}")
    print(f"      Teaching Note: {entry['teaching_note'][:80]}...")
    print(f"      Tags: {', '.join(entry['tags'])}")

# Step 3: Retrieve a specific segment
if catalog:
    segment_id = catalog[0]['segment_id']
    print(f"\n[3/3] Retrieving full text for segment: {segment_id}")
    record = store.get_segment(segment_id)
    
    if record:
        print(f"      ✓ Retrieved {len(record.text)} characters")
        print(f"\n      Preview (first 300 chars):")
        print(f"      {record.text[:300]}...")
        
        # Demonstrate conversion back to TextSegment
        text_segment = record.to_text_segment(sampler)
        print(f"\n      ✓ Re-retrieved via DataSampler:")
        print(f"        File: {text_segment.file_path.name}")
        print(f"        Range: {text_segment.paragraph_start}-{text_segment.paragraph_end}")

## Summary and Next Steps

Show catalog statistics and suggest next steps for agents.

In [None]:
print("\n" + "="*60)
print("WORKFLOW COMPLETE")
print("="*60)
print(f"\nSegment catalog saved to: {SEGMENT_DB_PATH}")
print(f"\nThis run:")
print(f"  Chapters processed: {processed_count}/{total_chapters}")
if skipped_count > 0:
    print(f"  Chapters skipped: {skipped_count}")
print(f"  New segments stored: {len(all_segment_ids)}")

if all_segment_ids:
    print(f"    First: {all_segment_ids[0]}")
    print(f"    Last:  {all_segment_ids[-1]}")

# Show total catalog size
total_in_catalog = store.get_count()
total_tags = len(store.list_all_tags())

print(f"\nCatalog totals:")
print(f"  Total segments: {total_in_catalog}")
print(f"  Unique tags: {total_tags}")

print("\nNext steps - agents can now:")
print("  • store.list_all_tags() → discover available categories")
print("  • store.browse_catalog() → read segment descriptions")
print("  • store.get_segment(id) → retrieve full text")
print("  • store.search_by_tag(tag) → filter by form/function")
print("="*60)

## Close Store

Clean up database connection.

In [None]:
store.close()
print("✓ Database connection closed")