# End-to-End Spindle Workflow

This notebook demonstrates the complete Spindle pipeline from document ingestion to knowledge graph construction and analytics visualization.

## Overview

This walkthrough covers:
1. Setting up configuration in a new project directory
2. Creating sample documents with entity variations
3. Ingesting documents into storage
4. Iteratively recommending and extending an ontology
5. Extracting knowledge graph triples
6. Resolving duplicate entities
7. Inserting into a persistent knowledge graph
8. Viewing analytics from event logs

## Prerequisites

- Ensure dependencies are installed: `uv pip install -e ".[dev]"`
from spindle.observability import get_event_recorder, attach_persistent_observer
from spindle.observability.storage import EventLogStore
- Set `ANTHROPIC_API_KEY` environment variable for LLM operations
- Run cells sequentially; each step builds on the previous one


In [1]:
from __future__ import annotations

import os
import json
import shutil
from pathlib import Path
from datetime import datetime
from pprint import pprint

# Spindle imports
from spindle.configuration import SpindleConfig
from spindle.ingestion.service import build_config, run_ingestion
from spindle.extraction.extractor import SpindleExtractor
from spindle.extraction.recommender import OntologyRecommender
from spindle.entity_resolution import EntityResolver, ResolutionConfig, get_duplicate_clusters
from spindle.graph_store import GraphStore
from spindle.vector_store import ChromaVectorStore, get_default_embedding_function
from spindle.analytics import AnalyticsStore
from spindle.analytics.views import (
    corpus_overview,
    ontology_recommendation_metrics,
    triple_extraction_metrics,
    entity_resolution_metrics,
)
from spindle.observability import get_event_recorder, attach_persistent_observer
from spindle.observability.storage import EventLogStore
from spindle.baml_client.types import Triple, Entity, SourceMetadata, CharacterSpan

# Check for API key
if not os.getenv("ANTHROPIC_API_KEY"):
    print("‚ö†Ô∏è  Warning: ANTHROPIC_API_KEY not set. LLM operations will fail.")
else:
    print("‚úÖ ANTHROPIC_API_KEY configured")

print("‚úÖ Imports successful")


‚úÖ ANTHROPIC_API_KEY configured
‚úÖ Imports successful


## Step 1: Setup Configuration

We'll create a new `example_project` directory and configure Spindle to use it for all storage operations.


In [2]:
# Set environment variable DEBUG_BAML_COLLECTOR=1
os.environ['DEBUG_BAML_COLLECTOR'] = '1'

# Create example_project directory
BASE_DIR = Path.cwd()
PROJECT_DIR = BASE_DIR / "example_project"

# Clean up if it exists (for fresh runs)
if PROJECT_DIR.exists():
    shutil.rmtree(PROJECT_DIR)

PROJECT_DIR.mkdir(parents=True, exist_ok=True)
print(f"Created project directory: {PROJECT_DIR}")

# Initialize SpindleConfig with auto-detected LLM
config = SpindleConfig.with_auto_detected_llm(
    root=PROJECT_DIR,
)

# Ensure all directories are created
config.storage.ensure_directories()

# Set up event persistence for analytics
# Events from ontology recommendation, triple extraction, and entity resolution
# will be saved to the analytics database
analytics_db_path = config.storage.log_dir / "analytics.db"
event_store = EventLogStore(f"sqlite:///{analytics_db_path}")
detach_observer = attach_persistent_observer(get_event_recorder(), event_store)

print(f"‚úÖ Event persistence configured: {analytics_db_path}")


print(f"\nConfiguration initialized:")
print(f"  Storage root: {config.storage.root}")
print(f"  Vector store: {config.storage.vector_store_dir}")
print(f"  Graph store: {config.storage.graph_store_path}")
print(f"  Document catalog: {config.storage.catalog_path}")
print(f"  Logs directory: {config.storage.log_dir}")
print(f"  LLM configured: {config.llm is not None}")


Created project directory: c:\Users\danie\Repos\spindle\spindle\notebooks\example_project
‚úÖ Event persistence configured: c:\Users\danie\Repos\spindle\spindle\notebooks\example_project\logs\analytics.db

Configuration initialized:
  Storage root: c:\Users\danie\Repos\spindle\spindle\notebooks\example_project
  Vector store: c:\Users\danie\Repos\spindle\spindle\notebooks\example_project\vector_store
  Graph store: c:\Users\danie\Repos\spindle\spindle\notebooks\example_project\graph\graph.db
  Document catalog: c:\Users\danie\Repos\spindle\spindle\notebooks\example_project\catalog\ingestion.db
  Logs directory: c:\Users\danie\Repos\spindle\spindle\notebooks\example_project\logs
  LLM configured: True


## Step 2: Create Sample Documents

We'll create 3 documents with intentional entity variations to demonstrate entity resolution:
- Document 1: "TechCorp" company, "John Smith" person
- Document 2: "Tech Corp" (space variation), "J. Smith" (abbreviation)  
- Document 3: "TechCorp Inc." (legal suffix), "John Smith" (full name again)


In [3]:
# Create documents directory
DOCUMENTS_DIR = PROJECT_DIR / "documents"
DOCUMENTS_DIR.mkdir(parents=True, exist_ok=True)

# Document 1: TechCorp and John Smith (made longer to create multiple chunks)
# Default chunk size is 800 chars with 100 overlap, so we'll make this ~2000 chars
doc1_content = """
# TechCorp Company Overview

TechCorp is a leading technology company based in San Francisco, California. 
Founded in 2010, the company specializes in cloud infrastructure and AI solutions.
The company has grown rapidly over the past decade, establishing itself as a key 
player in the enterprise software market. TechCorp's innovative approach to cloud 
computing has attracted numerous Fortune 500 clients seeking scalable and secure 
infrastructure solutions.

John Smith serves as the Chief Technology Officer at TechCorp. He has been with 
the company since 2015 and leads the engineering team. John Smith holds a PhD in 
Computer Science from Stanford University. Under his leadership, TechCorp has 
developed several breakthrough products that have revolutionized how businesses 
manage their cloud infrastructure. His expertise in distributed systems and 
machine learning has been instrumental in the company's success.

The company's headquarters are located in San Francisco, and it employs over 500 
people worldwide. TechCorp has partnerships with major cloud providers including 
AWS and Microsoft Azure. These strategic partnerships enable TechCorp to offer 
comprehensive solutions that integrate seamlessly with existing enterprise 
infrastructure. The company's commitment to innovation and customer satisfaction 
has earned it numerous industry awards and recognition from leading technology 
publications.

TechCorp's product portfolio includes enterprise-grade cloud management platforms, 
AI-powered analytics tools, and security solutions designed for modern businesses. 
The company continues to invest heavily in research and development, with plans 
to expand into new markets and develop cutting-edge technologies that will shape 
the future of cloud computing.
"""

# Document 2: Tech Corp (space variation) and J. Smith (abbreviation)
doc2_content = """
# Tech Corp Expansion Plans

Tech Corp announced plans to expand its operations to New York City. The expansion 
will create 200 new jobs in the metropolitan area.

J. Smith, CTO of Tech Corp, commented on the expansion: "This move represents our 
commitment to growing our presence on the East Coast. We're excited about the 
talent pool available in New York."

The New York office will focus on sales and customer support, complementing the 
engineering work done at the San Francisco headquarters. Tech Corp expects to 
complete the move by Q2 2025.
"""

# Document 3: TechCorp Inc. (legal suffix) and John Smith (full name)
doc3_content = """
# TechCorp Inc. Financial Results

TechCorp Inc. reported strong financial results for fiscal year 2024. Revenue 
grew by 35% year-over-year, reaching $150 million.

John Smith, the Chief Technology Officer, attributed the growth to successful 
product launches and strategic partnerships. "Our cloud infrastructure platform 
has seen tremendous adoption," said John Smith.

The company's board of directors approved a $50 million investment in R&D for 
2025. TechCorp Inc. plans to hire 100 additional engineers, with positions 
available in both San Francisco and New York offices.
"""

# Write documents
doc1_path = DOCUMENTS_DIR / "doc1_techcorp_overview.md"
doc2_path = DOCUMENTS_DIR / "doc2_techcorp_expansion.md"
doc3_path = DOCUMENTS_DIR / "doc3_techcorp_financials.md"

doc1_path.write_text(doc1_content.strip(), encoding="utf-8")
doc2_path.write_text(doc2_content.strip(), encoding="utf-8")
doc3_path.write_text(doc3_content.strip(), encoding="utf-8")

print("‚úÖ Created 3 sample documents:")
print(f"  1. {doc1_path.name}")
print(f"  2. {doc2_path.name}")
print(f"  3. {doc3_path.name}")
print("\nEntity variations to resolve:")
print("  - TechCorp / Tech Corp / TechCorp Inc. (same company)")
print("  - John Smith / J. Smith (same person)")


‚úÖ Created 3 sample documents:
  1. doc1_techcorp_overview.md
  2. doc2_techcorp_expansion.md
  3. doc3_techcorp_financials.md

Entity variations to resolve:
  - TechCorp / Tech Corp / TechCorp Inc. (same company)
  - John Smith / J. Smith (same person)


## Step 3: Document Ingestion

Ingest the documents into document storage (catalog) and vector storage.


In [4]:
# Build ingestion config using SpindleConfig
ingestion_config = build_config(spindle_config=config)

# Ingest all documents
document_paths = [doc1_path, doc2_path, doc3_path]
print("Ingesting documents...")
ingestion_result = run_ingestion(document_paths, ingestion_config)

print(f"\n‚úÖ Ingestion complete!")
print(f"  Documents processed: {ingestion_result.metrics.processed_documents}")
print(f"  Chunks created: {ingestion_result.metrics.processed_chunks}")
print(f"  Bytes read: {ingestion_result.metrics.bytes_read:,}")
print(f"  Errors: {len(ingestion_result.metrics.errors)}")

# Display chunk information
print(f"\nChunk details:")
for i, chunk in enumerate(ingestion_result.chunks, 1):
    print(f"  Chunk {i}: {chunk.chunk_id[:8]}... (doc: {chunk.document_id[:8]}...)")
    print(f"    Preview: {chunk.text[:80]}...")


Ingesting documents...

‚úÖ Ingestion complete!
  Documents processed: 3
  Chunks created: 6
  Bytes read: 2,974
  Errors: 0

Chunk details:
  Chunk 1: 80dd3827... (doc: 9202ddce...)
    Preview: # TechCorp Company Overview

TechCorp is a leading technology company based in S...
  Chunk 2: 09fcd32f... (doc: 9202ddce...)
    Preview: John Smith serves as the Chief Technology Officer at TechCorp. He has been with ...
  Chunk 3: 7d4c29b0... (doc: 9202ddce...)
    Preview: The company's headquarters are located in San Francisco, and it employs over 500...
  Chunk 4: f2fbfec0... (doc: 9202ddce...)
    Preview: TechCorp's product portfolio includes enterprise-grade cloud management platform...
  Chunk 5: 539f2c09... (doc: d70176e4...)
    Preview: # Tech Corp Expansion Plans

Tech Corp announced plans to expand its operations ...
  Chunk 6: b260f733... (doc: bb912dbf...)
    Preview: # TechCorp Inc. Financial Results

TechCorp Inc. reported strong financial resul...


## Step 4: Iterative Ontology Recommendation

We'll recommend an ontology from the first document, then iteratively extend it as we analyze the remaining documents.


In [5]:
# Create recommender using config
recommender = config.create_recommender()

# Get document texts
doc1_text = doc1_path.read_text()
doc2_text = doc2_path.read_text()
doc3_text = doc3_path.read_text()

# Step 1: Recommend initial ontology from Document 1
print("Step 1: Recommending initial ontology from Document 1...")
print("-" * 70)
rec1 = recommender.recommend(text=doc1_text, scope="balanced")

print(f"Initial ontology:")
print(f"  Entity types: {len(rec1.ontology.entity_types)}")
for et in rec1.ontology.entity_types:
    print(f"    - {et.name}: {et.description}")
print(f"  Relation types: {len(rec1.ontology.relation_types)}")
for rt in rec1.ontology.relation_types:
    print(f"    - {rt.name}: {rt.domain} ‚Üí {rt.range}")

current_ontology = rec1.ontology


Step 1: Recommending initial ontology from Document 1...
----------------------------------------------------------------------

=== DEBUG: Collector attributes ===
Collector type: <class 'baml_py.baml_py.Collector'>
Collector dir: ['clear', 'id', 'last', 'logs', 'usage']
Number of logs: 1
Log type: <class 'baml_py.baml_py.FunctionLog'>
Log dir: ['calls', 'function_name', 'id', 'log_type', 'metadata', 'raw_llm_response', 'selected_call', 'tags', 'timing', 'usage']

selected_call type: <class 'baml_py.baml_py.LLMCall'>
selected_call dir: ['client_name', 'http_request', 'http_response', 'provider', 'selected', 'timing', 'usage']
selected_call.client_name: CustomHaiku
selected_call.provider: anthropic

http_response type: <class 'baml_py.baml_py.HTTPResponse'>
http_response dir: ['body', 'headers', 'status']
http_response.body type: <class 'baml_py.baml_py.HTTPBody'>

usage type: <class 'baml_py.baml_py.Usage'>
usage dir: ['cached_input_tokens', 'input_tokens', 'output_tokens']

metadata 

In [6]:
# Step 2: Analyze Document 2 for extension needs
print("\nStep 2: Analyzing Document 2 for ontology extension...")
print("-" * 70)
extension2 = recommender.analyze_extension(
    text=doc2_text,
    current_ontology=current_ontology,
    scope="balanced"
)

print(f"Extension needed: {extension2.needs_extension}")
if extension2.needs_extension:
    print(f"  New entity types: {len(extension2.new_entity_types)}")
    for et in extension2.new_entity_types:
        print(f"    - {et.name}: {et.description}")
    print(f"  New relation types: {len(extension2.new_relation_types)}")
    for rt in extension2.new_relation_types:
        print(f"    - {rt.name}: {rt.domain} ‚Üí {rt.range}")
    print(f"\nReasoning: {extension2.reasoning[:200]}...")
    
    # Extend ontology
    current_ontology = recommender.extend_ontology(current_ontology, extension2)
    print(f"\n‚úÖ Extended ontology now has:")
    print(f"  Entity types: {len(current_ontology.entity_types)}")
    print(f"  Relation types: {len(current_ontology.relation_types)}")
else:
    print("  No extension needed - existing ontology covers Document 2")



Step 2: Analyzing Document 2 for ontology extension...
----------------------------------------------------------------------
Extension needed: True
  New entity types: 1
    - Office: A discrete organizational unit or branch location of an Organization (e.g., regional office, sales office, support center). Used to represent planned or existing offices with their specific operational focus, staffing plans, and timelines.
  New relation types: 2
    - has_office: Organization ‚Üí Office
    - office_located_in: Office ‚Üí Location

Reasoning: Step 1 ‚Äî Concepts in the text: organization (Tech Corp), location (New York City, San Francisco), a planned/new office (New York office), job creation count (200 new jobs), office functional focus (sa...

‚úÖ Extended ontology now has:
  Entity types: 5
  Relation types: 7


In [7]:
# Step 3: Analyze Document 3 for extension needs
print("\nStep 3: Analyzing Document 3 for ontology extension...")
print("-" * 70)
extension3 = recommender.analyze_extension(
    text=doc3_text,
    current_ontology=current_ontology,
    scope="balanced"
)

print(f"Extension needed: {extension3.needs_extension}")
if extension3.needs_extension:
    print(f"  New entity types: {len(extension3.new_entity_types)}")
    for et in extension3.new_entity_types:
        print(f"    - {et.name}: {et.description}")
    print(f"  New relation types: {len(extension3.new_relation_types)}")
    for rt in extension3.new_relation_types:
        print(f"    - {rt.name}: {rt.domain} ‚Üí {rt.range}")
    print(f"\nReasoning: {extension3.reasoning[:200]}...")
    
    # Extend ontology
    current_ontology = recommender.extend_ontology(current_ontology, extension3)
    print(f"\n‚úÖ Extended ontology now has:")
    print(f"  Entity types: {len(current_ontology.entity_types)}")
    print(f"  Relation types: {len(current_ontology.relation_types)}")
else:
    print("  No extension needed - existing ontology covers Document 3")

print(f"\nüìä Final ontology summary:")
print(f"  Total entity types: {len(current_ontology.entity_types)}")
print(f"  Total relation types: {len(current_ontology.relation_types)}")



Step 3: Analyzing Document 3 for ontology extension...
----------------------------------------------------------------------
Extension needed: False
  No extension needed - existing ontology covers Document 3

üìä Final ontology summary:
  Total entity types: 5
  Total relation types: 7


## Step 5: Triple Extraction

Extract knowledge graph triples from all documents using the final ontology.


In [8]:
# Create extractor with final ontology
extractor = config.create_extractor(ontology=current_ontology)

# Extract triples from all documents, maintaining entity consistency
all_triples = []

print("Extracting triples from Document 1...")
result1 = extractor.extract(
    text=doc1_text,
    source_name="doc1_techcorp_overview.md",
    source_url=str(doc1_path)
)
all_triples.extend(result1.triples)
print(f"  Extracted {len(result1.triples)} triples")

print("\nExtracting triples from Document 2...")
result2 = extractor.extract(
    text=doc2_text,
    source_name="doc2_techcorp_expansion.md",
    source_url=str(doc2_path),
    existing_triples=all_triples  # Maintain entity consistency
)
all_triples.extend(result2.triples)
print(f"  Extracted {len(result2.triples)} triples")

print("\nExtracting triples from Document 3...")
result3 = extractor.extract(
    text=doc3_text,
    source_name="doc3_techcorp_financials.md",
    source_url=str(doc3_path),
    existing_triples=all_triples  # Maintain entity consistency
)
all_triples.extend(result3.triples)
print(f"  Extracted {len(result3.triples)} triples")

print(f"\n‚úÖ Total triples extracted: {len(all_triples)}")

# Display sample triples
print("\nSample triples:")
for i, triple in enumerate(all_triples[:5], 1):
    print(f"\n  {i}. {triple.subject.name} ({triple.subject.type})")
    print(f"     --[{triple.predicate}]-->")
    print(f"     {triple.object.name} ({triple.object.type})")
    if triple.supporting_spans:
        print(f"     Evidence: \"{triple.supporting_spans[0].text[:60]}...\"")

# Add manual triples with entity name variations to test entity resolution
# These refer to existing entities but use different names
print("\n\nAdding manual triples with entity name variations for resolution testing...")

# Find existing entities to create variations for
existing_company = None
existing_person = None
for triple in all_triples:
    if triple.subject.type == "Organization" and existing_company is None:
        existing_company = triple.subject
    if triple.subject.type == "Person" and existing_person is None:
        existing_person = triple.subject
    if triple.object.type == "Organization" and existing_company is None:
        existing_company = triple.object
    if triple.object.type == "Person" and existing_person is None:
        existing_person = triple.object
    if existing_company and existing_person:
        break

if existing_company and existing_person:
    # Create triples with name variations
    # Variation 1: "TechCorp Industries" (should resolve to "TechCorp")
    techcorp_variation = Entity(
        name="TechCorp Industries",
        type=existing_company.type,
        description=existing_company.description,
        custom_atts={}
    )
    
    # Variation 2: "Dr. John Smith" (should resolve to "John Smith")
    john_variation = Entity(
        name="Dr. John Smith",
        type=existing_person.type,
        description=existing_person.description,
        custom_atts={}
    )
    
    # Create manual triples
    manual_triple1 = Triple(
        subject=techcorp_variation,
        predicate="located_in",
        object=Entity(
            name="San Francisco Bay Area",
            type="Location",
            description="Geographical region",
            custom_atts={}
        ),
        source=SourceMetadata(
            source_name="manual_addition.md",
            source_url=str(PROJECT_DIR / "manual_addition.md")
        ),
        supporting_spans=[CharacterSpan(text="TechCorp Industries is located in the San Francisco Bay Area")],
        extraction_datetime=datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
    )
    
    manual_triple2 = Triple(
        subject=john_variation,
        predicate="works_at",
        object=techcorp_variation,
        source=SourceMetadata(
            source_name="manual_addition.md",
            source_url=str(PROJECT_DIR / "manual_addition.md")
        ),
        supporting_spans=[CharacterSpan(text="Dr. John Smith works at TechCorp Industries")],
        extraction_datetime=datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
    )
    
    manual_triple3 = Triple(
        subject=Entity(
            name="TechCorp Industries",
            type="Organization",
            description="Technology company",
            custom_atts={}
        ),
        predicate="partners_with",
        object=Entity(
            name="Google Cloud",
            type="Organization",
            description="Cloud provider",
            custom_atts={}
        ),
        source=SourceMetadata(
            source_name="manual_addition.md",
            source_url=str(PROJECT_DIR / "manual_addition.md")
        ),
        supporting_spans=[CharacterSpan(text="TechCorp Industries partners with Google Cloud")],
        extraction_datetime=datetime.utcnow().strftime("%Y-%m-%dT%H:%M:%SZ")
    )
    
    all_triples.extend([manual_triple1, manual_triple2, manual_triple3])
    print(f"  Added 3 manual triples with entity variations:")
    print(f"    - TechCorp Industries (should resolve to TechCorp)")
    print(f"    - Dr. John Smith (should resolve to John Smith)")
    print(f"  Total triples now: {len(all_triples)}")
else:
    print("  ‚ö†Ô∏è  Could not find existing entities to create variations for")


Extracting triples from Document 1...

=== DEBUG: Collector attributes ===
Collector type: <class 'baml_py.baml_py.Collector'>
Collector dir: ['clear', 'id', 'last', 'logs', 'usage']
Number of logs: 1
Log type: <class 'baml_py.baml_py.FunctionLog'>
Log dir: ['calls', 'function_name', 'id', 'log_type', 'metadata', 'raw_llm_response', 'selected_call', 'tags', 'timing', 'usage']

raw_llm_response type: <class 'str'>
raw_llm_response preview (first 500 chars): {
  "triples": [
    {
      "subject": {
        "name": "TechCorp",
        "type": "Organization",
        "description": "TechCorp is a technology company specializing in cloud infrastructure and AI solutions, positioned as a key player in the enterprise software market.",
        "custom_atts": {
          "founded_date": {
            "value": "2010-01-01",
            "type": "date"
          },
          "headquarters_location": {
            "value": "San Francisco, California",
       
Parsed response has keys: ['triples', 

## Step 6: Entity Resolution

Identify and link duplicate entities using semantic blocking and matching.


In [9]:
# Initialize stores for entity resolution
# First, create a temporary graph store to load triples
temp_graph_name = "e2e_temp_resolution"
temp_store = GraphStore(temp_graph_name, config=config)
temp_store.create_graph(temp_graph_name)

# Add triples to temporary graph
print("Loading triples into temporary graph store...")
triples_added = temp_store.add_triples(all_triples)
print(f"  Added {triples_added} triples")

# Get initial node count
initial_nodes = temp_store.nodes()
print(f"  Initial nodes: {len(initial_nodes)}")

# Initialize vector store for embeddings
vector_store = ChromaVectorStore(
    collection_name="e2e_resolution_embeddings",
    config=config
)

# Configure resolution
resolution_config = ResolutionConfig(
    blocking_threshold=0.85,
    matching_threshold=0.75,
    clustering_method="hierarchical",
    batch_size=20,
)

# Create resolver
resolver = EntityResolver(config=resolution_config)

# Provide domain context
context = """
This is a knowledge graph about technology companies and people.
Common variations:
- Corp, Corporation, Inc., LLC are equivalent suffixes
- Abbreviations like NYC = New York City are common
- First name abbreviations (J. = John) are common
"""

print("\nRunning entity resolution...")
resolution_result = resolver.resolve_entities(
    graph_store=temp_store,
    vector_store=vector_store,
    apply_to_nodes=True,
    apply_to_edges=True,
    context=context
)

print(f"\n‚úÖ Resolution complete!")
print(f"  Nodes processed: {resolution_result.total_nodes_processed}")
print(f"  Edges processed: {resolution_result.total_edges_processed}")
print(f"  Blocks created: {resolution_result.blocks_created}")
print(f"  SAME_AS edges created: {resolution_result.same_as_edges_created}")
print(f"  Duplicate clusters: {resolution_result.duplicate_clusters}")
print(f"  Execution time: {resolution_result.execution_time_seconds:.2f}s")


Loading triples into temporary graph store...
  Added 15 triples
  Initial nodes: 35

Running entity resolution...

‚úÖ Resolution complete!
  Nodes processed: 35
  Edges processed: 85
  Blocks created: 23
  SAME_AS edges created: 52
  Duplicate clusters: 23
  Execution time: 277.83s


In [10]:
# Display resolution results
print("\nNode matches found:")
for i, match in enumerate(resolution_result.node_matches, 1):
    if match.is_duplicate:
        print(f"\n  Match {i}:")
        print(f"    Entity 1: {match.entity1_id}")
        print(f"    Entity 2: {match.entity2_id}")
        print(f"    Confidence: {match.confidence:.2f}")
        print(f"    Reasoning: {match.reasoning[:150]}...")

# Get duplicate clusters
clusters = get_duplicate_clusters(temp_store)
print(f"\n\nDuplicate clusters: {len(clusters)}")
for i, cluster in enumerate(clusters, 1):
    print(f"  Cluster {i}: {sorted(cluster)}")

# Query SAME_AS edges
same_as_edges = temp_store.query_by_pattern(predicate="SAME_AS")
print(f"\n\nSAME_AS edges: {len(same_as_edges)}")
for edge in same_as_edges[:5]:
    print(f"  {edge['subject']} --[SAME_AS]--> {edge['object']}")



Node matches found:

  Match 1:
    Entity 1: CHIEF TECHNOLOGY OFFICER
    Entity 2: CHIEF TECHNOLOGY OFFICER AT TECHCORP
    Confidence: 0.95
    Reasoning: These are clearly the same job role. Both have identical titles ('Chief Technology Officer'), identical start dates (2015-01-01), and nearly identical...

  Match 2:
    Entity 1: CHIEF TECHNOLOGY OFFICER
    Entity 2: CTO, TECHCORP
    Confidence: 0.95
    Reasoning: These represent the same job role at TechCorp. 'CTO' is a standard abbreviation for 'Chief Technology Officer'. Both are JobRole entities at TechCorp ...

  Match 3:
    Entity 1: CHIEF TECHNOLOGY OFFICER AT TECHCORP
    Entity 2: CTO, TECHCORP
    Confidence: 0.95
    Reasoning: These are the same role - CTO is the standard abbreviation for Chief Technology Officer. Both are JobRole entities at TechCorp with consistent descrip...

  Match 10:
    Entity 1: SAN FRANCISCO, CALIFORNIA
    Entity 2: SAN FRANCISCO
    Confidence: 0.95
    Reasoning: Both entities refer 

## Step 7: Knowledge Graph Insertion

Insert the resolved triples into a persistent knowledge graph.


In [11]:
# Create persistent graph store
graph_name = "e2e_knowledge_graph"
kg_store = GraphStore(graph_name, config=config)
kg_store.create_graph(graph_name)

# Get all triples from temporary store (including SAME_AS edges)
all_resolved_triples = temp_store.get_triples()

# Insert into persistent graph
print("Inserting resolved triples into knowledge graph...")
inserted = kg_store.add_triples(all_resolved_triples)
print(f"  Inserted {inserted} triples")

# Get graph statistics
stats = kg_store.get_statistics()
print(f"\n‚úÖ Knowledge graph created!")
print(f"  Nodes: {stats['node_count']}")
print(f"  Edges: {stats['edge_count']}")
print(f"  Sources: {', '.join(stats['sources'])}")
print(f"  Predicates: {', '.join(stats['predicates'][:10])}...")

# Query example relationships
print("\n\nExample queries:")
print("\n1. All 'works_at' relationships:")
works_at = kg_store.query_by_pattern(predicate="works_at")
for edge in works_at[:5]:
    print(f"   {edge['subject']} --[works_at]--> {edge['object']}")

print("\n2. All 'located_in' relationships:")
located_in = kg_store.query_by_pattern(predicate="located_in")
for edge in located_in[:5]:
    print(f"   {edge['subject']} --[located_in]--> {edge['object']}")

print("\n3. SAME_AS relationships (duplicate links):")
same_as = kg_store.query_by_pattern(predicate="SAME_AS")
for edge in same_as:
    print(f"   {edge['subject']} --[SAME_AS]--> {edge['object']}")

# Clean up temporary store
temp_store.close()


Inserting resolved triples into knowledge graph...
  Inserted 93 triples

‚úÖ Knowledge graph created!
  Nodes: 35
  Edges: 85
  Sources: doc1_techcorp_overview.md, doc2_techcorp_expansion.md, doc3_techcorp_financials.md, entity_resolution, manual_addition.md
  Predicates: LOCATED_IN, ROLE_AT, TARGETS_MARKET, ANNOUNCED_ACTION, LEADS, SPECIALIZES_IN, HOLDS_POSITION, HAS_PLAN, APPROVED_INVESTMENT, DEVELOPS...


Example queries:

1. All 'works_at' relationships:
   J. SMITH --[works_at]--> TECHCORP
   JOHN SMITH --[works_at]--> TECHCORP
   DR. JOHN SMITH --[works_at]--> TECHCORP INDUSTRIES
   JOHN SMITH --[works_at]--> TECHCORP INC.

2. All 'located_in' relationships:
   TECHCORP --[located_in]--> SAN FRANCISCO, CALIFORNIA
   TECHCORP --[located_in]--> NEW YORK CITY, NEW YORK
   TECHCORP --[located_in]--> NEW YORK, NEW YORK
   TECHCORP INDUSTRIES --[located_in]--> SAN FRANCISCO BAY AREA

3. SAME_AS relationships (duplicate links):
   SAN FRANCISCO --[SAME_AS]--> SAN FRANCISCO, CALIFORNIA


## Step 8: Analytics Visualization

Query and display analytics from event logs showing LLM usage, costs, and performance metrics.


In [12]:
# Initialize analytics store
# Analytics database is typically in log_dir/analytics.db
analytics_db_path = config.storage.log_dir / "analytics.db"
analytics_store = AnalyticsStore(f"sqlite:///{analytics_db_path}", event_store=event_store)

print("Analytics Store initialized")
print(f"  Database: {analytics_db_path}")

# Fetch service events
print("\nFetching service events...")
all_events = analytics_store.fetch_service_events(limit=100)
print(f"  Total events: {len(all_events)}")

# Group by service
by_service = {}
for event in all_events:
    service = event.service
    if service not in by_service:
        by_service[service] = []
    by_service[service].append(event)

print(f"\nEvents by service:")
for service, events in by_service.items():
    print(f"  {service}: {len(events)} events")


Analytics Store initialized
  Database: c:\Users\danie\Repos\spindle\spindle\notebooks\example_project\logs\analytics.db

Fetching service events...
  Total events: 100

Events by service:
  ingestion.service: 2 events
  ingestion.pipeline: 90 events
  ingestion.analytics: 3 events
  ontology.recommender: 2 events
  extractor: 3 events


In [13]:
# Display ontology recommendation metrics
print("\n" + "="*70)
print("ONTOLOGY RECOMMENDATION METRICS")
print("="*70)
ontology_metrics = ontology_recommendation_metrics(analytics_store)
if ontology_metrics["total_calls"] > 0:
    print(f"Total calls: {ontology_metrics['total_calls']}")
    print(f"Total tokens: {ontology_metrics['total_tokens']:,}")
    if ontology_metrics.get('input_tokens'):
        print(f"  Input tokens: {ontology_metrics['input_tokens']:,}")
    if ontology_metrics.get('output_tokens'):
        print(f"  Output tokens: {ontology_metrics['output_tokens']:,}")
    print(f"Total cost: ${ontology_metrics['total_cost']:.4f}")
    print(f"Avg latency: {ontology_metrics['avg_latency_ms']:.1f} ms")
    
    if ontology_metrics.get("by_scope"):
        print("\nBy scope:")
        for scope, metrics in ontology_metrics["by_scope"].items():
            print(f"  {scope}:")
            print(f"    Calls: {metrics['calls']}")
            print(f"    Tokens: {metrics['total_tokens']:,}")
            if metrics.get('input_tokens'):
                print(f"    Input tokens: {metrics['input_tokens']:,}")
            if metrics.get('output_tokens'):
                print(f"    Output tokens: {metrics['output_tokens']:,}")
            print(f"    Cost: ${metrics['total_cost']:.4f}")
else:
    print("No ontology recommendation metrics available")



ONTOLOGY RECOMMENDATION METRICS
Total calls: 1
Total tokens: 6,216
Total cost: $0.0000
Avg latency: 60036.0 ms

By scope:
  balanced:
    Calls: 1
    Tokens: 6,216
    Input tokens: 2,562
    Output tokens: 3,654
    Cost: $0.0000


In [14]:
# Display triple extraction metrics
print("\n" + "="*70)
print("TRIPLE EXTRACTION METRICS")
print("="*70)
extraction_metrics = triple_extraction_metrics(analytics_store)
if extraction_metrics["total_calls"] > 0:
    print(f"Total calls: {extraction_metrics['total_calls']}")
    print(f"Total tokens: {extraction_metrics['total_tokens']:,}")
    # Aggregate input/output tokens from by_scope if available
    total_input = sum(m.get('input_tokens', 0) or 0 for m in extraction_metrics.get('by_scope', {}).values())
    total_output = sum(m.get('output_tokens', 0) or 0 for m in extraction_metrics.get('by_scope', {}).values())
    if total_input > 0:
        print(f"  Input tokens: {total_input:,}")
    if total_output > 0:
        print(f"  Output tokens: {total_output:,}")
    print(f"Total cost: ${extraction_metrics['total_cost']:.4f}")
    print(f"Avg latency: {extraction_metrics['avg_latency_ms']:.1f} ms")
    print(f"Total triples: {extraction_metrics['total_triples']}")
    
    if extraction_metrics.get("by_scope"):
        print("\nBy scope:")
        for scope, metrics in extraction_metrics["by_scope"].items():
            print(f"  {scope}:")
            print(f"    Calls: {metrics['calls']}")
            print(f"    Tokens: {metrics['total_tokens']:,}")
            if metrics.get('input_tokens'):
                print(f"    Input tokens: {metrics['input_tokens']:,}")
            if metrics.get('output_tokens'):
                print(f"    Output tokens: {metrics['output_tokens']:,}")
            print(f"    Cost: ${metrics['total_cost']:.4f}")
            print(f"    Triples: {metrics['triples']}")
else:
    print("No triple extraction metrics available")



TRIPLE EXTRACTION METRICS
Total calls: 3
Total tokens: 18,961
  Input tokens: 9,973
  Output tokens: 8,988
Total cost: $0.0000
Avg latency: 43622.4 ms
Total triples: 12

By scope:
  balanced:
    Calls: 3
    Tokens: 18,961
    Input tokens: 9,973
    Output tokens: 8,988
    Cost: $0.0000
    Triples: 12


In [15]:
# Display entity resolution metrics
print("\n" + "="*70)
print("ENTITY RESOLUTION METRICS")
print("="*70)
resolution_metrics = entity_resolution_metrics(analytics_store)
if resolution_metrics["total_calls"] > 0:
    print(f"Total calls: {resolution_metrics['total_calls']}")
    print(f"Total tokens: {resolution_metrics['total_tokens']:,}")
    # Aggregate input/output tokens from both steps
    total_input = 0
    total_output = 0
    for step_name in ["entity_matching", "edge_matching"]:
        step_metrics = resolution_metrics.get(step_name, {})
        if step_metrics.get('input_tokens'):
            total_input += step_metrics['input_tokens']
        if step_metrics.get('output_tokens'):
            total_output += step_metrics['output_tokens']
    if total_input > 0:
        print(f"  Input tokens: {total_input:,}")
    if total_output > 0:
        print(f"  Output tokens: {total_output:,}")
    print(f"Total cost: ${resolution_metrics['total_cost']:.4f}")
    print(f"Avg latency: {resolution_metrics.get('avg_latency_ms', 0):.1f} ms")
    
    print("\nBy step:")
    print("  Entity Matching:")
    em = resolution_metrics["entity_matching"]
    print(f"    Calls: {em['calls']}")
    print(f"    Tokens: {em['total_tokens']:,}")
    if em.get('input_tokens'):
        print(f"    Input tokens: {em['input_tokens']:,}")
    if em.get('output_tokens'):
        print(f"    Output tokens: {em['output_tokens']:,}")
    print(f"    Cost: ${em['total_cost']:.4f}")
    
    print("  Edge Matching:")
    ed = resolution_metrics["edge_matching"]
    print(f"    Calls: {ed['calls']}")
    print(f"    Tokens: {ed['total_tokens']:,}")
    if ed.get('input_tokens'):
        print(f"    Input tokens: {ed['input_tokens']:,}")
    if ed.get('output_tokens'):
        print(f"    Output tokens: {ed['output_tokens']:,}")
    print(f"    Cost: ${ed['total_cost']:.4f}")
else:
    print("No entity resolution metrics available")



ENTITY RESOLUTION METRICS
Total calls: 23
Total tokens: 36,779
  Input tokens: 22,562
  Output tokens: 14,217
Total cost: $0.0000
Avg latency: 0.0 ms

By step:
  Entity Matching:
    Calls: 5
    Tokens: 9,901
    Input tokens: 5,849
    Output tokens: 4,052
    Cost: $0.0000
  Edge Matching:
    Calls: 18
    Tokens: 26,878
    Input tokens: 16,713
    Output tokens: 10,165
    Cost: $0.0000


In [16]:
# Display corpus overview
print("\n" + "="*70)
print("CORPUS OVERVIEW")
print("="*70)
overview = corpus_overview(analytics_store)
print(f"Total documents: {overview['documents']}")
print(f"Total tokens: {overview['total_tokens']:,}")
print(f"Avg tokens/doc: {overview['avg_tokens']:.1f}")
print(f"Avg chunks/doc: {overview['avg_chunks']:.1f}")

if overview.get("context_strategy_counts"):
    print("\nContext strategy distribution:")
    for strategy, count in overview["context_strategy_counts"].items():
        print(f"  {strategy}: {count}")



CORPUS OVERVIEW
Total documents: 3
Total tokens: 417
Avg tokens/doc: 139.0
Avg chunks/doc: 2.0

Context strategy distribution:
  document: 3


## Optional: Launch Dashboard

You can also launch the interactive Streamlit dashboard to explore analytics visually:

```python
from spindle.dashboard.app import cli_main
import sys

# Launch dashboard
cli_main(["--database", str(analytics_db_path)])
```

Or run from command line:
```bash
python -m spindle.dashboard.app --database path/to/analytics.db
```


## Summary

This notebook demonstrated the complete Spindle workflow:

‚úÖ **Configuration**: Set up SpindleConfig in `example_project/` directory  
‚úÖ **Document Creation**: Created 3 documents with entity variations  
‚úÖ **Ingestion**: Ingested documents into document storage and vector storage  
‚úÖ **Ontology Recommendation**: Iteratively recommended and extended ontology across documents  
‚úÖ **Triple Extraction**: Extracted knowledge graph triples using the final ontology  
‚úÖ **Entity Resolution**: Identified and linked duplicate entities (TechCorp variants, John Smith variants)  
‚úÖ **Knowledge Graph**: Inserted resolved triples into persistent graph  
‚úÖ **Analytics**: Viewed metrics from event logs showing LLM usage and costs  

### Key Results

- **Documents**: 3 documents ingested
- **Triples**: Extracted triples with entity consistency
- **Entity Resolution**: Found duplicate clusters for company and person name variations
- **Knowledge Graph**: Persistent graph with resolved entities and relationships
- **Analytics**: Complete observability into LLM operations and costs

### Next Steps

- Query the knowledge graph using Cypher queries
- Extend the ontology for new document types
- Fine-tune entity resolution thresholds
- Explore the dashboard for detailed analytics
- Export triples for use in other systems


In [9]:
# Cleanup: Close stores
kg_store.close()
vector_store.close()
detach_observer()  # Remove persistent observer
print("‚úÖ Stores closed and event observer detached")


NameError: name 'kg_store' is not defined

In [10]:
detach_observer()