# Ontology Pipeline Quick Start

This notebook demonstrates how to use the Ontology Pipeline to build semantic knowledge systems from a corpus of documents.

The pipeline consists of six stages:
1. **Controlled Vocabulary** - Extract and define key terms
2. **Metadata Standards** - Define metadata schema
3. **Taxonomy** - Build hierarchical structure
4. **Thesaurus** - Add semantic relationships (BT, NT, RT)
5. **Ontology** - Generate entity and relation types
6. **Knowledge Graph** - Extract triples and build queryable graph


## Setup


In [1]:
# Import required modules
from pathlib import Path

from spindle.graph_store import GraphStore
from spindle.ingestion.storage import DocumentCatalog
from spindle.ingestion.storage.corpus import CorpusManager
from spindle.pipeline import (
    PipelineOrchestrator,
    PipelineStage,
    ExtractionStrategyType,
)


In [2]:
# Run this to clean up first
import os
import shutil
if os.path.exists("pipeline_demo.db"):
    os.remove("pipeline_demo.db")
if os.path.exists("pipeline_demo_graph"):
    shutil.rmtree("pipeline_demo_graph")

In [3]:
# Initialize storage components
catalog = DocumentCatalog("sqlite:///pipeline_demo.db")
corpus_manager = CorpusManager(catalog)
graph_store = GraphStore("pipeline_demo_graph")

# Create the pipeline orchestrator
orchestrator = PipelineOrchestrator(corpus_manager, graph_store)
orchestrator.register_default_stages()

print("Pipeline orchestrator initialized with all stages")


Pipeline orchestrator initialized with all stages


## Step 1: Create a Corpus

A corpus is a collection of documents that will be processed through the Ontology Pipeline.


In [4]:
# Create a corpus for our documents
corpus = corpus_manager.create_corpus(
    name="Demo Corpus",
    description="Sample documents for demonstrating the Ontology Pipeline",
)

print(f"Created corpus: {corpus.corpus_id}")
print(f"Name: {corpus.name}")


Created corpus: 40480027-952b-48d7-8c05-ac924e0f6671
Name: Demo Corpus


## Step 2: Add Sample Documents

For this demo, we'll simulate documents by directly inserting sample text.
In production, you'd first ingest documents using `spindle-ingest`.


In [5]:
# Sample documents about AI and machine learning
sample_texts = [
    """
    Machine learning is a subset of artificial intelligence that enables computers 
    to learn from data without being explicitly programmed. Deep learning, a further 
    subset, uses neural networks with multiple layers to model complex patterns.
    
    Supervised learning algorithms learn from labeled training data to make predictions.
    Common examples include classification and regression tasks. In contrast, 
    unsupervised learning finds patterns in unlabeled data, such as clustering.
    
    TensorFlow and PyTorch are popular deep learning frameworks used by researchers
    and practitioners worldwide. Google developed TensorFlow while Meta created PyTorch.
    """,
    """
    Natural language processing (NLP) is a field of AI focused on the interaction
    between computers and human language. Transformer models have revolutionized NLP,
    enabling breakthrough applications like GPT and BERT.
    
    Large language models (LLMs) are trained on massive text datasets and can
    generate human-like text, answer questions, and perform various language tasks.
    
    OpenAI developed GPT-4, while Google created PaLM and Gemini. These models
    use attention mechanisms to process sequential data effectively.
    """,
    """
    Computer vision enables machines to interpret and understand visual information
    from images and videos. Convolutional neural networks (CNNs) are fundamental
    to many computer vision applications.
    
    Object detection, image classification, and semantic segmentation are key
    computer vision tasks. YOLO and ResNet are influential architectures in this field.
    
    Applications include autonomous vehicles, medical imaging, and facial recognition.
    """
]

print(f"Prepared {len(sample_texts)} sample documents")


Prepared 3 sample documents


In [6]:
# For demo purposes, we'll manually create chunk artifacts
# In production, use spindle-ingest to properly ingest documents

from datetime import datetime
import uuid

from sqlalchemy.orm import Session
from spindle.ingestion.storage.catalog import ChunkRow, DocumentRow

# First, create document and chunk records
doc_ids = []
with catalog.session() as session:
    for i, text in enumerate(sample_texts):
        doc_id = f"demo_doc_{i}"
        chunk_id = f"demo_chunk_{i}"
        doc_ids.append(doc_id)
        
        # Create document record
        session.merge(DocumentRow(
            document_id=doc_id,
            source_path=f"sample_{i}.txt",
            checksum=str(hash(text)),
            loader_name="demo",
            template_name="default",
            metadata_={"title": f"Sample Document {i}"},
            created_at=datetime.utcnow(),
            bytes_read=len(text),
        ))
        
        # Create chunk record
        session.merge(ChunkRow(
            chunk_id=chunk_id,
            document_id=doc_id,
            text=text.strip(),
            metadata_={"index": i},
        ))

# Then add documents to corpus (outside the session to avoid lock)
corpus_manager.add_documents(corpus.corpus_id, doc_ids)

print(f"Added {len(sample_texts)} documents to corpus")
print(f"Document count: {corpus_manager.get_corpus_document_count(corpus.corpus_id)}")


Added 3 documents to corpus
Document count: 3


## Step 3: Run the Pipeline

We can run stages individually or all at once.


In [7]:
# Run all pipeline stages
print("Running all pipeline stages...\n")

results = orchestrator.run_all(
    corpus,
    strategy_type=ExtractionStrategyType.SEQUENTIAL,
    stop_on_error=True,
)

for result in results:
    icon = "✓" if result.success else "✗"
    print(f"{icon} {result.stage.value}: ", end="")
    if result.success:
        print(f"{result.artifact_count} artifacts")
    else:
        print(f"FAILED - {result.error_message}")


Running all pipeline stages...

2025-11-27T23:29:48.965 [BAML [92mINFO[0m] [35mFunction ExtractControlledVocabulary[0m:
    [33mClient: CustomGPT5Mini (gpt-5-mini-2025-08-07) - 38489ms. StopReason: completed. Tokens(in/out): 604/2634[0m
    [34m---PROMPT---[0m
    [2m[43msystem: [0m[2mYou are a knowledge organization expert specializing in controlled vocabulary development.
    Your task is to extract key terms from the text and create clean, disambiguated vocabulary entries.
    [43muser: [0m[2mTEXT TO ANALYZE:
    Machine learning is a subset of artificial intelligence that enables computers 
        to learn from data without being explicitly programmed. Deep learning, a further 
        subset, uses neural networks with multiple layers to model complex patterns.
    
        Supervised learning algorithms learn from labeled training data to make predictions.
        Common examples include classification and regression tasks. In contrast, 
        unsupervised learni

## Step 4: Explore the Results

### Controlled Vocabulary


In [12]:
# Get the vocabulary stage and load terms
vocab_stage = orchestrator.get_stage(PipelineStage.VOCABULARY)
terms = vocab_stage.load_artifacts(corpus.corpus_id)

print(f"Extracted {len(terms)} vocabulary terms:\n")
for term in terms:
    print(f"• {term.preferred_label}")
    print(f"  Definition: {term.definition[:100]}...")
    if term.synonyms:
        print(f"  Synonyms: {', '.join(term.synonyms)}")
    print()


Extracted 29 vocabulary terms:

• machine learning
  Definition: A subfield of artificial intelligence that enables computers to improve performance on tasks by lear...
  Synonyms: ML

• artificial intelligence
  Definition: A broad field of computer science that focuses on creating systems capable of performing tasks that ...
  Synonyms: AI

• deep learning
  Definition: A subset of machine learning that uses neural networks with multiple layers to automatically learn h...
  Synonyms: DL

• neural network
  Definition: A computational model that consists of interconnected layers of simple processing units (neurons) wh...
  Synonyms: neural networks, NN

• supervised learning
  Definition: A category of machine learning that trains models using labeled examples so they can make prediction...
  Synonyms: supervised machine learning

• labeled training data
  Definition: A dataset used for training supervised learning models in which each input instance is paired with a...
  Synonyms: la

### Ontology


In [11]:
# Get the generated ontology
ontology_stage = orchestrator.get_stage(PipelineStage.ONTOLOGY)
ontology = ontology_stage.get_ontology(corpus.corpus_id)

if ontology:
    print(f"Generated ontology with:")
    print(f"  - {len(ontology.entity_types)} entity types")
    print(f"  - {len(ontology.relation_types)} relation types")
    print()
    
    print("Entity Types:")
    for et in ontology.entity_types:
        print(f"  • {et.name}: {et.description}")
    
    print("\nRelation Types:")
    for rt in ontology.relation_types:
        print(f"  • {rt.name}: {rt.domain} → {rt.range}")


Generated ontology with:
  - 9 entity types
  - 12 relation types

Entity Types:
  • ResearchArea: High-level scientific or technological field (e.g., artificial intelligence).
  • Technique: A method, approach, or subfield used within a research area (e.g., machine learning, deep learning).
  • ModelArchitecture: An architecture or model family for computation/learning (e.g., neural network, transformer, CNN, YOLO, ResNet, large language model).
  • Framework: Software library or framework used to build/train models (e.g., TensorFlow, PyTorch, YOLO implementations).
  • Organization: Commercial, academic, or non-profit organization involved in development or research (e.g., Google, Meta).
  • Dataset: A dataset used for model training or evaluation, including information about labels (e.g., labeled training data, unlabeled data).
  • Task: A concrete ML/AI task or problem (e.g., classification, regression, clustering, object detection, image classification, semantic segmentation, faci

## Summary

This notebook demonstrated the complete Ontology Pipeline workflow:

1. Created a corpus to organize documents
2. Added sample documents about AI/ML
3. Ran all six pipeline stages
4. Explored the extracted artifacts:
   - Controlled vocabulary with definitions
   - Domain ontology with entity and relation types

For production use:
- Use `spindle-ingest` to properly ingest documents
- Use `spindle-pipeline` CLI for batch processing
- Access the REST API for integration with other systems

### Next Steps
- Explore the full taxonomy, thesaurus, and knowledge graph
- Try different extraction strategies (batch, sample-based)
- Query the knowledge graph for insights
