# Document Ingestion Walkthrough

This notebook demonstrates how to route raw documents through the Spindle ingestion pipeline. You'll configure templates, build a pipeline, ingest sample files, and inspect the resulting artifacts, metrics, and graph.


## Getting Started

- Ensure the repository dependencies are installed (for example via `uv pip install -e ".[dev]"`).
- Make sure you can import `spindle` and the LangChain components referenced by the default templates.
- Run the cells sequentially; each step builds on the previous one.

### Walkthrough roadmap

1. Inspect default ingestion templates
2. Prepare sample documents
3. Build an ingestion pipeline
4. Execute ingestion
5. Explore artifacts, chunks, graphs, and metrics


In [None]:
from __future__ import annotations

from dataclasses import asdict
from pathlib import Path
from pprint import pprint
import shutil
import textwrap

from spindle.ingestion import (
    DEFAULT_TEMPLATE_SPECS,
    IngestionConfig,
    TemplateRegistry,
    build_ingestion_pipeline,
)

BASE_DIR = Path.cwd()
DATA_DIR = BASE_DIR / "_doc_ingestion_demo"

if DATA_DIR.exists():
    shutil.rmtree(DATA_DIR)
DATA_DIR.mkdir(parents=True, exist_ok=True)

print(f"Working in: {BASE_DIR}")
print(f"Demo documents will be written to: {DATA_DIR}")


Working in: /Users/thalamus/Repos/spindle/spindle/notebooks
Demo documents will be written to: /Users/thalamus/Repos/spindle/spindle/notebooks/spindle/notebooks/_doc_ingestion_demo


## 1. Inspect the default templates

The ingestion pipeline picks a `TemplateSpec` based on file metadata. The default templates cover common text and PDF files. Let's examine what is registered out of the box.


In [2]:
registry = TemplateRegistry(DEFAULT_TEMPLATE_SPECS)

print(f"Loaded {len(registry)} template(s):\n")
for spec in registry:
    selector = spec.selector
    print(f"- {spec.name}")
    print(f"  description : {spec.description or '—'}")
    print(f"  loader      : {spec.loader}")
    print(
        "  selector    : extensions="
        f"{', '.join(selector.file_extensions) or '—'}"
    )
    print()



Loaded 2 template(s):

- default-text
  description : Plain text and Markdown documents
  loader      : langchain_community.document_loaders.TextLoader
  selector    : extensions=.txt, .md, .mdx, .rst

- default-pdf
  description : PDF documents with layout-aware parsing
  loader      : langchain_community.document_loaders.PDFMinerLoader
  selector    : extensions=.pdf



## 2. Prepare sample documents

We'll fabricate a small knowledge base with two Markdown files so the pipeline has something to ingest. Each file uses headings and bullet points to simulate real content.


In [3]:
sample_docs = {
    "product_overview.md": """
        # Spindle Product Overview

        ## What problem does Spindle solve?
        - Automates extracting structured knowledge from unstructured text.
        - Produces chunk-level artifacts primed for retrieval and graph reasoning.

        ## Key capabilities
        - Template-driven ingestion
        - Document graph construction
        - Event hooks for observability
        - Extensible metadata enrichment

        ## Next steps
        1. Ingest documents
        2. Build embeddings
        3. Integrate with downstream agents
        """,
    "faq.md": """
        # Frequently Asked Questions

        ## How does chunking work?
        Spindle relies on LangChain splitters configured per template. You can
        override chunk sizes, separators, or even switch to semantic splitters.

        ## Where are ingestion metrics stored?
        Metrics are returned with every run. Optionally, connect a document catalog
        or vector store to persist them.

        ## Can I add custom metadata?
        Yes! Provide metadata extractor callables in your template to populate
        document-level context (titles, tags, owners, etc.).
        """,
}

documents_to_ingest = []
for name, body in sample_docs.items():
    path = DATA_DIR / name
    path.write_text(textwrap.dedent(body).strip() + "\n", encoding="utf-8")
    documents_to_ingest.append(path)
    print(f"Created demo document: {path.relative_to(BASE_DIR)}")

print("\nDocuments queued for ingestion:")
for path in documents_to_ingest:
    print(f"- {path}")


Created demo document: spindle/notebooks/_doc_ingestion_demo/product_overview.md
Created demo document: spindle/notebooks/_doc_ingestion_demo/faq.md

Documents queued for ingestion:
- /Users/thalamus/Repos/spindle/spindle/notebooks/spindle/notebooks/_doc_ingestion_demo/product_overview.md
- /Users/thalamus/Repos/spindle/spindle/notebooks/spindle/notebooks/_doc_ingestion_demo/faq.md


## 3. Build the ingestion pipeline

The pipeline combines an `IngestionConfig` with a `TemplateRegistry`. The default configuration listed above is enough for our Markdown examples.


In [4]:
config = IngestionConfig(template_specs=DEFAULT_TEMPLATE_SPECS)
pipeline = build_ingestion_pipeline(config=config, registry=registry)

print("Pipeline ready.")
print(f"Templates registered: {[spec.name for spec in registry]}")


Pipeline ready.
Templates registered: ['default-text', 'default-pdf']


## 4. Execute ingestion

Call `pipeline.ingest(paths=...)` with a list of `Path` objects. The run returns artifacts, chunk records, events, and metrics.


In [5]:
result = pipeline.ingest(paths=documents_to_ingest)

print("Ingestion complete.")
print(f"Documents processed: {len(result.documents)}")
print(f"Chunks produced   : {len(result.chunks)}")


Ingestion complete.
Documents processed: 2
Chunks produced   : 2


## 5. Explore the results

Start with the run metrics to understand what happened during ingestion.


In [6]:
metrics = result.metrics

print(f"Started at          : {metrics.started_at:%Y-%m-%d %H:%M:%S}")
print(f"Finished at         : {metrics.finished_at:%Y-%m-%d %H:%M:%S}")
print(f"Processed documents : {metrics.processed_documents}")
print(f"Processed chunks    : {metrics.processed_chunks}")
print(f"Bytes read          : {metrics.bytes_read}")
print(f"Errors              : {metrics.errors or '—'}\n")

print("Stage durations (ms):")
for stage, duration in metrics.extra.get("stage_durations_ms", {}).items():
    print(f"- {stage:<12} {duration:.2f}")

print("\nStage call counts:")
for stage, count in metrics.extra.get("stage_calls", {}).items():
    print(f"- {stage:<12} {count}")


Started at          : 2025-11-07 22:51:11
Finished at         : 2025-11-07 22:51:11
Processed documents : 2
Processed chunks    : 2
Bytes read          : 946
Errors              : —

Stage durations (ms):
- checksum     0.20
- load         137.59
- preprocess   0.01
- split        0.19
- metadata     0.01
- chunks       0.05
- graph        0.04

Stage call counts:
- checksum     2
- load         2
- preprocess   2
- split        2
- metadata     2
- chunks       2
- graph        2


### Document artifacts

Each `DocumentArtifact` captures high-level metadata about a source file. The raw bytes are available if you need to persist or reprocess the original content.


In [7]:
for artifact in result.documents:
    artifact_dict = asdict(artifact)
    raw_bytes = artifact_dict.pop("raw_bytes", None)
    artifact_dict["raw_bytes"] = f"{len(raw_bytes)} bytes" if raw_bytes else "—"
    pprint(artifact_dict)
    print()


{'checksum': 'ce9069f071b88346fc1c5c687f05a51c08081ed6a8a21cd9af1f7cf71bb12e2e',
 'created_at': datetime.datetime(2025, 11, 7, 22, 51, 11, 204714),
 'document_id': 'a6f75dcf673149b0b9045bb9f6a4b45c',
 'loader_name': 'langchain_community.document_loaders.TextLoader',
 'metadata': {},
 'raw_bytes': '442 bytes',
 'source_path': PosixPath('/Users/thalamus/Repos/spindle/spindle/notebooks/spindle/notebooks/_doc_ingestion_demo/product_overview.md'),
 'template_name': 'default-text'}

{'checksum': '8fce55c63acf05dfc68f97c0fbf326d32f0c256207dea7d4e8a508111a5819e2',
 'created_at': datetime.datetime(2025, 11, 7, 22, 51, 11, 212722),
 'document_id': 'a0c002723159470ebafe1106fb78fe35',
 'loader_name': 'langchain_community.document_loaders.TextLoader',
 'metadata': {},
 'raw_bytes': '504 bytes',
 'source_path': PosixPath('/Users/thalamus/Repos/spindle/spindle/notebooks/spindle/notebooks/_doc_ingestion_demo/faq.md'),
 'template_name': 'default-text'}



### Chunk artifacts

Chunks are suitable for vector stores or downstream retrieval. Below we preview the first few characters of each chunk to confirm the splitter configuration behaved as expected.


In [8]:
for index, chunk in enumerate(result.chunks, start=1):
    preview = chunk.text.replace("\n", " ")[:160]
    print(f"Chunk {index}")
    print(f"  chunk_id   : {chunk.chunk_id}")
    print(f"  document_id: {chunk.document_id}")
    print(f"  metadata   : {chunk.metadata}")
    print(f"  preview    : {preview}...")
    print()


Chunk 1
  chunk_id   : b6b4007997574adda39ddf6b900dbf68
  document_id: a6f75dcf673149b0b9045bb9f6a4b45c
  metadata   : {'source': '/Users/thalamus/Repos/spindle/spindle/notebooks/spindle/notebooks/_doc_ingestion_demo/product_overview.md', 'document_id': 'a6f75dcf673149b0b9045bb9f6a4b45c'}
  preview    : # Spindle Product Overview  ## What problem does Spindle solve? - Automates extracting structured knowledge from unstructured text. - Produces chunk-level artif...

Chunk 2
  chunk_id   : e76222dfd0eb4c0f823bb37a5a46415b
  document_id: a0c002723159470ebafe1106fb78fe35
  metadata   : {'source': '/Users/thalamus/Repos/spindle/spindle/notebooks/spindle/notebooks/_doc_ingestion_demo/faq.md', 'document_id': 'a0c002723159470ebafe1106fb78fe35'}
  preview    : # Frequently Asked Questions  ## How does chunking work? Spindle relies on LangChain splitters configured per template. You can override chunk sizes, separators...



### Document graph snapshot

The pipeline also builds a `DocumentGraph` that links documents to their chunks. More advanced templates can add extra relationships (for example, linking glossary terms or headings).


In [9]:
graph = result.document_graph

print(f"Graph nodes : {len(graph.nodes)}")
print(f"Graph edges : {len(graph.edges)}\n")

print("Sample nodes:")
for node in graph.nodes[:6]:
    pprint(asdict(node))
    print()

print("Sample edges:")
for edge in graph.edges[:6]:
    pprint(asdict(edge))
    print()


Graph nodes : 4
Graph edges : 2

Sample nodes:
{'attributes': {},
 'document_id': 'a6f75dcf673149b0b9045bb9f6a4b45c',
 'label': 'product_overview.md',
 'node_id': 'doc::a6f75dcf673149b0b9045bb9f6a4b45c'}

{'attributes': {'document_id': 'a6f75dcf673149b0b9045bb9f6a4b45c',
                'source': '/Users/thalamus/Repos/spindle/spindle/notebooks/spindle/notebooks/_doc_ingestion_demo/product_overview.md'},
 'document_id': 'a6f75dcf673149b0b9045bb9f6a4b45c',
 'label': 'Chunk 0',
 'node_id': 'chunk::b6b4007997574adda39ddf6b900dbf68'}

{'attributes': {},
 'document_id': 'a0c002723159470ebafe1106fb78fe35',
 'label': 'faq.md',
 'node_id': 'doc::a0c002723159470ebafe1106fb78fe35'}

{'attributes': {'document_id': 'a0c002723159470ebafe1106fb78fe35',
                'source': '/Users/thalamus/Repos/spindle/spindle/notebooks/spindle/notebooks/_doc_ingestion_demo/faq.md'},
 'document_id': 'a0c002723159470ebafe1106fb78fe35',
 'label': 'Chunk 0',
 'node_id': 'chunk::e76222dfd0eb4c0f823bb37a5a46415b'}


### Emitted events

Every pipeline stage pushes `IngestionEvent` records that you can stream to observers for logging, metrics, or tracing. Here we echo them inline.


In [10]:
for event in result.events:
    print(f"{event.timestamp:%Y-%m-%d %H:%M:%S} | {event.name:<15} | {event.payload}")


2025-11-07 22:51:11 | stage_start     | {'stage': 'checksum'}
2025-11-07 22:51:11 | stage_complete  | {'stage': 'checksum', 'duration_ms': 0.12024299940094352}
2025-11-07 22:51:11 | stage_start     | {'stage': 'load'}
2025-11-07 22:51:11 | stage_complete  | {'stage': 'load', 'duration_ms': 137.5019010156393}
2025-11-07 22:51:11 | stage_start     | {'stage': 'preprocess'}
2025-11-07 22:51:11 | stage_complete  | {'stage': 'preprocess', 'duration_ms': 0.00528499367646873}
2025-11-07 22:51:11 | stage_start     | {'stage': 'split'}
2025-11-07 22:51:11 | stage_complete  | {'stage': 'split', 'duration_ms': 0.13472599675878882}
2025-11-07 22:51:11 | stage_start     | {'stage': 'metadata'}
2025-11-07 22:51:11 | stage_complete  | {'stage': 'metadata', 'duration_ms': 0.004964007530361414}
2025-11-07 22:51:11 | stage_start     | {'stage': 'chunks'}
2025-11-07 22:51:11 | stage_complete  | {'stage': 'chunks', 'duration_ms': 0.026021996745839715}
2025-11-07 22:51:11 | stage_start     | {'stage': 'gra

## Next steps

- Add or override templates with `TemplateSpec` instances tailored to your corpus.
- Connect a document catalog or vector store (for example Chroma) when you want results persisted automatically.
- Attach observers to `build_ingestion_pipeline` to feed events into your logging or monitoring stack.
- Extend metadata extractors to enrich document context before indexing.
