Skip to content

[backend] CorpusBackend — wiki/raw knowledge storage abstracted from filesystem walks #65

@dep0we

Description

@dep0we

Background

<agent>/wiki/ (Atomic Wiki — distilled knowledge corpus, Karpathy-style) and <agent>/raw/ (source documents that feed the wiki) are filesystem directories. Discovery is recursive walks; reads are direct frontmatter.load().

There's no abstraction layer between the agent and the corpus storage.

Why it matters

Memory backend (#57) handles short-term, agent-evolving notes. Wiki/raw is a different beast — long-term, often-large knowledge corpus. The scaling pressures are different:

  • Volume. A wiki/raw corpus can run into hundreds of MB or GB (PDFs, transcripts, scraped content). Filesystem reads stay fast for hundreds of files but degrade. SQLite is fine to ~10GB; Postgres FTS handles larger. A VectorCorpusBackend makes RAG-style retrieval possible.
  • Search semantics. Filesystem grep is keyword-only. Operators want semantic search across the corpus ("find every wiki page where I discussed avalanche vs snowball"). That's a query(text, top_k) shape on a backend, not a directory walk.
  • Multi-tenant. SaaS deployment with shared corpora (every tenant's agent reads the same internal knowledge base) needs the wiki to live somewhere other than per-tenant filesystem.
  • Sync / ingestion. A backend lets ingestion be a write API, not "drop files in raw/ and hope the watcher picks them up."

What to change

  1. New module atomic_agents/corpus/ with backend.py (Protocol) + filesystem.py (default wrapping current wiki/ + raw/ walks).
  2. CorpusBackend protocol exposes:
    • list_pages(corpus="wiki" | "raw") → list of CorpusRef
    • read_page(name, corpus)CorpusPage (body + metadata)
    • write_page(name, content, corpus) — wiki distillation writes
    • query(text, top_k, corpus) — semantic search (capability-gated; FS backend can fall back to keyword)
    • stats(corpus) — page count, total bytes, last update
    • ingest(source, corpus) — operator-driven addition (capability-gated)
  3. Migrate call sites: agent prompt assembly (wiki INDEX read), wiki writers (if any), dashboard wiki tab.
  4. Spec doc docs/spec/26-corpus-backend.md.

Relationship to MemoryBackend

Memory and corpus are intentionally separate primitives. Memory is short, agent-mutable, behavioral. Corpus is long, distilled, knowledge-shaped. They share a write-path discipline but have very different access patterns. Don't collapse.

Future backends

  • SQLiteCorpusBackend — single-box, fast keyword + FTS5 full-text search
  • PostgresCorpusBackend — multi-tenant, full Postgres FTS
  • PgvectorCorpusBackend — semantic search via embeddings
  • S3CorpusBackend — large file storage with metadata in DB
  • ChromaCorpusBackend / WeaviateCorpusBackend — purpose-built vector stores

Acceptance

  • Existing wiki tests + raw read paths pass with FilesystemCorpusBackend as default.
  • Protocol conformance suite (~12 tests) — list, read, write, stats, query (with FS fallback to keyword).
  • One vector-shaped mock backend proves the semantic-search capability fits.

Open questions

  • Should wiki and raw be separate corpus types in the same backend, or two backend instances? (Probably same backend, different "corpus" parameter — consistent with how filesystem implements them as sibling directories.)
  • Embeddings storage: per-page embedding column vs. external vector store? Backend choice.
  • Page versioning: same as memory (.versions/ per page), or different model? Probably same to keep operators' mental model consistent.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendProtocol-pattern backend abstractions (memory, logs, locks, etc.)enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions