Background
<agent>/wiki/ (Atomic Wiki — distilled knowledge corpus, Karpathy-style) and <agent>/raw/ (source documents that feed the wiki) are filesystem directories. Discovery is recursive walks; reads are direct frontmatter.load().
There's no abstraction layer between the agent and the corpus storage.
Why it matters
Memory backend (#57) handles short-term, agent-evolving notes. Wiki/raw is a different beast — long-term, often-large knowledge corpus. The scaling pressures are different:
- Volume. A wiki/raw corpus can run into hundreds of MB or GB (PDFs, transcripts, scraped content). Filesystem reads stay fast for hundreds of files but degrade. SQLite is fine to ~10GB; Postgres FTS handles larger. A
VectorCorpusBackend makes RAG-style retrieval possible.
- Search semantics. Filesystem grep is keyword-only. Operators want semantic search across the corpus ("find every wiki page where I discussed avalanche vs snowball"). That's a
query(text, top_k) shape on a backend, not a directory walk.
- Multi-tenant. SaaS deployment with shared corpora (every tenant's agent reads the same internal knowledge base) needs the wiki to live somewhere other than per-tenant filesystem.
- Sync / ingestion. A backend lets ingestion be a write API, not "drop files in
raw/ and hope the watcher picks them up."
What to change
- New module
atomic_agents/corpus/ with backend.py (Protocol) + filesystem.py (default wrapping current wiki/ + raw/ walks).
CorpusBackend protocol exposes:
list_pages(corpus="wiki" | "raw") → list of CorpusRef
read_page(name, corpus) → CorpusPage (body + metadata)
write_page(name, content, corpus) — wiki distillation writes
query(text, top_k, corpus) — semantic search (capability-gated; FS backend can fall back to keyword)
stats(corpus) — page count, total bytes, last update
ingest(source, corpus) — operator-driven addition (capability-gated)
- Migrate call sites: agent prompt assembly (wiki INDEX read), wiki writers (if any), dashboard wiki tab.
- Spec doc
docs/spec/26-corpus-backend.md.
Relationship to MemoryBackend
Memory and corpus are intentionally separate primitives. Memory is short, agent-mutable, behavioral. Corpus is long, distilled, knowledge-shaped. They share a write-path discipline but have very different access patterns. Don't collapse.
Future backends
SQLiteCorpusBackend — single-box, fast keyword + FTS5 full-text search
PostgresCorpusBackend — multi-tenant, full Postgres FTS
PgvectorCorpusBackend — semantic search via embeddings
S3CorpusBackend — large file storage with metadata in DB
ChromaCorpusBackend / WeaviateCorpusBackend — purpose-built vector stores
Acceptance
- Existing wiki tests + raw read paths pass with
FilesystemCorpusBackend as default.
- Protocol conformance suite (~12 tests) — list, read, write, stats, query (with FS fallback to keyword).
- One vector-shaped mock backend proves the semantic-search capability fits.
Open questions
- Should
wiki and raw be separate corpus types in the same backend, or two backend instances? (Probably same backend, different "corpus" parameter — consistent with how filesystem implements them as sibling directories.)
- Embeddings storage: per-page embedding column vs. external vector store? Backend choice.
- Page versioning: same as memory (
.versions/ per page), or different model? Probably same to keep operators' mental model consistent.
Context
Background
<agent>/wiki/(Atomic Wiki — distilled knowledge corpus, Karpathy-style) and<agent>/raw/(source documents that feed the wiki) are filesystem directories. Discovery is recursive walks; reads are directfrontmatter.load().There's no abstraction layer between the agent and the corpus storage.
Why it matters
Memory backend (#57) handles short-term, agent-evolving notes. Wiki/raw is a different beast — long-term, often-large knowledge corpus. The scaling pressures are different:
VectorCorpusBackendmakes RAG-style retrieval possible.query(text, top_k)shape on a backend, not a directory walk.raw/and hope the watcher picks them up."What to change
atomic_agents/corpus/withbackend.py(Protocol) +filesystem.py(default wrapping currentwiki/+raw/walks).CorpusBackendprotocol exposes:list_pages(corpus="wiki" | "raw")→ list ofCorpusRefread_page(name, corpus)→CorpusPage(body + metadata)write_page(name, content, corpus)— wiki distillation writesquery(text, top_k, corpus)— semantic search (capability-gated; FS backend can fall back to keyword)stats(corpus)— page count, total bytes, last updateingest(source, corpus)— operator-driven addition (capability-gated)docs/spec/26-corpus-backend.md.Relationship to MemoryBackend
Memory and corpus are intentionally separate primitives. Memory is short, agent-mutable, behavioral. Corpus is long, distilled, knowledge-shaped. They share a write-path discipline but have very different access patterns. Don't collapse.
Future backends
SQLiteCorpusBackend— single-box, fast keyword + FTS5 full-text searchPostgresCorpusBackend— multi-tenant, full Postgres FTSPgvectorCorpusBackend— semantic search via embeddingsS3CorpusBackend— large file storage with metadata in DBChromaCorpusBackend/WeaviateCorpusBackend— purpose-built vector storesAcceptance
FilesystemCorpusBackendas default.Open questions
wikiandrawbe separate corpus types in the same backend, or two backend instances? (Probably same backend, different "corpus" parameter — consistent with how filesystem implements them as sibling directories.).versions/per page), or different model? Probably same to keep operators' mental model consistent.Context
MemoryBackendfrom PR refactor(memory): extract MemoryBackend protocol; FilesystemBackend default #57