Skip to content

Proposal: LLM-maintained knowledge base from Swarm papers and specs #5425

@gacevicljubisa

Description

@gacevicljubisa

Summary

Create a structured, interlinked Markdown knowledge base that converts all Swarm publications from papers.ethswarm.org into wiki pages — maintained by LLM agents, not humans. The bee repo would reference this knowledge base via AGENTS.md or CLAUDE.md, giving AI coding agents deep Swarm context when working on the codebase.

Motivation

AI coding agents (Claude Code, Codex, Cursor, etc.) are increasingly used to contribute to open-source projects. These agents work best when they have structured domain knowledge — not just code, but the concepts, architecture, protocols, and design rationale behind the code.

Swarm has excellent source material across multiple publications, but today this knowledge is locked in PDFs and scattered docs. An AI agent working on bee has no efficient way to understand why things are designed the way they are, what trade-offs were made, or how the protocol layers interact. A developer asking an AI to work on the redistribution game, for example, would benefit enormously from the agent having access to the formal definitions (Definitions 23–43 from the spec), the design rationale (Book of Swarm Chapter 3), and how the game interacts with postage stamps, reserve sampling, and the price oracle — all cross-referenced in one place.

How It Would Work

The approach follows the LLM Wiki pattern described by Andrej Karpathy — a persistent, compounding knowledge base maintained by LLM agents rather than humans.

Architecture — Three Layers

raw/              # Layer 1: Immutable source documents (converted Markdown from PDFs)
wiki/             # Layer 2: LLM-maintained structured wiki pages
CLAUDE.md         # Layer 3: Schema — conventions, page types, workflows

Layer 1 — Raw Sources: All PDFs from papers.ethswarm.org converted to Markdown (using tools like Datalab/Marker). These are immutable — the LLM reads but never modifies them.

Layer 2 — Wiki Pages: Structured pages organized by type (concepts, protocols, incentives, integration, papers, queries). Each page has YAML frontmatter, cross-references, glossary terms, formal definitions where applicable, and source citations.

Layer 3 — Schema: A CLAUDE.md that defines the wiki's structure, page format conventions, and three core workflows:

  • Ingest — process a source document into wiki pages, update cross-references, update index, log the operation
  • Query — answer questions by reading relevant wiki pages; file valuable answers back as new pages
  • Lint — two-layer health check: automated structural checks, then LLM semantic analysis (contradictions, gaps, missing connections)

What this produces

Starting from 9 source documents (~13,000 lines of raw material from papers.ethswarm.org), the ingestion process would build approximately 17+ wiki pages (~2,250+ lines of structured content) from just the first two major sources alone (Book of Swarm and Formal Specification Chapter 1). The full set of sources would produce significantly more.

Here is the kind of page structure that emerges:

Type Pages What they cover
Concepts chunks, DISC, Kademlia, postage stamps, feeds, encryption, erasure coding Design rationale, formal definitions, glossary terms, cross-references
Protocols PSS, push-sync, pull-sync, retrieval, hive Message formats, flows, state machines
Incentives SWAP, redistribution game, price oracle Bandwidth/storage incentives, formal game mechanics
Integration Bee API Upload/download, collections, access control, PSS messaging
Papers Per-paper summaries Chapter breakdowns, key contributions, notation
Meta overview, index, operation log Synthesis, catalog, audit trail

Key qualities of the output

Cross-referenced: Every page links to related pages. A redistribution page would link to postage stamps, price oracle, DISC, chunks, Kademlia, and pull-sync. An agent reading any page discovers the full context web.

Formally enriched: Mathematical definitions from the Formal Specification get integrated into the relevant concept pages. For instance, chunks.md would include the formal BMT hash definition (Δ[H,n]), CAC/SOC/PAC constructors, and segment inclusion proof structures — woven into the concept explanation, not isolated in a separate spec page.

Glossary-enriched: The ~180 formal terms from the Book of Swarm glossary get distributed across all topic pages, each term placed where it's most relevant.

Auditable: Every page has YAML frontmatter (title, type, sources, tags, last_updated), a Sources section citing the exact raw documents and sections it was derived from, and the operation log records every ingestion step.

Suggested approach for ingestion

Given that the Book of Swarm is ~6,500 lines and cannot fit in a single LLM context window, the ingestion should be split into chapter-level steps. A practical plan:

Phase 1 — Book of Swarm (8 steps): Ingest chapter by chapter. Each step reads one chapter, creates/updates the relevant concept/protocol/incentive pages, updates cross-references, and runs the linter. The first step creates the backbone (overview, index, paper summary); subsequent steps enrich it.

Phase 2 — Formal Specification (3 steps): Ingest by chapter. Chapter 1 (Definitions 1–43) enriches existing concept pages with mathematical formalism. Chapter 2 (Definitions 44–102) provides implementation-level data types and algorithms. Appendices add density estimation, randomness analysis, and parameter constants.

Phase 3 — Remaining papers (7 steps): Each smaller paper (whitepaper, protocol spec, erasure coding papers, price oracle, batch utilisation, DREAM) is ingested in a single step, creating its paper summary page and enriching relevant wiki pages.

After each step: update index.md, update overview.md, append to log.md, run the structural linter, commit.

Tooling

Three scripts support the workflow:

  • convert_docs.py — PDF/DOCX to Markdown conversion (e.g., via Datalab API)
  • wiki_lint.py — 9 automated structural checks (broken links, orphan pages, stale content, missing frontmatter, placeholder detection, tag consistency)
  • wiki_search.py — BM25 search with title/tag/body boosting; CLI and optional Flask web UI

Integration with bee

Add an AGENTS.md (or CLAUDE.md) to the bee repository pointing to the knowledge base repo. When an AI agent opens bee, it discovers the knowledge base and can:

  • Understand protocol design rationale before modifying code
  • Look up formal specifications when implementing features
  • Understand the relationship between protocol layers (e.g., how push-sync interacts with postage stamp validation)
  • Get context on Swarm-specific concepts (chunks, neighborhoods, postage, redistribution, etc.)
  • Answer "why" questions — not just "what does this function do" but "why was it designed this way"

Proposed Scope

Phase 1 — Foundation

  • Create the knowledge base repo with the three-layer structure
  • Convert all 9 papers from papers.ethswarm.org to Markdown
  • Write the schema (CLAUDE.md) with page format, workflows, conventions
  • Build tooling (lint, search, converter)
  • Ingest The Book of Swarm (all chapters + glossary, ~8 steps)
  • Ingest Swarm Formal Specification (all chapters + appendices, ~3 steps)
  • Ingest remaining 7 papers (whitepaper, protocol spec, erasure coding, price oracle, batch utilisation, DREAM)

Phase 2 — Protocol Pages and Integration

  • Create protocol pages (push-sync, pull-sync, retrieval, hive) from Protocol Specification
  • Create node architecture page from Protocol Specification
  • Add AGENTS.md to the bee repo
  • Ingest relevant content from docs.ethswarm.org

Phase 3 — Community and Automation

  • Document contribution workflow
  • Set up CI for wiki linting
  • Explore automated re-ingestion when sources update

Benefits

  • AI agents get deep Swarm context — architecture, rationale, formal specs, and cross-references, not just API docs
  • Knowledge compounds — every source ingested enriches the entire wiki through cross-references and glossary terms
  • Zero ongoing maintenance burden — the LLM handles cross-references, summaries, consistency, and index updates
  • Always current — re-ingest when sources change; the LLM propagates updates across all affected pages
  • Human-readable — plain Markdown files in a git repo; anyone can browse, review, and contribute

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions