Proposal: LLM-maintained knowledge base from Swarm papers and specs

## Summary

Create a structured, interlinked Markdown knowledge base that converts all Swarm publications from [papers.ethswarm.org](https://papers.ethswarm.org/) into wiki pages — maintained by LLM agents, not humans. The `bee` repo would reference this knowledge base via `AGENTS.md` or `CLAUDE.md`, giving AI coding agents deep Swarm context when working on the codebase.

## Motivation

AI coding agents (Claude Code, Codex, Cursor, etc.) are increasingly used to contribute to open-source projects. These agents work best when they have structured domain knowledge — not just code, but the *concepts, architecture, protocols, and design rationale* behind the code.

Swarm has excellent source material across multiple publications, but today this knowledge is locked in PDFs and scattered docs. An AI agent working on `bee` has no efficient way to understand *why* things are designed the way they are, what trade-offs were made, or how the protocol layers interact. A developer asking an AI to work on the redistribution game, for example, would benefit enormously from the agent having access to the formal definitions (Definitions 23–43 from the spec), the design rationale (Book of Swarm Chapter 3), and how the game interacts with postage stamps, reserve sampling, and the price oracle — all cross-referenced in one place.

## How It Would Work

The approach follows the [LLM Wiki pattern](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) described by Andrej Karpathy — a persistent, compounding knowledge base maintained by LLM agents rather than humans.

### Architecture — Three Layers

```
raw/              # Layer 1: Immutable source documents (converted Markdown from PDFs)
wiki/             # Layer 2: LLM-maintained structured wiki pages
CLAUDE.md         # Layer 3: Schema — conventions, page types, workflows
```

**Layer 1 — Raw Sources:** All PDFs from papers.ethswarm.org converted to Markdown (using tools like Datalab/Marker). These are immutable — the LLM reads but never modifies them.

**Layer 2 — Wiki Pages:** Structured pages organized by type (concepts, protocols, incentives, integration, papers, queries). Each page has YAML frontmatter, cross-references, glossary terms, formal definitions where applicable, and source citations.

**Layer 3 — Schema:** A `CLAUDE.md` that defines the wiki's structure, page format conventions, and three core workflows:

- **Ingest** — process a source document into wiki pages, update cross-references, update index, log the operation
- **Query** — answer questions by reading relevant wiki pages; file valuable answers back as new pages
- **Lint** — two-layer health check: automated structural checks, then LLM semantic analysis (contradictions, gaps, missing connections)

### What this produces

Starting from 9 source documents (~13,000 lines of raw material from papers.ethswarm.org), the ingestion process would build approximately 17+ wiki pages (~2,250+ lines of structured content) from just the first two major sources alone (Book of Swarm and Formal Specification Chapter 1). The full set of sources would produce significantly more.

Here is the kind of page structure that emerges:

| Type | Pages | What they cover |
| --- | --- | --- |
| Concepts | chunks, DISC, Kademlia, postage stamps, feeds, encryption, erasure coding | Design rationale, formal definitions, glossary terms, cross-references |
| Protocols | PSS, push-sync, pull-sync, retrieval, hive | Message formats, flows, state machines |
| Incentives | SWAP, redistribution game, price oracle | Bandwidth/storage incentives, formal game mechanics |
| Integration | Bee API | Upload/download, collections, access control, PSS messaging |
| Papers | Per-paper summaries | Chapter breakdowns, key contributions, notation |
| Meta | overview, index, operation log | Synthesis, catalog, audit trail |

### Key qualities of the output

**Cross-referenced:** Every page links to related pages. A redistribution page would link to postage stamps, price oracle, DISC, chunks, Kademlia, and pull-sync. An agent reading any page discovers the full context web.

**Formally enriched:** Mathematical definitions from the Formal Specification get integrated into the relevant concept pages. For instance, `chunks.md` would include the formal BMT hash definition (Δ[H,n]), CAC/SOC/PAC constructors, and segment inclusion proof structures — woven into the concept explanation, not isolated in a separate spec page.

**Glossary-enriched:** The ~180 formal terms from the Book of Swarm glossary get distributed across all topic pages, each term placed where it's most relevant.

**Auditable:** Every page has YAML frontmatter (title, type, sources, tags, last_updated), a Sources section citing the exact raw documents and sections it was derived from, and the operation log records every ingestion step.

### Suggested approach for ingestion

Given that the Book of Swarm is ~6,500 lines and cannot fit in a single LLM context window, the ingestion should be split into chapter-level steps. A practical plan:

**Phase 1 — Book of Swarm (8 steps):** Ingest chapter by chapter. Each step reads one chapter, creates/updates the relevant concept/protocol/incentive pages, updates cross-references, and runs the linter. The first step creates the backbone (overview, index, paper summary); subsequent steps enrich it.

**Phase 2 — Formal Specification (3 steps):** Ingest by chapter. Chapter 1 (Definitions 1–43) enriches existing concept pages with mathematical formalism. Chapter 2 (Definitions 44–102) provides implementation-level data types and algorithms. Appendices add density estimation, randomness analysis, and parameter constants.

**Phase 3 — Remaining papers (7 steps):** Each smaller paper (whitepaper, protocol spec, erasure coding papers, price oracle, batch utilisation, DREAM) is ingested in a single step, creating its paper summary page and enriching relevant wiki pages.

After each step: update `index.md`, update `overview.md`, append to `log.md`, run the structural linter, commit.

### Tooling

Three scripts support the workflow:

- **`convert_docs.py`** — PDF/DOCX to Markdown conversion (e.g., via Datalab API)
- **`wiki_lint.py`** — 9 automated structural checks (broken links, orphan pages, stale content, missing frontmatter, placeholder detection, tag consistency)
- **`wiki_search.py`** — BM25 search with title/tag/body boosting; CLI and optional Flask web UI

### Integration with `bee`

Add an `AGENTS.md` (or `CLAUDE.md`) to the `bee` repository pointing to the knowledge base repo. When an AI agent opens `bee`, it discovers the knowledge base and can:

- Understand protocol design rationale before modifying code
- Look up formal specifications when implementing features
- Understand the relationship between protocol layers (e.g., how push-sync interacts with postage stamp validation)
- Get context on Swarm-specific concepts (chunks, neighborhoods, postage, redistribution, etc.)
- Answer "why" questions — not just "what does this function do" but "why was it designed this way"

## Proposed Scope

### Phase 1 — Foundation

- [ ] Create the knowledge base repo with the three-layer structure
- [ ] Convert all 9 papers from papers.ethswarm.org to Markdown
- [ ] Write the schema (`CLAUDE.md`) with page format, workflows, conventions
- [ ] Build tooling (lint, search, converter)
- [ ] Ingest The Book of Swarm (all chapters + glossary, ~8 steps)
- [ ] Ingest Swarm Formal Specification (all chapters + appendices, ~3 steps)
- [ ] Ingest remaining 7 papers (whitepaper, protocol spec, erasure coding, price oracle, batch utilisation, DREAM)

### Phase 2 — Protocol Pages and Integration

- [ ] Create protocol pages (push-sync, pull-sync, retrieval, hive) from Protocol Specification
- [ ] Create node architecture page from Protocol Specification
- [ ] Add `AGENTS.md` to the `bee` repo
- [ ] Ingest relevant content from docs.ethswarm.org

### Phase 3 — Community and Automation

- [ ] Document contribution workflow
- [ ] Set up CI for wiki linting
- [ ] Explore automated re-ingestion when sources update

## Benefits

- **AI agents get deep Swarm context** — architecture, rationale, formal specs, and cross-references, not just API docs
- **Knowledge compounds** — every source ingested enriches the entire wiki through cross-references and glossary terms
- **Zero ongoing maintenance burden** — the LLM handles cross-references, summaries, consistency, and index updates
- **Always current** — re-ingest when sources change; the LLM propagates updates across all affected pages
- **Human-readable** — plain Markdown files in a git repo; anyone can browse, review, and contribute


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: LLM-maintained knowledge base from Swarm papers and specs #5425

Summary

Motivation

How It Would Work

Architecture — Three Layers

What this produces

Key qualities of the output

Suggested approach for ingestion

Tooling

Integration with `bee`

Proposed Scope

Phase 1 — Foundation

Phase 2 — Protocol Pages and Integration

Phase 3 — Community and Automation

Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Type	Pages	What they cover
Concepts	chunks, DISC, Kademlia, postage stamps, feeds, encryption, erasure coding	Design rationale, formal definitions, glossary terms, cross-references
Protocols	PSS, push-sync, pull-sync, retrieval, hive	Message formats, flows, state machines
Incentives	SWAP, redistribution game, price oracle	Bandwidth/storage incentives, formal game mechanics
Integration	Bee API	Upload/download, collections, access control, PSS messaging
Papers	Per-paper summaries	Chapter breakdowns, key contributions, notation
Meta	overview, index, operation log	Synthesis, catalog, audit trail

Proposal: LLM-maintained knowledge base from Swarm papers and specs #5425

Description

Summary

Motivation

How It Would Work

Architecture — Three Layers

What this produces

Key qualities of the output

Suggested approach for ingestion

Tooling

Integration with bee

Proposed Scope

Phase 1 — Foundation

Phase 2 — Protocol Pages and Integration

Phase 3 — Community and Automation

Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Integration with `bee`