feat(wiki): stable page IDs + redirect stubs (ADR-2244 Phase 3 foundation)#33
Merged
Conversation
…tion)
Phase 3 makes the path a *view* over a stable identifier. Without
stable IDs the upcoming Phase 4 bulk migration (renaming the .md.md /
timestamp-slug / path-leak pages, re-bucketing the 7820 file-docs)
would rot every inbound link.
What lands here
---------------
mcp_server/core/wiki_identity.py UUID4 generation, parsing,
validation, frontmatter helpers.
Pure logic.
mcp_server/core/wiki_redirect.py Redirect data model, frontmatter
detection, path-based chain
resolution with cycle + depth
protection, stub authoring.
Pure logic.
mcp_server/core/wiki_sync.py New writes carry ``id: <uuid>``
in their frontmatter.
scripts/wiki_backfill_ids.py One-shot CLI that walks the wiki
and mints an id on every page
that lacks one. Dry-run by
default; apply with ``--apply``.
Design choices
--------------
* UUID4, not UUID1 — UUID1 leaks host MAC into the identifier; the
wiki may be exported, so we use the random variant.
* Canonical 36-character hex form. No structured embedding in the
path; paths stay human-readable.
* IDs are independent of ``memory_id`` and ``draft_id``. A page can
be re-synthesised from a memory and still keep its identity.
* Redirect stubs accept either ``redirect_to`` (path) or
``redirect_id`` (UUID). When both present the ID wins — paths are
mutable but IDs are stable. Path-based chain resolution is
implemented here; ID-based resolution requires an id→path index
that the read-handler layer will provide in Phase 3.2.
* Cycle + depth protection: ``resolve_chain`` returns None on a
visited-node cycle or chains longer than ``MAX_REDIRECT_DEPTH=5``,
matching MediaWiki convention.
Backfill behaviour
------------------
Idempotent. Pages with valid ids are skipped. Redirect stubs are
skipped (they don't need their own identity, they reference another
page's). Pages with no frontmatter at all are skipped (synthesising
frontmatter would change page semantics). Pages with a malformed id
(``id: garbage``) have the line replaced rather than duplicated.
Dry-run against the live wiki (9608 pages):
Scanned: 9608
Would mint: 9607
Skipped (no fm): 1
Errored: 0
The remaining work — applying the backfill to the user's wiki, and
the handler-layer changes (``wiki_read`` follows redirects,
``wiki_migrate`` writes stubs on rename) — lands in follow-ups.
Tests
-----
tests_py/core/test_wiki_identity.py 22 tests — format
validation, generation
uniqueness, extraction,
ensure-or-mint.
tests_py/core/test_wiki_redirect.py 24 tests — parse, dataclass
validation, chain resolution
(single hop, multi hop,
cycles, self-loop, depth
limit, id-only redirect),
stub authoring + roundtrip.
tests_py/core/test_wiki_sync_routing.py 2 new tests — sync writes
a valid id, distinct ids
per page.
tests_py/scripts/test_wiki_backfill_ids.py 12 tests — dry-run
idempotence, redirect/no-fm
skipping, malformed-id
replacement, distinct ids
across pages.
Targeted suite: 66 passed. tests_py/core/ + tests_py/shared/ +
tests_py/scripts/: 2049 passed. ``ruff format --check`` and
``ruff check`` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 13, 2026
cdeust
added a commit
that referenced
this pull request
May 13, 2026
…Phase 3.2) (#34) Wires the Phase 3 data model (#33) into the read path and adds a new write handler that performs the rename + stub atomically. With this change ``wiki_rename old.md new.md`` produces: * ``new.md`` — the original content moved verbatim (id preserved) * ``old.md`` — a redirect stub pointing at new.md (with redirect_id = source page id, for future id-based resolution) And ``wiki_read old.md`` then returns the content of ``new.md`` along with ``redirect_chain: ["old.md", "new.md"]``. Inbound links to the old path keep working through the migration. Handler changes --------------- * ``wiki_read`` — follow redirect stubs transparently up to 5 hops. ``follow_redirects: false`` opts out (admin/migration tooling that needs to inspect the stub itself). New response field: ``redirect_chain``. * ``wiki_list`` — exclude redirect stubs from the listing by default. ``include_redirects: true`` opts in. New response field: ``redirect_count``. * ``wiki_reindex`` — drop redirect stubs from .generated/INDEX.md and surface the count by kind in the response. The index now lists only live pages, which is what readers actually want. * ``wiki_rename`` — NEW. Move a page from one path to another and leave a stub at the old path. Refuses to operate on pages without a stable frontmatter id (run ``scripts/wiki_backfill_ids.py --apply`` first), refuses to chain stubs (rename the terminal page instead), refuses to overwrite an existing destination unless ``overwrite_dest=true``. Tool registry: ``wiki_rename`` registered alongside the other 8 wiki tools. ``wiki_read`` and ``wiki_list`` MCP signatures extended with their new optional parameters. Stub semantics -------------- The stub carries ``redirect_id = <source page id>`` so future id-based resolution (which a follow-up will add for cross-rename resolution when the path itself is renamed twice) works. ``redirect_to`` is populated with the new path as the cheap path-based resolution target. Both forms are emitted; the id wins when an id-aware reader arrives. Tests ----- ``tests_py/handlers/test_wiki_redirect_handlers.py`` (NEW) — 20 tests covering every handler change: read: - returns content for a normal page (chain = []) - follows single-hop redirect - follows multi-hop chain (3 pages, 2 hops) - ``follow_redirects: false`` returns the stub itself - cycle returns error - dangling redirect returns error - missing source returns error list: - excludes stubs by default; redirect_count surfaced - ``include_redirects: true`` returns both - redirect_count is 0 when no stubs reindex: - stubs absent from INDEX.md; by_kind counts only live pages rename: - creates stub at old path with correct redirect_to, redirect_id, redirect_reason - refuses missing source - refuses source without id - refuses existing destination - ``overwrite_dest=true`` works - refuses to chain stubs - refuses same path - end-to-end: rename then read resolves to the new content - body preserved verbatim through the move Targeted suite: 86 passed (Phase 3 + Phase 3.2 surface). Broader: tests_py/core/ + tests_py/shared/ + tests_py/scripts/ + relevant tests_py/handlers/ → 2075 passed. ``ruff format --check`` and ``ruff check`` clean. What still ships in a follow-up ------------------------------- * ID→path index for ID-only redirect resolution (currently only path-based chain walking works; id-only stubs return None from resolve_chain so they error in wiki_read with a clear message). * Phase 4 bulk migration script that loops wiki_rename over the 88 known pollution paths (.md.md slug bug, timestamp-slugs, path-leak titles) — gated on this PR + #33 landing. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust
added a commit
that referenced
this pull request
May 13, 2026
…(ADR-2244 Phase 4.1) (#35) Phase 4 of ADR-2244 — the bulk migration. This is the deterministic half: three pollution classes with mechanically computable target paths. The LLM-assisted re-classification half (the 7820 file-doc re-bucket) is a separate scope and lands in a follow-up. Targets ------- Audit 2026-05-12 found three deterministic-rename pollution classes: Pattern Audit count ──────────────────────────────────────────────────── ``*.md.md`` 58 ``*-decision-created-YYYY-MM-DDt...z.md`` 10 ``*users-cdeust-... .md`` 11+ (path-leak in slug) Live dry-run after this commit: Pollution paths detected: 70 (all currently skipped because the backfill from #33 hasn't been applied yet — the script correctly refuses to rename pages without a stable id) Script flow ----------- scripts/wiki_bulk_migrate.py 1. Walk wiki, classify each .md page by pollution pattern. 2. For each match: a. Skip redirect stubs (already moved). b. Skip pages without a frontmatter ``id`` (Phase 3 invariant). Caller is told to run ``wiki_backfill_ids.py --apply`` first. c. Compute clean target path: - .md.md → strip duplicate extension - timestamp-slug → derive slug from frontmatter title or first body heading - path-leak → same, plus reject path-shaped titles d. Record the Pollution record. 3. On --apply: call the ``wiki_rename`` handler for each item, which writes content at the new path and a redirect stub at the old one. Inbound links keep resolving. Idempotency: a second --apply finds zero pollution paths (the renames landed; their stubs are detected and skipped). Slug derivation --------------- ``_derive_clean_slug`` picks from three sources in order: 1. Frontmatter ``title`` (if non-empty and not path-shaped / timestamp-shaped / too short / synthetic ``memory-XXX``) 2. First body H1/H2 heading (same cleanness check) 3. Deterministic 6-hex-character hash of the body content prefixed with the kind (``decision-abc123`` / ``page-def456``) The hash fallback is rare — most pollution pages already have a proper ``title`` field; it's the *slug* that's broken, not the metadata. Tests ----- ``tests_py/scripts/test_wiki_bulk_migrate.py`` (NEW) — 22 tests: Detection (6): .md.md positive + negative; timestamp-slug positive + negative; path-leak positive + negative. Slug derivation (5): accepts real titles; rejects path / timestamp / too-short titles; falls back to body heading; falls back to hash. plan() (5): finds all three classes in one pass; skips pages without id; skips existing redirect stubs; proposes the correct target for timestamp-slug and path-leak (preserving numeric and date prefixes). apply() / end-to-end (4): renames + creates stubs with correct redirect_to and redirect_id; idempotent (second run is a no-op); handles three classes in one pass; doesn't crash on id-less skipped pages. Plus 2 sanity tests for boundary slug shapes. Targeted: 22 passed. ruff format and check clean. Operational order ----------------- 1. Merge #33 (Phase 3 — UUID + redirect modules + backfill script) 2. Merge #34 (Phase 3.2 — wiki_read / wiki_rename handlers) 3. Merge this PR (Phase 4.1 — bulk-migrate script) 4. Run: python scripts/wiki_backfill_ids.py --apply python scripts/wiki_bulk_migrate.py # dry-run review python scripts/wiki_bulk_migrate.py --apply # commit moves Out of scope (follow-ups) ------------------------- * ID→path index for ID-only redirect resolution (path-based works today; id-only stubs error in wiki_read). * Phase 4.2 — file-doc re-bucket (7820 ``notes/<domain>/<id>-file-*`` pages → ``reference/<domain>/<file-slug>.md`` with provenance rewrite). Different operation (changes kind directory, rewrites frontmatter); separate script. * Phase 5 — classifier-driven cleanup for ai-generated stubs (filter not delete). * Phase 6 — producer audit (codebase_analyze emits correct provenance / lifecycle on its outputs). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust
added a commit
that referenced
this pull request
May 13, 2026
…Phase 3.2) Wires the Phase 3 data model (#33) into the read path and adds a new write handler that performs the rename + stub atomically. With this change ``wiki_rename old.md new.md`` produces: * ``new.md`` — the original content moved verbatim (id preserved) * ``old.md`` — a redirect stub pointing at new.md (with redirect_id = source page id, for future id-based resolution) And ``wiki_read old.md`` then returns the content of ``new.md`` along with ``redirect_chain: ["old.md", "new.md"]``. Inbound links to the old path keep working through the migration. Handler changes --------------- * ``wiki_read`` — follow redirect stubs transparently up to 5 hops. ``follow_redirects: false`` opts out (admin/migration tooling that needs to inspect the stub itself). New response field: ``redirect_chain``. * ``wiki_list`` — exclude redirect stubs from the listing by default. ``include_redirects: true`` opts in. New response field: ``redirect_count``. * ``wiki_reindex`` — drop redirect stubs from .generated/INDEX.md and surface the count by kind in the response. The index now lists only live pages, which is what readers actually want. * ``wiki_rename`` — NEW. Move a page from one path to another and leave a stub at the old path. Refuses to operate on pages without a stable frontmatter id (run ``scripts/wiki_backfill_ids.py --apply`` first), refuses to chain stubs (rename the terminal page instead), refuses to overwrite an existing destination unless ``overwrite_dest=true``. Tool registry: ``wiki_rename`` registered alongside the other 8 wiki tools. ``wiki_read`` and ``wiki_list`` MCP signatures extended with their new optional parameters. Stub semantics -------------- The stub carries ``redirect_id = <source page id>`` so future id-based resolution (which a follow-up will add for cross-rename resolution when the path itself is renamed twice) works. ``redirect_to`` is populated with the new path as the cheap path-based resolution target. Both forms are emitted; the id wins when an id-aware reader arrives. Tests ----- ``tests_py/handlers/test_wiki_redirect_handlers.py`` (NEW) — 20 tests covering every handler change: read: - returns content for a normal page (chain = []) - follows single-hop redirect - follows multi-hop chain (3 pages, 2 hops) - ``follow_redirects: false`` returns the stub itself - cycle returns error - dangling redirect returns error - missing source returns error list: - excludes stubs by default; redirect_count surfaced - ``include_redirects: true`` returns both - redirect_count is 0 when no stubs reindex: - stubs absent from INDEX.md; by_kind counts only live pages rename: - creates stub at old path with correct redirect_to, redirect_id, redirect_reason - refuses missing source - refuses source without id - refuses existing destination - ``overwrite_dest=true`` works - refuses to chain stubs - refuses same path - end-to-end: rename then read resolves to the new content - body preserved verbatim through the move Targeted suite: 86 passed (Phase 3 + Phase 3.2 surface). Broader: tests_py/core/ + tests_py/shared/ + tests_py/scripts/ + relevant tests_py/handlers/ → 2075 passed. ``ruff format --check`` and ``ruff check`` clean. What still ships in a follow-up ------------------------------- * ID→path index for ID-only redirect resolution (currently only path-based chain walking works; id-only stubs return None from resolve_chain so they error in wiki_read with a clear message). * Phase 4 bulk migration script that loops wiki_rename over the 88 known pollution paths (.md.md slug bug, timestamp-slugs, path-leak titles) — gated on this PR + #33 landing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust
added a commit
that referenced
this pull request
May 13, 2026
…(ADR-2244 Phase 4.1)
Phase 4 of ADR-2244 — the bulk migration. This is the deterministic
half: three pollution classes with mechanically computable target
paths. The LLM-assisted re-classification half (the 7820 file-doc
re-bucket) is a separate scope and lands in a follow-up.
Targets
-------
Audit 2026-05-12 found three deterministic-rename pollution classes:
Pattern Audit count
────────────────────────────────────────────────────
``*.md.md`` 58
``*-decision-created-YYYY-MM-DDt...z.md`` 10
``*users-cdeust-... .md`` 11+ (path-leak in slug)
Live dry-run after this commit:
Pollution paths detected: 70 (all currently skipped because the
backfill from #33 hasn't been applied
yet — the script correctly refuses
to rename pages without a stable id)
Script flow
-----------
scripts/wiki_bulk_migrate.py
1. Walk wiki, classify each .md page by pollution pattern.
2. For each match:
a. Skip redirect stubs (already moved).
b. Skip pages without a frontmatter ``id`` (Phase 3 invariant).
Caller is told to run ``wiki_backfill_ids.py --apply`` first.
c. Compute clean target path:
- .md.md → strip duplicate extension
- timestamp-slug → derive slug from frontmatter title
or first body heading
- path-leak → same, plus reject path-shaped titles
d. Record the Pollution record.
3. On --apply: call the ``wiki_rename`` handler for each item, which
writes content at the new path and a redirect stub at the old
one. Inbound links keep resolving.
Idempotency: a second --apply finds zero pollution paths (the
renames landed; their stubs are detected and skipped).
Slug derivation
---------------
``_derive_clean_slug`` picks from three sources in order:
1. Frontmatter ``title`` (if non-empty and not path-shaped /
timestamp-shaped / too short / synthetic ``memory-XXX``)
2. First body H1/H2 heading (same cleanness check)
3. Deterministic 6-hex-character hash of the body content
prefixed with the kind (``decision-abc123`` / ``page-def456``)
The hash fallback is rare — most pollution pages already have a
proper ``title`` field; it's the *slug* that's broken, not the
metadata.
Tests
-----
``tests_py/scripts/test_wiki_bulk_migrate.py`` (NEW) — 22 tests:
Detection (6):
.md.md positive + negative; timestamp-slug positive + negative;
path-leak positive + negative.
Slug derivation (5):
accepts real titles; rejects path / timestamp / too-short titles;
falls back to body heading; falls back to hash.
plan() (5):
finds all three classes in one pass; skips pages without id;
skips existing redirect stubs; proposes the correct target for
timestamp-slug and path-leak (preserving numeric and date prefixes).
apply() / end-to-end (4):
renames + creates stubs with correct redirect_to and redirect_id;
idempotent (second run is a no-op); handles three classes in one
pass; doesn't crash on id-less skipped pages.
Plus 2 sanity tests for boundary slug shapes.
Targeted: 22 passed. ruff format and check clean.
Operational order
-----------------
1. Merge #33 (Phase 3 — UUID + redirect modules + backfill script)
2. Merge #34 (Phase 3.2 — wiki_read / wiki_rename handlers)
3. Merge this PR (Phase 4.1 — bulk-migrate script)
4. Run:
python scripts/wiki_backfill_ids.py --apply
python scripts/wiki_bulk_migrate.py # dry-run review
python scripts/wiki_bulk_migrate.py --apply # commit moves
Out of scope (follow-ups)
-------------------------
* ID→path index for ID-only redirect resolution (path-based works
today; id-only stubs error in wiki_read).
* Phase 4.2 — file-doc re-bucket (7820 ``notes/<domain>/<id>-file-*``
pages → ``reference/<domain>/<file-slug>.md`` with provenance
rewrite). Different operation (changes kind directory, rewrites
frontmatter); separate script.
* Phase 5 — classifier-driven cleanup for ai-generated stubs
(filter not delete).
* Phase 6 — producer audit (codebase_analyze emits correct
provenance / lifecycle on its outputs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
cdeust
added a commit
that referenced
this pull request
May 13, 2026
…se 4.2) Producer-side fix #27 routed new file-doc pages to ``reference/<domain>/`` with ``provenance: auto-generated``. The existing population — 8,734 pages written under ``notes/<domain>/<id>-file-*.md`` — never got moved. This script handles that one-time migration. Operation per page ------------------ 1. Walk ``notes/<domain>/``; match the file-doc shape ``\\d+-file-...``. 2. Skip redirect stubs (already migrated). 3. Require a frontmatter ``id`` (Phase 3 invariant — run ``wiki_backfill_ids.py --apply`` first). 4. Extract the original source path from the ``file:<path>`` tag (canonical even when the on-disk filename was truncated to ``98817-file-....md``). 5. Compute target ``reference/<domain>/<file-slug>.md``. 6. Rewrite frontmatter to the modern schema: kind: reference lifecycle: seedling audience: [developer] provenance: auto-generated generator: {model: cortex-codebase-analyze, version: v1, prompt_template: file-doc-v1, generated_at: <original-created>} Plus migration trace fields (``source_file_path``, ``rebucketed_from``). The original id, title, tags, and body are preserved verbatim. 7. Write the rewritten page at the new path. 8. Replace the source with a redirect stub that carries ``redirect_to`` (path) + ``redirect_id`` (source id) so ``wiki_read`` resolves the old path through the stub transparently. The script is intentionally NOT a thin wrapper around ``wiki_rename``: that handler preserves content verbatim, whereas the file-doc re-bucket must REWRITE the frontmatter as part of the move. The stub-creation half does use ``mcp_server.core.wiki_redirect.build_redirect_stub`` for consistency with Phase 3.2. Live dry-run ------------ Detected file-doc pages: 8734 Plan: re-bucket 0 Skipped (no id): 8734 Same correct refusal as Phase 4.1 — the backfill from #33 hasn't been applied to the live wiki yet. Once ``wiki_backfill_ids.py --apply`` runs, the plan will flip to ``8734 to re-bucket``. Idempotency ----------- * Second --apply finds zero: source pages are now redirect stubs (skipped by plan()), new producers write to reference/ directly (skipped by the pattern match). * Collision handling: two notes documenting the same source file get distinct targets via a ``-<memory_id>`` suffix on the second one (rare in practice; observed 0 times on the live wiki). Tests ----- ``tests_py/scripts/test_wiki_rebucket_file_docs.py`` (NEW) — 19 tests: detection (6): - canonical file-doc shape matches; non-file-doc notes don't - file tag extracted from block-list and inline-list frontmatter - missing/empty file tag handled slug derivation (3): - separators flattened to hyphens - empty source returns empty target - empty domain falls back to ``_general`` plan (5): - finds file-doc notes, skips other notes - skips pages without id (refusal message) - skips pages without file tag - disambiguates colliding targets via memory-id suffix - skips existing redirect stubs (idempotent re-runs) apply (5): - modern frontmatter at target (kind/lifecycle/audience/ provenance/generator/source_file_path) - body preserved verbatim - redirect stub at source with correct target_path + target_id - refuses when destination already exists - idempotent (second pass = no-op) end-to-end (1): - 25 pages across 3 domains move correctly; spot-check each domain 19 passed; ruff format and check clean. Post-merge operations --------------------- After PR #36 + this PR land on main: python scripts/wiki_backfill_ids.py --apply python scripts/wiki_bulk_migrate.py --apply # Phase 4.1 — 70 paths python scripts/wiki_rebucket_file_docs.py # dry-run review python scripts/wiki_rebucket_file_docs.py --apply # Phase 4.2 — 8734 pages After all three apply runs: * notes/ drops from 92% of the wiki to ~5% (real catch-all content only) * reference/ grows to host the 8734 file docs with proper provenance * 70 + 8734 redirect stubs preserve all inbound links Out of scope (Phase 5+) ----------------------- * Phase 5 — classifier-driven cleanup for ai-generated seedlings (filter from search, do not delete; preserves the auto-gen reference pages but hides empty stubs from default views). * Phase 6 — producer audit (codebase_analyze emits the modern 4-tuple directly on new writes; would also write provenance = auto-generated + generator block on every output). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust
added a commit
that referenced
this pull request
May 13, 2026
…on onto main (#36) * feat(wiki): handler-layer redirect mechanics + wiki_rename (ADR-2244 Phase 3.2) Wires the Phase 3 data model (#33) into the read path and adds a new write handler that performs the rename + stub atomically. With this change ``wiki_rename old.md new.md`` produces: * ``new.md`` — the original content moved verbatim (id preserved) * ``old.md`` — a redirect stub pointing at new.md (with redirect_id = source page id, for future id-based resolution) And ``wiki_read old.md`` then returns the content of ``new.md`` along with ``redirect_chain: ["old.md", "new.md"]``. Inbound links to the old path keep working through the migration. Handler changes --------------- * ``wiki_read`` — follow redirect stubs transparently up to 5 hops. ``follow_redirects: false`` opts out (admin/migration tooling that needs to inspect the stub itself). New response field: ``redirect_chain``. * ``wiki_list`` — exclude redirect stubs from the listing by default. ``include_redirects: true`` opts in. New response field: ``redirect_count``. * ``wiki_reindex`` — drop redirect stubs from .generated/INDEX.md and surface the count by kind in the response. The index now lists only live pages, which is what readers actually want. * ``wiki_rename`` — NEW. Move a page from one path to another and leave a stub at the old path. Refuses to operate on pages without a stable frontmatter id (run ``scripts/wiki_backfill_ids.py --apply`` first), refuses to chain stubs (rename the terminal page instead), refuses to overwrite an existing destination unless ``overwrite_dest=true``. Tool registry: ``wiki_rename`` registered alongside the other 8 wiki tools. ``wiki_read`` and ``wiki_list`` MCP signatures extended with their new optional parameters. Stub semantics -------------- The stub carries ``redirect_id = <source page id>`` so future id-based resolution (which a follow-up will add for cross-rename resolution when the path itself is renamed twice) works. ``redirect_to`` is populated with the new path as the cheap path-based resolution target. Both forms are emitted; the id wins when an id-aware reader arrives. Tests ----- ``tests_py/handlers/test_wiki_redirect_handlers.py`` (NEW) — 20 tests covering every handler change: read: - returns content for a normal page (chain = []) - follows single-hop redirect - follows multi-hop chain (3 pages, 2 hops) - ``follow_redirects: false`` returns the stub itself - cycle returns error - dangling redirect returns error - missing source returns error list: - excludes stubs by default; redirect_count surfaced - ``include_redirects: true`` returns both - redirect_count is 0 when no stubs reindex: - stubs absent from INDEX.md; by_kind counts only live pages rename: - creates stub at old path with correct redirect_to, redirect_id, redirect_reason - refuses missing source - refuses source without id - refuses existing destination - ``overwrite_dest=true`` works - refuses to chain stubs - refuses same path - end-to-end: rename then read resolves to the new content - body preserved verbatim through the move Targeted suite: 86 passed (Phase 3 + Phase 3.2 surface). Broader: tests_py/core/ + tests_py/shared/ + tests_py/scripts/ + relevant tests_py/handlers/ → 2075 passed. ``ruff format --check`` and ``ruff check`` clean. What still ships in a follow-up ------------------------------- * ID→path index for ID-only redirect resolution (currently only path-based chain walking works; id-only stubs return None from resolve_chain so they error in wiki_read with a clear message). * Phase 4 bulk migration script that loops wiki_rename over the 88 known pollution paths (.md.md slug bug, timestamp-slugs, path-leak titles) — gated on this PR + #33 landing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(wiki): deterministic bulk migration for the ~88 pollution paths (ADR-2244 Phase 4.1) Phase 4 of ADR-2244 — the bulk migration. This is the deterministic half: three pollution classes with mechanically computable target paths. The LLM-assisted re-classification half (the 7820 file-doc re-bucket) is a separate scope and lands in a follow-up. Targets ------- Audit 2026-05-12 found three deterministic-rename pollution classes: Pattern Audit count ──────────────────────────────────────────────────── ``*.md.md`` 58 ``*-decision-created-YYYY-MM-DDt...z.md`` 10 ``*users-cdeust-... .md`` 11+ (path-leak in slug) Live dry-run after this commit: Pollution paths detected: 70 (all currently skipped because the backfill from #33 hasn't been applied yet — the script correctly refuses to rename pages without a stable id) Script flow ----------- scripts/wiki_bulk_migrate.py 1. Walk wiki, classify each .md page by pollution pattern. 2. For each match: a. Skip redirect stubs (already moved). b. Skip pages without a frontmatter ``id`` (Phase 3 invariant). Caller is told to run ``wiki_backfill_ids.py --apply`` first. c. Compute clean target path: - .md.md → strip duplicate extension - timestamp-slug → derive slug from frontmatter title or first body heading - path-leak → same, plus reject path-shaped titles d. Record the Pollution record. 3. On --apply: call the ``wiki_rename`` handler for each item, which writes content at the new path and a redirect stub at the old one. Inbound links keep resolving. Idempotency: a second --apply finds zero pollution paths (the renames landed; their stubs are detected and skipped). Slug derivation --------------- ``_derive_clean_slug`` picks from three sources in order: 1. Frontmatter ``title`` (if non-empty and not path-shaped / timestamp-shaped / too short / synthetic ``memory-XXX``) 2. First body H1/H2 heading (same cleanness check) 3. Deterministic 6-hex-character hash of the body content prefixed with the kind (``decision-abc123`` / ``page-def456``) The hash fallback is rare — most pollution pages already have a proper ``title`` field; it's the *slug* that's broken, not the metadata. Tests ----- ``tests_py/scripts/test_wiki_bulk_migrate.py`` (NEW) — 22 tests: Detection (6): .md.md positive + negative; timestamp-slug positive + negative; path-leak positive + negative. Slug derivation (5): accepts real titles; rejects path / timestamp / too-short titles; falls back to body heading; falls back to hash. plan() (5): finds all three classes in one pass; skips pages without id; skips existing redirect stubs; proposes the correct target for timestamp-slug and path-leak (preserving numeric and date prefixes). apply() / end-to-end (4): renames + creates stubs with correct redirect_to and redirect_id; idempotent (second run is a no-op); handles three classes in one pass; doesn't crash on id-less skipped pages. Plus 2 sanity tests for boundary slug shapes. Targeted: 22 passed. ruff format and check clean. Operational order ----------------- 1. Merge #33 (Phase 3 — UUID + redirect modules + backfill script) 2. Merge #34 (Phase 3.2 — wiki_read / wiki_rename handlers) 3. Merge this PR (Phase 4.1 — bulk-migrate script) 4. Run: python scripts/wiki_backfill_ids.py --apply python scripts/wiki_bulk_migrate.py # dry-run review python scripts/wiki_bulk_migrate.py --apply # commit moves Out of scope (follow-ups) ------------------------- * ID→path index for ID-only redirect resolution (path-based works today; id-only stubs error in wiki_read). * Phase 4.2 — file-doc re-bucket (7820 ``notes/<domain>/<id>-file-*`` pages → ``reference/<domain>/<file-slug>.md`` with provenance rewrite). Different operation (changes kind directory, rewrites frontmatter); separate script. * Phase 5 — classifier-driven cleanup for ai-generated stubs (filter not delete). * Phase 6 — producer audit (codebase_analyze emits correct provenance / lifecycle on its outputs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: bump tool count assertion 47 → 48 for new wiki_rename (ADR-2244 Phase 3.2) CI on PR #36 fails on tests_py/test_main.py:70 — the mcp_server tool count is now 48 because Phase 3.2 (#34's content, now flowing into main via this PR) registers ``wiki_rename`` as a new tool. The assertion is a hard count + membership check; both updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust
added a commit
that referenced
this pull request
May 13, 2026
…se 4.2) Producer-side fix #27 routed new file-doc pages to ``reference/<domain>/`` with ``provenance: auto-generated``. The existing population — 8,734 pages written under ``notes/<domain>/<id>-file-*.md`` — never got moved. This script handles that one-time migration. Operation per page ------------------ 1. Walk ``notes/<domain>/``; match the file-doc shape ``\\d+-file-...``. 2. Skip redirect stubs (already migrated). 3. Require a frontmatter ``id`` (Phase 3 invariant — run ``wiki_backfill_ids.py --apply`` first). 4. Extract the original source path from the ``file:<path>`` tag (canonical even when the on-disk filename was truncated to ``98817-file-....md``). 5. Compute target ``reference/<domain>/<file-slug>.md``. 6. Rewrite frontmatter to the modern schema: kind: reference lifecycle: seedling audience: [developer] provenance: auto-generated generator: {model: cortex-codebase-analyze, version: v1, prompt_template: file-doc-v1, generated_at: <original-created>} Plus migration trace fields (``source_file_path``, ``rebucketed_from``). The original id, title, tags, and body are preserved verbatim. 7. Write the rewritten page at the new path. 8. Replace the source with a redirect stub that carries ``redirect_to`` (path) + ``redirect_id`` (source id) so ``wiki_read`` resolves the old path through the stub transparently. The script is intentionally NOT a thin wrapper around ``wiki_rename``: that handler preserves content verbatim, whereas the file-doc re-bucket must REWRITE the frontmatter as part of the move. The stub-creation half does use ``mcp_server.core.wiki_redirect.build_redirect_stub`` for consistency with Phase 3.2. Live dry-run ------------ Detected file-doc pages: 8734 Plan: re-bucket 0 Skipped (no id): 8734 Same correct refusal as Phase 4.1 — the backfill from #33 hasn't been applied to the live wiki yet. Once ``wiki_backfill_ids.py --apply`` runs, the plan will flip to ``8734 to re-bucket``. Idempotency ----------- * Second --apply finds zero: source pages are now redirect stubs (skipped by plan()), new producers write to reference/ directly (skipped by the pattern match). * Collision handling: two notes documenting the same source file get distinct targets via a ``-<memory_id>`` suffix on the second one (rare in practice; observed 0 times on the live wiki). Tests ----- ``tests_py/scripts/test_wiki_rebucket_file_docs.py`` (NEW) — 19 tests: detection (6): - canonical file-doc shape matches; non-file-doc notes don't - file tag extracted from block-list and inline-list frontmatter - missing/empty file tag handled slug derivation (3): - separators flattened to hyphens - empty source returns empty target - empty domain falls back to ``_general`` plan (5): - finds file-doc notes, skips other notes - skips pages without id (refusal message) - skips pages without file tag - disambiguates colliding targets via memory-id suffix - skips existing redirect stubs (idempotent re-runs) apply (5): - modern frontmatter at target (kind/lifecycle/audience/ provenance/generator/source_file_path) - body preserved verbatim - redirect stub at source with correct target_path + target_id - refuses when destination already exists - idempotent (second pass = no-op) end-to-end (1): - 25 pages across 3 domains move correctly; spot-check each domain 19 passed; ruff format and check clean. Post-merge operations --------------------- After PR #36 + this PR land on main: python scripts/wiki_backfill_ids.py --apply python scripts/wiki_bulk_migrate.py --apply # Phase 4.1 — 70 paths python scripts/wiki_rebucket_file_docs.py # dry-run review python scripts/wiki_rebucket_file_docs.py --apply # Phase 4.2 — 8734 pages After all three apply runs: * notes/ drops from 92% of the wiki to ~5% (real catch-all content only) * reference/ grows to host the 8734 file docs with proper provenance * 70 + 8734 redirect stubs preserve all inbound links Out of scope (Phase 5+) ----------------------- * Phase 5 — classifier-driven cleanup for ai-generated seedlings (filter from search, do not delete; preserves the auto-gen reference pages but hides empty stubs from default views). * Phase 6 — producer audit (codebase_analyze emits the modern 4-tuple directly on new writes; would also write provenance = auto-generated + generator block on every output). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust
added a commit
that referenced
this pull request
May 13, 2026
…se 4.2) (#37) Producer-side fix #27 routed new file-doc pages to ``reference/<domain>/`` with ``provenance: auto-generated``. The existing population — 8,734 pages written under ``notes/<domain>/<id>-file-*.md`` — never got moved. This script handles that one-time migration. Operation per page ------------------ 1. Walk ``notes/<domain>/``; match the file-doc shape ``\\d+-file-...``. 2. Skip redirect stubs (already migrated). 3. Require a frontmatter ``id`` (Phase 3 invariant — run ``wiki_backfill_ids.py --apply`` first). 4. Extract the original source path from the ``file:<path>`` tag (canonical even when the on-disk filename was truncated to ``98817-file-....md``). 5. Compute target ``reference/<domain>/<file-slug>.md``. 6. Rewrite frontmatter to the modern schema: kind: reference lifecycle: seedling audience: [developer] provenance: auto-generated generator: {model: cortex-codebase-analyze, version: v1, prompt_template: file-doc-v1, generated_at: <original-created>} Plus migration trace fields (``source_file_path``, ``rebucketed_from``). The original id, title, tags, and body are preserved verbatim. 7. Write the rewritten page at the new path. 8. Replace the source with a redirect stub that carries ``redirect_to`` (path) + ``redirect_id`` (source id) so ``wiki_read`` resolves the old path through the stub transparently. The script is intentionally NOT a thin wrapper around ``wiki_rename``: that handler preserves content verbatim, whereas the file-doc re-bucket must REWRITE the frontmatter as part of the move. The stub-creation half does use ``mcp_server.core.wiki_redirect.build_redirect_stub`` for consistency with Phase 3.2. Live dry-run ------------ Detected file-doc pages: 8734 Plan: re-bucket 0 Skipped (no id): 8734 Same correct refusal as Phase 4.1 — the backfill from #33 hasn't been applied to the live wiki yet. Once ``wiki_backfill_ids.py --apply`` runs, the plan will flip to ``8734 to re-bucket``. Idempotency ----------- * Second --apply finds zero: source pages are now redirect stubs (skipped by plan()), new producers write to reference/ directly (skipped by the pattern match). * Collision handling: two notes documenting the same source file get distinct targets via a ``-<memory_id>`` suffix on the second one (rare in practice; observed 0 times on the live wiki). Tests ----- ``tests_py/scripts/test_wiki_rebucket_file_docs.py`` (NEW) — 19 tests: detection (6): - canonical file-doc shape matches; non-file-doc notes don't - file tag extracted from block-list and inline-list frontmatter - missing/empty file tag handled slug derivation (3): - separators flattened to hyphens - empty source returns empty target - empty domain falls back to ``_general`` plan (5): - finds file-doc notes, skips other notes - skips pages without id (refusal message) - skips pages without file tag - disambiguates colliding targets via memory-id suffix - skips existing redirect stubs (idempotent re-runs) apply (5): - modern frontmatter at target (kind/lifecycle/audience/ provenance/generator/source_file_path) - body preserved verbatim - redirect stub at source with correct target_path + target_id - refuses when destination already exists - idempotent (second pass = no-op) end-to-end (1): - 25 pages across 3 domains move correctly; spot-check each domain 19 passed; ruff format and check clean. Post-merge operations --------------------- After PR #36 + this PR land on main: python scripts/wiki_backfill_ids.py --apply python scripts/wiki_bulk_migrate.py --apply # Phase 4.1 — 70 paths python scripts/wiki_rebucket_file_docs.py # dry-run review python scripts/wiki_rebucket_file_docs.py --apply # Phase 4.2 — 8734 pages After all three apply runs: * notes/ drops from 92% of the wiki to ~5% (real catch-all content only) * reference/ grows to host the 8734 file docs with proper provenance * 70 + 8734 redirect stubs preserve all inbound links Out of scope (Phase 5+) ----------------------- * Phase 5 — classifier-driven cleanup for ai-generated seedlings (filter from search, do not delete; preserves the auto-gen reference pages but hides empty stubs from default views). * Phase 6 — producer audit (codebase_analyze emits the modern 4-tuple directly on new writes; would also write provenance = auto-generated + generator block on every output). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
cdeust
added a commit
that referenced
this pull request
May 13, 2026
…n complete) (#41) Bundles 11 merged PRs (#30-#40) since v3.15.4 closing out the ADR-2244 wiki classification cycle: Phase 2 #31 #32 pilot migration analyzer + 1000-page verification (96.7% kind-kept, passes target) Phase 3 #33 stable page IDs (UUID4) + redirect data model + backfill CLI Phase 3.2 #34 handler-layer redirect mechanics (wiki_read follows transparently, wiki_list/wiki_reindex exclude stubs, new wiki_rename tool) Phase 4.1 #35 #36 deterministic bulk migration for the 70 known pollution paths (.md.md, timestamp-slug, path-leak) Phase 4.2 #37 file-doc re-bucket (8734 pages from notes/ to reference/ with modern frontmatter) Phase 5 #39 filter auto-generated pages from default listings; INDEX.md splits human-authored from auto-gen Phase 6 #38 producer audit — codebase_analyze output routes to kind=reference (root-causes the 8734-page misroute) Phase 6.2 #40 producer audit — wiki_seed_codebase emits modern kind tags the classifier reads Security #30 authlib CVE-2026-44681 bump (dependabot #4) Notes for users: - Wiki on disk not migrated yet. Apply scripts (in scripts/) are dry-run by default. Three commands to fully migrate; each is idempotent and leaves redirect stubs. - Phases 5/6/6.2 take effect on next MCP restart. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 3 of ADR-2244 — making the path a view over a stable identifier so the upcoming Phase 4 bulk migration (rename the
.md.md/ timestamp-slug / path-leak pages, re-bucket the 7820 file-docs) doesn't rot inbound links.What lands here
mcp_server/core/wiki_identity.pymcp_server/core/wiki_redirect.pymcp_server/core/wiki_sync.pyid: <uuid>in their frontmatter.scripts/wiki_backfill_ids.py--applyto write. Idempotent.Design choices
memory_idanddraft_id. A page can be re-synthesised from a memory and still keep its identity across that operation.redirect_to(path) orredirect_id(UUID). When both present the ID wins — paths are mutable but IDs are stable. Path-based chain resolution is implemented here; ID-based resolution requires an id→path index that lands with the read-handler changes in Phase 3.2.resolve_chainreturnsNoneon a visited-node cycle or chains longer thanMAX_REDIRECT_DEPTH=5, matching MediaWiki convention.Backfill behaviour
Idempotent. Pages with valid ids are skipped. Redirect stubs are skipped (they don't need their own identity, they reference another page's). Pages with no frontmatter at all are skipped (synthesising frontmatter would change page semantics). Pages with a malformed id (
id: garbage) have the line replaced rather than duplicated.Dry-run against the live wiki (9608 pages):
The remaining Phase 3 work — applying the backfill to your wiki, plus the handler-layer changes (
wiki_readfollows redirects,wiki_migratewrites stubs on rename,wiki_listoptionally hides redirects) — lands in follow-ups.Tests
tests_py/core/test_wiki_identity.pytests_py/core/test_wiki_redirect.pytests_py/core/test_wiki_sync_routing.pytests_py/scripts/test_wiki_backfill_ids.pytests_py/core/ + tests_py/shared/ + tests_py/scripts/: 2049 passedruff format --checkandruff checkcleanHow to use after merge
Out of scope for this PR (Phase 3.2 / Phase 4 follow-ups)
wiki_readhandler resolves redirects transparentlywiki_migratewrites a redirect stub at the old path when moving a pagewiki_list/wiki_reindexlearn to hide or annotate redirect stubs🤖 Generated with Claude Code