Skip to content

feat(wiki): stable page IDs + redirect stubs (ADR-2244 Phase 3 foundation)#33

Merged
cdeust merged 1 commit into
mainfrom
feat/wiki-stable-ids-phase3
May 13, 2026
Merged

feat(wiki): stable page IDs + redirect stubs (ADR-2244 Phase 3 foundation)#33
cdeust merged 1 commit into
mainfrom
feat/wiki-stable-ids-phase3

Conversation

@cdeust
Copy link
Copy Markdown
Owner

@cdeust cdeust commented May 13, 2026

Summary

Phase 3 of ADR-2244 — making the path a view over a stable identifier so the upcoming Phase 4 bulk migration (rename the .md.md / timestamp-slug / path-leak pages, re-bucket the 7820 file-docs) doesn't rot inbound links.

What lands here

Module Purpose
mcp_server/core/wiki_identity.py UUID4 generation, parsing, validation, frontmatter helpers. Pure logic.
mcp_server/core/wiki_redirect.py Redirect data model, frontmatter detection, path-based chain resolution (cycle + depth protection), stub authoring. Pure logic.
mcp_server/core/wiki_sync.py New writes carry id: <uuid> in their frontmatter.
scripts/wiki_backfill_ids.py One-shot CLI that walks the wiki and mints an id on every page that lacks one. Dry-run by default; --apply to write. Idempotent.

Design choices

  • UUID4, not UUID1 — UUID1 leaks host MAC into the identifier; the wiki may be exported, so we use the random variant.
  • Canonical 36-character hex form. No structured embedding in the path; paths stay human-readable.
  • IDs are independent of memory_id and draft_id. A page can be re-synthesised from a memory and still keep its identity across that operation.
  • Redirect stubs accept either redirect_to (path) or redirect_id (UUID). When both present the ID wins — paths are mutable but IDs are stable. Path-based chain resolution is implemented here; ID-based resolution requires an id→path index that lands with the read-handler changes in Phase 3.2.
  • Cycle + depth protectionresolve_chain returns None on a visited-node cycle or chains longer than MAX_REDIRECT_DEPTH=5, matching MediaWiki convention.

Backfill behaviour

Idempotent. Pages with valid ids are skipped. Redirect stubs are skipped (they don't need their own identity, they reference another page's). Pages with no frontmatter at all are skipped (synthesising frontmatter would change page semantics). Pages with a malformed id (id: garbage) have the line replaced rather than duplicated.

Dry-run against the live wiki (9608 pages):

Scanned:           9608
Would mint:        9607
Skipped (no fm):   1
Errored:           0

The remaining Phase 3 work — applying the backfill to your wiki, plus the handler-layer changes (wiki_read follows redirects, wiki_migrate writes stubs on rename, wiki_list optionally hides redirects) — lands in follow-ups.

Tests

File Tests
tests_py/core/test_wiki_identity.py 22 — format validation, generation uniqueness, extraction, ensure-or-mint
tests_py/core/test_wiki_redirect.py 24 — parse, dataclass validation, chain resolution (single/multi-hop, cycles, self-loop, depth limit, id-only), stub authoring + roundtrip
tests_py/core/test_wiki_sync_routing.py +2 — sync writes a valid id, distinct ids per page
tests_py/scripts/test_wiki_backfill_ids.py 12 — dry-run idempotence, redirect/no-fm skipping, malformed-id replacement, distinct ids across pages
Total new 60
  • Targeted suite: 66 passed
  • tests_py/core/ + tests_py/shared/ + tests_py/scripts/: 2049 passed
  • ruff format --check and ruff check clean
  • Dry-run against live 9608-page wiki: 0 errors, 9607 would-mint, 1 skipped

How to use after merge

# Dry-run (recommended first pass)
python scripts/wiki_backfill_ids.py

# Apply to the live wiki
python scripts/wiki_backfill_ids.py --apply

# Idempotent — second --apply is a no-op
python scripts/wiki_backfill_ids.py --apply

Out of scope for this PR (Phase 3.2 / Phase 4 follow-ups)

  • wiki_read handler resolves redirects transparently
  • wiki_migrate writes a redirect stub at the old path when moving a page
  • wiki_list / wiki_reindex learn to hide or annotate redirect stubs
  • Bulk migration script that uses redirect stubs (Phase 4)

🤖 Generated with Claude Code

…tion)

Phase 3 makes the path a *view* over a stable identifier. Without
stable IDs the upcoming Phase 4 bulk migration (renaming the .md.md /
timestamp-slug / path-leak pages, re-bucketing the 7820 file-docs)
would rot every inbound link.

What lands here
---------------

  mcp_server/core/wiki_identity.py    UUID4 generation, parsing,
                                      validation, frontmatter helpers.
                                      Pure logic.
  mcp_server/core/wiki_redirect.py    Redirect data model, frontmatter
                                      detection, path-based chain
                                      resolution with cycle + depth
                                      protection, stub authoring.
                                      Pure logic.
  mcp_server/core/wiki_sync.py        New writes carry ``id: <uuid>``
                                      in their frontmatter.
  scripts/wiki_backfill_ids.py        One-shot CLI that walks the wiki
                                      and mints an id on every page
                                      that lacks one. Dry-run by
                                      default; apply with ``--apply``.

Design choices
--------------

* UUID4, not UUID1 — UUID1 leaks host MAC into the identifier; the
  wiki may be exported, so we use the random variant.
* Canonical 36-character hex form. No structured embedding in the
  path; paths stay human-readable.
* IDs are independent of ``memory_id`` and ``draft_id``. A page can
  be re-synthesised from a memory and still keep its identity.
* Redirect stubs accept either ``redirect_to`` (path) or
  ``redirect_id`` (UUID). When both present the ID wins — paths are
  mutable but IDs are stable. Path-based chain resolution is
  implemented here; ID-based resolution requires an id→path index
  that the read-handler layer will provide in Phase 3.2.
* Cycle + depth protection: ``resolve_chain`` returns None on a
  visited-node cycle or chains longer than ``MAX_REDIRECT_DEPTH=5``,
  matching MediaWiki convention.

Backfill behaviour
------------------

Idempotent. Pages with valid ids are skipped. Redirect stubs are
skipped (they don't need their own identity, they reference another
page's). Pages with no frontmatter at all are skipped (synthesising
frontmatter would change page semantics). Pages with a malformed id
(``id: garbage``) have the line replaced rather than duplicated.

Dry-run against the live wiki (9608 pages):
  Scanned:           9608
  Would mint:        9607
  Skipped (no fm):   1
  Errored:           0

The remaining work — applying the backfill to the user's wiki, and
the handler-layer changes (``wiki_read`` follows redirects,
``wiki_migrate`` writes stubs on rename) — lands in follow-ups.

Tests
-----

  tests_py/core/test_wiki_identity.py          22 tests — format
                                               validation, generation
                                               uniqueness, extraction,
                                               ensure-or-mint.
  tests_py/core/test_wiki_redirect.py          24 tests — parse, dataclass
                                               validation, chain resolution
                                               (single hop, multi hop,
                                               cycles, self-loop, depth
                                               limit, id-only redirect),
                                               stub authoring + roundtrip.
  tests_py/core/test_wiki_sync_routing.py      2 new tests — sync writes
                                               a valid id, distinct ids
                                               per page.
  tests_py/scripts/test_wiki_backfill_ids.py   12 tests — dry-run
                                               idempotence, redirect/no-fm
                                               skipping, malformed-id
                                               replacement, distinct ids
                                               across pages.

Targeted suite: 66 passed. tests_py/core/ + tests_py/shared/ +
tests_py/scripts/: 2049 passed. ``ruff format --check`` and
``ruff check`` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cdeust cdeust merged commit 4590a14 into main May 13, 2026
11 checks passed
cdeust added a commit that referenced this pull request May 13, 2026
…Phase 3.2) (#34)

Wires the Phase 3 data model (#33) into the read path and adds a new
write handler that performs the rename + stub atomically. With this
change ``wiki_rename old.md new.md`` produces:

  * ``new.md``  — the original content moved verbatim (id preserved)
  * ``old.md``  — a redirect stub pointing at new.md (with redirect_id
                  = source page id, for future id-based resolution)

And ``wiki_read old.md`` then returns the content of ``new.md`` along
with ``redirect_chain: ["old.md", "new.md"]``. Inbound links to the
old path keep working through the migration.

Handler changes
---------------

* ``wiki_read``  — follow redirect stubs transparently up to 5 hops.
                   ``follow_redirects: false`` opts out (admin/migration
                   tooling that needs to inspect the stub itself).
                   New response field: ``redirect_chain``.

* ``wiki_list``  — exclude redirect stubs from the listing by default.
                   ``include_redirects: true`` opts in. New response
                   field: ``redirect_count``.

* ``wiki_reindex`` — drop redirect stubs from .generated/INDEX.md and
                     surface the count by kind in the response. The
                     index now lists only live pages, which is what
                     readers actually want.

* ``wiki_rename``  — NEW. Move a page from one path to another and
                     leave a stub at the old path. Refuses to operate
                     on pages without a stable frontmatter id (run
                     ``scripts/wiki_backfill_ids.py --apply`` first),
                     refuses to chain stubs (rename the terminal page
                     instead), refuses to overwrite an existing
                     destination unless ``overwrite_dest=true``.

Tool registry: ``wiki_rename`` registered alongside the other 8 wiki
tools. ``wiki_read`` and ``wiki_list`` MCP signatures extended with
their new optional parameters.

Stub semantics
--------------

The stub carries ``redirect_id = <source page id>`` so future id-based
resolution (which a follow-up will add for cross-rename resolution
when the path itself is renamed twice) works. ``redirect_to`` is
populated with the new path as the cheap path-based resolution
target. Both forms are emitted; the id wins when an id-aware reader
arrives.

Tests
-----

``tests_py/handlers/test_wiki_redirect_handlers.py`` (NEW) — 20 tests
covering every handler change:

  read:
    - returns content for a normal page (chain = [])
    - follows single-hop redirect
    - follows multi-hop chain (3 pages, 2 hops)
    - ``follow_redirects: false`` returns the stub itself
    - cycle returns error
    - dangling redirect returns error
    - missing source returns error

  list:
    - excludes stubs by default; redirect_count surfaced
    - ``include_redirects: true`` returns both
    - redirect_count is 0 when no stubs

  reindex:
    - stubs absent from INDEX.md; by_kind counts only live pages

  rename:
    - creates stub at old path with correct redirect_to, redirect_id,
      redirect_reason
    - refuses missing source
    - refuses source without id
    - refuses existing destination
    - ``overwrite_dest=true`` works
    - refuses to chain stubs
    - refuses same path
    - end-to-end: rename then read resolves to the new content
    - body preserved verbatim through the move

Targeted suite: 86 passed (Phase 3 + Phase 3.2 surface).
Broader: tests_py/core/ + tests_py/shared/ + tests_py/scripts/ +
relevant tests_py/handlers/ → 2075 passed.
``ruff format --check`` and ``ruff check`` clean.

What still ships in a follow-up
-------------------------------

  * ID→path index for ID-only redirect resolution (currently only
    path-based chain walking works; id-only stubs return None from
    resolve_chain so they error in wiki_read with a clear message).
  * Phase 4 bulk migration script that loops wiki_rename over the 88
    known pollution paths (.md.md slug bug, timestamp-slugs, path-leak
    titles) — gated on this PR + #33 landing.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust added a commit that referenced this pull request May 13, 2026
…(ADR-2244 Phase 4.1) (#35)

Phase 4 of ADR-2244 — the bulk migration. This is the deterministic
half: three pollution classes with mechanically computable target
paths. The LLM-assisted re-classification half (the 7820 file-doc
re-bucket) is a separate scope and lands in a follow-up.

Targets
-------

Audit 2026-05-12 found three deterministic-rename pollution classes:

  Pattern                                  Audit count
  ────────────────────────────────────────────────────
  ``*.md.md``                              58
  ``*-decision-created-YYYY-MM-DDt...z.md``  10
  ``*users-cdeust-... .md``                 11+ (path-leak in slug)

Live dry-run after this commit:
  Pollution paths detected: 70  (all currently skipped because the
                                 backfill from #33 hasn't been applied
                                 yet — the script correctly refuses
                                 to rename pages without a stable id)

Script flow
-----------

  scripts/wiki_bulk_migrate.py

  1. Walk wiki, classify each .md page by pollution pattern.
  2. For each match:
     a. Skip redirect stubs (already moved).
     b. Skip pages without a frontmatter ``id`` (Phase 3 invariant).
        Caller is told to run ``wiki_backfill_ids.py --apply`` first.
     c. Compute clean target path:
          - .md.md             → strip duplicate extension
          - timestamp-slug     → derive slug from frontmatter title
                                 or first body heading
          - path-leak          → same, plus reject path-shaped titles
     d. Record the Pollution record.
  3. On --apply: call the ``wiki_rename`` handler for each item, which
     writes content at the new path and a redirect stub at the old
     one. Inbound links keep resolving.

Idempotency: a second --apply finds zero pollution paths (the
renames landed; their stubs are detected and skipped).

Slug derivation
---------------

``_derive_clean_slug`` picks from three sources in order:

  1. Frontmatter ``title`` (if non-empty and not path-shaped /
     timestamp-shaped / too short / synthetic ``memory-XXX``)
  2. First body H1/H2 heading (same cleanness check)
  3. Deterministic 6-hex-character hash of the body content
     prefixed with the kind (``decision-abc123`` / ``page-def456``)

The hash fallback is rare — most pollution pages already have a
proper ``title`` field; it's the *slug* that's broken, not the
metadata.

Tests
-----

``tests_py/scripts/test_wiki_bulk_migrate.py`` (NEW) — 22 tests:

  Detection (6):
    .md.md positive + negative; timestamp-slug positive + negative;
    path-leak positive + negative.

  Slug derivation (5):
    accepts real titles; rejects path / timestamp / too-short titles;
    falls back to body heading; falls back to hash.

  plan() (5):
    finds all three classes in one pass; skips pages without id;
    skips existing redirect stubs; proposes the correct target for
    timestamp-slug and path-leak (preserving numeric and date prefixes).

  apply() / end-to-end (4):
    renames + creates stubs with correct redirect_to and redirect_id;
    idempotent (second run is a no-op); handles three classes in one
    pass; doesn't crash on id-less skipped pages.

  Plus 2 sanity tests for boundary slug shapes.

Targeted: 22 passed. ruff format and check clean.

Operational order
-----------------

  1. Merge #33 (Phase 3 — UUID + redirect modules + backfill script)
  2. Merge #34 (Phase 3.2 — wiki_read / wiki_rename handlers)
  3. Merge this PR (Phase 4.1 — bulk-migrate script)
  4. Run:
       python scripts/wiki_backfill_ids.py --apply
       python scripts/wiki_bulk_migrate.py                # dry-run review
       python scripts/wiki_bulk_migrate.py --apply        # commit moves

Out of scope (follow-ups)
-------------------------

  * ID→path index for ID-only redirect resolution (path-based works
    today; id-only stubs error in wiki_read).
  * Phase 4.2 — file-doc re-bucket (7820 ``notes/<domain>/<id>-file-*``
    pages → ``reference/<domain>/<file-slug>.md`` with provenance
    rewrite). Different operation (changes kind directory, rewrites
    frontmatter); separate script.
  * Phase 5 — classifier-driven cleanup for ai-generated stubs
    (filter not delete).
  * Phase 6 — producer audit (codebase_analyze emits correct
    provenance / lifecycle on its outputs).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust added a commit that referenced this pull request May 13, 2026
…Phase 3.2)

Wires the Phase 3 data model (#33) into the read path and adds a new
write handler that performs the rename + stub atomically. With this
change ``wiki_rename old.md new.md`` produces:

  * ``new.md``  — the original content moved verbatim (id preserved)
  * ``old.md``  — a redirect stub pointing at new.md (with redirect_id
                  = source page id, for future id-based resolution)

And ``wiki_read old.md`` then returns the content of ``new.md`` along
with ``redirect_chain: ["old.md", "new.md"]``. Inbound links to the
old path keep working through the migration.

Handler changes
---------------

* ``wiki_read``  — follow redirect stubs transparently up to 5 hops.
                   ``follow_redirects: false`` opts out (admin/migration
                   tooling that needs to inspect the stub itself).
                   New response field: ``redirect_chain``.

* ``wiki_list``  — exclude redirect stubs from the listing by default.
                   ``include_redirects: true`` opts in. New response
                   field: ``redirect_count``.

* ``wiki_reindex`` — drop redirect stubs from .generated/INDEX.md and
                     surface the count by kind in the response. The
                     index now lists only live pages, which is what
                     readers actually want.

* ``wiki_rename``  — NEW. Move a page from one path to another and
                     leave a stub at the old path. Refuses to operate
                     on pages without a stable frontmatter id (run
                     ``scripts/wiki_backfill_ids.py --apply`` first),
                     refuses to chain stubs (rename the terminal page
                     instead), refuses to overwrite an existing
                     destination unless ``overwrite_dest=true``.

Tool registry: ``wiki_rename`` registered alongside the other 8 wiki
tools. ``wiki_read`` and ``wiki_list`` MCP signatures extended with
their new optional parameters.

Stub semantics
--------------

The stub carries ``redirect_id = <source page id>`` so future id-based
resolution (which a follow-up will add for cross-rename resolution
when the path itself is renamed twice) works. ``redirect_to`` is
populated with the new path as the cheap path-based resolution
target. Both forms are emitted; the id wins when an id-aware reader
arrives.

Tests
-----

``tests_py/handlers/test_wiki_redirect_handlers.py`` (NEW) — 20 tests
covering every handler change:

  read:
    - returns content for a normal page (chain = [])
    - follows single-hop redirect
    - follows multi-hop chain (3 pages, 2 hops)
    - ``follow_redirects: false`` returns the stub itself
    - cycle returns error
    - dangling redirect returns error
    - missing source returns error

  list:
    - excludes stubs by default; redirect_count surfaced
    - ``include_redirects: true`` returns both
    - redirect_count is 0 when no stubs

  reindex:
    - stubs absent from INDEX.md; by_kind counts only live pages

  rename:
    - creates stub at old path with correct redirect_to, redirect_id,
      redirect_reason
    - refuses missing source
    - refuses source without id
    - refuses existing destination
    - ``overwrite_dest=true`` works
    - refuses to chain stubs
    - refuses same path
    - end-to-end: rename then read resolves to the new content
    - body preserved verbatim through the move

Targeted suite: 86 passed (Phase 3 + Phase 3.2 surface).
Broader: tests_py/core/ + tests_py/shared/ + tests_py/scripts/ +
relevant tests_py/handlers/ → 2075 passed.
``ruff format --check`` and ``ruff check`` clean.

What still ships in a follow-up
-------------------------------

  * ID→path index for ID-only redirect resolution (currently only
    path-based chain walking works; id-only stubs return None from
    resolve_chain so they error in wiki_read with a clear message).
  * Phase 4 bulk migration script that loops wiki_rename over the 88
    known pollution paths (.md.md slug bug, timestamp-slugs, path-leak
    titles) — gated on this PR + #33 landing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust added a commit that referenced this pull request May 13, 2026
…(ADR-2244 Phase 4.1)

Phase 4 of ADR-2244 — the bulk migration. This is the deterministic
half: three pollution classes with mechanically computable target
paths. The LLM-assisted re-classification half (the 7820 file-doc
re-bucket) is a separate scope and lands in a follow-up.

Targets
-------

Audit 2026-05-12 found three deterministic-rename pollution classes:

  Pattern                                  Audit count
  ────────────────────────────────────────────────────
  ``*.md.md``                              58
  ``*-decision-created-YYYY-MM-DDt...z.md``  10
  ``*users-cdeust-... .md``                 11+ (path-leak in slug)

Live dry-run after this commit:
  Pollution paths detected: 70  (all currently skipped because the
                                 backfill from #33 hasn't been applied
                                 yet — the script correctly refuses
                                 to rename pages without a stable id)

Script flow
-----------

  scripts/wiki_bulk_migrate.py

  1. Walk wiki, classify each .md page by pollution pattern.
  2. For each match:
     a. Skip redirect stubs (already moved).
     b. Skip pages without a frontmatter ``id`` (Phase 3 invariant).
        Caller is told to run ``wiki_backfill_ids.py --apply`` first.
     c. Compute clean target path:
          - .md.md             → strip duplicate extension
          - timestamp-slug     → derive slug from frontmatter title
                                 or first body heading
          - path-leak          → same, plus reject path-shaped titles
     d. Record the Pollution record.
  3. On --apply: call the ``wiki_rename`` handler for each item, which
     writes content at the new path and a redirect stub at the old
     one. Inbound links keep resolving.

Idempotency: a second --apply finds zero pollution paths (the
renames landed; their stubs are detected and skipped).

Slug derivation
---------------

``_derive_clean_slug`` picks from three sources in order:

  1. Frontmatter ``title`` (if non-empty and not path-shaped /
     timestamp-shaped / too short / synthetic ``memory-XXX``)
  2. First body H1/H2 heading (same cleanness check)
  3. Deterministic 6-hex-character hash of the body content
     prefixed with the kind (``decision-abc123`` / ``page-def456``)

The hash fallback is rare — most pollution pages already have a
proper ``title`` field; it's the *slug* that's broken, not the
metadata.

Tests
-----

``tests_py/scripts/test_wiki_bulk_migrate.py`` (NEW) — 22 tests:

  Detection (6):
    .md.md positive + negative; timestamp-slug positive + negative;
    path-leak positive + negative.

  Slug derivation (5):
    accepts real titles; rejects path / timestamp / too-short titles;
    falls back to body heading; falls back to hash.

  plan() (5):
    finds all three classes in one pass; skips pages without id;
    skips existing redirect stubs; proposes the correct target for
    timestamp-slug and path-leak (preserving numeric and date prefixes).

  apply() / end-to-end (4):
    renames + creates stubs with correct redirect_to and redirect_id;
    idempotent (second run is a no-op); handles three classes in one
    pass; doesn't crash on id-less skipped pages.

  Plus 2 sanity tests for boundary slug shapes.

Targeted: 22 passed. ruff format and check clean.

Operational order
-----------------

  1. Merge #33 (Phase 3 — UUID + redirect modules + backfill script)
  2. Merge #34 (Phase 3.2 — wiki_read / wiki_rename handlers)
  3. Merge this PR (Phase 4.1 — bulk-migrate script)
  4. Run:
       python scripts/wiki_backfill_ids.py --apply
       python scripts/wiki_bulk_migrate.py                # dry-run review
       python scripts/wiki_bulk_migrate.py --apply        # commit moves

Out of scope (follow-ups)
-------------------------

  * ID→path index for ID-only redirect resolution (path-based works
    today; id-only stubs error in wiki_read).
  * Phase 4.2 — file-doc re-bucket (7820 ``notes/<domain>/<id>-file-*``
    pages → ``reference/<domain>/<file-slug>.md`` with provenance
    rewrite). Different operation (changes kind directory, rewrites
    frontmatter); separate script.
  * Phase 5 — classifier-driven cleanup for ai-generated stubs
    (filter not delete).
  * Phase 6 — producer audit (codebase_analyze emits correct
    provenance / lifecycle on its outputs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust added a commit that referenced this pull request May 13, 2026
…se 4.2)

Producer-side fix #27 routed new file-doc pages to ``reference/<domain>/``
with ``provenance: auto-generated``. The existing population — 8,734
pages written under ``notes/<domain>/<id>-file-*.md`` — never got
moved. This script handles that one-time migration.

Operation per page
------------------

  1. Walk ``notes/<domain>/``; match the file-doc shape
     ``\\d+-file-...``.
  2. Skip redirect stubs (already migrated).
  3. Require a frontmatter ``id`` (Phase 3 invariant — run
     ``wiki_backfill_ids.py --apply`` first).
  4. Extract the original source path from the ``file:<path>`` tag
     (canonical even when the on-disk filename was truncated to
     ``98817-file-....md``).
  5. Compute target ``reference/<domain>/<file-slug>.md``.
  6. Rewrite frontmatter to the modern schema:
       kind: reference
       lifecycle: seedling
       audience: [developer]
       provenance: auto-generated
       generator: {model: cortex-codebase-analyze, version: v1,
                   prompt_template: file-doc-v1,
                   generated_at: <original-created>}
     Plus migration trace fields (``source_file_path``,
     ``rebucketed_from``). The original id, title, tags, and body are
     preserved verbatim.
  7. Write the rewritten page at the new path.
  8. Replace the source with a redirect stub that carries
     ``redirect_to`` (path) + ``redirect_id`` (source id) so
     ``wiki_read`` resolves the old path through the stub
     transparently.

The script is intentionally NOT a thin wrapper around ``wiki_rename``:
that handler preserves content verbatim, whereas the file-doc re-bucket
must REWRITE the frontmatter as part of the move. The stub-creation
half does use ``mcp_server.core.wiki_redirect.build_redirect_stub``
for consistency with Phase 3.2.

Live dry-run
------------

  Detected file-doc pages:   8734
  Plan: re-bucket            0
  Skipped (no id):           8734

Same correct refusal as Phase 4.1 — the backfill from #33 hasn't been
applied to the live wiki yet. Once ``wiki_backfill_ids.py --apply``
runs, the plan will flip to ``8734 to re-bucket``.

Idempotency
-----------

  * Second --apply finds zero: source pages are now redirect stubs
    (skipped by plan()), new producers write to reference/ directly
    (skipped by the pattern match).
  * Collision handling: two notes documenting the same source file
    get distinct targets via a ``-<memory_id>`` suffix on the second
    one (rare in practice; observed 0 times on the live wiki).

Tests
-----

``tests_py/scripts/test_wiki_rebucket_file_docs.py`` (NEW) — 19 tests:

  detection (6):
    - canonical file-doc shape matches; non-file-doc notes don't
    - file tag extracted from block-list and inline-list frontmatter
    - missing/empty file tag handled

  slug derivation (3):
    - separators flattened to hyphens
    - empty source returns empty target
    - empty domain falls back to ``_general``

  plan (5):
    - finds file-doc notes, skips other notes
    - skips pages without id (refusal message)
    - skips pages without file tag
    - disambiguates colliding targets via memory-id suffix
    - skips existing redirect stubs (idempotent re-runs)

  apply (5):
    - modern frontmatter at target (kind/lifecycle/audience/
      provenance/generator/source_file_path)
    - body preserved verbatim
    - redirect stub at source with correct target_path + target_id
    - refuses when destination already exists
    - idempotent (second pass = no-op)

  end-to-end (1):
    - 25 pages across 3 domains move correctly; spot-check each domain

19 passed; ruff format and check clean.

Post-merge operations
---------------------

After PR #36 + this PR land on main:

  python scripts/wiki_backfill_ids.py --apply
  python scripts/wiki_bulk_migrate.py --apply        # Phase 4.1 — 70 paths
  python scripts/wiki_rebucket_file_docs.py          # dry-run review
  python scripts/wiki_rebucket_file_docs.py --apply  # Phase 4.2 — 8734 pages

After all three apply runs:
  * notes/ drops from 92% of the wiki to ~5% (real catch-all content only)
  * reference/ grows to host the 8734 file docs with proper provenance
  * 70 + 8734 redirect stubs preserve all inbound links

Out of scope (Phase 5+)
-----------------------

  * Phase 5 — classifier-driven cleanup for ai-generated seedlings
    (filter from search, do not delete; preserves the auto-gen
    reference pages but hides empty stubs from default views).
  * Phase 6 — producer audit (codebase_analyze emits the modern
    4-tuple directly on new writes; would also write provenance =
    auto-generated + generator block on every output).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust added a commit that referenced this pull request May 13, 2026
…on onto main (#36)

* feat(wiki): handler-layer redirect mechanics + wiki_rename (ADR-2244 Phase 3.2)

Wires the Phase 3 data model (#33) into the read path and adds a new
write handler that performs the rename + stub atomically. With this
change ``wiki_rename old.md new.md`` produces:

  * ``new.md``  — the original content moved verbatim (id preserved)
  * ``old.md``  — a redirect stub pointing at new.md (with redirect_id
                  = source page id, for future id-based resolution)

And ``wiki_read old.md`` then returns the content of ``new.md`` along
with ``redirect_chain: ["old.md", "new.md"]``. Inbound links to the
old path keep working through the migration.

Handler changes
---------------

* ``wiki_read``  — follow redirect stubs transparently up to 5 hops.
                   ``follow_redirects: false`` opts out (admin/migration
                   tooling that needs to inspect the stub itself).
                   New response field: ``redirect_chain``.

* ``wiki_list``  — exclude redirect stubs from the listing by default.
                   ``include_redirects: true`` opts in. New response
                   field: ``redirect_count``.

* ``wiki_reindex`` — drop redirect stubs from .generated/INDEX.md and
                     surface the count by kind in the response. The
                     index now lists only live pages, which is what
                     readers actually want.

* ``wiki_rename``  — NEW. Move a page from one path to another and
                     leave a stub at the old path. Refuses to operate
                     on pages without a stable frontmatter id (run
                     ``scripts/wiki_backfill_ids.py --apply`` first),
                     refuses to chain stubs (rename the terminal page
                     instead), refuses to overwrite an existing
                     destination unless ``overwrite_dest=true``.

Tool registry: ``wiki_rename`` registered alongside the other 8 wiki
tools. ``wiki_read`` and ``wiki_list`` MCP signatures extended with
their new optional parameters.

Stub semantics
--------------

The stub carries ``redirect_id = <source page id>`` so future id-based
resolution (which a follow-up will add for cross-rename resolution
when the path itself is renamed twice) works. ``redirect_to`` is
populated with the new path as the cheap path-based resolution
target. Both forms are emitted; the id wins when an id-aware reader
arrives.

Tests
-----

``tests_py/handlers/test_wiki_redirect_handlers.py`` (NEW) — 20 tests
covering every handler change:

  read:
    - returns content for a normal page (chain = [])
    - follows single-hop redirect
    - follows multi-hop chain (3 pages, 2 hops)
    - ``follow_redirects: false`` returns the stub itself
    - cycle returns error
    - dangling redirect returns error
    - missing source returns error

  list:
    - excludes stubs by default; redirect_count surfaced
    - ``include_redirects: true`` returns both
    - redirect_count is 0 when no stubs

  reindex:
    - stubs absent from INDEX.md; by_kind counts only live pages

  rename:
    - creates stub at old path with correct redirect_to, redirect_id,
      redirect_reason
    - refuses missing source
    - refuses source without id
    - refuses existing destination
    - ``overwrite_dest=true`` works
    - refuses to chain stubs
    - refuses same path
    - end-to-end: rename then read resolves to the new content
    - body preserved verbatim through the move

Targeted suite: 86 passed (Phase 3 + Phase 3.2 surface).
Broader: tests_py/core/ + tests_py/shared/ + tests_py/scripts/ +
relevant tests_py/handlers/ → 2075 passed.
``ruff format --check`` and ``ruff check`` clean.

What still ships in a follow-up
-------------------------------

  * ID→path index for ID-only redirect resolution (currently only
    path-based chain walking works; id-only stubs return None from
    resolve_chain so they error in wiki_read with a clear message).
  * Phase 4 bulk migration script that loops wiki_rename over the 88
    known pollution paths (.md.md slug bug, timestamp-slugs, path-leak
    titles) — gated on this PR + #33 landing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(wiki): deterministic bulk migration for the ~88 pollution paths (ADR-2244 Phase 4.1)

Phase 4 of ADR-2244 — the bulk migration. This is the deterministic
half: three pollution classes with mechanically computable target
paths. The LLM-assisted re-classification half (the 7820 file-doc
re-bucket) is a separate scope and lands in a follow-up.

Targets
-------

Audit 2026-05-12 found three deterministic-rename pollution classes:

  Pattern                                  Audit count
  ────────────────────────────────────────────────────
  ``*.md.md``                              58
  ``*-decision-created-YYYY-MM-DDt...z.md``  10
  ``*users-cdeust-... .md``                 11+ (path-leak in slug)

Live dry-run after this commit:
  Pollution paths detected: 70  (all currently skipped because the
                                 backfill from #33 hasn't been applied
                                 yet — the script correctly refuses
                                 to rename pages without a stable id)

Script flow
-----------

  scripts/wiki_bulk_migrate.py

  1. Walk wiki, classify each .md page by pollution pattern.
  2. For each match:
     a. Skip redirect stubs (already moved).
     b. Skip pages without a frontmatter ``id`` (Phase 3 invariant).
        Caller is told to run ``wiki_backfill_ids.py --apply`` first.
     c. Compute clean target path:
          - .md.md             → strip duplicate extension
          - timestamp-slug     → derive slug from frontmatter title
                                 or first body heading
          - path-leak          → same, plus reject path-shaped titles
     d. Record the Pollution record.
  3. On --apply: call the ``wiki_rename`` handler for each item, which
     writes content at the new path and a redirect stub at the old
     one. Inbound links keep resolving.

Idempotency: a second --apply finds zero pollution paths (the
renames landed; their stubs are detected and skipped).

Slug derivation
---------------

``_derive_clean_slug`` picks from three sources in order:

  1. Frontmatter ``title`` (if non-empty and not path-shaped /
     timestamp-shaped / too short / synthetic ``memory-XXX``)
  2. First body H1/H2 heading (same cleanness check)
  3. Deterministic 6-hex-character hash of the body content
     prefixed with the kind (``decision-abc123`` / ``page-def456``)

The hash fallback is rare — most pollution pages already have a
proper ``title`` field; it's the *slug* that's broken, not the
metadata.

Tests
-----

``tests_py/scripts/test_wiki_bulk_migrate.py`` (NEW) — 22 tests:

  Detection (6):
    .md.md positive + negative; timestamp-slug positive + negative;
    path-leak positive + negative.

  Slug derivation (5):
    accepts real titles; rejects path / timestamp / too-short titles;
    falls back to body heading; falls back to hash.

  plan() (5):
    finds all three classes in one pass; skips pages without id;
    skips existing redirect stubs; proposes the correct target for
    timestamp-slug and path-leak (preserving numeric and date prefixes).

  apply() / end-to-end (4):
    renames + creates stubs with correct redirect_to and redirect_id;
    idempotent (second run is a no-op); handles three classes in one
    pass; doesn't crash on id-less skipped pages.

  Plus 2 sanity tests for boundary slug shapes.

Targeted: 22 passed. ruff format and check clean.

Operational order
-----------------

  1. Merge #33 (Phase 3 — UUID + redirect modules + backfill script)
  2. Merge #34 (Phase 3.2 — wiki_read / wiki_rename handlers)
  3. Merge this PR (Phase 4.1 — bulk-migrate script)
  4. Run:
       python scripts/wiki_backfill_ids.py --apply
       python scripts/wiki_bulk_migrate.py                # dry-run review
       python scripts/wiki_bulk_migrate.py --apply        # commit moves

Out of scope (follow-ups)
-------------------------

  * ID→path index for ID-only redirect resolution (path-based works
    today; id-only stubs error in wiki_read).
  * Phase 4.2 — file-doc re-bucket (7820 ``notes/<domain>/<id>-file-*``
    pages → ``reference/<domain>/<file-slug>.md`` with provenance
    rewrite). Different operation (changes kind directory, rewrites
    frontmatter); separate script.
  * Phase 5 — classifier-driven cleanup for ai-generated stubs
    (filter not delete).
  * Phase 6 — producer audit (codebase_analyze emits correct
    provenance / lifecycle on its outputs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: bump tool count assertion 47 → 48 for new wiki_rename (ADR-2244 Phase 3.2)

CI on PR #36 fails on tests_py/test_main.py:70 — the mcp_server tool
count is now 48 because Phase 3.2 (#34's content, now flowing into
main via this PR) registers ``wiki_rename`` as a new tool. The
assertion is a hard count + membership check; both updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust added a commit that referenced this pull request May 13, 2026
…se 4.2)

Producer-side fix #27 routed new file-doc pages to ``reference/<domain>/``
with ``provenance: auto-generated``. The existing population — 8,734
pages written under ``notes/<domain>/<id>-file-*.md`` — never got
moved. This script handles that one-time migration.

Operation per page
------------------

  1. Walk ``notes/<domain>/``; match the file-doc shape
     ``\\d+-file-...``.
  2. Skip redirect stubs (already migrated).
  3. Require a frontmatter ``id`` (Phase 3 invariant — run
     ``wiki_backfill_ids.py --apply`` first).
  4. Extract the original source path from the ``file:<path>`` tag
     (canonical even when the on-disk filename was truncated to
     ``98817-file-....md``).
  5. Compute target ``reference/<domain>/<file-slug>.md``.
  6. Rewrite frontmatter to the modern schema:
       kind: reference
       lifecycle: seedling
       audience: [developer]
       provenance: auto-generated
       generator: {model: cortex-codebase-analyze, version: v1,
                   prompt_template: file-doc-v1,
                   generated_at: <original-created>}
     Plus migration trace fields (``source_file_path``,
     ``rebucketed_from``). The original id, title, tags, and body are
     preserved verbatim.
  7. Write the rewritten page at the new path.
  8. Replace the source with a redirect stub that carries
     ``redirect_to`` (path) + ``redirect_id`` (source id) so
     ``wiki_read`` resolves the old path through the stub
     transparently.

The script is intentionally NOT a thin wrapper around ``wiki_rename``:
that handler preserves content verbatim, whereas the file-doc re-bucket
must REWRITE the frontmatter as part of the move. The stub-creation
half does use ``mcp_server.core.wiki_redirect.build_redirect_stub``
for consistency with Phase 3.2.

Live dry-run
------------

  Detected file-doc pages:   8734
  Plan: re-bucket            0
  Skipped (no id):           8734

Same correct refusal as Phase 4.1 — the backfill from #33 hasn't been
applied to the live wiki yet. Once ``wiki_backfill_ids.py --apply``
runs, the plan will flip to ``8734 to re-bucket``.

Idempotency
-----------

  * Second --apply finds zero: source pages are now redirect stubs
    (skipped by plan()), new producers write to reference/ directly
    (skipped by the pattern match).
  * Collision handling: two notes documenting the same source file
    get distinct targets via a ``-<memory_id>`` suffix on the second
    one (rare in practice; observed 0 times on the live wiki).

Tests
-----

``tests_py/scripts/test_wiki_rebucket_file_docs.py`` (NEW) — 19 tests:

  detection (6):
    - canonical file-doc shape matches; non-file-doc notes don't
    - file tag extracted from block-list and inline-list frontmatter
    - missing/empty file tag handled

  slug derivation (3):
    - separators flattened to hyphens
    - empty source returns empty target
    - empty domain falls back to ``_general``

  plan (5):
    - finds file-doc notes, skips other notes
    - skips pages without id (refusal message)
    - skips pages without file tag
    - disambiguates colliding targets via memory-id suffix
    - skips existing redirect stubs (idempotent re-runs)

  apply (5):
    - modern frontmatter at target (kind/lifecycle/audience/
      provenance/generator/source_file_path)
    - body preserved verbatim
    - redirect stub at source with correct target_path + target_id
    - refuses when destination already exists
    - idempotent (second pass = no-op)

  end-to-end (1):
    - 25 pages across 3 domains move correctly; spot-check each domain

19 passed; ruff format and check clean.

Post-merge operations
---------------------

After PR #36 + this PR land on main:

  python scripts/wiki_backfill_ids.py --apply
  python scripts/wiki_bulk_migrate.py --apply        # Phase 4.1 — 70 paths
  python scripts/wiki_rebucket_file_docs.py          # dry-run review
  python scripts/wiki_rebucket_file_docs.py --apply  # Phase 4.2 — 8734 pages

After all three apply runs:
  * notes/ drops from 92% of the wiki to ~5% (real catch-all content only)
  * reference/ grows to host the 8734 file docs with proper provenance
  * 70 + 8734 redirect stubs preserve all inbound links

Out of scope (Phase 5+)
-----------------------

  * Phase 5 — classifier-driven cleanup for ai-generated seedlings
    (filter from search, do not delete; preserves the auto-gen
    reference pages but hides empty stubs from default views).
  * Phase 6 — producer audit (codebase_analyze emits the modern
    4-tuple directly on new writes; would also write provenance =
    auto-generated + generator block on every output).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust added a commit that referenced this pull request May 13, 2026
…se 4.2) (#37)

Producer-side fix #27 routed new file-doc pages to ``reference/<domain>/``
with ``provenance: auto-generated``. The existing population — 8,734
pages written under ``notes/<domain>/<id>-file-*.md`` — never got
moved. This script handles that one-time migration.

Operation per page
------------------

  1. Walk ``notes/<domain>/``; match the file-doc shape
     ``\\d+-file-...``.
  2. Skip redirect stubs (already migrated).
  3. Require a frontmatter ``id`` (Phase 3 invariant — run
     ``wiki_backfill_ids.py --apply`` first).
  4. Extract the original source path from the ``file:<path>`` tag
     (canonical even when the on-disk filename was truncated to
     ``98817-file-....md``).
  5. Compute target ``reference/<domain>/<file-slug>.md``.
  6. Rewrite frontmatter to the modern schema:
       kind: reference
       lifecycle: seedling
       audience: [developer]
       provenance: auto-generated
       generator: {model: cortex-codebase-analyze, version: v1,
                   prompt_template: file-doc-v1,
                   generated_at: <original-created>}
     Plus migration trace fields (``source_file_path``,
     ``rebucketed_from``). The original id, title, tags, and body are
     preserved verbatim.
  7. Write the rewritten page at the new path.
  8. Replace the source with a redirect stub that carries
     ``redirect_to`` (path) + ``redirect_id`` (source id) so
     ``wiki_read`` resolves the old path through the stub
     transparently.

The script is intentionally NOT a thin wrapper around ``wiki_rename``:
that handler preserves content verbatim, whereas the file-doc re-bucket
must REWRITE the frontmatter as part of the move. The stub-creation
half does use ``mcp_server.core.wiki_redirect.build_redirect_stub``
for consistency with Phase 3.2.

Live dry-run
------------

  Detected file-doc pages:   8734
  Plan: re-bucket            0
  Skipped (no id):           8734

Same correct refusal as Phase 4.1 — the backfill from #33 hasn't been
applied to the live wiki yet. Once ``wiki_backfill_ids.py --apply``
runs, the plan will flip to ``8734 to re-bucket``.

Idempotency
-----------

  * Second --apply finds zero: source pages are now redirect stubs
    (skipped by plan()), new producers write to reference/ directly
    (skipped by the pattern match).
  * Collision handling: two notes documenting the same source file
    get distinct targets via a ``-<memory_id>`` suffix on the second
    one (rare in practice; observed 0 times on the live wiki).

Tests
-----

``tests_py/scripts/test_wiki_rebucket_file_docs.py`` (NEW) — 19 tests:

  detection (6):
    - canonical file-doc shape matches; non-file-doc notes don't
    - file tag extracted from block-list and inline-list frontmatter
    - missing/empty file tag handled

  slug derivation (3):
    - separators flattened to hyphens
    - empty source returns empty target
    - empty domain falls back to ``_general``

  plan (5):
    - finds file-doc notes, skips other notes
    - skips pages without id (refusal message)
    - skips pages without file tag
    - disambiguates colliding targets via memory-id suffix
    - skips existing redirect stubs (idempotent re-runs)

  apply (5):
    - modern frontmatter at target (kind/lifecycle/audience/
      provenance/generator/source_file_path)
    - body preserved verbatim
    - redirect stub at source with correct target_path + target_id
    - refuses when destination already exists
    - idempotent (second pass = no-op)

  end-to-end (1):
    - 25 pages across 3 domains move correctly; spot-check each domain

19 passed; ruff format and check clean.

Post-merge operations
---------------------

After PR #36 + this PR land on main:

  python scripts/wiki_backfill_ids.py --apply
  python scripts/wiki_bulk_migrate.py --apply        # Phase 4.1 — 70 paths
  python scripts/wiki_rebucket_file_docs.py          # dry-run review
  python scripts/wiki_rebucket_file_docs.py --apply  # Phase 4.2 — 8734 pages

After all three apply runs:
  * notes/ drops from 92% of the wiki to ~5% (real catch-all content only)
  * reference/ grows to host the 8734 file docs with proper provenance
  * 70 + 8734 redirect stubs preserve all inbound links

Out of scope (Phase 5+)
-----------------------

  * Phase 5 — classifier-driven cleanup for ai-generated seedlings
    (filter from search, do not delete; preserves the auto-gen
    reference pages but hides empty stubs from default views).
  * Phase 6 — producer audit (codebase_analyze emits the modern
    4-tuple directly on new writes; would also write provenance =
    auto-generated + generator block on every output).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cdeust added a commit that referenced this pull request May 13, 2026
…n complete) (#41)

Bundles 11 merged PRs (#30-#40) since v3.15.4 closing out the
ADR-2244 wiki classification cycle:

  Phase 2     #31 #32  pilot migration analyzer + 1000-page
                       verification (96.7% kind-kept, passes target)
  Phase 3     #33      stable page IDs (UUID4) + redirect data model
                       + backfill CLI
  Phase 3.2   #34      handler-layer redirect mechanics (wiki_read
                       follows transparently, wiki_list/wiki_reindex
                       exclude stubs, new wiki_rename tool)
  Phase 4.1   #35 #36  deterministic bulk migration for the 70
                       known pollution paths (.md.md, timestamp-slug,
                       path-leak)
  Phase 4.2   #37      file-doc re-bucket (8734 pages from notes/
                       to reference/ with modern frontmatter)
  Phase 5     #39      filter auto-generated pages from default
                       listings; INDEX.md splits human-authored
                       from auto-gen
  Phase 6     #38      producer audit — codebase_analyze output
                       routes to kind=reference (root-causes the
                       8734-page misroute)
  Phase 6.2   #40      producer audit — wiki_seed_codebase emits
                       modern kind tags the classifier reads
  Security    #30      authlib CVE-2026-44681 bump (dependabot #4)

Notes for users:
  - Wiki on disk not migrated yet. Apply scripts (in scripts/) are
    dry-run by default. Three commands to fully migrate; each is
    idempotent and leaves redirect stubs.
  - Phases 5/6/6.2 take effect on next MCP restart.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant