Skip to content

fix(wiki): plug slug/title leaks producing .md.md, timestamp-slugs, path-titles#26

Merged
cdeust merged 2 commits into
mainfrom
fix/wiki-slug-title-leaks
May 12, 2026
Merged

fix(wiki): plug slug/title leaks producing .md.md, timestamp-slugs, path-titles#26
cdeust merged 2 commits into
mainfrom
fix/wiki-slug-title-leaks

Conversation

@cdeust
Copy link
Copy Markdown
Owner

@cdeust cdeust commented May 12, 2026

Context

Audit of the methodology wiki (7,883 pages) on 2026-05-12 surfaced systematic pollution:

Bug Count Example
.md.md double extension 58 adr/_general/2234-decision-001-zero-dependencies.md.md
Timestamp-as-slug ADRs 10 adr/_general/1828-decision-created-2026-04-15t09-29-10z.md
Embedded path in slug 11+ specs/2026/2026-04-17-also-on-users-cdeust-documents-developments-ai-architect-prd.md

All polluted pages were written 2026-04-21 — six days after v3.10.1 (commit 554f1ac, 2026-04-15) shipped the audit-artefact filter. These are live bugs, not historical pollution.

Root causes and fixes

1. .md.md (slugify preserves . ; six callers append .md)

Reproducer (before fix):

>>> from mcp_server.core.wiki_layout import slugify, adr_filename
>>> adr_filename(2234, slugify("001-zero-dependencies.md"))
'2234-001-zero-dependencies.md.md'

Six callsites all do f"{...}{slug}.md": adr_filename, domain_page_path, wiki_sync:91, draft_compiler:127, ingest_prd:205, ingest_codebase_pages:28. Single-point fix in slugify to strip a trailing chain of .md benefits all of them. Non-.md extensions (.py, .yaml, etc.) are preserved — file_path_slug callers still get login.py as before.

2. Timestamp-as-title (YAML metadata leaked through derive_title)

derive_title accepted any line len > 10 that didn't start with {/[. A YAML line like created: 2026-04-15T09:29:10Z passed that gate. New _YAML_KV_TITLE_PATTERNS reject (created|updated|date|timestamp|time|id|uuid|version): lines and bare ISO-8601 timestamps anywhere in the candidate line.

3. Path-embedded-mid-sentence titles

Existing _PATH_OR_URL_TITLE_PATTERNS only matched paths at line start (^\s*#*\s*[/~]). When content like "also on /Users/cdeust/Documents/..." became first_line, the path was mid-line so the regex didn't trigger; slugify then folded the entire path into the slug. Added two patterns matching /Users/, /home/, /root/, /opt/, /var/, /etc/, /tmp/ and Windows drive paths anywhere in the line.

4. derive_title fallback defeated the deterministic hash fallback

The function fell back to content[:80] when no clean line existed, leaking raw fragments. wiki_sync already had a memory-<hash> fallback but it was unreachable. Now derive_title returns "" when every candidate line is rejected; the caller routes to the hash. Extracted _line_is_title_candidate for a single source-of-truth predicate.

Files changed

  • mcp_server/core/wiki_layout.py — strip trailing .md chain from slug; document the postcondition.
  • mcp_server/core/wiki_classifier.py — extend path/URL filter, add YAML KV filter, extract _line_is_title_candidate, return "" on no-candidate.
  • tests_py/core/test_wiki_layout.py — 4 new tests covering .md strip, multi-.md collapse, non-.md preservation, end-to-end adr_filename.
  • tests_py/core/test_wiki_classifier.py — 6 new tests covering YAML-timestamp rejection, embedded-POSIX-path rejection, Windows path rejection, empty-on-no-candidate, bare ISO timestamp, positive control.

Test plan

  • pytest tests_py/core/ — 1720 passed
  • pytest tests_py/handlers/test_wiki_sync_errors.py tests_py/infrastructure/test_wiki_store.py — 20 passed
  • After merge + plugin re-publish: re-audit wiki page count; verify zero .md.md, zero decision-created-* ADRs, zero users-cdeust slugs in newly-written pages

Scope of fix vs scope of cleanup

This PR prevents future pollution. The existing ~88 polluted pages must be purged separately (see follow-up: cleanup pass using wiki_purge once this lands).

🤖 Generated with Claude Code

cdeust and others added 2 commits May 12, 2026 09:59
…ath-titles

Audit of the methodology wiki (7,883 pages) on 2026-05-12 found:
  - 58 pages with .md.md double extension
  - 10 ADRs slugged "decision-created-2026-04-15t09-29-10z" from YAML
    frontmatter timestamps leaking through as titles
  - 11+ pages with embedded filesystem paths in slugs, e.g.
    "specs/2026-04-17-also-on-users-cdeust-documents-developments-..."
  - All polluted pages written 2026-04-21 — *after* v3.10.1 (2026-04-15)
    shipped the audit-artefact filter, so these are live bugs not history.

Root causes and fixes:

  1. .md.md (slugify preserves '.') — wiki_layout.slugify now strips a
     trailing chain of ".md" before returning. Single-point repair
     benefits all six filename builders (adr_filename, domain_page_path,
     wiki_sync, draft_compiler, ingest_prd, ingest_codebase_pages).

  2. Timestamp-as-title (YAML "created:" line accepted by derive_title) —
     new _YAML_KV_TITLE_PATTERNS reject lines matching
     "(created|updated|date|...): <value>" and bare ISO-8601 timestamps.

  3. Path-embedded-mid-sentence titles (existing _PATH_OR_URL_TITLE_PATTERNS
     only matched paths at line start) — added two patterns that match
     /Users/, /home/, /opt/, /var/, /etc/, /tmp/, /root/ anywhere in the
     line, plus Windows drive paths anywhere.

  4. derive_title fallback leaked raw content[:80] when no clean line
     existed, defeating wiki_sync's deterministic memory-<hash> fallback.
     derive_title now returns "" when every candidate line is rejected
     (path, URL, YAML, JSON, too-short); callers route to the hash.

Regression tests in test_wiki_layout.py (4 new) and test_wiki_classifier.py
(6 new). 1720 core tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cdeust cdeust merged commit f9fab48 into main May 12, 2026
11 checks passed
cdeust added a commit that referenced this pull request May 12, 2026
Bundles four PRs landed since v3.15.3:

  #25  codebase_analyze: default max_files=0 (no cap); fixed truncation
       at 5000 in skill invocation
  #26  wiki slug/title leaks (.md.md, timestamp-slugs, path-titles)
  #27  ADR-2244 Phase 1: multi-axis classification (kind, lifecycle,
       audience, provenance) + Task #8 (file→reference/ routing fix)
  #28  ADR-2244 follow-up: data-driven axis registry replacing closed
       enums with wiki/_schema/<axis>/<value>.md user-extensible files

Changelog entry documents migration impact and the extension contract
for adding new classification values without code changes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cdeust cdeust deleted the fix/wiki-slug-title-leaks branch May 13, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant