fix(wiki): plug slug/title leaks producing .md.md, timestamp-slugs, path-titles#26
Merged
Conversation
…ath-titles
Audit of the methodology wiki (7,883 pages) on 2026-05-12 found:
- 58 pages with .md.md double extension
- 10 ADRs slugged "decision-created-2026-04-15t09-29-10z" from YAML
frontmatter timestamps leaking through as titles
- 11+ pages with embedded filesystem paths in slugs, e.g.
"specs/2026-04-17-also-on-users-cdeust-documents-developments-..."
- All polluted pages written 2026-04-21 — *after* v3.10.1 (2026-04-15)
shipped the audit-artefact filter, so these are live bugs not history.
Root causes and fixes:
1. .md.md (slugify preserves '.') — wiki_layout.slugify now strips a
trailing chain of ".md" before returning. Single-point repair
benefits all six filename builders (adr_filename, domain_page_path,
wiki_sync, draft_compiler, ingest_prd, ingest_codebase_pages).
2. Timestamp-as-title (YAML "created:" line accepted by derive_title) —
new _YAML_KV_TITLE_PATTERNS reject lines matching
"(created|updated|date|...): <value>" and bare ISO-8601 timestamps.
3. Path-embedded-mid-sentence titles (existing _PATH_OR_URL_TITLE_PATTERNS
only matched paths at line start) — added two patterns that match
/Users/, /home/, /opt/, /var/, /etc/, /tmp/, /root/ anywhere in the
line, plus Windows drive paths anywhere.
4. derive_title fallback leaked raw content[:80] when no clean line
existed, defeating wiki_sync's deterministic memory-<hash> fallback.
derive_title now returns "" when every candidate line is rejected
(path, URL, YAML, JSON, too-short); callers route to the hash.
Regression tests in test_wiki_layout.py (4 new) and test_wiki_classifier.py
(6 new). 1720 core tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
cdeust
added a commit
that referenced
this pull request
May 12, 2026
Bundles four PRs landed since v3.15.3: #25 codebase_analyze: default max_files=0 (no cap); fixed truncation at 5000 in skill invocation #26 wiki slug/title leaks (.md.md, timestamp-slugs, path-titles) #27 ADR-2244 Phase 1: multi-axis classification (kind, lifecycle, audience, provenance) + Task #8 (file→reference/ routing fix) #28 ADR-2244 follow-up: data-driven axis registry replacing closed enums with wiki/_schema/<axis>/<value>.md user-extensible files Changelog entry documents migration impact and the extension contract for adding new classification values without code changes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Audit of the methodology wiki (7,883 pages) on 2026-05-12 surfaced systematic pollution:
.md.mddouble extensionadr/_general/2234-decision-001-zero-dependencies.md.mdadr/_general/1828-decision-created-2026-04-15t09-29-10z.mdspecs/2026/2026-04-17-also-on-users-cdeust-documents-developments-ai-architect-prd.mdAll polluted pages were written 2026-04-21 — six days after v3.10.1 (commit
554f1ac, 2026-04-15) shipped the audit-artefact filter. These are live bugs, not historical pollution.Root causes and fixes
1.
.md.md(slugify preserves.; six callers append.md)Reproducer (before fix):
Six callsites all do
f"{...}{slug}.md":adr_filename,domain_page_path,wiki_sync:91,draft_compiler:127,ingest_prd:205,ingest_codebase_pages:28. Single-point fix inslugifyto strip a trailing chain of.mdbenefits all of them. Non-.mdextensions (.py,.yaml, etc.) are preserved —file_path_slugcallers still getlogin.pyas before.2. Timestamp-as-title (YAML metadata leaked through
derive_title)derive_titleaccepted any linelen > 10that didn't start with{/[. A YAML line likecreated: 2026-04-15T09:29:10Zpassed that gate. New_YAML_KV_TITLE_PATTERNSreject(created|updated|date|timestamp|time|id|uuid|version):lines and bare ISO-8601 timestamps anywhere in the candidate line.3. Path-embedded-mid-sentence titles
Existing
_PATH_OR_URL_TITLE_PATTERNSonly matched paths at line start (^\s*#*\s*[/~]). When content like"also on /Users/cdeust/Documents/..."becamefirst_line, the path was mid-line so the regex didn't trigger; slugify then folded the entire path into the slug. Added two patterns matching/Users/,/home/,/root/,/opt/,/var/,/etc/,/tmp/and Windows drive paths anywhere in the line.4.
derive_titlefallback defeated the deterministic hash fallbackThe function fell back to
content[:80]when no clean line existed, leaking raw fragments.wiki_syncalready had amemory-<hash>fallback but it was unreachable. Nowderive_titlereturns""when every candidate line is rejected; the caller routes to the hash. Extracted_line_is_title_candidatefor a single source-of-truth predicate.Files changed
mcp_server/core/wiki_layout.py— strip trailing.mdchain from slug; document the postcondition.mcp_server/core/wiki_classifier.py— extend path/URL filter, add YAML KV filter, extract_line_is_title_candidate, return""on no-candidate.tests_py/core/test_wiki_layout.py— 4 new tests covering.mdstrip, multi-.mdcollapse, non-.mdpreservation, end-to-endadr_filename.tests_py/core/test_wiki_classifier.py— 6 new tests covering YAML-timestamp rejection, embedded-POSIX-path rejection, Windows path rejection, empty-on-no-candidate, bare ISO timestamp, positive control.Test plan
pytest tests_py/core/— 1720 passedpytest tests_py/handlers/test_wiki_sync_errors.py tests_py/infrastructure/test_wiki_store.py— 20 passed.md.md, zerodecision-created-*ADRs, zerousers-cdeustslugs in newly-written pagesScope of fix vs scope of cleanup
This PR prevents future pollution. The existing ~88 polluted pages must be purged separately (see follow-up: cleanup pass using
wiki_purgeonce this lands).🤖 Generated with Claude Code