docs: fix legacy redirect destinations 404ing on docs.cube.dev#11004
Conversation
The cross-domain redirect table (#11002) sent most legacy cube.dev/docs URLs to non-existent docs.cube.dev pages — 386 of 509 destinations (76%) returned 404, e.g. /product/data-modeling/reference/pre-aggregations went to the dead /docs/data-modeling/reference/pre-aggregations instead of the real /reference/data-modeling/pre-aggregations. Root cause: build_old_site_redirects.py reused rewrite_links.PATH_REWRITES, which describes an intermediate migration layout (/access-security, /api-reference, /analytics tabs and /docs/data-modeling/reference/...) that never shipped. Content was consolidated into the live tabs (admin, reference, docs, recipes, embedding, configuration, cube-core). Rewrite the generator to be content-driven and self-validating: - Derive each destination by matching the old page's body text against the live Mintlify pages (new files were copied from old ones during migration, so prose is preserved). 319 old pages mapped. - Pin ambiguous/consolidated/removed pages in an explicit OVERRIDES table, verified by hand against the real tree. - Emit a redirect for every legacy alias and every canonical /product page, plus a /product/:path* catch-all so nothing 404s. - Validate every destination against the on-disk Mintlify tree and exit non-zero if any would 404, so a broken table can't be committed silently. Regenerated redirects-new-docs.json (509 -> 774 entries, all validated). All 455 original alias sources remain covered. Live-checked the reported pre-aggregations page (now 200, was 404) plus a 30-destination sample (all 200). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Claude finished @keydunov's task in 1m 23s —— View job Claude Code ReviewVerdict: approve with minor suggestions — 0 high, 1 medium, 2 low The fix is solid: content-driven mapping, explicit pinned overrides, and a hard-fail validator are exactly the right shape. Posted 3 inline comments. Details (checklist + summary)
Findings:
Positives
Did not run the script locally — destinations were sampled by reading |
| osh = shingles(f.read_text(encoding="utf-8", errors="ignore")) | ||
| best_url, best_score = None, 0.0 | ||
| for nurl, nsh in new_pages: | ||
| if not osh or not nsh: | ||
| continue | ||
| inter = len(osh & nsh) | ||
| if not inter: | ||
| continue | ||
| score = inter / len(osh | nsh) | ||
| if score > best_score: | ||
| best_score, best_url = score, nurl | ||
| if best_url is not None: | ||
| page_map[old_url] = best_url | ||
| return page_map |
There was a problem hiding this comment.
No minimum confidence threshold on the content match. Any non-zero Jaccard overlap wins — best_score > 0 is the only bar. A short/list page (mostly headings or near-empty stub) can pick up a tiny incidental overlap with an unrelated new page and be silently mapped to it. The validator only checks that the destination exists, not that it's the right page, so a wrong-but-existing match passes silently.
Two cheap mitigations:
- Require
best_score >= MIN_SCORE(~0.05–0.10); below the bar, fall through toPRODUCT_CATCH_ALLor require anOVERRIDESentry. - Print the lowest-N matches with scores to stderr for spot-checking before commit.
| base, frag = split_fragment(old_dest) | ||
| if base in OVERRIDES: | ||
| new_base = OVERRIDES[base] | ||
| elif base in page_map: | ||
| new_base = page_map[base] | ||
| elif base.startswith("/product/"): | ||
| new_base = PRODUCT_CATCH_ALL | ||
| else: | ||
| new_base = PRODUCT_CATCH_ALL | ||
| return to_absolute(new_base + frag) |
There was a problem hiding this comment.
Minor: the OVERRIDES check at line 239 is redundant — build_page_map already substitutes OVERRIDES[old_url] into page_map, so base in page_map always hits first. Not a bug, just dead code.
Also, the elif base.startswith("/product/") and trailing else both return PRODUCT_CATCH_ALL. Worth deciding whether the else should instead preserve a non-/product destination as-is (e.g. legacy /cloud/... or pure-anchor URLs in redirects.json); routing those to /docs/introduction silently discards intent.
| def valid_url_set(new_root: Path) -> set: | ||
| urls = set() | ||
| for d in sorted(NEW_CONTENT_DIRS): | ||
| base = new_root / d | ||
| if not base.is_dir(): | ||
| continue | ||
| for f in base.rglob("*.mdx"): | ||
| urls.add(file_to_url(f, new_root)) | ||
| return urls |
There was a problem hiding this comment.
Nit: valid_url_set re-walks the same .mdx files that index_new_pages already walked. For a one-shot script this is fine, but returning (pages, valid_set) from a single pass would avoid the double rglob and guarantee the two views can't drift.
| osh = shingles(f.read_text(encoding="utf-8", errors="ignore")) | ||
| best_url, best_score = None, 0.0 | ||
| for nurl, nsh in new_pages: | ||
| if not osh or not nsh: | ||
| continue | ||
| inter = len(osh & nsh) | ||
| if not inter: | ||
| continue | ||
| score = inter / len(osh | nsh) | ||
| if score > best_score: | ||
| best_score, best_url = score, nurl | ||
| if best_url is not None: | ||
| page_map[old_url] = best_url | ||
| return page_map |
There was a problem hiding this comment.
No minimum confidence threshold on the content match. Any non-zero Jaccard overlap wins — best_score > 0 is the only bar. A short/list page (e.g., a near-empty stub or a page that is mostly headings) can pick up a tiny overlap with an unrelated page and be silently mapped to it. The validator only checks that the destination exists, not that it's the right page, so a wrong-but-existing match passes silently.
Two cheap mitigations:
- Require
best_score >= MIN_SCORE(say 0.05–0.10); below that, log the page and either fall through toPRODUCT_CATCH_ALLor require an explicitOVERRIDESentry. - Print the lowest-N matches (with score) to stderr so a human can spot-check tail mappings before committing.
| base, frag = split_fragment(old_dest) | ||
| if base in OVERRIDES: | ||
| new_base = OVERRIDES[base] | ||
| elif base in page_map: | ||
| new_base = page_map[base] | ||
| elif base.startswith("/product/"): | ||
| new_base = PRODUCT_CATCH_ALL | ||
| else: | ||
| new_base = PRODUCT_CATCH_ALL | ||
| return to_absolute(new_base + frag) |
There was a problem hiding this comment.
Minor: the OVERRIDES lookup here is redundant — build_page_map already substitutes OVERRIDES[old_url] into page_map, so base in page_map will always hit first. Not a bug, just dead code.
Also, the elif base.startswith("/product/") and trailing else branches both return PRODUCT_CATCH_ALL. The two branches can be collapsed, or — more usefully — the else should probably preserve a non-/product destination as-is (e.g. legacy /cloud/... or anchored-only URLs), since blindly sending those to /docs/introduction discards intent. Worth a quick audit of what non-/product destinations appear in redirects.json.
| def valid_url_set(new_root: Path) -> set: | ||
| urls = set() | ||
| for d in sorted(NEW_CONTENT_DIRS): | ||
| base = new_root / d | ||
| if not base.is_dir(): | ||
| continue | ||
| for f in base.rglob("*.mdx"): | ||
| urls.add(file_to_url(f, new_root)) | ||
| return urls | ||
|
|
There was a problem hiding this comment.
valid_url_set re-walks the same .mdx files that index_new_pages already walked. For a one-shot script this is fine, but you could return (pages, valid_set) from a single pass and avoid the double rglob + double file stat — and guarantee the two views can never drift.
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
Check List
Description of Changes Made
The cross-domain redirect table from #11002 sent most legacy
cube.dev/docsURLs to non-existentdocs.cube.devpages — 386 of 509 destinations (76%) returned 404, including the pre-aggregations reference page reported internally (/product/data-modeling/reference/pre-aggregations→ dead/docs/data-modeling/reference/pre-aggregationsinstead of the real/reference/data-modeling/pre-aggregations).Root cause:
build_old_site_redirects.pyreusedrewrite_links.PATH_REWRITES, which describes an intermediate migration layout (/access-security,/api-reference,/analyticstabs and/docs/data-modeling/reference/...pages) that never shipped. Content was later consolidated into the live tabs (admin,reference,docs,recipes,embedding,configuration,cube-core), so the prefix rewrites produced dead URLs.Fix — rewrote the generator to be content-driven and self-validating:
OVERRIDEStable, verified by hand against the real tree./product/*page, plus a/product/:path*catch-all so nothing 404s.Result: regenerated
redirects-new-docs.json(509 → 774 entries, all validated). All 455 original alias sources remain covered.Verification:
🤖 Generated with Claude Code