Skip to content

docs: fix legacy redirect destinations 404ing on docs.cube.dev#11004

Merged
keydunov merged 1 commit into
masterfrom
fix/docs-redirect-destinations
Jun 3, 2026
Merged

docs: fix legacy redirect destinations 404ing on docs.cube.dev#11004
keydunov merged 1 commit into
masterfrom
fix/docs-redirect-destinations

Conversation

@keydunov
Copy link
Copy Markdown
Member

@keydunov keydunov commented Jun 3, 2026

Check List

  • Tests have been run in packages where changes have been made if available
  • Linter has been run for changed code
  • Tests for the changes have been added if not covered yet
  • Docs have been added / updated if required

Description of Changes Made

The cross-domain redirect table from #11002 sent most legacy cube.dev/docs URLs to non-existent docs.cube.dev pages — 386 of 509 destinations (76%) returned 404, including the pre-aggregations reference page reported internally (/product/data-modeling/reference/pre-aggregations → dead /docs/data-modeling/reference/pre-aggregations instead of the real /reference/data-modeling/pre-aggregations).

Root cause: build_old_site_redirects.py reused rewrite_links.PATH_REWRITES, which describes an intermediate migration layout (/access-security, /api-reference, /analytics tabs and /docs/data-modeling/reference/... pages) that never shipped. Content was later consolidated into the live tabs (admin, reference, docs, recipes, embedding, configuration, cube-core), so the prefix rewrites produced dead URLs.

Fix — rewrote the generator to be content-driven and self-validating:

  • Derives each destination by matching the old page's body text against the live Mintlify pages (new files were copied from the old ones during migration, so prose is preserved). 319 old pages mapped.
  • Pins ambiguous / consolidated / removed pages in an explicit OVERRIDES table, verified by hand against the real tree.
  • Emits a redirect for every legacy alias and every canonical /product/* page, plus a /product/:path* catch-all so nothing 404s.
  • Validates every destination against the on-disk Mintlify tree and exits non-zero if any would 404, so a broken table can't be committed silently.

Result: regenerated redirects-new-docs.json (509 → 774 entries, all validated). All 455 original alias sources remain covered.

Verification:

  • All 774 generated destinations resolve to existing pages (local tree).
  • Live production check: the reported pre-aggregations page now returns 200 (was 404); a 30-destination random sample returned 200 across all section types.

🤖 Generated with Claude Code

The cross-domain redirect table (#11002) sent most legacy cube.dev/docs
URLs to non-existent docs.cube.dev pages — 386 of 509 destinations (76%)
returned 404, e.g. /product/data-modeling/reference/pre-aggregations went
to the dead /docs/data-modeling/reference/pre-aggregations instead of the
real /reference/data-modeling/pre-aggregations.

Root cause: build_old_site_redirects.py reused rewrite_links.PATH_REWRITES,
which describes an intermediate migration layout (/access-security,
/api-reference, /analytics tabs and /docs/data-modeling/reference/...) that
never shipped. Content was consolidated into the live tabs (admin, reference,
docs, recipes, embedding, configuration, cube-core).

Rewrite the generator to be content-driven and self-validating:
- Derive each destination by matching the old page's body text against the
  live Mintlify pages (new files were copied from old ones during migration,
  so prose is preserved). 319 old pages mapped.
- Pin ambiguous/consolidated/removed pages in an explicit OVERRIDES table,
  verified by hand against the real tree.
- Emit a redirect for every legacy alias and every canonical /product page,
  plus a /product/:path* catch-all so nothing 404s.
- Validate every destination against the on-disk Mintlify tree and exit
  non-zero if any would 404, so a broken table can't be committed silently.

Regenerated redirects-new-docs.json (509 -> 774 entries, all validated).
All 455 original alias sources remain covered. Live-checked the reported
pre-aggregations page (now 200, was 404) plus a 30-destination sample (all 200).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions github-actions Bot added docs Issues that require a documentation improvement python labels Jun 3, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Jun 3, 2026

Claude finished @keydunov's task in 1m 23s —— View job


Claude Code Review

Verdict: approve with minor suggestions — 0 high, 1 medium, 2 low

The fix is solid: content-driven mapping, explicit pinned overrides, and a hard-fail validator are exactly the right shape. Posted 3 inline comments.

Details (checklist + summary)
  • Load tools
  • Read changed files
  • List existing review threads (no prior threads to dedupe)
  • Analyze script and JSON
  • Post review

Findings:

Sev Where Issue
medium build_page_map L200–213 No minimum Jaccard threshold — any non-zero overlap wins. Validator checks the destination exists, not that it's the right page, so a wrong-but-existing match passes silently. Suggest MIN_SCORE ~0.05–0.10 + a lowest-N log line for spot-checking.
low resolve_destination L238–247 Redundant OVERRIDES check (already folded into page_map); elif/else both return PRODUCT_CATCH_ALL, which discards intent for non-/product legacy destinations (e.g. /cloud/..., pure-anchor URLs).
low valid_url_set L273–281 Re-walks the same .mdx tree as index_new_pages — fine for one-shot use, but a single-pass return would prevent the two views from drifting.

Positives

  • Hard-fail validator (sys.exit(2) if any destination 404s) prevents silent regression of exactly the bug this PR fixes.
  • OVERRIDES is well-commented per-cluster (AI / monitoring / VPC / SSO / workspace / FAQ) — easy to audit.
  • Catch-all /product/:path* is a sensible last-resort.
  • Docstring explains the why (intermediate vs. shipped layout) clearly, so future maintainers won't re-introduce the prefix-rewrite approach.

Did not run the script locally — destinations were sampled by reading OVERRIDES and trusting the author's reported live-200 verification.

· branch [`fix/docs-redirect-destinations`](https://github.com/cube-js/cube/tree/fix/docs-redirect-destinations)

Comment on lines +200 to +213
osh = shingles(f.read_text(encoding="utf-8", errors="ignore"))
best_url, best_score = None, 0.0
for nurl, nsh in new_pages:
if not osh or not nsh:
continue
inter = len(osh & nsh)
if not inter:
continue
score = inter / len(osh | nsh)
if score > best_score:
best_score, best_url = score, nurl
if best_url is not None:
page_map[old_url] = best_url
return page_map
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No minimum confidence threshold on the content match. Any non-zero Jaccard overlap wins — best_score > 0 is the only bar. A short/list page (mostly headings or near-empty stub) can pick up a tiny incidental overlap with an unrelated new page and be silently mapped to it. The validator only checks that the destination exists, not that it's the right page, so a wrong-but-existing match passes silently.

Two cheap mitigations:

  1. Require best_score >= MIN_SCORE (~0.05–0.10); below the bar, fall through to PRODUCT_CATCH_ALL or require an OVERRIDES entry.
  2. Print the lowest-N matches with scores to stderr for spot-checking before commit.

Comment on lines +238 to +247
base, frag = split_fragment(old_dest)
if base in OVERRIDES:
new_base = OVERRIDES[base]
elif base in page_map:
new_base = page_map[base]
elif base.startswith("/product/"):
new_base = PRODUCT_CATCH_ALL
else:
new_base = PRODUCT_CATCH_ALL
return to_absolute(new_base + frag)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the OVERRIDES check at line 239 is redundant — build_page_map already substitutes OVERRIDES[old_url] into page_map, so base in page_map always hits first. Not a bug, just dead code.

Also, the elif base.startswith("/product/") and trailing else both return PRODUCT_CATCH_ALL. Worth deciding whether the else should instead preserve a non-/product destination as-is (e.g. legacy /cloud/... or pure-anchor URLs in redirects.json); routing those to /docs/introduction silently discards intent.

Comment on lines +273 to +281
def valid_url_set(new_root: Path) -> set:
urls = set()
for d in sorted(NEW_CONTENT_DIRS):
base = new_root / d
if not base.is_dir():
continue
for f in base.rglob("*.mdx"):
urls.add(file_to_url(f, new_root))
return urls
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: valid_url_set re-walks the same .mdx files that index_new_pages already walked. For a one-shot script this is fine, but returning (pages, valid_set) from a single pass would avoid the double rglob and guarantee the two views can't drift.

Comment on lines +200 to +213
osh = shingles(f.read_text(encoding="utf-8", errors="ignore"))
best_url, best_score = None, 0.0
for nurl, nsh in new_pages:
if not osh or not nsh:
continue
inter = len(osh & nsh)
if not inter:
continue
score = inter / len(osh | nsh)
if score > best_score:
best_score, best_url = score, nurl
if best_url is not None:
page_map[old_url] = best_url
return page_map
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No minimum confidence threshold on the content match. Any non-zero Jaccard overlap wins — best_score > 0 is the only bar. A short/list page (e.g., a near-empty stub or a page that is mostly headings) can pick up a tiny overlap with an unrelated page and be silently mapped to it. The validator only checks that the destination exists, not that it's the right page, so a wrong-but-existing match passes silently.

Two cheap mitigations:

  1. Require best_score >= MIN_SCORE (say 0.05–0.10); below that, log the page and either fall through to PRODUCT_CATCH_ALL or require an explicit OVERRIDES entry.
  2. Print the lowest-N matches (with score) to stderr so a human can spot-check tail mappings before committing.

Fix this →

Comment on lines +238 to +247
base, frag = split_fragment(old_dest)
if base in OVERRIDES:
new_base = OVERRIDES[base]
elif base in page_map:
new_base = page_map[base]
elif base.startswith("/product/"):
new_base = PRODUCT_CATCH_ALL
else:
new_base = PRODUCT_CATCH_ALL
return to_absolute(new_base + frag)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the OVERRIDES lookup here is redundant — build_page_map already substitutes OVERRIDES[old_url] into page_map, so base in page_map will always hit first. Not a bug, just dead code.

Also, the elif base.startswith("/product/") and trailing else branches both return PRODUCT_CATCH_ALL. The two branches can be collapsed, or — more usefully — the else should probably preserve a non-/product destination as-is (e.g. legacy /cloud/... or anchored-only URLs), since blindly sending those to /docs/introduction discards intent. Worth a quick audit of what non-/product destinations appear in redirects.json.

Comment on lines +273 to +282
def valid_url_set(new_root: Path) -> set:
urls = set()
for d in sorted(NEW_CONTENT_DIRS):
base = new_root / d
if not base.is_dir():
continue
for f in base.rglob("*.mdx"):
urls.add(file_to_url(f, new_root))
return urls

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid_url_set re-walks the same .mdx files that index_new_pages already walked. For a one-shot script this is fine, but you could return (pages, valid_set) from a single pass and avoid the double rglob + double file stat — and guarantee the two views can never drift.

@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented Jun 3, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
cubed3 🟢 Ready View Preview Jun 3, 2026, 3:49 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@keydunov keydunov merged commit 559bef7 into master Jun 3, 2026
50 checks passed
@keydunov keydunov deleted the fix/docs-redirect-destinations branch June 3, 2026 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Issues that require a documentation improvement python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant