Skip to content

[SPARK-57035][DOCS] Always target /docs/latest/ in DocSearch index#56080

Open
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:docs-search-latest
Open

[SPARK-57035][DOCS] Always target /docs/latest/ in DocSearch index#56080
gengliangwang wants to merge 1 commit into
apache:masterfrom
gengliangwang:docs-search-latest

Conversation

@gengliangwang
Copy link
Copy Markdown
Member

@gengliangwang gengliangwang commented May 23, 2026

What changes were proposed in this pull request?

  • Stop dev/create-release/release-tag.sh from rewriting 'facetFilters' in docs/_config.yml at release-cut and post-release-bump time. The line stays pinned to "version:latest" on every branch going forward.
  • Refresh the stale comment in docs/_config.yml to point at https://crawler.algolia.com/ instead of the legacy DocSearch v1 config repo.

Why are the changes needed?

We are moving DocSearch to a single shared index built from https://spark.apache.org/docs/latest/, used by every release. With a shared index, all branches should pin facetFilters to "version:latest", so the per-release rewrite in the release script is no longer needed.

The crawler-side change is being made separately on https://crawler.algolia.com/.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

N/A - documentation config and release-script change only.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

### What changes were proposed in this pull request?

Switch DocSearch to a single shared index built from
https://spark.apache.org/docs/latest/, used by all release branches.

- `docs/_config.yml`: rewrite the stale comment that pointed at the
  dead `github.com/algolia/docsearch-configs` repo. Document the new
  setup: the Algolia crawler at https://crawler.algolia.com/ indexes
  only `/docs/latest/` and tags every page with `version:latest`, so
  `facetFilters` stays pinned to `version:latest` on every branch.
- `dev/create-release/release-tag.sh`: remove the two `sed` lines that
  rewrote `facetFilters` to `version:<release>` at release-cut and
  post-release-bump time. They are no longer needed (and stayed wrong
  on the last few releases, which is why the search box on
  https://spark.apache.org/docs/latest/ has been returning no results).

### Why are the changes needed?

The legacy DocSearch v1 scheme crawled every released `/docs/<X.Y.Z>/`
and assigned a `version:X.Y.Z` facet, so each release branch had to
pin `facetFilters` to its own version. Since the SPARK-38122 migration
to the new DocSearch infra, we no longer maintain per-version indexes.
The release-script `sed` rewrites kept producing `version:<release>`
filters that don't match anything in the new index, so post-release
search on https://spark.apache.org/docs/latest/ returns empty results
until the crawler config is manually re-pointed.

Pinning to `version:latest` everywhere matches what the crawler tags
and removes the manual release-time step entirely.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A - documentation config and release-script change only.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)
Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up, @gengliangwang. I went to verify the premise in the description and I think this PR needs a rework before it can land — the assumption that the Algolia index is latest-only doesn't match what the live index actually contains, and as written this change is a user-facing regression rather than a no-op. Details below.

The Algolia index still contains per-version facets. Querying the live apache_spark index with the public DocSearch key in _config.yml:

curl -s -X POST "https://rai69rxrsk-dsn.algolia.net/1/indexes/apache_spark/query" \
  -H "X-Algolia-API-Key: d62f962a82bc9abb53471cb7b89da35e" \
  -H "X-Algolia-Application-Id: RAI69RXRSK" \
  -H "Content-Type: application/json" \
  -d '{"query":"DataFrame","facets":["version"],"hitsPerPage":1}'

returns:

"facets": {
    "version": {
        "latest": 1075,
        "4.1.2":  1075,
        "4.1.1":  1075,
        "4.1.0":  1075,
        "4.0.0":  1066
    }
}

and a follow-up query with facetFilters: ["version:4.1.2"] still returns 4.1.2-specific URLs (https://spark.apache.org/docs/4.1.2/sql-programming-guide.html#content, not /docs/latest/). So the crawler is indexing each released /docs/<X.Y.Z>/ and tagging it with the corresponding version facet — not "only /docs/latest/ tagged with version:latest" as the new comment claims.

Consistent with that, the currently published docs still ship per-version filters:

  • https://spark.apache.org/docs/latest/'facetFilters': ["version:4.1.2"]
  • https://spark.apache.org/docs/4.1.1/'facetFilters': ["version:4.1.1"]

Why this matters for the PR:

  1. The symptom in the description ("search on /docs/latest/ returns no results after a release") doesn't look like it's caused by the per-release sed in release-tag.sh. The index still has a populated version:latest facet, and latest currently points at 4.1.2 where search does work for me. It would be worth root-causing the original report (timing of the latest symlink flip vs. the next crawler run? a stale version:latest snapshot during a specific window?) before changing the script, otherwise we may be removing the wrong mechanism.

  2. After this PR, every new release branch and every released /docs/<X.Y.Z>/ HTML will hard-code facetFilters: ["version:latest"]. That means searches performed from /docs/4.1.3/ (or any future release) will start returning results from /docs/latest/ — i.e. the search no longer stays on the user's release. That's the exact behavior SPARK-33479 set out to fix, and it contradicts the "Does this PR introduce any user-facing change? No" line in the description.

  3. The new comment ("indexes only /docs/latest/", "tags every page with version:latest", "no per-release update is required") will actively mislead the next release manager, because — going by the live index — version-specific search is still a supported and currently-working feature.

Suggested paths forward, depending on intent:

  • If we want to keep release-specific search (status quo, matches what the index actually supports today): keep the two sed lines in release-tag.sh, and only update the comment in _config.yml to replace the dead algolia/docsearch-configs link with a pointer to https://crawler.algolia.com/. The rest of this PR isn't needed.
  • If we genuinely want to drop release-specific search and always target latest: this needs (a) a corresponding crawler-side config change so we stop spending crawler budget on /docs/<X.Y.Z>/, (b) an updated description that acknowledges the user-facing change ("search on /docs/<release>/ will jump to /docs/latest/"), (c) a comment that says this is intentional rather than describing it as the crawler's behavior, and (d) likely cherry-picks to active release branches so the version:X.Y.Z filter there is also reset to version:latest for consistency.

Happy to help land either version once we agree on which direction is intended.

Comment thread docs/_config.yml
# The DocSearch index is maintained by the Algolia crawler at https://crawler.algolia.com/.
# The crawler indexes only https://spark.apache.org/docs/latest/ and tags every page with
# `version:latest`. All release branches share this single index, so `facetFilters` stays
# pinned to `version:latest` everywhere and no per-release update is required.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't match the live index: querying apache_spark returns version facet values {latest, 4.1.0, 4.1.1, 4.1.2, 4.0.0}, and facetFilters: ["version:4.1.2"] still returns 4.1.2-specific URLs. The crawler isn't latest-only — release pages are still indexed and tagged with their version. As written this will mislead the next release manager into thinking version filters are unused; please either correct the description of the crawler's behavior, or — if the intent is to deliberately switch to latest-only — say so explicitly and link to the crawler-side change that makes it true.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the released pages, we will just keep using the indexes for those doc.
For new releases(4.1.3/4.2.0), we will start using the latest index only. Otherwise, release manager may forget to create new index and break the search function.

# Set the release version in docs
sed -i".tmp1" 's/SPARK_VERSION:.*$/SPARK_VERSION: '"$RELEASE_VERSION"'/g' docs/_config.yml
sed -i".tmp2" 's/SPARK_VERSION_SHORT:.*$/SPARK_VERSION_SHORT: '"$RELEASE_VERSION"'/g' docs/_config.yml
sed -i".tmp3" "s/'facetFilters':.*$/'facetFilters': [\"version:$RELEASE_VERSION\"]/g" docs/_config.yml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this rewrite means every future release branch (and the HTML shipped under /docs/<X.Y.Z>/) will ship facetFilters: ["version:latest"]. Combined with the live index still containing populated per-version facets, that turns release-page search into "jump to /docs/latest/" rather than staying on the user's release — a user-facing regression vs. the intent of SPARK-33479. If we do want this change, the PR description should reflect it; otherwise this sed (and the symmetric one below for R_NEXT_VERSION) should stay.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated PR description.

Copy link
Copy Markdown
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification and the description update, @gengliangwang — the two-phase intent ("existing released pages keep their per-version indexes; from 4.1.3 / 4.2.0 onward everything shares the latest index") is much clearer now. Before this lands, though, I'd like to push back on the direction rather than the wording, because I think pinning every future release to facetFilters: ["version:latest"] makes the on-release-page search quietly wrong in a way that's hard to recover from.

Why this worries me:

  1. facetFilters: ["version:latest"] is not "no filter" — it's "only results tagged version:latest". Whatever version:latest happens to point at when the user runs the query is what they get. After 4.2.0 ships and the /docs/latest/ symlink flips, a user reading /docs/4.1.3/sql-programming-guide.html and typing into its search box will get results pointing at 4.2.0 pages — silently, with no version-mismatch indicator. They click a result, land on a page where the API signature has changed, and don't realize they crossed a major boundary. "Search returns nothing" is loud and gets reported; "search returns the wrong version" is quiet and gets internalized as "the docs are bad".

  2. The frozen-HTML problem is one-way. Once /docs/4.1.3/index.html ships with facetFilters: ["version:latest"] baked in, that HTML is immutable in spark-website. There is no recovery path if we later decide this was the wrong call — we can't go back and rewrite already-published release HTML to use a different filter. Every release we ship under this policy permanently inherits "search jumps to whatever latest is at query time".

  3. It moves the maintenance burden from a tracked place to an untracked place, rather than removing it. The motivation ("release manager may forget to create new index and break the search function") is a process problem. The fix being proposed isn't "make the process safer" — it's "remove the per-version contract so the process step is no longer needed". But the new model still requires the crawler to (a) keep populating version:latest correctly forever, and (b) keep the existing version:4.0.0/4.1.0/4.1.1/4.1.2 facets alive for already-shipped HTML. That's at least as much ongoing crawler-side maintenance as before, except now it lives entirely outside the Spark repo, isn't reviewed, isn't version-controlled, and has no failure alarm visible to committers. Forgetting on the crawler side is just as easy as forgetting in the release script — and harder to notice.

  4. The original symptom hasn't actually been root-caused. The earlier version of the description attributed "search on /docs/latest/ returns no results after a release" to the per-release sed rewrite. We established earlier in this thread that the live Algolia index does contain version:latest with populated hits, and that facetFilters: ["version:latest"] does return results today — so that hypothesis doesn't fit. The updated description has wisely dropped the bug-fix framing, but that means we still don't know what actually broke search on /docs/latest/ after the last release. Whatever the real cause is, this PR doesn't address it. I'd rather we diagnose the original report (crawler schedule vs. latest symlink flip timing? a stale version:latest snapshot during a specific window? something on the crawler-config side?) before changing the contract for every future release.

Suggested alternatives I'd find easier to support:

  • (A) Minimal cleanup. Keep both sed lines in release-tag.sh (preserve per-version search). Update only the comment in _config.yml to replace the dead algolia/docsearch-configs link with a pointer to https://crawler.algolia.com/. Open a separate JIRA to root-cause the post-release search outage on /docs/latest/ — that's a real bug worth fixing, just not by this mechanism.

  • (B) Drop per-version search explicitly. If after diagnosis we genuinely want to abandon version-scoped search, the cleaner expression of that intent is to remove the facetFilters line entirely (no filter → search across all indexed pages), with a comment that says so. That at least degrades gracefully if version:latest ever stops being populated, and the "wrong version" failure mode becomes visible to the user (multi-version hits in the dropdown) rather than silent. It would still want a tracking link to the crawler-side change.

Happy to help land either direction. I just don't think shipping ["version:latest"] into every future release HTML is the right shape — it bakes a silent-failure mode into immutable artifacts, and it doesn't actually remove the maintenance burden it claims to remove.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants