Skip to content

[SPARK-57088] [SQL] Allow non-deterministic ranking expression for EXACT NEAREST BY#56128

Closed
zhidongqu-db wants to merge 3 commits into
apache:masterfrom
zhidongqu-db:allow-exact-no-deter-expr
Closed

[SPARK-57088] [SQL] Allow non-deterministic ranking expression for EXACT NEAREST BY#56128
zhidongqu-db wants to merge 3 commits into
apache:masterfrom
zhidongqu-db:allow-exact-no-deter-expr

Conversation

@zhidongqu-db
Copy link
Copy Markdown
Contributor

@zhidongqu-db zhidongqu-db commented May 26, 2026

What changes were proposed in this pull request?

Removes the NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION rejection in CheckAnalysis so the EXACT mode of NEAREST BY JOIN (added in SPARK-56395) accepts non-deterministic ranking expressions, the same way APPROX already does.

Concretely:

  • Drop the NearestByJoin arm in CheckAnalysis that failed analysis when approx = false and the ranking expression was non-deterministic.
  • Change NearestByJoin.allowNonDeterministicExpression to return true unconditionally (was previously returning approx).
  • Delete the NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION error condition.
  • Update scaladoc/comments in NearestByJoin and RewriteNearestByJoin to reflect that both modes permit a non-deterministic ranking expression.
  • Update the user-facing docs in sql-ref-syntax-qry-select-join.md.
  • Convert the existing rejection tests (Scala, Python, SQL golden) to positive tests asserting that EXACT + a non-deterministic ranking expression now succeeds.

Why are the changes needed?

APPROX vs. EXACT and determinism are orthogonal concerns:

  • APPROX vs. EXACT is about the search algorithm contract: APPROX permits the optimizer to use faster approximate strategies (e.g. indexed ANN); EXACT forces brute-force evaluation.
  • Determinism is a property of the ranking expression itself. Ordinary joins, for example, accept non-deterministic join conditions without forcing the user into an "approximate" join.

EXACT describes algebraic semantics ("compute the exact top-K according to the user's ranking expression"); it does not promise reproducibility across runs when the ranking expression is itself non-deterministic. Coupling the two was an over-restriction that this PR removes.

Does this PR introduce any user-facing change?

Yes. Queries of the form

SELECT ... FROM left JOIN right EXACT NEAREST k BY {DISTANCE | SIMILARITY} <non-deterministic expression>

previously failed at analysis with NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION; they are now accepted and evaluated through the same brute-force rewrite as the APPROX variant.

The error condition NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION is removed.

How was this patch tested?

  • Updated DataFrameNearestByJoinSuite: the rejection test is converted to a positive test asserting the result count (21/21 passing locally).
  • Updated the PySpark equivalent in test_nearest_by_join.py.
  • Updated the SQL golden file join-nearest-by.sql (replaced the failing-EXACT query with a COUNT(*) query mirroring the existing APPROX case); regenerated results/ and analyzer-results/. SQLQueryTestSuite -z join-nearest-by passes (2/2).
  • RewriteNearestByJoinSuite (12/12) still passes — the materializing-Project path in the optimizer rewrite already handled non-deterministic ranking expressions; only the analyzer gate changes.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

@zhidongqu-db zhidongqu-db changed the title draft [SPARK-57088] [SQL] Allow non-deterministic ranking expression for EXACT NEAREST BY May 26, 2026
Copy link
Copy Markdown
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this.

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master/4.x/4.2!

@cloud-fan cloud-fan closed this in 2603f6a May 27, 2026
cloud-fan added a commit that referenced this pull request May 27, 2026
…CT NEAREST BY

### What changes were proposed in this pull request?

Removes the `NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION` rejection in `CheckAnalysis` so the `EXACT` mode of `NEAREST BY JOIN` (added in SPARK-56395) accepts non-deterministic ranking expressions, the same way `APPROX` already does.

Concretely:
- Drop the `NearestByJoin` arm in `CheckAnalysis` that failed analysis when `approx = false` and the ranking expression was non-deterministic.
- Change `NearestByJoin.allowNonDeterministicExpression` to return `true` unconditionally (was previously returning `approx`).
- Delete the `NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION` error condition.
- Update scaladoc/comments in `NearestByJoin` and `RewriteNearestByJoin` to reflect that both modes permit a non-deterministic ranking expression.
- Update the user-facing docs in `sql-ref-syntax-qry-select-join.md`.
- Convert the existing rejection tests (Scala, Python, SQL golden) to positive tests asserting that EXACT + a non-deterministic ranking expression now succeeds.

### Why are the changes needed?

`APPROX` vs. `EXACT` and determinism are orthogonal concerns:
- `APPROX` vs. `EXACT` is about the search algorithm contract: `APPROX` permits the optimizer to use faster approximate strategies (e.g. indexed ANN); `EXACT` forces brute-force evaluation.
- Determinism is a property of the ranking expression itself. Ordinary joins, for example, accept non-deterministic join conditions without forcing the user into an "approximate" join.

`EXACT` describes algebraic semantics ("compute the exact top-K according to the user's ranking expression"); it does not promise reproducibility across runs when the ranking expression is itself non-deterministic. Coupling the two was an over-restriction that this PR removes.

### Does this PR introduce _any_ user-facing change?

Yes. Queries of the form

```sql
SELECT ... FROM left JOIN right EXACT NEAREST k BY {DISTANCE | SIMILARITY} <non-deterministic expression>
```
previously failed at analysis with `NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION`; they are now accepted and evaluated through the same brute-force rewrite as the `APPROX` variant.

The error condition `NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION` is removed.

### How was this patch tested?

- Updated `DataFrameNearestByJoinSuite`: the rejection test is converted to a positive test asserting the result count (21/21 passing locally).
- Updated the PySpark equivalent in `test_nearest_by_join.py`.
- Updated the SQL golden file `join-nearest-by.sql` (replaced the failing-EXACT query with a `COUNT(*)` query mirroring the existing APPROX case); regenerated `results/` and `analyzer-results/`. `SQLQueryTestSuite -z join-nearest-by` passes (2/2).
- `RewriteNearestByJoinSuite` (12/12) still passes — the materializing-Project path in the optimizer rewrite already handled non-deterministic ranking expressions; only the analyzer gate changes.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

Closes #56128 from zhidongqu-db/allow-exact-no-deter-expr.

Lead-authored-by: Zero Qu <zhidong.qu@databricks.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2603f6a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request May 27, 2026
…CT NEAREST BY

### What changes were proposed in this pull request?

Removes the `NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION` rejection in `CheckAnalysis` so the `EXACT` mode of `NEAREST BY JOIN` (added in SPARK-56395) accepts non-deterministic ranking expressions, the same way `APPROX` already does.

Concretely:
- Drop the `NearestByJoin` arm in `CheckAnalysis` that failed analysis when `approx = false` and the ranking expression was non-deterministic.
- Change `NearestByJoin.allowNonDeterministicExpression` to return `true` unconditionally (was previously returning `approx`).
- Delete the `NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION` error condition.
- Update scaladoc/comments in `NearestByJoin` and `RewriteNearestByJoin` to reflect that both modes permit a non-deterministic ranking expression.
- Update the user-facing docs in `sql-ref-syntax-qry-select-join.md`.
- Convert the existing rejection tests (Scala, Python, SQL golden) to positive tests asserting that EXACT + a non-deterministic ranking expression now succeeds.

### Why are the changes needed?

`APPROX` vs. `EXACT` and determinism are orthogonal concerns:
- `APPROX` vs. `EXACT` is about the search algorithm contract: `APPROX` permits the optimizer to use faster approximate strategies (e.g. indexed ANN); `EXACT` forces brute-force evaluation.
- Determinism is a property of the ranking expression itself. Ordinary joins, for example, accept non-deterministic join conditions without forcing the user into an "approximate" join.

`EXACT` describes algebraic semantics ("compute the exact top-K according to the user's ranking expression"); it does not promise reproducibility across runs when the ranking expression is itself non-deterministic. Coupling the two was an over-restriction that this PR removes.

### Does this PR introduce _any_ user-facing change?

Yes. Queries of the form

```sql
SELECT ... FROM left JOIN right EXACT NEAREST k BY {DISTANCE | SIMILARITY} <non-deterministic expression>
```
previously failed at analysis with `NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION`; they are now accepted and evaluated through the same brute-force rewrite as the `APPROX` variant.

The error condition `NEAREST_BY_JOIN.EXACT_WITH_NONDETERMINISTIC_EXPRESSION` is removed.

### How was this patch tested?

- Updated `DataFrameNearestByJoinSuite`: the rejection test is converted to a positive test asserting the result count (21/21 passing locally).
- Updated the PySpark equivalent in `test_nearest_by_join.py`.
- Updated the SQL golden file `join-nearest-by.sql` (replaced the failing-EXACT query with a `COUNT(*)` query mirroring the existing APPROX case); regenerated `results/` and `analyzer-results/`. `SQLQueryTestSuite -z join-nearest-by` passes (2/2).
- `RewriteNearestByJoinSuite` (12/12) still passes — the materializing-Project path in the optimizer rewrite already handled non-deterministic ranking expressions; only the analyzer gate changes.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.7)

Closes #56128 from zhidongqu-db/allow-exact-no-deter-expr.

Lead-authored-by: Zero Qu <zhidong.qu@databricks.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2603f6a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants