Skip to content

feat(cubestore): Fold IS NOT DISTINCT FROM join predicates into HashJoin keys#10923

Merged
waralexrom merged 1 commit into
masterfrom
cubestore-is-not-distinct-from
May 20, 2026
Merged

feat(cubestore): Fold IS NOT DISTINCT FROM join predicates into HashJoin keys#10923
waralexrom merged 1 commit into
masterfrom
cubestore-is-not-distinct-from

Conversation

@waralexrom
Copy link
Copy Markdown
Member

Summary

LEFT JOIN ... ON a IS NOT DISTINCT FROM b (used to emulate FULL OUTER JOIN via a driver table + chained LEFT JOINs on nullable keys) was planned as NestedLoopJoinExec — O(N×M) — because DataFusion's extract_equijoin_predicate only lifts plain = predicates. This PR adds a CubeStore logical optimizer rule that lifts IS NOT DISTINCT FROM conjuncts into the join's equi-keys with null_equals_null = true, letting DF's physical planner pick HashJoinExec (or SortMergeJoinExec) — both already support null-safe key comparison internally.

Changes

  • optimizations/is_not_distinct_from_join_keys.rs: new OptimizerRule (ApplyOrder::TopDown). For each LogicalPlan::Join with empty on and a filter that is a conjunction of IS NOT DISTINCT FROM predicates, the rule routes left/right-resolved hashable pairs into on, leaves the rest in filter, and flips null_equals_null = true.
  • Registered alongside RollingOptimizerRule in QueryPlannerImpl::make_execution_context.
  • Conservative scope: bails out when on is already populated by DF's own extraction, since null_equals_null is a single flag on the whole join and would silently change semantics of pre-extracted = keys.
  • Five regression tests covering: baseline = join, single-pair lift, 5-predicate lift, FULL-OUTER-via-driver + 3 chained LEFT JOINs, and a mixed = + IS NOT DISTINCT FROM case documenting the conservative no-op.

Testing

  • cargo test -p cubestore is_not_distinct_from_join_test:: — 5 passed.
  • Manual plan inspection: before the rule, IS NOT DISTINCT FROM chain produces NestedLoopJoinExec with the predicate in filter=; after the rule, the same chain produces HashJoinExec with on=[...] for every step.

…oin keys

DataFusion's extract_equijoin_predicate only lifts `=` predicates, so a
LEFT JOIN whose ON clause uses IS NOT DISTINCT FROM (typical for FULL
OUTER emulation via a driver + chained LEFT JOINs on nullable keys) ends
up as NestedLoopJoinExec — O(N*M) instead of hashed.

Adds an OptimizerRule that, when a Join has no existing equi-keys, lifts
all `expr IS NOT DISTINCT FROM expr` conjuncts whose sides resolve to
opposite inputs into `on` and sets null_equals_null=true. DF's physical
planner then propagates the flag to HashJoinExec/SortMergeJoinExec,
which already handle null-safe key comparison.

The rule is conservative: when `on` is already populated by DF's own
extraction, it bails out to avoid silently changing the null semantics
of existing `=` keys via the shared null_equals_null flag.
@waralexrom waralexrom requested a review from a team as a code owner May 20, 2026 20:17
@github-actions github-actions Bot added cube store Issues relating to Cube Store rust Pull requests that update Rust code labels May 20, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 20, 2026

Claude finished @waralexrom's task in 2m 7s —— View job


Code Review: IS NOT DISTINCT FROM Join Key Optimizer

  • Read and understand all changed files
  • Review optimizer rule logic and correctness
  • Review test coverage and quality
  • Check for security, performance, and edge cases
  • Post final review

Overall: This is a well-implemented, well-scoped optimizer rule. The conservative approach (only firing when on is empty) is the correct design choice given that null_equals_null is a single flag on the entire Join node. The code is clean, uses the right DataFusion utilities, and the test suite covers the motivating use case thoroughly.

Correctness

The optimizer rule correctly:

  • Matches only Join nodes with empty on and a non-None filter
  • Splits the filter conjunction and identifies IS NOT DISTINCT FROM predicates
  • Uses find_valid_equijoin_key_pair to validate left/right column resolution (same utility DF uses for =)
  • Checks can_hash to avoid promoting non-hashable types to equi-keys
  • Restores the original filter verbatim if nothing is liftable (avoiding structural changes)
  • Sets null_equals_null = true only when predicates are actually moved to on

The conservative bail-out when on is already populated is the right call — without it, enabling null_equals_null would silently change = key semantics to null-safe matching.

Suggestions (non-blocking)

  1. Join type coverage in tests: All tests use LEFT JOIN. The rule applies to all join types since there's no join_type guard. This is likely correct, but an INNER JOIN test would add confidence. See inline comment.

  2. Single-predicate test: The PR description mentions "single-pair lift" but the corresponding test lifts two pairs. A true single-predicate test would exercise the split_conjunction_owned → single-element path. See inline comment.

  3. filter_orig.clone() on line 57: The clone preserves the original tree for the bail-out path. It's clear and correct, but could be avoided by reconstructing from remaining in the empty case. Minor perf consideration if this runs on many joins. See inline comment.

Test Quality

The 5 tests provide strong coverage:

Test What it covers
equality_join_uses_hash_join_and_excludes_null_keys Baseline: = joins work as before, NULLs don't match
is_not_distinct_from_folds_into_hash_join_keys Core: IS NOT DISTINCT FROM lifted, NULLs now match
is_not_distinct_from_lifts_many_predicates 5-key lift, verifying multi-predicate handling
full_outer_via_driver_chain_uses_hash_joins Real-world motivating pattern with chained LEFT JOINs
mixed_eq_and_is_not_distinct_from_is_left_alone Conservative scope: mixed predicates not touched

Each test validates three layers: logical plan structure, physical plan choice (HashJoinExec vs NestedLoopJoinExec), and result correctness. The tag column pattern in the driver-chain test is a clever way to distinguish joined NULLs from LEFT-padded NULLs.

Missing coverage (nice-to-have)

  • INNER JOIN with IS NOT DISTINCT FROM
  • A true single-predicate lift
  • Non-hashable type (e.g., a struct column) to confirm the can_hash guard leaves the filter intact
  • IS NOT DISTINCT FROM with expressions (e.g., a.x + 1 IS NOT DISTINCT FROM b.y) rather than bare columns

None of these are blocking — the current test suite covers all the critical paths for the stated use case.


@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.43%. Comparing base (346eb56) to head (f8a8bcb).
⚠️ Report is 5 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (346eb56) and HEAD (f8a8bcb). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (346eb56) HEAD (f8a8bcb)
cubesql 1 0
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #10923       +/-   ##
===========================================
- Coverage   78.93%   58.43%   -20.51%     
===========================================
  Files         470      216      -254     
  Lines       92862    16993    -75869     
  Branches     3449     3449               
===========================================
- Hits        73304     9930    -63374     
+ Misses      19054     6559    -12495     
  Partials      504      504               
Flag Coverage Δ
cube-backend 58.43% <ø> (-0.01%) ⬇️
cubesql ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@waralexrom waralexrom merged commit 858c1e4 into master May 20, 2026
73 checks passed
@waralexrom waralexrom deleted the cubestore-is-not-distinct-from branch May 20, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cube store Issues relating to Cube Store rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants