You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Memory-limited NestedLoopJoin execution can still produce incorrect unmatched-left output when:
The right side has multiple output partitions (target_partitions > 1), AND
Join type requires final unmatched-left emission for LEFT-family behavior (LEFT, LEFT ANTI, LEFT MARK, and any join mode using left-final emission semantics)
Root cause: In memory-limited fallback, per-chunk unmatched-left emission uses a per-stream probe completion counter initialized to 1, instead of coordinating across all right partitions like the single-pass path does. This can cause a left row to be marked "unmatched" by partition 0 before partition 1 finishes probing and matches it.
Impact: Query results can become incorrect and memory-dependent, violating LEFT-family join semantics under memory pressure.
Why this matters
Correctness: LEFT-family join results can differ based on memory availability, breaking deterministic SQL semantics.
Severity: Large inputs (where fallback activates) are the exact scenarios where this breaks, not edge cases.
User impact: Queries may emit incorrect null-padded unmatched-left rows under memory pressure.
Still open: LEFT-family join paths are still allowed to fallback and still use per-chunk local completion (probe_threads_counter = 1), which does not coordinate unmatched-left emission globally across right partitions.
Current behavior and gap
Single-pass path behavior
The normal path buffers the full left side and uses right partition count to coordinate unmatched-left emission only after all right partitions finish probing.
Relevant code:
datafusion/physical-plan/src/joins/nested_loop_join.rs around collect_left_input call in execute
datafusion/physical-plan/src/joins/nested_loop_join.rs around process_left_unmatched (report_probe_completed gate)
Memory-limited fallback behavior
In fallback, each left chunk builds JoinLeftData with probe_threads_counter set to 1 (per stream/per partition chunk instance), so unmatched-left gating is local rather than global across right partitions.
Relevant code:
datafusion/physical-plan/src/joins/nested_loop_join.rs in handle_buffering_left_memory_limited where JoinLeftData::new is created with AtomicUsize::new(1)
Reproduction concept
Force memory-limited fallback with a tight memory limit.
Ensure right side is partitioned into multiple partitions (target_partitions > 1 and plan shape that preserves multi-partition probe behavior).
Use LEFT JOIN (or LEFT ANTI / LEFT MARK) with a predicate where a given left row matches only rows in a non-local right partition.
Observe that another partition may emit that left row as unmatched.
Expected:
No unmatched-left row for left rows that match in any right partition.
Observed risk:
Duplicate or incorrect null-padded unmatched-left output.
Proposed fix
The most direct fix is to disable memory-limited fallback for LEFT-family join modes that require unmatched-left emission when right_partition_count > 1, until proper cross-partition coordination is implemented.
This trades off spill-fallback availability for correctness:
LEFT, LEFT ANTI, LEFT MARK (and other left-final-emission variants) will OOM instead of spill when the right side has multiple partitions.
FULL remains unchanged: it is already blocked today.
INNER, RIGHT, RIGHT SEMI, RIGHT ANTI, RIGHT MARK continue to spill normally.
Future PRs can re-enable spill for LEFT-family joins with shared chunk-state coordination.
Alternative (more complex): implement cross-partition coordination now as part of this PR fix.
JoinType::Full (already blocked by existing full_join_multi_partition guard)
Acceptance criteria (disabling-fallback approach)
Correctness restored: LEFT-family joins with multi-partition right input no longer produce duplicate/incorrect unmatched-left rows.
Implementation: Add a join-type check in initiate_fallback that returns Disabled state (or error) for JoinType::Left | JoinType::LeftAnti | JoinType::LeftMark when right_partition_count > 1.
Regression test: Add deterministic unit test in nested_loop_join.rs
Setup: multi-partition right side, LEFT JOIN, tight memory limit.
Expected: deterministic error with stable substring, for example: memory-limited NestedLoopJoin fallback disabled for LEFT-producing join with multi-partition right side.
Non-goal for this PR: test should not assert full formatted error text, only stable substring.
Verify: query still succeeds with single-partition right side.
FULL behavior unchanged: Existing FULL JOIN multi-partition fallback-disabled behavior remains intact.
No regressions on other join types:
INNER, RIGHT, RIGHT SEMI, RIGHT ANTI, RIGHT MARK continue to spill.
Existing memory-limited test suite green.
Documentation:
Code comment explaining why LEFT-family joins don't fallback with multi-partition right side.
#[tokio::test]asyncfntest_nlj_memory_limited_left_join_multi_partition_fallback_disabled() -> Result<()>{// Setup: multi-partition right side + tight memory limit to force fallback path// Build data so at least one left row matches only in a non-local right partition.// Expectation:// - multi-partition LEFT join under memory pressure should not spill fallback;// it should fail fast (OOM/disabled fallback) until global coordination exists.// - single-partition LEFT join should continue to succeed.Ok(())}
Non-goals
Implementing cross-partition chunk-state sharing in this issue.
Changing optimizer rules or join selection logic.
Definition of done
Check-in disables memory-limited fallback for LEFT-producing joins with multi-partition right side.
New unit test verifies the error/disabled behavior.
Existing FULL JOIN fallback-disabled behavior remains covered.
Existing memory-limited test suite remains green (other join types unaffected).
Summary
Memory-limited NestedLoopJoin execution can still produce incorrect unmatched-left output when:
Root cause: In memory-limited fallback, per-chunk unmatched-left emission uses a per-stream probe completion counter initialized to 1, instead of coordinating across all right partitions like the single-pass path does. This can cause a left row to be marked "unmatched" by partition 0 before partition 1 finishes probing and matches it.
Impact: Query results can become incorrect and memory-dependent, violating LEFT-family join semantics under memory pressure.
Why this matters
Current status in codebase
full_join_multi_partitionguard).probe_threads_counter = 1), which does not coordinate unmatched-left emission globally across right partitions.Current behavior and gap
Single-pass path behavior
The normal path buffers the full left side and uses right partition count to coordinate unmatched-left emission only after all right partitions finish probing.
Relevant code:
Memory-limited fallback behavior
In fallback, each left chunk builds JoinLeftData with probe_threads_counter set to 1 (per stream/per partition chunk instance), so unmatched-left gating is local rather than global across right partitions.
Relevant code:
Reproduction concept
Expected:
Observed risk:
Proposed fix
The most direct fix is to disable memory-limited fallback for LEFT-family join modes that require unmatched-left emission when
right_partition_count > 1, until proper cross-partition coordination is implemented.This trades off spill-fallback availability for correctness:
Alternative (more complex): implement cross-partition coordination now as part of this PR fix.
Exact join-type scope for fallback-disable guard
Guard should explicitly cover:
JoinType::LeftJoinType::LeftAntiJoinType::LeftMarkGuard should explicitly not cover:
JoinType::Inner,JoinType::Right,JoinType::RightSemi,JoinType::RightAnti,JoinType::RightMark(spill fallback remains enabled)JoinType::Full(already blocked by existingfull_join_multi_partitionguard)Acceptance criteria (disabling-fallback approach)
Correctness restored: LEFT-family joins with multi-partition right input no longer produce duplicate/incorrect unmatched-left rows.
initiate_fallbackthat returns Disabled state (or error) forJoinType::Left | JoinType::LeftAnti | JoinType::LeftMarkwhenright_partition_count > 1.Regression test: Add deterministic unit test in nested_loop_join.rs
memory-limited NestedLoopJoin fallback disabled for LEFT-producing join with multi-partition right side.FULL behavior unchanged: Existing FULL JOIN multi-partition fallback-disabled behavior remains intact.
No regressions on other join types:
Documentation:
Suggested tests
Unit test in nested_loop_join.rs
Non-goals
Definition of done