Skip to content

NestedLoopJoinExec spill path: untracked allocation overshoots memory pool #22723

@avantgardnerio

Description

@avantgardnerio

Describe the bug

NestedLoopJoinExec's memory-limited (spill) path allocates more memory than it reserves through the MemoryPool. The accounting-pool framework added in #22626 surfaces this if HEADROOM_FACTOR in the SLT pool is tightened — see #22721, which lowers it from 8.0 → 5.0 and trips this test.

To Reproduce

In datafusion/sqllogictest/src/accounting_pool.rs, set:

```rust
const HEADROOM_FACTOR: f64 = 5.0;
```

Then run nested_loop_join_spill.slt. First query at line 33:

```sql
SET datafusion.execution.target_partitions = 1;
SET datafusion.runtime.memory_limit = '150K';

SELECT count(*) as cnt, min(v1) as mn, max(v1) as mx
FROM generate_series(1, 100000) AS t1(v1)
INNER JOIN generate_series(1, 1) AS t2(v2)
ON (t1.v1 + t2.v2) > 0;
```

Expected behavior

Query succeeds with allocator usage bounded by the configured memory_limit (×~10% slop for legitimate untracked overhead like Tokio/Rayon thread state).

Actual behavior

```
External error: 1 errors in file datafusion/sqllogictest/test_files/nested_loop_join_spill.slt

  1. query failed: Other Error: allocator overdraft: account balance at panic = -20245 bytes
    at nested_loop_join_spill.slt:33
    ```

Peak allocator usage reaches ~770KB against a declared 150KB pool — 5.13× the budget. The first ~750KB (5×) is absorbed by the headroom factor; the additional 20245 bytes is allocator usage that was never reserved through MemoryPool::try_grow.

Likely root cause

Spill path in NestedLoopJoinExec has at least these untracked allocation sites (need verification with targeted instrumentation):

  • generate_next_batch buffering — RecordBatches accumulated for the build side before the spill decision
  • concat_batches at the spill boundary — copies into a coalesced batch before the IPC writer sees it
  • take_native during the probe phase — gather kernels allocate output buffers
  • IPC reader path on spill re-read — the StreamReader/FileReader decoder owns buffers that aren't accounted

The first query in the test stays small (100K × i32 = ~400KB build side), so the overshoot is whatever lives off-pool in the spill setup, not the bulk data itself.

Additional context

Component

  • Physical Plan

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions