Skip to content

Conversation

@nuno-faria
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

The original code was referencing a record batch for each line in the limit, which was causing the arrow_select::interleave_record_batch to allocate vectors with billions of elements. For example, if we had 32 batches and LIMIT 50000, those 32 batches would be referenced 50k times. The new version uses only one reference per existing batch, limiting the memory used as well as improving the execution time:

> select *
from lineitem
order by l_comment;
-- Elapsed 1.804 seconds.

-- would OOM before
> select *
from lineitem
order by l_comment
limit 50000;
-- Elapsed 0.625 seconds.

What changes are included in this PR?

  • Updated the way the record batches and indices are created in topk/mod.rs.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 17, 2025
@adriangb
Copy link
Contributor

Very exciting, amazing work!!

@adriangb adriangb added this pull request to the merge queue Sep 17, 2025
Merged via the queue into apache:main with commit d2f8fa5 Sep 17, 2025
28 checks passed
@nuno-faria nuno-faria deleted the fix_sort_topk branch September 17, 2025 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SortExec with TopK causes OOM

2 participants