Skip to content

Use binary search partitioning in ReaderUtil#partitionByLeaf#15938

Open
gsmiller wants to merge 3 commits intoapache:mainfrom
gsmiller:GH/leaf-partition-bsearch
Open

Use binary search partitioning in ReaderUtil#partitionByLeaf#15938
gsmiller wants to merge 3 commits intoapache:mainfrom
gsmiller:GH/leaf-partition-bsearch

Conversation

@gsmiller
Copy link
Copy Markdown
Contributor

@gsmiller gsmiller commented Apr 7, 2026

Description

This change modifies the "partition" step of ReaderUtil#partitionByLeaf to leverage binary search instead of a linear scan.

  • Current Implementation: Iterates a sorted list of docIDs, checking each against leaf/segment boundaries to partition the sorted docs into their corresponding segments.
  • Proposed Implementation: Iterate the leaves/segments and binary search the sorted docIDs to determine the partitions.

I've included a jmh benchmark with this change that I used to test the two approaches. Benchmark results are below. At a high level, the binary search approach outperforms except when there's a very large number of index segments relative to docs being partitioned. My sense is that the cases where the linear scan approach performs best are uncommon cases "in the wild," so I think the binary search approach generally makes sense. But open to feedback/thoughts of course!

Benchmarks

Some iterations and more details are in #15934 where I initially experimented with this, but the most concise benchmarks are detailed here.

Benchmark Hardware

I ran benchmarks on two AWS ec2 Amazon Linux hosts—one with x86 (m5.12xlarge) and one with ARM (m6g.4xlarge):

x86 Linux ARM Linux
CPU Intel Xeon Platinum 8175M Neoverse-N1 (Graviton2)
Clock 2.5 GHz (base), 3.1 GHz (turbo) ~2.5 GHz
L1d cache 32 KB 64 KB
L2 cache 1 MB 1 MB
Cores 48 16

Summary Results

Ran benchmarks as: java -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-*.jar PartitionByLeafBenchmark

Fed the results to an AI tool to build summary tables, but including raw output below.

x86 Results

numDocIds numLeaves Linear (ops/ms) BS (ops/ms) Difference
100 5 4,352 8,429 BS +94% ✅
100 10 3,501 4,880 BS +39% ✅
100 20 2,682 2,589 ~tie
100 50 1,470 1,180 Linear +25% ❌
100 200 625 461 Linear +36% ❌
1,000 5 484 1,911 BS +295% ✅
1,000 10 479 1,922 BS +301% ✅
1,000 20 463 1,470 BS +217% ✅
1,000 50 416 735 BS +77% ✅
1,000 200 264 195 Linear +35% ❌
10,000 5 45 182 BS +304% ✅
10,000 10 46 175 BS +283% ✅
10,000 20 44 155 BS +252% ✅
10,000 50 47 193 BS +311% ✅
10,000 200 43 119 BS +176% ✅
100,000 5 6.1 10.9 BS +80% ✅
100,000 10 7.5 15.8 BS +111% ✅
100,000 20 8.0 17.2 BS +115% ✅
100,000 50 8.2 17.7 BS +116% ✅
100,000 200 8.0 14.1 BS +76% ✅

ARM Results

numDocIds numLeaves Linear (ops/ms) BS (ops/ms) Difference
100 5 5,285 6,608 BS +25% ✅
100 10 3,967 3,920 ~tie
100 20 2,331 1,948 Linear +20% ❌
100 50 1,470 840 Linear +75% ❌
100 200 580 363 Linear +60% ❌
1,000 5 703 1,798 BS +156% ✅
1,000 10 648 1,402 BS +116% ✅
1,000 20 615 1,068 BS +74% ✅
1,000 50 486 550 BS +13% ✅
1,000 200 265 140 Linear +89% ❌
10,000 5 69 273 BS +295% ✅
10,000 10 68 205 BS +201% ✅
10,000 20 67 196 BS +193% ✅
10,000 50 65 169 BS +160% ✅
10,000 200 58 74 BS +27% ✅
100,000 5 7.9 38.9 BS +392% ✅
100,000 10 7.5 36.6 BS +388% ✅
100,000 20 7.6 31.6 BS +316% ✅
100,000 50 7.2 23.9 BS +232% ✅
100,000 200 6.6 16.2 BS +145% ✅

Raw Benchmark Output

x86
Benchmark (numDocIds) (numLeaves) Mode Cnt Score Error Units
PartitionByLeafBenchmark.binarySearchPartition 100 5 thrpt 15 8429.348 ± 440.448 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100 10 thrpt 15 4880.362 ± 54.478 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100 20 thrpt 15 2589.496 ± 65.347 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100 50 thrpt 15 1180.417 ± 93.940 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100 200 thrpt 15 460.616 ± 22.819 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 1000 5 thrpt 15 1911.034 ± 23.741 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 1000 10 thrpt 15 1922.153 ± 77.887 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 1000 20 thrpt 15 1470.400 ± 32.570 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 1000 50 thrpt 15 734.610 ± 3.304 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 1000 200 thrpt 15 195.314 ± 1.438 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 10000 5 thrpt 15 182.224 ± 2.416 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 10000 10 thrpt 15 174.868 ± 1.510 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 10000 20 thrpt 15 155.064 ± 2.676 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 10000 50 thrpt 15 193.320 ± 1.270 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 10000 200 thrpt 15 119.443 ± 2.780 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100000 5 thrpt 15 10.909 ± 0.058 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100000 10 thrpt 15 15.837 ± 0.137 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100000 20 thrpt 15 17.244 ± 0.160 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100000 50 thrpt 15 17.748 ± 0.121 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100000 200 thrpt 15 14.092 ± 0.460 ops/ms
PartitionByLeafBenchmark.linearPartition 100 5 thrpt 15 4351.953 ± 32.738 ops/ms
PartitionByLeafBenchmark.linearPartition 100 10 thrpt 15 3500.875 ± 134.317 ops/ms
PartitionByLeafBenchmark.linearPartition 100 20 thrpt 15 2681.611 ± 14.056 ops/ms
PartitionByLeafBenchmark.linearPartition 100 50 thrpt 15 1469.889 ± 283.971 ops/ms
PartitionByLeafBenchmark.linearPartition 100 200 thrpt 15 624.940 ± 33.866 ops/ms
PartitionByLeafBenchmark.linearPartition 1000 5 thrpt 15 484.472 ± 1.494 ops/ms
PartitionByLeafBenchmark.linearPartition 1000 10 thrpt 15 478.512 ± 5.470 ops/ms
PartitionByLeafBenchmark.linearPartition 1000 20 thrpt 15 463.223 ± 7.010 ops/ms
PartitionByLeafBenchmark.linearPartition 1000 50 thrpt 15 416.226 ± 9.632 ops/ms
PartitionByLeafBenchmark.linearPartition 1000 200 thrpt 15 264.385 ± 2.228 ops/ms
PartitionByLeafBenchmark.linearPartition 10000 5 thrpt 15 44.861 ± 0.470 ops/ms
PartitionByLeafBenchmark.linearPartition 10000 10 thrpt 15 45.709 ± 0.101 ops/ms
PartitionByLeafBenchmark.linearPartition 10000 20 thrpt 15 44.448 ± 0.156 ops/ms
PartitionByLeafBenchmark.linearPartition 10000 50 thrpt 15 47.107 ± 0.505 ops/ms
PartitionByLeafBenchmark.linearPartition 10000 200 thrpt 15 43.276 ± 0.152 ops/ms
PartitionByLeafBenchmark.linearPartition 100000 5 thrpt 15 6.059 ± 0.015 ops/ms
PartitionByLeafBenchmark.linearPartition 100000 10 thrpt 15 7.519 ± 0.169 ops/ms
PartitionByLeafBenchmark.linearPartition 100000 20 thrpt 15 8.019 ± 0.033 ops/ms
PartitionByLeafBenchmark.linearPartition 100000 50 thrpt 15 8.233 ± 0.045 ops/ms
PartitionByLeafBenchmark.linearPartition 100000 200 thrpt 15 8.048 ± 0.034 ops/ms

ARM
Benchmark (numDocIds) (numLeaves) Mode Cnt Score Error Units
PartitionByLeafBenchmark.binarySearchPartition 100 5 thrpt 15 6607.852 ± 248.856 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100 10 thrpt 15 3920.374 ± 147.577 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100 20 thrpt 15 1948.412 ± 3.964 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100 50 thrpt 15 840.036 ± 24.553 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100 200 thrpt 15 362.670 ± 5.729 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 1000 5 thrpt 15 1798.015 ± 26.241 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 1000 10 thrpt 15 1401.657 ± 25.145 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 1000 20 thrpt 15 1067.532 ± 38.887 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 1000 50 thrpt 15 550.367 ± 5.071 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 1000 200 thrpt 15 139.719 ± 1.729 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 10000 5 thrpt 15 273.486 ± 17.731 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 10000 10 thrpt 15 204.891 ± 1.491 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 10000 20 thrpt 15 195.903 ± 1.537 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 10000 50 thrpt 15 168.672 ± 8.833 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 10000 200 thrpt 15 74.241 ± 3.183 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100000 5 thrpt 15 38.910 ± 0.332 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100000 10 thrpt 15 36.608 ± 0.374 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100000 20 thrpt 15 31.596 ± 0.274 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100000 50 thrpt 15 23.895 ± 0.305 ops/ms
PartitionByLeafBenchmark.binarySearchPartition 100000 200 thrpt 15 16.244 ± 0.143 ops/ms
PartitionByLeafBenchmark.linearPartition 100 5 thrpt 15 5284.656 ± 31.317 ops/ms
PartitionByLeafBenchmark.linearPartition 100 10 thrpt 15 3966.757 ± 32.919 ops/ms
PartitionByLeafBenchmark.linearPartition 100 20 thrpt 15 2330.805 ± 15.104 ops/ms
PartitionByLeafBenchmark.linearPartition 100 50 thrpt 15 1469.664 ± 49.736 ops/ms
PartitionByLeafBenchmark.linearPartition 100 200 thrpt 15 579.968 ± 8.910 ops/ms
PartitionByLeafBenchmark.linearPartition 1000 5 thrpt 15 703.493 ± 4.639 ops/ms
PartitionByLeafBenchmark.linearPartition 1000 10 thrpt 15 648.284 ± 6.565 ops/ms
PartitionByLeafBenchmark.linearPartition 1000 20 thrpt 15 614.930 ± 6.369 ops/ms
PartitionByLeafBenchmark.linearPartition 1000 50 thrpt 15 485.980 ± 12.309 ops/ms
PartitionByLeafBenchmark.linearPartition 1000 200 thrpt 15 264.950 ± 10.029 ops/ms
PartitionByLeafBenchmark.linearPartition 10000 5 thrpt 15 69.465 ± 4.038 ops/ms
PartitionByLeafBenchmark.linearPartition 10000 10 thrpt 15 68.010 ± 0.186 ops/ms
PartitionByLeafBenchmark.linearPartition 10000 20 thrpt 15 67.008 ± 0.213 ops/ms
PartitionByLeafBenchmark.linearPartition 10000 50 thrpt 15 65.174 ± 3.478 ops/ms
PartitionByLeafBenchmark.linearPartition 10000 200 thrpt 15 58.380 ± 3.943 ops/ms
PartitionByLeafBenchmark.linearPartition 100000 5 thrpt 15 7.858 ± 0.032 ops/ms
PartitionByLeafBenchmark.linearPartition 100000 10 thrpt 15 7.525 ± 0.506 ops/ms
PartitionByLeafBenchmark.linearPartition 100000 20 thrpt 15 7.645 ± 0.029 ops/ms
PartitionByLeafBenchmark.linearPartition 100000 50 thrpt 15 7.245 ± 0.033 ops/ms
PartitionByLeafBenchmark.linearPartition 100000 200 thrpt 15 6.632 ± 0.005 ops/ms

Copy link
Copy Markdown
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing logic was O(D + L) where D = docs and L = leaves. The new approach is O(L * log(D)). This explains why:

  • Binary search wins big when D >> L (the common case) — e.g., 100K docs / 5 leaves: +80-392%
  • Linear scan wins when L is large relative to D (uncommon) — e.g., 100 docs / 200 leaves: linear +36-60%

Given we know the number of matching documents and leaves upfront, I am wondering if it makes sense to keep existing logic for sparse cases?

@gsmiller
Copy link
Copy Markdown
Contributor Author

gsmiller commented Apr 8, 2026

Given we know the number of matching documents and leaves upfront, I am wondering if it makes sense to keep existing logic for sparse cases?

Great question! So I did play with this a bit in benchmarking. I forked the logic based on whether-or-not there were more leaves than docs (essentially handling the outlier cases where there are lots of leaves and very few docs with the current linear scan). I ultimately shelved that idea for two reasons: (1) I questioned the trade-off of having two separate implementations to handle outlier cases that I don't think are super likely, and (2) the tuning heuristic of when to use linear isn't quite as clean as checking which is larger, docs or leaves (it seemed to be hardware dependent). Also, for the most part, the cases where linear scan is more performant are already very fast anyway (because you need to have very few docs), so I wasn't sure it was worth further optimizing those cases with a forked implementation when the practical difference would likely be small anyway. One thing I like about the binary search approach is that it makes the "slow cases" (i.e., lots of docs) much faster.

But stepping back, I'm not opposed to having two implementations if there's support for it. We do that sort of thing in other places. I was trying to start simple though and avoid feedback in the other direction (e.g., "why are you overcomplicating this with two implementations?" :)

Greg Miller added 3 commits April 9, 2026 07:14
@gsmiller gsmiller force-pushed the GH/leaf-partition-bsearch branch from f301e05 to d614658 Compare April 9, 2026 14:37
@mikemccand
Copy link
Copy Markdown
Member

+1 to keep it simple (just binary search option). I think it's OK if the already fast cases get a bit slower, and the slow cases get sizably faster? We've made similar tradeoffs in the past for query execution optimizations.

Copy link
Copy Markdown
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it -- it's also a nice code simplification since we no longer need the additional / duplicated special cased handling after the loop.

Thanks @gsmiller.

I also made a fun little JMH benchmark with CC (Claude Code) ... will try to open PR soon. @zihanx's partitionByLeaf PR (#15803) and your JMH benchmarking inspired me!

LeafReaderContext leaf = leaves.get(leafIdx);
int leafEnd = leaf.docBase + leaf.reader().maxDoc();
if (sortedDocIds[from] >= leafEnd) {
result[leafIdx] = EMPTY_INT_ARRAY;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we had per-line code coverage pulled out to the GitHub PR (here) so we could quickly confirm whether unit tests are covering this path... I think we do somewhere run tests with code coverage but it's a fully separate UI maybe?

if (to < 0) {
to = -to - 1;
}
int count = to - from;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert count > 0 since we think we're always handling/optimizing the empty case above?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants