Skip to content

PanamaDocValuesRangeSupport: OR mask directly into FixedBitSet instead of per-bit loop#16123

Open
sgup432 wants to merge 2 commits into
apache:mainfrom
sgup432:or_mask_doc_values
Open

PanamaDocValuesRangeSupport: OR mask directly into FixedBitSet instead of per-bit loop#16123
sgup432 wants to merge 2 commits into
apache:mainfrom
sgup432:or_mask_doc_values

Conversation

@sgup432
Copy link
Copy Markdown
Contributor

@sgup432 sgup432 commented May 25, 2026

Description

As a followup to one of the comment in this PR - #16050 by @neoremind and @benwtrent, I am adding this change.

Earlier, after the vectorized comparison, we would loop over the matching bits and do bit by bit update on the bitset. This is not needed as ideally we could directly perform a OR operation and set it in the bitset in just one operation.

Also handles the case where if we need to span across two words, for example:
base = 60, and we compared 8 docs together giving the maskBits as 0b10110001.

Then docs 60-67 span two words:

  • Docs 60-63 are in bits[0] (bit positions 60-63)
  • Docs 64-67 are in bits[1] (bit positions 0-3)

Results below. This change gave a further boost in some of the cases!

c5.2xlarge (AVX-512):

Pattern docCount Fields Before OR mask (ops/s) With OR mask (ops/s) Change
random 1M 1 200.2 273.5 +37%
random 1M 3 66.8 91.5 +37%
random 1M 5 62.6 81.9 +31%
random 10M 1 24.4 34.4 +41%
random 10M 3 8.1 11.1 +37%
random 10M 5 7.2 9.8 +36%
clustered 1M 1 8,856 11,345 +28%
clustered 1M 3 33,846 36,577 +8%
clustered 1M 5 35,055 34,975 ~0%
clustered 10M 1 1,308 1,324 +1%
clustered 10M 3 24,388 24,052 ~0%
clustered 10M 5 12,940 12,884 ~0%

Mac (Apple M-series, 128-bit NEON):

Pattern docCount Fields Before OR mask (ops/s) With OR mask (ops/s) Change
random 1M 1 219.5 250.2 +14%
random 1M 3 86.9 93.6 +8%
random 1M 5 80.2 80.2 ~0%
random 10M 1 27.1 28.2 +4%
random 10M 3 9.0 7.5 -17%* (did another run where it gave ~5% improvement)
random 10M 5 7.7 8.7 +12%
clustered 1M 1 27,137 36,003 +33%
clustered 1M 3 56,333 81,988 +46%
clustered 1M 5 77,159 78,315 +2%
clustered 10M 1 3,502 3,617 +3%
clustered 10M 3 57,900 61,969 +7%
clustered 10M 5 31,728 31,502 ~0%

@github-actions github-actions Bot added this to the 10.5.0 milestone May 25, 2026
@neoremind
Copy link
Copy Markdown
Contributor

Nice follow-up! The speedup looks solid!

One minor comment, non-blocking: would it be better to move this to a method like orMask in FixedBitSet? Now it feels like leaking the internal bit manipulation outside, maybe other experts have more opinion on the inline way vs. encapsulation in FixetBitSet.

Comment on lines +61 to +68
int wordIndex = base >> 6;
int bitOffset = base & 63;
long[] bits = bitSet.getBits();
if (bitOffset + vectorLen <= 64) {
bits[wordIndex] |= maskBits << bitOffset;
} else {
bits[wordIndex] |= maskBits << bitOffset;
bits[wordIndex + 1] |= maskBits >>> (64 - bitOffset);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main concern here is correctness. I don't remember from the previous PR, but do we have a test that takes random docs of various densities, etc. and ensures that the "slow iterative path" and "fast path" produce the exact same results?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I am fine with a new method just on FixedBitSet to achieve this called orBits or orMask or something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants