Vectorize FixedBitSet.cardinality() via VectorizationProvider (Java 25 MRJAR) by iprithv · Pull Request #15832 · apache/lucene

iprithv · 2026-03-16T21:41:31Z

Summary

This change adds SIMD acceleration for FixedBitSet.cardinality() using the Java Vector API,
integrated through Lucene's existing VectorizationProvider dispatch framework.

Implementation

New interface: org.apache.lucene.internal.vectorization.BitSetUtilSupport
with a single method popCount(long[], int, int).
Java 25 SIMD implementation: PanamaBitSetUtilSupport in lucene/core/src/java25/,
using LongVector with VectorOperators.BIT_COUNT lane operation and
SPECIES_PREFERRED vector size.
Integration: hooked into VectorizationProvider alongside
getVectorUtilSupport() and getPostingDecodingUtil().
All vectorization decisions live in PanamaBitSetUtilSupport, threshold
(64 longs = 4096 bits), system property gate, and scalar fallback. FixedBitSet
is a pure unconditional delegate with no awareness of thresholds or properties.
System property: -Dlucene.useVectorizedBitSetOps=true (default false,
opt-in only).
Fallback: On Java < 25, or when --add-modules jdk.incubator.vector is not
resolved, DefaultVectorizationProvider is used and popCount() is a plain scalar
Long.bitCount loop — no behavioral change.

Test plan

./gradlew :lucene:core:test --tests org.apache.lucene.util.TestFixedBitSet
./gradlew :lucene:core:test --tests org.apache.lucene.util.TestFixedBitSet \
    -Dlucene.useVectorizedBitSetOps=true
./gradlew :lucene:core:test \
    --tests org.apache.lucene.internal.vectorization.TestVectorizationProvider

Benchmarks

./gradlew :lucene:benchmark-jmh:assemble

java -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar
FixedBitSetBenchmark
-p numBits=512,1024,4096,65536,1048576 -p density=0.10 -wi 3 -i 5 -f 1

Results : (ops/µs, -wi 3 -i 5 -f 1, density=0.10).
cardinalityScalar uses no --add-modules; cardinalityVector forks with --add-modules=jdk.incubator.vector.

numBits	cardinalityScalar	cardinalityVector	Gain
512	336.008 ± 8.743	334.150 ± 12.890	— (scalar path¹)
1,024	208.010 ± 1.655	207.387 ± 10.341	— (scalar path¹)
4,096	51.997 ± 0.275	85.069 ± 1.921	+63%
65,536	1.908 ± 0.046	4.167 ± 0.331	+118%
1,048,576	0.112 ± 0.002	0.237 ± 0.016	+112%

Below the 64-long (4096-bit) threshold both methods run the same scalar loop.

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

uschindler

This is wrongly implemented. The classes should not be loaded directly but using VectorizationProvider interfaces where an instance of the interface implemented in the Java25 folder is handled. This replicates some already existing code and also does not do the correct checks (if all is sane with JVM).

This can't be merged until changed to behave in line with how the other vector stuff is handled with the VectorizationProvider factory. Hook it in there and move to the internal code package containing that provider, guarded with correct checks and then we can look into this again.

@rmuir

rmuir · 2026-03-17T14:03:14Z

FixedBitSetBenchmark.(cardinality|intersectionCount|unionCount|andNotCount)

intersectionCount/unionCount/andNotCount are not used anywhere in the lucene codebase. So you can drop any optimization for them. I don't think cardinality is appropriate to optimize in this way either: for most sparse cases FixedBitSet is not even used. Instead different implementation (SparseFixedBitset/SparseLiveDocs/etc) geared at sparse will be used...

These optimizations are expensive to maintain and require a lot of work for correctness. We should have a clear search use-case in mind when attempting to optimize any functions.

For example, the dot-product was optimized because it is a clear hotspot for vector search, and because java's floating point rules don't allow the autovectorizer to do it.

rmuir · 2026-03-17T14:14:35Z

for any search regressions resulting from the change, it must be from cardinality, which is the only one used in the codebase. Problem is likely use of SPECIES_PREFERRED (avx512), this is just a guess. It might look 2x better in isolation but cause slowdowns due to downclocking. Personally I think autovectorizer is a safer approach.

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

…xedBitSet Signed-off-by: prithvi <prithvisivasankar@gmail.com>

…smid-bitset

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

uschindler · 2026-03-17T19:52:24Z

-    }
-    return Math.toIntExact(tot);
+    return Math.toIntExact(
+        VectorizationProvider.getInstance().getBitSetUtilSupport().popCount(bits, 0, numWords));


this call is too expensive, did you benchmark at all? (it calciulates stack trace to figure out if it is allowed to call it).

Basically you need to create a private static final constant in FixedBitset refering to the BitSetUtilSupport instance, initialized when class is loading and the call popcount() here on the static constant.

initialised at class load, thank you.

uschindler · 2026-03-17T20:00:47Z

+          long.class, VectorShape.forBitSize(PanamaVectorConstants.PREFERRED_VECTOR_BITSIZE));
+  private static final int VL = LONG_SPECIES.length();
+
+  private static final boolean ENABLED;


i don't think we need this because use can enable/disable it with Panama JVM parameter (enable module).

removed the ENABLED flag, thank you.

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

rmuir · 2026-03-17T21:45:48Z

+    for (; i < loopBound; i += VL) {
+      acc =
+          acc.add(
+              LongVector.fromArray(LONG_SPECIES, a, start + i).lanewise(VectorOperators.BIT_COUNT));


what will this do on my computer without avx512? my concern is that it might go 30x slower since I don't have such instruction

I tested with -Dlucene.panama.vectorSize=256 to simulate narrower vectors, but my hardware is ARM (Mac), so this doesn't replicate x86 AVX2 without VPOPCNTDQ. I don't have an AVX2 x86 machine available to measure directly.

I changed to gate the SIMD path on PREFERRED_VECTOR_BITSIZE == 512 as a conservative guard, so the vectorized path only activates when AVX-512 VPOPCNTDQ is available, thank you.

The problem with Intel/AMD machines ist that many of them throttle CPU freq when AVX3 is used, which can have side effects on other system, so in Lucene we generally avoid to use AVX3 features at moment.

Maybe the popcnt optimization should only be used for ARM CPUs with the correct checks as done in the provider.

P.S.: As the BitSet proviider only serves popcnt, maybe instead of having the implementation do the 512-bit check it could also be done in the factory call. So when there are no 512 bits available, the factory would return the default implementation. Currently the default implementation is a lambda, I'd prefer to have it as a separate class and both providers return an instance of that class, dependning on CPU settings.

autovectorizer in hotspot will already take care of all this for us.

Benchmarked this on ARM (which has native NEON vector popcount). The scalar loop (cardinalityScalar, no --add-modules) runs at 1.908 ops/µs at 65K bits. The Panama path runs at 4.167 ops/µs (+118%). If HotSpot's autovectorizer were already using native vector popcount for Long.bitCount, those numbers would be identical.
With this difference I think C2 SuperWord doesn't currently pattern match Long.bitCount intrinsics into vectorized popcount, it appears to bail out on intrinsic nodes in reduction loops.
Would be great if there's a way to verify the JIT output on some other hardware.

For the factory-level check, I can move the capability check into PanamaVectorizationProvider so it returns the scalar implementation when not on the right hardware, and remove POPCOUNT_SUPPORTED from the impl

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

…prop

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

uschindler · 2026-03-18T12:47:20Z

In general we should collect performance measurements for different CPUs, comparing scalar with optimized impl.

Once this is done, we can look into further streamlining the implementation. An alternative would be to add the popcnt on a long[] array to the public VectorUtil class (so its publicly available) and then implement it only in VectorUtilProvider. That's just an idea, so other other BitSet implementations can use it.

Vectorize FixedBitSet popcount-based ops via Vector API (Java 25 MRJAR)

295782f

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

github-actions Bot added the module:core/other label Mar 16, 2026

Updated CHANGES.txt

f9a1a88

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

github-actions Bot added this to the 11.0.0 milestone Mar 16, 2026

iprithv added 2 commits March 17, 2026 03:31

code check error google-java-format in VectorizedBitSetOpsHelper fixed

ce7f858

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

code check error fix ecj lint in VectorizedBitSetOps

f5308f6

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

uschindler requested changes Mar 17, 2026

View reviewed changes

uschindler marked this pull request as draft March 17, 2026 11:37

iprithv changed the title ~~Vectorize FixedBitSet popcount-based ops via Vector API (Java 25 MRJAR)~~ Vectorize FixedBitSet.cardinality() via VectorizationProvider (Java 25 MRJAR) Mar 17, 2026

iprithv added 5 commits March 18, 2026 00:59

review changes

ae5bb40

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

Merge branch 'main' into lucene-smid-bitset

f6b426b

fix lint: trailing newlines, revert spurious whitespace changes in Fi…

6c397cf

…xedBitSet Signed-off-by: prithvi <prithvisivasankar@gmail.com>

Merge remote-tracking branch 'origin/lucene-smid-bitset' into lucene-…

b9032f1

…smid-bitset

review changes

02f87f0

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

uschindler requested changes Mar 17, 2026

View reviewed changes

uschindler reviewed Mar 17, 2026

View reviewed changes

Comment thread lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/FixedBitSetBenchmark.java

uschindler reviewed Mar 17, 2026

View reviewed changes

iprithv added 2 commits March 18, 2026 01:31

review changes

6d6cad5

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

review changes

84272dc

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

rmuir reviewed Mar 17, 2026

View reviewed changes

iprithv added 3 commits March 18, 2026 12:03

review changes

4b68ff5

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

docs: update CHANGES.txt to reflect AVX-512 gating and removal of sys…

9303679

…prop

Updated CHANGES.txt

7a5c212

Signed-off-by: prithvi <prithvisivasankar@gmail.com>

iprithv closed this Mar 21, 2026

Conversation

iprithv commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation

Test plan

Benchmarks

Uh oh!

uschindler left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmuir commented Mar 17, 2026

Uh oh!

rmuir commented Mar 17, 2026

Uh oh!

uschindler Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iprithv Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

uschindler Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

iprithv Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmuir Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

iprithv Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

uschindler Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmuir Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

iprithv Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

iprithv Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

uschindler commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iprithv commented Mar 16, 2026 •

edited

Loading

uschindler left a comment •

edited

Loading

uschindler Mar 17, 2026 •

edited

Loading

iprithv Mar 18, 2026 •

edited

Loading

uschindler Mar 18, 2026 •

edited

Loading