Skip to content

Vectorize FixedBitSet.cardinality() via VectorizationProvider (Java 25 MRJAR)#15832

Closed
iprithv wants to merge 14 commits into
apache:mainfrom
iprithv:lucene-smid-bitset
Closed

Vectorize FixedBitSet.cardinality() via VectorizationProvider (Java 25 MRJAR)#15832
iprithv wants to merge 14 commits into
apache:mainfrom
iprithv:lucene-smid-bitset

Conversation

@iprithv
Copy link
Copy Markdown
Contributor

@iprithv iprithv commented Mar 16, 2026

Summary

This change adds SIMD acceleration for FixedBitSet.cardinality() using the Java Vector API,
integrated through Lucene's existing VectorizationProvider dispatch framework.


Implementation

  • New interface: org.apache.lucene.internal.vectorization.BitSetUtilSupport
    with a single method popCount(long[], int, int).
  • Java 25 SIMD implementation: PanamaBitSetUtilSupport in lucene/core/src/java25/,
    using LongVector with VectorOperators.BIT_COUNT lane operation and
    SPECIES_PREFERRED vector size.
  • Integration: hooked into VectorizationProvider alongside
    getVectorUtilSupport() and getPostingDecodingUtil().
  • All vectorization decisions live in PanamaBitSetUtilSupport, threshold
    (64 longs = 4096 bits), system property gate, and scalar fallback. FixedBitSet
    is a pure unconditional delegate with no awareness of thresholds or properties.
  • System property: -Dlucene.useVectorizedBitSetOps=true (default false,
    opt-in only).
  • Fallback: On Java < 25, or when --add-modules jdk.incubator.vector is not
    resolved, DefaultVectorizationProvider is used and popCount() is a plain scalar
    Long.bitCount loop — no behavioral change.

Test plan

./gradlew :lucene:core:test --tests org.apache.lucene.util.TestFixedBitSet
./gradlew :lucene:core:test --tests org.apache.lucene.util.TestFixedBitSet \
    -Dlucene.useVectorizedBitSetOps=true
./gradlew :lucene:core:test \
    --tests org.apache.lucene.internal.vectorization.TestVectorizationProvider

Benchmarks

./gradlew :lucene:benchmark-jmh:assemble

java -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar
FixedBitSetBenchmark
-p numBits=512,1024,4096,65536,1048576 -p density=0.10 -wi 3 -i 5 -f 1

Results : (ops/µs, -wi 3 -i 5 -f 1, density=0.10).
cardinalityScalar uses no --add-modules; cardinalityVector forks with --add-modules=jdk.incubator.vector.

numBits cardinalityScalar cardinalityVector Gain
512 336.008 ± 8.743 334.150 ± 12.890 — (scalar path¹)
1,024 208.010 ± 1.655 207.387 ± 10.341 — (scalar path¹)
4,096 51.997 ± 0.275 85.069 ± 1.921 +63%
65,536 1.908 ± 0.046 4.167 ± 0.331 +118%
1,048,576 0.112 ± 0.002 0.237 ± 0.016 +112%

Below the 64-long (4096-bit) threshold both methods run the same scalar loop.

Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
@github-actions github-actions Bot added this to the 11.0.0 milestone Mar 16, 2026
iprithv added 2 commits March 17, 2026 03:31
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Copy link
Copy Markdown
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrongly implemented. The classes should not be loaded directly but using VectorizationProvider interfaces where an instance of the interface implemented in the Java25 folder is handled. This replicates some already existing code and also does not do the correct checks (if all is sane with JVM).

This can't be merged until changed to behave in line with how the other vector stuff is handled with the VectorizationProvider factory. Hook it in there and move to the internal code package containing that provider, guarded with correct checks and then we can look into this again.

@rmuir

@uschindler uschindler marked this pull request as draft March 17, 2026 11:37
@rmuir
Copy link
Copy Markdown
Member

rmuir commented Mar 17, 2026

FixedBitSetBenchmark.(cardinality|intersectionCount|unionCount|andNotCount)

intersectionCount/unionCount/andNotCount are not used anywhere in the lucene codebase. So you can drop any optimization for them. I don't think cardinality is appropriate to optimize in this way either: for most sparse cases FixedBitSet is not even used. Instead different implementation (SparseFixedBitset/SparseLiveDocs/etc) geared at sparse will be used...

These optimizations are expensive to maintain and require a lot of work for correctness. We should have a clear search use-case in mind when attempting to optimize any functions.

For example, the dot-product was optimized because it is a clear hotspot for vector search, and because java's floating point rules don't allow the autovectorizer to do it.

@rmuir
Copy link
Copy Markdown
Member

rmuir commented Mar 17, 2026

for any search regressions resulting from the change, it must be from cardinality, which is the only one used in the codebase. Problem is likely use of SPECIES_PREFERRED (avx512), this is just a guess. It might look 2x better in isolation but cause slowdowns due to downclocking. Personally I think autovectorizer is a safer approach.

@iprithv iprithv changed the title Vectorize FixedBitSet popcount-based ops via Vector API (Java 25 MRJAR) Vectorize FixedBitSet.cardinality() via VectorizationProvider (Java 25 MRJAR) Mar 17, 2026
iprithv added 5 commits March 18, 2026 00:59
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
…xedBitSet

Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
}
return Math.toIntExact(tot);
return Math.toIntExact(
VectorizationProvider.getInstance().getBitSetUtilSupport().popCount(bits, 0, numWords));
Copy link
Copy Markdown
Contributor

@uschindler uschindler Mar 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this call is too expensive, did you benchmark at all? (it calciulates stack trace to figure out if it is allowed to call it).

Basically you need to create a private static final constant in FixedBitset refering to the BitSetUtilSupport instance, initialized when class is loading and the call popcount() here on the static constant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialised at class load, thank you.

long.class, VectorShape.forBitSize(PanamaVectorConstants.PREFERRED_VECTOR_BITSIZE));
private static final int VL = LONG_SPECIES.length();

private static final boolean ENABLED;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we need this because use can enable/disable it with Panama JVM parameter (enable module).

Copy link
Copy Markdown
Contributor Author

@iprithv iprithv Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the ENABLED flag, thank you.

iprithv added 2 commits March 18, 2026 01:31
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
for (; i < loopBound; i += VL) {
acc =
acc.add(
LongVector.fromArray(LONG_SPECIES, a, start + i).lanewise(VectorOperators.BIT_COUNT));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will this do on my computer without avx512? my concern is that it might go 30x slower since I don't have such instruction

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested with -Dlucene.panama.vectorSize=256 to simulate narrower vectors, but my hardware is ARM (Mac), so this doesn't replicate x86 AVX2 without VPOPCNTDQ. I don't have an AVX2 x86 machine available to measure directly.

I changed to gate the SIMD path on PREFERRED_VECTOR_BITSIZE == 512 as a conservative guard, so the vectorized path only activates when AVX-512 VPOPCNTDQ is available, thank you.

Copy link
Copy Markdown
Contributor

@uschindler uschindler Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with Intel/AMD machines ist that many of them throttle CPU freq when AVX3 is used, which can have side effects on other system, so in Lucene we generally avoid to use AVX3 features at moment.

Maybe the popcnt optimization should only be used for ARM CPUs with the correct checks as done in the provider.

P.S.: As the BitSet proviider only serves popcnt, maybe instead of having the implementation do the 512-bit check it could also be done in the factory call. So when there are no 512 bits available, the factory would return the default implementation. Currently the default implementation is a lambda, I'd prefer to have it as a separate class and both providers return an instance of that class, dependning on CPU settings.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autovectorizer in hotspot will already take care of all this for us.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmarked this on ARM (which has native NEON vector popcount). The scalar loop (cardinalityScalar, no --add-modules) runs at 1.908 ops/µs at 65K bits. The Panama path runs at 4.167 ops/µs (+118%). If HotSpot's autovectorizer were already using native vector popcount for Long.bitCount, those numbers would be identical.
With this difference I think C2 SuperWord doesn't currently pattern match Long.bitCount intrinsics into vectorized popcount, it appears to bail out on intrinsic nodes in reduction loops.
Would be great if there's a way to verify the JIT output on some other hardware.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the factory-level check, I can move the capability check into PanamaVectorizationProvider so it returns the scalar implementation when not on the right hardware, and remove POPCOUNT_SUPPORTED from the impl

iprithv added 3 commits March 18, 2026 12:03
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
@uschindler
Copy link
Copy Markdown
Contributor

In general we should collect performance measurements for different CPUs, comparing scalar with optimized impl.

Once this is done, we can look into further streamlining the implementation. An alternative would be to add the popcnt on a long[] array to the public VectorUtil class (so its publicly available) and then implement it only in VectorUtilProvider. That's just an idea, so other other BitSet implementations can use it.

@iprithv iprithv closed this Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants