Vectorize FixedBitSet.cardinality() via VectorizationProvider (Java 25 MRJAR)#15832
Vectorize FixedBitSet.cardinality() via VectorizationProvider (Java 25 MRJAR)#15832iprithv wants to merge 14 commits into
Conversation
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
There was a problem hiding this comment.
This is wrongly implemented. The classes should not be loaded directly but using VectorizationProvider interfaces where an instance of the interface implemented in the Java25 folder is handled. This replicates some already existing code and also does not do the correct checks (if all is sane with JVM).
This can't be merged until changed to behave in line with how the other vector stuff is handled with the VectorizationProvider factory. Hook it in there and move to the internal code package containing that provider, guarded with correct checks and then we can look into this again.
intersectionCount/unionCount/andNotCount are not used anywhere in the lucene codebase. So you can drop any optimization for them. I don't think cardinality is appropriate to optimize in this way either: for most sparse cases FixedBitSet is not even used. Instead different implementation (SparseFixedBitset/SparseLiveDocs/etc) geared at sparse will be used... These optimizations are expensive to maintain and require a lot of work for correctness. We should have a clear search use-case in mind when attempting to optimize any functions. For example, the dot-product was optimized because it is a clear hotspot for vector search, and because java's floating point rules don't allow the autovectorizer to do it. |
|
for any search regressions resulting from the change, it must be from cardinality, which is the only one used in the codebase. Problem is likely use of SPECIES_PREFERRED (avx512), this is just a guess. It might look 2x better in isolation but cause slowdowns due to downclocking. Personally I think autovectorizer is a safer approach. |
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
…xedBitSet Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
| } | ||
| return Math.toIntExact(tot); | ||
| return Math.toIntExact( | ||
| VectorizationProvider.getInstance().getBitSetUtilSupport().popCount(bits, 0, numWords)); |
There was a problem hiding this comment.
this call is too expensive, did you benchmark at all? (it calciulates stack trace to figure out if it is allowed to call it).
Basically you need to create a private static final constant in FixedBitset refering to the BitSetUtilSupport instance, initialized when class is loading and the call popcount() here on the static constant.
There was a problem hiding this comment.
initialised at class load, thank you.
| long.class, VectorShape.forBitSize(PanamaVectorConstants.PREFERRED_VECTOR_BITSIZE)); | ||
| private static final int VL = LONG_SPECIES.length(); | ||
|
|
||
| private static final boolean ENABLED; |
There was a problem hiding this comment.
i don't think we need this because use can enable/disable it with Panama JVM parameter (enable module).
There was a problem hiding this comment.
removed the ENABLED flag, thank you.
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
| for (; i < loopBound; i += VL) { | ||
| acc = | ||
| acc.add( | ||
| LongVector.fromArray(LONG_SPECIES, a, start + i).lanewise(VectorOperators.BIT_COUNT)); |
There was a problem hiding this comment.
what will this do on my computer without avx512? my concern is that it might go 30x slower since I don't have such instruction
There was a problem hiding this comment.
I tested with -Dlucene.panama.vectorSize=256 to simulate narrower vectors, but my hardware is ARM (Mac), so this doesn't replicate x86 AVX2 without VPOPCNTDQ. I don't have an AVX2 x86 machine available to measure directly.
I changed to gate the SIMD path on PREFERRED_VECTOR_BITSIZE == 512 as a conservative guard, so the vectorized path only activates when AVX-512 VPOPCNTDQ is available, thank you.
There was a problem hiding this comment.
The problem with Intel/AMD machines ist that many of them throttle CPU freq when AVX3 is used, which can have side effects on other system, so in Lucene we generally avoid to use AVX3 features at moment.
Maybe the popcnt optimization should only be used for ARM CPUs with the correct checks as done in the provider.
P.S.: As the BitSet proviider only serves popcnt, maybe instead of having the implementation do the 512-bit check it could also be done in the factory call. So when there are no 512 bits available, the factory would return the default implementation. Currently the default implementation is a lambda, I'd prefer to have it as a separate class and both providers return an instance of that class, dependning on CPU settings.
There was a problem hiding this comment.
autovectorizer in hotspot will already take care of all this for us.
There was a problem hiding this comment.
Benchmarked this on ARM (which has native NEON vector popcount). The scalar loop (cardinalityScalar, no --add-modules) runs at 1.908 ops/µs at 65K bits. The Panama path runs at 4.167 ops/µs (+118%). If HotSpot's autovectorizer were already using native vector popcount for Long.bitCount, those numbers would be identical.
With this difference I think C2 SuperWord doesn't currently pattern match Long.bitCount intrinsics into vectorized popcount, it appears to bail out on intrinsic nodes in reduction loops.
Would be great if there's a way to verify the JIT output on some other hardware.
There was a problem hiding this comment.
For the factory-level check, I can move the capability check into PanamaVectorizationProvider so it returns the scalar implementation when not on the right hardware, and remove POPCOUNT_SUPPORTED from the impl
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
Signed-off-by: prithvi <prithvisivasankar@gmail.com>
|
In general we should collect performance measurements for different CPUs, comparing scalar with optimized impl. Once this is done, we can look into further streamlining the implementation. An alternative would be to add the popcnt on a long[] array to the public VectorUtil class (so its publicly available) and then implement it only in VectorUtilProvider. That's just an idea, so other other BitSet implementations can use it. |
Summary
This change adds SIMD acceleration for
FixedBitSet.cardinality()using the Java Vector API,integrated through Lucene's existing
VectorizationProviderdispatch framework.Implementation
org.apache.lucene.internal.vectorization.BitSetUtilSupportwith a single method
popCount(long[], int, int).PanamaBitSetUtilSupportinlucene/core/src/java25/,using
LongVectorwithVectorOperators.BIT_COUNTlane operation andSPECIES_PREFERREDvector size.VectorizationProvideralongsidegetVectorUtilSupport()andgetPostingDecodingUtil().PanamaBitSetUtilSupport, threshold(64 longs = 4096 bits), system property gate, and scalar fallback.
FixedBitSetis a pure unconditional delegate with no awareness of thresholds or properties.
-Dlucene.useVectorizedBitSetOps=true(defaultfalse,opt-in only).
--add-modules jdk.incubator.vectoris notresolved,
DefaultVectorizationProvideris used andpopCount()is a plain scalarLong.bitCountloop — no behavioral change.Test plan
./gradlew :lucene:core:test --tests org.apache.lucene.util.TestFixedBitSet ./gradlew :lucene:core:test --tests org.apache.lucene.util.TestFixedBitSet \ -Dlucene.useVectorizedBitSetOps=true ./gradlew :lucene:core:test \ --tests org.apache.lucene.internal.vectorization.TestVectorizationProviderBenchmarks
./gradlew :lucene:benchmark-jmh:assemble
java -jar lucene/benchmark-jmh/build/benchmarks/lucene-benchmark-jmh-11.0.0-SNAPSHOT.jar
FixedBitSetBenchmark
-p numBits=512,1024,4096,65536,1048576 -p density=0.10 -wi 3 -i 5 -f 1
Results : (ops/µs, -wi 3 -i 5 -f 1, density=0.10).
cardinalityScalar uses no
--add-modules; cardinalityVector forks with--add-modules=jdk.incubator.vector.Below the 64-long (4096-bit) threshold both methods run the same scalar loop.