speedup all binary functions on avx256, speedup binary square on avx512 #12681

rmuir · 2023-10-14T20:34:49Z

This builds on #12680 so please review that one first to make it easier. The advantage there is we split out vector kernels into smaller manageable methods, making it easy to specialize the avx-512 cases.

This speeds up all the binary functions on avx-256, without slowing anything down on avx-512 (it explicitly avoids using 512-bit integer multiply). the squareDistance is sped up a bit on avx-512, only because avx-512 users were suffering from downclocking already (already doing 512-bit multiply), but we might as well clean up the other math. 512-bit multiply can't be easily avoided in this function due to the formula.

Old:
Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineVector        1024  thrpt    5   3.544 ± 0.046  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt    5   7.268 ± 0.081  ops/us
VectorUtilBenchmark.binarySquareVector        1024  thrpt    5   6.387 ± 0.091  ops/us

New:
Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.binaryCosineVector        1024  thrpt    5  3.704 ± 0.252  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt    5  8.556 ± 0.145  ops/us
VectorUtilBenchmark.binarySquareVector        1024  thrpt    5  7.514 ± 0.048  ops/us

We have to think about testing. I don't want to rely upon various hardware for correctness. I think there's a way to alter the code so that we can test the correctness of everything (albeit slowly), e.g. exercise 128, 256, and 512 in our tests.

rmuir · 2023-10-14T20:36:06Z

Here's the diff of just the commit for this change: 3ec9c26

uschindler · 2023-10-14T21:07:42Z

We have to think about testing. I don't want to rely upon various hardware for correctness. I think there's a way to alter the code so that we can test the correctness of everything (albeit slowly), e.g. exercise 128, 256, and 512 in our tests.

This can be done in the same way like the "testMode" flag, we should just extend it to cover more cases. You could also pass an override for the bit size instead of true/false.

By default it passes the preferred species, in test mode it could pass others.

rmuir · 2023-10-14T21:52:53Z

This can be done in the same way like the "testMode" flag, we should just extend it to cover more cases. You could also pass an override for the bit size instead of true/false.

By default it passes the preferred species, in test mode it could pass others.

A lot of this code relies on this stuff being static final constants. For example multiplication by power of 2 on every single vector, but thats just for starts. I don't want to rewrite the entire code for this, just have a plan.

We may have to get more creative here, such as using system property and doing it with "distribution test" passing jvm flags or similar.

I can also use qemu approach which requires no code changes.

rmuir · 2023-10-14T22:43:50Z

I tried it out, making species final instead of static final. performance completely falls apart, slower than scalar impl even. it is a non-option... We should keep everything here static final and just test it differently (e.g. more like integration tests that pass different sysprops/jvm flags)

…sprops for testing. Set those sysprops by the build when running tests: force integer vectors on always, and randomize vector size

rmuir · 2023-10-14T23:54:27Z

@uschindler I did the 'fast integer vectors' override differently, and configured the build to randomize the vector size used for testing.

So it still does the same thing it was doing, but now at least it runs with different vector sizes too. Of course they will be slow, but it is for testing code correctness and the test is reasonable.

rmuir · 2023-10-15T00:20:45Z

for me, when investigating a modification, this works easily enough:

$ for bits in 128 256 512; do ./gradlew -p lucene/core test --tests TestVectorUtilSupport -Dtests.vectorsize=$bits; done

rmuir · 2023-10-15T00:36:33Z

This also makes this test reproducible from random seed regardless of the hardware, as SPECIES_PREFERRED is not used at all in tests. From a test perspective, it is like a forbidden-api.

uschindler · 2023-10-15T09:18:54Z

This also makes this test reproducible from random seed regardless of the hardware, as SPECIES_PREFERRED is not used at all in tests. From a test perspective, it is like a forbidden-api.

I am mostly ok, but initially I stumbled on this. I was expecting that all tests get slow. But luckily we run tests without C2 so by default for all test except the one using the testMode, we use the scalar impl.

I'd like to add some extra safety. If the provider class detects that any of those properties are given at command line and at same time it is not in Test mode, it should scream loud in logger and use default impl. The problem is that Jenkins uses different GC and does not force Client mode, so it would take very long as all tests run using fixed bit size.

I have a small change to implement this, are you ok? Basically I would read the system props in the provider code.

ChrisHegarty · 2023-10-15T11:23:06Z

lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

+ *
+ * <ul>
+ *   <li>tests.vectorsize (int)
+ *   <li>tests.forceintegervectors (boolean)


This is great. I love the integrated benchmarks, and now this improved testing support. It'll really help the maintainability of this code going forward. 👍

ChrisHegarty · 2023-10-15T11:25:48Z

for me, when investigating a modification, this works easily enough:
$ for bits in 128 256 512; do ./gradlew -p lucene/core test --tests TestVectorUtilSupport -Dtests.vectorsize=$bits; done

I love it! I'm noodling on something in this area too, and having this will make my life soooooo much better.

uschindler · 2023-10-15T11:32:36Z

This also makes this test reproducible from random seed regardless of the hardware, as SPECIES_PREFERRED is not used at all in tests. From a test perspective, it is like a forbidden-api.

I am mostly ok, but initially I stumbled on this. I was expecting that all tests get slow. But luckily we run tests without C2 so by default for all test except the one using the testMode, we use the scalar impl.

I'd like to add some extra safety. If the provider class detects that any of those properties are given at command line and at same time it is not in Test mode, it should scream loud in logger and use default impl. The problem is that Jenkins uses different GC and does not force Client mode, so it would take very long as all tests run using fixed bit size.

I have a small change to implement this, are you ok? Basically I would read the system props in the provider code.

Here is my idea, I can commit it if you agree. It basically moves the system properties to the main VectorizationProvider (as optional). The SecurityException is also handled there.

If the system properties are detected, it complains on logger and uses default provider, unless testMode is enabled. This leads to the following:

TestVectorUtilSupport uses the Panama provider
All other tests fallback to default provider (they do by default due to client VM, but on jenkins they won't be slow)

 .../vectorization/VectorizationProvider.java       | 45 ++++++++++++++++++++--
 .../vectorization/PanamaVectorUtilSupport.java     | 16 ++------
 2 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java b/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java
index 3192af72b2c..d3aac3123ed 100644
--- a/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java
+++ b/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java
@@ -25,8 +25,11 @@ import java.security.AccessController;
 import java.security.PrivilegedAction;
 import java.util.Locale;
 import java.util.Objects;
+import java.util.OptionalInt;
 import java.util.Set;
+import java.util.function.Predicate;
 import java.util.logging.Logger;
+import java.util.stream.Stream;
 import org.apache.lucene.util.SuppressForbidden;
 import org.apache.lucene.util.VectorUtil;
 
@@ -39,6 +42,35 @@ import org.apache.lucene.util.VectorUtil;
  */
 public abstract class VectorizationProvider {
 
+  static final OptionalInt TESTS_VECTOR_SIZE;
+  static final boolean TESTS_FORCE_INTEGER_VECTORS;
+
+  static {
+    var vs = OptionalInt.empty();
+    try {
+      vs =
+          Stream.ofNullable(System.getProperty("tests.vectorsize"))
+              .filter(Predicate.not(String::isEmpty))
+              .mapToInt(Integer::parseInt)
+              .findAny();
+    } catch (
+        @SuppressWarnings("unused")
+        SecurityException se) {
+      // ignored
+    }
+
+    TESTS_VECTOR_SIZE = vs;
+    boolean enforce = false;
+    try {
+      enforce = Boolean.getBoolean("tests.forceintegervectors");
+    } catch (
+        @SuppressWarnings("unused")
+        SecurityException se) {
+      // ignored
+    }
+    TESTS_FORCE_INTEGER_VECTORS = enforce;
+  }
+
   /**
    * Returns the default instance of the provider matching vectorization possibilities of actual
    * runtime.
@@ -87,9 +119,16 @@ public abstract class VectorizationProvider {
             "Java vector incubator module is not readable. For optimal vector performance, pass '--add-modules jdk.incubator.vector' to enable Vector API.");
         return new DefaultVectorizationProvider();
       }
-      if (!testMode && isClientVM()) {
-        LOG.warning("C2 compiler is disabled; Java vector incubator API can't be enabled");
-        return new DefaultVectorizationProvider();
+      if (!testMode) {
+        if (TESTS_VECTOR_SIZE.isPresent() || TESTS_FORCE_INTEGER_VECTORS) {
+          LOG.warning(
+              "Vector bitsize and/or integer vectors enforcement; using default vectorization provider outside of testMode");
+          return new DefaultVectorizationProvider();
+        }
+        if (isClientVM()) {
+          LOG.warning("C2 compiler is disabled; Java vector incubator API can't be enabled");
+          return new DefaultVectorizationProvider();
+        }
       }
       try {
         // we use method handles with lookup, so we do not need to deal with setAccessible as we
diff --git a/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java b/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
index dad428845b7..e509f6ef49c 100644
--- a/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
+++ b/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
@@ -57,10 +57,7 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
     // default to platform supported bitsize
     int vectorBitSize = VectorShape.preferredShape().vectorBitSize();
     // but allow easy overriding for testing
-    try {
-      vectorBitSize = Integer.getInteger("tests.vectorsize", vectorBitSize);
-    } catch (SecurityException ignored) {
-    }
+    vectorBitSize = VectorizationProvider.TESTS_VECTOR_SIZE.orElse(vectorBitSize);
     INT_SPECIES = VectorSpecies.of(int.class, VectorShape.forBitSize(vectorBitSize));
     VECTOR_BITSIZE = INT_SPECIES.vectorBitSize();
     FLOAT_SPECIES = INT_SPECIES.withLanes(float.class);
@@ -76,15 +73,8 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
     // hotspot misses some SSE intrinsics, workaround it
     // to be fair, they do document this thing only works well with AVX2/AVX3 and Neon
     boolean isAMD64withoutAVX2 = Constants.OS_ARCH.equals("amd64") && VECTOR_BITSIZE < 256;
-    boolean hasFastIntegerVectors = isAMD64withoutAVX2 == false;
-    try {
-      hasFastIntegerVectors =
-          Boolean.parseBoolean(
-              System.getProperty(
-                  "tests.forceintegervectors", Boolean.toString(hasFastIntegerVectors)));
-    } catch (SecurityException ignored) {
-    }
-    HAS_FAST_INTEGER_VECTORS = hasFastIntegerVectors;
+    HAS_FAST_INTEGER_VECTORS =
+        VectorizationProvider.TESTS_FORCE_INTEGER_VECTORS || (isAMD64withoutAVX2 == false);
   }
 
   @Override

EDIT: I removed the bullshit Optional<Boolean>.

uschindler · 2023-10-15T12:14:03Z

In addition to that, maybe we should change TestVectorUtilSupport to spawn 3 child JVMs and don't set the property top-level?

The problem is currently any of those:

If we enable the system properties globally without above patch, Jenkins tests would suddenly get slow (unless it happily use the correct bitsize fitting CPU)
If we apply my patch, normal Lucene tests would always use the unoptimized panama code. In that case we could also run jenkins tests without --add-modules

If we run the TestVectorUtilSupport tests in 3 JVMs, the normal lucene tests would run with platform default settings (without overrides) and only the relevant tests checks all bit sizes and enforces integer vectors in separate JVMs. I would prefer this.

uschindler · 2023-10-15T12:16:37Z

Otherwise this is a great improvement for testing, we just need to fine-tune it! ❤️

rmuir · 2023-10-15T12:17:13Z

I don't mind if sysprop code moves, as long as you can ensure no classloading deadlocks

uschindler · 2023-10-15T12:19:03Z

I don't mind if sysprop code moves, as long as you can ensure no classloading deadlocks

No classloading deadlock possible. We have a Holder for that already:

lucene/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java

Lines 189 to 193 in db746bd

    
           private static final class Holder { 
        
             private Holder() {} 
        
             static final VectorizationProvider INSTANCE = lookup(false); 
        
           }

rmuir · 2023-10-15T12:19:22Z

wow, merge conflicts, really? git is terrible sometimes. literally didn't merge anything that isn't in this branch already.

…icting changes: if you made any, i erased them. git is garbage.

rmuir · 2023-10-15T12:25:16Z

I understand the git issue now. Why do we configure the merge button as "squash"? I'd like to default it to "merge" because I value my sanity!

…ossible) and fallback to default provider in non-test mode

uschindler

@rmuir I pushed my change. Maybe we should make the test improvements a separate PR. I tend to make a special forking test (like the crsuher test or the LockFactory test).
This issue was more about the 512bit improvements, so a first step this looks fine! Thanks..

If you add a changes entry (as this is an improvement), I am fine with merging.

uschindler · 2023-10-15T12:40:21Z

I understand the git issue now. Why do we configure the merge button as "squash"? I'd like to default it to "merge" because I value my sanity!

At least we should have the option to select both variants! We had the same issue with @risdenk when we updated Gradle, as we wanted to have the whitespace/formatter changes separately.

Maybe start a discussion on mailing list! Thanks.

rmuir · 2023-10-15T12:42:46Z

OK, i just wanted to do "enough" on the tests to make some progress. If we just add _128, _256, _512 variants without addressing the testing, then we create a potential problem.

I will add a CHANGES.txt and hope to avoid more conflicts (it is a conflict nightmare...)

uschindler · 2023-10-15T12:44:13Z

and hope to avoid more conflicts (it is a conflict nightmare...)

This is why I tend to make one PR after the other. So I don't want to start another test cleanup one before this is merged. 💩

rmuir · 2023-10-15T12:45:25Z

At least we should have the option to select both variants!

Yeah thats strange, It doesnt make sense to remove the option from the merge button. I will use "git push" to workaround this for now so I can do proper merge commits, and avoid using this button until it is fixed.

rmuir · 2023-10-15T12:50:41Z

ok, all done. "asfgit" can do proper merge commit :) thanks @uschindler !

uschindler · 2023-10-15T13:14:46Z

Will you backport?

uschindler · 2023-10-15T14:05:42Z

ok, all done. "asfgit" can do proper merge commit :) thanks @uschindler !

The only downside is that backporting gets harder. So squashed PRs are easier to handle in that regard.

rmuir · 2023-10-15T14:13:27Z

I haven't been backporting recent vector changes as they are fairly aggressive. That's just my thoughts. If we want to do that, we have to do it proper and backport test, benchmarks, logging improvements too. Is lucene 10 really so far away?

rmuir · 2023-10-15T14:15:17Z

ok, all done. "asfgit" can do proper merge commit :) thanks @uschindler !

The only downside is that backporting gets harder. So squashed PRs are easier to handle in that regard.

Not if u use an explicit merge commit. You can backport entire PRs that way with a single git cherry-pick -m. Your problem is fast forwards. I don't want those, I argue for merge commit.

uschindler · 2023-10-15T15:37:38Z

ok, all done. "asfgit" can do proper merge commit :) thanks @uschindler !

The only downside is that backporting gets harder. So squashed PRs are easier to handle in that regard.

Not if u use an explicit merge commit. You can backport entire PRs that way with a single git cherry-pick -m. Your problem is fast forwards. I don't want those, I argue for merge commit.

Actually the merge done by asfgit was a fast-forward. So backporting is only possible by a compare of branches.

uschindler · 2023-10-15T17:27:50Z

I can take care of backporting, also your previous commits. For maintenenace it would be good as the code differs dramatically now between main and 9.x and makes merging hard.

…12 (#12681) - Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode - fix html - tidy - fix Panama provider to only have static final constants, but allow sysprops for testing. - Set those sysprops by the build when running tests: force integer vectors on always, and randomize vector size - stop using SPECIES_PREFERRED except at the top of this file - speedup all binary functions on avx256, speedup binary square on avx512 - cleanup cosine too, no perf impact - simple cleanups to vector code

uschindler · 2023-10-15T18:04:03Z

I backported this one to 9.x. For that I selected the separate commits, cherrypicked and squashed them with my tortoise GUI (sorry!).

rmuir added 3 commits October 14, 2023 13:08

simple cleanups to vector code

a72bf7c

cleanup cosine too, no perf impact

e93bd52

speedup all binary functions on avx256, speedup binary square on avx512

3ec9c26

rmuir requested review from uschindler and ChrisHegarty October 14, 2023 20:34

stop using SPECIES_PREFERRED except at the top of this file

cce3cf1

fix Panama provider to only have static final constants, but allow sy…

94619e0

…sprops for testing. Set those sysprops by the build when running tests: force integer vectors on always, and randomize vector size

rmuir added 2 commits October 14, 2023 19:54

tidy

90c01c1

fix html

a69cdef

ChrisHegarty reviewed Oct 15, 2023

View reviewed changes

Merge branch 'main' into vector_512: stomping over any possible confl…

650a47e

…icting changes: if you made any, i erased them. git is garbage.

Move testing properties to provider class (no classloading deadlock p…

c8852a0

…ossible) and fallback to default provider in non-test mode

uschindler approved these changes Oct 15, 2023

View reviewed changes

CHANGES

2b105f7

asfgit merged commit 2b105f7 into apache:main Oct 15, 2023
4 checks passed

uschindler added this to the 9.9.0 milestone Oct 15, 2023

uschindler self-assigned this Oct 15, 2023

uschindler added the type:enhancement label Oct 15, 2023

uschindler mentioned this pull request Oct 15, 2023

simple cleanups to vector code #12680

Merged

reta mentioned this pull request Oct 18, 2023

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation opensearch-project/OpenSearch#10392

Closed

rmuir mentioned this pull request Oct 20, 2023

Speedup integer functions for 128-bit neon vectors #12632

Merged

uschindler mentioned this pull request May 6, 2024

How to run tests with the Panama Vector implementation #13344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speedup all binary functions on avx256, speedup binary square on avx512 #12681

speedup all binary functions on avx256, speedup binary square on avx512 #12681

rmuir commented Oct 14, 2023

rmuir commented Oct 14, 2023

uschindler commented Oct 14, 2023 •

edited

Loading

rmuir commented Oct 14, 2023

rmuir commented Oct 14, 2023

rmuir commented Oct 14, 2023

rmuir commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler commented Oct 15, 2023 •

edited

Loading

ChrisHegarty Oct 15, 2023

ChrisHegarty commented Oct 15, 2023

uschindler commented Oct 15, 2023 •

edited

Loading

uschindler commented Oct 15, 2023

uschindler commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler commented Oct 15, 2023

rmuir commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler left a comment

uschindler commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler commented Oct 15, 2023

rmuir commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler commented Oct 15, 2023

uschindler commented Oct 15, 2023

rmuir commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler commented Oct 15, 2023

uschindler commented Oct 15, 2023

uschindler commented Oct 15, 2023

speedup all binary functions on avx256, speedup binary square on avx512 #12681

speedup all binary functions on avx256, speedup binary square on avx512 #12681

Conversation

rmuir commented Oct 14, 2023

rmuir commented Oct 14, 2023

uschindler commented Oct 14, 2023 • edited Loading

rmuir commented Oct 14, 2023

rmuir commented Oct 14, 2023

rmuir commented Oct 14, 2023

rmuir commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler commented Oct 15, 2023 • edited Loading

ChrisHegarty Oct 15, 2023

Choose a reason for hiding this comment

ChrisHegarty commented Oct 15, 2023

uschindler commented Oct 15, 2023 • edited Loading

uschindler commented Oct 15, 2023

uschindler commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler commented Oct 15, 2023

rmuir commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler left a comment

Choose a reason for hiding this comment

uschindler commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler commented Oct 15, 2023

rmuir commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler commented Oct 15, 2023

uschindler commented Oct 15, 2023

rmuir commented Oct 15, 2023

rmuir commented Oct 15, 2023

uschindler commented Oct 15, 2023

uschindler commented Oct 15, 2023

uschindler commented Oct 15, 2023

uschindler commented Oct 14, 2023 •

edited

Loading

uschindler commented Oct 15, 2023 •

edited

Loading

uschindler commented Oct 15, 2023 •

edited

Loading