Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speedup all binary functions on avx256, speedup binary square on avx512 #12681

Merged
merged 10 commits into from
Oct 15, 2023

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Oct 14, 2023

This builds on #12680 so please review that one first to make it easier. The advantage there is we split out vector kernels into smaller manageable methods, making it easy to specialize the avx-512 cases.

This speeds up all the binary functions on avx-256, without slowing anything down on avx-512 (it explicitly avoids using 512-bit integer multiply). the squareDistance is sped up a bit on avx-512, only because avx-512 users were suffering from downclocking already (already doing 512-bit multiply), but we might as well clean up the other math. 512-bit multiply can't be easily avoided in this function due to the formula.

Old:
Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineVector        1024  thrpt    5   3.544 ± 0.046  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt    5   7.268 ± 0.081  ops/us
VectorUtilBenchmark.binarySquareVector        1024  thrpt    5   6.387 ± 0.091  ops/us

New:
Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.binaryCosineVector        1024  thrpt    5  3.704 ± 0.252  ops/us
VectorUtilBenchmark.binaryDotProductVector    1024  thrpt    5  8.556 ± 0.145  ops/us
VectorUtilBenchmark.binarySquareVector        1024  thrpt    5  7.514 ± 0.048  ops/us

We have to think about testing. I don't want to rely upon various hardware for correctness. I think there's a way to alter the code so that we can test the correctness of everything (albeit slowly), e.g. exercise 128, 256, and 512 in our tests.

@rmuir
Copy link
Member Author

rmuir commented Oct 14, 2023

Here's the diff of just the commit for this change: 3ec9c26

@uschindler
Copy link
Contributor

uschindler commented Oct 14, 2023

We have to think about testing. I don't want to rely upon various hardware for correctness. I think there's a way to alter the code so that we can test the correctness of everything (albeit slowly), e.g. exercise 128, 256, and 512 in our tests.

This can be done in the same way like the "testMode" flag, we should just extend it to cover more cases. You could also pass an override for the bit size instead of true/false.

By default it passes the preferred species, in test mode it could pass others.

@rmuir
Copy link
Member Author

rmuir commented Oct 14, 2023

This can be done in the same way like the "testMode" flag, we should just extend it to cover more cases. You could also pass an override for the bit size instead of true/false.

By default it passes the preferred species, in test mode it could pass others.

A lot of this code relies on this stuff being static final constants. For example multiplication by power of 2 on every single vector, but thats just for starts. I don't want to rewrite the entire code for this, just have a plan.

We may have to get more creative here, such as using system property and doing it with "distribution test" passing jvm flags or similar.

I can also use qemu approach which requires no code changes.

@rmuir
Copy link
Member Author

rmuir commented Oct 14, 2023

I tried it out, making species final instead of static final. performance completely falls apart, slower than scalar impl even. it is a non-option... We should keep everything here static final and just test it differently (e.g. more like integration tests that pass different sysprops/jvm flags)

…sprops for testing.

Set those sysprops by the build when running tests: force integer vectors on always, and randomize vector size
@rmuir
Copy link
Member Author

rmuir commented Oct 14, 2023

@uschindler I did the 'fast integer vectors' override differently, and configured the build to randomize the vector size used for testing.

So it still does the same thing it was doing, but now at least it runs with different vector sizes too. Of course they will be slow, but it is for testing code correctness and the test is reasonable.

@rmuir
Copy link
Member Author

rmuir commented Oct 15, 2023

for me, when investigating a modification, this works easily enough:

$ for bits in 128 256 512; do ./gradlew -p lucene/core test --tests TestVectorUtilSupport -Dtests.vectorsize=$bits; done

@rmuir
Copy link
Member Author

rmuir commented Oct 15, 2023

This also makes this test reproducible from random seed regardless of the hardware, as SPECIES_PREFERRED is not used at all in tests. From a test perspective, it is like a forbidden-api.

@uschindler
Copy link
Contributor

uschindler commented Oct 15, 2023

This also makes this test reproducible from random seed regardless of the hardware, as SPECIES_PREFERRED is not used at all in tests. From a test perspective, it is like a forbidden-api.

I am mostly ok, but initially I stumbled on this. I was expecting that all tests get slow. But luckily we run tests without C2 so by default for all test except the one using the testMode, we use the scalar impl.

I'd like to add some extra safety. If the provider class detects that any of those properties are given at command line and at same time it is not in Test mode, it should scream loud in logger and use default impl. The problem is that Jenkins uses different GC and does not force Client mode, so it would take very long as all tests run using fixed bit size.

I have a small change to implement this, are you ok? Basically I would read the system props in the provider code.

*
* <ul>
* <li>tests.vectorsize (int)
* <li>tests.forceintegervectors (boolean)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I love the integrated benchmarks, and now this improved testing support. It'll really help the maintainability of this code going forward. 👍

@ChrisHegarty
Copy link
Contributor

for me, when investigating a modification, this works easily enough:

$ for bits in 128 256 512; do ./gradlew -p lucene/core test --tests TestVectorUtilSupport -Dtests.vectorsize=$bits; done

I love it! I'm noodling on something in this area too, and having this will make my life soooooo much better.

@uschindler
Copy link
Contributor

uschindler commented Oct 15, 2023

This also makes this test reproducible from random seed regardless of the hardware, as SPECIES_PREFERRED is not used at all in tests. From a test perspective, it is like a forbidden-api.

I am mostly ok, but initially I stumbled on this. I was expecting that all tests get slow. But luckily we run tests without C2 so by default for all test except the one using the testMode, we use the scalar impl.

I'd like to add some extra safety. If the provider class detects that any of those properties are given at command line and at same time it is not in Test mode, it should scream loud in logger and use default impl. The problem is that Jenkins uses different GC and does not force Client mode, so it would take very long as all tests run using fixed bit size.

I have a small change to implement this, are you ok? Basically I would read the system props in the provider code.

Here is my idea, I can commit it if you agree. It basically moves the system properties to the main VectorizationProvider (as optional). The SecurityException is also handled there.

If the system properties are detected, it complains on logger and uses default provider, unless testMode is enabled. This leads to the following:

  • TestVectorUtilSupport uses the Panama provider
  • All other tests fallback to default provider (they do by default due to client VM, but on jenkins they won't be slow)
 .../vectorization/VectorizationProvider.java       | 45 ++++++++++++++++++++--
 .../vectorization/PanamaVectorUtilSupport.java     | 16 ++------
 2 files changed, 45 insertions(+), 16 deletions(-)

diff --git a/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java b/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java
index 3192af72b2c..d3aac3123ed 100644
--- a/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java
+++ b/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java
@@ -25,8 +25,11 @@ import java.security.AccessController;
 import java.security.PrivilegedAction;
 import java.util.Locale;
 import java.util.Objects;
+import java.util.OptionalInt;
 import java.util.Set;
+import java.util.function.Predicate;
 import java.util.logging.Logger;
+import java.util.stream.Stream;
 import org.apache.lucene.util.SuppressForbidden;
 import org.apache.lucene.util.VectorUtil;
 
@@ -39,6 +42,35 @@ import org.apache.lucene.util.VectorUtil;
  */
 public abstract class VectorizationProvider {
 
+  static final OptionalInt TESTS_VECTOR_SIZE;
+  static final boolean TESTS_FORCE_INTEGER_VECTORS;
+
+  static {
+    var vs = OptionalInt.empty();
+    try {
+      vs =
+          Stream.ofNullable(System.getProperty("tests.vectorsize"))
+              .filter(Predicate.not(String::isEmpty))
+              .mapToInt(Integer::parseInt)
+              .findAny();
+    } catch (
+        @SuppressWarnings("unused")
+        SecurityException se) {
+      // ignored
+    }
+
+    TESTS_VECTOR_SIZE = vs;
+    boolean enforce = false;
+    try {
+      enforce = Boolean.getBoolean("tests.forceintegervectors");
+    } catch (
+        @SuppressWarnings("unused")
+        SecurityException se) {
+      // ignored
+    }
+    TESTS_FORCE_INTEGER_VECTORS = enforce;
+  }
+
   /**
    * Returns the default instance of the provider matching vectorization possibilities of actual
    * runtime.
@@ -87,9 +119,16 @@ public abstract class VectorizationProvider {
             "Java vector incubator module is not readable. For optimal vector performance, pass '--add-modules jdk.incubator.vector' to enable Vector API.");
         return new DefaultVectorizationProvider();
       }
-      if (!testMode && isClientVM()) {
-        LOG.warning("C2 compiler is disabled; Java vector incubator API can't be enabled");
-        return new DefaultVectorizationProvider();
+      if (!testMode) {
+        if (TESTS_VECTOR_SIZE.isPresent() || TESTS_FORCE_INTEGER_VECTORS) {
+          LOG.warning(
+              "Vector bitsize and/or integer vectors enforcement; using default vectorization provider outside of testMode");
+          return new DefaultVectorizationProvider();
+        }
+        if (isClientVM()) {
+          LOG.warning("C2 compiler is disabled; Java vector incubator API can't be enabled");
+          return new DefaultVectorizationProvider();
+        }
       }
       try {
         // we use method handles with lookup, so we do not need to deal with setAccessible as we
diff --git a/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java b/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
index dad428845b7..e509f6ef49c 100644
--- a/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
+++ b/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
@@ -57,10 +57,7 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
     // default to platform supported bitsize
     int vectorBitSize = VectorShape.preferredShape().vectorBitSize();
     // but allow easy overriding for testing
-    try {
-      vectorBitSize = Integer.getInteger("tests.vectorsize", vectorBitSize);
-    } catch (SecurityException ignored) {
-    }
+    vectorBitSize = VectorizationProvider.TESTS_VECTOR_SIZE.orElse(vectorBitSize);
     INT_SPECIES = VectorSpecies.of(int.class, VectorShape.forBitSize(vectorBitSize));
     VECTOR_BITSIZE = INT_SPECIES.vectorBitSize();
     FLOAT_SPECIES = INT_SPECIES.withLanes(float.class);
@@ -76,15 +73,8 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
     // hotspot misses some SSE intrinsics, workaround it
     // to be fair, they do document this thing only works well with AVX2/AVX3 and Neon
     boolean isAMD64withoutAVX2 = Constants.OS_ARCH.equals("amd64") && VECTOR_BITSIZE < 256;
-    boolean hasFastIntegerVectors = isAMD64withoutAVX2 == false;
-    try {
-      hasFastIntegerVectors =
-          Boolean.parseBoolean(
-              System.getProperty(
-                  "tests.forceintegervectors", Boolean.toString(hasFastIntegerVectors)));
-    } catch (SecurityException ignored) {
-    }
-    HAS_FAST_INTEGER_VECTORS = hasFastIntegerVectors;
+    HAS_FAST_INTEGER_VECTORS =
+        VectorizationProvider.TESTS_FORCE_INTEGER_VECTORS || (isAMD64withoutAVX2 == false);
   }
 
   @Override

EDIT: I removed the bullshit Optional<Boolean>.

@uschindler
Copy link
Contributor

In addition to that, maybe we should change TestVectorUtilSupport to spawn 3 child JVMs and don't set the property top-level?

The problem is currently any of those:

  • If we enable the system properties globally without above patch, Jenkins tests would suddenly get slow (unless it happily use the correct bitsize fitting CPU)
  • If we apply my patch, normal Lucene tests would always use the unoptimized panama code. In that case we could also run jenkins tests without --add-modules

If we run the TestVectorUtilSupport tests in 3 JVMs, the normal lucene tests would run with platform default settings (without overrides) and only the relevant tests checks all bit sizes and enforces integer vectors in separate JVMs. I would prefer this.

@uschindler
Copy link
Contributor

Otherwise this is a great improvement for testing, we just need to fine-tune it! ❤️

@rmuir
Copy link
Member Author

rmuir commented Oct 15, 2023

I don't mind if sysprop code moves, as long as you can ensure no classloading deadlocks

@uschindler
Copy link
Contributor

I don't mind if sysprop code moves, as long as you can ensure no classloading deadlocks

No classloading deadlock possible. We have a Holder for that already:

private static final class Holder {
private Holder() {}
static final VectorizationProvider INSTANCE = lookup(false);
}

@rmuir
Copy link
Member Author

rmuir commented Oct 15, 2023

wow, merge conflicts, really? git is terrible sometimes. literally didn't merge anything that isn't in this branch already.

…icting changes: if you made any, i erased them. git is garbage.
@rmuir
Copy link
Member Author

rmuir commented Oct 15, 2023

I understand the git issue now. Why do we configure the merge button as "squash"? I'd like to default it to "merge" because I value my sanity!

…ossible) and fallback to default provider in non-test mode
Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmuir I pushed my change. Maybe we should make the test improvements a separate PR. I tend to make a special forking test (like the crsuher test or the LockFactory test).
This issue was more about the 512bit improvements, so a first step this looks fine! Thanks..

If you add a changes entry (as this is an improvement), I am fine with merging.

@uschindler
Copy link
Contributor

I understand the git issue now. Why do we configure the merge button as "squash"? I'd like to default it to "merge" because I value my sanity!

At least we should have the option to select both variants! We had the same issue with @risdenk when we updated Gradle, as we wanted to have the whitespace/formatter changes separately.

Maybe start a discussion on mailing list! Thanks.

@rmuir
Copy link
Member Author

rmuir commented Oct 15, 2023

OK, i just wanted to do "enough" on the tests to make some progress. If we just add _128, _256, _512 variants without addressing the testing, then we create a potential problem.

I will add a CHANGES.txt and hope to avoid more conflicts (it is a conflict nightmare...)

@uschindler
Copy link
Contributor

and hope to avoid more conflicts (it is a conflict nightmare...)

This is why I tend to make one PR after the other. So I don't want to start another test cleanup one before this is merged. 💩

@rmuir
Copy link
Member Author

rmuir commented Oct 15, 2023

At least we should have the option to select both variants!

Yeah thats strange, It doesnt make sense to remove the option from the merge button. I will use "git push" to workaround this for now so I can do proper merge commits, and avoid using this button until it is fixed.

@asfgit asfgit merged commit 2b105f7 into apache:main Oct 15, 2023
4 checks passed
@rmuir
Copy link
Member Author

rmuir commented Oct 15, 2023

ok, all done. "asfgit" can do proper merge commit :) thanks @uschindler !

@uschindler
Copy link
Contributor

Will you backport?

@uschindler
Copy link
Contributor

ok, all done. "asfgit" can do proper merge commit :) thanks @uschindler !

The only downside is that backporting gets harder. So squashed PRs are easier to handle in that regard.

@rmuir
Copy link
Member Author

rmuir commented Oct 15, 2023

I haven't been backporting recent vector changes as they are fairly aggressive. That's just my thoughts. If we want to do that, we have to do it proper and backport test, benchmarks, logging improvements too. Is lucene 10 really so far away?

@rmuir
Copy link
Member Author

rmuir commented Oct 15, 2023

ok, all done. "asfgit" can do proper merge commit :) thanks @uschindler !

The only downside is that backporting gets harder. So squashed PRs are easier to handle in that regard.

Not if u use an explicit merge commit. You can backport entire PRs that way with a single git cherry-pick -m. Your problem is fast forwards. I don't want those, I argue for merge commit.

@uschindler
Copy link
Contributor

ok, all done. "asfgit" can do proper merge commit :) thanks @uschindler !

The only downside is that backporting gets harder. So squashed PRs are easier to handle in that regard.

Not if u use an explicit merge commit. You can backport entire PRs that way with a single git cherry-pick -m. Your problem is fast forwards. I don't want those, I argue for merge commit.

Actually the merge done by asfgit was a fast-forward. So backporting is only possible by a compare of branches.

@uschindler
Copy link
Contributor

I can take care of backporting, also your previous commits. For maintenenace it would be good as the code differs dramatically now between main and 9.x and makes merging hard.

@uschindler uschindler added this to the 9.9.0 milestone Oct 15, 2023
@uschindler uschindler self-assigned this Oct 15, 2023
asfgit pushed a commit that referenced this pull request Oct 15, 2023
…12 (#12681)

- Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode
- fix html
- tidy
- fix Panama provider to only have static final constants, but allow sysprops for testing.
- Set those sysprops by the build when running tests: force integer vectors on always, and randomize vector size
- stop using SPECIES_PREFERRED except at the top of this file
- speedup all binary functions on avx256, speedup binary square on avx512
- cleanup cosine too, no perf impact
- simple cleanups to vector code
@uschindler
Copy link
Contributor

I backported this one to 9.x. For that I selected the separate commits, cherrypicked and squashed them with my tortoise GUI (sorry!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants