-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
speedup all binary functions on avx256, speedup binary square on avx512 #12681
Conversation
Here's the diff of just the commit for this change: 3ec9c26 |
This can be done in the same way like the "testMode" flag, we should just extend it to cover more cases. You could also pass an override for the bit size instead of true/false. By default it passes the preferred species, in test mode it could pass others. |
A lot of this code relies on this stuff being We may have to get more creative here, such as using system property and doing it with "distribution test" passing jvm flags or similar. I can also use qemu approach which requires no code changes. |
I tried it out, making species |
…sprops for testing. Set those sysprops by the build when running tests: force integer vectors on always, and randomize vector size
@uschindler I did the 'fast integer vectors' override differently, and configured the build to randomize the vector size used for testing. So it still does the same thing it was doing, but now at least it runs with different vector sizes too. Of course they will be slow, but it is for testing code correctness and the test is reasonable. |
for me, when investigating a modification, this works easily enough: $ for bits in 128 256 512; do ./gradlew -p lucene/core test --tests TestVectorUtilSupport -Dtests.vectorsize=$bits; done |
This also makes this test reproducible from random seed regardless of the hardware, as |
I am mostly ok, but initially I stumbled on this. I was expecting that all tests get slow. But luckily we run tests without C2 so by default for all test except the one using the testMode, we use the scalar impl. I'd like to add some extra safety. If the provider class detects that any of those properties are given at command line and at same time it is not in Test mode, it should scream loud in logger and use default impl. The problem is that Jenkins uses different GC and does not force Client mode, so it would take very long as all tests run using fixed bit size. I have a small change to implement this, are you ok? Basically I would read the system props in the provider code. |
* | ||
* <ul> | ||
* <li>tests.vectorsize (int) | ||
* <li>tests.forceintegervectors (boolean) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I love the integrated benchmarks, and now this improved testing support. It'll really help the maintainability of this code going forward. 👍
I love it! I'm noodling on something in this area too, and having this will make my life soooooo much better. |
Here is my idea, I can commit it if you agree. It basically moves the system properties to the main If the system properties are detected, it complains on logger and uses default provider, unless testMode is enabled. This leads to the following:
.../vectorization/VectorizationProvider.java | 45 ++++++++++++++++++++--
.../vectorization/PanamaVectorUtilSupport.java | 16 ++------
2 files changed, 45 insertions(+), 16 deletions(-)
diff --git a/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java b/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java
index 3192af72b2c..d3aac3123ed 100644
--- a/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java
+++ b/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java
@@ -25,8 +25,11 @@ import java.security.AccessController;
import java.security.PrivilegedAction;
import java.util.Locale;
import java.util.Objects;
+import java.util.OptionalInt;
import java.util.Set;
+import java.util.function.Predicate;
import java.util.logging.Logger;
+import java.util.stream.Stream;
import org.apache.lucene.util.SuppressForbidden;
import org.apache.lucene.util.VectorUtil;
@@ -39,6 +42,35 @@ import org.apache.lucene.util.VectorUtil;
*/
public abstract class VectorizationProvider {
+ static final OptionalInt TESTS_VECTOR_SIZE;
+ static final boolean TESTS_FORCE_INTEGER_VECTORS;
+
+ static {
+ var vs = OptionalInt.empty();
+ try {
+ vs =
+ Stream.ofNullable(System.getProperty("tests.vectorsize"))
+ .filter(Predicate.not(String::isEmpty))
+ .mapToInt(Integer::parseInt)
+ .findAny();
+ } catch (
+ @SuppressWarnings("unused")
+ SecurityException se) {
+ // ignored
+ }
+
+ TESTS_VECTOR_SIZE = vs;
+ boolean enforce = false;
+ try {
+ enforce = Boolean.getBoolean("tests.forceintegervectors");
+ } catch (
+ @SuppressWarnings("unused")
+ SecurityException se) {
+ // ignored
+ }
+ TESTS_FORCE_INTEGER_VECTORS = enforce;
+ }
+
/**
* Returns the default instance of the provider matching vectorization possibilities of actual
* runtime.
@@ -87,9 +119,16 @@ public abstract class VectorizationProvider {
"Java vector incubator module is not readable. For optimal vector performance, pass '--add-modules jdk.incubator.vector' to enable Vector API.");
return new DefaultVectorizationProvider();
}
- if (!testMode && isClientVM()) {
- LOG.warning("C2 compiler is disabled; Java vector incubator API can't be enabled");
- return new DefaultVectorizationProvider();
+ if (!testMode) {
+ if (TESTS_VECTOR_SIZE.isPresent() || TESTS_FORCE_INTEGER_VECTORS) {
+ LOG.warning(
+ "Vector bitsize and/or integer vectors enforcement; using default vectorization provider outside of testMode");
+ return new DefaultVectorizationProvider();
+ }
+ if (isClientVM()) {
+ LOG.warning("C2 compiler is disabled; Java vector incubator API can't be enabled");
+ return new DefaultVectorizationProvider();
+ }
}
try {
// we use method handles with lookup, so we do not need to deal with setAccessible as we
diff --git a/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java b/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
index dad428845b7..e509f6ef49c 100644
--- a/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
+++ b/lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
@@ -57,10 +57,7 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
// default to platform supported bitsize
int vectorBitSize = VectorShape.preferredShape().vectorBitSize();
// but allow easy overriding for testing
- try {
- vectorBitSize = Integer.getInteger("tests.vectorsize", vectorBitSize);
- } catch (SecurityException ignored) {
- }
+ vectorBitSize = VectorizationProvider.TESTS_VECTOR_SIZE.orElse(vectorBitSize);
INT_SPECIES = VectorSpecies.of(int.class, VectorShape.forBitSize(vectorBitSize));
VECTOR_BITSIZE = INT_SPECIES.vectorBitSize();
FLOAT_SPECIES = INT_SPECIES.withLanes(float.class);
@@ -76,15 +73,8 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
// hotspot misses some SSE intrinsics, workaround it
// to be fair, they do document this thing only works well with AVX2/AVX3 and Neon
boolean isAMD64withoutAVX2 = Constants.OS_ARCH.equals("amd64") && VECTOR_BITSIZE < 256;
- boolean hasFastIntegerVectors = isAMD64withoutAVX2 == false;
- try {
- hasFastIntegerVectors =
- Boolean.parseBoolean(
- System.getProperty(
- "tests.forceintegervectors", Boolean.toString(hasFastIntegerVectors)));
- } catch (SecurityException ignored) {
- }
- HAS_FAST_INTEGER_VECTORS = hasFastIntegerVectors;
+ HAS_FAST_INTEGER_VECTORS =
+ VectorizationProvider.TESTS_FORCE_INTEGER_VECTORS || (isAMD64withoutAVX2 == false);
}
@Override EDIT: I removed the bullshit |
In addition to that, maybe we should change TestVectorUtilSupport to spawn 3 child JVMs and don't set the property top-level? The problem is currently any of those:
If we run the TestVectorUtilSupport tests in 3 JVMs, the normal lucene tests would run with platform default settings (without overrides) and only the relevant tests checks all bit sizes and enforces integer vectors in separate JVMs. I would prefer this. |
Otherwise this is a great improvement for testing, we just need to fine-tune it! ❤️ |
I don't mind if sysprop code moves, as long as you can ensure no classloading deadlocks |
No classloading deadlock possible. We have a Holder for that already: lucene/lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java Lines 189 to 193 in db746bd
|
wow, merge conflicts, really? git is terrible sometimes. literally didn't merge anything that isn't in this branch already. |
…icting changes: if you made any, i erased them. git is garbage.
I understand the git issue now. Why do we configure the merge button as "squash"? I'd like to default it to "merge" because I value my sanity! |
…ossible) and fallback to default provider in non-test mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rmuir I pushed my change. Maybe we should make the test improvements a separate PR. I tend to make a special forking test (like the crsuher test or the LockFactory test).
This issue was more about the 512bit improvements, so a first step this looks fine! Thanks..
If you add a changes entry (as this is an improvement), I am fine with merging.
At least we should have the option to select both variants! We had the same issue with @risdenk when we updated Gradle, as we wanted to have the whitespace/formatter changes separately. Maybe start a discussion on mailing list! Thanks. |
OK, i just wanted to do "enough" on the tests to make some progress. If we just add _128, _256, _512 variants without addressing the testing, then we create a potential problem. I will add a CHANGES.txt and hope to avoid more conflicts (it is a conflict nightmare...) |
This is why I tend to make one PR after the other. So I don't want to start another test cleanup one before this is merged. 💩 |
Yeah thats strange, It doesnt make sense to remove the option from the merge button. I will use "git push" to workaround this for now so I can do proper merge commits, and avoid using this button until it is fixed. |
ok, all done. "asfgit" can do proper merge commit :) thanks @uschindler ! |
Will you backport? |
The only downside is that backporting gets harder. So squashed PRs are easier to handle in that regard. |
I haven't been backporting recent vector changes as they are fairly aggressive. That's just my thoughts. If we want to do that, we have to do it proper and backport test, benchmarks, logging improvements too. Is lucene 10 really so far away? |
Not if u use an explicit merge commit. You can backport entire PRs that way with a single |
Actually the merge done by asfgit was a fast-forward. So backporting is only possible by a compare of branches. |
I can take care of backporting, also your previous commits. For maintenenace it would be good as the code differs dramatically now between main and 9.x and makes merging hard. |
…12 (#12681) - Move testing properties to provider class (no classloading deadlock possible) and fallback to default provider in non-test mode - fix html - tidy - fix Panama provider to only have static final constants, but allow sysprops for testing. - Set those sysprops by the build when running tests: force integer vectors on always, and randomize vector size - stop using SPECIES_PREFERRED except at the top of this file - speedup all binary functions on avx256, speedup binary square on avx512 - cleanup cosine too, no perf impact - simple cleanups to vector code
I backported this one to 9.x. For that I selected the separate commits, cherrypicked and squashed them with my tortoise GUI (sorry!). |
This builds on #12680 so please review that one first to make it easier. The advantage there is we split out vector kernels into smaller manageable methods, making it easy to specialize the avx-512 cases.
This speeds up all the binary functions on avx-256, without slowing anything down on avx-512 (it explicitly avoids using 512-bit integer multiply). the squareDistance is sped up a bit on avx-512, only because avx-512 users were suffering from downclocking already (already doing 512-bit multiply), but we might as well clean up the other math. 512-bit multiply can't be easily avoided in this function due to the formula.
We have to think about testing. I don't want to rely upon various hardware for correctness. I think there's a way to alter the code so that we can test the correctness of everything (albeit slowly), e.g. exercise 128, 256, and 512 in our tests.