-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup integer functions for 128-bit neon vectors #12632
Conversation
I did manage to get a little bit more out of the arm chip. I will look at the other 2 functions there too...
|
FYI I run the benchmark on latest benchmark commit with a linux-x86-64 sever that AVX-512 supported.
|
thanks for running. I will just revert it then and get folks to test arm changes. i don't want to hurt avx 512... |
This reverts commit 132bf28.
ok i reverted the 256-bit changes from here, and from the vectorbench, but kept the 128 bit ones for ppl to test on macs. Now this issue does the opposite of what it says, i will edit it... |
I don't know how to do the same tricks for the BinarySquare one due to the subtraction. So I'm done for now. I think given the reports from @gf2121 the 256/512-bit experiment was a loss :( |
Thanks for looking into this @rmuir, I've been thinking similar myself (just didn't get around to anything other than the thinking! ) On my Mac M2. JDK 20.0.2.
JDK 21
|
What is the maximum value that we can see in the input bytes? Can they every hold |
And of course, |
All possible values is how i test
Yes!
No! integer fma won't overflow here. and i know that isn't obvious and probably needs a code comment lol. that's why it only works for these two methods and not the square. |
Ok, cool. If there is not already one, we should add a test to the Panama / scalar unit test for the boundary values. |
yeah agreed: we should test the boundaries for all 3 functions. |
yeah, you are right, i am wrong. the trick only works in the unsigned case, Byte.MIN_VALUE is a problem :( |
at least we can improve the testing out of this: #12634 |
don't worry, i have a plan B. it is just frustrating due to the nightmare of operating on the mac, combined with the fact this benchmark and lucene source is a separate repo. it makes the situation very slow and error-prone.... |
see latest commit for the idea. on my mac it gives a decent boost. it uses "32-bit" vector by loading 64-bit vector from array but only processing half of it. The tests should fail as i need to fix the other functions. |
ok on my mac i see:
and it passes new boundary tests (but no sneakiness with boundary values, instead another type of sneakiness). |
@gf2121 @ChrisHegarty you can see the issue from his assembler output with the failed intel optimization: |
@gf2121 i think we could diagnose it further with https://github.com/travisdowns/avx-turbo |
I compiled the code and ran it easily, just
|
I guess you will have to probably |
And at least the theory makes sense, this integer multiply is definitely "avx512 heavy", so if u have a cpu susceptible to throttling, better to do 256bit multiplies that we do today. I guess I feel it's an issue with SPECIES_PREFERRED on such CPUs, but for now I prefer being conservative. Would be cool if we could confirm it. |
I'm gonna merge this but we should continue to explore the intel case. Not sure what we can do there though. |
Thank y'all so much for digging into this @rmuir @gf2121 @ChrisHegarty @uschindler ! Maybe one day Panama Vector will mature into allow us to do nicer things with |
@benwtrent it isn't a panama thing. these functions are 32-bit (they return |
I backported this one to 9.x. |
Thanks @rmuir for profile guide! Sorry for the delay. It took me some time to apply for root privileges on this machine. Here is the result: avx_turbo.log |
Thank you @gf2121 , it is confirmed. I include just the part of the table that is relevant. It is really great that you caught this.
|
@ChrisHegarty there are plenty of actions we could take... but I implemented this specific same optimization in question safely in #12681 See https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking for a brief introduction. |
to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvious pipe dream on the part of openjdk... we have to deal with special-case BS like this for every CPU. my advice on openjdk side: give up on the pipe dream and give us some cpuinfo or at least tell us what vector operations are supported. I'm almost to the point of writing /proc/cpuinfo parser. |
also i think JMH is bad news when it comes to downclocking. It does not show the true performance impact of this. It slows down other things on the machine as well: the user might have other shit going on too. This is why the avx-512 variants of the functions avoid 512-bit multiply and just use 256-bit vectors for that part, it is the only safe approach. Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe. |
I think we should really open an issue at JDK. We discussed about this before. In my opinion, the system should have an API to request for each operator, if it is supported in current configuration (CPU, species size and Hotspot abilities). In addition a shortcut to figure out if C2 is enabled at all, but if C2 is disabled, each of the above specific queries for operator and species combinations should return "no-no". Everything else makes vector API a huge trap. It is a specialist API so please add detailed information to allow specialist use!
Doesn't work on Windows. 😱 |
We could add a sysprops to enable advanced avx512 support. Could be a static boolean then on the impl. If somebody enables it, they must be aware that it may slow down. I have seen stuff like this in native vector databases on buzzwords (they told me). |
Vector API should also fix its bugs. It is totally senseless to have Openjdk has the same heuristics in its intrinsics/asm that i allude to, to avoid these problems. Hacks in every intrinsic / default checking specific families of CPUs and shit. go look for yourself. But it doesn't want to expose this stuff via vector api to developers. |
I would really just fix the api: instead of
Then openjdk can keep its hacks to itself. But if they wont fix this, they should expose them. |
such a method would solve 95% of my problems, if it would throw UnsupportedOperationException or return |
The jvm already has these. For example a user can set max vector width and avx instructiom level already. I assume that avx 512 users who are running on downclock-susceptible CPUs would already set flags to use only 256bit vectors. So I am afraid of adding our own flags that conflict with those. |
Hi,
The problem is that by default JVM enables those flags and then those machines get slow when they use Lucene and support cases are openend at Elasticsearch/Opensearch stuff. So my wish would be to have that opt-in. My idea was to add a sysprop like the MMap ones to enable/disbale manually or disable unsafe unmapping. In that case the sysprop would enable the AVX 512 bit case. It won't conflict with manual AVX=2 JVM override setting (because then the preferredBitSize=256 and our flag is a noop). |
Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level. |
Let's write a proposal together in a Google doc (or similar) and we could later open an issue or JEP. As we are both openjdk members, this would be helpful to me to participate in that process (want to learn sth new). |
gives a little love to the mac for dot-product and binary-cosine.
I see on m1 some improvement:
You can reproduce with
java -jar target/vectorbench.jar Binary -p size=1024
See https://github.com/rmuir/vectorbench with has a README now!
edit: originally I tried to optimize the 256/512-bit path, but it caused slowdowns on avx-512 so i reverted the changes. Sorry for confusion. hopefully we get some improvement to something that can stick :)