-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Add unsigned byte vector operations for uint8 quantization #12694
[DRAFT] Add unsigned byte vector operations for uint8 quantization #12694
Conversation
I don't think we should 'add' unsigned vectors format, if it is better we should change to it and remove the signed format. We have to maintain all this stuff. |
@@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byte[] b, int limit) { | |||
// 16-bit multiply: avoid AVX-512 heavy multiply on zmm | |||
Vector<Short> va16 = va8.convertShape(B2S, SHORT_SPECIES, 0); | |||
Vector<Short> vb16 = vb8.convertShape(B2S, SHORT_SPECIES, 0); | |||
if (unsigned) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't add branches like this to the vector code. needs to be a separate method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is that you won't see this problem in benchmark, because each benchmark runs in a separate VM which always calls dotProductBody512
with always same parameter. Hotspot will for sure optimize this. But if you have productive code that sometimes uses signed and sometimes unsigned multiplication, the method will be deoptimized on every change as it runs into a trap (or like that). That's not what you want.
To try it out (haven't tried), add a benchmark for this:
@Benchmark
@Fork(
value = 1,
jvmArgsPrepend = {"--add-modules=jdk.incubator.vector"})
public float binaryCosineUnsignedMixed() {
return VectorUtil.cosine(bytesA, bytesB) + VectorUtil.cosineUnsigned(bytesA, bytesB);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to, we can add an option that adds pollution to the profiles. But we kinda already know that it will be bad.
seems like this should be implemented as e.g. ZERO_EXTEND_B2I and ZERO_EXTEND_B2S instead of adding branches to the code and AND instructions. |
This doesn't make any sense to me, it is 8 bits either way. But supporting both signed and unsigned is a nonstarter for me, it is too much. So if unsigned is better then remove the signed functions from VectorUtil and their associated vectorized methods completely. Then I'm happy, we still have 6 similarity methods, just they use ZERO_EXTEND_B2I instead of B2I and so on. |
Thank you!
This is tricky as folks who give Lucene |
The old signed stuff needs to be removed in order for the unsigned stuff to be added here. I'm gonna stand pretty firm on this. If you feel changes are "breaking" or "back compat", just ADD ADD ADD features is not the solution. |
The number of formats (float, binary) multiplies by the number of functions (dot product, cosine, square), so you aren't just adding one function here, it is 3. And in the future perhaps it equates to 4. And I have struggled very hard to make the existing 6 functions we have perform well. Some of them are just extremely inefficient mathematically. So we absolutely must remove the signed functions to add these unsigned ones, if they are better. We can't just keep exploding the amount of stuff we have to support. I am sure my opinion here will be unpopular, that is ok. I have fought the shit out of these methods. |
also i'd recommend writing some tests, at least enough to know if the code is viable. It is not clear to me that the vector methods are correct, if they do 16-bit multiplication on two unsigned 8-bit integers and store result in a signed short, it overflows. |
This means the only way you can do this correctly, is to remove all 16-bit multiplications and all use of It means suffering downclocking on avx-512 or shortening vectors in half. It means much slower ARM performance. If it gives better search results and it is worth the tradeoff, that is fine. I just want you to be aware of the tradeoffs because the benchmarks you have posted I think are unrealistic. |
fwiw, i think you can keep the performance and solve the last problem by zero-extending twice: 8-16bit, then 16-32bit |
There is a test missing in TestVectorUtilSupport that compares the results of vectorized and standard impl. Also some basic tests using extreme vectors should be added due to overflows. As Robert says, I am quite sure that the current code overflows if vectorized if you have large values (like 0xFF). So please add a test that compares results (like we have for all other methods). |
@@ -164,6 +173,23 @@ public float cosine(byte[] a, byte[] b) { | |||
return (float) (sum / Math.sqrt((double) norm1 * (double) norm2)); | |||
} | |||
|
|||
@Override | |||
public float cosineUnsigned(byte[] a, byte[] b) { | |||
// Note: this will not overflow if dim < 2^18, since max(byte * byte) = 2^14. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these comments need to adjusted if final result is still signed java values. If Integer.compareUnsigned
is used correctly on results of int
methods, and used before conversion to double, then that problem goes away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add tests and same comment like Robert's
@@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byte[] b, int limit) { | |||
// 16-bit multiply: avoid AVX-512 heavy multiply on zmm | |||
Vector<Short> va16 = va8.convertShape(B2S, SHORT_SPECIES, 0); | |||
Vector<Short> vb16 = vb8.convertShape(B2S, SHORT_SPECIES, 0); | |||
if (unsigned) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is that you won't see this problem in benchmark, because each benchmark runs in a separate VM which always calls dotProductBody512
with always same parameter. Hotspot will for sure optimize this. But if you have productive code that sometimes uses signed and sometimes unsigned multiplication, the method will be deoptimized on every change as it runs into a trap (or like that). That's not what you want.
To try it out (haven't tried), add a benchmark for this:
@Benchmark
@Fork(
value = 1,
jvmArgsPrepend = {"--add-modules=jdk.incubator.vector"})
public float binaryCosineUnsignedMixed() {
return VectorUtil.cosine(bytesA, bytesB) + VectorUtil.cosineUnsigned(bytesA, bytesB);
}
Even if the code is fixed to always use edit: add |
P.S.: It is too bad that we have no C preprocessor so we could expand and inline the methods automatically. We could maybe write a python script that generates the PanamaVectorUtilSupport class, but I think that's too much at current state. But could be an option in the future. The downside is that it is IDE unfriendly. |
I also question this is the correct design with respect to the hardware. Look at instruction support for doing this stuff which uses signed bytes: https://www.felixcloutier.com/x86/vpdpbusd and with saturation: https://www.felixcloutier.com/x86/vpdpbusds It would be bad to "pick" vectors format that is not friendly to such instructions. Maybe it is the saturation that you really want for better results, which would still be supported by the hardware at least in theory. i don't know if we can coerce vector api into doing it today. Maybe it is possible to get something reasonable using xor/shift like NumericUtils code :) |
{DRAFT}
After finalizing work and merging: #12582
Investigation on if adding unsigned vector operations should occur. Quantizing within
[0-255]
can reduce error. However, panama vector operations over unsigned bytes is slightly more expensive (see JMH benchmarks below). Need to benchmark recall vs. latency over some data sets to verify if this is worth it or not.M1 (AMD 128 NEON)
GCP AVX512