Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Add unsigned byte vector operations for uint8 quantization #12694

Closed

Conversation

benwtrent
Copy link
Member

{DRAFT}

After finalizing work and merging: #12582

Investigation on if adding unsigned vector operations should occur. Quantizing within [0-255] can reduce error. However, panama vector operations over unsigned bytes is slightly more expensive (see JMH benchmarks below). Need to benchmark recall vs. latency over some data sets to verify if this is worth it or not.

M1 (AMD 128 NEON)
Benchmark                                           (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineScalar                 128  thrpt    5   8.369 ± 0.208  ops/us
VectorUtilBenchmark.binaryCosineScalar                 207  thrpt    5   5.124 ± 0.210  ops/us
VectorUtilBenchmark.binaryCosineScalar                 256  thrpt    5   4.193 ± 0.014  ops/us
VectorUtilBenchmark.binaryCosineScalar                1024  thrpt    5   1.043 ± 0.002  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         128  thrpt    5   8.359 ± 0.100  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         207  thrpt    5   5.193 ± 0.025  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         256  thrpt    5   4.194 ± 0.015  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar        1024  thrpt    5   1.043 ± 0.002  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         128  thrpt    5  21.068 ± 0.072  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         207  thrpt    5  12.901 ± 0.041  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         256  thrpt    5  11.595 ± 0.128  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector        1024  thrpt    5   3.197 ± 0.007  ops/us
VectorUtilBenchmark.binaryCosineVector                 128  thrpt    5  23.552 ± 0.081  ops/us
VectorUtilBenchmark.binaryCosineVector                 207  thrpt    5  14.358 ± 0.077  ops/us
VectorUtilBenchmark.binaryCosineVector                 256  thrpt    5  13.165 ± 0.053  ops/us
VectorUtilBenchmark.binaryCosineVector                1024  thrpt    5   3.681 ± 0.027  ops/us
VectorUtilBenchmark.binaryDotProductScalar             128  thrpt    5  25.125 ± 0.043  ops/us
VectorUtilBenchmark.binaryDotProductScalar             207  thrpt    5  15.512 ± 0.061  ops/us
VectorUtilBenchmark.binaryDotProductScalar             256  thrpt    5  12.557 ± 0.044  ops/us
VectorUtilBenchmark.binaryDotProductScalar            1024  thrpt    5   3.110 ± 0.029  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     128  thrpt    5  25.115 ± 0.082  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     207  thrpt    5  15.518 ± 0.039  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     256  thrpt    5  12.554 ± 0.037  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar    1024  thrpt    5   3.112 ± 0.011  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     128  thrpt    5  38.071 ± 0.060  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     207  thrpt    5  25.039 ± 0.120  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     256  thrpt    5  20.578 ± 0.062  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector    1024  thrpt    5   5.465 ± 0.008  ops/us
VectorUtilBenchmark.binaryDotProductVector             128  thrpt    5  45.923 ± 0.150  ops/us
VectorUtilBenchmark.binaryDotProductVector             207  thrpt    5  30.516 ± 0.053  ops/us
VectorUtilBenchmark.binaryDotProductVector             256  thrpt    5  25.510 ± 0.053  ops/us
VectorUtilBenchmark.binaryDotProductVector            1024  thrpt    5   6.744 ± 0.046  ops/us
GCP AVX512
Benchmark                                           (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineScalar                 128  thrpt    5   7.290 ± 0.003  ops/us
VectorUtilBenchmark.binaryCosineScalar                 207  thrpt    5   4.236 ± 0.015  ops/us
VectorUtilBenchmark.binaryCosineScalar                 256  thrpt    5   3.452 ± 0.015  ops/us
VectorUtilBenchmark.binaryCosineScalar                1024  thrpt    5   0.885 ± 0.003  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         128  thrpt    5   7.304 ± 0.007  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         207  thrpt    5   4.225 ± 0.013  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         256  thrpt    5   3.431 ± 0.026  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar        1024  thrpt    5   0.879 ± 0.006  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         128  thrpt    5  29.931 ± 0.049  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         207  thrpt    5  17.284 ± 0.018  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         256  thrpt    5  19.145 ± 0.067  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector        1024  thrpt    5   6.109 ± 0.004  ops/us
VectorUtilBenchmark.binaryCosineVector                 128  thrpt    5  32.736 ± 0.027  ops/us
VectorUtilBenchmark.binaryCosineVector                 207  thrpt    5  18.272 ± 0.640  ops/us
VectorUtilBenchmark.binaryCosineVector                 256  thrpt    5  21.435 ± 0.051  ops/us
VectorUtilBenchmark.binaryCosineVector                1024  thrpt    5   7.029 ± 0.011  ops/us
VectorUtilBenchmark.binaryDotProductScalar             128  thrpt    5  16.971 ± 0.053  ops/us
VectorUtilBenchmark.binaryDotProductScalar             207  thrpt    5   9.508 ± 0.091  ops/us
VectorUtilBenchmark.binaryDotProductScalar             256  thrpt    5   8.121 ± 0.059  ops/us
VectorUtilBenchmark.binaryDotProductScalar            1024  thrpt    5   2.501 ± 0.011  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     128  thrpt    5  16.977 ± 0.056  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     207  thrpt    5  10.448 ± 0.045  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     256  thrpt    5   8.352 ± 0.042  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar    1024  thrpt    5   2.502 ± 0.042  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     128  thrpt    5  69.663 ± 0.079  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     207  thrpt    5  44.077 ± 0.059  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     256  thrpt    5  41.963 ± 0.030  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector    1024  thrpt    5  11.856 ± 0.020  ops/us
VectorUtilBenchmark.binaryDotProductVector             128  thrpt    5  85.247 ± 0.175  ops/us
VectorUtilBenchmark.binaryDotProductVector             207  thrpt    5  48.486 ± 0.055  ops/us
VectorUtilBenchmark.binaryDotProductVector             256  thrpt    5  50.560 ± 0.045  ops/us
VectorUtilBenchmark.binaryDotProductVector            1024  thrpt    5  14.697 ± 0.010  ops/us

@rmuir
Copy link
Member

rmuir commented Oct 19, 2023

I don't think we should 'add' unsigned vectors format, if it is better we should change to it and remove the signed format. We have to maintain all this stuff.

@@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byte[] b, int limit) {
// 16-bit multiply: avoid AVX-512 heavy multiply on zmm
Vector<Short> va16 = va8.convertShape(B2S, SHORT_SPECIES, 0);
Vector<Short> vb16 = vb8.convertShape(B2S, SHORT_SPECIES, 0);
if (unsigned) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't add branches like this to the vector code. needs to be a separate method.

Copy link
Contributor

@uschindler uschindler Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that you won't see this problem in benchmark, because each benchmark runs in a separate VM which always calls dotProductBody512 with always same parameter. Hotspot will for sure optimize this. But if you have productive code that sometimes uses signed and sometimes unsigned multiplication, the method will be deoptimized on every change as it runs into a trap (or like that). That's not what you want.

To try it out (haven't tried), add a benchmark for this:

@Benchmark
  @Fork(
      value = 1,
      jvmArgsPrepend = {"--add-modules=jdk.incubator.vector"})
  public float binaryCosineUnsignedMixed() {
    return VectorUtil.cosine(bytesA, bytesB) + VectorUtil.cosineUnsigned(bytesA, bytesB);
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to, we can add an option that adds pollution to the profiles. But we kinda already know that it will be bad.

@rmuir
Copy link
Member

rmuir commented Oct 19, 2023

seems like this should be implemented as e.g. ZERO_EXTEND_B2I and ZERO_EXTEND_B2S instead of adding branches to the code and AND instructions.

@rmuir
Copy link
Member

rmuir commented Oct 19, 2023

Quantizing within [0-255] can reduce error.

This doesn't make any sense to me, it is 8 bits either way.

But supporting both signed and unsigned is a nonstarter for me, it is too much. So if unsigned is better then remove the signed functions from VectorUtil and their associated vectorized methods completely.

Then I'm happy, we still have 6 similarity methods, just they use ZERO_EXTEND_B2I instead of B2I and so on.

@benwtrent
Copy link
Member Author

ZERO_EXTEND_B2I and ZERO_EXTEND_B2S instead of adding branches to the code and AND instructions.

Thank you!

I don't think we should 'add' unsigned vectors format, if it is better we should change to it and remove the signed format. We have to maintain all this stuff.

This is tricky as folks who give Lucene byte[] vectors now expected signed operations. While this isn't an issue with euclidean, it is an issue with dot_product, etc. Wouldn't it be a breaking change to adjust how scoring works?

@rmuir
Copy link
Member

rmuir commented Oct 19, 2023

This is tricky as folks who give Lucene byte[] vectors now expected signed operations. While this isn't an issue with euclidean, it is an issue with dot_product, etc. Wouldn't it be a breaking change to adjust how scoring works?

The old signed stuff needs to be removed in order for the unsigned stuff to be added here. I'm gonna stand pretty firm on this.

If you feel changes are "breaking" or "back compat", just ADD ADD ADD features is not the solution.

@rmuir
Copy link
Member

rmuir commented Oct 19, 2023

The number of formats (float, binary) multiplies by the number of functions (dot product, cosine, square), so you aren't just adding one function here, it is 3. And in the future perhaps it equates to 4.

And I have struggled very hard to make the existing 6 functions we have perform well. Some of them are just extremely inefficient mathematically.

So we absolutely must remove the signed functions to add these unsigned ones, if they are better. We can't just keep exploding the amount of stuff we have to support.

I am sure my opinion here will be unpopular, that is ok. I have fought the shit out of these methods.

@rmuir
Copy link
Member

rmuir commented Oct 19, 2023

also i'd recommend writing some tests, at least enough to know if the code is viable. It is not clear to me that the vector methods are correct, if they do 16-bit multiplication on two unsigned 8-bit integers and store result in a signed short, it overflows.

@rmuir
Copy link
Member

rmuir commented Oct 19, 2023

This means the only way you can do this correctly, is to remove all 16-bit multiplications and all use of short completely and go straight from 8-bit to 32-bit with ZERO_EXTEND_B2I.

It means suffering downclocking on avx-512 or shortening vectors in half. It means much slower ARM performance.

If it gives better search results and it is worth the tradeoff, that is fine. I just want you to be aware of the tradeoffs because the benchmarks you have posted I think are unrealistic.

@rmuir
Copy link
Member

rmuir commented Oct 19, 2023

fwiw, i think you can keep the performance and solve the last problem by zero-extending twice: 8-16bit, then 16-32bit

@uschindler
Copy link
Contributor

also i'd recommend writing some tests, at least enough to know if the code is viable. It is not clear to me that the vector methods are correct, if they do 16-bit multiplication on two unsigned 8-bit integers and store result in a signed short, it overflows.

There is a test missing in TestVectorUtilSupport that compares the results of vectorized and standard impl. Also some basic tests using extreme vectors should be added due to overflows.

As Robert says, I am quite sure that the current code overflows if vectorized if you have large values (like 0xFF). So please add a test that compares results (like we have for all other methods).

@@ -164,6 +173,23 @@ public float cosine(byte[] a, byte[] b) {
return (float) (sum / Math.sqrt((double) norm1 * (double) norm2));
}

@Override
public float cosineUnsigned(byte[] a, byte[] b) {
// Note: this will not overflow if dim < 2^18, since max(byte * byte) = 2^14.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these comments need to adjusted if final result is still signed java values. If Integer.compareUnsigned is used correctly on results of int methods, and used before conversion to double, then that problem goes away.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add tests and same comment like Robert's

@@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byte[] b, int limit) {
// 16-bit multiply: avoid AVX-512 heavy multiply on zmm
Vector<Short> va16 = va8.convertShape(B2S, SHORT_SPECIES, 0);
Vector<Short> vb16 = vb8.convertShape(B2S, SHORT_SPECIES, 0);
if (unsigned) {
Copy link
Contributor

@uschindler uschindler Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that you won't see this problem in benchmark, because each benchmark runs in a separate VM which always calls dotProductBody512 with always same parameter. Hotspot will for sure optimize this. But if you have productive code that sometimes uses signed and sometimes unsigned multiplication, the method will be deoptimized on every change as it runs into a trap (or like that). That's not what you want.

To try it out (haven't tried), add a benchmark for this:

@Benchmark
  @Fork(
      value = 1,
      jvmArgsPrepend = {"--add-modules=jdk.incubator.vector"})
  public float binaryCosineUnsignedMixed() {
    return VectorUtil.cosine(bytesA, bytesB) + VectorUtil.cosineUnsigned(bytesA, bytesB);
  }

@rmuir
Copy link
Member

rmuir commented Oct 19, 2023

There is a test missing in TestVectorUtilSupport that compares the results of vectorized and standard impl. Also some basic tests using extreme vectors should be added due to overflows.

As Robert says, I am quite sure that the current code overflows if vectorized if you have large values (like 0xFF). So please add a test that compares results (like we have for all other methods).

Even if the code is fixed to always use ZERO_EXTEND_B2I and ZERO_EXTEND_B2S AND ZERO_EXTEND_S2I so that it does true unsigned math, there is the problem of the end result currently being defined as a 'signed integer'. It should be 'unsigned integer', but it leads to problems in java-land as we have to worry all code treats it correctly with Integer.compareUnsigned and that various "finalizer" functions acting on the result first cast to long and so on. This is not happening in the scalar versions right now either.

edit: add ZERO_EXTEND_S2I as that is critical to prevent the multiplication from overflowing.

@uschindler
Copy link
Contributor

P.S.: It is too bad that we have no C preprocessor so we could expand and inline the methods automatically. We could maybe write a python script that generates the PanamaVectorUtilSupport class, but I think that's too much at current state. But could be an option in the future. The downside is that it is IDE unfriendly.

@rmuir
Copy link
Member

rmuir commented Oct 19, 2023

I also question this is the correct design with respect to the hardware. Look at instruction support for doing this stuff which uses signed bytes: https://www.felixcloutier.com/x86/vpdpbusd and with saturation: https://www.felixcloutier.com/x86/vpdpbusds

It would be bad to "pick" vectors format that is not friendly to such instructions. Maybe it is the saturation that you really want for better results, which would still be supported by the hardware at least in theory. i don't know if we can coerce vector api into doing it today. Maybe it is possible to get something reasonable using xor/shift like NumericUtils code :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants