[DRAFT] Add unsigned byte vector operations for uint8 quantization #12694

benwtrent · 2023-10-18T13:58:24Z

{DRAFT}

After finalizing work and merging: #12582

Investigation on if adding unsigned vector operations should occur. Quantizing within [0-255] can reduce error. However, panama vector operations over unsigned bytes is slightly more expensive (see JMH benchmarks below). Need to benchmark recall vs. latency over some data sets to verify if this is worth it or not.

M1 (AMD 128 NEON)

Benchmark                                           (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineScalar                 128  thrpt    5   8.369 ± 0.208  ops/us
VectorUtilBenchmark.binaryCosineScalar                 207  thrpt    5   5.124 ± 0.210  ops/us
VectorUtilBenchmark.binaryCosineScalar                 256  thrpt    5   4.193 ± 0.014  ops/us
VectorUtilBenchmark.binaryCosineScalar                1024  thrpt    5   1.043 ± 0.002  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         128  thrpt    5   8.359 ± 0.100  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         207  thrpt    5   5.193 ± 0.025  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         256  thrpt    5   4.194 ± 0.015  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar        1024  thrpt    5   1.043 ± 0.002  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         128  thrpt    5  21.068 ± 0.072  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         207  thrpt    5  12.901 ± 0.041  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         256  thrpt    5  11.595 ± 0.128  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector        1024  thrpt    5   3.197 ± 0.007  ops/us
VectorUtilBenchmark.binaryCosineVector                 128  thrpt    5  23.552 ± 0.081  ops/us
VectorUtilBenchmark.binaryCosineVector                 207  thrpt    5  14.358 ± 0.077  ops/us
VectorUtilBenchmark.binaryCosineVector                 256  thrpt    5  13.165 ± 0.053  ops/us
VectorUtilBenchmark.binaryCosineVector                1024  thrpt    5   3.681 ± 0.027  ops/us
VectorUtilBenchmark.binaryDotProductScalar             128  thrpt    5  25.125 ± 0.043  ops/us
VectorUtilBenchmark.binaryDotProductScalar             207  thrpt    5  15.512 ± 0.061  ops/us
VectorUtilBenchmark.binaryDotProductScalar             256  thrpt    5  12.557 ± 0.044  ops/us
VectorUtilBenchmark.binaryDotProductScalar            1024  thrpt    5   3.110 ± 0.029  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     128  thrpt    5  25.115 ± 0.082  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     207  thrpt    5  15.518 ± 0.039  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     256  thrpt    5  12.554 ± 0.037  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar    1024  thrpt    5   3.112 ± 0.011  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     128  thrpt    5  38.071 ± 0.060  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     207  thrpt    5  25.039 ± 0.120  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     256  thrpt    5  20.578 ± 0.062  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector    1024  thrpt    5   5.465 ± 0.008  ops/us
VectorUtilBenchmark.binaryDotProductVector             128  thrpt    5  45.923 ± 0.150  ops/us
VectorUtilBenchmark.binaryDotProductVector             207  thrpt    5  30.516 ± 0.053  ops/us
VectorUtilBenchmark.binaryDotProductVector             256  thrpt    5  25.510 ± 0.053  ops/us
VectorUtilBenchmark.binaryDotProductVector            1024  thrpt    5   6.744 ± 0.046  ops/us

GCP AVX512

Benchmark                                           (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryCosineScalar                 128  thrpt    5   7.290 ± 0.003  ops/us
VectorUtilBenchmark.binaryCosineScalar                 207  thrpt    5   4.236 ± 0.015  ops/us
VectorUtilBenchmark.binaryCosineScalar                 256  thrpt    5   3.452 ± 0.015  ops/us
VectorUtilBenchmark.binaryCosineScalar                1024  thrpt    5   0.885 ± 0.003  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         128  thrpt    5   7.304 ± 0.007  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         207  thrpt    5   4.225 ± 0.013  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar         256  thrpt    5   3.431 ± 0.026  ops/us
VectorUtilBenchmark.binaryCosineUnsignedScalar        1024  thrpt    5   0.879 ± 0.006  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         128  thrpt    5  29.931 ± 0.049  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         207  thrpt    5  17.284 ± 0.018  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector         256  thrpt    5  19.145 ± 0.067  ops/us
VectorUtilBenchmark.binaryCosineUnsignedVector        1024  thrpt    5   6.109 ± 0.004  ops/us
VectorUtilBenchmark.binaryCosineVector                 128  thrpt    5  32.736 ± 0.027  ops/us
VectorUtilBenchmark.binaryCosineVector                 207  thrpt    5  18.272 ± 0.640  ops/us
VectorUtilBenchmark.binaryCosineVector                 256  thrpt    5  21.435 ± 0.051  ops/us
VectorUtilBenchmark.binaryCosineVector                1024  thrpt    5   7.029 ± 0.011  ops/us
VectorUtilBenchmark.binaryDotProductScalar             128  thrpt    5  16.971 ± 0.053  ops/us
VectorUtilBenchmark.binaryDotProductScalar             207  thrpt    5   9.508 ± 0.091  ops/us
VectorUtilBenchmark.binaryDotProductScalar             256  thrpt    5   8.121 ± 0.059  ops/us
VectorUtilBenchmark.binaryDotProductScalar            1024  thrpt    5   2.501 ± 0.011  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     128  thrpt    5  16.977 ± 0.056  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     207  thrpt    5  10.448 ± 0.045  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar     256  thrpt    5   8.352 ± 0.042  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedScalar    1024  thrpt    5   2.502 ± 0.042  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     128  thrpt    5  69.663 ± 0.079  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     207  thrpt    5  44.077 ± 0.059  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector     256  thrpt    5  41.963 ± 0.030  ops/us
VectorUtilBenchmark.binaryDotProductUnsignedVector    1024  thrpt    5  11.856 ± 0.020  ops/us
VectorUtilBenchmark.binaryDotProductVector             128  thrpt    5  85.247 ± 0.175  ops/us
VectorUtilBenchmark.binaryDotProductVector             207  thrpt    5  48.486 ± 0.055  ops/us
VectorUtilBenchmark.binaryDotProductVector             256  thrpt    5  50.560 ± 0.045  ops/us
VectorUtilBenchmark.binaryDotProductVector            1024  thrpt    5  14.697 ± 0.010  ops/us

rmuir · 2023-10-19T11:42:10Z

I don't think we should 'add' unsigned vectors format, if it is better we should change to it and remove the signed format. We have to maintain all this stuff.

rmuir · 2023-10-19T11:43:21Z

lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

@@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byte[] b, int limit) {
      // 16-bit multiply: avoid AVX-512 heavy multiply on zmm
      Vector<Short> va16 = va8.convertShape(B2S, SHORT_SPECIES, 0);
      Vector<Short> vb16 = vb8.convertShape(B2S, SHORT_SPECIES, 0);
+      if (unsigned) {


please don't add branches like this to the vector code. needs to be a separate method.

the problem is that you won't see this problem in benchmark, because each benchmark runs in a separate VM which always calls dotProductBody512 with always same parameter. Hotspot will for sure optimize this. But if you have productive code that sometimes uses signed and sometimes unsigned multiplication, the method will be deoptimized on every change as it runs into a trap (or like that). That's not what you want.

To try it out (haven't tried), add a benchmark for this:

@Benchmark @Fork( value = 1, jvmArgsPrepend = {"--add-modules=jdk.incubator.vector"}) public float binaryCosineUnsignedMixed() { return VectorUtil.cosine(bytesA, bytesB) + VectorUtil.cosineUnsigned(bytesA, bytesB); }

If we need to, we can add an option that adds pollution to the profiles. But we kinda already know that it will be bad.

rmuir · 2023-10-19T11:46:03Z

seems like this should be implemented as e.g. ZERO_EXTEND_B2I and ZERO_EXTEND_B2S instead of adding branches to the code and AND instructions.

rmuir · 2023-10-19T12:08:17Z

Quantizing within [0-255] can reduce error.

This doesn't make any sense to me, it is 8 bits either way.

But supporting both signed and unsigned is a nonstarter for me, it is too much. So if unsigned is better then remove the signed functions from VectorUtil and their associated vectorized methods completely.

Then I'm happy, we still have 6 similarity methods, just they use ZERO_EXTEND_B2I instead of B2I and so on.

benwtrent · 2023-10-19T12:08:54Z

ZERO_EXTEND_B2I and ZERO_EXTEND_B2S instead of adding branches to the code and AND instructions.

Thank you!

I don't think we should 'add' unsigned vectors format, if it is better we should change to it and remove the signed format. We have to maintain all this stuff.

This is tricky as folks who give Lucene byte[] vectors now expected signed operations. While this isn't an issue with euclidean, it is an issue with dot_product, etc. Wouldn't it be a breaking change to adjust how scoring works?

rmuir · 2023-10-19T12:11:30Z

This is tricky as folks who give Lucene byte[] vectors now expected signed operations. While this isn't an issue with euclidean, it is an issue with dot_product, etc. Wouldn't it be a breaking change to adjust how scoring works?

The old signed stuff needs to be removed in order for the unsigned stuff to be added here. I'm gonna stand pretty firm on this.

If you feel changes are "breaking" or "back compat", just ADD ADD ADD features is not the solution.

rmuir · 2023-10-19T12:15:55Z

The number of formats (float, binary) multiplies by the number of functions (dot product, cosine, square), so you aren't just adding one function here, it is 3. And in the future perhaps it equates to 4.

And I have struggled very hard to make the existing 6 functions we have perform well. Some of them are just extremely inefficient mathematically.

So we absolutely must remove the signed functions to add these unsigned ones, if they are better. We can't just keep exploding the amount of stuff we have to support.

I am sure my opinion here will be unpopular, that is ok. I have fought the shit out of these methods.

rmuir · 2023-10-19T12:34:06Z

also i'd recommend writing some tests, at least enough to know if the code is viable. It is not clear to me that the vector methods are correct, if they do 16-bit multiplication on two unsigned 8-bit integers and store result in a signed short, it overflows.

rmuir · 2023-10-19T12:39:18Z

This means the only way you can do this correctly, is to remove all 16-bit multiplications and all use of short completely and go straight from 8-bit to 32-bit with ZERO_EXTEND_B2I.

It means suffering downclocking on avx-512 or shortening vectors in half. It means much slower ARM performance.

If it gives better search results and it is worth the tradeoff, that is fine. I just want you to be aware of the tradeoffs because the benchmarks you have posted I think are unrealistic.

rmuir · 2023-10-19T12:42:18Z

fwiw, i think you can keep the performance and solve the last problem by zero-extending twice: 8-16bit, then 16-32bit

uschindler · 2023-10-19T12:43:45Z

also i'd recommend writing some tests, at least enough to know if the code is viable. It is not clear to me that the vector methods are correct, if they do 16-bit multiplication on two unsigned 8-bit integers and store result in a signed short, it overflows.

There is a test missing in TestVectorUtilSupport that compares the results of vectorized and standard impl. Also some basic tests using extreme vectors should be added due to overflows.

As Robert says, I am quite sure that the current code overflows if vectorized if you have large values (like 0xFF). So please add a test that compares results (like we have for all other methods).

rmuir · 2023-10-19T12:45:15Z

lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultVectorUtilSupport.java

@@ -164,6 +173,23 @@ public float cosine(byte[] a, byte[] b) {
    return (float) (sum / Math.sqrt((double) norm1 * (double) norm2));
  }

+  @Override
+  public float cosineUnsigned(byte[] a, byte[] b) {
+    // Note: this will not overflow if dim < 2^18, since max(byte * byte) = 2^14.


All these comments need to adjusted if final result is still signed java values. If Integer.compareUnsigned is used correctly on results of int methods, and used before conversion to double, then that problem goes away.

uschindler

Add tests and same comment like Robert's

uschindler · 2023-10-19T12:49:04Z

lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

@@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byte[] b, int limit) {
      // 16-bit multiply: avoid AVX-512 heavy multiply on zmm
      Vector<Short> va16 = va8.convertShape(B2S, SHORT_SPECIES, 0);
      Vector<Short> vb16 = vb8.convertShape(B2S, SHORT_SPECIES, 0);
+      if (unsigned) {


the problem is that you won't see this problem in benchmark, because each benchmark runs in a separate VM which always calls dotProductBody512 with always same parameter. Hotspot will for sure optimize this. But if you have productive code that sometimes uses signed and sometimes unsigned multiplication, the method will be deoptimized on every change as it runs into a trap (or like that). That's not what you want.

To try it out (haven't tried), add a benchmark for this:

@Benchmark @Fork( value = 1, jvmArgsPrepend = {"--add-modules=jdk.incubator.vector"}) public float binaryCosineUnsignedMixed() { return VectorUtil.cosine(bytesA, bytesB) + VectorUtil.cosineUnsigned(bytesA, bytesB); }

rmuir · 2023-10-19T12:54:13Z

There is a test missing in TestVectorUtilSupport that compares the results of vectorized and standard impl. Also some basic tests using extreme vectors should be added due to overflows.

As Robert says, I am quite sure that the current code overflows if vectorized if you have large values (like 0xFF). So please add a test that compares results (like we have for all other methods).

Even if the code is fixed to always use ZERO_EXTEND_B2I and ZERO_EXTEND_B2S AND ZERO_EXTEND_S2I so that it does true unsigned math, there is the problem of the end result currently being defined as a 'signed integer'. It should be 'unsigned integer', but it leads to problems in java-land as we have to worry all code treats it correctly with Integer.compareUnsigned and that various "finalizer" functions acting on the result first cast to long and so on. This is not happening in the scalar versions right now either.

edit: add ZERO_EXTEND_S2I as that is critical to prevent the multiplication from overflowing.

uschindler · 2023-10-19T12:56:07Z

P.S.: It is too bad that we have no C preprocessor so we could expand and inline the methods automatically. We could maybe write a python script that generates the PanamaVectorUtilSupport class, but I think that's too much at current state. But could be an option in the future. The downside is that it is IDE unfriendly.

rmuir · 2023-10-19T13:08:29Z

I also question this is the correct design with respect to the hardware. Look at instruction support for doing this stuff which uses signed bytes: https://www.felixcloutier.com/x86/vpdpbusd and with saturation: https://www.felixcloutier.com/x86/vpdpbusds

It would be bad to "pick" vectors format that is not friendly to such instructions. Maybe it is the saturation that you really want for better results, which would still be supported by the hardware at least in theory. i don't know if we can coerce vector api into doing it today. Maybe it is possible to get something reasonable using xor/shift like NumericUtils code :)

benwtrent added 3 commits October 18, 2023 09:28

Adding unsigned byte vector operation support

3962af9

fixing benchmark

770b3fb

formatting

42cf6af

benwtrent mentioned this pull request Oct 18, 2023

Add new int8 scalar quantization to HNSW codec #12582

Merged

reta mentioned this pull request Oct 18, 2023

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation opensearch-project/OpenSearch#10392

Closed

rmuir reviewed Oct 19, 2023

View reviewed changes

uschindler requested changes Oct 19, 2023

View reviewed changes

benwtrent mentioned this pull request May 21, 2024

Significant drop in recall for int7 scalar quantization using maximum_inner_product #13350

Closed

benwtrent mentioned this pull request Sep 11, 2024

Significant drop in recall for 8 bit Scalar Quantizer #13519

Closed

benwtrent closed this Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Add unsigned byte vector operations for uint8 quantization #12694

[DRAFT] Add unsigned byte vector operations for uint8 quantization #12694

benwtrent commented Oct 18, 2023

rmuir commented Oct 19, 2023

rmuir Oct 19, 2023

uschindler Oct 19, 2023 •

edited

Loading

ChrisHegarty Oct 28, 2023

rmuir commented Oct 19, 2023

rmuir commented Oct 19, 2023

benwtrent commented Oct 19, 2023

rmuir commented Oct 19, 2023

rmuir commented Oct 19, 2023

rmuir commented Oct 19, 2023

rmuir commented Oct 19, 2023

rmuir commented Oct 19, 2023

uschindler commented Oct 19, 2023

rmuir Oct 19, 2023

uschindler left a comment

uschindler Oct 19, 2023 •

edited

Loading

rmuir commented Oct 19, 2023 •

edited

Loading

uschindler commented Oct 19, 2023

rmuir commented Oct 19, 2023

[DRAFT] Add unsigned byte vector operations for uint8 quantization #12694

[DRAFT] Add unsigned byte vector operations for uint8 quantization #12694

Conversation

benwtrent commented Oct 18, 2023

rmuir commented Oct 19, 2023

rmuir Oct 19, 2023

Choose a reason for hiding this comment

uschindler Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

ChrisHegarty Oct 28, 2023

Choose a reason for hiding this comment

rmuir commented Oct 19, 2023

rmuir commented Oct 19, 2023

benwtrent commented Oct 19, 2023

rmuir commented Oct 19, 2023

rmuir commented Oct 19, 2023

rmuir commented Oct 19, 2023

rmuir commented Oct 19, 2023

rmuir commented Oct 19, 2023

uschindler commented Oct 19, 2023

rmuir Oct 19, 2023

Choose a reason for hiding this comment

uschindler left a comment

Choose a reason for hiding this comment

uschindler Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

rmuir commented Oct 19, 2023 • edited Loading

uschindler commented Oct 19, 2023

rmuir commented Oct 19, 2023

uschindler Oct 19, 2023 •

edited

Loading

uschindler Oct 19, 2023 •

edited

Loading

rmuir commented Oct 19, 2023 •

edited

Loading