Expand scalar quantization with adding half-byte (int4) quantization #13197

benwtrent · 2024-03-21T20:04:21Z

This PR is a culmination of some various streams of work:

Confidence interval optimizations, unlocked even smaller quantization bytes.
The ability to quantize down smaller than just int8 or int7
Adding an optimized int4 (halfbyte) vector API comparison for dot-product.

The idea of further scalar quantization gives users the choice between:

Further quantizing to gain space through compressing the bits into single byte values
Or allowing quantization to give guarantees around maximal values that afford faster vector operations.

I didn't add more panama vector APIs as I think trying to micro-optimize int4 for anything other than dot-product was a fools errand. Additionally, I only focused on ARM. I experimented with trying to get better performance on other architectures, but didn't get very far, so I fall back to dotProduct.

…8-quantization

tteofili · 2024-03-22T10:46:50Z

@benwtrent to me it makes sense to have quantization bits configurable in this case.

jpountz

For that case, adding an option makes sense to me since it seems extremely similar to int8 scalar quantization.

jpountz · 2024-03-22T14:14:53Z

...ne/core/src/java/org/apache/lucene/codecs/lucene99/OffHeapQuantizedHalfByteVectorValues.java

+    for (int i = 0; i < numBytes; ++i) {
+      compressed[numBytes + i] = (byte) (compressed[i] & 0x0F);
+      compressed[i] = (byte) ((compressed[i] & 0xFF) >> 4);
+    }


Cool, this should get auto-vectorized on JDK13+.

@jpountz its pretty fast. This combined with the panama optimized int4 vector comparison keeps runtime faster than float32. However, doing this and only the int8 vector comparison makes us about the same speed or slightly slower than float32.

I am going to run a bunch more benchmarks once I get this all refactored and show all the numbers.

benwtrent · 2024-03-26T19:23:32Z

I did a bunch of local benchmarking on this. I am adding a parameter to allow optional compression as the numbers without compressing are compelling enough on ARM to justify it IMO.

To achieve similar recall, int4 without compression is about 30% faster. With compression its about 30% slower, but with 50% of the memory requirements.

Here are some latency vs. recall for int7, and int4 with this change.

plt.plot([2.01], [0.964], marker='x', markersize=10, label='f32')
plt.plot([1.49, 1.53, 1.54, 1.83, 2.09], [0.952, 0.962, 0.965, 0.974, 0.981], marker='o', label='int7')
plt.plot([1.72, 1.75, 1.79, 2.04, 2.48], [0.897, 0.915, 0.929, 0.971, 0.980 ], marker='o', label='int4_compressed')
plt.plot([1.08, 1.12, 1.12, 1.34, 1.50], [0.897, 0.915, 0.929, 0.971, 0.980 ], marker='o', label='int4')

int4 with compression gives 2x space improvement over int7, but it comes at an obvious cost as we have to (un)pack bytes during dot-products.

Here are the numbers around index building as well. I committed ever 1MB to ensure merging occurred and that force-merging was adequately exercised.

Int4 no compression:

Indexed 500000 documents in 312090ms
Force merge done in: 76169 ms

Int4 compression:

Indexed 500000 documents in 326978ms
Force merge done in: 124961 ms

Int7:

Indexed 500000 documents in 344584ms
Force merge done in: 98311 ms

…8-quantization

jpountz

I wonder if this feature should be more opinionated, e.g. should it only accept 4 and 7 as numbers of bits, these look like the two most interesting numbers to me? And maybe we should enforce compression with 4 bits or less, I understand that there is a performance hit, but storing vectors in a wasteful way doesn't feel great?

tteofili · 2024-03-27T10:24:30Z

I tend to agree on being opinionated on a set of allowed configurations for what concerns the number of bits (4 and 7).
Given the speed-space trade-off for packing, I think it's useful to leave that as an option.

tteofili · 2024-03-27T10:28:11Z

lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/VectorUtilBenchmark.java

-  @Param({"1", "128", "207", "256", "300", "512", "702", "1024"})
+  @Param({"1024"})


shouldn't we keep the other options too?

Yeah, this was a mistake to commit this change, I was benchmarking :/

…c-tests

…uantization

benwtrent · 2024-03-29T16:59:34Z

lucene/backward-codecs/src/test/org/apache/lucene/backward_index/TestGenerateBwcIndices.java

@@ -82,6 +82,16 @@ public void testCreateSortedIndex() throws IOException {
    sortedTest.createBWCIndex();
  }

+  public void testCreateInt8HNSWIndices() throws IOException {


@jpountz because the change adds a version metadata difference to Scalar quantized HNSW, I added some BWC tests. I only built BWC indices for 8.10.1.

The reason for its own BWC class is because the codec here for the particular field isn't the default testing codec, I didn't want to adjust the other tests unnecessarily.

…8-quantization

…13197) This PR is a culmination of some various streams of work: - Confidence interval optimizations, unlocked even smaller quantization bytes. - The ability to quantize down smaller than just int8 or int7 - Adding an optimized int4 (halfbyte) vector API comparison for dot-product. The idea of further scalar quantization gives users the choice between: - Further quantizing to gain space through compressing the bits into single byte values - Or allowing quantization to give guarantees around maximal values that afford faster vector operations. I didn't add more panama vector APIs as I think trying to micro-optimize int4 for anything other than dot-product was a fools errand. Additionally, I only focused on ARM. I experimented with trying to get better performance on other architectures, but didn't get very far, so I fall back to dotProduct.

jpountz · 2024-04-04T10:48:15Z

@benwtrent I tried to fix the compilation on luceneutil at mikemccand/luceneutil@027146b. I could use your help to check if this is the right fix.

benwtrent · 2024-04-04T12:44:30Z

@jpountz I can add the parameters today and fix the compilation. I think your change is the correct one.

benwtrent added 11 commits January 10, 2024 10:48

Improve int8 recall by better auto quantile calculation

c57af13

fixing scalar quantization

1d6cff0

fixing minor bug

b822b43

adding int4

0e62932

iter

2ae25c0

Merge remote-tracking branch 'upstream/main' into feature/improve-int…

023cf85

…8-quantization

iter

c15f7a5

iter

f2ac8af

iter

dc16431

Adding panama support for halfbyte

b10d301

updating benchmark

f1f69eb

jpountz reviewed Mar 22, 2024

View reviewed changes

benwtrent added 4 commits March 25, 2024 14:32

iter

97dc28d

iter

7e70783

requantize always on bit4

bdcddc8

iter

87ca60a

benwtrent added 3 commits March 26, 2024 15:26

iter panama

e89bde6

Merge remote-tracking branch 'upstream/main' into feature/improve-int…

30e3b8e

…8-quantization

add changes

9cc5251

benwtrent changed the title ~~New int4 scalar quantization~~ Expand scalar quantization with adding half-byte (int4) quantization Mar 26, 2024

benwtrent marked this pull request as ready for review March 26, 2024 19:34

jpountz reviewed Mar 26, 2024

View reviewed changes

tteofili approved these changes Mar 27, 2024

View reviewed changes

benwtrent added 4 commits March 27, 2024 10:43

restricting bits and reverting benchmark change

8a06cf5

Add BWC tests for core scalar quantized HNSW codec

69bf584

Merge remote-tracking branch 'upstream/main' into feature/add-int8-bw…

f675647

…c-tests

Merge branch 'feature/add-int8-bwc-tests' into feature/improve-int8-q…

63b15c2

…uantization

Adding bwc tests for int8hnsw

5c8227b

benwtrent commented Mar 29, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into feature/improve-int…

31b5351

…8-quantization

benwtrent merged commit 07d3be5 into apache:main Apr 2, 2024
3 checks passed

benwtrent deleted the feature/improve-int8-quantization branch April 2, 2024 17:38

benwtrent added this to the 9.11.0 milestone Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand scalar quantization with adding half-byte (int4) quantization #13197

Expand scalar quantization with adding half-byte (int4) quantization #13197

benwtrent commented Mar 21, 2024 •

edited

tteofili commented Mar 22, 2024

jpountz left a comment

jpountz Mar 22, 2024

benwtrent Mar 22, 2024

benwtrent commented Mar 26, 2024 •

edited

jpountz left a comment

tteofili commented Mar 27, 2024

tteofili Mar 27, 2024

benwtrent Mar 27, 2024

benwtrent Mar 29, 2024

jpountz commented Apr 4, 2024

benwtrent commented Apr 4, 2024

		@Param({"1", "128", "207", "256", "300", "512", "702", "1024"})
		@Param({"1024"})

Expand scalar quantization with adding half-byte (int4) quantization #13197

Expand scalar quantization with adding half-byte (int4) quantization #13197

Conversation

benwtrent commented Mar 21, 2024 • edited

tteofili commented Mar 22, 2024

jpountz left a comment

Choose a reason for hiding this comment

jpountz Mar 22, 2024

Choose a reason for hiding this comment

benwtrent Mar 22, 2024

Choose a reason for hiding this comment

benwtrent commented Mar 26, 2024 • edited

jpountz left a comment

Choose a reason for hiding this comment

tteofili commented Mar 27, 2024

tteofili Mar 27, 2024

Choose a reason for hiding this comment

benwtrent Mar 27, 2024

Choose a reason for hiding this comment

benwtrent Mar 29, 2024

Choose a reason for hiding this comment

jpountz commented Apr 4, 2024

benwtrent commented Apr 4, 2024

benwtrent commented Mar 21, 2024 •

edited

benwtrent commented Mar 26, 2024 •

edited