Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand scalar quantization with adding half-byte (int4) quantization #13197

Merged
merged 24 commits into from Apr 2, 2024

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Mar 21, 2024

This PR is a culmination of some various streams of work:

  • Confidence interval optimizations, unlocked even smaller quantization bytes.
  • The ability to quantize down smaller than just int8 or int7
  • Adding an optimized int4 (halfbyte) vector API comparison for dot-product.

The idea of further scalar quantization gives users the choice between:

  • Further quantizing to gain space through compressing the bits into single byte values
  • Or allowing quantization to give guarantees around maximal values that afford faster vector operations.

I didn't add more panama vector APIs as I think trying to micro-optimize int4 for anything other than dot-product was a fools errand. Additionally, I only focused on ARM. I experimented with trying to get better performance on other architectures, but didn't get very far, so I fall back to dotProduct.

@tteofili
Copy link
Contributor

@benwtrent to me it makes sense to have quantization bits configurable in this case.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that case, adding an option makes sense to me since it seems extremely similar to int8 scalar quantization.

for (int i = 0; i < numBytes; ++i) {
compressed[numBytes + i] = (byte) (compressed[i] & 0x0F);
compressed[i] = (byte) ((compressed[i] & 0xFF) >> 4);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, this should get auto-vectorized on JDK13+.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz its pretty fast. This combined with the panama optimized int4 vector comparison keeps runtime faster than float32. However, doing this and only the int8 vector comparison makes us about the same speed or slightly slower than float32.

I am going to run a bunch more benchmarks once I get this all refactored and show all the numbers.

@benwtrent
Copy link
Member Author

benwtrent commented Mar 26, 2024

I did a bunch of local benchmarking on this. I am adding a parameter to allow optional compression as the numbers without compressing are compelling enough on ARM to justify it IMO.

To achieve similar recall, int4 without compression is about 30% faster. With compression its about 30% slower, but with 50% of the memory requirements.

Here are some latency vs. recall for int7, and int4 with this change.

plt.plot([2.01], [0.964], marker='x', markersize=10, label='f32')
plt.plot([1.49, 1.53, 1.54, 1.83, 2.09], [0.952, 0.962, 0.965, 0.974, 0.981], marker='o', label='int7')
plt.plot([1.72, 1.75, 1.79, 2.04, 2.48], [0.897, 0.915, 0.929, 0.971, 0.980 ], marker='o', label='int4_compressed')
plt.plot([1.08, 1.12, 1.12, 1.34, 1.50], [0.897, 0.915, 0.929, 0.971, 0.980 ], marker='o', label='int4')

image

int4 with compression gives 2x space improvement over int7, but it comes at an obvious cost as we have to (un)pack bytes during dot-products.

Here are the numbers around index building as well. I committed ever 1MB to ensure merging occurred and that force-merging was adequately exercised.

Int4 no compression:

Indexed 500000 documents in 312090ms
Force merge done in: 76169 ms

Int4 compression:

Indexed 500000 documents in 326978ms
Force merge done in: 124961 ms

Int7:

Indexed 500000 documents in 344584ms
Force merge done in: 98311 ms

@benwtrent benwtrent changed the title New int4 scalar quantization Expand scalar quantization with adding half-byte (int4) quantization Mar 26, 2024
@benwtrent benwtrent marked this pull request as ready for review March 26, 2024 19:34
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this feature should be more opinionated, e.g. should it only accept 4 and 7 as numbers of bits, these look like the two most interesting numbers to me? And maybe we should enforce compression with 4 bits or less, I understand that there is a performance hit, but storing vectors in a wasteful way doesn't feel great?

@tteofili
Copy link
Contributor

I tend to agree on being opinionated on a set of allowed configurations for what concerns the number of bits (4 and 7).
Given the speed-space trade-off for packing, I think it's useful to leave that as an option.

Comment on lines 42 to 44
@Param({"1", "128", "207", "256", "300", "512", "702", "1024"})
@Param({"1024"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we keep the other options too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this was a mistake to commit this change, I was benchmarking :/

@@ -82,6 +82,16 @@ public void testCreateSortedIndex() throws IOException {
sortedTest.createBWCIndex();
}

public void testCreateInt8HNSWIndices() throws IOException {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz because the change adds a version metadata difference to Scalar quantized HNSW, I added some BWC tests. I only built BWC indices for 8.10.1.

The reason for its own BWC class is because the codec here for the particular field isn't the default testing codec, I didn't want to adjust the other tests unnecessarily.

@benwtrent benwtrent merged commit 07d3be5 into apache:main Apr 2, 2024
3 checks passed
@benwtrent benwtrent deleted the feature/improve-int8-quantization branch April 2, 2024 17:38
@benwtrent benwtrent added this to the 9.11.0 milestone Apr 2, 2024
benwtrent added a commit that referenced this pull request Apr 2, 2024
…13197)

This PR is a culmination of some various streams of work:

 - Confidence interval optimizations, unlocked even smaller quantization bytes.
 - The ability to quantize down smaller than just int8 or int7
 - Adding an optimized int4 (halfbyte) vector API comparison for dot-product.

The idea of further scalar quantization gives users the choice between:

 - Further quantizing to gain space through compressing the bits into single byte values
 - Or allowing quantization to give guarantees around maximal values that afford faster vector operations.

I didn't add more panama vector APIs as I think trying to micro-optimize int4 for anything other than dot-product was a fools errand. Additionally, I only focused on ARM. I experimented with trying to get better performance on other architectures, but didn't get very far, so I fall back to dotProduct.
@jpountz
Copy link
Contributor

jpountz commented Apr 4, 2024

@benwtrent I tried to fix the compilation on luceneutil at mikemccand/luceneutil@027146b. I could use your help to check if this is the right fix.

@benwtrent
Copy link
Member Author

@jpountz I can add the parameters today and fix the compilation. I think your change is the correct one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants