New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expand scalar quantization with adding half-byte (int4) quantization #13197
Expand scalar quantization with adding half-byte (int4) quantization #13197
Conversation
@benwtrent to me it makes sense to have quantization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For that case, adding an option makes sense to me since it seems extremely similar to int8 scalar quantization.
for (int i = 0; i < numBytes; ++i) { | ||
compressed[numBytes + i] = (byte) (compressed[i] & 0x0F); | ||
compressed[i] = (byte) ((compressed[i] & 0xFF) >> 4); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, this should get auto-vectorized on JDK13+.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz its pretty fast. This combined with the panama optimized int4 vector comparison keeps runtime faster than float32. However, doing this and only the int8 vector comparison makes us about the same speed or slightly slower than float32.
I am going to run a bunch more benchmarks once I get this all refactored and show all the numbers.
I did a bunch of local benchmarking on this. I am adding a parameter to allow optional compression as the numbers without compressing are compelling enough on ARM to justify it IMO. To achieve similar recall, Here are some latency vs. recall for int7, and int4 with this change.
int4 with compression gives 2x space improvement over int7, but it comes at an obvious cost as we have to (un)pack bytes during dot-products. Here are the numbers around index building as well. I committed ever 1MB to ensure merging occurred and that force-merging was adequately exercised. Int4 no compression:
Int4 compression:
Int7:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this feature should be more opinionated, e.g. should it only accept 4 and 7 as numbers of bits, these look like the two most interesting numbers to me? And maybe we should enforce compression with 4 bits or less, I understand that there is a performance hit, but storing vectors in a wasteful way doesn't feel great?
I tend to agree on being opinionated on a set of allowed configurations for what concerns the number of bits (4 and 7). |
@Param({"1", "128", "207", "256", "300", "512", "702", "1024"}) | ||
@Param({"1024"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we keep the other options too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this was a mistake to commit this change, I was benchmarking :/
@@ -82,6 +82,16 @@ public void testCreateSortedIndex() throws IOException { | |||
sortedTest.createBWCIndex(); | |||
} | |||
|
|||
public void testCreateInt8HNSWIndices() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz because the change adds a version metadata difference to Scalar quantized HNSW, I added some BWC tests. I only built BWC indices for 8.10.1.
The reason for its own BWC class is because the codec here for the particular field isn't the default testing codec, I didn't want to adjust the other tests unnecessarily.
…13197) This PR is a culmination of some various streams of work: - Confidence interval optimizations, unlocked even smaller quantization bytes. - The ability to quantize down smaller than just int8 or int7 - Adding an optimized int4 (halfbyte) vector API comparison for dot-product. The idea of further scalar quantization gives users the choice between: - Further quantizing to gain space through compressing the bits into single byte values - Or allowing quantization to give guarantees around maximal values that afford faster vector operations. I didn't add more panama vector APIs as I think trying to micro-optimize int4 for anything other than dot-product was a fools errand. Additionally, I only focused on ARM. I experimented with trying to get better performance on other architectures, but didn't get very far, so I fall back to dotProduct.
@benwtrent I tried to fix the compilation on luceneutil at mikemccand/luceneutil@027146b. I could use your help to check if this is the right fix. |
@jpountz I can add the parameters today and fix the compilation. I think your change is the correct one. |
This PR is a culmination of some various streams of work:
The idea of further scalar quantization gives users the choice between:
I didn't add more panama vector APIs as I think trying to micro-optimize int4 for anything other than dot-product was a fools errand. Additionally, I only focused on ARM. I experimented with trying to get better performance on other architectures, but didn't get very far, so I fall back to dotProduct.