add bulk off-heap scoring for uint8 quantized vectors using panama vector api#16203
Open
iprithv wants to merge 1 commit into
Open
add bulk off-heap scoring for uint8 quantized vectors using panama vector api#16203iprithv wants to merge 1 commit into
iprithv wants to merge 1 commit into
Conversation
2336b8b to
94e8a44
Compare
c47bf55 to
8a127b8
Compare
…ctor API Adds 4-wide bulk dot product and square distance operations for uint8 quantized vectors using the Panama Vector API with MemorySegment-based data access. This reduces reduceLanes calls by 4x compared to single- vector scoring, which helps under icache pressure during HNSW graph traversal. Key implementation details: - Uint8DotProduct and Uint8SqrDistance inner classes in MemorySegmentBulkVectorOps with platform-specific widening: - 128-bit: bytes -> shorts -> ints (2 convertShape parts) - 256-bit: bytes -> ints directly - 512-bit: bytes -> shorts -> ints (single part) - UINT8_NEEDS_PART1 guard prevents convertShape(..., 1) call on 512-bit where all 16 shorts fit in 16 ints in a single part - Int4 bulk scoring is explicitly not implemented; int4 scorers fall back to the existing single-vector Panama/Native paths via !isUint8() guard in bulkScoreBody Tests cover: basic uint8 (int7), large dimensions (128), odd dimensions (97), SIMD boundaries (15/16/17), tail paths (0-3 nodes), updateable scorer (MemorySegment query), and int4 fallback verification. Benchmarks (AMD Ryzen 7 7800X3D, AVX-512): - Float32 bulk: ~1.5x speedup (reference) - Uint8 bulk (clean icache): ~5-7% regression - Uint8 bulk (polluted icache): ~2x speedup Int4 bulk was evaluated but benchmarked ~2.3x slower than single-vector and has been removed from this PR.
8a127b8 to
fdab794
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
added 4-wide bulk dot product and square distance for uint8 quantized vectors using panama vector api with memorysegment-based access. reduced reduceLanes calls by 4x which helps under icache pressure during hnsw graph traversal.
benchmarks (amd ryzen 7 7800x3d, avx-512):
related: #15155 #15257 #14980
also I tried this for int4 bulk but saw ~2.3x slower on avx-512 due to nibble unpacking overhead..