Skip to content

add bulk off-heap scoring for uint8 quantized vectors using panama vector api#16203

Open
iprithv wants to merge 1 commit into
apache:mainfrom
iprithv:bulk-offheap-scoring
Open

add bulk off-heap scoring for uint8 quantized vectors using panama vector api#16203
iprithv wants to merge 1 commit into
apache:mainfrom
iprithv:bulk-offheap-scoring

Conversation

@iprithv
Copy link
Copy Markdown
Contributor

@iprithv iprithv commented Jun 5, 2026

added 4-wide bulk dot product and square distance for uint8 quantized vectors using panama vector api with memorysegment-based access. reduced reduceLanes calls by 4x which helps under icache pressure during hnsw graph traversal.

benchmarks (amd ryzen 7 7800x3d, avx-512):

  • float32 bulk: ~1.5x speedup
  • uint8 bulk (clean icache): ~5-7% slower
  • uint8 bulk (polluted icache): ~2x faster

related: #15155 #15257 #14980
also I tried this for int4 bulk but saw ~2.3x slower on avx-512 due to nibble unpacking overhead..

@github-actions github-actions Bot added this to the 10.5.0 milestone Jun 5, 2026
@iprithv iprithv force-pushed the bulk-offheap-scoring branch 4 times, most recently from 2336b8b to 94e8a44 Compare June 5, 2026 18:29
@iprithv iprithv changed the title Add bulk off-heap scoring for quantized vectors using Panama Vector API. add bulk off-heap scoring for uint8 quantized vectors using panama vector api Jun 5, 2026
@iprithv iprithv force-pushed the bulk-offheap-scoring branch 3 times, most recently from c47bf55 to 8a127b8 Compare June 5, 2026 18:54
@iprithv iprithv marked this pull request as ready for review June 5, 2026 19:09
…ctor API

Adds 4-wide bulk dot product and square distance operations for uint8
quantized vectors using the Panama Vector API with MemorySegment-based
data access. This reduces reduceLanes calls by 4x compared to single-
vector scoring, which helps under icache pressure during HNSW graph
traversal.

Key implementation details:
- Uint8DotProduct and Uint8SqrDistance inner classes in
  MemorySegmentBulkVectorOps with platform-specific widening:
  - 128-bit: bytes -> shorts -> ints (2 convertShape parts)
  - 256-bit: bytes -> ints directly
  - 512-bit: bytes -> shorts -> ints (single part)
- UINT8_NEEDS_PART1 guard prevents convertShape(..., 1) call on 512-bit
  where all 16 shorts fit in 16 ints in a single part
- Int4 bulk scoring is explicitly not implemented; int4 scorers fall
  back to the existing single-vector Panama/Native paths via !isUint8()
  guard in bulkScoreBody

Tests cover: basic uint8 (int7), large dimensions (128), odd dimensions
(97), SIMD boundaries (15/16/17), tail paths (0-3 nodes), updateable
scorer (MemorySegment query), and int4 fallback verification.

Benchmarks (AMD Ryzen 7 7800X3D, AVX-512):
- Float32 bulk: ~1.5x speedup (reference)
- Uint8 bulk (clean icache): ~5-7% regression
- Uint8 bulk (polluted icache): ~2x speedup

Int4 bulk was evaluated but benchmarked ~2.3x slower than single-vector
and has been removed from this PR.
@iprithv iprithv force-pushed the bulk-offheap-scoring branch from 8a127b8 to fdab794 Compare June 5, 2026 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant