[DiskBBQ] Add bulk scoring for int7 centroid scoring #138204

benwtrent · 2025-11-17T23:57:18Z

This provides a minor but measurable speed improvement.

JMH shows a nicer story:

this PR

Int7ScorerBenchmark.scoreFromMemorySegmentBulk     384  thrpt    5  57.738 ± 0.510  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     782  thrpt    5  25.804 ± 0.196  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk    1024  thrpt    5  23.813 ± 2.751  ops/ms

baseline:

Int7ScorerBenchmark.scoreFromMemorySegmentBulk     384  thrpt    5  35.412 ± 0.202  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     782  thrpt    5  20.663 ± 0.521  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk    1024  thrpt    5  19.765 ± 1.296  ops/ms

I will need help rebuilding and publishing the binaries :)

elasticsearchmachine · 2025-11-17T23:57:43Z

Pinging @elastic/es-search-relevance (Team:Search Relevance)

elasticsearchmachine · 2025-11-17T23:58:06Z

Hi @benwtrent, I've created a changelog YAML for you.

ldematte · 2025-11-18T09:02:02Z

libs/simdvec/native/src/vec/c/aarch64/vec.c

+    if (dims > DOT7U_STRIDE_BYTES_LEN) {
+        for (size_t c = 0; c < count; c++) {
+            int i = 0;
+            i += dims & ~(DOT7U_STRIDE_BYTES_LEN - 1);


Nit: this could be extracted out of the loop. I'd expect a decent optimizer to do that, but you never know :)

ldematte

Looks good to me; of course we'd need building and publishing the native lib and update it in libs/native/libraries/build.gradle
I think what we did in the past was to raise a separate PR with the native code changes (basically, the libs/simdec/native files), get it approved/merged, publish the library, and then in a separate PR do the java changes/gradle version bump.

ldematte · 2025-11-18T11:17:53Z

An alternative is to do everything within the same PR:

bump the native library version the library build
build and publish the native library to artifactory
bump the reference to the native library to the new version number
build (or let the CI build it)

Pro of this approach is that the changes to the native lib are immediately tested in the Java layer.
Cons is that in case of iteration you need to repeat the process again, using a new version.

Let me know if you prefer this approach, and I can take care of steps 1-3 for you (or give you more detailed instructions)

…t7-centroid-scoring

…enwtrent/elasticsearch into feature/add-bulk-int7-centroid-scoring

benwtrent · 2025-11-18T13:53:56Z

Looks good to me; of course we'd need building and publishing the native lib and update it in libs/native/libraries/build.gradle
I think what we did in the past was to raise a separate PR with the native code changes (basically, the libs/simdec/native files), get it approved/merged, publish the library, and then in a separate PR do the java changes/gradle version bump.

I think a separate PR is fine for now (I can open that). But this does show that we just need a nicer process (possibly a tag or CI job?), that will publish and use the binaries, etc but not impact main until things are merged.

benwtrent · 2025-11-18T13:58:07Z

native PR: #138239

iverase · 2025-11-18T14:01:35Z

I love this, that should leverage the capacity of bulk scoring in DiskBBQ. I am +1 here but I let Lorenzo reviewing the native stuff, otherwise LGTM.

ChrisHegarty · 2025-11-18T14:07:22Z

Very nice!

This type of bulk - scoring against every vector in a contiguous block - is good for this particular use case (no issue with this). I mention this as I did think that prefetching could help, but I image that it would help more for the Lucene bulkScore - scoring against a given set of ordinals in the dataset.

A quick check on my mac laptop shows just a little improvement!?

Int7ScorerBenchmark.scoreFromMemorySegmentBulk     384  thrpt    5  58.981 ± 2.602  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     782  thrpt    5  26.450 ± 0.122  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk    1024  thrpt    5  24.411 ± 0.229  ops/ms

EXPORT void dot7u_bulk(int8_t* a, int8_t* b, size_t dims, size_t count, float_t* results) {
    for (size_t c = 0; c < count; c++) {
        // Prefetch next vector of 'a' for the *next* iteration
        if (c + 1 < count) {
            __builtin_prefetch(a + dims, 0, 3);   // read, high locality
        }
        int32_t res = 0;
        if (dims > DOT7U_STRIDE_BYTES_LEN) {
            size_t i = 0;
            size_t blk = dims & ~(DOT7U_STRIDE_BYTES_LEN - 1);
            res = dot7u_inner(a, b, blk);
            for (i = blk; i < dims; i++) {
                res += a[i] * b[i];
            }
        } else {
            for (size_t i = 0; i < dims; i++) {
                res += a[i] * b[i];
            }
        }
        results[c] = (float_t)res;
        a += dims;
    }
}

benwtrent · 2025-11-18T17:02:35Z

I can add

        if (c + 1 < count) {
            __builtin_prefetch(a + dims, 0, 3);   // read, high locality
        }

ldematte · 2025-11-18T17:11:18Z

Prefectching is a strange beast :)
If I understood it correctly (but please challenge me on this point) in this particular case we are scanning all the vectors in the input linearly; in this case the "regular" prefetching done by the CPU should be already optimal.

I think that prefetching could help in cases where we don't scan the 2 "matrices" (array of vectors) linearly, but using an array of offsets. But even in that case I would benchmark it, as I'm not sure it would help a lot if the access is really sparse/random. Or if it's better to "hint" the compiler/processor to prefetch, e.g. by manually unrolling part of the loop (so that the processor "knows" to go and fetch some data while it's still processing the current vector).
In any case, I'd leave this for a follow-up (e.g. when we introduce random access to the matrices). Again, given that I understood this correctly :)

ldematte · 2025-11-18T17:14:35Z

(so that the processor "knows" to go and fetch some data while it's still processing the current vector)

To be clear: this should be exactly what the lines suggested by @benwtrent would do, but sometimes the internal prefetch logic "knows better" and makes better decisions if left alone (e.g. if it actually wants to prefetch or not to avoid trashing). The only way to know is to benchmark IMO

benwtrent · 2025-11-18T17:47:33Z

I will happily leave it for now :D. Its fun that we already get such great results just with this simple change.

* Adding native code related to (#138204) * Bump simdvec native lib build/publish VERSION --------- Co-authored-by: Lorenzo Dematte <lorenzo.dematte@elastic.co>

…t7-centroid-scoring

ChrisHegarty

LGTM

…#138204)" This reverts commit 8a839dc.

…" (#138309) This reverts commit 8a839dc.

[DiskBBQ] Add bulk scoring for int7 centroid scoring

bfa278c

benwtrent requested review from ChrisHegarty, iverase, ldematte and thecoop November 17, 2025 23:57

benwtrent added the >enhancement label Nov 17, 2025

benwtrent requested a review from a team as a code owner November 17, 2025 23:57

benwtrent added the :Search Relevance/Vectors Vector search label Nov 17, 2025

Merge branch 'main' into feature/add-bulk-int7-centroid-scoring

8b06a7e

elasticsearchmachine added v9.3.0 Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Nov 17, 2025

Update docs/changelog/138204.yaml

8bd1c28

ldematte reviewed Nov 18, 2025

View reviewed changes

benwtrent added 3 commits November 18, 2025 08:51

pulling out limit check

d088782

Merge remote-tracking branch 'upstream/main' into feature/add-bulk-in…

73b391b

…t7-centroid-scoring

Merge branch 'feature/add-bulk-int7-centroid-scoring' of github.com:b…

ce8b226

…enwtrent/elasticsearch into feature/add-bulk-int7-centroid-scoring

benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Nov 18, 2025

Adding native code related to (elastic#138204)

de3c74f

benwtrent mentioned this pull request Nov 18, 2025

Adding native code related to (#138204) #138239

Merged

benwtrent added a commit that referenced this pull request Nov 18, 2025

Adding native code related to (#138204) (#138239)

fa8c771

* Adding native code related to (#138204) * Bump simdvec native lib build/publish VERSION --------- Co-authored-by: Lorenzo Dematte <lorenzo.dematte@elastic.co>

benwtrent and others added 3 commits November 18, 2025 12:59

Merge branch 'main' into feature/add-bulk-int7-centroid-scoring

617dd8d

Merge remote-tracking branch 'upstream/main' into feature/add-bulk-in…

ecbd393

…t7-centroid-scoring

adjusting to new version

db04133

benwtrent requested a review from ldematte November 18, 2025 18:55

ChrisHegarty approved these changes Nov 18, 2025

View reviewed changes

benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 18, 2025

elasticsearchmachine merged commit 8a839dc into elastic:main Nov 18, 2025
34 checks passed

benwtrent deleted the feature/add-bulk-int7-centroid-scoring branch November 18, 2025 20:00

benwtrent mentioned this pull request Nov 19, 2025

DiskBBQ always block encode centroids #138296

Open

benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Nov 19, 2025

Revert "[DiskBBQ] Add bulk scoring for int7 centroid scoring (elastic…

299dd56

…#138204)" This reverts commit 8a839dc.

john-wagster pushed a commit that referenced this pull request Nov 19, 2025

Revert "[DiskBBQ] Add bulk scoring for int7 centroid scoring (#138204)…

59f7904

…" (#138309) This reverts commit 8a839dc.

[DiskBBQ] Add bulk scoring for int7 centroid scoring #138204

[DiskBBQ] Add bulk scoring for int7 centroid scoring #138204

Uh oh!

Conversation

benwtrent commented Nov 17, 2025

Uh oh!

elasticsearchmachine commented Nov 17, 2025

Uh oh!

elasticsearchmachine commented Nov 17, 2025

Uh oh!

ldematte Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

ldematte left a comment

Choose a reason for hiding this comment

Uh oh!

ldematte commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benwtrent commented Nov 18, 2025

Uh oh!

benwtrent commented Nov 18, 2025

Uh oh!

iverase commented Nov 18, 2025

Uh oh!

ChrisHegarty commented Nov 18, 2025

Uh oh!

benwtrent commented Nov 18, 2025

Uh oh!

ldematte commented Nov 18, 2025

Uh oh!

ldematte commented Nov 18, 2025

Uh oh!

benwtrent commented Nov 18, 2025

Uh oh!

ChrisHegarty left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ldematte commented Nov 18, 2025 •

edited

Loading