Skip to content

Conversation

@benwtrent
Copy link
Member

This provides a minor but measurable speed improvement.

JMH shows a nicer story:

this PR

Int7ScorerBenchmark.scoreFromMemorySegmentBulk     384  thrpt    5  57.738 ± 0.510  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     782  thrpt    5  25.804 ± 0.196  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk    1024  thrpt    5  23.813 ± 2.751  ops/ms

baseline:

Int7ScorerBenchmark.scoreFromMemorySegmentBulk     384  thrpt    5  35.412 ± 0.202  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     782  thrpt    5  20.663 ± 0.521  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk    1024  thrpt    5  19.765 ± 1.296  ops/ms

I will need help rebuilding and publishing the binaries :)

@elasticsearchmachine elasticsearchmachine added v9.3.0 Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Nov 17, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Hi @benwtrent, I've created a changelog YAML for you.

if (dims > DOT7U_STRIDE_BYTES_LEN) {
for (size_t c = 0; c < count; c++) {
int i = 0;
i += dims & ~(DOT7U_STRIDE_BYTES_LEN - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this could be extracted out of the loop. I'd expect a decent optimizer to do that, but you never know :)

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me; of course we'd need building and publishing the native lib and update it in libs/native/libraries/build.gradle
I think what we did in the past was to raise a separate PR with the native code changes (basically, the libs/simdec/native files), get it approved/merged, publish the library, and then in a separate PR do the java changes/gradle version bump.

@ldematte
Copy link
Contributor

ldematte commented Nov 18, 2025

An alternative is to do everything within the same PR:

  1. bump the native library version the library build
  2. build and publish the native library to artifactory
  3. bump the reference to the native library to the new version number
  4. build (or let the CI build it)

Pro of this approach is that the changes to the native lib are immediately tested in the Java layer.
Cons is that in case of iteration you need to repeat the process again, using a new version.

Let me know if you prefer this approach, and I can take care of steps 1-3 for you (or give you more detailed instructions)

@benwtrent
Copy link
Member Author

Looks good to me; of course we'd need building and publishing the native lib and update it in libs/native/libraries/build.gradle
I think what we did in the past was to raise a separate PR with the native code changes (basically, the libs/simdec/native files), get it approved/merged, publish the library, and then in a separate PR do the java changes/gradle version bump.

I think a separate PR is fine for now (I can open that). But this does show that we just need a nicer process (possibly a tag or CI job?), that will publish and use the binaries, etc but not impact main until things are merged.

benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Nov 18, 2025
@benwtrent
Copy link
Member Author

native PR: #138239

@iverase
Copy link
Contributor

iverase commented Nov 18, 2025

I love this, that should leverage the capacity of bulk scoring in DiskBBQ. I am +1 here but I let Lorenzo reviewing the native stuff, otherwise LGTM.

@ChrisHegarty
Copy link
Contributor

Very nice!

This type of bulk - scoring against every vector in a contiguous block - is good for this particular use case (no issue with this). I mention this as I did think that prefetching could help, but I image that it would help more for the Lucene bulkScore - scoring against a given set of ordinals in the dataset.

A quick check on my mac laptop shows just a little improvement!?

Int7ScorerBenchmark.scoreFromMemorySegmentBulk     384  thrpt    5  58.981 ± 2.602  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk     782  thrpt    5  26.450 ± 0.122  ops/ms
Int7ScorerBenchmark.scoreFromMemorySegmentBulk    1024  thrpt    5  24.411 ± 0.229  ops/ms
EXPORT void dot7u_bulk(int8_t* a, int8_t* b, size_t dims, size_t count, float_t* results) {
    for (size_t c = 0; c < count; c++) {
        // Prefetch next vector of 'a' for the *next* iteration
        if (c + 1 < count) {
            __builtin_prefetch(a + dims, 0, 3);   // read, high locality
        }
        int32_t res = 0;
        if (dims > DOT7U_STRIDE_BYTES_LEN) {
            size_t i = 0;
            size_t blk = dims & ~(DOT7U_STRIDE_BYTES_LEN - 1);
            res = dot7u_inner(a, b, blk);
            for (i = blk; i < dims; i++) {
                res += a[i] * b[i];
            }
        } else {
            for (size_t i = 0; i < dims; i++) {
                res += a[i] * b[i];
            }
        }
        results[c] = (float_t)res;
        a += dims;
    }
}

@benwtrent
Copy link
Member Author

I can add

        if (c + 1 < count) {
            __builtin_prefetch(a + dims, 0, 3);   // read, high locality
        }

@ldematte
Copy link
Contributor

Prefectching is a strange beast :)
If I understood it correctly (but please challenge me on this point) in this particular case we are scanning all the vectors in the input linearly; in this case the "regular" prefetching done by the CPU should be already optimal.

I think that prefetching could help in cases where we don't scan the 2 "matrices" (array of vectors) linearly, but using an array of offsets. But even in that case I would benchmark it, as I'm not sure it would help a lot if the access is really sparse/random. Or if it's better to "hint" the compiler/processor to prefetch, e.g. by manually unrolling part of the loop (so that the processor "knows" to go and fetch some data while it's still processing the current vector).
In any case, I'd leave this for a follow-up (e.g. when we introduce random access to the matrices). Again, given that I understood this correctly :)

@ldematte
Copy link
Contributor

(so that the processor "knows" to go and fetch some data while it's still processing the current vector)

To be clear: this should be exactly what the lines suggested by @benwtrent would do, but sometimes the internal prefetch logic "knows better" and makes better decisions if left alone (e.g. if it actually wants to prefetch or not to avoid trashing). The only way to know is to benchmark IMO

@benwtrent
Copy link
Member Author

I will happily leave it for now :D. Its fun that we already get such great results just with this simple change.

benwtrent added a commit that referenced this pull request Nov 18, 2025
* Adding native code related to (#138204)

* Bump simdvec native lib build/publish VERSION

---------

Co-authored-by: Lorenzo Dematte <lorenzo.dematte@elastic.co>
@benwtrent benwtrent requested a review from ldematte November 18, 2025 18:55
Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Nov 18, 2025
@elasticsearchmachine elasticsearchmachine merged commit 8a839dc into elastic:main Nov 18, 2025
34 checks passed
@benwtrent benwtrent deleted the feature/add-bulk-int7-centroid-scoring branch November 18, 2025 20:00
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Nov 19, 2025
john-wagster pushed a commit that referenced this pull request Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement :Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants