-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Optimize BytesArray::indexOf, which is used heavily in ndjson parsing #135087
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-perf (Team:Performance) |
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Hi @ChrisHegarty, I've created a changelog YAML for you. |
libs/simdvec/src/main/java/org/elasticsearch/simdvec/internal/vectorization/ByteArrayUtils.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numbers on skylake are great! I will run it on my macbook :).
Would be neato to see numbers on cloud focused ARM, though I imagine it still pretty crazy.
benchmarks/src/main/java/org/elasticsearch/benchmark/bytes/BytesArrayIndexOfBenchmark.java
Show resolved
Hide resolved
.../main21/java/org/elasticsearch/simdvec/internal/vectorization/PanamaESVectorUtilSupport.java
Show resolved
Hide resolved
benchmarks/src/main/java/org/elasticsearch/benchmark/bytes/BytesArrayIndexOfBenchmark.java
Outdated
Show resolved
Hide resolved
Seems like, maybe we don't need to check the length at all? This is surprising to me. But for index size of 64 (way too small to be Running on my macbook (128 arm) and its looking nice!
|
ha! Thanks for checking this. Lemme add those sizes to the benchmark and do some more runs. But otherwise looks like we don't need to fall back to the old code - as it's slower. |
Benchmark summary: Panama consistently outperforms the baseline on all architectures. Benchmark results: Linux 6.8.0-1029-aws aarch64 AWS Graviton3
Linux 6.14.0-1010-aws x86_64 Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Linux 6.8.0-79-generic x86_64 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
.../main21/java/org/elasticsearch/simdvec/internal/vectorization/PanamaESVectorUtilSupport.java
Outdated
Show resolved
Hide resolved
Latest results; summary as good or better than the previous set. Small sizes with tails have improved. Linux 6.8.0-1029-aws aarch64 AWS Graviton3
Linux 6.14.0-1010-aws x86_64 Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Linux 6.8.0-79-generic x86_64 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very neat use of Panama! Vectorization without writing a line of native code, nice!
One small point, and I'm late to the party :) but LGTM!
} | ||
|
||
@Override | ||
public int indexOf(final byte[] bytes, final int offset, final int length, final byte marker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't think we need to keep this in main21, as 21 is the minimum version we support now. We need to apply the mrjar plugin for the preview bit handling, but no need for a different sourceset I think
…elastic#135087) This commit optimizes BytesArray::indexOf, which is used heavily in ndjson parsing. I simply extracted the indexOf functionality into ESVectorUtil, so that we could have a default implementation (the current one), and a Panama Vectorized one. BytesArray::indexOf is used heavily for newline json delimited parsing, which backs the _bulk API. Here’s the result of the benchmark run on Linux x64, Intel SkyLake: Size Baseline ops/ms Panama ops/ms Speedup 4,096 2,439 41,689 ~17.1x 16,384 624 12,297 ~19.7x 65,536 156 1,689 ~10.8x 1,048,576 9.8 73.4 ~7.5x The Panama version is 7–20× faster than the baseline depending on input size. For small and medium arrays (4 KB and 16 KB), the speedup is dramatic (~17–20×), showing that the vectorization pays off most when the data set fits easily in cache. For larger arrays (64 KB – 1 MB), the relative advantage drops (~10× at 64 KB, ~7.5× at 1 MB). That’s expected - as array size grows, the bottleneck shifts from CPU to memory bandwidth, reducing the benefit of SIMD. In short: Panama delivers an order-of-magnitude improvement over the scalar implementation, especially for small-to-medium buffers, while still offering a solid multiple (~7×) speedup even on very large buffers. The Panama implementation is between 10-6x faster on AVX2, and ~5x times faster on ARM.
…elastic#135087) This commit optimizes BytesArray::indexOf, which is used heavily in ndjson parsing. I simply extracted the indexOf functionality into ESVectorUtil, so that we could have a default implementation (the current one), and a Panama Vectorized one. BytesArray::indexOf is used heavily for newline json delimited parsing, which backs the _bulk API. Here’s the result of the benchmark run on Linux x64, Intel SkyLake: Size Baseline ops/ms Panama ops/ms Speedup 4,096 2,439 41,689 ~17.1x 16,384 624 12,297 ~19.7x 65,536 156 1,689 ~10.8x 1,048,576 9.8 73.4 ~7.5x The Panama version is 7–20× faster than the baseline depending on input size. For small and medium arrays (4 KB and 16 KB), the speedup is dramatic (~17–20×), showing that the vectorization pays off most when the data set fits easily in cache. For larger arrays (64 KB – 1 MB), the relative advantage drops (~10× at 64 KB, ~7.5× at 1 MB). That’s expected - as array size grows, the bottleneck shifts from CPU to memory bandwidth, reducing the benefit of SIMD. In short: Panama delivers an order-of-magnitude improvement over the scalar implementation, especially for small-to-medium buffers, while still offering a solid multiple (~7×) speedup even on very large buffers. The Panama implementation is between 10-6x faster on AVX2, and ~5x times faster on ARM.
Optimize BytesArray::indexOf, which is used heavily in ndjson parsing.
I simply extracted the indexOf functionality into ESVectorUtil, so that we could have a default implementation (the current one), and a Panama Vectorized one. BytesArray::indexOf is used heavily for newline json delimited parsing, which backs the _bulk API.
Here’s the result of the benchmark run on Linux x64, Intel SkyLake:
The Panama version is 7–20× faster than the baseline depending on input size.
For small and medium arrays (4 KB and 16 KB), the speedup is dramatic (~17–20×), showing that the vectorization pays off most when the data set fits easily in cache.
For larger arrays (64 KB – 1 MB), the relative advantage drops (~10× at 64 KB, ~7.5× at 1 MB). That’s expected - as array size grows, the bottleneck shifts from CPU to memory bandwidth, reducing the benefit of SIMD.
In short: Panama delivers an order-of-magnitude improvement over the scalar implementation, especially for small-to-medium buffers, while still offering a solid multiple (~7×) speedup even on very large buffers.
The Panama implementation is between 10-6x faster on AVX2, and ~5x times faster on ARM.