Skip to content

Conversation

ChrisHegarty
Copy link
Contributor

@ChrisHegarty ChrisHegarty commented Sep 19, 2025

Optimize BytesArray::indexOf, which is used heavily in ndjson parsing.

I simply extracted the indexOf functionality into ESVectorUtil, so that we could have a default implementation (the current one), and a Panama Vectorized one. BytesArray::indexOf is used heavily for newline json delimited parsing, which backs the _bulk API.

Here’s the result of the benchmark run on Linux x64, Intel SkyLake:

Size	     Baseline ops/ms Panama ops/ms Speedup
4,096        2,439           41,689        ~17.1x
16,384       624             12,297        ~19.7x
65,536       156             1,689         ~10.8x
1,048,576    9.8             73.4           ~7.5x

The Panama version is 7–20× faster than the baseline depending on input size.

For small and medium arrays (4 KB and 16 KB), the speedup is dramatic (~17–20×), showing that the vectorization pays off most when the data set fits easily in cache.
For larger arrays (64 KB – 1 MB), the relative advantage drops (~10× at 64 KB, ~7.5× at 1 MB). That’s expected - as array size grows, the bottleneck shifts from CPU to memory bandwidth, reducing the benefit of SIMD.

In short: Panama delivers an order-of-magnitude improvement over the scalar implementation, especially for small-to-medium buffers, while still offering a solid multiple (~7×) speedup even on very large buffers.

The Panama implementation is between 10-6x faster on AVX2, and ~5x times faster on ARM.

@ChrisHegarty ChrisHegarty requested a review from a team as a code owner September 19, 2025 13:54
@ChrisHegarty ChrisHegarty added :Core/Infra/Core Core issues without another label :Performance All issues related to Elasticsearch performance including regressions and investigations Team:Core/Infra Meta label for core/infra team v9.2.0 labels Sep 19, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-perf (Team:Performance)

@elasticsearchmachine elasticsearchmachine added the Team:Performance Meta label for performance team label Sep 19, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Hi @ChrisHegarty, I've created a changelog YAML for you.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers on skylake are great! I will run it on my macbook :).

Would be neato to see numbers on cloud focused ARM, though I imagine it still pretty crazy.

@benwtrent
Copy link
Member

Seems like, maybe we don't need to check the length at all? This is surprising to me. But for index size of 64 (way too small to be

Running on my macbook (128 arm) and its looking nice!

Benchmark                                          (size)   Mode  Cnt       Score       Error   Units
BytesArrayIndexOfBenchmark.indexOf                     64  thrpt    5  111159.836 ±   654.890  ops/ms
BytesArrayIndexOfBenchmark.indexOf                    127  thrpt    5   63477.492 ±  2255.925  ops/ms
BytesArrayIndexOfBenchmark.indexOf                    128  thrpt    5   65826.505 ±  1153.706  ops/ms
BytesArrayIndexOfBenchmark.indexOf                   4096  thrpt    5    2490.710 ±    26.760  ops/ms
BytesArrayIndexOfBenchmark.indexOf                  16384  thrpt    5     634.546 ±     6.918  ops/ms
BytesArrayIndexOfBenchmark.indexOf                  65536  thrpt    5     157.730 ±     4.007  ops/ms
BytesArrayIndexOfBenchmark.indexOf                1048576  thrpt    5       9.897 ±     0.324  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama               64  thrpt    5  206332.883 ± 37844.369  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama              127  thrpt    5  108412.415 ±  2275.115  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama              128  thrpt    5  163095.756 ±  2229.280  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama             4096  thrpt    5   13856.177 ±   375.879  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama            16384  thrpt    5    3563.749 ±    63.174  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama            65536  thrpt    5     910.785 ±    15.769  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama          1048576  thrpt    5      57.565 ±     0.302  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexOf           64  thrpt    5   96301.840 ±  1945.001  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexOf          127  thrpt    5   64477.539 ±   814.852  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexOf          128  thrpt    5   62245.001 ±   938.305  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexOf         4096  thrpt    5    2468.943 ±    61.478  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexOf        16384  thrpt    5     627.870 ±    28.683  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexOf        65536  thrpt    5     157.878 ±     4.429  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexOf      1048576  thrpt    5       9.884 ±     0.255  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexPanama       64  thrpt    5  122800.983 ±  3506.417  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexPanama      127  thrpt    5  106781.683 ±  2724.090  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexPanama      128  thrpt    5  102716.653 ±  5450.408  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexPanama     4096  thrpt    5   13241.232 ±   146.939  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexPanama    16384  thrpt    5    3362.549 ±   280.103  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexPanama    65536  thrpt    5     894.667 ±    68.088  ops/ms
BytesArrayIndexOfBenchmark.withOffsetIndexPanama  1048576  thrpt    5      56.321 ±     3.408  ops/ms

@ChrisHegarty
Copy link
Contributor Author

Running on my macbook (128 arm) and its looking nice!

ha! Thanks for checking this. Lemme add those sizes to the benchmark and do some more runs. But otherwise looks like we don't need to fall back to the old code - as it's slower.

@ChrisHegarty
Copy link
Contributor Author

ChrisHegarty commented Sep 19, 2025

Benchmark summary:

Panama consistently outperforms the baseline on all architectures.
ARM (Graviton3) sees smaller relative gains (~4×), but Panama still provides solid improvements.
x86 (Xeon, Core i5) benefits most: mid-size arrays see order-of-magnitude speedups.

Benchmark results:

Linux 6.8.0-1029-aws aarch64 AWS Graviton3

Benchmark                                  (size)   Mode  Cnt       Score      Error   Units
BytesArrayIndexOfBenchmark.indexOf             64  thrpt    5   67576.777 ±  137.815  ops/ms
BytesArrayIndexOfBenchmark.indexOf            127  thrpt    5   38576.815 ±  127.426  ops/ms
BytesArrayIndexOfBenchmark.indexOf            128  thrpt    5   38679.525 ±   29.822  ops/ms
BytesArrayIndexOfBenchmark.indexOf           4096  thrpt    5    1438.940 ±    0.750  ops/ms
BytesArrayIndexOfBenchmark.indexOf          16384  thrpt    5     360.565 ±    0.210  ops/ms
BytesArrayIndexOfBenchmark.indexOf          65536  thrpt    5      90.102 ±    0.039  ops/ms
BytesArrayIndexOfBenchmark.indexOf        1048576  thrpt    5       5.606 ±    0.008  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama       64  thrpt    5  166784.847 ±  747.773  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      127  thrpt    5   47398.116 ±   28.134  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      128  thrpt    5  111873.941 ± 2330.236  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama     4096  thrpt    5    6183.588 ±   28.437  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    16384  thrpt    5    1565.275 ±    1.924  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    65536  thrpt    5     384.226 ±    4.493  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama  1048576  thrpt    5      22.812 ±    2.929  ops/ms

Linux 6.14.0-1010-aws x86_64 Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz

Benchmark                                  (size)   Mode  Cnt       Score      Error   Units
BytesArrayIndexOfBenchmark.indexOf             64  thrpt    5   76710.541 ±  569.685  ops/ms
BytesArrayIndexOfBenchmark.indexOf            127  thrpt    5   44880.102 ±  450.004  ops/ms
BytesArrayIndexOfBenchmark.indexOf            128  thrpt    5   46795.429 ±   74.645  ops/ms
BytesArrayIndexOfBenchmark.indexOf           4096  thrpt    5    1882.422 ±    3.302  ops/ms
BytesArrayIndexOfBenchmark.indexOf          16384  thrpt    5     473.050 ±    0.281  ops/ms
BytesArrayIndexOfBenchmark.indexOf          65536  thrpt    5     117.787 ±    0.910  ops/ms
BytesArrayIndexOfBenchmark.indexOf        1048576  thrpt    5       7.172 ±    0.026  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama       64  thrpt    5  443081.678 ± 2030.219  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      127  thrpt    5   41538.104 ±   43.945  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      128  thrpt    5  186297.721 ±  499.580  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama     4096  thrpt    5   25778.828 ±  142.519  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    16384  thrpt    5    6978.396 ±   14.414  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    65536  thrpt    5     942.210 ±    2.387  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama  1048576  thrpt    5      58.544 ±    0.230  ops/ms

Linux 6.8.0-79-generic x86_64 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz

Benchmark                                  (size)   Mode  Cnt       Score      Error   Units
BytesArrayIndexOfBenchmark.indexOf             64  thrpt    5   97591.143 ± 1092.773  ops/ms
BytesArrayIndexOfBenchmark.indexOf            127  thrpt    5   56637.893 ±  207.175  ops/ms
BytesArrayIndexOfBenchmark.indexOf            128  thrpt    5   60739.373 ±  301.099  ops/ms
BytesArrayIndexOfBenchmark.indexOf           4096  thrpt    5    2501.760 ±    3.686  ops/ms
BytesArrayIndexOfBenchmark.indexOf          16384  thrpt    5     639.158 ±    1.495  ops/ms
BytesArrayIndexOfBenchmark.indexOf          65536  thrpt    5     159.490 ±    0.350  ops/ms
BytesArrayIndexOfBenchmark.indexOf        1048576  thrpt    5       9.268 ±    0.164  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama       64  thrpt    5  262803.047 ± 1202.271  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      127  thrpt    5   71137.973 ±  143.598  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      128  thrpt    5  178740.023 ±  604.916  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama     4096  thrpt    5   24265.270 ±  102.993  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    16384  thrpt    5    6717.387 ±   39.386  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    65536  thrpt    5    1636.685 ±    2.580  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama  1048576  thrpt    5      76.563 ±    0.257  ops/ms

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@ChrisHegarty
Copy link
Contributor Author

Latest results; summary as good or better than the previous set. Small sizes with tails have improved.

Linux 6.8.0-1029-aws aarch64 AWS Graviton3

BytesArrayIndexOfBenchmark.indexOf             64  thrpt    5   67707.455 ±   18.846  ops/ms
BytesArrayIndexOfBenchmark.indexOf            127  thrpt    5   38530.078 ±   52.207  ops/ms
BytesArrayIndexOfBenchmark.indexOf            128  thrpt    5   38706.770 ±    9.581  ops/ms
BytesArrayIndexOfBenchmark.indexOf           4096  thrpt    5    1438.885 ±    0.318  ops/ms
BytesArrayIndexOfBenchmark.indexOf          16384  thrpt    5     360.585 ±    0.148  ops/ms
BytesArrayIndexOfBenchmark.indexOf          65536  thrpt    5      90.127 ±    0.025  ops/ms
BytesArrayIndexOfBenchmark.indexOf        1048576  thrpt    5       5.597 ±    0.008  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama       64  thrpt    5  174350.951 ± 9986.252  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      127  thrpt    5   64324.577 ±   55.027  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      128  thrpt    5  126681.926 ± 2941.217  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama     4096  thrpt    5    6073.756 ±    4.627  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    16384  thrpt    5    1526.167 ±   17.054  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    65536  thrpt    5     389.074 ±    0.987  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama  1048576  thrpt    5      23.472 ±    2.943  ops/ms

Linux 6.14.0-1010-aws x86_64 Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz

Benchmark                                  (size)   Mode  Cnt       Score      Error   Units
BytesArrayIndexOfBenchmark.indexOf             64  thrpt    5   76674.803 ±  847.656  ops/ms
BytesArrayIndexOfBenchmark.indexOf            127  thrpt    5   44890.652 ±  140.901  ops/ms
BytesArrayIndexOfBenchmark.indexOf            128  thrpt    5   46813.474 ±   32.569  ops/ms
BytesArrayIndexOfBenchmark.indexOf           4096  thrpt    5    1883.676 ±   12.478  ops/ms
BytesArrayIndexOfBenchmark.indexOf          16384  thrpt    5     472.499 ±    0.739  ops/ms
BytesArrayIndexOfBenchmark.indexOf          65536  thrpt    5     118.075 ±    0.125  ops/ms
BytesArrayIndexOfBenchmark.indexOf        1048576  thrpt    5       7.181 ±    0.014  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama       64  thrpt    5  471356.354 ± 1012.151  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      127  thrpt    5   47259.317 ±   94.854  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      128  thrpt    5  186583.614 ±  302.880  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama     4096  thrpt    5   25956.985 ±  123.638  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    16384  thrpt    5    6972.430 ±  165.971  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    65536  thrpt    5     941.913 ±    0.513  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama  1048576  thrpt    5      58.908 ±    0.109  ops/ms

Linux 6.8.0-79-generic x86_64 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz

Benchmark                                  (size)   Mode  Cnt       Score      Error   Units
BytesArrayIndexOfBenchmark.indexOf             64  thrpt    5   97237.362 ± 4587.096  ops/ms
BytesArrayIndexOfBenchmark.indexOf            127  thrpt    5   57227.153 ±  525.360  ops/ms
BytesArrayIndexOfBenchmark.indexOf            128  thrpt    5   60770.794 ±  286.542  ops/ms
BytesArrayIndexOfBenchmark.indexOf           4096  thrpt    5    2502.030 ±   15.971  ops/ms
BytesArrayIndexOfBenchmark.indexOf          16384  thrpt    5     639.372 ±    1.104  ops/ms
BytesArrayIndexOfBenchmark.indexOf          65536  thrpt    5     159.138 ±    5.858  ops/ms
BytesArrayIndexOfBenchmark.indexOf        1048576  thrpt    5       9.242 ±    0.111  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama       64  thrpt    5  593308.180 ± 1329.063  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      127  thrpt    5   70080.825 ±  244.430  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama      128  thrpt    5  251281.542 ±  875.517  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama     4096  thrpt    5   40686.370 ± 3350.372  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    16384  thrpt    5   12251.919 ±  195.274  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama    65536  thrpt    5    1689.770 ±    1.895  ops/ms
BytesArrayIndexOfBenchmark.indexOfPanama  1048576  thrpt    5      61.271 ±    0.165  ops/ms

@ChrisHegarty ChrisHegarty merged commit 9cfeaac into elastic:main Sep 20, 2025
34 checks passed
@ChrisHegarty ChrisHegarty deleted the indexOf_vec branch September 20, 2025 10:39
Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very neat use of Panama! Vectorization without writing a line of native code, nice!
One small point, and I'm late to the party :) but LGTM!

}

@Override
public int indexOf(final byte[] bytes, final int offset, final int length, final byte marker) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't think we need to keep this in main21, as 21 is the minimum version we support now. We need to apply the mrjar plugin for the preview bit handling, but no need for a different sourceset I think

gmjehovich pushed a commit to gmjehovich/elasticsearch that referenced this pull request Sep 22, 2025
…elastic#135087)

This commit optimizes BytesArray::indexOf, which is used heavily in ndjson parsing.

I simply extracted the indexOf functionality into ESVectorUtil, so that we could have a default implementation (the current one), and a Panama Vectorized one. BytesArray::indexOf is used heavily for newline json delimited parsing, which backs the _bulk API.

Here’s the result of the benchmark run on Linux x64, Intel SkyLake:

Size	     Baseline ops/ms Panama ops/ms Speedup
4,096        2,439           41,689        ~17.1x
16,384       624             12,297        ~19.7x
65,536       156             1,689         ~10.8x
1,048,576    9.8             73.4           ~7.5x
The Panama version is 7–20× faster than the baseline depending on input size.

For small and medium arrays (4 KB and 16 KB), the speedup is dramatic (~17–20×), showing that the vectorization pays off most when the data set fits easily in cache.
For larger arrays (64 KB – 1 MB), the relative advantage drops (~10× at 64 KB, ~7.5× at 1 MB). That’s expected - as array size grows, the bottleneck shifts from CPU to memory bandwidth, reducing the benefit of SIMD.

In short: Panama delivers an order-of-magnitude improvement over the scalar implementation, especially for small-to-medium buffers, while still offering a solid multiple (~7×) speedup even on very large buffers.

The Panama implementation is between 10-6x faster on AVX2, and ~5x times faster on ARM.
DonalEvans pushed a commit to DonalEvans/elasticsearch that referenced this pull request Sep 22, 2025
…elastic#135087)

This commit optimizes BytesArray::indexOf, which is used heavily in ndjson parsing.

I simply extracted the indexOf functionality into ESVectorUtil, so that we could have a default implementation (the current one), and a Panama Vectorized one. BytesArray::indexOf is used heavily for newline json delimited parsing, which backs the _bulk API.

Here’s the result of the benchmark run on Linux x64, Intel SkyLake:

Size	     Baseline ops/ms Panama ops/ms Speedup
4,096        2,439           41,689        ~17.1x
16,384       624             12,297        ~19.7x
65,536       156             1,689         ~10.8x
1,048,576    9.8             73.4           ~7.5x
The Panama version is 7–20× faster than the baseline depending on input size.

For small and medium arrays (4 KB and 16 KB), the speedup is dramatic (~17–20×), showing that the vectorization pays off most when the data set fits easily in cache.
For larger arrays (64 KB – 1 MB), the relative advantage drops (~10× at 64 KB, ~7.5× at 1 MB). That’s expected - as array size grows, the bottleneck shifts from CPU to memory bandwidth, reducing the benefit of SIMD.

In short: Panama delivers an order-of-magnitude improvement over the scalar implementation, especially for small-to-medium buffers, while still offering a solid multiple (~7×) speedup even on very large buffers.

The Panama implementation is between 10-6x faster on AVX2, and ~5x times faster on ARM.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >feature :Performance All issues related to Elasticsearch performance including regressions and investigations Team:Core/Infra Meta label for core/infra team Team:Performance Meta label for performance team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants