Speedup integer functions for 128-bit neon vectors #12632

rmuir · 2023-10-08T02:59:44Z

gives a little love to the mac for dot-product and binary-cosine.

I see on m1 some improvement:
You can reproduce with java -jar target/vectorbench.jar Binary -p size=1024
See https://github.com/rmuir/vectorbench with has a README now!

Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  6.135 ± 0.008  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5  7.197 ± 0.028  ops/us
BinaryCosineBenchmark.cosineDistanceNew       1024  thrpt    5  2.259 ± 0.003  ops/us
BinaryCosineBenchmark.cosineDistanceNewNew    1024  thrpt    5  3.622 ± 0.017  ops/us

edit: originally I tried to optimize the 256/512-bit path, but it caused slowdowns on avx-512 so i reverted the changes. Sorry for confusion. hopefully we get some improvement to something that can stick :)

rmuir · 2023-10-08T05:29:04Z

I did manage to get a little bit more out of the arm chip. I will look at the other 2 functions there too...

Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  6.135 ± 0.008  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5  7.197 ± 0.028  ops/us

gf2121 · 2023-10-08T06:18:28Z

FYI I run the benchmark on latest benchmark commit with a linux-x86-64 sever that AVX-512 supported.

Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
BinaryCosineBenchmark.cosineDistanceNew       1024  thrpt    5   5.637 ± 0.003  ops/us
BinaryCosineBenchmark.cosineDistanceNewNew    1024  thrpt    5   4.942 ± 0.009  ops/us
BinaryCosineBenchmark.cosineDistanceOld       1024  thrpt    5   0.848 ± 0.001  ops/us
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  11.717 ± 0.013  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5   9.623 ± 0.050  ops/us
BinaryDotProductBenchmark.dotProductOld       1024  thrpt    5   1.953 ± 0.005  ops/us
BinarySquareBenchmark.squareDistanceNew       1024  thrpt    5   8.407 ± 0.020  ops/us
BinarySquareBenchmark.squareDistanceNewNew    1024  thrpt    5   9.057 ± 0.045  ops/us
BinarySquareBenchmark.squareDistanceOld       1024  thrpt    5   1.651 ± 0.001  ops/us

rmuir · 2023-10-08T06:20:31Z

thanks for running. I will just revert it then and get folks to test arm changes. i don't want to hurt avx 512...

This reverts commit 132bf28.

rmuir · 2023-10-08T06:41:20Z

ok i reverted the 256-bit changes from here, and from the vectorbench, but kept the 128 bit ones for ppl to test on macs. Now this issue does the opposite of what it says, i will edit it...

rmuir · 2023-10-08T06:47:01Z

I don't know how to do the same tricks for the BinarySquare one due to the subtraction.

So I'm done for now. I think given the reports from @gf2121 the 256/512-bit experiment was a loss :(

ChrisHegarty · 2023-10-08T11:26:38Z

Thanks for looking into this @rmuir, I've been thinking similar myself (just didn't get around to anything other than the thinking! )

On my Mac M2.

JDK 20.0.2.

Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  6.590 ± 0.098  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5  7.769 ± 0.102  ops/us
BinaryDotProductBenchmark.dotProductOld       1024  thrpt    5  3.159 ± 0.034  ops/us

JDK 21

Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  6.546 ± 0.054  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5  7.696 ± 0.103  ops/us
BinaryDotProductBenchmark.dotProductOld       1024  thrpt    5  2.893 ± 0.306  ops/us

ChrisHegarty · 2023-10-08T13:08:34Z

  // sum into accumulators
  Vector<Short> prod16 = prod16_1.add(prod16_2);
  acc = acc.add(prod16.convert(VectorOperators.S2I, 0));
  acc = acc.add(prod16.convert(VectorOperators.S2I, 1));

What is the maximum value that we can see in the input bytes? Can they every hold -128? Do we need to handle "overflow" in the short accumulation and subsequent conversion to int? If so then we can use VectorOperators.ZERO_EXTEND_S2I (rather than S2I). ( this is just a question, rather than a suggestion, since I only thought of this after leaving my keyboard. And I've not tested. )

ChrisHegarty · 2023-10-08T13:29:55Z

And of course, ZERO_EXTEND_S2I, will work in the maximum boundary case, but not in others. So the question is then just about the maximum value of the bytes in these input arrays. :-(

rmuir · 2023-10-08T13:49:01Z

What is the maximum value that we can see in the input bytes?

All possible values is how i test

Can they every hold -128?

Yes!

Do we need to handle "overflow" in the short accumulation and subsequent conversion to int?

No! integer fma won't overflow here. and i know that isn't obvious and probably needs a code comment lol. that's why it only works for these two methods and not the square.

ChrisHegarty · 2023-10-08T14:00:56Z

Ok, cool. If there is not already one, we should add a test to the Panama / scalar unit test for the boundary values.

rmuir · 2023-10-08T14:03:38Z

yeah agreed: we should test the boundaries for all 3 functions.

rmuir · 2023-10-08T14:17:17Z

yeah, you are right, i am wrong. the trick only works in the unsigned case, Byte.MIN_VALUE is a problem :(

rmuir · 2023-10-08T14:26:16Z

at least we can improve the testing out of this: #12634

…or_more

rmuir · 2023-10-08T15:05:00Z

don't worry, i have a plan B. it is just frustrating due to the nightmare of operating on the mac, combined with the fact this benchmark and lucene source is a separate repo. it makes the situation very slow and error-prone....

…he square too

rmuir · 2023-10-08T15:07:38Z

see latest commit for the idea. on my mac it gives a decent boost. it uses "32-bit" vector by loading 64-bit vector from array but only processing half of it. The tests should fail as i need to fix the other functions.

rmuir · 2023-10-08T15:27:24Z

ok on my mac i see:

Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
BinaryCosineBenchmark.cosineDistanceNew       1024  thrpt    5  2.261 ± 0.007  ops/us
BinaryCosineBenchmark.cosineDistanceNewNew    1024  thrpt    5  3.708 ± 0.034  ops/us
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  6.138 ± 0.021  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5  6.645 ± 0.021  ops/us

and it passes new boundary tests (but no sneakiness with boundary values, instead another type of sneakiness).

rmuir · 2023-10-13T19:09:21Z

@gf2121 @ChrisHegarty you can see the issue from his assembler output with the failed intel optimization:
the current code does 2 x 256-bit vpmull on ymm registers, the proposed simplification (go straight from byte->int) instead uses 1 x 512-bit vpmull on zmm registers. Doesn't explain why its slower (maybe AVX-512 downclocking issue for that cpu?, i have to do some digging), but explains why its different.

rmuir · 2023-10-13T19:16:11Z

@gf2121 i think we could diagnose it further with https://github.com/travisdowns/avx-turbo

rmuir · 2023-10-13T19:20:51Z

I compiled the code and ran it easily, just git clone + make. You do have to run it as root to get the useful output, I took a risk on my machine:

think:avx-turbo[master]$  sudo ./avx-turbo
CPUID highest leaf    : [16h]
Running as root       : [YES]
MSR reads supported   : [YES]
CPU pinning enabled   : [YES]
CPU supports zeroupper: [YES]
CPU supports AVX2     : [YES]
CPU supports AVX-512F : [NO ]
CPU supports AVX-512VL: [NO ]
CPU supports AVX-512BW: [NO ]
CPU supports AVX-512CD: [NO ]
cpuid = eax = 2, ebx = 208, ecx = 0, edx = 0
cpu: family = 6, model = 78, stepping = 3
tsc_freq = 2496.0 MHz (from cpuid leaf 0x15)
CPU brand string: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
4 available CPUs: [0, 1, 2, 3]
2 physical cores: [0, 1]
Will test up to 2 CPUs
Cores | ID                | Description                      | OVRLP3 |  Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1     | pause_only        | pause instruction                |  1.000 |  2116 |      1.20 |    2993 |        0.98
1     | ucomis_clean      | scalar ucomis (w/ vzeroupper)    |  1.000 |   742 |      1.20 |    2991 |        0.98
1     | ucomis_dirty      | scalar ucomis (no vzeroupper)    |  1.000 |   717 |      1.18 |    2934 |        0.98
1     | scalar_iadd       | Scalar integer adds              |  1.000 |  2993 |      1.20 |    2989 |        0.97
1     | avx128_iadd       | 128-bit integer serial adds      |  1.000 |  2993 |      1.20 |    2990 |        0.98
1     | avx256_iadd       | 256-bit integer serial adds      |  1.000 |  2993 |      1.20 |    2990 |        0.97
1     | avx128_iadd_t     | 128-bit integer parallel adds    |  1.000 |  8973 |      1.19 |    2983 |        0.96
1     | avx256_iadd_t     | 256-bit integer parallel adds    |  1.000 |  8977 |      1.20 |    2995 |        0.99
1     | avx128_xor_zero   | 128-bit zeroing xor              |  1.000 | 11850 |      1.20 |    2995 |        0.99
1     | avx256_xor_zero   | 256-bit zeroing xor              |  1.000 | 11857 |      1.20 |    2995 |        0.99
1     | avx128_mov_sparse | 128-bit reg-reg mov              |  1.000 |  2970 |      1.20 |    2988 |        0.97
1     | avx256_mov_sparse | 256-bit reg-reg mov              |  1.000 |  2993 |      1.20 |    2995 |        0.99
1     | avx128_vshift     | 128-bit variable shift (vpsrlvd) |  1.000 |  2993 |      1.20 |    2994 |        0.99
1     | avx256_vshift     | 256-bit variable shift (vpsrlvd) |  1.000 |  2993 |      1.20 |    2991 |        0.98
1     | avx128_vshift_t   | 128-bit variable shift (vpsrlvd) |  1.000 |  5985 |      1.20 |    2993 |        0.98
1     | avx256_vshift_t   | 256-bit variable shift (vpsrlvd) |  1.000 |  5986 |      1.20 |    2994 |        0.99
1     | avx128_imul       | 128-bit integer muls (vpmuldq)   |  1.000 |   599 |      1.20 |    2992 |        0.98
1     | avx256_imul       | 256-bit integer muls (vpmuldq)   |  1.000 |   599 |      1.20 |    2993 |        0.98
1     | avx128_fma_sparse | 128-bit 64-bit sparse FMAs       |  1.000 |  2993 |      1.20 |    2992 |        0.98
1     | avx256_fma_sparse | 256-bit 64-bit sparse FMAs       |  1.000 |  2993 |      1.20 |    2991 |        0.98
1     | avx128_fma        | 128-bit serial DP FMAs           |  1.000 |   748 |      1.20 |    2986 |        0.98
1     | avx256_fma        | 256-bit serial DP FMAs           |  1.000 |   748 |      1.20 |    2991 |        0.97
1     | avx128_fma_t      | 128-bit parallel DP FMAs         |  1.000 |  5986 |      1.20 |    2993 |        0.98
1     | avx256_fma_t      | 256-bit parallel DP FMAs         |  1.000 |  5986 |      1.20 |    2995 |        0.99

Cores | ID                | Description                      | OVRLP3 |         Mops |    A/M-ratio |    A/M-MHz | M/tsc-ratio
2     | pause_only        | pause instruction                |  1.000 |   1996, 1994 |  1.16,  1.16 | 2884, 2886 |  1.00, 1.00
2     | ucomis_clean      | scalar ucomis (w/ vzeroupper)    |  1.000 |    717,  717 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | ucomis_dirty      | scalar ucomis (no vzeroupper)    |  1.000 |    717,  717 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | scalar_iadd       | Scalar integer adds              |  1.000 |   2865, 2888 |  1.16,  1.16 | 2895, 2896 |  1.00, 1.00
2     | avx128_iadd       | 128-bit integer serial adds      |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_iadd       | 256-bit integer serial adds      |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_iadd_t     | 128-bit integer parallel adds    |  1.000 |   8678, 8679 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_iadd_t     | 256-bit integer parallel adds    |  1.000 |   8680, 8681 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_xor_zero   | 128-bit zeroing xor              |  1.000 | 11456, 11460 |  1.16,  1.16 | 2896, 2895 |  1.00, 1.00
2     | avx256_xor_zero   | 256-bit zeroing xor              |  1.000 | 11457, 11459 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_mov_sparse | 128-bit reg-reg mov              |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_mov_sparse | 256-bit reg-reg mov              |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_vshift     | 128-bit variable shift (vpsrlvd) |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_vshift     | 256-bit variable shift (vpsrlvd) |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_vshift_t   | 128-bit variable shift (vpsrlvd) |  1.000 |   5787, 5787 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_vshift_t   | 256-bit variable shift (vpsrlvd) |  1.000 |   5786, 5787 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_imul       | 128-bit integer muls (vpmuldq)   |  1.000 |    579,  579 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_imul       | 256-bit integer muls (vpmuldq)   |  1.000 |    579,  579 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_fma_sparse | 128-bit 64-bit sparse FMAs       |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_fma_sparse | 256-bit 64-bit sparse FMAs       |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_fma        | 128-bit serial DP FMAs           |  1.000 |    723,  723 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_fma        | 256-bit serial DP FMAs           |  1.000 |    723,  723 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_fma_t      | 128-bit parallel DP FMAs         |  1.000 |   5785, 5787 |  1.16,  1.16 | 2896, 2896 |  1.00, 1.00
2     | avx256_fma_t      | 256-bit parallel DP FMAs         |  1.000 |   5786, 5787 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00

rmuir · 2023-10-13T19:25:42Z

I guess you will have to probably modprobe msr first. I already have the msr module loaded for other nefarious purposes.

rmuir · 2023-10-13T19:31:30Z

And at least the theory makes sense, this integer multiply is definitely "avx512 heavy", so if u have a cpu susceptible to throttling, better to do 256bit multiplies that we do today. I guess I feel it's an issue with SPECIES_PREFERRED on such CPUs, but for now I prefer being conservative. Would be cool if we could confirm it.

rmuir · 2023-10-14T15:38:10Z

I'm gonna merge this but we should continue to explore the intel case. Not sure what we can do there though.

benwtrent · 2023-10-14T15:50:21Z

Thank y'all so much for digging into this @rmuir @gf2121 @ChrisHegarty @uschindler !

Maybe one day Panama Vector will mature into allow us to do nicer things with byte comparisons.

rmuir · 2023-10-14T17:29:59Z

@benwtrent it isn't a panama thing. these functions are 32-bit (they return int and float). There is no hope for these getting faster, I just hope you understand that.

uschindler · 2023-10-15T18:06:54Z

I backported this one to 9.x.

gf2121 · 2023-10-19T06:38:19Z

@gf2121 i think we could diagnose it further with https://github.com/travisdowns/avx-turbo

Thanks @rmuir for profile guide!

Sorry for the delay. It took me some time to apply for root privileges on this machine. Here is the result: avx_turbo.log

rmuir · 2023-10-19T11:38:00Z

Thank you @gf2121 , it is confirmed. I include just the part of the table that is relevant. It is really great that you caught this.

ID	Description	OVRLP3	Mops	A/M-ratio	A/M-MHz	M/tsc-ratio
avx128_imul	128-bit integer muls (vpmuldq)	1.000	619	1.29	3093	1.00
avx256_imul	256-bit integer muls (vpmuldq)	1.000	619	1.29	3093	1.00
avx512_imul	512-bit integer muls (vpmuldq)	1.000	474 (!)	1.08 (!)	2594 (!)	1.00

ChrisHegarty · 2023-10-20T11:03:15Z

Thanks @rmuir @gf2121 I need to spend a bit more evaluating this. But it looks like no action is needed here?

rmuir · 2023-10-20T12:31:54Z

@ChrisHegarty there are plenty of actions we could take... but I implemented this specific same optimization in question safely in #12681

See https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking for a brief introduction.

rmuir · 2023-10-20T12:34:42Z

to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvious pipe dream on the part of openjdk... we have to deal with special-case BS like this for every CPU. my advice on openjdk side: give up on the pipe dream and give us some cpuinfo or at least tell us what vector operations are supported. I'm almost to the point of writing /proc/cpuinfo parser.

rmuir · 2023-10-20T12:40:11Z

also i think JMH is bad news when it comes to downclocking. It does not show the true performance impact of this. It slows down other things on the machine as well: the user might have other shit going on too.

This is why the avx-512 variants of the functions avoid 512-bit multiply and just use 256-bit vectors for that part, it is the only safe approach.

Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe.

uschindler · 2023-10-20T12:43:33Z

to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvious pipe dream on the part of openjdk... we have to deal with special-case BS like this for every CPU. my advice on openjdk side: give up on the pipe dream and give us some cpuinfo or at least tell us what vector operations are supported.

I think we should really open an issue at JDK. We discussed about this before. In my opinion, the system should have an API to request for each operator, if it is supported in current configuration (CPU, species size and Hotspot abilities). In addition a shortcut to figure out if C2 is enabled at all, but if C2 is disabled, each of the above specific queries for operator and species combinations should return "no-no".

Everything else makes vector API a huge trap. It is a specialist API so please add detailed information to allow specialist use!

I'm almost to the point of writing /proc/cpuinfo parser.

Doesn't work on Windows. 😱

uschindler · 2023-10-20T12:47:53Z

Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe.

We could add a sysprops to enable advanced avx512 support. Could be a static boolean then on the impl. If somebody enables it, they must be aware that it may slow down. I have seen stuff like this in native vector databases on buzzwords (they told me).

rmuir · 2023-10-20T12:58:17Z

Vector API should also fix its bugs. It is totally senseless to have IntVector.SPECIES_PREFERRED and FloatVector.SPECIES_PREFERRED and then always set them to '512' on every avx-512 machine. It is not about the data type but the operation you are doing on.

Openjdk has the same heuristics in its intrinsics/asm that i allude to, to avoid these problems. Hacks in every intrinsic / default checking specific families of CPUs and shit. go look for yourself. But it doesn't want to expose this stuff via vector api to developers.

rmuir · 2023-10-20T13:02:50Z

I would really just fix the api: instead of IntVector.SPECIES_PREFERRED constant which is meaningless, it should be a method taking VectorOperation... about how you plan to use it. it should be something like this in our code:

import static VectorOperators.ADD;
import static VectorOperators.MUL;

// get preferred species on the platform for doing multiplication and addition
static final DOT_PRODUCT_SPECIES = IntVector.preferredFor(MUL, ADD);

Then openjdk can keep its hacks to itself. But if they wont fix this, they should expose them.

rmuir · 2023-10-20T13:06:00Z

such a method would solve 95% of my problems, if it would throw UnsupportedOperationException or return null if the hardware/hotspot doesnt support all the requested VectorOperators.

rmuir · 2023-10-20T15:05:19Z

Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe.

We could add a sysprops to enable advanced avx512 support. Could be a static boolean then on the impl. If somebody enables it, they must be aware that it may slow down. I have seen stuff like this in native vector databases on buzzwords (they told me).

The jvm already has these. For example a user can set max vector width and avx instructiom level already. I assume that avx 512 users who are running on downclock-susceptible CPUs would already set flags to use only 256bit vectors. So I am afraid of adding our own flags that conflict with those.

uschindler · 2023-10-20T15:25:57Z

Hi,

The jvm already has these. For example a user can set max vector width and avx instructiom level already. I assume that avx 512 users who are running on downclock-susceptible CPUs would already set flags to use only 256bit vectors. So I am afraid of adding our own flags that conflict with those.

The problem is that by default JVM enables those flags and then those machines get slow when they use Lucene and support cases are openend at Elasticsearch/Opensearch stuff. So my wish would be to have that opt-in.

My idea was to add a sysprop like the MMap ones to enable/disbale manually or disable unsafe unmapping. In that case the sysprop would enable the AVX 512 bit case. It won't conflict with manual AVX=2 JVM override setting (because then the preferredBitSize=256 and our flag is a noop).

ChrisHegarty · 2023-10-20T15:45:58Z

Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level.

uschindler · 2023-10-20T16:25:22Z

Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level.

Let's write a proposal together in a Google doc (or similar) and we could later open an issue or JEP. As we are both openjdk members, this would be helpful to me to participate in that process (want to learn sth new).

Speedup integer functions for 256bit+ vectors

132bf28

rmuir requested review from uschindler and ChrisHegarty October 8, 2023 02:59

rmuir mentioned this pull request Oct 8, 2023

Make byte[] vector comparisons faster! (if possible) #12621

Closed

give that arm chip some love, just one function for now

9e2d50b

rmuir added 2 commits October 8, 2023 02:20

arm binarycosine

8dd4cf2

Revert "Speedup integer functions for 256bit+ vectors"

def1c70

This reverts commit 132bf28.

rmuir changed the title ~~Speedup integer functions for 256bit+ vectors~~ Speedup integer functions for 128-bit neon vectors Oct 8, 2023

ChrisHegarty approved these changes Oct 8, 2023

View reviewed changes

add tests for vectorutils integer boundaries

2dc74a4

Merge branch 'vector_util_boundaries_tests' into speedup_integer_256_…

3898359

…or_more

speedup on arm but safely, need to fix the cosine still and look at t…

24c6e78

…he square too

rmuir added 2 commits October 8, 2023 11:16

fix correctness of cosine too

a0e41ca

tidy

7bec979

rmuir merged commit 872aee6 into apache:main Oct 14, 2023
4 checks passed

uschindler added this to the 9.9.0 milestone Oct 15, 2023

uschindler added the type:enhancement label Oct 15, 2023

uschindler self-assigned this Oct 15, 2023

asfgit pushed a commit that referenced this pull request Oct 15, 2023

Speedup integer functions for 128-bit neon vectors (#12632)

2615c03

reta mentioned this pull request Oct 18, 2023

[SIMD] [POC] Improve performance of rounding dates in date_histogram aggregation opensearch-project/OpenSearch#10392

Closed

Speedup integer functions for 128-bit neon vectors #12632

Speedup integer functions for 128-bit neon vectors #12632

Conversation

rmuir commented Oct 8, 2023 • edited Loading

rmuir commented Oct 8, 2023

gf2121 commented Oct 8, 2023

rmuir commented Oct 8, 2023

rmuir commented Oct 8, 2023

rmuir commented Oct 8, 2023

ChrisHegarty commented Oct 8, 2023

ChrisHegarty commented Oct 8, 2023 • edited Loading

ChrisHegarty commented Oct 8, 2023

rmuir commented Oct 8, 2023

ChrisHegarty commented Oct 8, 2023

rmuir commented Oct 8, 2023

rmuir commented Oct 8, 2023

rmuir commented Oct 8, 2023

rmuir commented Oct 8, 2023

rmuir commented Oct 8, 2023

rmuir commented Oct 8, 2023

rmuir commented Oct 13, 2023

rmuir commented Oct 13, 2023

rmuir commented Oct 13, 2023

rmuir commented Oct 13, 2023

rmuir commented Oct 13, 2023

rmuir commented Oct 14, 2023

benwtrent commented Oct 14, 2023

rmuir commented Oct 14, 2023

uschindler commented Oct 15, 2023

gf2121 commented Oct 19, 2023

rmuir commented Oct 19, 2023

ChrisHegarty commented Oct 20, 2023

rmuir commented Oct 20, 2023

rmuir commented Oct 20, 2023

rmuir commented Oct 20, 2023

uschindler commented Oct 20, 2023

uschindler commented Oct 20, 2023

rmuir commented Oct 20, 2023

rmuir commented Oct 20, 2023

rmuir commented Oct 20, 2023

rmuir commented Oct 20, 2023

uschindler commented Oct 20, 2023

ChrisHegarty commented Oct 20, 2023

uschindler commented Oct 20, 2023

rmuir commented Oct 8, 2023 •

edited

Loading

ChrisHegarty commented Oct 8, 2023 •

edited

Loading