Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup integer functions for 128-bit neon vectors #12632

Merged
merged 9 commits into from
Oct 14, 2023

Conversation

rmuir
Copy link
Member

@rmuir rmuir commented Oct 8, 2023

gives a little love to the mac for dot-product and binary-cosine.

I see on m1 some improvement:
You can reproduce with java -jar target/vectorbench.jar Binary -p size=1024
See https://github.com/rmuir/vectorbench with has a README now!

Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  6.135 ± 0.008  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5  7.197 ± 0.028  ops/us
BinaryCosineBenchmark.cosineDistanceNew       1024  thrpt    5  2.259 ± 0.003  ops/us
BinaryCosineBenchmark.cosineDistanceNewNew    1024  thrpt    5  3.622 ± 0.017  ops/us

edit: originally I tried to optimize the 256/512-bit path, but it caused slowdowns on avx-512 so i reverted the changes. Sorry for confusion. hopefully we get some improvement to something that can stick :)

@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

I did manage to get a little bit more out of the arm chip. I will look at the other 2 functions there too...

Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  6.135 ± 0.008  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5  7.197 ± 0.028  ops/us

@gf2121
Copy link
Contributor

gf2121 commented Oct 8, 2023

FYI I run the benchmark on latest benchmark commit with a linux-x86-64 sever that AVX-512 supported.

Benchmark                                   (size)   Mode  Cnt   Score   Error   Units
BinaryCosineBenchmark.cosineDistanceNew       1024  thrpt    5   5.637 ± 0.003  ops/us
BinaryCosineBenchmark.cosineDistanceNewNew    1024  thrpt    5   4.942 ± 0.009  ops/us
BinaryCosineBenchmark.cosineDistanceOld       1024  thrpt    5   0.848 ± 0.001  ops/us
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  11.717 ± 0.013  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5   9.623 ± 0.050  ops/us
BinaryDotProductBenchmark.dotProductOld       1024  thrpt    5   1.953 ± 0.005  ops/us
BinarySquareBenchmark.squareDistanceNew       1024  thrpt    5   8.407 ± 0.020  ops/us
BinarySquareBenchmark.squareDistanceNewNew    1024  thrpt    5   9.057 ± 0.045  ops/us
BinarySquareBenchmark.squareDistanceOld       1024  thrpt    5   1.651 ± 0.001  ops/us

@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

thanks for running. I will just revert it then and get folks to test arm changes. i don't want to hurt avx 512...

@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

ok i reverted the 256-bit changes from here, and from the vectorbench, but kept the 128 bit ones for ppl to test on macs. Now this issue does the opposite of what it says, i will edit it...

@rmuir rmuir changed the title Speedup integer functions for 256bit+ vectors Speedup integer functions for 128-bit neon vectors Oct 8, 2023
@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

I don't know how to do the same tricks for the BinarySquare one due to the subtraction.

So I'm done for now. I think given the reports from @gf2121 the 256/512-bit experiment was a loss :(

@ChrisHegarty
Copy link
Contributor

Thanks for looking into this @rmuir, I've been thinking similar myself (just didn't get around to anything other than the thinking! )

On my Mac M2.

JDK 20.0.2.

Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  6.590 ± 0.098  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5  7.769 ± 0.102  ops/us
BinaryDotProductBenchmark.dotProductOld       1024  thrpt    5  3.159 ± 0.034  ops/us

JDK 21

Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  6.546 ± 0.054  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5  7.696 ± 0.103  ops/us
BinaryDotProductBenchmark.dotProductOld       1024  thrpt    5  2.893 ± 0.306  ops/us

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Oct 8, 2023

  // sum into accumulators
  Vector<Short> prod16 = prod16_1.add(prod16_2);
  acc = acc.add(prod16.convert(VectorOperators.S2I, 0));
  acc = acc.add(prod16.convert(VectorOperators.S2I, 1));

What is the maximum value that we can see in the input bytes? Can they every hold -128? Do we need to handle "overflow" in the short accumulation and subsequent conversion to int? If so then we can use VectorOperators.ZERO_EXTEND_S2I (rather than S2I). ( this is just a question, rather than a suggestion, since I only thought of this after leaving my keyboard. And I've not tested. )

@ChrisHegarty
Copy link
Contributor

And of course, ZERO_EXTEND_S2I, will work in the maximum boundary case, but not in others. So the question is then just about the maximum value of the bytes in these input arrays. :-(

@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

What is the maximum value that we can see in the input bytes?

All possible values is how i test

Can they every hold -128?

Yes!

Do we need to handle "overflow" in the short accumulation and subsequent conversion to int?

No! integer fma won't overflow here. and i know that isn't obvious and probably needs a code comment lol. that's why it only works for these two methods and not the square.

@ChrisHegarty
Copy link
Contributor

Ok, cool. If there is not already one, we should add a test to the Panama / scalar unit test for the boundary values.

@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

yeah agreed: we should test the boundaries for all 3 functions.

@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

yeah, you are right, i am wrong. the trick only works in the unsigned case, Byte.MIN_VALUE is a problem :(

@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

at least we can improve the testing out of this: #12634

@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

don't worry, i have a plan B. it is just frustrating due to the nightmare of operating on the mac, combined with the fact this benchmark and lucene source is a separate repo. it makes the situation very slow and error-prone....

@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

see latest commit for the idea. on my mac it gives a decent boost. it uses "32-bit" vector by loading 64-bit vector from array but only processing half of it. The tests should fail as i need to fix the other functions.

@rmuir
Copy link
Member Author

rmuir commented Oct 8, 2023

ok on my mac i see:

Benchmark                                   (size)   Mode  Cnt  Score   Error   Units
BinaryCosineBenchmark.cosineDistanceNew       1024  thrpt    5  2.261 ± 0.007  ops/us
BinaryCosineBenchmark.cosineDistanceNewNew    1024  thrpt    5  3.708 ± 0.034  ops/us
BinaryDotProductBenchmark.dotProductNew       1024  thrpt    5  6.138 ± 0.021  ops/us
BinaryDotProductBenchmark.dotProductNewNew    1024  thrpt    5  6.645 ± 0.021  ops/us

and it passes new boundary tests (but no sneakiness with boundary values, instead another type of sneakiness).

@rmuir
Copy link
Member Author

rmuir commented Oct 13, 2023

@gf2121 @ChrisHegarty you can see the issue from his assembler output with the failed intel optimization:
the current code does 2 x 256-bit vpmull on ymm registers, the proposed simplification (go straight from byte->int) instead uses 1 x 512-bit vpmull on zmm registers. Doesn't explain why its slower (maybe AVX-512 downclocking issue for that cpu?, i have to do some digging), but explains why its different.

@rmuir
Copy link
Member Author

rmuir commented Oct 13, 2023

@gf2121 i think we could diagnose it further with https://github.com/travisdowns/avx-turbo

@rmuir
Copy link
Member Author

rmuir commented Oct 13, 2023

I compiled the code and ran it easily, just git clone + make. You do have to run it as root to get the useful output, I took a risk on my machine:

think:avx-turbo[master]$  sudo ./avx-turbo
CPUID highest leaf    : [16h]
Running as root       : [YES]
MSR reads supported   : [YES]
CPU pinning enabled   : [YES]
CPU supports zeroupper: [YES]
CPU supports AVX2     : [YES]
CPU supports AVX-512F : [NO ]
CPU supports AVX-512VL: [NO ]
CPU supports AVX-512BW: [NO ]
CPU supports AVX-512CD: [NO ]
cpuid = eax = 2, ebx = 208, ecx = 0, edx = 0
cpu: family = 6, model = 78, stepping = 3
tsc_freq = 2496.0 MHz (from cpuid leaf 0x15)
CPU brand string: Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz
4 available CPUs: [0, 1, 2, 3]
2 physical cores: [0, 1]
Will test up to 2 CPUs
Cores | ID                | Description                      | OVRLP3 |  Mops | A/M-ratio | A/M-MHz | M/tsc-ratio
1     | pause_only        | pause instruction                |  1.000 |  2116 |      1.20 |    2993 |        0.98
1     | ucomis_clean      | scalar ucomis (w/ vzeroupper)    |  1.000 |   742 |      1.20 |    2991 |        0.98
1     | ucomis_dirty      | scalar ucomis (no vzeroupper)    |  1.000 |   717 |      1.18 |    2934 |        0.98
1     | scalar_iadd       | Scalar integer adds              |  1.000 |  2993 |      1.20 |    2989 |        0.97
1     | avx128_iadd       | 128-bit integer serial adds      |  1.000 |  2993 |      1.20 |    2990 |        0.98
1     | avx256_iadd       | 256-bit integer serial adds      |  1.000 |  2993 |      1.20 |    2990 |        0.97
1     | avx128_iadd_t     | 128-bit integer parallel adds    |  1.000 |  8973 |      1.19 |    2983 |        0.96
1     | avx256_iadd_t     | 256-bit integer parallel adds    |  1.000 |  8977 |      1.20 |    2995 |        0.99
1     | avx128_xor_zero   | 128-bit zeroing xor              |  1.000 | 11850 |      1.20 |    2995 |        0.99
1     | avx256_xor_zero   | 256-bit zeroing xor              |  1.000 | 11857 |      1.20 |    2995 |        0.99
1     | avx128_mov_sparse | 128-bit reg-reg mov              |  1.000 |  2970 |      1.20 |    2988 |        0.97
1     | avx256_mov_sparse | 256-bit reg-reg mov              |  1.000 |  2993 |      1.20 |    2995 |        0.99
1     | avx128_vshift     | 128-bit variable shift (vpsrlvd) |  1.000 |  2993 |      1.20 |    2994 |        0.99
1     | avx256_vshift     | 256-bit variable shift (vpsrlvd) |  1.000 |  2993 |      1.20 |    2991 |        0.98
1     | avx128_vshift_t   | 128-bit variable shift (vpsrlvd) |  1.000 |  5985 |      1.20 |    2993 |        0.98
1     | avx256_vshift_t   | 256-bit variable shift (vpsrlvd) |  1.000 |  5986 |      1.20 |    2994 |        0.99
1     | avx128_imul       | 128-bit integer muls (vpmuldq)   |  1.000 |   599 |      1.20 |    2992 |        0.98
1     | avx256_imul       | 256-bit integer muls (vpmuldq)   |  1.000 |   599 |      1.20 |    2993 |        0.98
1     | avx128_fma_sparse | 128-bit 64-bit sparse FMAs       |  1.000 |  2993 |      1.20 |    2992 |        0.98
1     | avx256_fma_sparse | 256-bit 64-bit sparse FMAs       |  1.000 |  2993 |      1.20 |    2991 |        0.98
1     | avx128_fma        | 128-bit serial DP FMAs           |  1.000 |   748 |      1.20 |    2986 |        0.98
1     | avx256_fma        | 256-bit serial DP FMAs           |  1.000 |   748 |      1.20 |    2991 |        0.97
1     | avx128_fma_t      | 128-bit parallel DP FMAs         |  1.000 |  5986 |      1.20 |    2993 |        0.98
1     | avx256_fma_t      | 256-bit parallel DP FMAs         |  1.000 |  5986 |      1.20 |    2995 |        0.99

Cores | ID                | Description                      | OVRLP3 |         Mops |    A/M-ratio |    A/M-MHz | M/tsc-ratio
2     | pause_only        | pause instruction                |  1.000 |   1996, 1994 |  1.16,  1.16 | 2884, 2886 |  1.00, 1.00
2     | ucomis_clean      | scalar ucomis (w/ vzeroupper)    |  1.000 |    717,  717 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | ucomis_dirty      | scalar ucomis (no vzeroupper)    |  1.000 |    717,  717 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | scalar_iadd       | Scalar integer adds              |  1.000 |   2865, 2888 |  1.16,  1.16 | 2895, 2896 |  1.00, 1.00
2     | avx128_iadd       | 128-bit integer serial adds      |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_iadd       | 256-bit integer serial adds      |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_iadd_t     | 128-bit integer parallel adds    |  1.000 |   8678, 8679 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_iadd_t     | 256-bit integer parallel adds    |  1.000 |   8680, 8681 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_xor_zero   | 128-bit zeroing xor              |  1.000 | 11456, 11460 |  1.16,  1.16 | 2896, 2895 |  1.00, 1.00
2     | avx256_xor_zero   | 256-bit zeroing xor              |  1.000 | 11457, 11459 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_mov_sparse | 128-bit reg-reg mov              |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_mov_sparse | 256-bit reg-reg mov              |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_vshift     | 128-bit variable shift (vpsrlvd) |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_vshift     | 256-bit variable shift (vpsrlvd) |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_vshift_t   | 128-bit variable shift (vpsrlvd) |  1.000 |   5787, 5787 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_vshift_t   | 256-bit variable shift (vpsrlvd) |  1.000 |   5786, 5787 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_imul       | 128-bit integer muls (vpmuldq)   |  1.000 |    579,  579 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_imul       | 256-bit integer muls (vpmuldq)   |  1.000 |    579,  579 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_fma_sparse | 128-bit 64-bit sparse FMAs       |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_fma_sparse | 256-bit 64-bit sparse FMAs       |  1.000 |   2893, 2893 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_fma        | 128-bit serial DP FMAs           |  1.000 |    723,  723 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx256_fma        | 256-bit serial DP FMAs           |  1.000 |    723,  723 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00
2     | avx128_fma_t      | 128-bit parallel DP FMAs         |  1.000 |   5785, 5787 |  1.16,  1.16 | 2896, 2896 |  1.00, 1.00
2     | avx256_fma_t      | 256-bit parallel DP FMAs         |  1.000 |   5786, 5787 |  1.16,  1.16 | 2895, 2895 |  1.00, 1.00

@rmuir
Copy link
Member Author

rmuir commented Oct 13, 2023

I guess you will have to probably modprobe msr first. I already have the msr module loaded for other nefarious purposes.

@rmuir
Copy link
Member Author

rmuir commented Oct 13, 2023

And at least the theory makes sense, this integer multiply is definitely "avx512 heavy", so if u have a cpu susceptible to throttling, better to do 256bit multiplies that we do today. I guess I feel it's an issue with SPECIES_PREFERRED on such CPUs, but for now I prefer being conservative. Would be cool if we could confirm it.

@rmuir
Copy link
Member Author

rmuir commented Oct 14, 2023

I'm gonna merge this but we should continue to explore the intel case. Not sure what we can do there though.

@rmuir rmuir merged commit 872aee6 into apache:main Oct 14, 2023
4 checks passed
@benwtrent
Copy link
Member

Thank y'all so much for digging into this @rmuir @gf2121 @ChrisHegarty @uschindler !

Maybe one day Panama Vector will mature into allow us to do nicer things with byte comparisons.

@rmuir
Copy link
Member Author

rmuir commented Oct 14, 2023

@benwtrent it isn't a panama thing. these functions are 32-bit (they return int and float). There is no hope for these getting faster, I just hope you understand that.

@uschindler uschindler added this to the 9.9.0 milestone Oct 15, 2023
@uschindler uschindler self-assigned this Oct 15, 2023
@uschindler
Copy link
Contributor

I backported this one to 9.x.

@gf2121
Copy link
Contributor

gf2121 commented Oct 19, 2023

@gf2121 i think we could diagnose it further with https://github.com/travisdowns/avx-turbo

Thanks @rmuir for profile guide!

Sorry for the delay. It took me some time to apply for root privileges on this machine. Here is the result: avx_turbo.log

@rmuir
Copy link
Member Author

rmuir commented Oct 19, 2023

Thank you @gf2121 , it is confirmed. I include just the part of the table that is relevant. It is really great that you caught this.

ID Description OVRLP3 Mops A/M-ratio A/M-MHz M/tsc-ratio
avx128_imul 128-bit integer muls (vpmuldq) 1.000 619 1.29 3093 1.00
avx256_imul 256-bit integer muls (vpmuldq) 1.000 619 1.29 3093 1.00
avx512_imul 512-bit integer muls (vpmuldq) 1.000 474 (!) 1.08 (!) 2594 (!) 1.00

@ChrisHegarty
Copy link
Contributor

Thanks @rmuir @gf2121 I need to spend a bit more evaluating this. But it looks like no action is needed here?

@rmuir
Copy link
Member Author

rmuir commented Oct 20, 2023

@ChrisHegarty there are plenty of actions we could take... but I implemented this specific same optimization in question safely in #12681

See https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#Downclocking for a brief introduction.

@rmuir
Copy link
Member Author

rmuir commented Oct 20, 2023

to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvious pipe dream on the part of openjdk... we have to deal with special-case BS like this for every CPU. my advice on openjdk side: give up on the pipe dream and give us some cpuinfo or at least tell us what vector operations are supported. I'm almost to the point of writing /proc/cpuinfo parser.

@rmuir
Copy link
Member Author

rmuir commented Oct 20, 2023

also i think JMH is bad news when it comes to downclocking. It does not show the true performance impact of this. It slows down other things on the machine as well: the user might have other shit going on too.

This is why the avx-512 variants of the functions avoid 512-bit multiply and just use 256-bit vectors for that part, it is the only safe approach.

Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe.

@uschindler
Copy link
Contributor

to have any decent performance, we really need information on the CPU in question and its vector capabilities. And the idea you can write "one loop" that "runs anywhere" is an obvious pipe dream on the part of openjdk... we have to deal with special-case BS like this for every CPU. my advice on openjdk side: give up on the pipe dream and give us some cpuinfo or at least tell us what vector operations are supported.

I think we should really open an issue at JDK. We discussed about this before. In my opinion, the system should have an API to request for each operator, if it is supported in current configuration (CPU, species size and Hotspot abilities). In addition a shortcut to figure out if C2 is enabled at all, but if C2 is disabled, each of the above specific queries for operator and species combinations should return "no-no".

Everything else makes vector API a huge trap. It is a specialist API so please add detailed information to allow specialist use!

I'm almost to the point of writing /proc/cpuinfo parser.

Doesn't work on Windows. 😱

@uschindler
Copy link
Contributor

Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe.

We could add a sysprops to enable advanced avx512 support. Could be a static boolean then on the impl. If somebody enables it, they must be aware that it may slow down. I have seen stuff like this in native vector databases on buzzwords (they told me).

@rmuir
Copy link
Member Author

rmuir commented Oct 20, 2023

Vector API should also fix its bugs. It is totally senseless to have IntVector.SPECIES_PREFERRED and FloatVector.SPECIES_PREFERRED and then always set them to '512' on every avx-512 machine. It is not about the data type but the operation you are doing on.

Openjdk has the same heuristics in its intrinsics/asm that i allude to, to avoid these problems. Hacks in every intrinsic / default checking specific families of CPUs and shit. go look for yourself. But it doesn't want to expose this stuff via vector api to developers.

@rmuir
Copy link
Member Author

rmuir commented Oct 20, 2023

I would really just fix the api: instead of IntVector.SPECIES_PREFERRED constant which is meaningless, it should be a method taking VectorOperation... about how you plan to use it. it should be something like this in our code:

import static VectorOperators.ADD;
import static VectorOperators.MUL;

// get preferred species on the platform for doing multiplication and addition
static final DOT_PRODUCT_SPECIES = IntVector.preferredFor(MUL, ADD);

Then openjdk can keep its hacks to itself. But if they wont fix this, they should expose them.

@rmuir
Copy link
Member Author

rmuir commented Oct 20, 2023

such a method would solve 95% of my problems, if it would throw UnsupportedOperationException or return null if the hardware/hotspot doesnt support all the requested VectorOperators.

@rmuir
Copy link
Member Author

rmuir commented Oct 20, 2023

Unfortunately this approach is slightly suboptimal for your Rocket Lake which doesn't suffer from downclocking, but it is a disaster elsewhere, so we have to play it safe.

We could add a sysprops to enable advanced avx512 support. Could be a static boolean then on the impl. If somebody enables it, they must be aware that it may slow down. I have seen stuff like this in native vector databases on buzzwords (they told me).

The jvm already has these. For example a user can set max vector width and avx instructiom level already. I assume that avx 512 users who are running on downclock-susceptible CPUs would already set flags to use only 256bit vectors. So I am afraid of adding our own flags that conflict with those.

@uschindler
Copy link
Contributor

Hi,

The jvm already has these. For example a user can set max vector width and avx instructiom level already. I assume that avx 512 users who are running on downclock-susceptible CPUs would already set flags to use only 256bit vectors. So I am afraid of adding our own flags that conflict with those.

The problem is that by default JVM enables those flags and then those machines get slow when they use Lucene and support cases are openend at Elasticsearch/Opensearch stuff. So my wish would be to have that opt-in.

My idea was to add a sysprop like the MMap ones to enable/disbale manually or disable unsafe unmapping. In that case the sysprop would enable the AVX 512 bit case. It won't conflict with manual AVX=2 JVM override setting (because then the preferredBitSize=256 and our flag is a noop).

@ChrisHegarty
Copy link
Contributor

Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level.

@uschindler
Copy link
Contributor

Just dropping an ACK here, for now. I do get the issues, and I agree that there could be better ways to frame things at the vector API level.

Let's write a proposal together in a Google doc (or similar) and we could later open an issue or JEP. As we are both openjdk members, this would be helpful to me to participate in that process (want to learn sth new).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants