New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: fix PPC_FP snan/qnan handling #1623
Conversation
0a613d3
to
4d8fcae
Compare
|
Might it be possible to merge those branches into one somehow? |
|
I can't think of any way without adding even more uops |
|
squees all over the PR IT FIXES IT |
|
You're a wizard. Good job, how did you manage to fix it without having the game? |
|
I figured that the only way things could be broken is if my conversion algorithm was actually lossy on some inputs when it shouldn't be, so I wrote a C implementation and checked it on all possible inputs (2^32 possible float bit patterns)... and found I was right. Then had to fix it <___<; |
|
|
||
| SetJumpTarget(nanConversion); | ||
| CVTSS2SD(dst, R(dst)); | ||
| CMP(32, R(gprsrc), Imm32(0x7fc00000)); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
4d8fcae
to
f82deb9
Compare
|
Just to recap what I said on IRC yesterday, I'd appreciate if you checked before merging whether using a cmov or whatever to avoid two branches can reduce the performance hit. Thanks... |
|
I don't know any good way to do this with a cmov that won't balloon the instruction count, do you have any suggestions? |
| @@ -906,23 +909,39 @@ void EmuCodeBlock::ConvertDoubleToSingle(X64Reg dst, X64Reg src) | |||
| // if it is. | |||
| MOVQ_xmm(R(RSCRATCH), src); | |||
| SHR(64, R(RSCRATCH), Imm8(55)); | |||
| CMP(8, R(RSCRATCH), Imm8(0xff)); | |||
| FixupBranch nanConversion = J_CC(CC_E, true); | |||
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Something like Okay, I don't know if this would be a good idea or not. But it saves a useless branch on every operation, and is one less byte than a long jne... |
f82deb9
to
ae78ee1
Compare
|
This has little to no performance drop for me. It fixes some errors that probably need to be fixed as well. |
|
8 seconds on povray is at least pretty significant to me, though certainly not that bad. I really do welcome suggestions on specific code to improve any of the asm paths here though, since this is very much "hyperoptimize a small instruction sequence that will be in a million places" |
|
For ConvertDoubleToSingle, you could just fast-path exponents in the range 0x387 < e < 0x7f8, or something like that; that should be the same cost as the old code on the fast path, although it throws more inputs onto the slow path. I guess the primary issue with this is that you want 0.0 on the fast path... but it might be good enough special-case it on the slow path? Might depend on what the inputs look like. |
ae78ee1
to
73f4790
Compare
|
How about this "high-level" approach for static const __m128i GC_ALIGNED16(not_sign_bit) = _mm_set_epi64x(0xffffffffffffffff, 0x7fffffffffffffff);
static const double GC_ALIGNED16(min_norm_single) = std::numeric_limits<float>::min();
UCOMISD(src, R(src));
FixupBranch nanConversion = J_CC(CC_P, true);
VPAND(XMM0, src, M(¬_sign_bit));
UCOMISD(XMM0, M(&min_norm_single));
FixupBranch denormalConversion = J_CC(CC_B, true);
CVTSD2SS(dst, R(src));
// ...C++ equivalent would be: if (src != src) {
// nanConversion
} else if (std::abs(src) < std::numeric_limits<float>::min()) {
// denormalConversion (including zero :/ )
} else {
// just CVTSD2SS
}In the four POV-Ray runs I did, my code was 1-2 seconds slower. I do think it's a lot more readable though. Thoughts? |
5ac0eff
to
1523258
Compare
Also simplify the conversion code with some suggestions by flacs; might even be slightly faster now despite handling more cases.
1523258
to
3d12849
Compare
|
I'm OK with this. And by "OK" I mean "aww yeah". |
| // We'd normally need to MOVDDUP here to put the single in the top half of the output register too, but | ||
| // this function is only used to go directly to a following store, so we omit the MOVDDUP here. | ||
| } | ||
| #endif // MORE_ACCURATE_DOUBLETOSINGLE | ||
|
|
||
| // Converting single->double is a bit easier because all single denormals are double normals. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
Much better now; thanks flacs! |
| // needlessly send it through the slow path. If we subtract 1 before doing the comparison, it turns | ||
| // float-zero into 0xffffffff (skipping the slow path). This results in a single non-denormal being sent | ||
| // through the slow path (0x00800000), but the performance effects of that should be negligible. | ||
| SUB(32, R(gprsrc), Imm8(1)); |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
LGTM. This fixes Beyond Good and Evil (issue 7897) and performance is nearly identical. |
|
Code looks fine to me. |
JIT: fix PPC_FP snan/qnan handling
|
Just wanted to report that in fact performance isn't really so identical: on my system this commit causes a speed loss around 3% |
|
A 3% speed loss isn't actually that big compared to one of the first version which had me drop almost twice as much as that. AFAIK @FioraAeterna was aware that there would be a performance loss and that it would be inevitable. |
|
I wanted only to report/prove that the statement "performance is nearly identical" was definitely wrong ;) |
|
I was going off someone else's benchmarks, sorry! |
|
Speed loss is fine once this actually works. PPC_FP stuff is things we kind of need. |
Needs testing to see if this fixes Beyond Good and Evil