Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: fix PPC_FP snan/qnan handling #1623

Merged
merged 1 commit into from Dec 9, 2014
Merged

Conversation

FioraAeterna
Copy link
Contributor

Needs testing to see if this fixes Beyond Good and Evil

@FioraAeterna FioraAeterna force-pushed the fixppcfp branch 5 times, most recently from 0a613d3 to 4d8fcae Compare December 1, 2014 01:01
@comex
Copy link
Contributor

comex commented Dec 1, 2014

Might it be possible to merge those branches into one somehow?

@FioraAeterna
Copy link
Contributor Author

I can't think of any way without adding even more uops

@FioraAeterna
Copy link
Contributor Author

squees all over the PR

IT FIXES IT

@PatrickFerry
Copy link
Contributor

You're a wizard. Good job, how did you manage to fix it without having the game?
squee

@FioraAeterna
Copy link
Contributor Author

I figured that the only way things could be broken is if my conversion algorithm was actually lossy on some inputs when it shouldn't be, so I wrote a C implementation and checked it on all possible inputs (2^32 possible float bit patterns)... and found I was right.

Then had to fix it <___<;


SetJumpTarget(nanConversion);
CVTSS2SD(dst, R(dst));
CMP(32, R(gprsrc), Imm32(0x7fc00000));

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@comex
Copy link
Contributor

comex commented Dec 1, 2014

Just to recap what I said on IRC yesterday, I'd appreciate if you checked before merging whether using a cmov or whatever to avoid two branches can reduce the performance hit. Thanks...

@FioraAeterna
Copy link
Contributor Author

I don't know any good way to do this with a cmov that won't balloon the instruction count, do you have any suggestions?

@@ -906,23 +909,39 @@ void EmuCodeBlock::ConvertDoubleToSingle(X64Reg dst, X64Reg src)
// if it is.
MOVQ_xmm(R(RSCRATCH), src);
SHR(64, R(RSCRATCH), Imm8(55));
CMP(8, R(RSCRATCH), Imm8(0xff));
FixupBranch nanConversion = J_CC(CC_E, true);

This comment was marked as off-topic.

This comment was marked as off-topic.

@comex
Copy link
Contributor

comex commented Dec 1, 2014

Something like xor %ecx, %ecx; cmovz %ecx, %eax? After the subtraction, so it would become zero, which <= 3.

Okay, I don't know if this would be a good idea or not. But it saves a useless branch on every operation, and is one less byte than a long jne...

@JMC47
Copy link
Contributor

JMC47 commented Dec 5, 2014

This has little to no performance drop for me. It fixes some errors that probably need to be fixed as well.

@FioraAeterna
Copy link
Contributor Author

8 seconds on povray is at least pretty significant to me, though certainly not that bad. I really do welcome suggestions on specific code to improve any of the asm paths here though, since this is very much "hyperoptimize a small instruction sequence that will be in a million places"

@magumagu
Copy link
Contributor

magumagu commented Dec 6, 2014

For ConvertDoubleToSingle, you could just fast-path exponents in the range 0x387 < e < 0x7f8, or something like that; that should be the same cost as the old code on the fast path, although it throws more inputs onto the slow path. I guess the primary issue with this is that you want 0.0 on the fast path... but it might be good enough special-case it on the slow path? Might depend on what the inputs look like.

@Tilka
Copy link
Member

Tilka commented Dec 6, 2014

How about this "high-level" approach for ConvertDoubleToSingle()'s fast path?

static const __m128i GC_ALIGNED16(not_sign_bit) = _mm_set_epi64x(0xffffffffffffffff, 0x7fffffffffffffff);
static const double GC_ALIGNED16(min_norm_single) = std::numeric_limits<float>::min();

UCOMISD(src, R(src));
FixupBranch nanConversion = J_CC(CC_P, true);
VPAND(XMM0, src, M(&not_sign_bit));
UCOMISD(XMM0, M(&min_norm_single));
FixupBranch denormalConversion = J_CC(CC_B, true);
CVTSD2SS(dst, R(src));
// ...

C++ equivalent would be:

if (src != src) {
  // nanConversion
} else if (std::abs(src) < std::numeric_limits<float>::min()) {
  // denormalConversion (including zero :/ )
} else {
  // just CVTSD2SS
}

In the four POV-Ray runs I did, my code was 1-2 seconds slower. I do think it's a lot more readable though. Thoughts?

@FioraAeterna FioraAeterna force-pushed the fixppcfp branch 5 times, most recently from 5ac0eff to 1523258 Compare December 6, 2014 20:57
Also simplify the conversion code with some suggestions by flacs; might even
be slightly faster now despite handling more cases.
@Tilka
Copy link
Member

Tilka commented Dec 6, 2014

I'm OK with this. And by "OK" I mean "aww yeah".

// We'd normally need to MOVDDUP here to put the single in the top half of the output register too, but
// this function is only used to go directly to a following store, so we omit the MOVDDUP here.
}
#endif // MORE_ACCURATE_DOUBLETOSINGLE

// Converting single->double is a bit easier because all single denormals are double normals.

This comment was marked as off-topic.

@FioraAeterna
Copy link
Contributor Author

Much better now; thanks flacs!

// needlessly send it through the slow path. If we subtract 1 before doing the comparison, it turns
// float-zero into 0xffffffff (skipping the slow path). This results in a single non-denormal being sent
// through the slow path (0x00800000), but the performance effects of that should be negligible.
SUB(32, R(gprsrc), Imm8(1));

This comment was marked as off-topic.

This comment was marked as off-topic.

@OussamaDanba
Copy link
Contributor

LGTM. This fixes Beyond Good and Evil (issue 7897) and performance is nearly identical.

@skidau
Copy link
Contributor

skidau commented Dec 9, 2014

Code looks fine to me.

skidau added a commit that referenced this pull request Dec 9, 2014
JIT: fix PPC_FP snan/qnan handling
@skidau skidau merged commit 1ce3696 into dolphin-emu:master Dec 9, 2014
@psennermann
Copy link

Just wanted to report that in fact performance isn't really so identical: on my system this commit causes a speed loss around 3%

@OussamaDanba
Copy link
Contributor

A 3% speed loss isn't actually that big compared to one of the first version which had me drop almost twice as much as that. AFAIK @FioraAeterna was aware that there would be a performance loss and that it would be inevitable.

@psennermann
Copy link

I wanted only to report/prove that the statement "performance is nearly identical" was definitely wrong ;)

@FioraAeterna
Copy link
Contributor Author

I was going off someone else's benchmarks, sorry!

@JMC47
Copy link
Contributor

JMC47 commented Dec 9, 2014

Speed loss is fine once this actually works. PPC_FP stuff is things we kind of need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
9 participants