Stop treating sine/cosine/round as target intrinsics on x86. #7023
Conversation
@dotnet/jit-contrib @CarolEidt @BruceForstall PTAL |
@@ -2641,6 +2644,26 @@ regMaskTP LinearScan::getKillSetForNode(GenTree* tree) | |||
} | |||
break; | |||
|
|||
#ifdef _TARGET_X86_ | |||
case GT_INTRINSIC: | |||
if (compiler->compFloatingPointUsed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor question - why did you add this check; does it really add value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same check performed above at line 2603: https://github.com/dotnet/coreclr/pull/7023/files#diff-93fb9832a03a9783e2a415b581fcb45cL2603. I don't think it's necessary, so I'm happy to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's your call; I don't feel strongly about it.
Wouldn't be better to just do what x64 is doing (helper calls)? What we'll get now is neither the old JIT32 bit behavior (due to the use of SSE in the new JIT32) nor the x64 behavior (due to the use of x87 intrinsics in the new JIT32). Now that both x86 and x64 are using SSE it would be better to use the same transcendental math helpers to avoid precision differences between platforms. |
@mikedn: that's a fair point. Depending on the calling convention for the helpers, we may even get better codegen out of them. @dotnet/jit-contrib thoughts? |
I think that this approach is the right one to get us unblocked on these. |
There were some decent perf gains when I modified the x64 calls to use the CRT implementation of 'cos', 'sin', and 'tan' (#4847), this also resulted in no loss of precision for the existing tests. This was because all of the existing tests were only checking to within a few decimal places (9 IIRC). I ended up adding a number of new tests that improve the precision checks in the tests and coverage overall. These same perf gains would likely be seen if brought over to the x86 platform and should also not result in any unreasonable loss in precision (and in some cases, may result in better precision due to limitations of the fsin, fcos, and ftan instructions). |
If I understand correctly, we need to bring in new CRT helpers that use SSE calling convention to x86 and implement VM side helpers that in turn calls into CRT helpers. This needs to be done so that VM supports both Legacy Jit32 and RyuJIt 32. If we think that using SSE-based helpers is the long-term direction and using x87 instructions for now is to unblock us, there is an easier option to unblock ourselves functionally: not to treat sin/cos as intrinsics on RyuJIT x86 (by making changes to importer.cpp::IsTargetIntrinsic()). Obviously this will have poor code quality compared to emitting x87 instructions. Another question is do we think that SSE-based sin/cos helpers a must for shipping RyuJIT x86 1.0 in coreclr? If yes, we can use the above mentioned quick option to functionally unblock us. |
|
||
// Load the source onto the x87 stack. | ||
genConsumeReg(source); | ||
genFld(source); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a clean and easy way to achieve what genFld() does plus it also allows us to optimize for the case when the operand is a memory-op:
- Compiler class declares two members to hold local numbers, these are initialized to BAD_VAR_NUM
floatTemp and doubleTemp - TreeNodeInfoInitIntrinsic() - in case of cos/sin/round intrinsics:
if the operand is a memory-op, mark it as contained, otherwise create a temp of float/double depending on target type of GT_INTRINSIC node.
e.g. if (floatTemp == BAD_VAR_NUM)
{
floatTemp = comp->lvaGrabTemp(...)
// initialize lv type, lvRefCnt, lvDoNotEnregister
// set lvaSortAgain to true
}
- genIntrinsic() - in case of sin/cos/round, if the operand is not a contained memor-op, use floatTemp/doubleTemp to store the value of the operand to memory and to copy to x87 stack from memory. If it is a contained memory op, we can use it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a contained memory op, we can use it directly.
How? I tried to go down this road but could not find an appropriate emit instruction that to generate an fld
with a memory operand that was not on the stack.
Even if there is a straightforward way to do this, I'm not sure it's worth the effort unless we stick with this solution long-term.
a7e7d30
to
72e2eb5
Compare
@dotnet/jit-contrib the test failures are due to instances of IR like the following:
In this case, the result of the call and the result of the intrinsic are live concurrently, and each needs a spill temp for the transition from the x87 stack. In this case, the max number of concurrent I am pretty certain that we cannot end up in a similar situation with calls. If the results of two calls are concurrently live on x86, the execution of the later call will have killed the register in which the result of the earlier call lives (if any). The RA will then spill the earlier result as part of its normal operation. Later, the RA will record one additional spill that will be used to move the result of the later call off of the FP stack, and the number of concurrent spills will be correct. As a result, I'm of the opinion that solving the actual problem here is not worth it, and I intend to switch this PR to simply stop treating sine/cosine/round as intrinsics on RyuJIT-x86. |
These intrinsics were supported by JIT32 using the corresponding x87 instructions. There are a few downsides to doing the same in RyuJIT, however, because it uses SSE for all other floating-point math: - The mix of precisions used by the combination of SSE + x87 instructions matches neither the JIT32 nor the RyuJIT/x64 behavior. - Using the x87 instructions is more expensive, as the result of the instruction may need to be moved via memory to an SSE register. - The RA would need to be udpated in order to better understand that each x87 instruction may require a spill temp in order to retrieve the result. Issue #7097 tracks investigating a custom calling convention for the appropriate helper calls if necessasry.
72e2eb5
to
32f6613
Compare
The OSX failure appears to be an infrastructure issue: http://dotnet-ci.cloudapp.net/job/dotnet_coreclr/job/master/job/checked_osx_prtest/5564/consoleFull#-6271423686ee26338-6221-4f7c-8c8f-1d1d5cb4704a cc @mmitche |
Taking a look, thanks. |
@dotnet-bot test OSX x64 Checked Build and Test |
@dotnet/jit-contrib Ping. This change is now green. |
@@ -17895,7 +17895,7 @@ void Compiler::impMarkInlineCandidate(GenTreePtr callNode, | |||
|
|||
bool Compiler::IsTargetIntrinsic(CorInfoIntrinsics intrinsicId) | |||
{ | |||
#if defined(_TARGET_AMD64_) | |||
#if defined(_TARGET_AMD64_) || (defined(_TARGET_X86_) && !defined(LEGACY_BACKEND)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is not a long term solution, please add a TODO comment.
LGTM |
Stop treating sine/cosine/round as target intrinsics on x86. Commit migrated from dotnet/coreclr@ddc3d67
These intrinsics were supported by JIT32 using the corresponding
x87 instructions. There are a few downsides to doing the same in
RyuJIT, however, because it uses SSE for all other floating-point
math:
instructions matches neither the JIT32 nor the RyuJIT/x64
behavior.
the instruction may need to be moved via memory to an SSE
register.
that each x87 instruction may require a spill temp in order to
retrieve the result.
Issue #7097 tracks investigating a custom calling convention for
the appropriate helper calls if necessasry.