Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Stop treating sine/cosine/round as target intrinsics on x86. #7023

Merged
merged 2 commits into from Sep 7, 2016

Conversation

pgavlin
Copy link

@pgavlin pgavlin commented Sep 1, 2016

These intrinsics were supported by JIT32 using the corresponding
x87 instructions. There are a few downsides to doing the same in
RyuJIT, however, because it uses SSE for all other floating-point
math:

  • The mix of precisions used by the combination of SSE + x87
    instructions matches neither the JIT32 nor the RyuJIT/x64
    behavior.
  • Using the x87 instructions is more expensive, as the result of
    the instruction may need to be moved via memory to an SSE
    register.
  • The RA would need to be udpated in order to better understand
    that each x87 instruction may require a spill temp in order to
    retrieve the result.

Issue #7097 tracks investigating a custom calling convention for
the appropriate helper calls if necessasry.

@pgavlin
Copy link
Author

pgavlin commented Sep 1, 2016

@dotnet/jit-contrib @CarolEidt @BruceForstall PTAL

@@ -2641,6 +2644,26 @@ regMaskTP LinearScan::getKillSetForNode(GenTree* tree)
}
break;

#ifdef _TARGET_X86_
case GT_INTRINSIC:
if (compiler->compFloatingPointUsed)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor question - why did you add this check; does it really add value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same check performed above at line 2603: https://github.com/dotnet/coreclr/pull/7023/files#diff-93fb9832a03a9783e2a415b581fcb45cL2603. I don't think it's necessary, so I'm happy to remove it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's your call; I don't feel strongly about it.

@CarolEidt
Copy link

:shipit:

@mikedn
Copy link

mikedn commented Sep 1, 2016

Wouldn't be better to just do what x64 is doing (helper calls)? What we'll get now is neither the old JIT32 bit behavior (due to the use of SSE in the new JIT32) nor the x64 behavior (due to the use of x87 intrinsics in the new JIT32). Now that both x86 and x64 are using SSE it would be better to use the same transcendental math helpers to avoid precision differences between platforms.

@pgavlin
Copy link
Author

pgavlin commented Sep 1, 2016

@mikedn: that's a fair point. Depending on the calling convention for the helpers, we may even get better codegen out of them. @dotnet/jit-contrib thoughts?

@CarolEidt
Copy link

I think that this approach is the right one to get us unblocked on these.
I think (though I'm not really sure) that enabling the helpers requires some configuration changes in the VM. That said, we need to decide whether we are looking for more compatibility with the 32-bit desktop version of .NET or the 64-bit (desktop or coreclr) version. I think the latter provides better long-term compatibility, so IMO we should consider moving in that direction.
Also, I don't think we have an SSE-based calling convention for 32-bit, even for helper calls. That's something we'd want to add (probably for other reasons as well). Pat - perhaps you could open an issue to track this?

@tannergooding
Copy link
Member

There were some decent perf gains when I modified the x64 calls to use the CRT implementation of 'cos', 'sin', and 'tan' (#4847), this also resulted in no loss of precision for the existing tests. This was because all of the existing tests were only checking to within a few decimal places (9 IIRC). I ended up adding a number of new tests that improve the precision checks in the tests and coverage overall.

These same perf gains would likely be seen if brought over to the x86 platform and should also not result in any unreasonable loss in precision (and in some cases, may result in better precision due to limitations of the fsin, fcos, and ftan instructions).

@sivarv
Copy link
Member

sivarv commented Sep 1, 2016

Depending on the calling convention for the helpers, we may even get better codegen out of them.

If I understand correctly, we need to bring in new CRT helpers that use SSE calling convention to x86 and implement VM side helpers that in turn calls into CRT helpers. This needs to be done so that VM supports both Legacy Jit32 and RyuJIt 32.

If we think that using SSE-based helpers is the long-term direction and using x87 instructions for now is to unblock us, there is an easier option to unblock ourselves functionally: not to treat sin/cos as intrinsics on RyuJIT x86 (by making changes to importer.cpp::IsTargetIntrinsic()). Obviously this will have poor code quality compared to emitting x87 instructions.

Another question is do we think that SSE-based sin/cos helpers a must for shipping RyuJIT x86 1.0 in coreclr? If yes, we can use the above mentioned quick option to functionally unblock us.


// Load the source onto the x87 stack.
genConsumeReg(source);
genFld(source);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a clean and easy way to achieve what genFld() does plus it also allows us to optimize for the case when the operand is a memory-op:

  • Compiler class declares two members to hold local numbers, these are initialized to BAD_VAR_NUM
    floatTemp and doubleTemp
  • TreeNodeInfoInitIntrinsic() - in case of cos/sin/round intrinsics:
    if the operand is a memory-op, mark it as contained, otherwise create a temp of float/double depending on target type of GT_INTRINSIC node.
e.g.  if (floatTemp == BAD_VAR_NUM)
       {
            floatTemp = comp->lvaGrabTemp(...)
            // initialize lv type, lvRefCnt, lvDoNotEnregister
            // set lvaSortAgain to true
       }
  • genIntrinsic() - in case of sin/cos/round, if the operand is not a contained memor-op, use floatTemp/doubleTemp to store the value of the operand to memory and to copy to x87 stack from memory. If it is a contained memory op, we can use it directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a contained memory op, we can use it directly.

How? I tried to go down this road but could not find an appropriate emit instruction that to generate an fld with a memory operand that was not on the stack.

Even if there is a straightforward way to do this, I'm not sure it's worth the effort unless we stick with this solution long-term.

@pgavlin
Copy link
Author

pgavlin commented Sep 7, 2016

@dotnet/jit-contrib the test failures are due to instances of IR like the following:

                                                 /--*  t237   double arg0
N107 ( 21,  8) [000224] --C-G-------      t224 = *  call      double System.Math.Exp

N109 (  3,  2) [000060] ------------       t60 =    lclVar    double V03 loc3         u:3 (last use) REG NA $401

                                                 /--*  t60    double 
N111 ( 39,  6) [000061] ------------       t61 = *  intrinsic double sin REG NA $482

                                                 /--*  t224   double 
                                                 +--*  t61    double 
N113 ( 65, 18) [000062] --C-G-------       t62 = *  *         double REG NA $402

In this case, the result of the call and the result of the intrinsic are live concurrently, and each needs a spill temp for the transition from the x87 stack. In this case, the max number of concurrent TYP_DOUBLE spills will be incorrect, since the spills for FP-typed calls and intrinsics are not treated in a first class fashion by the RA.

I am pretty certain that we cannot end up in a similar situation with calls. If the results of two calls are concurrently live on x86, the execution of the later call will have killed the register in which the result of the earlier call lives (if any). The RA will then spill the earlier result as part of its normal operation. Later, the RA will record one additional spill that will be used to move the result of the later call off of the FP stack, and the number of concurrent spills will be correct.

As a result, I'm of the opinion that solving the actual problem here is not worth it, and I intend to switch this PR to simply stop treating sine/cosine/round as intrinsics on RyuJIT-x86.

These intrinsics were supported by JIT32 using the corresponding
x87 instructions. There are a few downsides to doing the same in
RyuJIT, however, because it uses SSE for all other floating-point
math:
- The mix of precisions used by the combination of SSE + x87
  instructions matches neither the JIT32 nor the RyuJIT/x64
  behavior.
- Using the x87 instructions is more expensive, as the result of
  the instruction may need to be moved via memory to an SSE
  register.
- The RA would need to be udpated in order to better understand
  that each x87 instruction may require a spill temp in order to
  retrieve the result.

Issue #7097 tracks investigating a custom calling convention for
the appropriate helper calls if necessasry.
@pgavlin pgavlin changed the title Implement the x87 math intrinsics for x86. Stop treating sine/cosine/round as target intrinsics on x86. Sep 7, 2016
@pgavlin
Copy link
Author

pgavlin commented Sep 7, 2016

@mmitche
Copy link
Member

mmitche commented Sep 7, 2016

Taking a look, thanks.

@pgavlin
Copy link
Author

pgavlin commented Sep 7, 2016

@dotnet-bot test OSX x64 Checked Build and Test

@pgavlin
Copy link
Author

pgavlin commented Sep 7, 2016

@dotnet/jit-contrib Ping. This change is now green.

@@ -17895,7 +17895,7 @@ void Compiler::impMarkInlineCandidate(GenTreePtr callNode,

bool Compiler::IsTargetIntrinsic(CorInfoIntrinsics intrinsicId)
{
#if defined(_TARGET_AMD64_)
#if defined(_TARGET_AMD64_) || (defined(_TARGET_X86_) && !defined(LEGACY_BACKEND))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not a long term solution, please add a TODO comment.

@CarolEidt
Copy link

LGTM

@sivarv
Copy link
Member

sivarv commented Sep 7, 2016

:shipit:

@pgavlin pgavlin merged commit ddc3d67 into dotnet:master Sep 7, 2016
@pgavlin pgavlin deleted the X87MathIntrinsics branch September 7, 2016 21:56
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Stop treating sine/cosine/round as target intrinsics on x86.

Commit migrated from dotnet/coreclr@ddc3d67
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
7 participants