Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to System.Math InternalCall code. #4847

Merged
merged 6 commits into from Jun 2, 2016

Conversation

@tannergooding
Copy link
Member

commented May 8, 2016

This cleans up the native code for the System.Math internal calls. There were several instances of workarounds implemented for much older versions of the MSVC compiler that are just not relevant anymore.

The list of workarounds removed are as follows:

  • On Windows AMD64, Sin/Cos/Tan were implement in vm\amd64\JitHelpers_Fast.asm (using x87 floating-point instructions) because the CRT versions were previously too slow.
  • Log, Log10, Exp, Pow, and Round were being compiled with /fp:precise because the /fp:fast implementation was previously slower.
  • Exp(+/-INFINITY) was previously special handled as the CRT implementation did not return the expected values.
  • On Windows x86, Pow had some additional code (and inline assembly) that was unnecessary.
  • On Windows x86, Round was using inline assembly to call the frndint x87 floating-point instruction

In all of these cases, the replaced code was one or more of the following:

  • No longer needed as the bug had been fixed
  • Went against the Intel/AMD CPU Optimization Guidelines by using x87 floating-point instructions. Both manuals recommend using SSE/SSE2 instructions unless their are special circumstances requiring the use of the x87 floating-point instructions; the SSE/SSE2 instructions can be pipelined, have a lower latency, use hardware registers rather than the floating-point stack, and generally provided faster computations than the special x87 floating point instructions (such as fsin, fcos, and ftan).
  • Was no longer more performant than the 'correct' implementation
@tannergooding

This comment has been minimized.

Copy link
Member Author

commented May 9, 2016

This should be reviewable. Everything is passing and looking good, so I don't have any other changes I am planning on making (unless otherwise requested).

@jkotas

This comment has been minimized.

Copy link
Member

commented May 10, 2016

@tannergooding Could you please split this into several smaller PRs? It will be hard to make progress on it as is because of it has too much stuff in it.

  • File rename so that the diff for the rest looks more reasonable
  • Non-x86 specific changes
  • x86 specific changes. How did you test that these changes are good?

Thanks!

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented May 10, 2016

@jkotas, this becomes significantly easier to review if you look at each of the commits individually, rather than all at once. It also becomes simpler to review if you add ?w=1 to the end of the url, as it causes whitespace specific changes to be left out of the review.

// instead returns +1.0.

if(IS_DBL_INFINITY(y) && IS_DBL_NEGATIVEONE(x)) {
*((INT64 *)(&result)) = CLR_NAN_64;

This comment has been minimized.

Copy link
@jamesqo

jamesqo May 11, 2016

Contributor

Just a question, not too familiar with native code but isn't violating strict aliasing supposed to be UB?

This comment has been minimized.

Copy link
@tannergooding

tannergooding May 23, 2016

Author Member

Yes, violating strict aliasing is UB according to the spec, but clang, gcc, and msvc will handle this code as intended.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented May 19, 2016

@tannergooding any plans to implement perf tests for these math functions?

Also how about more rigorous correctness tests?

There are a few tests over in CoreFx but they're minimal and don't check boundary cases. There are pal tests for the underlying native methods but nothing that verifies these when called from managed code.

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented May 19, 2016

@AndyAyersMS. I'll see about getting some perf tests implemented as well as some more rigorous correctness tests.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented May 20, 2016

I wrote a simple perf test for Log, feel free to build on that if you like.

BTW math function performance seems significantly worse on Linux than on Windows.

@tannergooding tannergooding force-pushed the tannergooding:math branch from 20b8ce0 to 6357499 May 22, 2016

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented May 22, 2016

I have added some very basic performance tests for all of the System.Math functions which are implemented via InternalCall.

All performance tests are implemented as follows:

  • 100,000 iterations are executed
  • The time of all iterations are aggregated to compute the Total Time
  • The time of all iterations are averaged to compute the Average Time
  • A single iteration executes some simple operation, using the function under test, 5000 times

All of the functions which were not modified did not show any improvement or regression.

The execution time below is the Total Time for all 100,000 iterations, measured in seconds.

Hardware: Desktop w/ 3.7GHz Quad-Core A10-7850K (AMD) and 16GB RAM

Of the functions which were changed, they are showing the following improvments (x64):

Function Improvment Execution Time - Before Execution Time - After
Cos 59.8373609% 17.5828667s 7.06169908s
Exp 13.0115399% 7.78959558s 6.77604924s
Log 0.691849597% 5.47524991s 5.43736941s
Log10 0.214979727% 5.79203835s 5.77961335s
Pow 0.982046086% 30.8941570s 30.5907621s
Round 31.0027713% 5.90780906s 4.07622453s
Sin 68.0349993% 17.6330190s 5.63639463s
Tan 59.4324800% 25.4978910s 10.3438620s

Of the functions which were changed, they are showing the following improvments (x86):

Function Improvment Perf Before Perf After
Exp 21.8584468% 9.46195789s 7.39372086s
Log 14.4277698% 11.2499700s 9.62685020s
Log10 14.9602930% 11.5140937s 9.79155157s
Pow 15.1681239% 31.8424078s 27.0125119s
Round 0.172920825% 0.63726968s 0.63616771s

FYI. @AndyAyersMS


var diff = Math.Abs(absDoubleExpectedResult - result);

if ((diff != 0) && (diff < absDoubleEpsilon))

This comment has been minimized.

Copy link
@AndyAyersMS

AndyAyersMS May 22, 2016

Member

I think this should just be if (diff > absDoubleEpsilon) -- no need to check against zero, and fail if the diff is too large.

Similarly for checks in the other tests.

This comment has been minimized.

Copy link
@tannergooding

tannergooding May 22, 2016

Author Member

Fixing.

public static partial class MathTests
{
private const double absDoubleDelta = 0.0004;
private const double absDoubleEpsilon = 2.22e-16;

This comment has been minimized.

Copy link
@AndyAyersMS

AndyAyersMS May 22, 2016

Member

If this epsilon value is same across all tests (looks like it is...), I'd suggest you just have one definition for it.

Also you should add a comment on why it has the value it does.

This comment has been minimized.

Copy link
@tannergooding

tannergooding May 22, 2016

Author Member

Fixing.

@tannergooding tannergooding force-pushed the tannergooding:math branch from 6357499 to 3aa70c7 May 22, 2016

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented May 22, 2016

Thanks for adding tests.... left a few comments.

I'd be interested to see the Windows vs Linux numbers.

@tannergooding tannergooding force-pushed the tannergooding:math branch from 3aa70c7 to 14a9f28 May 22, 2016

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented May 22, 2016

I've updated according to the feedback provided and am working on getting some Linux (and Mac) numbers for comparison.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented May 22, 2016

Thanks for the updates.

One last thing to think about. The code under jit/performance has to work well under 3 different use cases:

  • Regular test, run every CI. We do this a lot
  • Perf test, xunit-perf run. We do this rarely
  • Perf test, manual run. We do this very rarely.

I think what you have is good on the perf test aspects, but it's likely running this as a regular test will take a long time. What I've done for the other perf tests is (for now) give up on the manual perf test aspect and scale back iterations so when run as an exe the timing is reasonable, say a second or so at most in release and a short as possible in non-release.

To get back the manual mode capabilities, I have plans to go back and add optional -bench arguments to each of tests so they ramp up the work they do when run as a standalone perf test.

So, look at how long this test takes in CI and if it stands out, you should consider doing something to trim it back.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented May 22, 2016

Looks like you hit the timeout ...

D:\j\workspace\debug_windows---17180f1b\bin\tests\Windows_NT.x64.Debug\JIT\Performance\CodeQuality\Math\Functions\Functions\Functions.cmd Timed Out
@tannergooding

This comment has been minimized.

Copy link
Member Author

commented May 22, 2016

I'm updating this to accept two arguments (similar to the SIMD perf tests). One for the math function to test and the other which will be the number of iterations to run. I'm going to default the iterations to 1 for debug and 1000 for release (I can run 2500 iterations of the PowTest in one second on my box)

@tannergooding tannergooding force-pushed the tannergooding:math branch 2 times, most recently from 1e9b274 to a28ad2a May 22, 2016

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented May 22, 2016

Updated. The Math Functions performance test now accepts two arguments. One allows you to specify the name of the test to run (all is supported and is the default) and the other (-bench) allows you to specify the number of iterations to run (defaults to 1 on debug and 1000 on release).

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented May 22, 2016

Nicely done -- LGTM.

@@ -28,7 +28,7 @@
</PropertyGroup>
<!--Leaf Project Items-->
<ItemGroup>
<CppCompile Include="FloatNative.cpp" />
<CppCompile Include="FloatDouble.cpp" />

This comment has been minimized.

Copy link
@bendono

bendono May 22, 2016

Contributor

The casing on the file name differs here from that used on the actual file. Should be fine for Windows, but is this not a problem for Linux type systems?

This comment has been minimized.

Copy link
@tannergooding

tannergooding May 23, 2016

Author Member

I believe (in the case of _._proj) casing differences are being handled already. However, I have adjusted the casing to match what exists on disk.

////////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////
///
/// Beggining of /fp:fast scope

This comment has been minimized.

Copy link
@bendono

bendono May 22, 2016

Contributor

Spelling: Beginning of /fp:fast scope

This comment has been minimized.

Copy link
@tannergooding

tannergooding May 22, 2016

Author Member

Fixed.

@tannergooding tannergooding force-pushed the tannergooding:math branch 2 times, most recently from a846d99 to 87e9ab8 May 22, 2016

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented May 23, 2016

I have some perf numbers for Linux and Mac. I'm not sure I trust the Linux numbers because of how small they are (I did double check the I deployed the right bits), but Mac did show some small (and consistent) improvements (I was happy to see already good perf coming from Mac). I also updated my post above to list the hardware I ran the windows benchmarks on.

OS X showed very little improvement, but the improvement was consistent.

Hardware: MacBook Pro w/ 2.5GHz Quad-Core i7 (Intel) and 16GB RAM

Function Improvment Execution Time - Before Execution Time - After
Cos 8.38015985% 5.39619623s 4.94398636s
Exp 3.44344613% 3.70642912s 3.57880023s
Log 0.155915035% 3.37851958s 3.37325196s
Log10 0.274460120% 3.40944615s 3.40008858s
Pow 1.00818607% 15.5424187s 15.3857222s
Round 0.243108725% 2.61207820s 2.60572801s
Sin 0.572526337% 4.50698218s 4.48117852s
Tan 23.0676004% 5.25166822s 4.04023438s

Linux showed even less improvement and it did not seem to be consistent.

Hardware: Asus ZenBook w/ 2.6GHz Quad-Core i3 (Intel) and 4GB RAM

Function Improvment Execution Time - Before Execution Time - After
Cos 0.209634293% 54.7168588s 54.6021535s
Exp 1.89459115% 52.8744791s 51.8727239s
Log 0.217855270% 56.1605876s 56.0382388s
Log10 0.274460120% 58.8396717s 58.3054002s
Pow 0.0565113567% 51.8469237s 51.8176243s
Round 3.87328635% 4.83688612s 4.64953967s
Sin 1.90053224% 51.5278393s 50.5485361s
Tan 0.441676743% 58.7293771s 58.4699831s

FYI. @AndyAyersMS

Let me know what else, if anything, is required to get this merged.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

commented May 23, 2016

LGTM.

Something for future follow-up: try and run Linux on same HW as Windows, and compare relative perf (eg dual-boot a machine or similar). Data above and some experiments I've done suggest Linux math function performance may not be very good.

Also windows 10 math libs seem to light up on AVX2 capable machines. So for Windows, you might want to measure on both older and newer HW.

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented May 24, 2016

@AndyAyersMS, I did some investigations on Linux and it seems that the slowness is coming from the libm implementation being linked in (usr/lib/x86_64-linux-gnu/libm.a).

I tried swapping the implementation out with amdlibm.a and saw the numbers drop to be on par with Windows.

@sergiy-k

This comment has been minimized.

Copy link
Contributor

commented Jun 1, 2016

/cc @janvorli

@janvorli

This comment has been minimized.

Copy link
Member

commented Jun 1, 2016

@tannergooding You have accidentally (I guess) changed the license headers in multiple files from .NET Foundation to Microsoft. Can you please fix that?
LGTM otherwise.

@tannergooding tannergooding force-pushed the tannergooding:math branch from 6f0c484 to 9c13a84 Jun 1, 2016

@tannergooding tannergooding force-pushed the tannergooding:math branch from 9c13a84 to c83e7ec Jun 1, 2016

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Jun 1, 2016

@janvorli, I've rebased the changes against the current HEAD and fixed the license headers.

@janvorli

This comment has been minimized.

Copy link
Member

commented Jun 1, 2016

LGTM

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2016

Everything is passing and I believe I have received all necessary sign-off.

Am I clear to merge?

@janvorli janvorli merged commit 08786f2 into dotnet:master Jun 2, 2016

8 checks passed

CentOS7.1 x64 Debug Build and Test Build finished.
Details
FreeBSD x64 Checked Build Build finished.
Details
OSX x64 Checked Build and Test Build finished.
Details
Ubuntu x64 Checked Build and Test Build finished.
Details
Windows_NT x64 Debug Build and Test Build finished.
Details
Windows_NT x64 Release Priority 1 Build and Test Build finished.
Details
Windows_NT x86 legacy_backend Checked Build and Test Build finished.
Details
Windows_NT x86 ryujit Checked Build and Test Build finished.
Details
@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Jun 2, 2016

Thanks!

@tannergooding tannergooding deleted the tannergooding:math branch Jun 2, 2016

@tarekgh tarekgh added this to the 2.0.0 milestone Mar 22, 2017

tijoytom pushed a commit to tijoytom/corert that referenced this pull request May 5, 2017

jkotas added a commit to dotnet/corert that referenced this pull request May 5, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.