Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to System.Math InternalCall code. #4847

Merged
merged 6 commits into from Jun 2, 2016
Merged

Improvements to System.Math InternalCall code. #4847

merged 6 commits into from Jun 2, 2016

Conversation

@tannergooding
Copy link
Member

@tannergooding tannergooding commented May 8, 2016

This cleans up the native code for the System.Math internal calls. There were several instances of workarounds implemented for much older versions of the MSVC compiler that are just not relevant anymore.

The list of workarounds removed are as follows:

  • On Windows AMD64, Sin/Cos/Tan were implement in vm\amd64\JitHelpers_Fast.asm (using x87 floating-point instructions) because the CRT versions were previously too slow.
  • Log, Log10, Exp, Pow, and Round were being compiled with /fp:precise because the /fp:fast implementation was previously slower.
  • Exp(+/-INFINITY) was previously special handled as the CRT implementation did not return the expected values.
  • On Windows x86, Pow had some additional code (and inline assembly) that was unnecessary.
  • On Windows x86, Round was using inline assembly to call the frndint x87 floating-point instruction

In all of these cases, the replaced code was one or more of the following:

  • No longer needed as the bug had been fixed
  • Went against the Intel/AMD CPU Optimization Guidelines by using x87 floating-point instructions. Both manuals recommend using SSE/SSE2 instructions unless their are special circumstances requiring the use of the x87 floating-point instructions; the SSE/SSE2 instructions can be pipelined, have a lower latency, use hardware registers rather than the floating-point stack, and generally provided faster computations than the special x87 floating point instructions (such as fsin, fcos, and ftan).
  • Was no longer more performant than the 'correct' implementation
@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 9, 2016

This should be reviewable. Everything is passing and looking good, so I don't have any other changes I am planning on making (unless otherwise requested).

@jkotas
Copy link
Member

@jkotas jkotas commented May 10, 2016

@tannergooding Could you please split this into several smaller PRs? It will be hard to make progress on it as is because of it has too much stuff in it.

  • File rename so that the diff for the rest looks more reasonable
  • Non-x86 specific changes
  • x86 specific changes. How did you test that these changes are good?

Thanks!

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 10, 2016

@jkotas, this becomes significantly easier to review if you look at each of the commits individually, rather than all at once. It also becomes simpler to review if you add ?w=1 to the end of the url, as it causes whitespace specific changes to be left out of the review.

// instead returns +1.0.

if(IS_DBL_INFINITY(y) && IS_DBL_NEGATIVEONE(x)) {
*((INT64 *)(&result)) = CLR_NAN_64;
Copy link

@jamesqo jamesqo May 11, 2016

Just a question, not too familiar with native code but isn't violating strict aliasing supposed to be UB?

Copy link
Member Author

@tannergooding tannergooding May 23, 2016

Yes, violating strict aliasing is UB according to the spec, but clang, gcc, and msvc will handle this code as intended.

@AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented May 19, 2016

@tannergooding any plans to implement perf tests for these math functions?

Also how about more rigorous correctness tests?

There are a few tests over in CoreFx but they're minimal and don't check boundary cases. There are pal tests for the underlying native methods but nothing that verifies these when called from managed code.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 19, 2016

@AndyAyersMS. I'll see about getting some perf tests implemented as well as some more rigorous correctness tests.

@AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented May 20, 2016

I wrote a simple perf test for Log, feel free to build on that if you like.

BTW math function performance seems significantly worse on Linux than on Windows.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 22, 2016

I have added some very basic performance tests for all of the System.Math functions which are implemented via InternalCall.

All performance tests are implemented as follows:

  • 100,000 iterations are executed
  • The time of all iterations are aggregated to compute the Total Time
  • The time of all iterations are averaged to compute the Average Time
  • A single iteration executes some simple operation, using the function under test, 5000 times

All of the functions which were not modified did not show any improvement or regression.

The execution time below is the Total Time for all 100,000 iterations, measured in seconds.

Hardware: Desktop w/ 3.7GHz Quad-Core A10-7850K (AMD) and 16GB RAM

Of the functions which were changed, they are showing the following improvments (x64):

Function Improvment Execution Time - Before Execution Time - After
Cos 59.8373609% 17.5828667s 7.06169908s
Exp 13.0115399% 7.78959558s 6.77604924s
Log 0.691849597% 5.47524991s 5.43736941s
Log10 0.214979727% 5.79203835s 5.77961335s
Pow 0.982046086% 30.8941570s 30.5907621s
Round 31.0027713% 5.90780906s 4.07622453s
Sin 68.0349993% 17.6330190s 5.63639463s
Tan 59.4324800% 25.4978910s 10.3438620s

Of the functions which were changed, they are showing the following improvments (x86):

Function Improvment Perf Before Perf After
Exp 21.8584468% 9.46195789s 7.39372086s
Log 14.4277698% 11.2499700s 9.62685020s
Log10 14.9602930% 11.5140937s 9.79155157s
Pow 15.1681239% 31.8424078s 27.0125119s
Round 0.172920825% 0.63726968s 0.63616771s

FYI. @AndyAyersMS


var diff = Math.Abs(absDoubleExpectedResult - result);

if ((diff != 0) && (diff < absDoubleEpsilon))
Copy link
Member

@AndyAyersMS AndyAyersMS May 22, 2016

I think this should just be if (diff > absDoubleEpsilon) -- no need to check against zero, and fail if the diff is too large.

Similarly for checks in the other tests.

Copy link
Member Author

@tannergooding tannergooding May 22, 2016

Fixing.

@AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented May 22, 2016

Thanks for adding tests.... left a few comments.

I'd be interested to see the Windows vs Linux numbers.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 22, 2016

I've updated according to the feedback provided and am working on getting some Linux (and Mac) numbers for comparison.

@AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented May 22, 2016

Thanks for the updates.

One last thing to think about. The code under jit/performance has to work well under 3 different use cases:

  • Regular test, run every CI. We do this a lot
  • Perf test, xunit-perf run. We do this rarely
  • Perf test, manual run. We do this very rarely.

I think what you have is good on the perf test aspects, but it's likely running this as a regular test will take a long time. What I've done for the other perf tests is (for now) give up on the manual perf test aspect and scale back iterations so when run as an exe the timing is reasonable, say a second or so at most in release and a short as possible in non-release.

To get back the manual mode capabilities, I have plans to go back and add optional -bench arguments to each of tests so they ramp up the work they do when run as a standalone perf test.

So, look at how long this test takes in CI and if it stands out, you should consider doing something to trim it back.

@AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented May 22, 2016

Looks like you hit the timeout ...

D:\j\workspace\debug_windows---17180f1b\bin\tests\Windows_NT.x64.Debug\JIT\Performance\CodeQuality\Math\Functions\Functions\Functions.cmd Timed Out

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 22, 2016

I'm updating this to accept two arguments (similar to the SIMD perf tests). One for the math function to test and the other which will be the number of iterations to run. I'm going to default the iterations to 1 for debug and 1000 for release (I can run 2500 iterations of the PowTest in one second on my box)

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 22, 2016

Updated. The Math Functions performance test now accepts two arguments. One allows you to specify the name of the test to run (all is supported and is the default) and the other (-bench) allows you to specify the number of iterations to run (defaults to 1 on debug and 1000 on release).

@AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented May 22, 2016

Nicely done -- LGTM.

@@ -28,7 +28,7 @@
</PropertyGroup>
<!--Leaf Project Items-->
<ItemGroup>
<CppCompile Include="FloatNative.cpp" />
<CppCompile Include="FloatDouble.cpp" />
Copy link

@bendono bendono May 22, 2016

The casing on the file name differs here from that used on the actual file. Should be fine for Windows, but is this not a problem for Linux type systems?

Copy link
Member Author

@tannergooding tannergooding May 23, 2016

I believe (in the case of _._proj) casing differences are being handled already. However, I have adjusted the casing to match what exists on disk.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 23, 2016

I have some perf numbers for Linux and Mac. I'm not sure I trust the Linux numbers because of how small they are (I did double check the I deployed the right bits), but Mac did show some small (and consistent) improvements (I was happy to see already good perf coming from Mac). I also updated my post above to list the hardware I ran the windows benchmarks on.

OS X showed very little improvement, but the improvement was consistent.

Hardware: MacBook Pro w/ 2.5GHz Quad-Core i7 (Intel) and 16GB RAM

Function Improvment Execution Time - Before Execution Time - After
Cos 8.38015985% 5.39619623s 4.94398636s
Exp 3.44344613% 3.70642912s 3.57880023s
Log 0.155915035% 3.37851958s 3.37325196s
Log10 0.274460120% 3.40944615s 3.40008858s
Pow 1.00818607% 15.5424187s 15.3857222s
Round 0.243108725% 2.61207820s 2.60572801s
Sin 0.572526337% 4.50698218s 4.48117852s
Tan 23.0676004% 5.25166822s 4.04023438s

Linux showed even less improvement and it did not seem to be consistent.

Hardware: Asus ZenBook w/ 2.6GHz Quad-Core i3 (Intel) and 4GB RAM

Function Improvment Execution Time - Before Execution Time - After
Cos 0.209634293% 54.7168588s 54.6021535s
Exp 1.89459115% 52.8744791s 51.8727239s
Log 0.217855270% 56.1605876s 56.0382388s
Log10 0.274460120% 58.8396717s 58.3054002s
Pow 0.0565113567% 51.8469237s 51.8176243s
Round 3.87328635% 4.83688612s 4.64953967s
Sin 1.90053224% 51.5278393s 50.5485361s
Tan 0.441676743% 58.7293771s 58.4699831s

FYI. @AndyAyersMS

Let me know what else, if anything, is required to get this merged.

@AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented May 23, 2016

LGTM.

Something for future follow-up: try and run Linux on same HW as Windows, and compare relative perf (eg dual-boot a machine or similar). Data above and some experiments I've done suggest Linux math function performance may not be very good.

Also windows 10 math libs seem to light up on AVX2 capable machines. So for Windows, you might want to measure on both older and newer HW.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 24, 2016

@AndyAyersMS, I did some investigations on Linux and it seems that the slowness is coming from the libm implementation being linked in (usr/lib/x86_64-linux-gnu/libm.a).

I tried swapping the implementation out with amdlibm.a and saw the numbers drop to be on par with Windows.

@AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented May 24, 2016

Interesting... so do you think we should be linking amdlibm.a instead? Seems like we should open an issue for this.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 24, 2016

I believe, at the very least, it is worth investigating further.

AMD's libm implementation doesn't ship natively, it has to be downloaded externally (http://developer.amd.com/tools-and-sdks/archive/libm/). It also comes with its own licensing agreement, which would need review by LCA before we could adopt it.

I'm still wondering if there is some additional detail I'm missing, because it is a little hard to believe that libm is 6-12x less performant than the Windows or OSX implementation. However, I did validate that the clang commands were the same for both Linux/Mac (aside from OS specific compiler defines and library references).

I have all the code available on my box (and I believe my office is just down the hall from yours) if you would like to take a closer look tomorrow.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 24, 2016

Rebased code onto HEAD and finished cleaning up the code.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 25, 2016

Updating the existing PAL tests. Should provide much better coverage for the for things both inside and outside of a functions input domain.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 26, 2016

This is now passing (again), has new perf tests, and extended pal test coverage. Is there anything else required on my end to get this merged in?

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 27, 2016

Looking at the trigger phrase, it seems my previous comment caused a coverage test to run, which failed because it couldn't find ./corerun or ./runtests.sh.

@AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented May 27, 2016

@tannergooding I would suggest getting somebody on the runtime side to sign off.

Just browsing through the new pal tests, I don't understand why the expected variance for inputs is sometimes PAL_EPS and sometimes a multiple or fraction of it. Did you come about these variances by trial and error? I would actually hope that math library functions we link to would end up fairly close to 1 ULP for most reasonable inputs.

I understood why we relaxed the variance on the managed side since we were adding up many results, but here you're just testing isolated values.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 27, 2016

@AndyAyersMS, the epsilon used for these tests was already set to 1e-07, so I just maintained the what was already there.

I can certainly update the tests to be more have a more accurate base epsilon (1e-6 for single, and 1e-15 for double) so that the cross platform math library implementations are required to produce consistent/accurate results (this is a simple change as I ensured the inputs/outputs for all the tests were at the maximum precision allowed -- 17 digits).

As for why it varies by a multiple/division of 10, that is because floating point values have digits of precision (6-9 for single and 15-17 for double). So, for a single precision floating point, 0.123456789 and 123456789.0 are both as equally precise as allowed. The adjustments handle the ULP for the expected output (so that the last digit is what the delta is based on).

So, in the first example 0.1234567890 and 0.1234567891 should be considered as equally precise, because the 10th digit is outside the range of precision for a single-precision floating-point integer. The same is true for example 123456789.0 and 123456789.1 in the second example.

So PAL_EPSILON is used for numbers in the format of 0.xxxxxx, PAL_EPISLON / 10 is used for 0.0xxxxxx, PAL_EPSILON * 10 is used for x.xxxxx and so on.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented May 28, 2016

@AndyAyersMS. I have added a new commit which updates the value of PAL_EPSILON to be more precise. Additionally, I have commented all instances of PAL_EPSILON to explain (roughly) what I explained above.

It seems that the various implementations (between Windows, Linux, and Mac) are accurate to within 2^-50 (2^-21 for float). Where the actual machine-epsilon is 2^-52 (2^-23 for float).

Do you know who on the Runtime side would be good to tag for review?

@AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented May 28, 2016

@sergiy-k Can you suggest somebody to review the PAL changes?

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented Jun 1, 2016

Ping @sergiy-k Can you suggest somebody to review the PAL changes?
This was probably missed since it was sent late last Friday.

@sergiy-k
Copy link

@sergiy-k sergiy-k commented Jun 1, 2016

/cc @janvorli

@janvorli
Copy link
Member

@janvorli janvorli commented Jun 1, 2016

@tannergooding You have accidentally (I guess) changed the license headers in multiple files from .NET Foundation to Microsoft. Can you please fix that?
LGTM otherwise.

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented Jun 1, 2016

@janvorli, I've rebased the changes against the current HEAD and fixed the license headers.

@janvorli
Copy link
Member

@janvorli janvorli commented Jun 1, 2016

LGTM

@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented Jun 2, 2016

Everything is passing and I believe I have received all necessary sign-off.

Am I clear to merge?

@janvorli janvorli merged commit 08786f2 into dotnet:master Jun 2, 2016
8 checks passed
@tannergooding
Copy link
Member Author

@tannergooding tannergooding commented Jun 2, 2016

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants