Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision #6272

kinke · 2018-03-12T22:16:48Z

And make the x87 real version CTFE-able.

Based on Cephes like the existing quadruple and x87 implementations,
https://github.com/jeremybarnes/cephes/blob/master/cmath/atan.c &
https://github.com/jeremybarnes/cephes/blob/master/single/atanf.c

dlang-bot · 2018-03-12T22:16:50Z

Thanks for your pull request and interest in making D better, @kinke! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

My PR is fully covered with tests (you can see the annotated coverage diff directly on GitHub with CodeCov's browser extension
My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
I have provided a detailed rationale explaining my changes
New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.

If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub fetch digger
dub run digger -- build "master + phobos#6272"

kinke · 2018-03-14T20:13:04Z

There are unittest failures due to hypot() still only supporting reals etc. So I guess more or less all remaining std.math functions need to get float/double support in order to hopefully pass the druntime/Phobos unittests again.

I'll need an okay to proceed though, I don't want to waste any more time on this if this doesn't have a chance to be merged (backwards compatibility concerns).

kinke · 2018-03-17T00:35:56Z

compilable/ctfe_math.d fails because DMD would need to be patched to recognize the std.math.tan(float|double) overloads as CTFE builtins in builtin.d. The non-asm version still isn't fully CTFE-able because primitives copysign() and signbit() aren't.

No unittest failures anymore [well, test coverage is minimal at the moment].

x86_64 timings are quite interesting. Linux, Intel i5-3550 @4GHz, dmd master with -O -inline:

import std.datetime.stopwatch;
import std.stdio;

version (C)
{
    pragma(msg, "Using core.stdc.tgmath");
    import math = core.stdc.tgmath;
}
else
{
    pragma(msg, "Using std.math");
    import math = std.math;
}

T atan(T)() { return math.atan(cast(T) 0.43685); }
T atan2(T)() { return math.atan2(cast(T) 0.43685, cast(T) 0.06912); }
T tan(T)() { return math.tan(cast(T) 0.43685); }
T exp(T)() { return math.exp(cast(T) 0.43685); }
T exp2(T)() { return math.exp2(cast(T) 0.43685); }
T expm1(T)() { return math.expm1(cast(T) 0.43685); }

version (C) {} else
T poly(T)()
{
    static immutable T[6] coeffs = [ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 ];
    return math.poly(cast(T) 0.43685, coeffs);
}

void bench(alias Func, T...)()
{
    enum numRounds = 5; // best-of
    enum N = 10_000_000;

    writeln(".: ", Func.stringof);

    Duration[T.length][numRounds] durations;
    foreach (i; 0 .. numRounds)
        foreach (j, F; T)
            durations[i][j] = benchmark!(Func!F)(N)[0];

    foreach (j, F; T)
    {
        Duration bestOf = durations[0][j];
        foreach (i; 1 .. numRounds)
        {
            if (durations[i][j] < bestOf)
                bestOf = durations[i][j];
        }
        writeln(F.stringof, ":\t", bestOf);
    }
}

void main()
{
    bench!(atan, real, double, float);
    bench!(atan2, real, double, float);
    bench!(tan, real, double, float);
    bench!(exp, real, double, float);
    bench!(exp2, real, double, float);
    bench!(expm1, real, double);

    version (C) {} else
    bench!(poly, real, double, float);
}

=>

.: atan(T)()
real:   430 ms, 58 μs, and 7 hnsecs
double: 268 ms, 79 μs, and 3 hnsecs
float:  113 ms, 737 μs, and 4 hnsecs
.: atan2(T)()
real:   427 ms, 635 μs, and 3 hnsecs
double: 480 ms, 229 μs, and 6 hnsecs
float:  266 ms, 752 μs, and 3 hnsecs
.: tan(T)()
real:   351 ms, 427 μs, and 5 hnsecs
double: 909 ms, 280 μs, and 5 hnsecs
float:  220 ms, 476 μs, and 4 hnsecs
.: exp(T)()
real:   223 ms and 639 μs
double: 544 ms, 363 μs, and 5 hnsecs
float:  390 ms, 93 μs, and 9 hnsecs
.: exp2(T)()
real:   226 ms, 165 μs, and 2 hnsecs
double: 619 ms, 915 μs, and 6 hnsecs
float:  513 ms, 462 μs, and 2 hnsecs
.: expm1(T)()
real:   234 ms, 788 μs, and 2 hnsecs
double: 197 ms, 129 μs, and 8 hnsecs

The real timings correspond to the DMD 2.079.0 timings for all 3 floating-point types. So from ~4x speed-up to ~2.75x slow-down. ;)

dnadlinger · 2018-03-17T00:53:57Z

x86_64 timings are quite interesting.

Is that unexpected, though? One would expect the "software" implementations to do worse.

kinke · 2018-03-17T01:06:48Z

Well I for one didn't expect such inconsistent results, or a ~4x slow-down of software-double-tan vs. software-float-tan etc.

kinke · 2018-03-17T01:23:40Z

Here are the corresponding core.stdc.tgmath timings (Ubuntu 16.04):

.: atan(T)()
real:   403 ms, 803 μs, and 5 hnsecs
double: 685 ms, 848 μs, and 3 hnsecs
float:  58 ms, 113 μs, and 3 hnsecs
.: atan2(T)()
real:   510 ms, 110 μs, and 6 hnsecs
double: 903 ms, 576 μs, and 4 hnsecs
float:  165 ms and 114 μs
.: tan(T)()
real:   193 ms, 504 μs, and 1 hnsec
double: 716 ms, 779 μs, and 5 hnsecs
float:  83 ms and 817 μs
.: exp(T)()
real:   511 ms, 916 μs, and 1 hnsec
double: 690 ms, 991 μs, and 3 hnsecs
float:  87 ms, 717 μs, and 6 hnsecs
.: exp2(T)()
real:   547 ms, 701 μs, and 2 hnsecs
double: 129 ms, 991 μs, and 6 hnsecs
float:  123 ms and 33 μs
.: expm1(T)()
real:   471 ms and 943 μs
double: 110 ms, 476 μs, and 3 hnsecs

kinke · 2018-03-17T02:17:26Z

Rough runtimes comparison for the single/double precision overloads:

wilzbach · 2018-03-17T12:42:35Z

Regarding the project tester failure:

core.exception.AssertError@source/mir/random/flex/internal/area.d(525): unittest failure
----------------
??:? _d_unittestp [0x617ef1]
source/mir/random/flex/internal/area.d:525 @safe void mir.random.flex.internal.area.__unittest_L441_C32() [0x5a65a1]
??:? void mir.random.flex.internal.area.__modtest() [0x5b2f9f]
??:? int core.runtime.runModuleUnitTests().__foreachbody2(object.ModuleInfo*) [0x6308e3]
??:? int object.ModuleInfo.opApply(scope int delegate(object.ModuleInfo*)).__lambda2(immutable(object.ModuleInfo*)) [0x616622]
??:? int rt.minfo.moduleinfos_apply(scope int delegate(immutable(object.ModuleInfo*))).__foreachbody2(ref rt.sections_elf_shared.DSO) [0x61dde1]
??:? int rt.sections_elf_shared.DSO.opApply(scope int delegate(ref rt.sections_elf_shared.DSO)) [0x61df74]
??:? int rt.minfo.moduleinfos_apply(scope int delegate(immutable(object.ModuleInfo*))) [0x61dd6d]
??:? int object.ModuleInfo.opApply(scope int delegate(object.ModuleInfo*)) [0x6165f9]
??:? runModuleUnitTests [0x6306b9]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).runAll() [0x619e84]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) [0x619e0b]
??:? _d_run_main [0x619d76]
??:? main [0x560021]
??:? __libc_start_main [0xe29d882f]
core.exception.AssertError@source/mir/random/flex/package.d(916): unittest failure
----------------
??:? _d_unittestp [0x617ef1]
source/mir/random/flex/package.d:916 void mir.random.flex.__unittest_L879_C26() [0x5c9871]

that's because the flex algorithm is quite sensitive to the floating point behavior and the tests use a special fpEqual function:

https://github.com/libmir/mir-random/blob/master/source/mir/random/flex/internal/area.d

kinke · 2018-03-17T12:59:27Z

that's because the flex algorithm is quite sensitive to the floating point behavior and the tests use a special fpEqual function

... which (only) for DMD and x86_64 is defined as exact equality (== operator). The failing test there seems to depend on exp(double x), which for DMD x86(_64) is defined as cast(double) exp2Asm(LOG2E * cast(real) x), so I'd argue that exact equality is too exaggerated and approxEqual() a better fit, as used for all other compilers/platforms.
Thanks for checking. :)

kinke · 2018-03-17T14:08:03Z

Statically unrolled poly() makes the Cephes version of atan(double) faster by ~31%, and expm1(double) by 19%; the double versions of the other functions only show marginal improvements.

dnadlinger · 2018-03-17T14:11:01Z

How do the timings look on LDC?

kinke · 2018-03-17T14:57:34Z

I'm afraid I don't have any LDC timings. This patch cannot be simply cherry-picked due to existing mods in the touched parts; the main problem though is that the timings wouldn't be easily comparable, as e.g. the currently real-only core.math.ldexp() primitive is implemented differently (intrinsic for DMD, i.e., inlined x87 asm; for LDC, a stupidly non-inlined call to druntime which forwards to the C long double version of the C runtime; in C, it's apparently a macro).

kinke · 2018-03-17T15:51:08Z

Experimentally using core.stdc.math.ldexp[l|f] in std.math.ldexp() (instead of the core.math.ldexp(real) intrinsic using a supposedly slow fscale instruction) leads to a significant performance increase, up to almost 2x faster, for tan(double) and exp2(float|double), i.e., exactly where the Cephes version lags significantly behind in the chart above.

kinke · 2018-03-17T17:33:43Z

[Chart and benchmark code updated.]

kinke · 2018-03-17T20:13:41Z

Chart updated once more after getting rid of a few ldexp() calls, which makes tan(double) twice as fast on my machine (!).

[Note that std.math.ldexp(T) cannot use core.stdc.tgmath.ldexp(T) at the moment, as the latter is impure. I hacked druntime for testing, i.e., the green bars.]

kinke · 2018-03-17T21:58:04Z

So on my machine and using the purely arbitrary input data above:

The new Cephes double versions are 0.4x - 2x faster than the current versions, with a geometric mean of 0.88 (0.97 with C ldexp), i.e., overall slower than the current versions using the FPU, almost exclusively due to great performance of Phobos x87 inline asm for std.math.exp2(real) (twice as fast as the long double version in the C runtime!) distorting the otherwise good results. Cephes almost always clearly beats the C runtime, except for exp2(double) (losing a lot there) and expm1(double).
The float versions are 0.5x - 3.8x faster than the current versions, with a geometric mean of 1.2 (1.4 with C ldexp). The C runtime is always and by far the winner though (possibly due to less handled special cases? - up to 12x as fast as the C double version!).

kinke · 2018-03-18T02:04:29Z

I ported this to LDC (with LLVM 6.0.0 and using C ldexp), and the results are way better than expected: overall speed-up factor of >3 for the new double/float versions compared to the old ones, only beaten by the C runtime in 2 cases (exp[2](float)) and otherwise leading the pack. Here's the extended chart:

n8sh · 2018-06-23T22:10:41Z

Time to squash and merge?

kinke · 2018-06-23T22:22:54Z

#Ah nice, I forgot to check whether it got green now with new mir-random. Edit: Oh, no project tester in CI anymore?

squash

Do you really prefer a single 1.7k lines patch?

n8sh · 2018-06-23T22:48:22Z

When you put it that way, perhaps not (EDIT: I mean the squash).

WalterBright · 2018-06-24T00:23:38Z

std/math.d is getting way too large. The next PR should split this into a package.

ibuclaw · 2018-06-24T07:59:17Z

@WalterBright indeed, we touched the subject earlier here #6272 (comment)

And make the x87 `real` version CTFE-able.

And make the x87 `real` version CTFE-able. I couldn't find the single-precision version in Cephes.

Add a statically unrolled version for 1..10 coefficients. Results on Linux x86_64 and with an Intel i5-3550 for: static immutable T[6] coeffs = [ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 ]; std.math.poly(cast(T) 0.43685, coeffs); => real: ~1.2x faster => double: ~3.2x faster => float: ~3.0x faster

Prefer an arithmetic multiply by the according power-of-two constant, as that's much faster. E.g., this makes tan(double) run 2.05x faster on an Intel i5-3550.

I.e., use poly() as for the other precisions (unlike the C source). With enabled compiler optimizations, it should be inlined anyway.

… overloads

kinke · 2018-08-15T16:34:06Z

Rebased...

dnadlinger · 2018-08-16T20:58:02Z

Let's see how this goes… At this point, I believe due diligence has been done, even though more performance comparisons are always nice to have (I still didn't get around to writing that microbenchmarking thing…).

MartinNowak · 2018-09-04T11:07:43Z

@kinke We got the surprising bug report that exp no longer works at compile time.
https://forum.dlang.org/post/oocmwvbzxsekxykxavts@forum.dlang.org
Would be good to figure out what happened there and why this wasn't tested/found beforehand.

kinke · 2018-09-04T13:00:08Z

I fixed that for LDC, as we do test it, but forgot to upstream it. If someone gets to it before I do, see ldc-developers/ldc@95af69e (and add a test here). IIRC, the problem is that the statically unrolled poly() version isn't CTFE-able, apparently due to a CTFE bug (https://issues.dlang.org/show_bug.cgi?id=17351#c10).

kinke requested review from ibuclaw and dnadlinger as code owners March 12, 2018 22:16

kinke force-pushed the atan_generic branch from 85159e9 to baba23c Compare March 12, 2018 22:58

kinke mentioned this pull request Mar 12, 2018

Fix issue 18559: std.math.* should stop using real overloads by default #6257

Closed

kinke changed the title ~~Implement std.math.atan() for single- and double-precision~~ Implement std.math.{atan,atan2,tan} for single- and double-precision Mar 14, 2018

kinke force-pushed the atan_generic branch from 776741f to 1a3d994 Compare March 15, 2018 23:15

kinke changed the title ~~Implement std.math.{atan,atan2,tan} for single- and double-precision~~ Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision Mar 15, 2018

kinke force-pushed the atan_generic branch 2 times, most recently from 859cad1 to cedc96c Compare March 16, 2018 22:13

kinke changed the title ~~Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision~~ WIP: Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision Mar 17, 2018

kinke force-pushed the atan_generic branch from cedc96c to 091261b Compare March 17, 2018 13:58

kinke force-pushed the atan_generic branch from 091261b to 8ad8497 Compare March 17, 2018 14:15

kinke force-pushed the atan_generic branch from 8ad8497 to f67aea3 Compare March 17, 2018 20:00

kinke force-pushed the atan_generic branch 2 times, most recently from da5e35f to b101b0d Compare May 15, 2018 23:30

kinke force-pushed the atan_generic branch from 5734049 to ad0f807 Compare June 22, 2018 23:32

kinke added 12 commits August 15, 2018 18:32

Implement std.math.atan() for single- and double-precision

ef21892

And make the x87 `real` version CTFE-able.

Implement std.math.atan2() for single- and double-precision

83e0a95

And make the x87 `real` version CTFE-able.

Implement std.math.tan() for single- and double-precision

d3f5005

And make the x87 `real` version CTFE-able.

Implement std.math.exp() for single- and double-precision

d17cda4

Implement std.math.expm1() for double-precision

a08c986

And make the x87 `real` version CTFE-able. I couldn't find the single-precision version in Cephes.

Implement std.math.exp2() for single- and double-precision

adc30e7

Avoid ldexp() for static exponents

6325462

Prefer an arithmetic multiply by the according power-of-two constant, as that's much faster. E.g., this makes tan(double) run 2.05x faster on an Intel i5-3550.

Streamline Cephes implementations for single precision

0288386

I.e., use poly() as for the other precisions (unlike the C source). With enabled compiler optimizations, it should be inlined anyway.

Forward to (builtin) real version for CTFE in single/double precision…

21468e7

… overloads

Add std.math.expm1(float) overload (using double precision)

77376fc

Add changelog entry

97a491b

kinke force-pushed the atan_generic branch from ad0f807 to 97a491b Compare August 15, 2018 16:32

dnadlinger approved these changes Aug 15, 2018

View reviewed changes

dnadlinger added the auto-merge label Aug 15, 2018

dlang-bot merged commit a56be52 into dlang:master Aug 16, 2018

kinke deleted the atan_generic branch August 16, 2018 09:29

kinke mentioned this pull request Aug 20, 2021

std.math.trigonometry: fix tan asm x86 implementation for ldc compiler #8199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision #6272

Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision #6272

kinke commented Mar 12, 2018 •

edited

Loading

dlang-bot commented Mar 12, 2018 •

edited

Loading

kinke commented Mar 14, 2018

kinke commented Mar 17, 2018 •

edited

Loading

dnadlinger commented Mar 17, 2018

kinke commented Mar 17, 2018

kinke commented Mar 17, 2018

kinke commented Mar 17, 2018

wilzbach commented Mar 17, 2018

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 17, 2018 •

edited

Loading

dnadlinger commented Mar 17, 2018

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 17, 2018

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 18, 2018 •

edited

Loading

n8sh commented Jun 23, 2018

kinke commented Jun 23, 2018 •

edited

Loading

n8sh commented Jun 23, 2018 •

edited

Loading

WalterBright commented Jun 24, 2018

ibuclaw commented Jun 24, 2018

kinke commented Aug 15, 2018

dnadlinger commented Aug 16, 2018

MartinNowak commented Sep 4, 2018

kinke commented Sep 4, 2018 •

edited

Loading

Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision #6272

Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision #6272

Conversation

kinke commented Mar 12, 2018 • edited Loading

dlang-bot commented Mar 12, 2018 • edited Loading

Bugzilla references

Testing this PR locally

kinke commented Mar 14, 2018

kinke commented Mar 17, 2018 • edited Loading

dnadlinger commented Mar 17, 2018

kinke commented Mar 17, 2018

kinke commented Mar 17, 2018

kinke commented Mar 17, 2018

wilzbach commented Mar 17, 2018

kinke commented Mar 17, 2018 • edited Loading

kinke commented Mar 17, 2018 • edited Loading

dnadlinger commented Mar 17, 2018

kinke commented Mar 17, 2018 • edited Loading

kinke commented Mar 17, 2018 • edited Loading

kinke commented Mar 17, 2018

kinke commented Mar 17, 2018 • edited Loading

kinke commented Mar 17, 2018 • edited Loading

kinke commented Mar 18, 2018 • edited Loading

n8sh commented Jun 23, 2018

kinke commented Jun 23, 2018 • edited Loading

n8sh commented Jun 23, 2018 • edited Loading

WalterBright commented Jun 24, 2018

ibuclaw commented Jun 24, 2018

kinke commented Aug 15, 2018

dnadlinger commented Aug 16, 2018

MartinNowak commented Sep 4, 2018

kinke commented Sep 4, 2018 • edited Loading

kinke commented Mar 12, 2018 •

edited

Loading

dlang-bot commented Mar 12, 2018 •

edited

Loading

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 17, 2018 •

edited

Loading

kinke commented Mar 18, 2018 •

edited

Loading

kinke commented Jun 23, 2018 •

edited

Loading

n8sh commented Jun 23, 2018 •

edited

Loading

kinke commented Sep 4, 2018 •

edited

Loading