Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision #6272

Merged
merged 12 commits into from
Aug 16, 2018

Conversation

kinke
Copy link
Contributor

@kinke kinke commented Mar 12, 2018

And make the x87 real version CTFE-able.

Based on Cephes like the existing quadruple and x87 implementations,
https://github.com/jeremybarnes/cephes/blob/master/cmath/atan.c &
https://github.com/jeremybarnes/cephes/blob/master/single/atanf.c

@dlang-bot
Copy link
Contributor

dlang-bot commented Mar 12, 2018

Thanks for your pull request and interest in making D better, @kinke! We are looking forward to reviewing it, and you should be hearing from a maintainer soon.
Please verify that your PR follows this checklist:

  • My PR is fully covered with tests (you can see the annotated coverage diff directly on GitHub with CodeCov's browser extension
  • My PR is as minimal as possible (smaller, focused PRs are easier to review than big ones)
  • I have provided a detailed rationale explaining my changes
  • New or modified functions have Ddoc comments (with Params: and Returns:)

Please see CONTRIBUTING.md for more information.


If you have addressed all reviews or aren't sure how to proceed, don't hesitate to ping us with a simple comment.

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub fetch digger
dub run digger -- build "master + phobos#6272"

@kinke
Copy link
Contributor Author

kinke commented Mar 14, 2018

There are unittest failures due to hypot() still only supporting reals etc. So I guess more or less all remaining std.math functions need to get float/double support in order to hopefully pass the druntime/Phobos unittests again.

I'll need an okay to proceed though, I don't want to waste any more time on this if this doesn't have a chance to be merged (backwards compatibility concerns).

@kinke kinke changed the title Implement std.math.atan() for single- and double-precision Implement std.math.{atan,atan2,tan} for single- and double-precision Mar 14, 2018
@kinke kinke changed the title Implement std.math.{atan,atan2,tan} for single- and double-precision Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision Mar 15, 2018
@kinke kinke force-pushed the atan_generic branch 2 times, most recently from 859cad1 to cedc96c Compare March 16, 2018 22:13
@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

compilable/ctfe_math.d fails because DMD would need to be patched to recognize the std.math.tan(float|double) overloads as CTFE builtins in builtin.d. The non-asm version still isn't fully CTFE-able because primitives copysign() and signbit() aren't.

No unittest failures anymore [well, test coverage is minimal at the moment].

x86_64 timings are quite interesting. Linux, Intel i5-3550 @4GHz, dmd master with -O -inline:

import std.datetime.stopwatch;
import std.stdio;

version (C)
{
    pragma(msg, "Using core.stdc.tgmath");
    import math = core.stdc.tgmath;
}
else
{
    pragma(msg, "Using std.math");
    import math = std.math;
}

T atan(T)() { return math.atan(cast(T) 0.43685); }
T atan2(T)() { return math.atan2(cast(T) 0.43685, cast(T) 0.06912); }
T tan(T)() { return math.tan(cast(T) 0.43685); }
T exp(T)() { return math.exp(cast(T) 0.43685); }
T exp2(T)() { return math.exp2(cast(T) 0.43685); }
T expm1(T)() { return math.expm1(cast(T) 0.43685); }

version (C) {} else
T poly(T)()
{
    static immutable T[6] coeffs = [ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 ];
    return math.poly(cast(T) 0.43685, coeffs);
}

void bench(alias Func, T...)()
{
    enum numRounds = 5; // best-of
    enum N = 10_000_000;

    writeln(".: ", Func.stringof);

    Duration[T.length][numRounds] durations;
    foreach (i; 0 .. numRounds)
        foreach (j, F; T)
            durations[i][j] = benchmark!(Func!F)(N)[0];

    foreach (j, F; T)
    {
        Duration bestOf = durations[0][j];
        foreach (i; 1 .. numRounds)
        {
            if (durations[i][j] < bestOf)
                bestOf = durations[i][j];
        }
        writeln(F.stringof, ":\t", bestOf);
    }
}

void main()
{
    bench!(atan, real, double, float);
    bench!(atan2, real, double, float);
    bench!(tan, real, double, float);
    bench!(exp, real, double, float);
    bench!(exp2, real, double, float);
    bench!(expm1, real, double);

    version (C) {} else
    bench!(poly, real, double, float);
}

=>

.: atan(T)()
real:   430 ms, 58 μs, and 7 hnsecs
double: 268 ms, 79 μs, and 3 hnsecs
float:  113 ms, 737 μs, and 4 hnsecs
.: atan2(T)()
real:   427 ms, 635 μs, and 3 hnsecs
double: 480 ms, 229 μs, and 6 hnsecs
float:  266 ms, 752 μs, and 3 hnsecs
.: tan(T)()
real:   351 ms, 427 μs, and 5 hnsecs
double: 909 ms, 280 μs, and 5 hnsecs
float:  220 ms, 476 μs, and 4 hnsecs
.: exp(T)()
real:   223 ms and 639 μs
double: 544 ms, 363 μs, and 5 hnsecs
float:  390 ms, 93 μs, and 9 hnsecs
.: exp2(T)()
real:   226 ms, 165 μs, and 2 hnsecs
double: 619 ms, 915 μs, and 6 hnsecs
float:  513 ms, 462 μs, and 2 hnsecs
.: expm1(T)()
real:   234 ms, 788 μs, and 2 hnsecs
double: 197 ms, 129 μs, and 8 hnsecs

The real timings correspond to the DMD 2.079.0 timings for all 3 floating-point types. So from ~4x speed-up to ~2.75x slow-down. ;)

@kinke kinke changed the title Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision WIP: Implement std.math.{atan[2],tan,exp[2],expm1} for single- and double-precision Mar 17, 2018
@dnadlinger
Copy link
Member

x86_64 timings are quite interesting.

Is that unexpected, though? One would expect the "software" implementations to do worse.

@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

Well I for one didn't expect such inconsistent results, or a ~4x slow-down of software-double-tan vs. software-float-tan etc.

@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

Here are the corresponding core.stdc.tgmath timings (Ubuntu 16.04):

.: atan(T)()
real:   403 ms, 803 μs, and 5 hnsecs
double: 685 ms, 848 μs, and 3 hnsecs
float:  58 ms, 113 μs, and 3 hnsecs
.: atan2(T)()
real:   510 ms, 110 μs, and 6 hnsecs
double: 903 ms, 576 μs, and 4 hnsecs
float:  165 ms and 114 μs
.: tan(T)()
real:   193 ms, 504 μs, and 1 hnsec
double: 716 ms, 779 μs, and 5 hnsecs
float:  83 ms and 817 μs
.: exp(T)()
real:   511 ms, 916 μs, and 1 hnsec
double: 690 ms, 991 μs, and 3 hnsecs
float:  87 ms, 717 μs, and 6 hnsecs
.: exp2(T)()
real:   547 ms, 701 μs, and 2 hnsecs
double: 129 ms, 991 μs, and 6 hnsecs
float:  123 ms and 33 μs
.: expm1(T)()
real:   471 ms and 943 μs
double: 110 ms, 476 μs, and 3 hnsecs

@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

Rough runtimes comparison for the single/double precision overloads:

benchmark

@wilzbach
Copy link
Member

Regarding the project tester failure:

core.exception.AssertError@source/mir/random/flex/internal/area.d(525): unittest failure
----------------
??:? _d_unittestp [0x617ef1]
source/mir/random/flex/internal/area.d:525 @safe void mir.random.flex.internal.area.__unittest_L441_C32() [0x5a65a1]
??:? void mir.random.flex.internal.area.__modtest() [0x5b2f9f]
??:? int core.runtime.runModuleUnitTests().__foreachbody2(object.ModuleInfo*) [0x6308e3]
??:? int object.ModuleInfo.opApply(scope int delegate(object.ModuleInfo*)).__lambda2(immutable(object.ModuleInfo*)) [0x616622]
??:? int rt.minfo.moduleinfos_apply(scope int delegate(immutable(object.ModuleInfo*))).__foreachbody2(ref rt.sections_elf_shared.DSO) [0x61dde1]
??:? int rt.sections_elf_shared.DSO.opApply(scope int delegate(ref rt.sections_elf_shared.DSO)) [0x61df74]
??:? int rt.minfo.moduleinfos_apply(scope int delegate(immutable(object.ModuleInfo*))) [0x61dd6d]
??:? int object.ModuleInfo.opApply(scope int delegate(object.ModuleInfo*)) [0x6165f9]
??:? runModuleUnitTests [0x6306b9]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).runAll() [0x619e84]
??:? void rt.dmain2._d_run_main(int, char**, extern (C) int function(char[][])*).tryExec(scope void delegate()) [0x619e0b]
??:? _d_run_main [0x619d76]
??:? main [0x560021]
??:? __libc_start_main [0xe29d882f]
core.exception.AssertError@source/mir/random/flex/package.d(916): unittest failure
----------------
??:? _d_unittestp [0x617ef1]
source/mir/random/flex/package.d:916 void mir.random.flex.__unittest_L879_C26() [0x5c9871]

that's because the flex algorithm is quite sensitive to the floating point behavior and the tests use a special fpEqual function:

https://github.com/libmir/mir-random/blob/master/source/mir/random/flex/internal/area.d

@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

that's because the flex algorithm is quite sensitive to the floating point behavior and the tests use a special fpEqual function

... which (only) for DMD and x86_64 is defined as exact equality (== operator). The failing test there seems to depend on exp(double x), which for DMD x86(_64) is defined as cast(double) exp2Asm(LOG2E * cast(real) x), so I'd argue that exact equality is too exaggerated and approxEqual() a better fit, as used for all other compilers/platforms.
Thanks for checking. :)

@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

Statically unrolled poly() makes the Cephes version of atan(double) faster by ~31%, and expm1(double) by 19%; the double versions of the other functions only show marginal improvements.

@dnadlinger
Copy link
Member

How do the timings look on LDC?

@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

I'm afraid I don't have any LDC timings. This patch cannot be simply cherry-picked due to existing mods in the touched parts; the main problem though is that the timings wouldn't be easily comparable, as e.g. the currently real-only core.math.ldexp() primitive is implemented differently (intrinsic for DMD, i.e., inlined x87 asm; for LDC, a stupidly non-inlined call to druntime which forwards to the C long double version of the C runtime; in C, it's apparently a macro).

@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

Experimentally using core.stdc.math.ldexp[l|f] in std.math.ldexp() (instead of the core.math.ldexp(real) intrinsic using a supposedly slow fscale instruction) leads to a significant performance increase, up to almost 2x faster, for tan(double) and exp2(float|double), i.e., exactly where the Cephes version lags significantly behind in the chart above.

@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

[Chart and benchmark code updated.]

@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

Chart updated once more after getting rid of a few ldexp() calls, which makes tan(double) twice as fast on my machine (!).

[Note that std.math.ldexp(T) cannot use core.stdc.tgmath.ldexp(T) at the moment, as the latter is impure. I hacked druntime for testing, i.e., the green bars.]

@kinke
Copy link
Contributor Author

kinke commented Mar 17, 2018

So on my machine and using the purely arbitrary input data above:

  • The new Cephes double versions are 0.4x - 2x faster than the current versions, with a geometric mean of 0.88 (0.97 with C ldexp), i.e., overall slower than the current versions using the FPU, almost exclusively due to great performance of Phobos x87 inline asm for std.math.exp2(real) (twice as fast as the long double version in the C runtime!) distorting the otherwise good results. Cephes almost always clearly beats the C runtime, except for exp2(double) (losing a lot there) and expm1(double).
  • The float versions are 0.5x - 3.8x faster than the current versions, with a geometric mean of 1.2 (1.4 with C ldexp). The C runtime is always and by far the winner though (possibly due to less handled special cases? - up to 12x as fast as the C double version!).

@kinke
Copy link
Contributor Author

kinke commented Mar 18, 2018

I ported this to LDC (with LLVM 6.0.0 and using C ldexp), and the results are way better than expected: overall speed-up factor of >3 for the new double/float versions compared to the old ones, only beaten by the C runtime in 2 cases (exp[2](float)) and otherwise leading the pack. Here's the extended chart:

benchmark incl. LDC port

@n8sh
Copy link
Member

n8sh commented Jun 23, 2018

Time to squash and merge?

@kinke
Copy link
Contributor Author

kinke commented Jun 23, 2018

#Ah nice, I forgot to check whether it got green now with new mir-random. Edit: Oh, no project tester in CI anymore?

squash

Do you really prefer a single 1.7k lines patch?

@n8sh
Copy link
Member

n8sh commented Jun 23, 2018

When you put it that way, perhaps not (EDIT: I mean the squash).

@WalterBright
Copy link
Member

std/math.d is getting way too large. The next PR should split this into a package.

@ibuclaw
Copy link
Member

ibuclaw commented Jun 24, 2018

@WalterBright indeed, we touched the subject earlier here #6272 (comment)

And make the x87 `real` version CTFE-able.
And make the x87 `real` version CTFE-able.
And make the x87 `real` version CTFE-able.
And make the x87 `real` version CTFE-able.
I couldn't find the single-precision version in Cephes.
Add a statically unrolled version for 1..10 coefficients.

Results on Linux x86_64 and with an Intel i5-3550 for:

  static immutable T[6] coeffs = [ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 ];
  std.math.poly(cast(T) 0.43685, coeffs);

=> real:   ~1.2x faster
=> double: ~3.2x faster
=> float:  ~3.0x faster
Prefer an arithmetic multiply by the according power-of-two constant, as
that's much faster. E.g., this makes tan(double) run 2.05x faster on an
Intel i5-3550.
I.e., use poly() as for the other precisions (unlike the C source).
With enabled compiler optimizations, it should be inlined anyway.
@kinke
Copy link
Contributor Author

kinke commented Aug 15, 2018

Rebased...

@dlang-bot dlang-bot merged commit a56be52 into dlang:master Aug 16, 2018
@kinke kinke deleted the atan_generic branch August 16, 2018 09:29
@dnadlinger
Copy link
Member

Let's see how this goes… At this point, I believe due diligence has been done, even though more performance comparisons are always nice to have (I still didn't get around to writing that microbenchmarking thing…).

@MartinNowak
Copy link
Member

@kinke We got the surprising bug report that exp no longer works at compile time.
https://forum.dlang.org/post/oocmwvbzxsekxykxavts@forum.dlang.org
Would be good to figure out what happened there and why this wasn't tested/found beforehand.

@kinke
Copy link
Contributor Author

kinke commented Sep 4, 2018

I fixed that for LDC, as we do test it, but forgot to upstream it. If someone gets to it before I do, see ldc-developers/ldc@95af69e (and add a test here). IIRC, the problem is that the statically unrolled poly() version isn't CTFE-able, apparently due to a CTFE bug (https://issues.dlang.org/show_bug.cgi?id=17351#c10).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants