-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can sin_pi be sped up without error? #200
Comments
Your version returns the wrong sign when reducing by an even number, for example when the argument is 18.6282 you return -0.920005 but it should be +0.920005. |
My static_cast is compiling down to a In any case, I fixed the bug you found: #include "math_unit_test.hpp"
#include <boost/math/special_functions/sin_pi.hpp>
#include <boost/multiprecision/float128.hpp>
#include <boost/multiprecision/cpp_bin_float.hpp>
using std::sin;
using boost::math::constants::pi;
using boost::multiprecision::float128;
using boost::multiprecision::cpp_bin_float_100;
template<class Real>
Real sin_pi2(Real x)
{
using std::sin;
using std::floor;
using boost::math::constants::pi;
if(x < 0) {
return -sin_pi2(-x);
}
Real integer_part = floor(x);
Real rem = x - integer_part;
if (rem > 0.5) {
rem = 1 - rem;
}
long i = static_cast<long>(integer_part);
if (i & 1) {
return -sin(pi<Real>()*rem);
} else {
return sin(pi<Real>()*rem);
}
}
template<class Real>
void test_ulp()
{
Real x = 0;
Real max_x = 10;
Real step = std::numeric_limits<Real>::epsilon();
while (x < max_x) {
Real expected = static_cast<Real>(sin(pi<cpp_bin_float_100>()*cpp_bin_float_100(x)));
Real computed = boost::math::sin_pi<Real>(x);
if (!CHECK_ULP_CLOSE(expected, computed, 2)) {
std::cerr << " Boost Baddie at x = " << x << "\n";
}
computed = sin_pi2<Real>(x);
if (!CHECK_ULP_CLOSE(expected, computed, 2)) {
std::cerr << " My Baddie at x = " << x << "\n";
}
x += step;
}
}
int main()
{
test_ulp<float>();
} Yours is still more accurate for reasons I can't understand right now. (Change the ULP tolerance to 1 to see it.) |
I've managed to get the existing code up to the same speed as yours (at least as far as msvc is concerned), but I want to test a bit more and do at least cos_pi as well before pushing... |
Can you try out current develop? Note that on some platforms you may need to disable double->long double promotion inside sin_pi/cos_pi to achieve max performance (see below). Here's the updated benchmark:
Output from msvc:
|
I find it very strange that sine is slower in float precision than double on MSVC; when I run your benchmark using g++-8 I don't see that. In any case, here's the output I get with g++-8 -I./include -I../../../boost --std=c++17 -march=native -O3 -ffast-math test.cpp -lbenchmark -lbenchmark_main -pthread 2019-05-03 16:22:29
Running ./a.out
Run on (2 X 3300 MHz CPU s)
CPU Caches:
L1 Data 32K (x2)
L1 Instruction 32K (x2)
L2 Unified 256K (x2)
L3 Unified 3072K (x1)
Load Average: 0.17, 0.31, 0.19
----------------------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------------------
BM_Sin<float> 20.6 ns 20.6 ns 33969206
BM_Sin<double> 44.7 ns 44.6 ns 15679269
BM_SinPi<float> 45.6 ns 45.2 ns 15457933
BM_SinPi<double> 80.6 ns 80.5 ns 8685480
BM_SinPi_no_promote<float> 36.8 ns 36.5 ns 19181025
BM_SinPi_no_promote<double> 53.0 ns 52.9 ns 13229723
BM_SinPi2<float> 37.8 ns 37.7 ns 18576036
BM_SinPi2<double> 53.7 ns 53.3 ns 13103178
BM_Cos<float> 20.4 ns 20.4 ns 34338087
BM_Cos<double> 44.7 ns 44.7 ns 15669084
BM_CosPi<float> 45.5 ns 45.5 ns 15386990
BM_CosPi<double> 85.7 ns 85.7 ns 8165762
BM_CosPi_no_promote<float> 33.7 ns 33.7 ns 20775641
BM_CosPi_no_promote<double> 52.7 ns 52.7 ns 13283941 So sin_pi_no_promote is a very much acceptable cost above calls to sine, but I'm confused as to why gcc is doing this long double promotion in the first place. Of course, nothing we can do about it, but clang 6.0 does a very poor job with this: BM_Sin<float> 92.1 ns 92.0 ns 7596889
BM_Sin<double> 116 ns 116 ns 6039997
BM_SinPi<float> 110 ns 109 ns 6393240
BM_SinPi<double> 286 ns 286 ns 2448960
BM_SinPi_no_promote<float> 101 ns 101 ns 6940553
BM_SinPi_no_promote<double> 117 ns 117 ns 6006419
BM_SinPi2<float> 103 ns 102 ns 6830852
BM_SinPi2<double> 119 ns 119 ns 5905943
BM_Cos<float> 92.2 ns 92.1 ns 7602179
BM_Cos<double> 118 ns 118 ns 5918561
BM_CosPi<float> 109 ns 109 ns 6440883
BM_CosPi<double> 214 ns 214 ns 3274262
BM_CosPi_no_promote<float> 94.6 ns 94.4 ns 7410933
BM_CosPi_no_promote<double> 114 ns 114 ns 6150028 |
Ugh, history. And something we may need to revisit, albeit a big breaking change. When the library was written, most code was 32-bit, most floating point code was x87, and there was next no difference in performance between double and long double functions. Plus the C std libraries were typically implemented as long double functions which the float/double versions called internally. So... it made a lot of sense to have "one true version" of each special function, and that was the long double version. Oh and it did wonders for accuracy. So we have code like this at the start of each function:
The result_type converts all the argument types to the canonical floating point type (including integer->double promotion), while the
Weird, is it using intrinsics or calling the C lib? I suspect the latter? |
I think 1ULP accurate is generally acceptable and half ULP accurate is more than expected; I'll try to take a look and see if your implementation satisfies this bound. In any case, Iooked into what clang was doing: It's doing a lot of comparisons to a stack pointer which is taking a lot of time, especially since it appears that (say) [rip + 0x400f5] is just the value 0.5 and can be put in the instruction instead of the stack. At a high level, I see calls to __ieee_754_logl, which I did not expect: On g++-8, the logl is not called: |
As to the ULP accuracy with #include "math_unit_test.hpp"
#include <boost/math/special_functions/sin_pi.hpp>
#include <boost/multiprecision/float128.hpp>
#include <boost/multiprecision/cpp_bin_float.hpp>
using std::sin;
using boost::math::constants::pi;
using boost::multiprecision::float128;
using boost::multiprecision::cpp_bin_float_100;
template<class Real>
void test_ulp()
{
Real x = 10;
Real max_x = 1000;
typedef boost::math::policies::policy<
boost::math::policies::promote_float<false>,
boost::math::policies::promote_double<false> >
no_promote_policy;
Real step = std::numeric_limits<Real>::epsilon();
while (x < max_x) {
Real expected = static_cast<Real>(sin(pi<cpp_bin_float_100>()*cpp_bin_float_100(x)));
Real computed = boost::math::sin_pi<Real>(x, no_promote_policy());
if (!CHECK_ULP_CLOSE(expected, computed, 2)) {
std::cerr << " Baddie at x = " << x << "\n";
}
x = std::nextafter(x, std::numeric_limits<float>::max());
}
}
int main()
{
test_ulp<float>();
test_ulp<double>();
return boost::math::test::report_errors();
} I haven't found any that are more than 2 representables incorrect, and lots that are 2 representables incorrect. Is that an acceptable backwards incompatible change? Kinda a grey area; I'd definitely say go for it if it was only 1representable incorrect. . . |
Huh?? I assume this is nothing to do with sin_pi and more to do with the random number generator or something? I'm sure those two log calls are the reason for the slowdown?
It's probably fine for this small function, I would prefer to have a consistent policy throughout though rather than an ad-hoc one with different functions doing different things. |
Yup, sry my bad. Clang is doing something strange with the RNG. |
Overall I think it's probably a bad idea but you can replace
with something like
for about a 1% speedup in my (admittedly imprecise) benchmarks. Naturally the int64_t and 63 are specific to double so would have to be compile time calculated for other types which may not even be possible for some. You could of course hide this behind an inlined function to make it less ugly and provide the optimization only where the type supports it. Depends how determined you are to squeeze out performance vs maintainability (and also how representative my benchmarks are). I'm also not sure how consistent the bit layout of floats are which could put a big hole in the whole idea. |
I think I'd rather not start relying on reinterpret_casts and undefined behaviour, unless the difference is huge. |
Assuming this is fixed and closing down. |
Agreed; I think it's fixed. |
I noticed that
sin_pi
seems a little slow. What I don't know is whether or not this necessary for some reason or another.I have written the following benchmarks, and also given what I think may be a simplified implementation (but not yet checked its ULP accuracy):
The performance is shown below:
Is it reasonable to replace the implementation or is there something fundamental I'm missing?
The text was updated successfully, but these errors were encountered: