-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Round to even in a branchless fashion. #216
Conversation
@deadalnix Please provide before/after benchmarks indicating a performance gain. Before we make algorithmic changes with the hope that it might improve performance, we need solid empirical evidence that there are robust gains. They do not need to be large, but they need to substantiated and robust. Please grab https://github.com/lemire/simple_fastfloat_benchmark You may build it and run it like so...
You can change the FetchContent_Declare(fast_float
GIT_REPOSITORY https://github.com/fastfloat/fast_float.git
GIT_TAG origin/main
GIT_SHALLOW TRUE) Please post your numbers when running... after and before...
|
Unfortunately, this doesn't seems to work. I'm failing to compile the benchmark with the following error:
|
Is your branch up-to-date with https://github.com/fastfloat/fast_float? If not, please sync before doing any benchmarking. |
889a3b0
to
1eb42b8
Compare
Good catch, I was using lemire/fast_float , which is not up to date. |
So, with this patch:
And without:
Honestly, the result are so close that I'm not sure if we are looking at noise or something significant. Nevertheless, I'd argue that the branchless approach is preferable for the following reason:
EDIT: Theses benchmarks ran on a |
Running the benchmark numerous times, it's definitively within the noise. |
Thanks. I have added back a branch counter (some cloud systems do not support counting branches, so I had removed that). If you sync with simple_fastfloat_benchmark and execute in privileged mode, you should get the number of branches. On an Intel Ice Lake server with GCC 11, I get the following... (the
After...
At a glance, on this system with this compiler and on this dataset, this PR saves a branch per float (so a 2% reduction in branching), at the cost of 10 extra instructions per float (so a 3% gain in instructions) which translates into about 4 extra cycles per float (so about 6% slower). It appears that the number of instructions per cycle is slightly reduced by the PR, maybe because there is a longer dependency chain. |
I'm going to try to play around to see if that chain can be shortened. I'm not too worried about instruction count if there is a corresponding increase in ILP, but it doesn't looks like this is the case here. Time to try to shuffle things around and see how that goes. |
I'm going to close this, as no matter how I shuffle things around, it doesn't seems to be possible to match the perf with branch and the branch is predictible enough so that mispredict are not a major concern. |
As per title. There is no point risking a mispredict, especially since the branch condition is at least as complicated than the branchless version anyways.