Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Round to even in a branchless fashion. #216

Closed
wants to merge 1 commit into from

Conversation

deadalnix
Copy link

As per title. There is no point risking a mispredict, especially since the branch condition is at least as complicated than the branchless version anyways.

@lemire
Copy link
Member

lemire commented Jul 10, 2023

@deadalnix Please provide before/after benchmarks indicating a performance gain. Before we make algorithmic changes with the hope that it might improve performance, we need solid empirical evidence that there are robust gains. They do not need to be large, but they need to substantiated and robust.

Please grab https://github.com/lemire/simple_fastfloat_benchmark

You may build it and run it like so...

cmake -B buildtest && cmake --build buildtest -j
./buildtest/benchmarks/benchmark -f data/canada.txt

You can change the FetchContent_Declare instruction to point to a new variation and do the routine again.

FetchContent_Declare(fast_float
     GIT_REPOSITORY https://github.com/fastfloat/fast_float.git
     GIT_TAG origin/main
     GIT_SHALLOW TRUE)

Please post your numbers when running... after and before...

./buildtest/benchmarks/benchmark -f data/canada.txt

@deadalnix
Copy link
Author

Unfortunately, this doesn't seems to work. I'm failing to compile the benchmark with the following error:

[ 97%] Building CXX object benchmarks/CMakeFiles/benchmark.dir/benchmark.cpp.o
In file included from /home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:25: /home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/dtoa.c:226: warning: "assert" redefined
  226 | #define assert(x) /*nothing*/
      |
In file included from /usr/include/c++/10/cassert:44,
                 from /home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/doubleconversion-src/double-conversion/utils.h:34,
                 from /home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/doubleconversion-src/double-conversion/diy-fp.h:31,
                 from /home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/doubleconversion-src/double-conversion/ieee.h:31,
                 from /home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:16: /usr/include/assert.h:50: note: this is the location of the previous definition
   50 | # define assert(expr)  (__ASSERT_VOID_CAST (0))
      |
/home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp: In instantiation of ‘double findmax_fastfloat(std::vector<std::__cxx11::basic_string<_CharT> >&) [with CharT = char16_t]’:
/home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:359:106:   required from here
/home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:183:42: error: no matching function for call to ‘from_chars(char16_t*, char16_t*, double&)’
  183 |     auto [p, ec] = fast_float::from_chars(st.data(), st.data() + st.size(), x);
      |                    ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/fast_float-src/include/fast_float/fast_float.h:62,
                 from /home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:9:
/home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/fast_float-src/include/fast_float/parse_number.h:66:19: note: candidate: ‘template<class T> fast_float::from_chars_result fast_float::from_chars(const char*, const char*, T&, fast_float::chars_format)’
   66 | from_chars_result from_chars(const char *first, const char *last,
      |                   ^~~~~~~~~~
/home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/fast_float-src/include/fast_float/parse_number.h:66:19: note:   template argument deduction/substitution failed:
/home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:183:50: note:   cannot convert ‘(& st)->std::__cxx11::basic_string<char16_t>::data()’ (type ‘char16_t*’) to type ‘const char*’
  183 |     auto [p, ec] = fast_float::from_chars(st.data(), st.data() + st.size(), x);
      |                                           ~~~~~~~^~
gmake[2]: *** [benchmarks/CMakeFiles/benchmark.dir/build.make:76: benchmarks/CMakeFiles/benchmark.dir/benchmark.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:2721: benchmarks/CMakeFiles/benchmark.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

@lemire
Copy link
Member

lemire commented Jul 12, 2023

Unfortunately, this doesn't seems to work. I'm failing to compile the benchmark with the following error:

Is your branch up-to-date with https://github.com/fastfloat/fast_float? If not, please sync before doing any benchmarking.

@deadalnix
Copy link
Author

Good catch, I was using lemire/fast_float , which is not up to date.

@deadalnix
Copy link
Author

deadalnix commented Jul 12, 2023

So, with this patch:

# read 111126 lines
ASCII volume = 1.93374 MB
netlib                                  :   425.67 MB/s (+/- 0.4 %)    24.46 Mfloat/s
doubleconversion                        :   339.53 MB/s (+/- 0.5 %)    19.51 Mfloat/s
strtod                                  :   258.97 MB/s (+/- 0.5 %)    14.88 Mfloat/s
abseil                                  :   657.99 MB/s (+/- 0.3 %)    37.81 Mfloat/s
fastfloat                               :  1284.92 MB/s (+/- 0.2 %)    73.84 Mfloat/s
UTF-16 volume = 3.86749 MB
doubleconversion                        :   630.07 MB/s (+/- 0.2 %)    18.10 Mfloat/s
fastfloat                               :  2502.62 MB/s (+/- 0.3 %)    71.91 Mfloat/s

And without:

# read 111126 lines
ASCII volume = 1.93374 MB
netlib                                  :   418.28 MB/s (+/- 1.2 %)    24.04 Mfloat/s
doubleconversion                        :   330.63 MB/s (+/- 0.3 %)    19.00 Mfloat/s
strtod                                  :   250.09 MB/s (+/- 0.5 %)    14.37 Mfloat/s
abseil                                  :   639.56 MB/s (+/- 0.4 %)    36.75 Mfloat/s
fastfloat                               :  1274.06 MB/s (+/- 0.6 %)    73.22 Mfloat/s
UTF-16 volume = 3.86749 MB
doubleconversion                        :   617.71 MB/s (+/- 0.2 %)    17.75 Mfloat/s
fastfloat                               :  2496.53 MB/s (+/- 0.4 %)    71.73 Mfloat/s

Honestly, the result are so close that I'm not sure if we are looking at noise or something significant. Nevertheless, I'd argue that the branchless approach is preferable for the following reason:

  1. There is no degenerate case when the branch mispredicts. How bad/good this is can be highly dependent on input data, so everything else being equal, I'd rather go for the branchless approach.
  2. Because libraries are typically used withing a larger application, it is always good to keep the pressure low on shared resources such as cache and branch predictor, as, while it might not make a difference in the library itself, it might for the program as a whole. A good exemple of that is libc reverting many of the vectorization changes in their memcpy implementation, because, while they did better standalone, they tended to put more pressure, notably on the icache, which caused programs to perform worse.

EDIT: Theses benchmarks ran on a AMD Ryzen 7 5800X 8-Core Processor, and obviously, results might differs on another CPU.

@deadalnix
Copy link
Author

Running the benchmark numerous times, it's definitively within the noise.

@lemire
Copy link
Member

lemire commented Jul 12, 2023

Thanks.

I have added back a branch counter (some cloud systems do not support counting branches, so I had removed that). If you sync with simple_fastfloat_benchmark and execute in privileged mode, you should get the number of branches.

On an Intel Ice Lake server with GCC 11, I get the following... (the b/f value is the average number of branch per float parsed).

fastfloat                               :   861.99 MB/s (+/- 2.1 %)    49.54 Mfloat/s      16.93 i/B   308.97 i/f (+/- 0.0 %)      3.53 c/B    64.34 c/f (+/- 2.1 %)      4.80 i/c     50.40 b/f      3.19 GHz 

After...

fastfloat                               :   812.23 MB/s (+/- 0.8 %)    46.68 Mfloat/s      17.48 i/B   319.00 i/f (+/- 0.0 %)      3.74 c/B    68.26 c/f (+/- 0.6 %)      4.67 i/c     49.49 b/f      3.19 GHz 

At a glance, on this system with this compiler and on this dataset, this PR saves a branch per float (so a 2% reduction in branching), at the cost of 10 extra instructions per float (so a 3% gain in instructions) which translates into about 4 extra cycles per float (so about 6% slower). It appears that the number of instructions per cycle is slightly reduced by the PR, maybe because there is a longer dependency chain.

@deadalnix
Copy link
Author

I'm going to try to play around to see if that chain can be shortened. I'm not too worried about instruction count if there is a corresponding increase in ILP, but it doesn't looks like this is the case here. Time to try to shuffle things around and see how that goes.

@deadalnix
Copy link
Author

I'm going to close this, as no matter how I shuffle things around, it doesn't seems to be possible to match the perf with branch and the branch is predictible enough so that mispredict are not a major concern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants