Round to even in a branchless fashion. #216

deadalnix · 2023-07-10T12:05:20Z

As per title. There is no point risking a mispredict, especially since the branch condition is at least as complicated than the branchless version anyways.

lemire · 2023-07-10T14:47:51Z

@deadalnix Please provide before/after benchmarks indicating a performance gain. Before we make algorithmic changes with the hope that it might improve performance, we need solid empirical evidence that there are robust gains. They do not need to be large, but they need to substantiated and robust.

Please grab https://github.com/lemire/simple_fastfloat_benchmark

You may build it and run it like so...

cmake -B buildtest && cmake --build buildtest -j
./buildtest/benchmarks/benchmark -f data/canada.txt

You can change the FetchContent_Declare instruction to point to a new variation and do the routine again.

FetchContent_Declare(fast_float
     GIT_REPOSITORY https://github.com/fastfloat/fast_float.git
     GIT_TAG origin/main
     GIT_SHALLOW TRUE)

Please post your numbers when running... after and before...

./buildtest/benchmarks/benchmark -f data/canada.txt

deadalnix · 2023-07-12T13:59:58Z

Unfortunately, this doesn't seems to work. I'm failing to compile the benchmark with the following error:

[ 97%] Building CXX object benchmarks/CMakeFiles/benchmark.dir/benchmark.cpp.o
In file included from /home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:25: /home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/dtoa.c:226: warning: "assert" redefined
  226 | #define assert(x) /*nothing*/
      |
In file included from /usr/include/c++/10/cassert:44,
                 from /home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/doubleconversion-src/double-conversion/utils.h:34,
                 from /home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/doubleconversion-src/double-conversion/diy-fp.h:31,
                 from /home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/doubleconversion-src/double-conversion/ieee.h:31,
                 from /home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:16: /usr/include/assert.h:50: note: this is the location of the previous definition
   50 | # define assert(expr)  (__ASSERT_VOID_CAST (0))
      |
/home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp: In instantiation of ‘double findmax_fastfloat(std::vector<std::__cxx11::basic_string<_CharT> >&) [with CharT = char16_t]’:
/home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:359:106:   required from here
/home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:183:42: error: no matching function for call to ‘from_chars(char16_t*, char16_t*, double&)’
  183 |     auto [p, ec] = fast_float::from_chars(st.data(), st.data() + st.size(), x);
      |                    ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/fast_float-src/include/fast_float/fast_float.h:62,
                 from /home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:9:
/home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/fast_float-src/include/fast_float/parse_number.h:66:19: note: candidate: ‘template<class T> fast_float::from_chars_result fast_float::from_chars(const char*, const char*, T&, fast_float::chars_format)’
   66 | from_chars_result from_chars(const char *first, const char *last,
      |                   ^~~~~~~~~~
/home/deadalnix/repos/simple_fastfloat_benchmark/buildtest/_deps/fast_float-src/include/fast_float/parse_number.h:66:19: note:   template argument deduction/substitution failed:
/home/deadalnix/repos/simple_fastfloat_benchmark/benchmarks/benchmark.cpp:183:50: note:   cannot convert ‘(& st)->std::__cxx11::basic_string<char16_t>::data()’ (type ‘char16_t*’) to type ‘const char*’
  183 |     auto [p, ec] = fast_float::from_chars(st.data(), st.data() + st.size(), x);
      |                                           ~~~~~~~^~
gmake[2]: *** [benchmarks/CMakeFiles/benchmark.dir/build.make:76: benchmarks/CMakeFiles/benchmark.dir/benchmark.cpp.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:2721: benchmarks/CMakeFiles/benchmark.dir/all] Error 2
gmake: *** [Makefile:136: all] Error 2

lemire · 2023-07-12T14:02:26Z

Unfortunately, this doesn't seems to work. I'm failing to compile the benchmark with the following error:

Is your branch up-to-date with https://github.com/fastfloat/fast_float? If not, please sync before doing any benchmarking.

deadalnix · 2023-07-12T14:06:10Z

Good catch, I was using lemire/fast_float , which is not up to date.

deadalnix · 2023-07-12T14:19:54Z

So, with this patch:

# read 111126 lines
ASCII volume = 1.93374 MB
netlib                                  :   425.67 MB/s (+/- 0.4 %)    24.46 Mfloat/s
doubleconversion                        :   339.53 MB/s (+/- 0.5 %)    19.51 Mfloat/s
strtod                                  :   258.97 MB/s (+/- 0.5 %)    14.88 Mfloat/s
abseil                                  :   657.99 MB/s (+/- 0.3 %)    37.81 Mfloat/s
fastfloat                               :  1284.92 MB/s (+/- 0.2 %)    73.84 Mfloat/s
UTF-16 volume = 3.86749 MB
doubleconversion                        :   630.07 MB/s (+/- 0.2 %)    18.10 Mfloat/s
fastfloat                               :  2502.62 MB/s (+/- 0.3 %)    71.91 Mfloat/s

And without:

# read 111126 lines
ASCII volume = 1.93374 MB
netlib                                  :   418.28 MB/s (+/- 1.2 %)    24.04 Mfloat/s
doubleconversion                        :   330.63 MB/s (+/- 0.3 %)    19.00 Mfloat/s
strtod                                  :   250.09 MB/s (+/- 0.5 %)    14.37 Mfloat/s
abseil                                  :   639.56 MB/s (+/- 0.4 %)    36.75 Mfloat/s
fastfloat                               :  1274.06 MB/s (+/- 0.6 %)    73.22 Mfloat/s
UTF-16 volume = 3.86749 MB
doubleconversion                        :   617.71 MB/s (+/- 0.2 %)    17.75 Mfloat/s
fastfloat                               :  2496.53 MB/s (+/- 0.4 %)    71.73 Mfloat/s

Honestly, the result are so close that I'm not sure if we are looking at noise or something significant. Nevertheless, I'd argue that the branchless approach is preferable for the following reason:

There is no degenerate case when the branch mispredicts. How bad/good this is can be highly dependent on input data, so everything else being equal, I'd rather go for the branchless approach.
Because libraries are typically used withing a larger application, it is always good to keep the pressure low on shared resources such as cache and branch predictor, as, while it might not make a difference in the library itself, it might for the program as a whole. A good exemple of that is libc reverting many of the vectorization changes in their memcpy implementation, because, while they did better standalone, they tended to put more pressure, notably on the icache, which caused programs to perform worse.

EDIT: Theses benchmarks ran on a AMD Ryzen 7 5800X 8-Core Processor, and obviously, results might differs on another CPU.

deadalnix · 2023-07-12T14:35:14Z

Running the benchmark numerous times, it's definitively within the noise.

lemire · 2023-07-12T16:59:50Z

Thanks.

I have added back a branch counter (some cloud systems do not support counting branches, so I had removed that). If you sync with simple_fastfloat_benchmark and execute in privileged mode, you should get the number of branches.

On an Intel Ice Lake server with GCC 11, I get the following... (the b/f value is the average number of branch per float parsed).

fastfloat                               :   861.99 MB/s (+/- 2.1 %)    49.54 Mfloat/s      16.93 i/B   308.97 i/f (+/- 0.0 %)      3.53 c/B    64.34 c/f (+/- 2.1 %)      4.80 i/c     50.40 b/f      3.19 GHz

After...

fastfloat                               :   812.23 MB/s (+/- 0.8 %)    46.68 Mfloat/s      17.48 i/B   319.00 i/f (+/- 0.0 %)      3.74 c/B    68.26 c/f (+/- 0.6 %)      4.67 i/c     49.49 b/f      3.19 GHz

At a glance, on this system with this compiler and on this dataset, this PR saves a branch per float (so a 2% reduction in branching), at the cost of 10 extra instructions per float (so a 3% gain in instructions) which translates into about 4 extra cycles per float (so about 6% slower). It appears that the number of instructions per cycle is slightly reduced by the PR, maybe because there is a longer dependency chain.

deadalnix · 2023-07-13T13:09:17Z

I'm going to try to play around to see if that chain can be shortened. I'm not too worried about instruction count if there is a corresponding increase in ILP, but it doesn't looks like this is the case here. Time to try to shuffle things around and see how that goes.

deadalnix · 2023-07-14T13:31:13Z

I'm going to close this, as no matter how I shuffle things around, it doesn't seems to be possible to match the perf with branch and the branch is predictible enough so that mispredict are not a major concern.

Round to even in a branchless fashion.

1eb42b8

deadalnix force-pushed the branchlessroundeven branch from 889a3b0 to 1eb42b8 Compare July 12, 2023 14:05

deadalnix closed this Jul 14, 2023

deadalnix mentioned this pull request Jul 14, 2023

Extract the round to even loigic so it can be reused for hex floats. #218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Round to even in a branchless fashion. #216

Round to even in a branchless fashion. #216

deadalnix commented Jul 10, 2023

lemire commented Jul 10, 2023 •

edited

deadalnix commented Jul 12, 2023

lemire commented Jul 12, 2023

deadalnix commented Jul 12, 2023

deadalnix commented Jul 12, 2023 •

edited

deadalnix commented Jul 12, 2023

lemire commented Jul 12, 2023

deadalnix commented Jul 13, 2023

deadalnix commented Jul 14, 2023

Round to even in a branchless fashion. #216

Round to even in a branchless fashion. #216

Conversation

deadalnix commented Jul 10, 2023

lemire commented Jul 10, 2023 • edited

deadalnix commented Jul 12, 2023

lemire commented Jul 12, 2023

deadalnix commented Jul 12, 2023

deadalnix commented Jul 12, 2023 • edited

deadalnix commented Jul 12, 2023

lemire commented Jul 12, 2023

deadalnix commented Jul 13, 2023

deadalnix commented Jul 14, 2023

lemire commented Jul 10, 2023 •

edited

deadalnix commented Jul 12, 2023 •

edited