Optimization for hexadecimal formatting #1944

NN--- · 2020-10-20T15:05:50Z

Manual implementation is about 10 times faster than fmtlib.
MSVC x64

Manual: 2414
Fmt: 16481

#include <string>
#include <iostream>
#include <chrono>
#include <fmt/format.h>

constexpr char hexchar[] = "0123456789ABCDEF";

template <typename T>
std::string to_hex(T value)
{
    constexpr size_t sz = sizeof(value) * 2;
    std::string buffer(sz, L'\0');
    for (int i = 0; i < 2 * sizeof(value); i++)
        buffer[i] = hexchar[(value >> 4 * (sz - 1 - i)) & 0xf];
    return buffer;
}

template <typename T>
std::string fmt_to_hex(T value)
{
    return fmt::format(FMT_STRING("{0:X}"), value);
}

int main()
{
    int sum = 0;

    auto before =  std::chrono::system_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = to_hex(i);
        sum += a.length();
    }

    auto manual =  std::chrono::system_clock::now() - before;

    before=  std::chrono::system_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = fmt_to_hex(i);
        sum += a.length();
    }

    auto fmt=  std::chrono::system_clock::now() -before;

    std::cout << "Manual: " << duration_cast<std::chrono::milliseconds>(manual).count() << std::endl;
    std::cout << "Fmt: " << duration_cast<std::chrono::milliseconds>(fmt).count() << std::endl;

    return sum;
}

The text was updated successfully, but these errors were encountered:

vitaut · 2020-10-20T15:21:01Z

Your implementation of to_hex is not equivalent because it always produces unnecessary leading zeros. That said fmt's hex formatter is indeed quite suboptimal but it's easy to optimize. It's just a matter of adding a fast path in

fmt/include/fmt/compile.h

Lines 537 to 540 in 17fba75

    
           using type = get_type<ID, Args>; 
        
           constexpr auto result = parse_specs<type>(str, POS + 2, ID); 
        
           return parse_tail<Args, result.end, result.next_arg_id>( 
        
               spec_field<char_type, type, ID>{result.fmt}, format_str);

and using format string compilation. PRs are welcome.

BuzaL · 2020-11-17T20:48:18Z

Fortunate, easy to compare, 'cause it is 10 times slower than sprintf:
https://godbolt.org/z/cPqMz3

NobodyXu · 2021-02-25T11:18:43Z

@BuzaL Interestingly, if I use FMT_COMPILE and use -std=c++17 -O3 -flto, then fmt::format is faster than snprintf by roughly 3 time.

https://godbolt.org/z/9Ejjoc

BuzaL · 2021-02-25T20:52:21Z

As I see, its an issue with 7.x.x. The 6.2.1 still faster then reserve + append, but 7.0 and trunk are slow with ANY formatting.
That means something went wrong the on-the-fly format compiler.

vitaut · 2021-02-25T22:35:26Z

Looks like the regression is specific to Compiler Explorer, most likely the installed version is compiled with optimizations disabled.

vitaut · 2021-02-25T23:37:39Z

Confirmed in #2153 (comment). Don't use Compiler Explorer for benchmarking =). Also don't use system clock.

vitaut · 2021-03-11T23:44:45Z

I've landed a few formatting optimizations that improve hex formatting perf among other things.

With these changes, format string compilation and adding size computation to the manual method for consistency

#include <string>
#include <iostream>
#include <chrono>
#include <fmt/compile.h>

constexpr char hexchar[] = "0123456789ABCDEF";

template <typename T>
std::string to_hex(T value)
{
    constexpr size_t sz = sizeof(value) * 2;
    std::string buffer(sz, L'\0');
    int size = 0;
    for (int i = 0; i < 2 * sizeof(value); i++) {
        int digit = (value >> 4 * (sz - 1 - i)) & 0xf;
        buffer[i] = hexchar[digit];
        if (digit != 0) size = digit;
    }
    buffer.resize(size + 1);
    return buffer;
}

template <typename T>
std::string fmt_to_hex(T value)
{
    return fmt::format(FMT_COMPILE("{:X}"), value);
}

int main()
{
    int sum = 0;

    auto before = std::chrono::steady_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = to_hex(i);
        sum += a.length();
    }

    auto manual = std::chrono::steady_clock::now() - before;

    before = std::chrono::steady_clock::now();

    for (int i = 0; i < 100000000; i++)
    {
        auto a = fmt_to_hex(i);
        sum += a.length();
    }

    auto fmt = std::chrono::steady_clock::now() - before;

    std::cout << "Manual: " << std::chrono::duration_cast<std::chrono::milliseconds>(manual).count() << std::endl;
    std::cout << "Fmt: " << std::chrono::duration_cast<std::chrono::milliseconds>(fmt).count() << std::endl;

    return sum;
}

we can get comparable performance:

$ g++ test.cc -DNDEBUG -O3 -I include src/format.cc -std=c++17 -flto
$ ./a.out 
Manual: 1099
Fmt: 867

Note that the results are not very reliable, I recommend using a benchmarking framework instead.

alexezeder · 2021-03-12T09:16:50Z

I prepared a bit different still fitting benchmark of hex formatting.
It doesn't check fmt::format() or to_hex from the top message that returns std::string, as well as std::stringstream and std::to_string, because all these approaches introduce additional allocations that may influence on actual formatting. But it does check formatting to the existing buffer, so there are several versions of fmt::format_to, std::to_chars, and custom code (it copied from {fmt} but doesn't introduce any complexity that format_to does).
There are tests named FMTCustomFormatter* which have a slightly modified version of fmt/compile.h that provides a custom int_formatter class for formatting integer types. It's the initial test of my idea of formatter being specific for passed specs, it was previously described here - #2078.
Let me paste some of the results here (full results available in README.md, also you can check it on your PC):

Benchmark	Time
CustomCode	2.93 ns
FMTCompileMaster	12.5 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	3.66 ns
ToChars	2.57 ns

vitaut · 2021-03-12T16:04:25Z

Thanks for the benchmark. I wonder why the results for FMTCompileMaster and FMTCompileMasterNOINLINERemovedFromFill are slightly different in https://github.com/alexezeder/fmt_bnchmrk if fill is not called? Maybe a measurement error or some difference in codegen?

alexezeder · 2021-03-13T03:14:50Z

I wonder why the results for FMTCompileMaster and FMTCompileMasterNOINLINERemovedFromFill are slightly different in https://github.com/alexezeder/fmt_bnchmrk_legacy if fill is not called? Maybe a measurement error or some difference in codegen?

Oops... I made a mistake at the last state of preparing this repo and FMTCompileMasterNOINLINERemovedFromFill actually has the same code as FMTCompileMaster. I updated the benchmark repo, but they are still really close, almost equal.

There is a difference in codegen of course, but looks like as long as fill() function is not used this difference does not really affect performance. So it's probably a measurement error, something like ±0.3 ns which is not bad on my opinion.

EDIT: repo moved from https://github.com/alexezeder/fmt_bnchmrk to https://github.com/alexezeder/fmt_bnchmrk_legacy

vitaut · 2021-03-14T22:59:00Z

Regarding FMTCompileSeparateFormatterNOINLINERemovedFromFill , it is a nice speedup but I wonder if we can avoid replicating the format specifier parsing logic. In principle the compiler should be able to optimize basic_format_specs handling since the specs are constexpr. The LTO results seem to confirm that but the question is how to make it work without LTO.

alexezeder · 2021-03-15T19:15:57Z

In principle the compiler should be able to optimize basic_format_specs handling since the specs are constexpr.

My bet here is that the code is just too complex for the compiler to optimize. We can play with attributes of function to make them always inlined, but that would not probably solve problems with losing knowledge of constness of value in some cases. I mean, I'm not sure that even the formatter object that is created at compile-time can be used inside format method of spec_field at runtime in the most efficient way, i.e. inlined somehow.

The LTO results seem to confirm that but the question is how to make it work without LTO.

I'm not sure how LTO should help in this case, at least for the results in the benchmark. It uses header-only versions of {fmt}, so looks like it could help only with reducing the size of the executable by deleting some of non-inlined functions from different versions. In fact, GCC with -flto did it for my precious benchmark, so I had to build it several times to get the results of different versions. But here are results for LTO and non-LTO builds:

	Master	SeparatedFormatter
0	11.5 ns (w/o LTO), 10.7 ns (with LTO)	2.44 ns (w/o LTO), 2.20 ns (with LTO)
42	12.0 ns (w/o LTO), 11.0 ns (with LTO)	3.06 ns (w/o LTO), 3.25 ns (with LTO)
273123	13.2 ns (w/o LTO), 12.7 ns (with LTO)	6.11 ns (w/o LTO), 7.10 ns (with LTO)
9223372036854775807	14.3 ns (w/o LTO), 13.5 ns (with LTO)	7.89 ns (w/o LTO), 10.1 ns (with LTO)

alexezeder · 2021-03-19T21:55:37Z

And another set of benchmarks. They are also available in my benchmark repository.

Clang

Clang was added, it shows even better results for SeparatedFormatter tests:

Benchmark, `value == 273123`	GCC	Clang
CustomCode	6.15 ns	5.03 ns
FMTCompileMaster	13.3 ns	14.7 ns
FMTCompileMasterNOINLINERemovedFromFill	13.2 ns	14.8 ns
FMTCompileSeparateFormatter	14.2 ns	13.4 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	7.49 ns	5.19 ns
ToChars	3.76 ns	3.44 ns

For Clang there is no overhead of using code from FMTCompileSeparateFormatterNOINLINERemovedFromFill compare to custom code.

Force inlining

To test what I wrote earlier:

We can play with attributes of function to make them always inlined, but that would not probably solve problems with losing knowledge of constness of value in some cases.

I modified all functions of {fmt} (which are invoked by the test code) to add to them always_inline attribute. This forces compiler to make inlining even in case if it decides that the function is too complex. Of course, fill function left untouched for FMTCompileMaster and FMTCompileSeparateFormatter versions of {fmt}. Here are the results (FI stands for force inline version):

Benchmark, `value == 42`	GCC*	GCC* FI	Clang	Clang FI
CustomCode	2.79 ns	2.79 ns	2.22 ns	2.24 ns
FMTCompileMaster	12.0 ns	11.8 ns	13.3 ns	12.8 ns
FMTCompileMasterNOINLINERemovedFromFill	12.0 ns	11.8 ns	13.4 ns	12.7 ns
FMTCompileSeparateFormatter	11.5 ns	4.83 ns	10.6 ns	10.6 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	3.53 ns	3.31 ns	2.22 ns	2.26 ns
ToChars	2.49 ns	2.01 ns	2.21 ns	2.23 ns

Benchmark, `value == 9223372036854775807`	GCC*	GCC* FI	Clang	Clang FI
CustomCode	6.82 ns	6.85 ns	6.47 ns	6.58 ns
FMTCompileMaster	14.5 ns	14.5 ns	16.0 ns	15.5 ns
FMTCompileMasterNOINLINERemovedFromFill	14.4 ns	14.5 ns	16.0 ns	15.5 ns
FMTCompileSeparateFormatter	16.6 ns	11.6 ns	15.4 ns	15.6 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	10.5 ns	9.80 ns	6.49 ns	6.55 ns
ToChars	5.45 ns	4.54 ns	4.92 ns	5.00 ns

note: GCC and GCC FI results cannot be directly compared because there is a time gap between these runs

Looks like this trick can only slice some time for the FMTCompileSeparateFormatter test on GCC, but for the current approach in {fmt} it does nothing.

std::string tests

Since this issue started with std::string based test, I decided to make a proper benchmark for this case. I took code from the first message, and since this code fills the result with leading zeros (up to 8 digits), the format string for {fmt}-based test is {:0>8x}. Also, I replaced ToChars test with StringStream test, just to be sure that stringstream still sucks.

Benchmark, `value == 42`	GCC	Clang
CustomCode	3.51 ns	3.24 ns
FMTCompileMaster	36.4 ns	28.1 ns
FMTCompileMasterNOINLINERemovedFromFill	36.5 ns	27.9 ns
FMTCompileSeparateFormatter	19.8 ns	18.5 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	14.9 ns	18.8 ns
StringStream	311 ns	322 ns

Considering these results looks like {fmt} has a big overhead, but that's quite false. Code from the first message is actually closer to the usage of fmt::format_to because the buffer size is available at compile-time. Just by changing this code like that:

std::string test_custom_code(unsigned value) {
  constexpr unsigned width = 2 * sizeof(value);
  std::string buffer;
  for (int i = 0; i < width; i++)
    buffer.append(1, hexchar[(value >> 4 * (width - 1 - i)) & 0xf]);
  return buffer;
}

the results from GCC (Clang still optimizes this code) become comparable:

Benchmark, `value == 273123`	Time
CustomCode	15.6 ns
FMTCompileMaster	35.4 ns
FMTCompileMasterNOINLINERemovedFromFill	35.4 ns
FMTCompileSeparateFormatter	19.4 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	14.7 ns
StringStream	321 ns

vitaut · 2021-03-21T16:29:40Z

AFAICS the part that prevents optimization for some reason is dynamic width/precision handling:

fmt/include/fmt/format.h

Lines 3540 to 3544 in 0f85a46

    
           auto specs = specs_; 
        
           detail::handle_dynamic_spec<detail::width_checker>(specs.width, 
        
                                                              specs.width_ref, ctx); 
        
           detail::handle_dynamic_spec<detail::precision_checker>( 
        
               specs.precision, specs.precision_ref, ctx);

Temporarily commenting it out makes gcc do the expected optimization:

-----------------------------------------------------------------------------------------------------------------
Benchmark                                                                       Time             CPU   Iterations
-----------------------------------------------------------------------------------------------------------------
CustomCode/0                                                                 1.40 ns         1.40 ns    499241121
CustomCode/42                                                                1.99 ns         1.99 ns    350273792
CustomCode/273123                                                            4.93 ns         4.93 ns    143643778
CustomCode/9223372036854775807                                               5.48 ns         5.48 ns    125456338
FMTCompileMaster/0                                                           1.20 ns         1.20 ns    582823513
FMTCompileMaster/42                                                          1.59 ns         1.59 ns    437414846
FMTCompileMaster/273123                                                      2.89 ns         2.89 ns    238470638
FMTCompileMaster/9223372036854775807                                         3.89 ns         3.89 ns    180152587
FMTCompileMasterNOINLINERemovedFromFill/0                                    9.17 ns         9.17 ns     75675294
FMTCompileMasterNOINLINERemovedFromFill/42                                   9.57 ns         9.57 ns     72413567
FMTCompileMasterNOINLINERemovedFromFill/273123                               10.6 ns         10.6 ns     65808614
FMTCompileMasterNOINLINERemovedFromFill/9223372036854775807                  11.6 ns         11.6 ns     60052123
FMTCompileSeparateFormatter/0                                                7.71 ns         7.71 ns     89553288
FMTCompileSeparateFormatter/42                                               8.52 ns         8.52 ns     81365037
FMTCompileSeparateFormatter/273123                                           10.7 ns         10.7 ns     64955329
FMTCompileSeparateFormatter/9223372036854775807                              12.0 ns         12.0 ns     58023609
FMTCompileSeparateFormatterNOINLINERemovedFromFill/0                         1.91 ns         1.91 ns    368577999
FMTCompileSeparateFormatterNOINLINERemovedFromFill/42                        2.63 ns         2.63 ns    265312761
FMTCompileSeparateFormatterNOINLINERemovedFromFill/273123                    5.57 ns         5.57 ns    126824185
FMTCompileSeparateFormatterNOINLINERemovedFromFill/9223372036854775807       8.26 ns         8.26 ns     84841458
ToChars/0                                                                    1.00 ns         1.00 ns    667463669
ToChars/42                                                                   1.99 ns         1.99 ns    350416544
ToChars/273123                                                               2.99 ns         2.99 ns    233728063
ToChars/9223372036854775807                                                  4.32 ns         4.32 ns    161760882

Need to figure out what specifically gcc doesn't like about the above code snippet.

alexezeder · 2021-03-21T22:02:45Z

Temporarily commenting it out makes gcc do the expected optimization:

Wow, these results are great! I didn't expect that compilers can optimize code like that.

This change makes FMTCompileMaster timings even better than timings for std::to_chars, despite {fmt} is way more generic. But that's true for GCC, not for Clang. On Clang, FMTCompileMaster timings become better with this change, but it's still slower than FMTCompileSeparateFormatterNOINLINERemovedFromFill. Here are combined results (DSH - dynamic specs handling):

Benchmark, `value == 0`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	1.76 ns	1.75 ns	1.59 ns	1.55 ns
FMTCompileMaster	1.50 ns	11.5 ns	6.82 ns	12.9 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	2.54 ns	2.52 ns	1.59 ns	1.56 ns
ToChars	1.24 ns	1.25 ns	1.25 ns	1.89 ns

Benchmark, `value == 42`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	2.55 ns	2.79 ns	2.26 ns	2.22 ns
FMTCompileMaster	2.00 ns	12.0 ns	7.49 ns	13.3 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	3.80 ns	3.53 ns	2.26 ns	2.22 ns
ToChars	2.90 ns	2.49 ns	2.25 ns	2.21 ns

Benchmark, `value == 273123`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	5.81 ns	6.15 ns	5.19 ns	5.03 ns
FMTCompileMaster	3.34 ns	13.3 ns	8.95 ns	14.7 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	8.26 ns	7.49 ns	5.51 ns	5.19 ns
ToChars	4.50 ns	3.76 ns	3.50 ns	3.44 ns

Benchmark, `value == 9223372036854775807`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	6.95 ns	6.82 ns	6.72 ns	6.47 ns
FMTCompileMaster	4.86 ns	14.5 ns	10.6 ns	16.0 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	11.3 ns	10.5 ns	6.67 ns	6.49 ns
ToChars	5.63 ns	5.45 ns	5.02 ns	4.92 ns

For some reason FMTCompileSeparateFormatterNOINLINERemovedFromFill is slower than FMTCompileMaster on GCC. I mean, they share the same {fmt} code, and int_formatter is way simpler than the current one 😐. But okay, I would investigate what's going on there.

In addition, I did a quick check for a bit more complex format string 0>8x. Here are also combined results:

Benchmark, `value == 0`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	4.07 ns	4.48 ns	5.27 ns	5.23 ns
FMTCompileMaster	21.5 ns	27.8 ns	14.2 ns	19.2 ns
FMTCompileMasterNOINLINERemovedFromFill	4.12 ns	27.7 ns	14.2 ns	19.2 ns
FMTCompileSeparateFormatter	11.6 ns	11.0 ns	9.72 ns	9.97 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	5.60 ns	5.97 ns	4.95 ns	5.48 ns

Benchmark, `value == 42`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	4.32 ns	5.61 ns	6.15 ns	6.08 ns
FMTCompileMaster	21.9 ns	28.8 ns	14.8 ns	19.6 ns
FMTCompileMasterNOINLINERemovedFromFill	4.84 ns	28.6 ns	15.0 ns	19.6 ns
FMTCompileSeparateFormatter	11.8 ns	11.8 ns	10.7 ns	10.7 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	5.96 ns	6.21 ns	6.29 ns	5.99 ns

Benchmark, `value == 273123`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	6.91 ns	8.07 ns	8.51 ns	8.44 ns
FMTCompileMaster	23.2 ns	29.7 ns	16.3 ns	20.9 ns
FMTCompileMasterNOINLINERemovedFromFill	5.72 ns	29.9 ns	16.2 ns	21.0 ns
FMTCompileSeparateFormatter	14.7 ns	15.1 ns	12.6 ns	13.5 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	7.98 ns	8.06 ns	9.37 ns	8.30 ns

Benchmark, `value == 9223372036854775807`	GCC w/o DSH	GCC with DSH	Clang w/o DSH	Clang with DSH
CustomCode	7.53 ns	9.28 ns	8.38 ns	8.28 ns
FMTCompileMaster	20.5 ns	27.2 ns	14.0 ns	19.3 ns
FMTCompileMasterNOINLINERemovedFromFill	5.07 ns	27.5 ns	13.9 ns	19.0 ns
FMTCompileSeparateFormatter	16.9 ns	16.7 ns	14.9 ns	15.7 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	9.38 ns	9.33 ns	11.7 ns	10.6 ns

alexezeder · 2021-03-23T02:57:12Z

Ok, I realized why FMTCompileSeparateFormatter* are so slow compared to FMTCompileMaster* - it was probably a bad idea to take different revisions of {fmt} without rebasing separate formatter test code onto the master version. Especially when format.h was updated in master (integral formatting refactoring).

Results for {:x} format string:

Benchmark, `value == 42`	GCC	Clang
CustomCode	2.51 ns	2.27 ns
FMTCompileMaster	1.99 ns	7.41 ns
FMTCompileMasterNOINLINERemovedFromFill	2.00 ns	7.17 ns
FMTCompileSeparateFormatter	2.75 ns	2.09 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	2.76 ns	2.08 ns
ToChars	2.10 ns	2.27 ns

Results for {:0>8x} format string:

Benchmark, `value == 42`	GCC	Clang
CustomCode	4.27 ns	6.19 ns
FMTCompileMaster	22.3 ns	14.8 ns
FMTCompileMasterNOINLINERemovedFromFill	4.82 ns	15.0 ns
FMTCompileSeparateFormatter	8.72 ns	6.85 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	5.95 ns	5.32 ns

Still, FMTCompileMaster is a bit faster than FMTCompileSeparateFormatter on GCC, but it's either poor optimization for a non-type template parameter or some optimization magic which cannot be solved.

Anyway, even if those handle_dynamic_spec() calls cannot be optimized out sanely, then I think we still can have that static only specs optimization in {fmt}. This solution requires a new formatter that handles only basic_ not dynamic_ format specs. Also, there is a problem of detecting dynamic format specs on the string parsing step, because we cannot just look for { and } in specs to determine it is dynamic or static. But we can create the default formatter and look at the context argument identifier after calling parse() method of formatter.

alexezeder · 2021-03-26T02:42:19Z

Probably the final results for hexadecimal formatting.

I ran the same benchmarks for {:x} and {:0>8x} format strings, but this time they were compiled using MSVC (version 19.28.29913), since this compiler is also popular. Again, it's C++20 standard, it's a release configuration, everything remains the same. I also added results from GCC (Cygwin) on Windows just to demonstrate that GCC results on Linux and on Windows are almost identical.

Results for {:x} format string:

Benchmark, `value == 42`	MSVC	GCC (Cygwin)	GCC	Clang
CustomCode	4.14 ns	2.78 ns	2.51 ns	2.27 ns
FMTCompileMaster	38.5 ns	2.52 ns	1.99 ns	7.41 ns
FMTCompileMasterNOINLINERemovedFromFill	38.5 ns	2.02 ns	2.00 ns	7.17 ns
FMTCompileSeparateFormatter	7.87 ns	2.52 ns	2.75 ns	2.09 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	7.89 ns	2.85 ns	2.76 ns	2.08 ns
ToChars	8.07 ns	2.02 ns	2.10 ns	2.27 ns

Results for {:0>8x} format string:

Benchmark, `value == 42`	MSVC	GCC (Cygwin)	GCC	Clang
CustomCode	6.85 ns	4.78 ns	4.27 ns	6.19 ns
FMTCompileMaster	45.8 ns	27.4 ns	22.3 ns	14.8 ns
FMTCompileMasterNOINLINERemovedFromFill	45.9 ns	5.37 ns	4.82 ns	15.0 ns
FMTCompileSeparateFormatter	16.7 ns	15.4 ns	8.72 ns	6.85 ns
FMTCompileSeparateFormatterNOINLINERemovedFromFill	16.7 ns	6.64 ns	5.95 ns	5.32 ns

Need to figure out what specifically gcc doesn't like about the above code snippet.

I dig a little into the handle_dynamic_spec(), and based on what I saw - every meaningful line (in this function or in functions that are invoked inside) introduces some overhead (mostly it's ~1-2 ns). But the biggest impact comes from basic_format_args and it's get() method (~5 ns). Note that these observations are not very precise, because they actually just cannot be precise because we are talking about compiler optimizations here with all that magic under the hood.

vitaut · 2021-03-26T15:05:28Z

Thanks for such a comprehensive testing!

I didn't expect that compilers can optimize code like that.

In principle this is mostly inlining and constant propagation. I'm more surprised that clang does a poor job here. I'll try to dig into the generated assembly to see what's going on.

I think we still can have that static only specs optimization in {fmt}.

Right.

I dig a little into the handle_dynamic_spec(), and based on what I saw - every meaningful line (in this function or in functions that are invoked inside) introduces some overhead (mostly it's ~1-2 ns).

True but it should be optimized away completely if dynamic specs are not used.

jovibor · 2021-03-27T03:14:53Z

@alexezeder That's a nice benchmarks indeed.
I hope you'll add a std::format to this contest, when it's ready.

vitaut · 2021-05-13T01:20:34Z

After a recent series of optimizations the perf is on par with the handwritten code. Here are the results on clang and the current master (d1aebdb):

------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations
------------------------------------------------------------------------------------------------------
CustomCode/0                                                      1.86 ns         1.86 ns    351709307
CustomCode/42                                                     3.03 ns         3.02 ns    244754389
CustomCode/273123                                                 5.45 ns         5.43 ns    130468007
CustomCode/9223372036854775807                                    6.79 ns         6.77 ns    102097372
FMTCompileMaster/0                                                1.99 ns         1.99 ns    360523686
FMTCompileMaster/42                                               2.33 ns         2.33 ns    279865664
FMTCompileMaster/273123                                           3.72 ns         3.71 ns    190230315
FMTCompileMaster/9223372036854775807                              5.28 ns         5.26 ns    130711631
ToChars/0                                                         4.42 ns         4.41 ns    160196630
ToChars/42                                                        5.00 ns         4.98 ns    140735201
ToChars/273123                                                    7.26 ns         7.24 ns     95784130
ToChars/9223372036854775807                                       8.77 ns         8.75 ns     75872534

For comparison here's are results on an old version (7.1.3):

Running ./fmt_test
Run on (8 X 2800 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 22.02, 12.70, 8.66
------------------------------------------------------------------------------------------------------
Benchmark                                                            Time             CPU   Iterations
------------------------------------------------------------------------------------------------------
CustomCode/0                                                      1.81 ns         1.81 ns    369289856
CustomCode/42                                                     2.66 ns         2.63 ns    277249683
CustomCode/273123                                                 5.30 ns         5.27 ns    131162285
CustomCode/9223372036854775807                                    6.92 ns         6.86 ns    101821144
FMTCompileOld/0                                                   15.5 ns         15.5 ns     43302898
FMTCompileOld/42.                                                 16.6 ns         16.6 ns     43278267
FMTCompileOld/273123                                              18.7 ns         18.6 ns     37035861
FMTCompileOld/9223372036854775807                                 19.4 ns         19.4 ns     35243000
ToChars/0                                                         4.33 ns         4.31 ns    162237581
ToChars/42                                                        5.50 ns         5.44 ns    123228589
ToChars/273123                                                    8.47 ns         8.16 ns     92972600
ToChars/9223372036854775807                                       9.84 ns         9.52 ns     63743569

vitaut closed this as completed May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization for hexadecimal formatting #1944

Optimization for hexadecimal formatting #1944

NN--- commented Oct 20, 2020

vitaut commented Oct 20, 2020 •

edited

BuzaL commented Nov 17, 2020

NobodyXu commented Feb 25, 2021 •

edited

BuzaL commented Feb 25, 2021 •

edited

vitaut commented Feb 25, 2021 •

edited

vitaut commented Feb 25, 2021 •

edited

vitaut commented Mar 11, 2021 •

edited

alexezeder commented Mar 12, 2021 •

edited

vitaut commented Mar 12, 2021

alexezeder commented Mar 13, 2021 •

edited

vitaut commented Mar 14, 2021 •

edited

alexezeder commented Mar 15, 2021

alexezeder commented Mar 19, 2021 •

edited

vitaut commented Mar 21, 2021

alexezeder commented Mar 21, 2021 •

edited

alexezeder commented Mar 23, 2021 •

edited

alexezeder commented Mar 26, 2021

vitaut commented Mar 26, 2021

jovibor commented Mar 27, 2021

vitaut commented May 13, 2021 •

edited

Optimization for hexadecimal formatting #1944

Optimization for hexadecimal formatting #1944

Comments

NN--- commented Oct 20, 2020

vitaut commented Oct 20, 2020 • edited

BuzaL commented Nov 17, 2020

NobodyXu commented Feb 25, 2021 • edited

BuzaL commented Feb 25, 2021 • edited

vitaut commented Feb 25, 2021 • edited

vitaut commented Feb 25, 2021 • edited

vitaut commented Mar 11, 2021 • edited

alexezeder commented Mar 12, 2021 • edited

vitaut commented Mar 12, 2021

alexezeder commented Mar 13, 2021 • edited

vitaut commented Mar 14, 2021 • edited

alexezeder commented Mar 15, 2021

alexezeder commented Mar 19, 2021 • edited

Clang

Force inlining

std::string tests

vitaut commented Mar 21, 2021

alexezeder commented Mar 21, 2021 • edited

alexezeder commented Mar 23, 2021 • edited

alexezeder commented Mar 26, 2021

vitaut commented Mar 26, 2021

jovibor commented Mar 27, 2021

vitaut commented May 13, 2021 • edited

vitaut commented Oct 20, 2020 •

edited

NobodyXu commented Feb 25, 2021 •

edited

BuzaL commented Feb 25, 2021 •

edited

vitaut commented Feb 25, 2021 •

edited

vitaut commented Feb 25, 2021 •

edited

vitaut commented Mar 11, 2021 •

edited

alexezeder commented Mar 12, 2021 •

edited

alexezeder commented Mar 13, 2021 •

edited

vitaut commented Mar 14, 2021 •

edited

alexezeder commented Mar 19, 2021 •

edited

alexezeder commented Mar 21, 2021 •

edited

alexezeder commented Mar 23, 2021 •

edited

vitaut commented May 13, 2021 •

edited