Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13536: [C++] Use decimal-point aware conversion from fast-float #11817

Closed
wants to merge 3 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Nov 30, 2021

The custom wrapper is still used for decimals.

@pitrou
Copy link
Member Author

pitrou commented Nov 30, 2021

@ursabot please benchmark

@github-actions
Copy link

@ursabot
Copy link

ursabot commented Nov 30, 2021

Benchmark runs are scheduled for baseline = 2baed02 and contender = 2113e2a. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Failed] ec2-t3-xlarge-us-east-2
[Failed] ursa-i9-9960x
[Failed] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

The custom wrapper is still used for decimals.
@pitrou
Copy link
Member Author

pitrou commented Nov 30, 2021

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Nov 30, 2021

Benchmark runs are scheduled for baseline = 2baed02 and contender = 0a82075. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️25.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.35% ⬆️0.09%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@cyb70289
Copy link
Contributor

cyb70289 commented Dec 1, 2021

There's some regression from FloatParsing. I repeated the test locally with similar result.

Below result is tested on Xeon gold 5218. FloatParsing sees 25%~30% regression for clang, 12% regression for gcc.
I also tested on Arm neoverse N1, the regression is about 10% for both compilers.

clang-12
$ archery benchmark diff --suite-filter="arrow-value-parsing-benchmark" --cc=clang --cxx=clang++

-----------------------------------------------------------------------------------------
Non-regressions: (23)                                                                    
-----------------------------------------------------------------------------------------
                                benchmark           baseline          contender  change %
                    HexParsing<Int16Type> 122.889M items/sec 146.914M items/sec    19.551
                IntegerParsing<Int16Type> 122.648M items/sec 130.271M items/sec     6.215
             IntegerFormatting<Int64Type>  50.467M items/sec  53.040M items/sec     5.097
......

---------------------------------------------------------------------------------------
Regressions: (10)
---------------------------------------------------------------------------------------
                              benchmark           baseline          contender  change %
             IntegerParsing<UInt32Type> 158.139M items/sec 149.383M items/sec    -5.537
              IntegerParsing<UInt8Type> 193.646M items/sec 181.939M items/sec    -6.046
              IntegerParsing<Int32Type>  96.648M items/sec  87.516M items/sec    -9.449
                  HexParsing<UInt8Type> 154.437M items/sec 139.302M items/sec    -9.800
                   HexParsing<Int8Type> 159.359M items/sec 143.353M items/sec   -10.044
                 HexParsing<UInt32Type> 104.878M items/sec  90.750M items/sec   -13.470
TimestampParsingISO8601<TimeUnit::NANO>  44.332M items/sec  37.728M items/sec   -14.897
                FloatParsing<FloatType>  52.520M items/sec  38.632M items/sec   -26.444
               FloatParsing<DoubleType>  59.252M items/sec  41.705M items/sec   -29.613
                 HexParsing<UInt16Type> 122.306M items/sec  84.441M items/sec   -30.959

gcc-9.4
$ archery benchmark diff --suite-filter="arrow-value-parsing-benchmark" --cc=gcc --cxx=g++

-----------------------------------------------------------------------------------------
Non-regressions: (28)                                                                    
-----------------------------------------------------------------------------------------
                                benchmark           baseline          contender  change %
                   HexParsing<UInt32Type> 117.514M items/sec 131.796M items/sec    12.153
                    HexParsing<UInt8Type> 191.946M items/sec 213.739M items/sec    11.354
TimestampParsingISO8601<TimeUnit::SECOND>  38.486M items/sec  41.766M items/sec     8.523
                 IntegerParsing<Int8Type> 136.544M items/sec 147.524M items/sec     8.042
                   HexParsing<UInt16Type> 150.635M items/sec 158.778M items/sec     5.406
             IntegerFormatting<UInt8Type> 405.355M items/sec 426.245M items/sec     5.153
......

-----------------------------------------------------------------------------
Regressions: (5)
-----------------------------------------------------------------------------
                    benchmark           baseline          contender  change %
   IntegerParsing<UInt32Type> 167.434M items/sec 157.774M items/sec    -5.769
IntegerFormatting<UInt16Type> 183.618M items/sec 172.128M items/sec    -6.257
        HexParsing<Int16Type> 142.974M items/sec 130.157M items/sec    -8.965
     FloatParsing<DoubleType>  45.377M items/sec  39.540M items/sec   -12.864
      FloatParsing<FloatType>  43.444M items/sec  37.624M items/sec   -13.397

@pitrou
Copy link
Member Author

pitrou commented Dec 1, 2021

@ursabot please benchmark lang=Python,R

@ursabot
Copy link

ursabot commented Dec 1, 2021

Supported benchmark command examples:

@ursabot benchmark help

To run all benchmarks:
@ursabot please benchmark

To filter benchmarks by language:
@ursabot please benchmark lang=Python
@ursabot please benchmark lang=C++
@ursabot please benchmark lang=R
@ursabot please benchmark lang=Java
@ursabot please benchmark lang=JavaScript

To filter Python and R benchmarks by name:
@ursabot please benchmark name=file-write
@ursabot please benchmark name=file-write lang=Python
@ursabot please benchmark name=file-.*

To filter C++ benchmarks by archery --suite-filter and --benchmark-filter:
@ursabot please benchmark command=cpp-micro --suite-filter=arrow-compute-vector-selection-benchmark --benchmark-filter=TakeStringRandomIndicesWithNulls/262144/2 --iterations=3

For other command=cpp-micro options, please see https://github.com/ursacomputing/benchmarks/blob/main/benchmarks/cpp_micro_benchmarks.py

@pitrou
Copy link
Member Author

pitrou commented Dec 1, 2021

@ursabot please benchmark lang=Python lang=R

@ursabot
Copy link

ursabot commented Dec 1, 2021

Supported benchmark command examples:

@ursabot benchmark help

To run all benchmarks:
@ursabot please benchmark

To filter benchmarks by language:
@ursabot please benchmark lang=Python
@ursabot please benchmark lang=C++
@ursabot please benchmark lang=R
@ursabot please benchmark lang=Java
@ursabot please benchmark lang=JavaScript

To filter Python and R benchmarks by name:
@ursabot please benchmark name=file-write
@ursabot please benchmark name=file-write lang=Python
@ursabot please benchmark name=file-.*

To filter C++ benchmarks by archery --suite-filter and --benchmark-filter:
@ursabot please benchmark command=cpp-micro --suite-filter=arrow-compute-vector-selection-benchmark --benchmark-filter=TakeStringRandomIndicesWithNulls/262144/2 --iterations=3

For other command=cpp-micro options, please see https://github.com/ursacomputing/benchmarks/blob/main/benchmarks/cpp_micro_benchmarks.py

@pitrou
Copy link
Member Author

pitrou commented Dec 1, 2021

@ursabot please benchmark lang=Python
@ursabot please benchmark lang=R

@ursabot
Copy link

ursabot commented Dec 1, 2021

Benchmark runs are scheduled for baseline = 2baed02 and contender = 93fcb18. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Provided benchmark filters do not have any benchmark groups to be executed on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.45% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@pitrou
Copy link
Member Author

pitrou commented Dec 1, 2021

Hmm, the regression is annoying. I simply made a method non-static and the struct is empty (for non-floats). I expected modern compilers to handle this optimally :-/

@cyb70289
Copy link
Contributor

cyb70289 commented Dec 1, 2021

Hmm, the regression is annoying. I simply made a method non-static and the struct is empty (for non-floats). I expected modern compilers to handle this optimally :-/

So looks the overhead is from the benchmark itself. As the test strings are very short, the indirect call overhead becomes significant.
I think it's not a real problem in practice.

@cyb70289
Copy link
Contributor

cyb70289 commented Dec 1, 2021

Interestingly, when I add a long string in float parsing benchmark. Master is 2.7M/s, This PR is 7.8M/s, much faster.

diff --git a/cpp/src/arrow/util/value_parsing_benchmark.cc b/cpp/src/arrow/util/value_parsing_benchmark.cc
index 40d139316..b955a4174 100644
--- a/cpp/src/arrow/util/value_parsing_benchmark.cc
+++ b/cpp/src/arrow/util/value_parsing_benchmark.cc
@@ -77,9 +77,8 @@ static std::vector<std::string> MakeHexStrings(int32_t num_items) {
 }

 static std::vector<std::string> MakeFloatStrings(int32_t num_items) {
-  std::vector<std::string> base_strings = {"0.0",         "5",        "-12.3",
-                                           "98765430000", "3456.789", "0.0012345",
-                                           "2.34567e8",   "-5.67e-8"};
+  std::vector<std::string> base_strings = {
+                                           "1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111112222222223333"};

@pitrou
Copy link
Member Author

pitrou commented Dec 1, 2021

By "indirect call overhead", you mean the fact that StringToFloat gained a new parameter?

Interestingly, when I add a long string in float parsing benchmark. Master is 2.7M/s, This PR is 7.8M/s, much faster.

This may be because of the updated fast-float version.

@pitrou
Copy link
Member Author

pitrou commented Dec 1, 2021

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Dec 1, 2021

Benchmark runs are scheduled for baseline = 2baed02 and contender = b5489e2. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.45% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.53% ⬆️0.09%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@cyb70289
Copy link
Contributor

cyb70289 commented Dec 1, 2021

By "indirect call overhead", you mean the fact that StringToFloat gained a new parameter?

I mean changes from StringConverter<T>::Convert(type, s, length, out) to StringConverter<T>{}.Convert(type, s, length, out), which might generate a temporary object.
But I'm not certain now. Looks it's trivial for compiler to eliminate the overhead. https://godbolt.org/z/TGTqsMdM1

@pitrou
Copy link
Member Author

pitrou commented Dec 1, 2021

Yes, ideally it's trivial. Which is why the regression is a bit surprising.

Copy link
Contributor

@cyb70289 cyb70289 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

For FloatParsing, this PR starts to beat master code when string size >= 8 on my test machine, the longer the string, the bigger the gap. For string size < 8, this PR is slower than master. Guess it's due to fast-float code updatds. I think this PR is beneficial overall.

Other regressions like HexParsing looks not real, clang and gcc gives different benchmark results.

@pitrou pitrou closed this in b8431fb Dec 2, 2021
@pitrou pitrou deleted the ARROW-13536-fast-float branch December 2, 2021 11:07
@ursabot
Copy link

ursabot commented Dec 2, 2021

Benchmark runs are scheduled for baseline = 6cdb80c and contender = b8431fb. b8431fb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.35% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.49% ⬆️0.27%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants