ARROW-14269: [C++] Consolidate utf8 benchmark #11376

cyb70289 · 2021-10-11T03:19:09Z

Don't benchmark inlined functions directly inside the benchmark loop.
Instead, wrap them with non-inlined functions.

This patch also improves performance of validating short ascii strings.

github-actions · 2021-10-11T03:19:27Z

https://issues.apache.org/jira/browse/ARROW-14269

github-actions · 2021-10-11T03:19:28Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

cyb70289 · 2021-10-11T03:25:18Z

cpp/src/arrow/util/utf8.h

  const uint8_t* data = reinterpret_cast<const uint8_t*>(str.data());
  const size_t length = str.size();

  return ValidateUTF8(data, length);
 }

-inline bool ValidateAsciiSw(const uint8_t* data, int64_t len) {
+static inline bool ValidateAsciiSw(const uint8_t* data, int64_t len) {


Original code unrolls loop manually expecting to make better use of cpu pipeline. But it prevents the compiler to do better optimization leveraging auto vectorization.

Actually, this simple code performs same as simd code, if build with clang.
But gcc has big regression. So I left the simd code untouched.

clang, simd no better than naive code
https://quick-bench.com/q/R5S9gDfyCzxs4pyQO3rLszMLCBI

gcc, simd is faster
https://quick-bench.com/q/Xes5_3-CGjJbYNikY0E0BWdfVTo

cyb70289 · 2021-10-11T03:58:04Z

NOTE: below results are compared against the non-inlined benchmark (1st commit in this pr), not against master branch.

$ archery benchmark diff --suite-filter="arrow-utf8-util-benchmark" HEAD HEAD^

clang-10, Intel gold 5218
The improvement of ValidateLargeNonAscii looks not real. It should not be affected by this pr.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (8)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
               benchmark       baseline      contender  change %                                                                                                                counters
       ValidateTinyAscii  1.421 GiB/sec  2.000 GiB/sec    40.758       {'run_name': 'ValidateTinyAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 106006292}
   ValidateLargeNonAscii  1.554 GiB/sec  2.032 GiB/sec    30.731       {'run_name': 'ValidateLargeNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 11668}
      ValidateSmallAscii 14.104 GiB/sec 17.500 GiB/sec    24.082       {'run_name': 'ValidateSmallAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 75859578}
ValidateSmallAlmostAscii  2.983 GiB/sec  3.402 GiB/sec    14.066 {'run_name': 'ValidateSmallAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 15147502}
ValidateLargeAlmostAscii  3.382 GiB/sec  3.711 GiB/sec     9.736    {'run_name': 'ValidateLargeAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 25408}
   ValidateSmallNonAscii  1.953 GiB/sec  1.978 GiB/sec     1.280    {'run_name': 'ValidateSmallNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 10763254}
      ValidateLargeAscii 38.357 GiB/sec 38.436 GiB/sec     0.207         {'run_name': 'ValidateLargeAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 288231}
    ValidateTinyNonAscii  1.243 GiB/sec  1.242 GiB/sec    -0.105     {'run_name': 'ValidateTinyNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 84939377}

gcc-9, Intel gold 5218
Again, the regression of ValidateLargeNonAscii looks not real.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (7)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
               benchmark       baseline      contender  change %                                                                                                                counters
      ValidateSmallAscii 13.695 GiB/sec 13.943 GiB/sec     1.810       {'run_name': 'ValidateSmallAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 74956699}
ValidateSmallAlmostAscii  3.259 GiB/sec  3.303 GiB/sec     1.349 {'run_name': 'ValidateSmallAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 16556254}
    ValidateTinyNonAscii  1.186 GiB/sec  1.201 GiB/sec     1.270     {'run_name': 'ValidateTinyNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 81825116}
      ValidateLargeAscii 39.003 GiB/sec 39.047 GiB/sec     0.112         {'run_name': 'ValidateLargeAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 293159}
ValidateLargeAlmostAscii  3.489 GiB/sec  3.489 GiB/sec    -0.010    {'run_name': 'ValidateLargeAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 26201}
   ValidateSmallNonAscii  1.675 GiB/sec  1.674 GiB/sec    -0.061     {'run_name': 'ValidateSmallNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 9388982}
       ValidateTinyAscii  1.624 GiB/sec  1.616 GiB/sec    -0.523       {'run_name': 'ValidateTinyAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 122075013}

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (1)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            benchmark      baseline     contender  change %                                                                                                          counters
ValidateLargeNonAscii 1.747 GiB/sec 1.584 GiB/sec    -9.328 {'run_name': 'ValidateLargeNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 13117}

Arm Neoverse N1, clang-10
Arm benchmark shows big improvement of tiny and small ascii validation, it's reasonable.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (8)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
               benchmark        baseline       contender  change %                                                                                                                counters
       ValidateTinyAscii   1.214 GiB/sec   1.862 GiB/sec    53.331        {'run_name': 'ValidateTinyAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 91250343}
      ValidateSmallAscii  12.344 GiB/sec  17.937 GiB/sec    45.313       {'run_name': 'ValidateSmallAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 67719531}
    ValidateTinyNonAscii 817.353 MiB/sec 820.267 MiB/sec     0.357     {'run_name': 'ValidateTinyNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 54553894}
ValidateSmallAlmostAscii   2.579 GiB/sec   2.583 GiB/sec     0.161 {'run_name': 'ValidateSmallAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 13097861}
      ValidateLargeAscii  43.055 GiB/sec  43.101 GiB/sec     0.105         {'run_name': 'ValidateLargeAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 323527}
ValidateLargeAlmostAscii   2.738 GiB/sec   2.737 GiB/sec    -0.039    {'run_name': 'ValidateLargeAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 20551}
   ValidateLargeNonAscii   1.207 GiB/sec   1.205 GiB/sec    -0.195        {'run_name': 'ValidateLargeNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 9066}
   ValidateSmallNonAscii   1.324 GiB/sec   1.318 GiB/sec    -0.468     {'run_name': 'ValidateSmallNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 7425228}

Arm Neoverse N1, gcc-9

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (8)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
               benchmark         baseline       contender  change %                                                                                                                counters
       ValidateTinyAscii    1.637 GiB/sec   2.149 GiB/sec    31.283       {'run_name': 'ValidateTinyAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 122917255}
      ValidateSmallAscii   15.231 GiB/sec  16.637 GiB/sec     9.236       {'run_name': 'ValidateSmallAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 83551083}
      ValidateLargeAscii   43.135 GiB/sec  43.175 GiB/sec     0.093         {'run_name': 'ValidateLargeAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 324160}
   ValidateSmallNonAscii    1.394 GiB/sec   1.395 GiB/sec     0.057     {'run_name': 'ValidateSmallNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 7819363}
ValidateLargeAlmostAscii    2.816 GiB/sec   2.818 GiB/sec     0.055    {'run_name': 'ValidateLargeAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 21173}
ValidateSmallAlmostAscii    2.733 GiB/sec   2.732 GiB/sec    -0.012 {'run_name': 'ValidateSmallAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 13878062}
   ValidateLargeNonAscii    1.410 GiB/sec   1.410 GiB/sec    -0.041       {'run_name': 'ValidateLargeNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 10588}
    ValidateTinyNonAscii 1001.808 MiB/sec 999.129 MiB/sec    -0.267     {'run_name': 'ValidateTinyNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 66846809}

cyb70289 · 2021-10-11T04:02:07Z

cpp/src/arrow/util/utf8_util_benchmark.cc

+// Do not benchmark inlined functions directly inside the benchmark loop
+static ARROW_NOINLINE bool ValidateUTF8NoInline(const uint8_t* data, int64_t size) {


Without this noinline wrap, benchmarked function is expanded in the benchmark loop.
It causes strange behaviour. E.g., an obviously irrelevant change may bring big variance to the benchmark result.

pitrou · 2021-10-11T12:06:37Z

@ursabot please benchmark

ursabot · 2021-10-11T12:07:11Z

Benchmark runs are scheduled for baseline = 39875f1 and contender = a06c34f. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

pitrou · 2021-10-11T14:37:43Z

There is no important regression, so this can be merged IMHO.

Don't benchmark inlined functions directly inside the benchmark loop. Instead, wrap them with non-inlined functions. This patch also improves performance of validating short ascii strings.

cyb70289 · 2021-10-12T01:48:57Z

Rebased to fix merge conflict.
Will merge when CI finishes.

ursabot · 2021-10-12T04:15:10Z

Benchmark runs are scheduled for baseline = ca18e8a and contender = b01a30a. b01a30a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

github-actions bot added the Component: C++ label Oct 11, 2021

cyb70289 commented Oct 11, 2021

View reviewed changes

cyb70289 force-pushed the utf8-bench branch from 50a65c0 to a06c34f Compare October 11, 2021 09:28

pitrou approved these changes Oct 11, 2021

View reviewed changes

cyb70289 added 2 commits October 12, 2021 01:45

ARROW-14269: [C++] Consolidate utf8 benchmark

d07367b

Don't benchmark inlined functions directly inside the benchmark loop. Instead, wrap them with non-inlined functions. This patch also improves performance of validating short ascii strings.

Improve short ascii validation

3cf0083

cyb70289 force-pushed the utf8-bench branch from a06c34f to 3cf0083 Compare October 12, 2021 01:48

cyb70289 closed this in b01a30a Oct 12, 2021

cyb70289 deleted the utf8-bench branch October 12, 2021 04:16

asfimport mentioned this pull request Oct 12, 2021

[C++] Consolidate utf8 benchmark #29846

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-14269: [C++] Consolidate utf8 benchmark #11376

ARROW-14269: [C++] Consolidate utf8 benchmark #11376

cyb70289 commented Oct 11, 2021

github-actions bot commented Oct 11, 2021

github-actions bot commented Oct 11, 2021

cyb70289 Oct 11, 2021

cyb70289 commented Oct 11, 2021 •

edited

cyb70289 Oct 11, 2021

pitrou commented Oct 11, 2021

ursabot commented Oct 11, 2021 •

edited

pitrou commented Oct 11, 2021

cyb70289 commented Oct 12, 2021

ursabot commented Oct 12, 2021 •

edited

		// Do not benchmark inlined functions directly inside the benchmark loop
		static ARROW_NOINLINE bool ValidateUTF8NoInline(const uint8_t* data, int64_t size) {

ARROW-14269: [C++] Consolidate utf8 benchmark #11376

ARROW-14269: [C++] Consolidate utf8 benchmark #11376

Conversation

cyb70289 commented Oct 11, 2021

github-actions bot commented Oct 11, 2021

github-actions bot commented Oct 11, 2021

cyb70289 Oct 11, 2021

Choose a reason for hiding this comment

cyb70289 commented Oct 11, 2021 • edited

cyb70289 Oct 11, 2021

Choose a reason for hiding this comment

pitrou commented Oct 11, 2021

ursabot commented Oct 11, 2021 • edited

pitrou commented Oct 11, 2021

cyb70289 commented Oct 12, 2021

ursabot commented Oct 12, 2021 • edited

cyb70289 commented Oct 11, 2021 •

edited

ursabot commented Oct 11, 2021 •

edited

ursabot commented Oct 12, 2021 •

edited