Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14269: [C++] Consolidate utf8 benchmark #11376

Closed
wants to merge 2 commits into from

Conversation

cyb70289
Copy link
Contributor

Don't benchmark inlined functions directly inside the benchmark loop.
Instead, wrap them with non-inlined functions.

This patch also improves performance of validating short ascii strings.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

const uint8_t* data = reinterpret_cast<const uint8_t*>(str.data());
const size_t length = str.size();

return ValidateUTF8(data, length);
}

inline bool ValidateAsciiSw(const uint8_t* data, int64_t len) {
static inline bool ValidateAsciiSw(const uint8_t* data, int64_t len) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original code unrolls loop manually expecting to make better use of cpu pipeline. But it prevents the compiler to do better optimization leveraging auto vectorization.

Actually, this simple code performs same as simd code, if build with clang.
But gcc has big regression. So I left the simd code untouched.

@cyb70289
Copy link
Contributor Author

cyb70289 commented Oct 11, 2021

NOTE: below results are compared against the non-inlined benchmark (1st commit in this pr), not against master branch.

$ archery benchmark diff --suite-filter="arrow-utf8-util-benchmark" HEAD HEAD^

clang-10, Intel gold 5218
The improvement of ValidateLargeNonAscii looks not real. It should not be affected by this pr.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (8)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
               benchmark       baseline      contender  change %                                                                                                                counters
       ValidateTinyAscii  1.421 GiB/sec  2.000 GiB/sec    40.758       {'run_name': 'ValidateTinyAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 106006292}
   ValidateLargeNonAscii  1.554 GiB/sec  2.032 GiB/sec    30.731       {'run_name': 'ValidateLargeNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 11668}
      ValidateSmallAscii 14.104 GiB/sec 17.500 GiB/sec    24.082       {'run_name': 'ValidateSmallAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 75859578}
ValidateSmallAlmostAscii  2.983 GiB/sec  3.402 GiB/sec    14.066 {'run_name': 'ValidateSmallAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 15147502}
ValidateLargeAlmostAscii  3.382 GiB/sec  3.711 GiB/sec     9.736    {'run_name': 'ValidateLargeAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 25408}
   ValidateSmallNonAscii  1.953 GiB/sec  1.978 GiB/sec     1.280    {'run_name': 'ValidateSmallNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 10763254}
      ValidateLargeAscii 38.357 GiB/sec 38.436 GiB/sec     0.207         {'run_name': 'ValidateLargeAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 288231}
    ValidateTinyNonAscii  1.243 GiB/sec  1.242 GiB/sec    -0.105     {'run_name': 'ValidateTinyNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 84939377}

gcc-9, Intel gold 5218
Again, the regression of ValidateLargeNonAscii looks not real.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (7)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
               benchmark       baseline      contender  change %                                                                                                                counters
      ValidateSmallAscii 13.695 GiB/sec 13.943 GiB/sec     1.810       {'run_name': 'ValidateSmallAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 74956699}
ValidateSmallAlmostAscii  3.259 GiB/sec  3.303 GiB/sec     1.349 {'run_name': 'ValidateSmallAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 16556254}
    ValidateTinyNonAscii  1.186 GiB/sec  1.201 GiB/sec     1.270     {'run_name': 'ValidateTinyNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 81825116}
      ValidateLargeAscii 39.003 GiB/sec 39.047 GiB/sec     0.112         {'run_name': 'ValidateLargeAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 293159}
ValidateLargeAlmostAscii  3.489 GiB/sec  3.489 GiB/sec    -0.010    {'run_name': 'ValidateLargeAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 26201}
   ValidateSmallNonAscii  1.675 GiB/sec  1.674 GiB/sec    -0.061     {'run_name': 'ValidateSmallNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 9388982}
       ValidateTinyAscii  1.624 GiB/sec  1.616 GiB/sec    -0.523       {'run_name': 'ValidateTinyAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 122075013}

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (1)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
            benchmark      baseline     contender  change %                                                                                                          counters
ValidateLargeNonAscii 1.747 GiB/sec 1.584 GiB/sec    -9.328 {'run_name': 'ValidateLargeNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 13117}

Arm Neoverse N1, clang-10
Arm benchmark shows big improvement of tiny and small ascii validation, it's reasonable.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (8)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
               benchmark        baseline       contender  change %                                                                                                                counters
       ValidateTinyAscii   1.214 GiB/sec   1.862 GiB/sec    53.331        {'run_name': 'ValidateTinyAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 91250343}
      ValidateSmallAscii  12.344 GiB/sec  17.937 GiB/sec    45.313       {'run_name': 'ValidateSmallAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 67719531}
    ValidateTinyNonAscii 817.353 MiB/sec 820.267 MiB/sec     0.357     {'run_name': 'ValidateTinyNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 54553894}
ValidateSmallAlmostAscii   2.579 GiB/sec   2.583 GiB/sec     0.161 {'run_name': 'ValidateSmallAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 13097861}
      ValidateLargeAscii  43.055 GiB/sec  43.101 GiB/sec     0.105         {'run_name': 'ValidateLargeAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 323527}
ValidateLargeAlmostAscii   2.738 GiB/sec   2.737 GiB/sec    -0.039    {'run_name': 'ValidateLargeAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 20551}
   ValidateLargeNonAscii   1.207 GiB/sec   1.205 GiB/sec    -0.195        {'run_name': 'ValidateLargeNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 9066}
   ValidateSmallNonAscii   1.324 GiB/sec   1.318 GiB/sec    -0.468     {'run_name': 'ValidateSmallNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 7425228}

Arm Neoverse N1, gcc-9

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (8)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
               benchmark         baseline       contender  change %                                                                                                                counters
       ValidateTinyAscii    1.637 GiB/sec   2.149 GiB/sec    31.283       {'run_name': 'ValidateTinyAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 122917255}
      ValidateSmallAscii   15.231 GiB/sec  16.637 GiB/sec     9.236       {'run_name': 'ValidateSmallAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 83551083}
      ValidateLargeAscii   43.135 GiB/sec  43.175 GiB/sec     0.093         {'run_name': 'ValidateLargeAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 324160}
   ValidateSmallNonAscii    1.394 GiB/sec   1.395 GiB/sec     0.057     {'run_name': 'ValidateSmallNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 7819363}
ValidateLargeAlmostAscii    2.816 GiB/sec   2.818 GiB/sec     0.055    {'run_name': 'ValidateLargeAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 21173}
ValidateSmallAlmostAscii    2.733 GiB/sec   2.732 GiB/sec    -0.012 {'run_name': 'ValidateSmallAlmostAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 13878062}
   ValidateLargeNonAscii    1.410 GiB/sec   1.410 GiB/sec    -0.041       {'run_name': 'ValidateLargeNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 10588}
    ValidateTinyNonAscii 1001.808 MiB/sec 999.129 MiB/sec    -0.267     {'run_name': 'ValidateTinyNonAscii', 'repetitions': 0, 'repetition_index': 0, 'threads': 1, 'iterations': 66846809}

Comment on lines +27 to +28
// Do not benchmark inlined functions directly inside the benchmark loop
static ARROW_NOINLINE bool ValidateUTF8NoInline(const uint8_t* data, int64_t size) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this noinline wrap, benchmarked function is expanded in the benchmark loop.
It causes strange behaviour. E.g., an obviously irrelevant change may bring big variance to the benchmark result.

@pitrou
Copy link
Member

pitrou commented Oct 11, 2021

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Oct 11, 2021

Benchmark runs are scheduled for baseline = 39875f1 and contender = a06c34f. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@pitrou
Copy link
Member

pitrou commented Oct 11, 2021

There is no important regression, so this can be merged IMHO.

Don't benchmark inlined functions directly inside the benchmark loop.
Instead, wrap them with non-inlined functions.

This patch also improves performance of validating short ascii strings.
@cyb70289
Copy link
Contributor Author

Rebased to fix merge conflict.
Will merge when CI finishes.

@cyb70289 cyb70289 closed this in b01a30a Oct 12, 2021
@ursabot
Copy link

ursabot commented Oct 12, 2021

Benchmark runs are scheduled for baseline = ca18e8a and contender = b01a30a. b01a30a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.0% ⬆️0.0%] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@cyb70289 cyb70289 deleted the utf8-bench branch October 12, 2021 04:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants