Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Investigate recent regressions in some utf8 kernel benchmarks #30040

Closed
asfimport opened this issue Oct 26, 2021 · 7 comments
Closed

[C++] Investigate recent regressions in some utf8 kernel benchmarks #30040

asfimport opened this issue Oct 26, 2021 · 7 comments

Comments

@asfimport
Copy link

asfimport commented Oct 26, 2021

See https://conbench.ursa.dev/benchmarks/6ccff6887e7c47148a09fe46f18c8688/

Some (on the surface) unrelated commits have caused performance for a few string kernels to plummet. We should try to replicate locally.

Reporter: David Li / @lidavidm

Related issues:

Note: This issue was originally created as ARROW-14481. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Eduardo Ponce / @edponce:
Looking more carefully ARROW-13879 does modifies code used by unary string transform and predicate functions:

  1. Kernel registration was slightly modified here for string UTF8 transforms which converts 2 explicit statements into a for-loop along with a function call to a generator dispatcher.
  2. A similar change is here for the unary string predicates.
  3. Also, string predicates exec functor was modified here to provide the predicate as a template parameter instead of as a function parameter.

Now, #3 should not have a noticeable impact, but #1 and #2 can because of the loop iteration and extra function call.
Nevertheless, these are code paths that are not critical and should only be called once.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I do not see any regression locally between git master and a random commit from October 21st.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (26)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                           benchmark         baseline        contender  change %                                                                                                                                                                          counters
                           Utf8Lower  644.618 MiB/sec  814.359 MiB/sec    26.332                            {'family_index': 11, 'per_family_instance_index': 0, 'run_name': 'Utf8Lower', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 29}
                           Utf8Upper  647.102 MiB/sec  779.122 MiB/sec    20.402                            {'family_index': 12, 'per_family_instance_index': 0, 'run_name': 'Utf8Upper', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 29}
                 IsAlphaNumericAscii  506.334 MiB/sec  577.621 MiB/sec    14.079                   {'family_index': 2, 'per_family_instance_index': 0, 'run_name': 'IsAlphaNumericAscii', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 22}
                           MatchLike  808.991 MiB/sec  863.400 MiB/sec     6.725                             {'family_index': 7, 'per_family_instance_index': 0, 'run_name': 'MatchLike', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 38}
                        TrimManyUtf8  669.685 MiB/sec  714.090 MiB/sec     6.631                         {'family_index': 15, 'per_family_instance_index': 0, 'run_name': 'TrimManyUtf8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 29}
                BinaryJoinArrayArray    1.105 GiB/sec    1.131 GiB/sec     2.336               {'family_index': 17, 'per_family_instance_index': 0, 'run_name': 'BinaryJoinArrayArray', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 6817}
 BinaryJoinElementWiseArrayScalar/64    1.018 GiB/sec    1.038 GiB/sec     1.894 {'family_index': 18, 'per_family_instance_index': 2, 'run_name': 'BinaryJoinElementWiseArrayScalar/64', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 102}
   BinaryJoinElementWiseArrayArray/8  754.305 MiB/sec  768.563 MiB/sec     1.890   {'family_index': 19, 'per_family_instance_index': 1, 'run_name': 'BinaryJoinElementWiseArrayArray/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 564}
                     MatchLikePrefix    3.605 GiB/sec    3.665 GiB/sec     1.663                      {'family_index': 9, 'per_family_instance_index': 0, 'run_name': 'MatchLikePrefix', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 161}
                          AsciiUpper    7.936 GiB/sec    8.051 GiB/sec     1.441                           {'family_index': 1, 'per_family_instance_index': 0, 'run_name': 'AsciiUpper', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 359}
                     MatchLikeSuffix    3.637 GiB/sec    3.678 GiB/sec     1.132                     {'family_index': 10, 'per_family_instance_index': 0, 'run_name': 'MatchLikeSuffix', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 165}
                     TrimSingleAscii    1.233 GiB/sec    1.245 GiB/sec     0.970                       {'family_index': 5, 'per_family_instance_index': 0, 'run_name': 'TrimSingleAscii', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 57}
                          AsciiLower    7.933 GiB/sec    8.009 GiB/sec     0.957                           {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'AsciiLower', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 365}
                       TrimManyAscii  955.260 MiB/sec  962.251 MiB/sec     0.732                         {'family_index': 6, 'per_family_instance_index': 0, 'run_name': 'TrimManyAscii', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 42}
  BinaryJoinElementWiseArrayScalar/8  816.534 MiB/sec  822.348 MiB/sec     0.712  {'family_index': 18, 'per_family_instance_index': 1, 'run_name': 'BinaryJoinElementWiseArrayScalar/8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 612}
   BinaryJoinElementWiseArrayArray/2  591.531 MiB/sec  594.633 MiB/sec     0.525  {'family_index': 19, 'per_family_instance_index': 0, 'run_name': 'BinaryJoinElementWiseArrayArray/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 1767}
                      MatchSubstring  577.765 MiB/sec  579.194 MiB/sec     0.247                        {'family_index': 3, 'per_family_instance_index': 0, 'run_name': 'MatchSubstring', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 25}
  BinaryJoinElementWiseArrayScalar/2  747.787 MiB/sec  748.576 MiB/sec     0.106 {'family_index': 18, 'per_family_instance_index': 0, 'run_name': 'BinaryJoinElementWiseArrayScalar/2', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 2232}
               IsAlphaNumericUnicode 1005.992 MiB/sec 1005.034 MiB/sec    -0.095                {'family_index': 13, 'per_family_instance_index': 0, 'run_name': 'IsAlphaNumericUnicode', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 44}
               BinaryJoinArrayScalar    1.238 GiB/sec    1.236 GiB/sec    -0.195              {'family_index': 16, 'per_family_instance_index': 0, 'run_name': 'BinaryJoinArrayScalar', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 7191}
 BinaryJoinElementWiseArrayArray/128    1.225 GiB/sec    1.222 GiB/sec    -0.246  {'family_index': 19, 'per_family_instance_index': 3, 'run_name': 'BinaryJoinElementWiseArrayArray/128', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 58}
  BinaryJoinElementWiseArrayArray/64    1.014 GiB/sec    1.008 GiB/sec    -0.594   {'family_index': 19, 'per_family_instance_index': 2, 'run_name': 'BinaryJoinElementWiseArrayArray/64', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 94}
BinaryJoinElementWiseArrayScalar/128    1.244 GiB/sec    1.232 GiB/sec    -1.014 {'family_index': 18, 'per_family_instance_index': 3, 'run_name': 'BinaryJoinElementWiseArrayScalar/128', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 60}
                        SplitPattern  458.727 MiB/sec  451.400 MiB/sec    -1.597                          {'family_index': 4, 'per_family_instance_index': 0, 'run_name': 'SplitPattern', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 20}
                  MatchLikeSubstring  566.577 MiB/sec  548.685 MiB/sec    -3.158                    {'family_index': 8, 'per_family_instance_index': 0, 'run_name': 'MatchLikeSubstring', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 25}
                      TrimSingleUtf8  982.468 MiB/sec  942.122 MiB/sec    -4.107                       {'family_index': 14, 'per_family_instance_index': 0, 'run_name': 'TrimSingleUtf8', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 44}

@asfimport
Copy link
Author

Jonathan Keane / @jonkeane:
Oddly, it looks like these have in fact reverted back to their previous performance, here's a recent run: https://conbench.ursa.dev/benchmarks/673597e52dd24e4e9c04ffcd8570ea99/ at 695 MiB/s (up from 611MiB/s before, and much closer to the ~710-720 MiB/s we were seeing before).

@asfimport
Copy link
Author

Jonathan Keane / @jonkeane:
Though there is quite a bit of variability in that plot still — is that kind of variability expected for this benchmark (we don't yet have enough history after the release to have a good measure of that yet)?

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I don't think variability is expected for this particular benchmark, but as we've already discussed there may be variability in any benchmark that goes out of the L2 cache.

@asfimport
Copy link
Author

David Li / @lidavidm:
Maybe we can close this then? (Sorry, I've been unable to actually go look at this.)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Closing as invalid for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant