Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6385: [C++] Use xxh3 instead of custom hashing code for non-tiny strings #5265

Closed
wants to merge 1 commit into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Sep 3, 2019

This yields better performance in addition to relying on less custom code.

@pitrou
Copy link
Member Author

pitrou commented Sep 3, 2019

Benchmarks on Ubuntu 18.04 with gcc 7.4 (x86-64):

  • before
HashIntegers           12776 ns        12769 ns       154535 bytes_per_second=11.6694G/s items_per_second=1.56623G/s
HashSmallStrings       81587 ns        81541 ns        25652 bytes_per_second=2.49555G/s items_per_second=245.275M/s
HashMediumStrings     483494 ns       483223 ns         4328 bytes_per_second=2.68286G/s items_per_second=41.3888M/s
HashLargeStrings      323605 ns       323421 ns         6577 bytes_per_second=5.92472G/s items_per_second=6.18389M/s

BuildStringDictionaryArray          1462602077 ns   1461951271 ns            1 bytes_per_second=233.616M/s

BuildStringDictionary                 7696631 ns      7686495 ns           91 bytes_per_second=39.2773M/s
UniqueString10bytes/4194304/1024     71341945 ns     71312810 ns           10 bytes_per_second=560.909M/s
UniqueString10bytes/4194304/10240   103873422 ns    103832297 ns            7 bytes_per_second=385.237M/s
UniqueString100bytes/4194304/1024   212966741 ns    212833819 ns            3 bytes_per_second=1.83535G/s
UniqueString100bytes/4194304/10240  274777819 ns    274610854 ns            3 bytes_per_second=1.42247G/s
  • after
HashIntegers           12412 ns        12410 ns       168589 bytes_per_second=12.0078G/s items_per_second=1.61167G/s
HashSmallStrings       59447 ns        59430 ns        35245 bytes_per_second=3.42401G/s items_per_second=336.528M/s
HashMediumStrings     171430 ns       171386 ns        12225 bytes_per_second=7.56435G/s items_per_second=116.696M/s
HashLargeStrings      133742 ns       133705 ns        15833 bytes_per_second=14.3314G/s items_per_second=14.9583M/s

BuildStringDictionaryArray          1525536403 ns   1524966882 ns            1 bytes_per_second=223.962M/s

BuildStringDictionary                 7817772 ns      7804228 ns           84 bytes_per_second=38.6848M/s
UniqueString10bytes/4194304/1024     73274423 ns     73247744 ns           10 bytes_per_second=546.092M/s
UniqueString10bytes/4194304/10240   105917896 ns    105879905 ns            7 bytes_per_second=377.787M/s
UniqueString100bytes/4194304/1024   185390781 ns    185284167 ns            4 bytes_per_second=2.10825G/s
UniqueString100bytes/4194304/10240  253262101 ns    253075255 ns            3 bytes_per_second=1.54351G/s

@pitrou
Copy link
Member Author

pitrou commented Sep 3, 2019

Benchmarks on Ubuntu 18.04 with clang 7.0 (x86-64):

  • before
HashIntegers            7818 ns         7817 ns       268959 bytes_per_second=19.0637G/s items_per_second=2.55868G/s
HashSmallStrings      134226 ns       134189 ns        15627 bytes_per_second=1.51645G/s items_per_second=149.044M/s
HashMediumStrings     569914 ns       569758 ns         3680 bytes_per_second=2.27538G/s items_per_second=35.1026M/s
HashLargeStrings      315351 ns       315265 ns         6615 bytes_per_second=6.07801G/s items_per_second=6.34388M/s

BuildStringDictionaryArray          1565437184 ns   1564331162 ns            1 bytes_per_second=218.326M/s

BuildStringDictionary                 7675061 ns      7668470 ns           92 bytes_per_second=39.3696M/s
UniqueString10bytes/4194304/1024     68233039 ns     68205470 ns           10 bytes_per_second=586.463M/s
UniqueString10bytes/4194304/10240   101163657 ns    101119925 ns            7 bytes_per_second=395.57M/s
UniqueString100bytes/4194304/1024   199345875 ns    199220110 ns            4 bytes_per_second=1.96077G/s
UniqueString100bytes/4194304/10240  269394590 ns    269214721 ns            3 bytes_per_second=1.45098G/s
  • after
HashIntegers            7796 ns         7794 ns       269360 bytes_per_second=19.1192G/s items_per_second=2.56614G/s
HashSmallStrings       82722 ns        82701 ns        25383 bytes_per_second=2.46055G/s items_per_second=241.834M/s
HashMediumStrings     179893 ns       179847 ns        11677 bytes_per_second=7.20846G/s items_per_second=111.206M/s
HashLargeStrings      125845 ns       125812 ns        16942 bytes_per_second=15.2306G/s items_per_second=15.8968M/s

BuildStringDictionaryArray          1565018864 ns   1564329811 ns            1 bytes_per_second=218.327M/s

BuildStringDictionary                 7617240 ns      7603062 ns           92 bytes_per_second=39.7083M/s
UniqueString10bytes/4194304/1024     68524774 ns     68497319 ns           10 bytes_per_second=583.964M/s
UniqueString10bytes/4194304/10240   102208886 ns    102174316 ns            7 bytes_per_second=391.488M/s
UniqueString100bytes/4194304/1024   181762167 ns    181652423 ns            4 bytes_per_second=2.1504G/s
UniqueString100bytes/4194304/10240  250831357 ns    250640792 ns            3 bytes_per_second=1.55851G/s

@pitrou
Copy link
Member Author

pitrou commented Sep 3, 2019

@ursabot benchmark --benchmark-filter=Hash

@ursabot
Copy link

ursabot commented Sep 3, 2019

AMD64 Ubuntu 18.04 C++ Benchmark (#58316) builder has been succeeded.

Revision: f3275a1

  =================  ===========  ===========  =========
  benchmark             baseline    contender     change
  =================  ===========  ===========  =========
  HashMediumStrings  6.25091e+09  6.36296e+09  0.017926
  HashIntegers       8.59461e+09  8.81482e+09  0.0256215
  HashSmallStrings   2.45542e+09  2.48164e+09  0.0106789
  HashLargeStrings   1.17633e+10  1.19911e+10  0.0193699
  =================  ===========  ===========  =========

@pitrou
Copy link
Member Author

pitrou commented Sep 3, 2019

Hmm... the almost identical numbers reported by ursabot seem a bit unlikely, since we're completely changing algorithms here. @fsaintjacques

@wesm
Copy link
Member

wesm commented Sep 4, 2019

Looks like some MSVC warnings need to be suppressed

…trings

This yields better performance in addition to relying on less custom code.
@pitrou
Copy link
Member Author

pitrou commented Sep 4, 2019

Looks like some MSVC warnings need to be suppressed

Done.

@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@96928d5). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5265   +/-   ##
=========================================
  Coverage          ?   89.22%           
=========================================
  Files             ?      635           
  Lines             ?    83654           
  Branches          ?        0           
=========================================
  Hits              ?    74640           
  Misses            ?     9014           
  Partials          ?        0
Impacted Files Coverage Δ
cpp/src/plasma/client.cc 89.5% <ø> (ø)
cpp/src/arrow/util/hashing.h 99.43% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96928d5...ce7f938. Read the comment docs.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@wesm wesm closed this in b829f53 Sep 5, 2019
@pitrou pitrou deleted the ARROW-6385-xxh3 branch September 5, 2019 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants