Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One simple SIMD optimization #402

Merged
merged 3 commits into from May 14, 2023
Merged

One simple SIMD optimization #402

merged 3 commits into from May 14, 2023

Conversation

lemire
Copy link
Member

@lemire lemire commented May 14, 2023

This PR provides SSE2 (x64) and NEON (arm) optimization for one function (has_tabs_newlines).

It saves about one instruction per input byte:

benchdata results, this PR, GCC11 x64 (Skylake)

BasicBench_AdaURL_href              35821587 ns     35778820 ns           18 GHz=3.18736 cycle/byte=13.1335 cycles/url=1.14076k instructions/byte=36.9871 instructions/cycle=2.81625 instructions/ns=8.97639 instructions/url=3.21267k ns/url=357.902 speed=242.828M/s time/byte=4.11814ns time/url=357.699ns url/s=2.79565M/s
BasicBench_AdaURL_aggregator_href   23950378 ns     23922856 ns           29 GHz=3.18859 cycle/byte=8.77344 cycles/url=762.054 instructions/byte=26.0415 instructions/cycle=2.96821 instructions/ns=9.46443 instructions/url=2.26194k ns/url=238.994 speed=363.171M/s time/byte=2.75352ns time/url=239.169ns url/s=4.18115M/s

With this PR:

BasicBench_AdaURL_href              35823401 ns     35784656 ns           20 GHz=3.18873 cycle/byte=13.1107 cycles/url=1.13878k instructions/byte=35.9141 instructions/cycle=2.73931 instructions/ns=8.73491 instructions/url=3.11947k ns/url=357.127 speed=242.788M/s time/byte=4.11882ns time/url=357.757ns url/s=2.79519M/s
BasicBench_AdaURL_aggregator_href   23093315 ns     23065276 ns           30 GHz=3.1883 cycle/byte=8.41502 cycles/url=730.922 instructions/byte=24.8959 instructions/cycle=2.9585 instructions/ns=9.43259 instructions/url=2.16244k ns/url=229.252 speed=376.674M/s time/byte=2.65481ns time/url=230.595ns url/s=4.33661M/s

ARM (Apple M1, LLVM 14):

Main...

BasicBench_AdaURL_href              24703119 ns     24703143 ns           28 GHz=3.50432 cycle/byte=9.94401 cycles/url=863.729 instructions/byte=41.3928 instructions/cycle=4.16259 instructions/ns=14.587 instructions/url=3.59535k ns/url=246.475 speed=351.7M/s time/byte=2.84333ns time/url=246.97ns url/s=4.04908M/s
BasicBench_AdaURL_aggregator_href   17946808 ns     17938231 ns           39 GHz=3.44577 cycle/byte=7.01056 cycles/url=608.932 instructions/byte=29.8892 instructions/cycle=4.26345 instructions/ns=14.6909 instructions/url=2.59615k ns/url=176.719 speed=484.334M/s time/byte=2.06469ns time/url=179.337ns url/s=5.57608M/s

This PR...

BasicBench_AdaURL_href              24482828 ns     24482966 ns           29 GHz=3.40634 cycle/byte=9.41797 cycles/url=818.037 instructions/byte=39.5885 instructions/cycle=4.2035 instructions/ns=14.3186 instructions/url=3.43862k ns/url=240.152 speed=354.863M/s time/byte=2.81799ns time/url=244.768ns url/s=4.08549M/s
BasicBench_AdaURL_aggregator_href   16123292 ns     16123070 ns           43 GHz=3.40847 cycle/byte=6.29478 cycles/url=546.76 instructions/byte=28.1412 instructions/cycle=4.47056 instructions/ns=15.2378 instructions/url=2.44432k ns/url=160.412 speed=538.861M/s time/byte=1.85577ns time/url=161.19ns url/s=6.20384M/s

So the gains go from 0% to ~10% depending on whether you use ada::url (no change) or ada::url_aggregator (+10% speed). There is always a reduction in the number of instructions, but with BasicBench_AdaURL_href on x64 and ARM, the reduction in instructions translates in a reduction of the number of instructions retired per cycle so there is no speed gain.

Note that it is possible to do better on x64 than SSE2. Unfortunately, it requires runtime dispatching because not all x64 processors support more than SSE2 (but SSE2 should be always available per the platform definition).

Focusing on BasicBench_AdaURL_aggregator_href, I have the following bar charts...

x64
main ▏ 2262 instructions/url █████████████████████████
  pr ▏ 2162 instructions/url ███████████████████████▉
arm
main ▏ 2596 instructions/url █████████████████████████
  pr ▏ 2444 instructions/url ███████████████████████▌

Copy link
Member

@anonrig anonrig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a linting error, but other than that thanks for this awesome contribution @lemire.

include/ada/common_defs.h Show resolved Hide resolved
src/unicode.cpp Show resolved Hide resolved

ada_really_inline constexpr bool has_tabs_or_newline(
#if ADA_NEON
ada_really_inline bool has_tabs_or_newline(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about having unicode_simd.cpp file name convention, and moving all simd declarations there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as above. I recommend not overengineering at this time... until we know more. It is possible that we might end up using SIMD just for this one function, and creating a lot of extra code could add unnecessary complexity.

However, if we do add more SIMD support, we will have plenty of time then to add more code...

@anonrig anonrig merged commit 90ed1dd into main May 14, 2023
25 of 26 checks passed
@anonrig anonrig deleted the opti_simd_has_tabs_newlines branch May 14, 2023 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants