One simple SIMD optimization #402

lemire · 2023-05-14T17:18:13Z

This PR provides SSE2 (x64) and NEON (arm) optimization for one function (has_tabs_newlines).

It saves about one instruction per input byte:

benchdata results, this PR, GCC11 x64 (Skylake)

BasicBench_AdaURL_href              35821587 ns     35778820 ns           18 GHz=3.18736 cycle/byte=13.1335 cycles/url=1.14076k instructions/byte=36.9871 instructions/cycle=2.81625 instructions/ns=8.97639 instructions/url=3.21267k ns/url=357.902 speed=242.828M/s time/byte=4.11814ns time/url=357.699ns url/s=2.79565M/s
BasicBench_AdaURL_aggregator_href   23950378 ns     23922856 ns           29 GHz=3.18859 cycle/byte=8.77344 cycles/url=762.054 instructions/byte=26.0415 instructions/cycle=2.96821 instructions/ns=9.46443 instructions/url=2.26194k ns/url=238.994 speed=363.171M/s time/byte=2.75352ns time/url=239.169ns url/s=4.18115M/s

With this PR:

BasicBench_AdaURL_href              35823401 ns     35784656 ns           20 GHz=3.18873 cycle/byte=13.1107 cycles/url=1.13878k instructions/byte=35.9141 instructions/cycle=2.73931 instructions/ns=8.73491 instructions/url=3.11947k ns/url=357.127 speed=242.788M/s time/byte=4.11882ns time/url=357.757ns url/s=2.79519M/s
BasicBench_AdaURL_aggregator_href   23093315 ns     23065276 ns           30 GHz=3.1883 cycle/byte=8.41502 cycles/url=730.922 instructions/byte=24.8959 instructions/cycle=2.9585 instructions/ns=9.43259 instructions/url=2.16244k ns/url=229.252 speed=376.674M/s time/byte=2.65481ns time/url=230.595ns url/s=4.33661M/s

ARM (Apple M1, LLVM 14):

Main...

BasicBench_AdaURL_href              24703119 ns     24703143 ns           28 GHz=3.50432 cycle/byte=9.94401 cycles/url=863.729 instructions/byte=41.3928 instructions/cycle=4.16259 instructions/ns=14.587 instructions/url=3.59535k ns/url=246.475 speed=351.7M/s time/byte=2.84333ns time/url=246.97ns url/s=4.04908M/s
BasicBench_AdaURL_aggregator_href   17946808 ns     17938231 ns           39 GHz=3.44577 cycle/byte=7.01056 cycles/url=608.932 instructions/byte=29.8892 instructions/cycle=4.26345 instructions/ns=14.6909 instructions/url=2.59615k ns/url=176.719 speed=484.334M/s time/byte=2.06469ns time/url=179.337ns url/s=5.57608M/s

This PR...

BasicBench_AdaURL_href              24482828 ns     24482966 ns           29 GHz=3.40634 cycle/byte=9.41797 cycles/url=818.037 instructions/byte=39.5885 instructions/cycle=4.2035 instructions/ns=14.3186 instructions/url=3.43862k ns/url=240.152 speed=354.863M/s time/byte=2.81799ns time/url=244.768ns url/s=4.08549M/s
BasicBench_AdaURL_aggregator_href   16123292 ns     16123070 ns           43 GHz=3.40847 cycle/byte=6.29478 cycles/url=546.76 instructions/byte=28.1412 instructions/cycle=4.47056 instructions/ns=15.2378 instructions/url=2.44432k ns/url=160.412 speed=538.861M/s time/byte=1.85577ns time/url=161.19ns url/s=6.20384M/s

So the gains go from 0% to ~10% depending on whether you use ada::url (no change) or ada::url_aggregator (+10% speed). There is always a reduction in the number of instructions, but with BasicBench_AdaURL_href on x64 and ARM, the reduction in instructions translates in a reduction of the number of instructions retired per cycle so there is no speed gain.

Note that it is possible to do better on x64 than SSE2. Unfortunately, it requires runtime dispatching because not all x64 processors support more than SSE2 (but SSE2 should be always available per the platform definition).

Focusing on BasicBench_AdaURL_aggregator_href, I have the following bar charts...

x64
main ▏ 2262 instructions/url █████████████████████████
  pr ▏ 2162 instructions/url ███████████████████████▉
arm
main ▏ 2596 instructions/url █████████████████████████
  pr ▏ 2444 instructions/url ███████████████████████▌

anonrig

There is a linting error, but other than that thanks for this awesome contribution @lemire.

include/ada/common_defs.h

src/unicode.cpp

anonrig · 2023-05-14T17:32:57Z

src/unicode.cpp

-
-ada_really_inline constexpr bool has_tabs_or_newline(
+#if ADA_NEON
+ada_really_inline bool has_tabs_or_newline(


What do you think about having unicode_simd.cpp file name convention, and moving all simd declarations there?

Same answer as above. I recommend not overengineering at this time... until we know more. It is possible that we might end up using SIMD just for this one function, and creating a lot of extra code could add unnecessary complexity.

However, if we do add more SIMD support, we will have plenty of time then to add more code...

lemire and others added 2 commits May 14, 2023 12:41

Adding some SIMD.

848b1f6

SSE2 version.

680df9f

anonrig approved these changes May 14, 2023

View reviewed changes

Reformat.

9c78458

anonrig approved these changes May 14, 2023

View reviewed changes

anonrig merged commit 90ed1dd into main May 14, 2023
25 of 26 checks passed

anonrig deleted the opti_simd_has_tabs_newlines branch May 14, 2023 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One simple SIMD optimization #402

One simple SIMD optimization #402

lemire commented May 14, 2023

anonrig left a comment

anonrig May 14, 2023

lemire May 14, 2023

Navigation Menu

One simple SIMD optimization #402

One simple SIMD optimization #402

Conversation

lemire commented May 14, 2023

anonrig left a comment

Choose a reason for hiding this comment

anonrig May 14, 2023

Choose a reason for hiding this comment

lemire May 14, 2023

Choose a reason for hiding this comment