Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR provides SSE2 (x64) and NEON (arm) optimization for one function (has_tabs_newlines).
It saves about one instruction per input byte:
benchdata
results, this PR, GCC11 x64 (Skylake)With this PR:
ARM (Apple M1, LLVM 14):
Main...
This PR...
So the gains go from 0% to ~10% depending on whether you use ada::url (no change) or ada::url_aggregator (+10% speed). There is always a reduction in the number of instructions, but with BasicBench_AdaURL_href on x64 and ARM, the reduction in instructions translates in a reduction of the number of instructions retired per cycle so there is no speed gain.
Note that it is possible to do better on x64 than SSE2. Unfortunately, it requires runtime dispatching because not all x64 processors support more than SSE2 (but SSE2 should be always available per the platform definition).
Focusing on
BasicBench_AdaURL_aggregator_href
, I have the following bar charts...