Add AArch64 ARM Neon code (complements #1823) #1881
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull request #1823 added AArch32 ARM Neon versions of the Blend() and Delta() functions, which are used by zoneminder's motion detection.
This pull requests complements pull request #1823 by adding AArch64 ARM Neon versions of the same functions, so both AArch32 and AArch64 are now available.
This pull request also includes a minor change to the AArch32 neon functions: relocate the data prefetches to be after the loads, to better utilize memory bus bandwidth.
In AArch64 mode, Neon is a mandatory feature of ARMv8-A CPUs, so no compiler flags are needed and no runtime detection is is needed. Neon is assumed to be always available.
Performance is pretty much identical to the AArch32 versions, but a comparison is provided:
Odroid C2 with ARM Cortex A53 processor @ 1.5 GHz:
Scaleway ARM64-2GB instance with 2 cores of Cavium ThunderX processor:
It seems the ThunderX loves Neon, or isn't as memory bound as the Odroid C2 is.
The CPU reduction should be between 20-50%. Perhaps in the future i will work on creating ARM Neon version of AlarmedPixels, which is currently the biggest CPU consumer in zma.