Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AArch64 ARM Neon code (complements #1823) #1881

Merged
merged 4 commits into from
May 14, 2017

Conversation

mastertheknife
Copy link
Contributor

@mastertheknife mastertheknife commented May 13, 2017

Pull request #1823 added AArch32 ARM Neon versions of the Blend() and Delta() functions, which are used by zoneminder's motion detection.
This pull requests complements pull request #1823 by adding AArch64 ARM Neon versions of the same functions, so both AArch32 and AArch64 are now available.
This pull request also includes a minor change to the AArch32 neon functions: relocate the data prefetches to be after the loads, to better utilize memory bus bandwidth.

In AArch64 mode, Neon is a mandatory feature of ARMv8-A CPUs, so no compiler flags are needed and no runtime detection is is needed. Neon is assumed to be always available.
Performance is pretty much identical to the AArch32 versions, but a comparison is provided:

Odroid C2 with ARM Cortex A53 processor @ 1.5 GHz:

Function Std 32bit (-O2) Std 64bit (-O2) Neon (AArch32) Neon (AArch64)
8bit Delta 215 MPixels/sec 258 MPixels/sec 968 MPixels/sec 968 MPixels/sec
32bit Delta 67 MPixels/sec 79 MPixels/sec 309 MPixels/sec 309 MPixels/sec
Fastblend 188 MColors/sec 188 MColors/sec 1130 MColors/sec 1130 MColors/sec

Scaleway ARM64-2GB instance with 2 cores of Cavium ThunderX processor:

Function Std 64bit (-O2) Neon (AArch64)
8bit Delta 174 MPixels/sec 2176 MPixels/sec
32bit Delta 68 MPixels/sec 373 MPixels/sec
Fastblend 177 MColors/sec 1757 MColors/sec

It seems the ThunderX loves Neon, or isn't as memory bound as the Odroid C2 is.

The CPU reduction should be between 20-50%. Perhaps in the future i will work on creating ARM Neon version of AlarmedPixels, which is currently the biggest CPU consumer in zma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants