-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ARM Neon to speed up motion detection on ARM processors #1823
Conversation
So just for kicks, I googled the errors shown in the Travis CI log:
The actual relevant lines numbers in the source file are 251 and 254. Googling indicates this is a result of trying to build 32bit code in a 64bit environment (Travis runs 64bit Ubuntu Trusty). This answer seemed to be most applicable since it applies to cmake:
Reference: http://stackoverflow.com/questions/19346667/error-invalid-instruction-suffix-for-push So there I learned something. I think. It reminds me it has been too long since I had to push things onto a stack to operate on them... better leave this stuff to @mastertheknife |
Its a mistake on my side. However, now travis is failing, but now its because it can't clone ffmpeg. Is there a way to make it run again? |
Yeah, the ffmpeg failing seems to be reoccurring issue. Don't know why their git server is refusing our clone request. When clicking on "Details" in this issue, it takes me to the Travis site for this build. On the top-right, I have a "Restart Job" button. Don't know if you will see that same button or not so I'll go ahead and restart it. |
I have been playing with the Neon for some time now and managed to find the combination of code (in terms of instruction order, width, prefetch distance etc) that yields the best performance, at least on the RPi2. Before i can push it, there is a change that i need to make in ZM's core. This is not really a problem, i did a test and it seems all resolutions are divisible cleanly by 64: |
…nd added SSE4.1, SSE4.2 and AVX detection
… performance increase over standard functions Memory allocations and image size requirements changed to be as needed for 64 byte alignment. Self-test code for Blend modified accordingly and added Self-test for the delta functions.
Okay, so i went ahead with this.
|
Crap, this has merge conflicts |
…ng about the failure
Add AArch64 ARM Neon code (complements #1823)
As discussed in #1810 and requested in #726 , this is a pull request to add code that uses ARM Neon technology to speed up motion detection on ARM processors that have Neon (starting with ARMv7) in AArch32 mode.
The ssedetect() function was modified to add detection for Neon. Unlike x86 and x86-64, we can't use cpuid for runtime detection of Neon, because the CPUID registers on ARM are for privileged access only (only the kernel can do that), so we get the information from the kernel, using the getauxval function but it requires glibc 2.16 or newer. This shouldn't be an issue except on CentOS6 machines.
While at it, also added AVX2 detection, i will be using it soon. Also, the function was renamed to hwcaps_detect() to better reflect what it does and allow more things to be detected later on (e.g. non-SIMD special instructions)
I use 128bit registers in the code, but most current ARM processors, until the Cortex A15, execute almost all 128bit instructions in two chunks of 64bit. Newer processors will benefit more from this.
I ran some benchmarks on the standard functions versus the ARM Neon functions on a Raspberry Pi 2:
It is possible to measure CPU reduction by turning off ZM_CPU_EXTENSIONS in the settings.
I don't have an ARM that i can install ZoneMinder on, so please test this.
I already tested all functions externally and they are all good.