New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ARM Neon to speed up motion detection on ARM processors #1823

Merged
merged 9 commits into from May 10, 2017

Conversation

Projects
None yet
2 participants
@mastertheknife
Contributor

mastertheknife commented Mar 19, 2017

As discussed in #1810 and requested in #726 , this is a pull request to add code that uses ARM Neon technology to speed up motion detection on ARM processors that have Neon (starting with ARMv7) in AArch32 mode.

The ssedetect() function was modified to add detection for Neon. Unlike x86 and x86-64, we can't use cpuid for runtime detection of Neon, because the CPUID registers on ARM are for privileged access only (only the kernel can do that), so we get the information from the kernel, using the getauxval function but it requires glibc 2.16 or newer. This shouldn't be an issue except on CentOS6 machines.
While at it, also added AVX2 detection, i will be using it soon. Also, the function was renamed to hwcaps_detect() to better reflect what it does and allow more things to be detected later on (e.g. non-SIMD special instructions)

I use 128bit registers in the code, but most current ARM processors, until the Cortex A15, execute almost all 128bit instructions in two chunks of 64bit. Newer processors will benefit more from this.

I ran some benchmarks on the standard functions versus the ARM Neon functions on a Raspberry Pi 2:

Function Standard (-O2) Neon (ARMv7) Speedup
8bit Delta 79 MPixels/sec 157 MPixels/sec 1.98x
32bit Delta 25 MPixels/sec 32 MPixels/sec 1.28x
Fastblend 81 MColors/sec 148 MColors/sec 1.82x

It is possible to measure CPU reduction by turning off ZM_CPU_EXTENSIONS in the settings.

I don't have an ARM that i can install ZoneMinder on, so please test this.
I already tested all functions externally and they are all good.

@mastertheknife mastertheknife changed the title from Use ARM Neon to speed up Delta() and Blend() on ARM processors to Use ARM Neon to speed up motion detection on ARM processors Mar 19, 2017

@knight-of-ni

This comment has been minimized.

Show comment
Hide comment
@knight-of-ni

knight-of-ni Mar 19, 2017

Member

So just for kicks, I googled the errors shown in the Travis CI log:

[ 85%] Building CXX object src/CMakeFiles/zm.dir/zm_utils.cpp.o
/home/travis/build/ZoneMinder/ZoneMinder/src/zm_utils.cpp: Assembler messages:
/home/travis/build/ZoneMinder/ZoneMinder/src/zm_utils.cpp:261: Error: invalid instruction suffix for `push'
/home/travis/build/ZoneMinder/ZoneMinder/src/zm_utils.cpp:264: Error: invalid instruction suffix for `pop'
make[2]: *** [src/CMakeFiles/zm.dir/zm_utils.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/zm.dir/all] Error 2
make: *** [all] Error 2

The actual relevant lines numbers in the source file are 251 and 254.

Googling indicates this is a result of trying to build 32bit code in a 64bit environment (Travis runs 64bit Ubuntu Trusty).

This answer seemed to be most applicable since it applies to cmake:

It appears that you are trying to build 32-bit assembly code with 64-bit assembler.

You have 2 options:

    Use 32-bit assembler, for example by utilizing --32 option;
    Change code by substituting 64-bit (extended) registers such as %rax, for example, instead of 32-bit registers such as %eax used with push/pop instructions.

Since the build system appears to be CMake, I'd refer you to this manual on how to configure the build for various Assembly dialects in CMake.

You could try:

set(CMAKE_ASM_FLAGS "--32")

but I haven't tested it.

Reference: http://stackoverflow.com/questions/19346667/error-invalid-instruction-suffix-for-push

So there I learned something. I think. It reminds me it has been too long since I had to push things onto a stack to operate on them... better leave this stuff to @mastertheknife

Member

knight-of-ni commented Mar 19, 2017

So just for kicks, I googled the errors shown in the Travis CI log:

[ 85%] Building CXX object src/CMakeFiles/zm.dir/zm_utils.cpp.o
/home/travis/build/ZoneMinder/ZoneMinder/src/zm_utils.cpp: Assembler messages:
/home/travis/build/ZoneMinder/ZoneMinder/src/zm_utils.cpp:261: Error: invalid instruction suffix for `push'
/home/travis/build/ZoneMinder/ZoneMinder/src/zm_utils.cpp:264: Error: invalid instruction suffix for `pop'
make[2]: *** [src/CMakeFiles/zm.dir/zm_utils.cpp.o] Error 1
make[1]: *** [src/CMakeFiles/zm.dir/all] Error 2
make: *** [all] Error 2

The actual relevant lines numbers in the source file are 251 and 254.

Googling indicates this is a result of trying to build 32bit code in a 64bit environment (Travis runs 64bit Ubuntu Trusty).

This answer seemed to be most applicable since it applies to cmake:

It appears that you are trying to build 32-bit assembly code with 64-bit assembler.

You have 2 options:

    Use 32-bit assembler, for example by utilizing --32 option;
    Change code by substituting 64-bit (extended) registers such as %rax, for example, instead of 32-bit registers such as %eax used with push/pop instructions.

Since the build system appears to be CMake, I'd refer you to this manual on how to configure the build for various Assembly dialects in CMake.

You could try:

set(CMAKE_ASM_FLAGS "--32")

but I haven't tested it.

Reference: http://stackoverflow.com/questions/19346667/error-invalid-instruction-suffix-for-push

So there I learned something. I think. It reminds me it has been too long since I had to push things onto a stack to operate on them... better leave this stuff to @mastertheknife

@mastertheknife

This comment has been minimized.

Show comment
Hide comment
@mastertheknife

mastertheknife Mar 19, 2017

Contributor

Its a mistake on my side.
This code compiled successfully on my home ZM box, but its 32bit.
Trying to push 32bit registers in 64bit mode is not possible. I have corrected the code.

However, now travis is failing, but now its because it can't clone ffmpeg. Is there a way to make it run again?

Contributor

mastertheknife commented Mar 19, 2017

Its a mistake on my side.
This code compiled successfully on my home ZM box, but its 32bit.
Trying to push 32bit registers in 64bit mode is not possible. I have corrected the code.

However, now travis is failing, but now its because it can't clone ffmpeg. Is there a way to make it run again?

@knight-of-ni

This comment has been minimized.

Show comment
Hide comment
@knight-of-ni

knight-of-ni Mar 19, 2017

Member

Yeah, the ffmpeg failing seems to be reoccurring issue. Don't know why their git server is refusing our clone request.

When clicking on "Details" in this issue, it takes me to the Travis site for this build. On the top-right, I have a "Restart Job" button. Don't know if you will see that same button or not so I'll go ahead and restart it.

Member

knight-of-ni commented Mar 19, 2017

Yeah, the ffmpeg failing seems to be reoccurring issue. Don't know why their git server is refusing our clone request.

When clicking on "Details" in this issue, it takes me to the Travis site for this build. On the top-right, I have a "Restart Job" button. Don't know if you will see that same button or not so I'll go ahead and restart it.

@mastertheknife

This comment has been minimized.

Show comment
Hide comment
@mastertheknife

mastertheknife Mar 30, 2017

Contributor

I have been playing with the Neon for some time now and managed to find the combination of code (in terms of instruction order, width, prefetch distance etc) that yields the best performance, at least on the RPi2.
This makes it now 2.5x - 3.0x faster than stock.
I would like to test it on other processors, because different processors behave different and have different cache line sizes (32byte on the RPi2)

Before i can push it, there is a change that i need to make in ZM's core.
The new code requires images to be aligned on 32 byte boundary, and the image sizes to be multiples of 32. However, i am taking my work on AVX2 and future AVX512 into consideration and going for 64 byte instead.
I need to change the memory allocations to be start on a 64 byte boundary, and the image sizes must be divisible by 64 instead of 16 today. We only care about the shared memory images and the reference image being aligned to a 64byte boundary and their image sizes being multiples of 64.
In case 24bit is used, it also has to be divisible by 12 (like now, no change), this is because of unrolled version of Delta() that works on 4 24bit pixels at a time.

This is not really a problem, i did a test and it seems all resolutions are divisible cleanly by 64:
http://51.15.134.207/~kfir/reslist_results.txt
Note: Current 32bit SSE2\SSSE3 colour functions require it to be divisible cleanly by 16, but i am not checking for that, because if its divisible by 64, its also divisible by 16.

Contributor

mastertheknife commented Mar 30, 2017

I have been playing with the Neon for some time now and managed to find the combination of code (in terms of instruction order, width, prefetch distance etc) that yields the best performance, at least on the RPi2.
This makes it now 2.5x - 3.0x faster than stock.
I would like to test it on other processors, because different processors behave different and have different cache line sizes (32byte on the RPi2)

Before i can push it, there is a change that i need to make in ZM's core.
The new code requires images to be aligned on 32 byte boundary, and the image sizes to be multiples of 32. However, i am taking my work on AVX2 and future AVX512 into consideration and going for 64 byte instead.
I need to change the memory allocations to be start on a 64 byte boundary, and the image sizes must be divisible by 64 instead of 16 today. We only care about the shared memory images and the reference image being aligned to a 64byte boundary and their image sizes being multiples of 64.
In case 24bit is used, it also has to be divisible by 12 (like now, no change), this is because of unrolled version of Delta() that works on 4 24bit pixels at a time.

This is not really a problem, i did a test and it seems all resolutions are divisible cleanly by 64:
http://51.15.134.207/~kfir/reslist_results.txt
Note: Current 32bit SSE2\SSSE3 colour functions require it to be divisible cleanly by 16, but i am not checking for that, because if its divisible by 64, its also divisible by 16.

@knight-of-ni knight-of-ni added this to the 1.31.0 milestone Apr 15, 2017

Neon32 functions now work on 64 bytes at a time. This results in 4-6x…
… performance increase over standard functions

Memory allocations and image size requirements changed to be as needed for 64 byte alignment.
Self-test code for Blend modified accordingly and added Self-test for the delta functions.
@mastertheknife

This comment has been minimized.

Show comment
Hide comment
@mastertheknife

mastertheknife Apr 16, 2017

Contributor

Okay, so i went ahead with this.
I tried the new code on multiple ARM Cortex processors, A7, A8 and A53, to find a sweet spot that results in best performance for all processors. I found that working on 64bytes at a time, with 256 bytes prefetch distance resulted in best performance. Typical performance gain is 3-6x depending on the processor, with the A53 benefiting the most.
All alignments were changed to be on 64byte boundary and the image size requirement to be multiples of 64.
The self-test code for the blend function was changed accordingly, and i have also added self-test code for the delta function for 8bit and 32bit color.
Here are the results of the new code against the standard functions on my Odroid C2 with ARM Cortex A53 processor @ 1.5 GHz:

Function Standard (-O2) Neon (ARMv7) Speedup
8bit Delta 215 MPixels/sec 968 MPixels/sec 4.50x
32bit Delta 67 MPixels/sec 307 MPixels/sec 4.58x
Fastblend 195 MColors/sec 1112 MColors/sec 5.70x
Contributor

mastertheknife commented Apr 16, 2017

Okay, so i went ahead with this.
I tried the new code on multiple ARM Cortex processors, A7, A8 and A53, to find a sweet spot that results in best performance for all processors. I found that working on 64bytes at a time, with 256 bytes prefetch distance resulted in best performance. Typical performance gain is 3-6x depending on the processor, with the A53 benefiting the most.
All alignments were changed to be on 64byte boundary and the image size requirement to be multiples of 64.
The self-test code for the blend function was changed accordingly, and i have also added self-test code for the delta function for 8bit and 32bit color.
Here are the results of the new code against the standard functions on my Odroid C2 with ARM Cortex A53 processor @ 1.5 GHz:

Function Standard (-O2) Neon (ARMv7) Speedup
8bit Delta 215 MPixels/sec 968 MPixels/sec 4.50x
32bit Delta 67 MPixels/sec 307 MPixels/sec 4.58x
Fastblend 195 MColors/sec 1112 MColors/sec 5.70x
@knight-of-ni

This comment has been minimized.

Show comment
Hide comment
@knight-of-ni

knight-of-ni May 10, 2017

Member

Crap, this has merge conflicts

Member

knight-of-ni commented May 10, 2017

Crap, this has merge conflicts

@knight-of-ni knight-of-ni merged commit 71e6735 into ZoneMinder:master May 10, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

mastertheknife added a commit to mastertheknife/ZoneMinder that referenced this pull request May 10, 2017

mastertheknife added a commit to mastertheknife/ZoneMinder that referenced this pull request May 11, 2017

@mastertheknife mastertheknife deleted the mastertheknife:armv7_neon branch May 11, 2017

connortechnology added a commit that referenced this pull request May 11, 2017

Fix delta self-test introduced in #1823 failing (#1878)
* Fix self-test introduced in #1823 failing and improve logging about the failure

* Remove unnecessary newline added by previous commit

knight-of-ni added a commit that referenced this pull request May 14, 2017

Merge pull request #1881 from mastertheknife/aarch64_neon
Add AArch64 ARM Neon code (complements #1823)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment