-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRC32 implementation with gcc intrinsics instead pure asm. #4
CRC32 implementation with gcc intrinsics instead pure asm. #4
Conversation
nice! Solved difficulties with clang missing ppc headers too. |
Hi @antonblanchard @grooverdan . Any news about this PR? |
This commit implements CRC32 using power8 vector intrinsics and gcc builtins instead pure assembly. The performance is the same compared to .S version: time ./vec_crc32_bench 32768 5000000 CRC: 165b4c91 real 0m2.799s user 0m2.799s sys 0m0.000s time ./crc32_bench 32768 5000000 CRC: 165b4c91 real 0m2.803s user 0m2.803s sys 0m0.000s Perf results: perf stat -a ./vec_crc32_bench 32768 5000000 CRC: 165b4c91 Performance counter stats for 'system wide': 360774.660732 task-clock (msec) # 128.683 CPUs utilized 529 context-switches # 0.001 K/sec 8 cpu-migrations # 0.000 K/sec 208 page-faults # 0.001 K/sec 12,468,436,530 cycles # 0.035 GHz (66.62%) 18,068,249 stalled-cycles-frontend # 0.14% cycles idle 466,739,548 stalled-cycles-backend # 3.74% cycles idle 49,670,139,591 instructions # 3.98 insns per cycle # 0.01 stalled cycles per insn (66.82%) 1,370,729,619 branches # 3.799 M/sec (50.09%) 5,759,980 branch-misses # 0.42% of all branches 2.803581718 seconds time elapsed perf stat -a ./crc32_bench 32768 5000000 CRC: 165b4c91 Performance counter stats for 'system wide': 360942.638504 task-clock (msec) # 128.498 CPUs utilized 535 context-switches # 0.001 K/sec 12 cpu-migrations # 0.000 K/sec 287 page-faults # 0.001 K/sec 12,476,309,108 cycles # 0.035 GHz (66.67%) 17,688,340 stalled-cycles-frontend # 0.14% cycles idle 477,872,611 stalled-cycles-backend # 3.83% cycles idle 48,459,294,347 instructions # 3.88 insns per cycle # 0.01 stalled cycles per insn (66.69%) 1,371,856,316 branches # 3.801 M/sec (50.01%) 5,771,271 branch-misses # 0.42% of all branches 2.808943029 seconds time elapsed Tested on (tulibee): P8 / LE DD2.1 Murano 32G RAM, 16 Cores. RHEL7.2 LE Signed-off-by: Rogerio Alves <rogealve@br.ibm.com>
Included quickstart instruction for vec_crc32.c on README. Signed-off-by: Rogerio Alves <rogealve@br.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
This ensures that: defining __ASSEMBLY__ (gcc builtin) isn't needed for C implementation. MAX_SIZE is defined in both C and __ASSEMBLY__ generations Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
…ures Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
1905b6f
to
8d0e3f6
Compare
Updated PR with @grooverdan review and contribution. |
https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions indicates -mvsx needed for Also lodged https://bugs.llvm.org/show_bug.cgi?id=34295 and https://bugs.llvm.org/show_bug.cgi?id=34296 as it currently doesn't compile with bleeding edge clang. |
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Add example crc32_two_implementations on how to use this. Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
This rolls vec_crc32_test into the crc32_test by comparing the reference, ASM and C vpmsum CRC32 implementations. Compares the result at up to 16 alignments as the codepath with change for this. A Makefile target of test is added to test the boundary conditions of the implementations. Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Thanks Rogerio and Daniel! |
This commit implements CRC32 using power8 vector intrinsics
and gcc builtins instead pure assembly. The performance is
the same compared to .S version:
time ./vec_crc32_bench 32768 5000000
CRC: 165b4c91
real 0m2.799s
user 0m2.799s
sys 0m0.000s
time ./crc32_bench 32768 5000000
CRC: 165b4c91
real 0m2.803s
user 0m2.803s
sys 0m0.000s
Perf results:
perf stat -a ./vec_crc32_bench 32768 5000000
CRC: 165b4c91
Performance counter stats for 'system wide':
360774.660732 task-clock (msec) # 128.683 CPUs utilized
529 context-switches # 0.001 K/sec
8 cpu-migrations # 0.000 K/sec
208 page-faults # 0.001 K/sec
12,468,436,530 cycles # 0.035 GHz (66.62%)
18,068,249 stalled-cycles-frontend # 0.14% cycles idle
466,739,548 stalled-cycles-backend # 3.74% cycles idle
49,670,139,591 instructions # 3.98 insns per cycle
# 0.01 stalled cycles
per insn (66.82%)
1,370,729,619 branches # 3.799 M/sec (50.09%)
5,759,980 branch-misses # 0.42% of all branches
2.803581718 seconds time elapsed
perf stat -a ./crc32_bench 32768 5000000
CRC: 165b4c91
Performance counter stats for 'system wide':
360942.638504 task-clock (msec) # 128.498 CPUs utilized
535 context-switches # 0.001 K/sec
12 cpu-migrations # 0.000 K/sec
287 page-faults # 0.001 K/sec
12,476,309,108 cycles # 0.035 GHz (66.67%)
17,688,340 stalled-cycles-frontend # 0.14% cycles idle
477,872,611 stalled-cycles-backend # 3.83% cycles idle
48,459,294,347 instructions # 3.88 insns per cycle
# 0.01 stalled cycles
per insn (66.69%)
1,371,856,316 branches # 3.801 M/sec (50.01%)
5,771,271 branch-misses # 0.42% of all branches
2.808943029 seconds time elapsed
Tested on (tulibee): P8 / LE DD2.1 Murano 32G RAM, 16 Cores.
RHEL7.2 LE