Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRC32 implementation with gcc intrinsics instead pure asm. #4

Merged
merged 16 commits into from
Jan 28, 2018

Conversation

racardoso
Copy link

This commit implements CRC32 using power8 vector intrinsics
and gcc builtins instead pure assembly. The performance is
the same compared to .S version:

time ./vec_crc32_bench 32768 5000000
CRC: 165b4c91
real 0m2.799s
user 0m2.799s
sys 0m0.000s

time ./crc32_bench 32768 5000000
CRC: 165b4c91
real 0m2.803s
user 0m2.803s
sys 0m0.000s

Perf results:

perf stat -a ./vec_crc32_bench 32768 5000000
CRC: 165b4c91
Performance counter stats for 'system wide':
360774.660732 task-clock (msec) # 128.683 CPUs utilized
529 context-switches # 0.001 K/sec
8 cpu-migrations # 0.000 K/sec
208 page-faults # 0.001 K/sec
12,468,436,530 cycles # 0.035 GHz (66.62%)
18,068,249 stalled-cycles-frontend # 0.14% cycles idle
466,739,548 stalled-cycles-backend # 3.74% cycles idle
49,670,139,591 instructions # 3.98 insns per cycle
# 0.01 stalled cycles
per insn (66.82%)
1,370,729,619 branches # 3.799 M/sec (50.09%)
5,759,980 branch-misses # 0.42% of all branches

2.803581718 seconds time elapsed

perf stat -a ./crc32_bench 32768 5000000
CRC: 165b4c91
Performance counter stats for 'system wide':
360942.638504 task-clock (msec) # 128.498 CPUs utilized
535 context-switches # 0.001 K/sec
12 cpu-migrations # 0.000 K/sec
287 page-faults # 0.001 K/sec
12,476,309,108 cycles # 0.035 GHz (66.67%)
17,688,340 stalled-cycles-frontend # 0.14% cycles idle
477,872,611 stalled-cycles-backend # 3.83% cycles idle
48,459,294,347 instructions # 3.88 insns per cycle
# 0.01 stalled cycles
per insn (66.69%)
1,371,856,316 branches # 3.801 M/sec (50.01%)
5,771,271 branch-misses # 0.42% of all branches

2.808943029 seconds time elapsed

Tested on (tulibee): P8 / LE DD2.1 Murano 32G RAM, 16 Cores.
RHEL7.2 LE

@grooverdan
Copy link
Contributor

nice! Solved difficulties with clang missing ppc headers too.

@racardoso
Copy link
Author

Hi @antonblanchard @grooverdan . Any news about this PR?

Rogerio Alves and others added 6 commits August 22, 2017 10:42
This commit implements CRC32 using power8 vector intrinsics
and gcc builtins instead pure assembly. The performance is
the same compared to .S version:

time ./vec_crc32_bench 32768 5000000
CRC: 165b4c91
real  0m2.799s
user  0m2.799s
sys   0m0.000s

time ./crc32_bench 32768 5000000
CRC: 165b4c91
real  0m2.803s
user  0m2.803s
sys   0m0.000s

Perf results:

perf stat -a ./vec_crc32_bench 32768 5000000
CRC: 165b4c91
Performance counter stats for 'system wide':
360774.660732   task-clock (msec)   #   128.683 CPUs utilized
529             context-switches    #   0.001 K/sec
8               cpu-migrations      #   0.000 K/sec
208             page-faults         #   0.001 K/sec
12,468,436,530  cycles              #   0.035 GHz (66.62%)
18,068,249      stalled-cycles-frontend #   0.14% cycles idle
466,739,548     stalled-cycles-backend  #   3.74% cycles idle
49,670,139,591  instructions        #   3.98  insns per cycle
                                    #   0.01  stalled cycles
                                    per insn  (66.82%)
1,370,729,619  branches             #   3.799 M/sec (50.09%)
5,759,980      branch-misses        #   0.42% of all branches

2.803581718 seconds time elapsed

perf stat -a ./crc32_bench 32768 5000000
CRC: 165b4c91
Performance counter stats for 'system wide':
360942.638504   task-clock (msec)   #   128.498 CPUs utilized
535             context-switches    #   0.001 K/sec
12              cpu-migrations      #   0.000 K/sec
287             page-faults         #   0.001 K/sec
12,476,309,108  cycles              #   0.035 GHz (66.67%)
17,688,340      stalled-cycles-frontend #   0.14% cycles idle
477,872,611     stalled-cycles-backend  #   3.83% cycles idle
48,459,294,347  instructions        #   3.88  insns per cycle
                                    #   0.01  stalled cycles
                                        per insn  (66.69%)
1,371,856,316   branches            #   3.801 M/sec (50.01%)
5,771,271       branch-misses       #   0.42% of all branches

2.808943029 seconds time elapsed

Tested on (tulibee): P8 / LE DD2.1 Murano 32G RAM, 16 Cores.
RHEL7.2 LE

Signed-off-by: Rogerio Alves <rogealve@br.ibm.com>
Included quickstart instruction for vec_crc32.c on README.

Signed-off-by: Rogerio Alves <rogealve@br.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
This ensures that:

defining __ASSEMBLY__ (gcc builtin) isn't needed for C implementation.

MAX_SIZE is defined in both C and __ASSEMBLY__ generations

Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
…ures

Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
@racardoso
Copy link
Author

Updated PR with @grooverdan review and contribution.

@grooverdan
Copy link
Contributor

https://gcc.gnu.org/onlinedocs/gcc-7.2.0/gcc/PowerPC-AltiVec_002fVSX-Built-in-Functions.html#PowerPC-AltiVec_002fVSX-Built-in-Functions indicates -mvsx needed for vector unsigned long (though in practice I never saw any warning). I'm not particular worried about it however.

Also lodged https://bugs.llvm.org/show_bug.cgi?id=34295 and https://bugs.llvm.org/show_bug.cgi?id=34296 as it currently doesn't compile with bleeding edge clang.

Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Add example crc32_two_implementations on how to use this.

Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
This rolls vec_crc32_test into the crc32_test by comparing the
reference, ASM and C vpmsum CRC32 implementations.

Compares the result at up to 16 alignments as the codepath with change
for this.

A Makefile target of test is added to test the boundary conditions
of the implementations.

Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
@antonblanchard antonblanchard merged commit 90c45d7 into antonblanchard:master Jan 28, 2018
@antonblanchard
Copy link
Owner

Thanks Rogerio and Daniel!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants