perf improvements for zstd decode #1668

mgrice · 2019-07-02T19:22:45Z

tldr: 7.5% average decode speedup on silesia corpus at compression levels 1-3 (sandy bridge)

Background: while investigating zstd perf differences between clang and gcc I noticed that even though gcc is vectorizing the loop in in wildcopy, it was not being done as well as could be done by hand. The sites where wildcopy is invoked have an interesting distribution of lengths to be copied. The loop trip count is rarely above 1, yet long copies are common enough to make their performance important.The code in zstd_decompress.c to invoke wildcopy handles the latter well but the gcc autovectorizer introduces a needlessly expensive startup check for vectorization.

See how GCC autovectorizes the loop here:
https://godbolt.org/z/apr0x0

Here is the code after this diff has been applied: (left hand side is the good one, right is with vectorizer on)
After: https://godbolt.org/z/OwO4F8

Note that autovectorization still does not do a good job on the optimized version, so it's turned off
via attribute and flag. I found that neither attribute nor command-line flag were entirely successful in turning off vectorization, which is why there were both.

silesia benchmark data - second triad of each file is with the original code:

file      orig        compressedratio     encode              decode           change
1#dickens   10192446->   4268865(2.388),       198.9MB/s           709.6MB/s
2#dickens   10192446->   3876126(2.630),       128.7MB/s           552.5MB/s
3#dickens   10192446->   3682956(2.767),       104.6MB/s             537MB/s
1#dickens   10192446->   4268865(2.388),       195.4MB/s           659.5MB/s     7.60%
2#dickens   10192446->   3876126(2.630),         127MB/s           516.3MB/s     7.01%
3#dickens   10192446->   3682956(2.767),         105MB/s           479.5MB/s    11.99%
1#mozilla   51220480->  20117517(2.546),       285.4MB/s           734.9MB/s
2#mozilla   51220480->  19067018(2.686),       220.8MB/s           686.3MB/s
3#mozilla   51220480->  18508283(2.767),       152.2MB/s           669.4MB/s
1#mozilla   51220480->  20117517(2.546),       283.4MB/s           697.9MB/s     5.30%
2#mozilla   51220480->  19067018(2.686),       225.9MB/s             665MB/s     3.20%
3#mozilla   51220480->  18508283(2.767),       154.5MB/s           640.6MB/s     4.50%
1#mr         9970564->   3840242(2.596),       262.4MB/s           899.8MB/s
2#mr         9970564->   3600976(2.769),       181.2MB/s           717.9MB/s
3#mr         9970564->   3563987(2.798),       116.3MB/s             620MB/s
1#mr         9970564->   3840242(2.596),       253.2MB/s           827.3MB/s     8.76%
2#mr         9970564->   3600976(2.769),       177.4MB/s           655.4MB/s     9.54%
3#mr         9970564->   3563987(2.798),       111.2MB/s           564.2MB/s     9.89%
1#nci       33553445->   2849306(11.78),       575.2MB/s ,        1335.8MB/s
2#nci       33553445->   2890166(11.61),       509.3MB/s ,        1238.1MB/s
3#nci       33553445->   2857408(11.74),         431MB/s ,        1210.7MB/s
1#nci       33553445->   2849306(11.78),       565.4MB/s ,        1220.2MB/s     9.47%
2#nci       33553445->   2890166(11.61),       508.2MB/s ,        1128.4MB/s     9.72%
3#nci       33553445->   2857408(11.74),       429.1MB/s ,        1097.7MB/s    10.29%
1#ooffice    6152192->   3590954(1.713),       231.4MB/s ,         662.6MB/s
2#ooffice    6152192->   3323931(1.851),       162.8MB/s ,         592.6MB/s
3#ooffice    6152192->   3145625(1.956),        99.9MB/s ,         549.6MB/s
1#ooffice    6152192->   3590954(1.713),       224.7MB/s ,         624.2MB/s     6.15%
2#ooffice    6152192->   3323931 (1.851),        155MB/s ,         564.5MB/s     4.98%
3#ooffice    6152192->   3145625(1.956),       101.1MB/s ,         521.2MB/s     5.45%
1#osdb      10085684->   3739042(2.697),       271.9MB/s           876.4MB/s
2#osdb      10085684->   3493875(2.887),       208.2MB/s             857MB/s
3#osdb      10085684->   3515831(2.869),       135.3MB/s           805.4MB/s
1#osdb      10085684->   3739042(2.697),       257.4MB/s           793.8MB/s    10.41%
2#osdb      10085684->   3493875(2.887),       209.7MB/s           776.1MB/s    10.42%
3#osdb      10085684->   3515831(2.869),       130.6MB/s           727.7MB/s    10.68%
1#reymont    6627202->   2152771(3.078),       198.9MB/s           696.2MB/s
2#reymont    6627202->   2071140(3.200),         170MB/s           595.2MB/s
3#reymont    6627202->   1953597(3.392),       128.5MB/s           609.7MB/s
1#reymont    6627202->   2152771(3.078),       199.6MB/s           655.2MB/s     6.26%
2#reymont    6627202->   2071140(3.200),       168.2MB/s           554.4MB/s     7.36%
3#reymont    6627202->   1953597(3.392),       128.7MB/s           557.4MB/s     9.38%
1#samba     21606400->   5510994(3.921),       338.1MB/s            1066MB/s
2#samba     21606400->   5240208(4.123),       258.7MB/s           992.3MB/s
3#samba     21606400->   5003358(4.318),       200.2MB/s           991.1MB/s
1#samba     21606400->   5510994(3.921),       330.8MB/s             974MB/s     9.45%
2#samba     21606400->   5240208(4.123),       257.9MB/s           919.4MB/s     7.93%
3#samba     21606400->   5003358(4.318),       198.5MB/s           908.9MB/s     9.04%
1#sao        7251944->   6256401(1.159),       194.6MB/s           602.2MB/s
2#sao        7251944->   5808761(1.248),       128.2MB/s           532.1MB/s
3#sao        7251944->   5556318(1.305),          73MB/s           509.4MB/s
1#sao        7251944->   6256401(1.159),       198.7MB/s           580.7MB/s     3.70%
2#sao        7251944->   5808761(1.248),       129.1MB/s           502.7MB/s     5.85%
3#sao        7251944->   5556318(1.305),        74.6MB/s           493.1MB/s     3.31%
1#webster   41458703->  13692222(3.028),       222.3MB/s             752MB/s
2#webster   41458703->  12842646(3.228),       157.6MB/s           532.2MB/s
3#webster   41458703->  12191964(3.400),         124MB/s           468.5MB/s
1#webster   41458703->  13692222(3.028),       219.7MB/s             697MB/s     7.89%
2#webster   41458703->  12842646(3.228),       153.9MB/s           495.4MB/s     7.43%
3#webster   41458703->  12191964(3.400),       124.8MB/s           444.8MB/s     5.33%
1#xml        5345280->    696652(7.673),         485MB/s ,        1333.9MB/s
2#xml        5345280->    681492(7.843),       405.2MB/s ,        1237.5MB/s
3#xml        5345280->    639057(8.364),       328.5MB/s ,        1281.3MB/s
1#xml        5345280->    696652(7.673),       473.1MB/s ,        1232.4MB/s     8.24%
2#xml        5345280->    681492(7.843),       398.6MB/s ,        1145.9MB/s     7.99%
3#xml        5345280->    639057(8.364),       327.1MB/s ,          1175MB/s     9.05%
1#x-ray      8474240->   6772557(1.251),       521.3MB/s           762.6MB/s
2#x-ray      8474240->   6684531(1.268),       230.5MB/s           688.5MB/s
3#x-ray      8474240->   6166679(1.374),        68.7MB/s           478.8MB/s
1#x-ray      8474240->   6772557(1.251),       502.8MB/s           736.7MB/s     3.52%
2#x-ray      8474240->   6684531(1.268),       224.4MB/s             662MB/s     4.00%
3#x-ray      8474240->   6166679(1.374),        67.3MB/s           437.8MB/s     9.37%

                                                                                 7.51%

tldr: 7.5% average decode speedup on silesia corpus at compression levels 1-3 (sandy bridge) Background: while investigating zstd perf differences between clang and gcc I noticed that even though gcc is vectorizing the loop in in wildcopy, it was not being done as well as could be done by hand. The sites where wildcopy is invoked have an interesting distribution of lengths to be copied. The loop trip count is rarely above 1, yet long copies are common enough to make their performance important.The code in zstd_decompress.c to invoke wildcopy handles the latter well but the gcc autovectorizer introduces a needlessly expensive startup check for vectorization. See how GCC autovectorizes the loop here: https://godbolt.org/z/apr0x0 Here is the code after this diff has been applied: (left hand side is the good one, right is with vectorizer on) After: https://godbolt.org/z/OwO4F8 Note that autovectorization still does not do a good job on the optimized version, so it's turned off\ via attribute and flag. I found that neither attribute nor command-line flag were entirely successful in turning off vectorization, which is why there were both. silesia benchmark data - second triad of each file is with the original code: file orig compressedratio encode decode change 1#dickens 10192446-> 4268865(2.388), 198.9MB/s 709.6MB/s 2#dickens 10192446-> 3876126(2.630), 128.7MB/s 552.5MB/s 3#dickens 10192446-> 3682956(2.767), 104.6MB/s 537MB/s 1#dickens 10192446-> 4268865(2.388), 195.4MB/s 659.5MB/s 7.60% 2#dickens 10192446-> 3876126(2.630), 127MB/s 516.3MB/s 7.01% 3#dickens 10192446-> 3682956(2.767), 105MB/s 479.5MB/s 11.99% 1#mozilla 51220480-> 20117517(2.546), 285.4MB/s 734.9MB/s 2#mozilla 51220480-> 19067018(2.686), 220.8MB/s 686.3MB/s 3#mozilla 51220480-> 18508283(2.767), 152.2MB/s 669.4MB/s 1#mozilla 51220480-> 20117517(2.546), 283.4MB/s 697.9MB/s 5.30% 2#mozilla 51220480-> 19067018(2.686), 225.9MB/s 665MB/s 3.20% 3#mozilla 51220480-> 18508283(2.767), 154.5MB/s 640.6MB/s 4.50% 1#mr 9970564-> 3840242(2.596), 262.4MB/s 899.8MB/s 2#mr 9970564-> 3600976(2.769), 181.2MB/s 717.9MB/s 3#mr 9970564-> 3563987(2.798), 116.3MB/s 620MB/s 1#mr 9970564-> 3840242(2.596), 253.2MB/s 827.3MB/s 8.76% 2#mr 9970564-> 3600976(2.769), 177.4MB/s 655.4MB/s 9.54% 3#mr 9970564-> 3563987(2.798), 111.2MB/s 564.2MB/s 9.89% 1#nci 33553445-> 2849306(11.78), 575.2MB/s , 1335.8MB/s 2#nci 33553445-> 2890166(11.61), 509.3MB/s , 1238.1MB/s 3#nci 33553445-> 2857408(11.74), 431MB/s , 1210.7MB/s 1#nci 33553445-> 2849306(11.78), 565.4MB/s , 1220.2MB/s 9.47% 2#nci 33553445-> 2890166(11.61), 508.2MB/s , 1128.4MB/s 9.72% 3#nci 33553445-> 2857408(11.74), 429.1MB/s , 1097.7MB/s 10.29% 1#ooffice 6152192-> 3590954(1.713), 231.4MB/s , 662.6MB/s 2#ooffice 6152192-> 3323931(1.851), 162.8MB/s , 592.6MB/s 3#ooffice 6152192-> 3145625(1.956), 99.9MB/s , 549.6MB/s 1#ooffice 6152192-> 3590954(1.713), 224.7MB/s , 624.2MB/s 6.15% 2#ooffice 6152192-> 3323931 (1.851), 155MB/s , 564.5MB/s 4.98% 3#ooffice 6152192-> 3145625(1.956), 101.1MB/s , 521.2MB/s 5.45% 1#osdb 10085684-> 3739042(2.697), 271.9MB/s 876.4MB/s 2#osdb 10085684-> 3493875(2.887), 208.2MB/s 857MB/s 3#osdb 10085684-> 3515831(2.869), 135.3MB/s 805.4MB/s 1#osdb 10085684-> 3739042(2.697), 257.4MB/s 793.8MB/s 10.41% 2#osdb 10085684-> 3493875(2.887), 209.7MB/s 776.1MB/s 10.42% 3#osdb 10085684-> 3515831(2.869), 130.6MB/s 727.7MB/s 10.68% 1#reymont 6627202-> 2152771(3.078), 198.9MB/s 696.2MB/s 2#reymont 6627202-> 2071140(3.200), 170MB/s 595.2MB/s 3#reymont 6627202-> 1953597(3.392), 128.5MB/s 609.7MB/s 1#reymont 6627202-> 2152771(3.078), 199.6MB/s 655.2MB/s 6.26% 2#reymont 6627202-> 2071140(3.200), 168.2MB/s 554.4MB/s 7.36% 3#reymont 6627202-> 1953597(3.392), 128.7MB/s 557.4MB/s 9.38% 1#samba 21606400-> 5510994(3.921), 338.1MB/s 1066MB/s 2#samba 21606400-> 5240208(4.123), 258.7MB/s 992.3MB/s 3#samba 21606400-> 5003358(4.318), 200.2MB/s 991.1MB/s 1#samba 21606400-> 5510994(3.921), 330.8MB/s 974MB/s 9.45% 2#samba 21606400-> 5240208(4.123), 257.9MB/s 919.4MB/s 7.93% 3#samba 21606400-> 5003358(4.318), 198.5MB/s 908.9MB/s 9.04% 1#sao 7251944-> 6256401(1.159), 194.6MB/s 602.2MB/s 2#sao 7251944-> 5808761(1.248), 128.2MB/s 532.1MB/s 3#sao 7251944-> 5556318(1.305), 73MB/s 509.4MB/s 1#sao 7251944-> 6256401(1.159), 198.7MB/s 580.7MB/s 3.70% 2#sao 7251944-> 5808761(1.248), 129.1MB/s 502.7MB/s 5.85% 3#sao 7251944-> 5556318(1.305), 74.6MB/s 493.1MB/s 3.31% 1#webster 41458703-> 13692222(3.028), 222.3MB/s 752MB/s 2#webster 41458703-> 12842646(3.228), 157.6MB/s 532.2MB/s 3#webster 41458703-> 12191964(3.400), 124MB/s 468.5MB/s 1#webster 41458703-> 13692222(3.028), 219.7MB/s 697MB/s 7.89% 2#webster 41458703-> 12842646(3.228), 153.9MB/s 495.4MB/s 7.43% 3#webster 41458703-> 12191964(3.400), 124.8MB/s 444.8MB/s 5.33% 1#xml 5345280-> 696652(7.673), 485MB/s , 1333.9MB/s 2#xml 5345280-> 681492(7.843), 405.2MB/s , 1237.5MB/s 3#xml 5345280-> 639057(8.364), 328.5MB/s , 1281.3MB/s 1#xml 5345280-> 696652(7.673), 473.1MB/s , 1232.4MB/s 8.24% 2#xml 5345280-> 681492(7.843), 398.6MB/s , 1145.9MB/s 7.99% 3#xml 5345280-> 639057(8.364), 327.1MB/s , 1175MB/s 9.05% 1#x-ray 8474240-> 6772557(1.251), 521.3MB/s 762.6MB/s 2#x-ray 8474240-> 6684531(1.268), 230.5MB/s 688.5MB/s 3#x-ray 8474240-> 6166679(1.374), 68.7MB/s 478.8MB/s 1#x-ray 8474240-> 6772557(1.251), 502.8MB/s 736.7MB/s 3.52% 2#x-ray 8474240-> 6684531(1.268), 224.4MB/s 662MB/s 4.00% 3#x-ray 8474240-> 6166679(1.374), 67.3MB/s 437.8MB/s 9.37% 7.51%

terrelln · 2019-07-02T19:45:34Z

lib/Makefile

@@ -45,6 +45,8 @@ ZDICT_FILES := $(sort $(wildcard dictBuilder/*.c))
 ZDEPR_FILES := $(sort $(wildcard deprecated/*.c))
 ZSTD_FILES := $(ZSTDCOMMON_FILES)

+decompress/zstd_decompress_block.o : CFLAGS+=-fno-tree-vectorize


I'm wondering if all compilers that use the Makefile support or ignore this flag. For example, people will compile zstd with icc, and also some ancient compilers.

We may have to test to see if the compiler supports this flag.

Good point, they don't.

Don't add "no-tree-vectorize" attribute on clang (which defines __GNUC__)

Cyan4973 · 2019-07-03T19:19:21Z

lib/common/zstd_internal.h

@@ -227,7 +227,9 @@ void ZSTD_wildcopy(void* dst, const void* src, ptrdiff_t length, ZSTD_overlap_e
        COPY16(op, ip);
      }
      while (op < oend - 8);
-      COPY8(op, ip);
+
+      if (op < oend)


This new branch might affect performance, and is probably worth a new decompression speed measurement.

It looks like it's somewhere around 0.5% slower with the branch, in other words about 7% faster overall.

Not sure if you've tried this, but would it be possible to fold this COPY8() into the first if? Something like: if length & 15 < 8? Then we should be able to guarantee that the COPY16() loop finishes the job and only overruns 8. That is an extra &, but we would avoid an unpredictable branch, so it seems like it could help.

https://gcc.godbolt.org/z/MTe_1Y is the generated code for gcc.

if ((length & 8) == 0) /* if (length % 16 < 8) */ COPY8(op, ip); do COPY16(op, ip); while (op < oend);

I did some initial testing, and it seems like it might be a bit faster.

What if length == 17 ?

edit : yep, that works

Comparing gcc-9.1 and gcc-8.0 on the 3 versions (buggy, yours, and mine) with zstd -b1 silesia.tar I get:

compiler mgrice terrelln buggy

gcc 1438.8 MB/s 1443.2 MB/s 1456.8 MB/s

clang 1196.6 MB/s 1216.6 MB/s 1199.7 MB/s

terrelln · 2019-07-11T00:55:59Z

I would like to get this landed this week, so we can run tests and make a release by the end of next week, or beginning of the week after.

terrelln

LGTM! I'll let @Cyan4973 take a final look before I merge it.

Cyan4973

Looks great ! Thanks @mgrice !

facebook-github-bot added the CLA Signed label Jul 2, 2019

terrelln reviewed Jul 2, 2019

View reviewed changes

mgrice added 7 commits July 2, 2019 15:41

makefile changed to only pass -fno-tree-vectorize to gcc

e621572

<Replace this line with a title. Use 1 line only, 67 chars or less>

a34ac22

Don't add "no-tree-vectorize" attribute on clang (which defines __GNUC__)

fix for warning/error with subtraction of void* pointers

c34be8a

Merge branch 'dev' of https://github.com/mgrice/zstd into dev

ea424aa

fix c90 conformance issue - ISO C90 forbids mixed declarations and code

6b9a497

Fix assert for negative diff, only when there is no overlap

e4ac38e

fix overflow revealed in fuzzing tests

73a9b47

Cyan4973 reviewed Jul 3, 2019

View reviewed changes

tweak for small speed increase

dc4f3ea

terrelln approved these changes Jul 11, 2019

View reviewed changes

Cyan4973 approved these changes Jul 11, 2019

View reviewed changes

terrelln merged commit 812e8f2 into facebook:dev Jul 11, 2019

felixhandte mentioned this pull request Jul 19, 2019

Merge v1.4.1 to Master #1691

Merged

Cyan4973 mentioned this pull request Aug 29, 2019

Performance regression (decompress) from 1.4.0 -> 1.4.3 #1758

Closed

crimson-gao mentioned this pull request Nov 9, 2024

Missing the compile flag -fno-tree-vectorize since version v1.4.7 #4193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf improvements for zstd decode #1668

perf improvements for zstd decode #1668

mgrice commented Jul 2, 2019

terrelln Jul 2, 2019

mgrice Jul 2, 2019

Cyan4973 Jul 3, 2019

mgrice Jul 3, 2019

mgrice Jul 8, 2019

terrelln Jul 10, 2019 •

edited

Loading

terrelln Jul 10, 2019 •

edited

Loading

Cyan4973 Jul 10, 2019 •

edited

Loading

terrelln Jul 10, 2019

terrelln commented Jul 11, 2019

terrelln left a comment

Cyan4973 left a comment

compiler	mgrice	terrelln	buggy
gcc	1438.8 MB/s	1443.2 MB/s	1456.8 MB/s
clang	1196.6 MB/s	1216.6 MB/s	1199.7 MB/s

perf improvements for zstd decode #1668

perf improvements for zstd decode #1668

Conversation

mgrice commented Jul 2, 2019

terrelln Jul 2, 2019

Choose a reason for hiding this comment

mgrice Jul 2, 2019

Choose a reason for hiding this comment

Cyan4973 Jul 3, 2019

Choose a reason for hiding this comment

mgrice Jul 3, 2019

Choose a reason for hiding this comment

mgrice Jul 8, 2019

Choose a reason for hiding this comment

terrelln Jul 10, 2019 • edited Loading

Choose a reason for hiding this comment

terrelln Jul 10, 2019 • edited Loading

Choose a reason for hiding this comment

Cyan4973 Jul 10, 2019 • edited Loading

Choose a reason for hiding this comment

terrelln Jul 10, 2019

Choose a reason for hiding this comment

terrelln commented Jul 11, 2019

terrelln left a comment

Choose a reason for hiding this comment

Cyan4973 left a comment

Choose a reason for hiding this comment

terrelln Jul 10, 2019 •

edited

Loading

terrelln Jul 10, 2019 •

edited

Loading

Cyan4973 Jul 10, 2019 •

edited

Loading