Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf improvements for zstd decode #1668

Merged
merged 9 commits into from
Jul 11, 2019
Merged

perf improvements for zstd decode #1668

merged 9 commits into from
Jul 11, 2019

Conversation

mgrice
Copy link
Contributor

@mgrice mgrice commented Jul 2, 2019

tldr: 7.5% average decode speedup on silesia corpus at compression levels 1-3 (sandy bridge)

Background: while investigating zstd perf differences between clang and gcc I noticed that even though gcc is vectorizing the loop in in wildcopy, it was not being done as well as could be done by hand. The sites where wildcopy is invoked have an interesting distribution of lengths to be copied. The loop trip count is rarely above 1, yet long copies are common enough to make their performance important.The code in zstd_decompress.c to invoke wildcopy handles the latter well but the gcc autovectorizer introduces a needlessly expensive startup check for vectorization.

See how GCC autovectorizes the loop here:
https://godbolt.org/z/apr0x0

Here is the code after this diff has been applied: (left hand side is the good one, right is with vectorizer on)
After: https://godbolt.org/z/OwO4F8

Note that autovectorization still does not do a good job on the optimized version, so it's turned off
via attribute and flag. I found that neither attribute nor command-line flag were entirely successful in turning off vectorization, which is why there were both.

silesia benchmark data - second triad of each file is with the original code:

file      orig        compressedratio     encode              decode           change
1#dickens   10192446->   4268865(2.388),       198.9MB/s           709.6MB/s
2#dickens   10192446->   3876126(2.630),       128.7MB/s           552.5MB/s
3#dickens   10192446->   3682956(2.767),       104.6MB/s             537MB/s
1#dickens   10192446->   4268865(2.388),       195.4MB/s           659.5MB/s     7.60%
2#dickens   10192446->   3876126(2.630),         127MB/s           516.3MB/s     7.01%
3#dickens   10192446->   3682956(2.767),         105MB/s           479.5MB/s    11.99%
1#mozilla   51220480->  20117517(2.546),       285.4MB/s           734.9MB/s
2#mozilla   51220480->  19067018(2.686),       220.8MB/s           686.3MB/s
3#mozilla   51220480->  18508283(2.767),       152.2MB/s           669.4MB/s
1#mozilla   51220480->  20117517(2.546),       283.4MB/s           697.9MB/s     5.30%
2#mozilla   51220480->  19067018(2.686),       225.9MB/s             665MB/s     3.20%
3#mozilla   51220480->  18508283(2.767),       154.5MB/s           640.6MB/s     4.50%
1#mr         9970564->   3840242(2.596),       262.4MB/s           899.8MB/s
2#mr         9970564->   3600976(2.769),       181.2MB/s           717.9MB/s
3#mr         9970564->   3563987(2.798),       116.3MB/s             620MB/s
1#mr         9970564->   3840242(2.596),       253.2MB/s           827.3MB/s     8.76%
2#mr         9970564->   3600976(2.769),       177.4MB/s           655.4MB/s     9.54%
3#mr         9970564->   3563987(2.798),       111.2MB/s           564.2MB/s     9.89%
1#nci       33553445->   2849306(11.78),       575.2MB/s ,        1335.8MB/s
2#nci       33553445->   2890166(11.61),       509.3MB/s ,        1238.1MB/s
3#nci       33553445->   2857408(11.74),         431MB/s ,        1210.7MB/s
1#nci       33553445->   2849306(11.78),       565.4MB/s ,        1220.2MB/s     9.47%
2#nci       33553445->   2890166(11.61),       508.2MB/s ,        1128.4MB/s     9.72%
3#nci       33553445->   2857408(11.74),       429.1MB/s ,        1097.7MB/s    10.29%
1#ooffice    6152192->   3590954(1.713),       231.4MB/s ,         662.6MB/s
2#ooffice    6152192->   3323931(1.851),       162.8MB/s ,         592.6MB/s
3#ooffice    6152192->   3145625(1.956),        99.9MB/s ,         549.6MB/s
1#ooffice    6152192->   3590954(1.713),       224.7MB/s ,         624.2MB/s     6.15%
2#ooffice    6152192->   3323931 (1.851),        155MB/s ,         564.5MB/s     4.98%
3#ooffice    6152192->   3145625(1.956),       101.1MB/s ,         521.2MB/s     5.45%
1#osdb      10085684->   3739042(2.697),       271.9MB/s           876.4MB/s
2#osdb      10085684->   3493875(2.887),       208.2MB/s             857MB/s
3#osdb      10085684->   3515831(2.869),       135.3MB/s           805.4MB/s
1#osdb      10085684->   3739042(2.697),       257.4MB/s           793.8MB/s    10.41%
2#osdb      10085684->   3493875(2.887),       209.7MB/s           776.1MB/s    10.42%
3#osdb      10085684->   3515831(2.869),       130.6MB/s           727.7MB/s    10.68%
1#reymont    6627202->   2152771(3.078),       198.9MB/s           696.2MB/s
2#reymont    6627202->   2071140(3.200),         170MB/s           595.2MB/s
3#reymont    6627202->   1953597(3.392),       128.5MB/s           609.7MB/s
1#reymont    6627202->   2152771(3.078),       199.6MB/s           655.2MB/s     6.26%
2#reymont    6627202->   2071140(3.200),       168.2MB/s           554.4MB/s     7.36%
3#reymont    6627202->   1953597(3.392),       128.7MB/s           557.4MB/s     9.38%
1#samba     21606400->   5510994(3.921),       338.1MB/s            1066MB/s
2#samba     21606400->   5240208(4.123),       258.7MB/s           992.3MB/s
3#samba     21606400->   5003358(4.318),       200.2MB/s           991.1MB/s
1#samba     21606400->   5510994(3.921),       330.8MB/s             974MB/s     9.45%
2#samba     21606400->   5240208(4.123),       257.9MB/s           919.4MB/s     7.93%
3#samba     21606400->   5003358(4.318),       198.5MB/s           908.9MB/s     9.04%
1#sao        7251944->   6256401(1.159),       194.6MB/s           602.2MB/s
2#sao        7251944->   5808761(1.248),       128.2MB/s           532.1MB/s
3#sao        7251944->   5556318(1.305),          73MB/s           509.4MB/s
1#sao        7251944->   6256401(1.159),       198.7MB/s           580.7MB/s     3.70%
2#sao        7251944->   5808761(1.248),       129.1MB/s           502.7MB/s     5.85%
3#sao        7251944->   5556318(1.305),        74.6MB/s           493.1MB/s     3.31%
1#webster   41458703->  13692222(3.028),       222.3MB/s             752MB/s
2#webster   41458703->  12842646(3.228),       157.6MB/s           532.2MB/s
3#webster   41458703->  12191964(3.400),         124MB/s           468.5MB/s
1#webster   41458703->  13692222(3.028),       219.7MB/s             697MB/s     7.89%
2#webster   41458703->  12842646(3.228),       153.9MB/s           495.4MB/s     7.43%
3#webster   41458703->  12191964(3.400),       124.8MB/s           444.8MB/s     5.33%
1#xml        5345280->    696652(7.673),         485MB/s ,        1333.9MB/s
2#xml        5345280->    681492(7.843),       405.2MB/s ,        1237.5MB/s
3#xml        5345280->    639057(8.364),       328.5MB/s ,        1281.3MB/s
1#xml        5345280->    696652(7.673),       473.1MB/s ,        1232.4MB/s     8.24%
2#xml        5345280->    681492(7.843),       398.6MB/s ,        1145.9MB/s     7.99%
3#xml        5345280->    639057(8.364),       327.1MB/s ,          1175MB/s     9.05%
1#x-ray      8474240->   6772557(1.251),       521.3MB/s           762.6MB/s
2#x-ray      8474240->   6684531(1.268),       230.5MB/s           688.5MB/s
3#x-ray      8474240->   6166679(1.374),        68.7MB/s           478.8MB/s
1#x-ray      8474240->   6772557(1.251),       502.8MB/s           736.7MB/s     3.52%
2#x-ray      8474240->   6684531(1.268),       224.4MB/s             662MB/s     4.00%
3#x-ray      8474240->   6166679(1.374),        67.3MB/s           437.8MB/s     9.37%

                                                                                 7.51%

tldr: 7.5% average decode speedup on silesia corpus at compression levels 1-3 (sandy bridge)

Background: while investigating zstd perf differences between clang and gcc I noticed that even though gcc is vectorizing the loop in in wildcopy, it was not being done as well as could be done by hand.  The sites where wildcopy is invoked have an interesting distribution of lengths to be copied.  The loop trip count is rarely above 1, yet long copies are common enough to make their performance important.The code in zstd_decompress.c to invoke wildcopy handles the latter well but the gcc autovectorizer introduces a needlessly expensive startup check for vectorization.

See how GCC autovectorizes the loop here:
https://godbolt.org/z/apr0x0

Here is the code after this diff has been applied: (left hand side is the good one, right is with vectorizer on)
After: https://godbolt.org/z/OwO4F8

Note that autovectorization still does not do a good job on the optimized version, so it's turned off\
 via attribute and flag.  I found that neither attribute nor command-line flag were entirely successful in turning off vectorization, which is why there were both.

    silesia benchmark data - second triad of each file is with the original code:

    file      orig        compressedratio     encode              decode           change
    1#dickens   10192446->   4268865(2.388),       198.9MB/s           709.6MB/s
    2#dickens   10192446->   3876126(2.630),       128.7MB/s           552.5MB/s
    3#dickens   10192446->   3682956(2.767),       104.6MB/s             537MB/s
    1#dickens   10192446->   4268865(2.388),       195.4MB/s           659.5MB/s     7.60%
    2#dickens   10192446->   3876126(2.630),         127MB/s           516.3MB/s     7.01%
    3#dickens   10192446->   3682956(2.767),         105MB/s           479.5MB/s    11.99%
    1#mozilla   51220480->  20117517(2.546),       285.4MB/s           734.9MB/s
    2#mozilla   51220480->  19067018(2.686),       220.8MB/s           686.3MB/s
    3#mozilla   51220480->  18508283(2.767),       152.2MB/s           669.4MB/s
    1#mozilla   51220480->  20117517(2.546),       283.4MB/s           697.9MB/s     5.30%
    2#mozilla   51220480->  19067018(2.686),       225.9MB/s             665MB/s     3.20%
    3#mozilla   51220480->  18508283(2.767),       154.5MB/s           640.6MB/s     4.50%
    1#mr         9970564->   3840242(2.596),       262.4MB/s           899.8MB/s
    2#mr         9970564->   3600976(2.769),       181.2MB/s           717.9MB/s
    3#mr         9970564->   3563987(2.798),       116.3MB/s             620MB/s
    1#mr         9970564->   3840242(2.596),       253.2MB/s           827.3MB/s     8.76%
    2#mr         9970564->   3600976(2.769),       177.4MB/s           655.4MB/s     9.54%
    3#mr         9970564->   3563987(2.798),       111.2MB/s           564.2MB/s     9.89%
    1#nci       33553445->   2849306(11.78),       575.2MB/s ,        1335.8MB/s
    2#nci       33553445->   2890166(11.61),       509.3MB/s ,        1238.1MB/s
    3#nci       33553445->   2857408(11.74),         431MB/s ,        1210.7MB/s
    1#nci       33553445->   2849306(11.78),       565.4MB/s ,        1220.2MB/s     9.47%
    2#nci       33553445->   2890166(11.61),       508.2MB/s ,        1128.4MB/s     9.72%
    3#nci       33553445->   2857408(11.74),       429.1MB/s ,        1097.7MB/s    10.29%
    1#ooffice    6152192->   3590954(1.713),       231.4MB/s ,         662.6MB/s
    2#ooffice    6152192->   3323931(1.851),       162.8MB/s ,         592.6MB/s
    3#ooffice    6152192->   3145625(1.956),        99.9MB/s ,         549.6MB/s
    1#ooffice    6152192->   3590954(1.713),       224.7MB/s ,         624.2MB/s     6.15%
    2#ooffice    6152192->   3323931 (1.851),        155MB/s ,         564.5MB/s     4.98%
    3#ooffice    6152192->   3145625(1.956),       101.1MB/s ,         521.2MB/s     5.45%
    1#osdb      10085684->   3739042(2.697),       271.9MB/s           876.4MB/s
    2#osdb      10085684->   3493875(2.887),       208.2MB/s             857MB/s
    3#osdb      10085684->   3515831(2.869),       135.3MB/s           805.4MB/s
    1#osdb      10085684->   3739042(2.697),       257.4MB/s           793.8MB/s    10.41%
    2#osdb      10085684->   3493875(2.887),       209.7MB/s           776.1MB/s    10.42%
    3#osdb      10085684->   3515831(2.869),       130.6MB/s           727.7MB/s    10.68%
    1#reymont    6627202->   2152771(3.078),       198.9MB/s           696.2MB/s
    2#reymont    6627202->   2071140(3.200),         170MB/s           595.2MB/s
    3#reymont    6627202->   1953597(3.392),       128.5MB/s           609.7MB/s
    1#reymont    6627202->   2152771(3.078),       199.6MB/s           655.2MB/s     6.26%
    2#reymont    6627202->   2071140(3.200),       168.2MB/s           554.4MB/s     7.36%
    3#reymont    6627202->   1953597(3.392),       128.7MB/s           557.4MB/s     9.38%
    1#samba     21606400->   5510994(3.921),       338.1MB/s            1066MB/s
    2#samba     21606400->   5240208(4.123),       258.7MB/s           992.3MB/s
    3#samba     21606400->   5003358(4.318),       200.2MB/s           991.1MB/s
    1#samba     21606400->   5510994(3.921),       330.8MB/s             974MB/s     9.45%
    2#samba     21606400->   5240208(4.123),       257.9MB/s           919.4MB/s     7.93%
    3#samba     21606400->   5003358(4.318),       198.5MB/s           908.9MB/s     9.04%
    1#sao        7251944->   6256401(1.159),       194.6MB/s           602.2MB/s
    2#sao        7251944->   5808761(1.248),       128.2MB/s           532.1MB/s
    3#sao        7251944->   5556318(1.305),          73MB/s           509.4MB/s
    1#sao        7251944->   6256401(1.159),       198.7MB/s           580.7MB/s     3.70%
    2#sao        7251944->   5808761(1.248),       129.1MB/s           502.7MB/s     5.85%
    3#sao        7251944->   5556318(1.305),        74.6MB/s           493.1MB/s     3.31%
    1#webster   41458703->  13692222(3.028),       222.3MB/s             752MB/s
    2#webster   41458703->  12842646(3.228),       157.6MB/s           532.2MB/s
    3#webster   41458703->  12191964(3.400),         124MB/s           468.5MB/s
    1#webster   41458703->  13692222(3.028),       219.7MB/s             697MB/s     7.89%
    2#webster   41458703->  12842646(3.228),       153.9MB/s           495.4MB/s     7.43%
    3#webster   41458703->  12191964(3.400),       124.8MB/s           444.8MB/s     5.33%
    1#xml        5345280->    696652(7.673),         485MB/s ,        1333.9MB/s
    2#xml        5345280->    681492(7.843),       405.2MB/s ,        1237.5MB/s
    3#xml        5345280->    639057(8.364),       328.5MB/s ,        1281.3MB/s
    1#xml        5345280->    696652(7.673),       473.1MB/s ,        1232.4MB/s     8.24%
    2#xml        5345280->    681492(7.843),       398.6MB/s ,        1145.9MB/s     7.99%
    3#xml        5345280->    639057(8.364),       327.1MB/s ,          1175MB/s     9.05%
    1#x-ray      8474240->   6772557(1.251),       521.3MB/s           762.6MB/s
    2#x-ray      8474240->   6684531(1.268),       230.5MB/s           688.5MB/s
    3#x-ray      8474240->   6166679(1.374),        68.7MB/s           478.8MB/s
    1#x-ray      8474240->   6772557(1.251),       502.8MB/s           736.7MB/s     3.52%
    2#x-ray      8474240->   6684531(1.268),       224.4MB/s             662MB/s     4.00%
    3#x-ray      8474240->   6166679(1.374),        67.3MB/s           437.8MB/s     9.37%

                                                                                     7.51%
lib/Makefile Outdated
@@ -45,6 +45,8 @@ ZDICT_FILES := $(sort $(wildcard dictBuilder/*.c))
ZDEPR_FILES := $(sort $(wildcard deprecated/*.c))
ZSTD_FILES := $(ZSTDCOMMON_FILES)

decompress/zstd_decompress_block.o : CFLAGS+=-fno-tree-vectorize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if all compilers that use the Makefile support or ignore this flag. For example, people will compile zstd with icc, and also some ancient compilers.

We may have to test to see if the compiler supports this flag.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, they don't.

@@ -227,7 +227,9 @@ void ZSTD_wildcopy(void* dst, const void* src, ptrdiff_t length, ZSTD_overlap_e
COPY16(op, ip);
}
while (op < oend - 8);
COPY8(op, ip);

if (op < oend)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new branch might affect performance, and is probably worth a new decompression speed measurement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it's somewhere around 0.5% slower with the branch, in other words about 7% faster overall.

Copy link
Contributor

@terrelln terrelln Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you've tried this, but would it be possible to fold this COPY8() into the first if? Something like: if length & 15 < 8? Then we should be able to guarantee that the COPY16() loop finishes the job and only overruns 8. That is an extra &, but we would avoid an unpredictable branch, so it seems like it could help.

Copy link
Contributor

@terrelln terrelln Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://gcc.godbolt.org/z/MTe_1Y is the generated code for gcc.

if ((length & 8) == 0) /* if (length % 16 < 8) */
    COPY8(op, ip);
do
    COPY16(op, ip);
while (op < oend);

I did some initial testing, and it seems like it might be a bit faster.

Copy link
Contributor

@Cyan4973 Cyan4973 Jul 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if length == 17 ?

edit : yep, that works

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing gcc-9.1 and gcc-8.0 on the 3 versions (buggy, yours, and mine) with zstd -b1 silesia.tar I get:

compiler mgrice terrelln buggy
gcc 1438.8 MB/s 1443.2 MB/s 1456.8 MB/s
clang 1196.6 MB/s 1216.6 MB/s 1199.7 MB/s

@terrelln
Copy link
Contributor

I would like to get this landed this week, so we can run tests and make a release by the end of next week, or beginning of the week after.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I'll let @Cyan4973 take a final look before I merge it.

Copy link
Contributor

@Cyan4973 Cyan4973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great ! Thanks @mgrice !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants