Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MUCH slower compression speeds using version 1.5.1 #2966

Closed
kareldonk opened this issue Dec 30, 2021 · 22 comments · Fixed by #2969
Closed

MUCH slower compression speeds using version 1.5.1 #2966

kareldonk opened this issue Dec 30, 2021 · 22 comments · Fixed by #2969
Assignees

Comments

@kareldonk
Copy link

After updating to version 1.5.1 of the library I noticed that compression speeds were much worse compared to version 1.4.9 at the same compression level. In my benchmarks for my use case, which is compressing streaming network data, zlib now beats zstd with regard to speed and output size.

To reproduce, build the attached source file first with version 1.4.9 used as a DLL, check the output, and then do the same with version 1.5.1 and check the output after running the program.

On my PC running Windows 10 and using Visual Studio 2022 to build, I get the following results with compressionlevel 10:

with zstd 1.4.9: duration: 512 ms, output size: 422 bytes
with zstd 1.5.1: duration: 71266 ms, output size: 422 bytes

Even with compression level 1 there is a difference; 1.5.1 is slower than 1.4.9 albeit not by much compared to compression level 10.

Where does this huge difference in duration come from?

Source.zip

@Cyan4973
Copy link
Contributor

Tested on a local desktop :

v1.4.9 :

10#silesia.tar       : 211957760 ->  59534038 (3.560),  36.5 MB/s ,1622.8 MB/s

v1.5.1 :

10#silesia.tar       : 211957760 ->  58644998 (x3.614),   45.2 MB/s, 1633.0 MB/s

I'm not sure what's going in your case, but it's not a universal experience.

There are too many unsaid variables that can result in such a large difference,
starting with the actual content to compress,
and following with the exact build configuration, to name a few.

@kareldonk
Copy link
Author

If you look at the attached source file it's a pretty simple example of compressing a small string. Run that example and check for yourself what the difference is.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Dec 30, 2021

I've extracted the small string as a file, and compressed it with v1.4.9 and v1.5.1, giving the following results :

geldings.txt

v1.4.9 :

10#geldings.txt      :       727 ->       432 (1.683),  39.0 MB/s , 268.0 MB/s

v1.5.1 :

10#geldings.txt      :       727 ->       432 (x1.683),   42.8 MB/s,  223.1 MB/s

No significant difference in this test.

I'll try to analyze what could be the potential differences between these tests.

@kareldonk
Copy link
Author

I think it is the way in which I am using zstd in my code that shows the huge difference in performance. It's not related to the string; you can put any string in my code and you will get a big difference in the program's duration output.

The ZSTD_compressStream() function appears to be much slower in 1.5.1 compared to 1.4.9. This might be related to the 'rebalancing' work you guys did in 1.5.0.

@kareldonk
Copy link
Author

I have benchmark code in a test program I'm using, and you can see the output below comparing zlib and zstd. At compression level 10 for zstd 1.4.9 it outperformed zlib with default compression level. But using zstd 1.5.1 compression got way slower, while decompression remained fast.

Using zstd 1.5.1:

---
Starting Compression benchmark for 50000 iterations
---
Input size: 11 bytes
Benchmark 'Compression using Zlib' result: 116ms
Benchmark 'Decompression using Zlib' result: 7ms
Benchmark 'Compression using Zstd' result: 70267ms
Benchmark 'Decompression using Zstd' result: 5ms
Zlib compression output size: 21
Zstd compression output size: 24
---
Input size: 24 bytes
Benchmark 'Compression using Zlib' result: 129ms
Benchmark 'Decompression using Zlib' result: 9ms
Benchmark 'Compression using Zstd' result: 70165ms
Benchmark 'Decompression using Zstd' result: 5ms
Zlib compression output size: 32
Zstd compression output size: 37
---
Input size: 81 bytes
Benchmark 'Compression using Zlib' result: 180ms
Benchmark 'Decompression using Zlib' result: 17ms
Benchmark 'Compression using Zstd' result: 70440ms
Benchmark 'Decompression using Zstd' result: 10ms
Zlib compression output size: 61
Zstd compression output size: 74
---
Input size: 1547 bytes
Benchmark 'Compression using Zlib' result: 1118ms
Benchmark 'Decompression using Zlib' result: 265ms
Benchmark 'Compression using Zstd' result: 72410ms
Benchmark 'Decompression using Zstd' result: 212ms
Zlib compression output size: 576
Zstd compression output size: 577

Using zstd 1.4.9:

---
Starting Compression benchmark for 50000 iterations
---
Input size: 11 bytes
Benchmark 'Compression using Zlib' result: 118ms
Benchmark 'Decompression using Zlib' result: 7ms
Benchmark 'Compression using Zstd' result: 21ms
Benchmark 'Decompression using Zstd' result: 4ms
Zlib compression output size: 21
Zstd compression output size: 24
---
Input size: 24 bytes
Benchmark 'Compression using Zlib' result: 129ms
Benchmark 'Decompression using Zlib' result: 10ms
Benchmark 'Compression using Zstd' result: 26ms
Benchmark 'Decompression using Zstd' result: 4ms
Zlib compression output size: 32
Zstd compression output size: 37
---
Input size: 81 bytes
Benchmark 'Compression using Zlib' result: 179ms
Benchmark 'Decompression using Zlib' result: 17ms
Benchmark 'Compression using Zstd' result: 107ms
Benchmark 'Decompression using Zstd' result: 10ms
Zlib compression output size: 61
Zstd compression output size: 74
---
Input size: 1547 bytes
Benchmark 'Compression using Zlib' result: 1123ms
Benchmark 'Decompression using Zlib' result: 262ms
Benchmark 'Compression using Zstd' result: 1015ms
Benchmark 'Decompression using Zstd' result: 235ms
Zlib compression output size: 576
Zstd compression output size: 576

Perhaps I'm using the API wrong or something, but the fact remains that there is a huge decrease in performance.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Dec 30, 2021

OK, I can observe a difference when trying to produce a similar scenario,
though not as large (~x2.5 slower for v1.5.1 compared to v1.4.9, not the > x100 reported in your experiment).

The x2.5 difference can be explained, and mitigated :

  • The scenario recreates a new streaming state every time a small string is compressed. This is extremely wasteful, as a lot of resources are being reserved and initialized, for just a tiny job.
    • Whenever possible, try to re-use the streaming state across multiple compression sessions. This will save a lot of time, as resources are being re-used in-place, and initialization can be (mostly) skipped.
    • When re-using context across all subsequent compression operations, performance differences between v1.4.9 and v1.5.1 disappear almost entirely.
  • The streaming interface is initialized without receiving any information on the size of the future stream to compress. This obliges it to prepare for a source of potentially any size, hence "large". Hence it initializes a large amount of resources.
    • v1.5.1 can use a lot more resources for large inputs, than v1.4.9 for the same level 10. v1.5.1 can reserve and initialized ~24 MB for its "hot" tables, while v1.4.9 "only" uses ~12 MB.
    • Also, v1.5.1 use a different algorithm than v1.4.9 for large data, which is unfortunately less good for small data. By "less good", we are talking ~10-20% differences, not x10 slower.
    • If the state was aware that the next input to compress is actually much smaller, then it would adjust its resources accordingly, resulting in substantial savings. It would also use the algorithm which is more appropriate for small data.
    • Try adding ZSTD_CCtx_setPledgedSrcSize(zstream, inbuffer.size()); in your experiment. It should sharply reduce the amount of resources needed, and therefore the amount of initialization time. In my experiment, it reduced the difference between v1.4.9 and v1.5.1 by a factor 3.

However, a x2.5 difference pales in comparison to the huge differences you report (> x100).
Another interesting element in your experiment is that the total time for v1.5.1 barely changes for all source sizes, from 11 bytes to 1547 bytes, while it does scale as expected for v1.4.9. This suggests that there is a kind of "large fixed cost" in your v1.5.1 experiment. Not sure which one though, since I don't see the same issue in my tests.

@kareldonk
Copy link
Author

Ok I see, is the ZSTD_compressStream() function doing (more) (larger) memory allocations in 1.5.1 compared to version 1.4.9? This could explain why it is slower. Also my tests were on Windows 10, which might be slower with allocations compared to what you are using.
Where do I find the ZSTD_pledgeSrcSize function?

@kareldonk
Copy link
Author

Ok found the function, it's ZSTD_CCtx_setPledgedSrcSize().

Now I get the following output:

for zstd 1.4.9: duration: 512 ms, output size: 422 bytes
for zstd 1.5.1: duration: 71266 ms, output size: 422 bytes
for zstd 1.5.1 with pledged source size: duration: 769 ms, output size : 423 bytes

Which is not as bad but still a bit slower compared to 1.4.9.

@Cyan4973

This comment has been minimized.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Dec 31, 2021

Which is not as bad but still a bit slower compared to 1.4.9.

Could you try to re-use the compression state across compression jobs ?
In my experiment, it considerably reduced speed differences between v1.4.9 and v1.5.1.
This would help narrow down our options to explain the remaining differences.

edit : ah, and another good data point would be : what's the status regarding v1.5.0 ?
This would help determine when difference(s) are introduced.

@stati64
Copy link

stati64 commented Dec 31, 2021

Are you using zstdlib.vcxproj or compiling another way?
The way we compiled slowed down 50% from 1.5.0 to 1.5.1,
with Visual Studio 2022, but not with linux clang or gcc.

the github zstdlib.vcxproj got back our lost performance and
improved on 1.5.0 in our use case of 10s of gigabytes compression.
We don't yet know why this was so.

@kareldonk
Copy link
Author

This is the benchmark result compared with Zlib when using zstd 1.5.1 with the ZSTD_CCtx_setPledgedSrcSize() function added:

---
Starting Compression benchmark for 50000 iterations
---
Input size: 11 bytes
Benchmark 'Compression using Zlib' result: 118ms
Benchmark 'Decompression using Zlib' result: 7ms
Benchmark 'Compression using Zstd' result: 25ms
Benchmark 'Decompression using Zstd' result: 5ms
Zlib compression output size: 21
Zstd compression output size: 24
---
Input size: 24 bytes
Benchmark 'Compression using Zlib' result: 129ms
Benchmark 'Decompression using Zlib' result: 9ms
Benchmark 'Compression using Zstd' result: 29ms
Benchmark 'Decompression using Zstd' result: 5ms
Zlib compression output size: 32
Zstd compression output size: 37
---
Input size: 81 bytes
Benchmark 'Compression using Zlib' result: 188ms
Benchmark 'Decompression using Zlib' result: 18ms
Benchmark 'Compression using Zstd' result: 125ms
Benchmark 'Decompression using Zstd' result: 11ms
Zlib compression output size: 61
Zstd compression output size: 74
---
Input size: 1547 bytes
Benchmark 'Compression using Zlib' result: 1119ms
Benchmark 'Decompression using Zlib' result: 263ms
Benchmark 'Compression using Zstd' result: 1691ms
Benchmark 'Decompression using Zstd' result: 214ms
Zlib compression output size: 576
Zstd compression output size: 574

With large input sizes it starts to get slower than zlib when using compression level 10 for zstd.

I'm not sure if it's possible to reuse compression state. The program I'm working on sends messages over the network, and each message is compressed separately/independently, because each message also needs to be decompressed separately/independently.
I don't know if in such a scenario it would be possible to reuse compression state in zstd? If possible could you tell me how?

For zlib, there is an API where you can specify your own memory allocation functions, and I use that to allocate once at the beginning and keep reusing the buffers for each message I compress.
In case zstd also allocates while compressing, is it possible to also do this once upfront and keep reusing the resources? If you look at the example source.zip file I provided, how would I go about reusing the state there?

@kareldonk
Copy link
Author

Are you using zstdlib.vcxproj or compiling another way? The way we compiled slowed down 50% from 1.5.0 to 1.5.1, with Visual Studio 2022, but not with linux clang or gcc.

the github zstdlib.vcxproj got back our lost performance and improved on 1.5.0 in our use case of 10s of gigabytes compression. We don't yet know why this was so.

I'm using libzstd-dll.vcxproj and use zstd as a DLL. I compile using VS2022 with the MSVC compiler.

@kareldonk
Copy link
Author

ah, and another good data point would be : what's the status regarding v1.5.0 ?
This would help determine when difference(s) are introduced.

1.5.0 had the same issue as 1.5.1. The differences were introduced between 1.4.9 and 1.5.0.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Dec 31, 2021

I'm not sure if it's possible to reuse compression state. The program I'm working on sends messages over the network, and each message is compressed separately/independently, because each message also needs to be decompressed separately/independently. I don't know if in such a scenario it would be possible to reuse compression state in zstd? If possible could you tell me how?

I misinterpreted the example source code, provided in Sources.zip.
It already does properly re-use the compression state across compression jobs.

This makes the huge performance difference reported even more difficult to explain.

I wish I could reproduce it locally, in order to analyze it.

@kareldonk
Copy link
Author

I misinterpreted the example source code, provided in Sources.zip. It already does properly re-use the compression state across compression jobs.

This makes the huge performance difference reported even more difficult to explain.

So, adding the ZSTD_CCtx_setPledgedSrcSize() function call does improve the performance considerably for 1.5.1 but still a little bit slower compared to 1.4.9.

for zstd 1.4.9: duration: 512 ms, output size: 422 bytes
for zstd 1.5.1: duration: 71266 ms, output size: 422 bytes
for zstd 1.5.1 with pledged source size: duration: 769 ms, output size : 423 bytes

Maybe the remaining difference is related to the rebalancing work done for compression levels? I'm going to experiment with it some more tomorrow with other compression levels.

@ghost
Copy link

ghost commented Dec 31, 2021

Are you using 32-bit build?
Some Windows clients only provide 32-bit build for maximum compatibility.

@ghost
Copy link

ghost commented Dec 31, 2021

In v1.5.1:

  • for small data (<=16KB), level 10 uses btlazy2.
  • for unknow size data, level 10 uses lazy2. (correct me if I'm wrong)

Don't know if it related to #2831, ZSTD_VecMask_next() doesn't have 32-bit MSVC code path currently, then it will fall to very slow software implementation. ZSTD_VecMask_next() happens to be in the zstd_lazy.c source file.

edit:
I tested, 64-bit MSVC build also has this problem.

@ghost
Copy link

ghost commented Dec 31, 2021

It seems memset() takes a lot of time.

For small input, this invoke sets 16 MiB RAM to 0.
https://github.com/facebook/zstd/blob/v1.5.1/lib/compress/zstd_cwksp.h#L488

This invoke sets 8 MiB RAM to 0.
https://github.com/facebook/zstd/blob/v1.5.1/lib/compress/zstd_compress.c#L1782

These two invokes are major spots.

@kareldonk
Copy link
Author

Are you using 32-bit build? Some Windows clients only provide 32-bit build for maximum compatibility.

Oops I should have been more specific. I'm using the 64bit release build in my tests above.

@kareldonk
Copy link
Author

So I did some more tests today, and it appears I can get back to the 1.4.9 levels of performance (roughly) I was used to by using a compressionlevel of 8 for zstd 1.5.1. In that case I get the following output for the test benchmark:

Using zstd 1.5.1 with ZSTD_CCtx_setPledgedSrcSize() function added and compressionlevel 8:

Starting Compression benchmark for 50000 iterations
---
Input size: 11 bytes
Benchmark 'Compression using Zlib' result: 119ms
Benchmark 'Decompression using Zlib' result: 6ms
Benchmark 'Compression using Zstd' result: 18ms
Benchmark 'Decompression using Zstd' result: 5ms
Zlib compression output size: 21
Zstd compression output size: 24
---
Input size: 24 bytes
Benchmark 'Compression using Zlib' result: 131ms
Benchmark 'Decompression using Zlib' result: 9ms
Benchmark 'Compression using Zstd' result: 24ms
Benchmark 'Decompression using Zstd' result: 5ms
Zlib compression output size: 32
Zstd compression output size: 37
---
Input size: 81 bytes
Benchmark 'Compression using Zlib' result: 182ms
Benchmark 'Decompression using Zlib' result: 17ms
Benchmark 'Compression using Zstd' result: 105ms
Benchmark 'Decompression using Zstd' result: 11ms
Zlib compression output size: 61
Zstd compression output size: 74
---
Input size: 1547 bytes
Benchmark 'Compression using Zlib' result: 1120ms
Benchmark 'Decompression using Zlib' result: 265ms
Benchmark 'Compression using Zstd' result: 922ms
Benchmark 'Decompression using Zstd' result: 226ms
Zlib compression output size: 576
Zstd compression output size: 574

Using zstd 1.4.9 with compressionlevel 10:

Starting Compression benchmark for 50000 iterations
---
Input size: 11 bytes
Benchmark 'Compression using Zlib' result: 118ms
Benchmark 'Decompression using Zlib' result: 7ms
Benchmark 'Compression using Zstd' result: 21ms
Benchmark 'Decompression using Zstd' result: 4ms
Zlib compression output size: 21
Zstd compression output size: 24
---
Input size: 24 bytes
Benchmark 'Compression using Zlib' result: 129ms
Benchmark 'Decompression using Zlib' result: 10ms
Benchmark 'Compression using Zstd' result: 26ms
Benchmark 'Decompression using Zstd' result: 4ms
Zlib compression output size: 32
Zstd compression output size: 37
---
Input size: 81 bytes
Benchmark 'Compression using Zlib' result: 179ms
Benchmark 'Decompression using Zlib' result: 17ms
Benchmark 'Compression using Zstd' result: 107ms
Benchmark 'Decompression using Zstd' result: 10ms
Zlib compression output size: 61
Zstd compression output size: 74
---
Input size: 1547 bytes
Benchmark 'Compression using Zlib' result: 1123ms
Benchmark 'Decompression using Zlib' result: 262ms
Benchmark 'Compression using Zstd' result: 1015ms
Benchmark 'Decompression using Zstd' result: 235ms
Zlib compression output size: 576
Zstd compression output size: 576

In the application itself benchmarked transfer speeds using zstd can be roughly 2.5 - 3 times faster compared to zlib when using random data which is difficult to compress.

So I guess this issue has been resolved.

If I didn't have benchmarks I would not have noticed the performance regression going from version 1.4.9 to 1.5.1. The compressionlevel of 10 was also hardcoded in the app. If you guys change (rebalance) what a particular compressionlevel means in terms of performance and output size again it would be good to try to make users aware so they can check their apps to see if it doesn't affect them negatively and if they need to change things to match previous levels of functionality.

@Cyan4973 Cyan4973 self-assigned this Dec 31, 2021
Cyan4973 added a commit that referenced this issue Dec 31, 2021
When re-using a compression state, across multiple successive compressions,
the state should minimize the amount of allocation and initialization required.

This mostly matters in situations where initialization is an overwhelming task
compared to compression itself.
This can happen when the amount to compress is small,
while the compression state was given the impression that it would be much larger,
aka, streaming mode without providing a srcSize hint.

This lean-initialization optimization was broken in 980f3bb .

This commit fixes it, making this scenario once again on par with v1.4.9.

Note that this does not completely fix #2966,
since another heavy initialization, specific to row mode,
is also happening (and was not present in v1.4.9).
This will be fixed in a separate commit.
Cyan4973 added a commit that referenced this issue Dec 31, 2021
(note : this might break due to the need to also track the starting candidate nb per row)
Cyan4973 added a commit that referenced this issue Dec 31, 2021
When re-using a compression state, across multiple successive compressions,
the state should minimize the amount of allocation and initialization required.

This mostly matters in situations where initialization is an overwhelming task
compared to compression itself.
This can happen when the amount to compress is small,
while the compression state was given the impression that it would be much larger,
aka, streaming mode without providing a srcSize hint.

This lean-initialization optimization was broken in 980f3bb .

This commit fixes it, making this scenario once again on par with v1.4.9.

Note that this does not completely fix #2966,
since another heavy initialization, specific to row mode,
is also happening (and was not present in v1.4.9).
This will be fixed in a separate commit.
@ghost
Copy link

ghost commented Jan 1, 2022

If add time-stamp to DEBUGLOG(), and add a few DEBUGLOG(), this kind of problem can be quickly locate.

timestamp.zip

Run 1000 times, and sort the time:

E:\>sort.py d.txt
408803 zstd/lib/compress/zstd_cwksp.h: cwksp: ZSTD_cwksp_mark_tables_clean
204993 zstd/lib/compress/zstd_compress.c: wksp: finished allocating,
48417 zstd/lib/compress/zstd_lazy.c: ZSTD_row_update_internalImpl(): updateStartIdx=
11090 zstd/lib/compress/zstd_cwksp.h: cwksp: reserving
4154 zstd/lib/compress/zstd_compress.c: ZSTD_compressStream
3133 zstd/lib/compress/zstd_compress_literals.c: Compressed literals:
2497 zstd/lib/compress/zstd_compress.c: ZSTD_compressStream_generic, flush=
2035 zstd/lib/compress/zstd_compress.c: completed ZSTD_compressStream
1661 zstd/lib/compress/zstd_compress.c: ZSTD_compress_insertDictionary (dictSize=
1211 zstd/lib/compress/zstd_compress.c: chainSize:
1067 zstd/lib/compress/zstd_lazy.c: ZSTD_row_fillHashCache(): [
1065 zstd/lib/compress/zstd_compress.c: ZSTD_entropyCompressSeqStore_internal (nbSeq=
1052 zstd/lib/compress/zstd_compress_internal.h: ZSTD_window_enforceMaxDist: blockEndIdx=
1049 zstd/lib/compress/zstd_compress.c: ZSTD_useTargetCBlockSize (targetCBlockSize=
1048 zstd/lib/compress/zstd_compress.c: ZSTD_compressBegin_internal: wlog=
1039 zstd/lib/compress/zstd_compress.c: ZSTD_buildSeqStore (srcSize=
1023 zstd/lib/compress/zstd_compress_internal.h: ZSTD_window_update
1010 zstd/lib/compress/zstd_compress.c: stream compression stage (flushMode==
997 zstd/lib/compress/zstd_compress.c: ZSTD_writeEpilogue

Cyan4973 added a commit that referenced this issue Jan 5, 2022
fix performance issue in scenario #2966 (part 1)
otrosien added a commit to otrosien/nakadi that referenced this issue May 5, 2022
According to the release notes https://github.com/facebook/zstd/releases/tag/v1.5.2:

> This release also corrects a performance regression that was introduced in v1.5.0 that slows down compression of very small data when using the streaming API. Issue facebook/zstd#2966 tracks that topic.
adyach pushed a commit to zalando/nakadi that referenced this issue May 9, 2022
According to the release notes https://github.com/facebook/zstd/releases/tag/v1.5.2:

> This release also corrects a performance regression that was introduced in v1.5.0 that slows down compression of very small data when using the streaming API. Issue facebook/zstd#2966 tracks that topic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants