Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.5.0 speed regressions #2662

Closed
ghost opened this issue May 16, 2021 · 24 comments
Closed

v1.5.0 speed regressions #2662

ghost opened this issue May 16, 2021 · 24 comments

Comments

@ghost
Copy link

ghost commented May 16, 2021

A msvc2019 speed regression commit: 634bfd3

EDIT:
This regression only occurs when using /Ob3 compiler option, AND on i5-4570.
If use '/Ob2' option, OR on Ryzen-3600X, 634bfd3 has no speed regression.

/Ob3 specifies more aggressive inlining than /Ob2
https://docs.microsoft.com/en-us/cpp/build/reference/ob-inline-function-expansion

before:

                  c_speed   d_speed
level 1: 63.99%, 200.3395, 370.5008
level 2: 54.90%, 150.2246, 259.3094
level 3: 52.55%, 104.7193, 231.1173
level 4: 49.95%, 94.5034, 198.3928

after:

                  c_speed   d_speed
level 1: 63.99%, 184.0881, 361.4072
level 2: 54.90%, 144.1343, 259.7859
level 3: 52.55%, 101.5963, 231.4723
level 4: 49.95%, 91.4660, 198.3102

The unit of c_speed/d_speed is MB/s.

@ghost
Copy link
Author

ghost commented May 16, 2021

There is another speed regression commit.

EDIT:
It seems this regression only occurs in finalize_dictionary function, it's not a performance critical function.

Run pyzstd module's unit-tests on Windows 10, msvc2019, i5 4570:
before 980f3bb : 2.8x seconds.
after    980f3bb : 3.3x seconds.

On WSL2, gcc-9.3.0, i5 4570:
before 980f3bb : 5.1x seconds.
after    980f3bb : 5.6x seconds.

On Windows 10, msvc2019, amd 3600x:
before 980f3bb : 1.9x seconds.
after    980f3bb : 2.1x seconds.

How to run the unit-tests:

  1. Install Python with "Python test suite" checkbox checked.
  2. Install msvc2019 community edition.
  3. Download pyzstd source code: https://github.com/animalize/pyzstd/archive/refs/heads/dev.zip
  4. Run this .bat file:
e:
cd e:\dev\pyzstd

echo Y | rd /q /s build
echo Y | rd /q /s dist
echo Y | py -m pip uninstall pyzstd

py setup.py install
py E:\dev\pyzstd\tests\test_zstd.py
pause

@ghost ghost changed the title v1.5.0 speed regression on msvc2019 v1.5.0 speed regressions on msvc2019 May 16, 2021
@FrancescAlted
Copy link

I have recently upgraded C-Blosc2 to Zstd 1.5.0 (from 1.4.9), and I am detecting performance regressions too, specially on the compression side of the things.

The differences are very apparent on this Intel box: Clear Linux, GCC 11, i9-10940X @ 3.30GHz:

Before (zstd 1.4.9):

> ./bench/b2bench zstd shuffle single 8                                                                                                 (base)
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,zlib,zstd
Supported compression libraries:
  BloscLZ: 2.3.0
  LZ4: 1.9.3
  Zlib: 1.2.11.zlib-ng
  Zstd: 1.4.9
Using compressor: zstd
Using shuffle type: shuffle
Running suite: single
--> 8, 4194304, 4, 19, zstd, shuffle
********************** Run info ******************************
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 4194304 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):		  871.3 us, 4590.7 MB/s
memcpy(read):		  481.3 us, 8311.5 MB/s
Compression level: 0
comp(write):	  122.1 us, 32749.7 MB/s	  Final bytes: 4194336  Ratio: 1.00
decomp(read):	   94.7 us, 42224.1 MB/s	  OK
Compression level: 1
comp(write):	  716.9 us, 5579.4 MB/s	  Final bytes: 599602  Ratio: 7.00
decomp(read):	  288.5 us, 13864.8 MB/s	  OK
Compression level: 2
comp(write):	  567.5 us, 7048.5 MB/s	  Final bytes: 345678  Ratio: 12.13
decomp(read):	  261.8 us, 15280.0 MB/s	  OK
Compression level: 3
comp(write):	  806.8 us, 4957.9 MB/s	  Final bytes: 134398  Ratio: 31.21
decomp(read):	  266.0 us, 15039.3 MB/s	  OK
Compression level: 4
comp(write):	  837.3 us, 4777.3 MB/s	  Final bytes: 62832  Ratio: 66.75
decomp(read):	  130.2 us, 30722.6 MB/s	  OK
Compression level: 5
comp(write):	  928.7 us, 4307.3 MB/s	  Final bytes: 60076  Ratio: 69.82
decomp(read):	  122.2 us, 32722.5 MB/s	  OK
Compression level: 6
comp(write):	  909.1 us, 4400.2 MB/s	  Final bytes: 59080  Ratio: 70.99
decomp(read):	  114.9 us, 34818.2 MB/s	  OK
Compression level: 7
comp(write):	 1515.8 us, 2639.0 MB/s	  Final bytes: 37592  Ratio: 111.57
decomp(read):	   89.5 us, 44674.7 MB/s	  OK
Compression level: 8
comp(write):	 1686.7 us, 2371.4 MB/s	  Final bytes: 37464  Ratio: 111.96
decomp(read):	   90.0 us, 44432.8 MB/s	  OK
Compression level: 9
comp(write):	 18625.2 us, 214.8 MB/s	  Final bytes: 15400  Ratio: 272.36
decomp(read):	  179.2 us, 22316.9 MB/s	  OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:	    5.7 s, 2953.1 MB/s

After (zstd 1.5.0):

> ./bench/b2bench zstd shuffle single 8                                                                                                     (base)
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,zlib,zstd
Supported compression libraries:
  BloscLZ: 2.3.0
  LZ4: 1.9.3
  Zlib: 1.2.11.zlib-ng
  Zstd: 1.5.0
Using compressor: zstd
Using shuffle type: shuffle
Running suite: single
--> 8, 4194304, 4, 19, zstd, shuffle
********************** Run info ******************************
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 4194304 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):		  865.8 us, 4619.7 MB/s
memcpy(read):		  481.9 us, 8299.9 MB/s
Compression level: 0
comp(write):	  122.4 us, 32675.9 MB/s	  Final bytes: 4194336  Ratio: 1.00
decomp(read):	  100.0 us, 39998.4 MB/s	  OK
Compression level: 1
comp(write):	  737.9 us, 5420.7 MB/s	  Final bytes: 599602  Ratio: 7.00
decomp(read):	  281.9 us, 14187.4 MB/s	  OK
Compression level: 2
comp(write):	  586.9 us, 6815.0 MB/s	  Final bytes: 345678  Ratio: 12.13
decomp(read):	  253.1 us, 15801.9 MB/s	  OK
Compression level: 3
comp(write):	 1169.0 us, 3421.8 MB/s	  Final bytes: 134398  Ratio: 31.21
decomp(read):	  262.8 us, 15219.4 MB/s	  OK
Compression level: 4
comp(write):	 1863.3 us, 2146.8 MB/s	  Final bytes: 63200  Ratio: 66.37
decomp(read):	  127.9 us, 31278.8 MB/s	  OK
Compression level: 5
comp(write):	 1758.7 us, 2274.4 MB/s	  Final bytes: 60076  Ratio: 69.82
decomp(read):	  121.7 us, 32866.7 MB/s	  OK
Compression level: 6
comp(write):	 2255.6 us, 1773.4 MB/s	  Final bytes: 59080  Ratio: 70.99
decomp(read):	  112.3 us, 35628.1 MB/s	  OK
Compression level: 7
comp(write):	 3032.0 us, 1319.2 MB/s	  Final bytes: 37592  Ratio: 111.57
decomp(read):	   88.5 us, 45204.3 MB/s	  OK
Compression level: 8
comp(write):	 3274.6 us, 1221.5 MB/s	  Final bytes: 37464  Ratio: 111.96
decomp(read):	   89.0 us, 44924.8 MB/s	  OK
Compression level: 9
comp(write):	 21220.3 us, 188.5 MB/s	  Final bytes: 15400  Ratio: 272.36
decomp(read):	  182.1 us, 21962.1 MB/s	  OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:	    7.5 s, 2251.7 MB/s

As you see, the compression speed can be more than 2x slower with 1.5.0. Decompression does not seem to be affected.

Curiously, with Apple M1 (Apple clang 12.0.5) the differences are almost negligible:

Before (zstd 1.4.9):

> ./bench/b2bench zstd shuffle single 8                                                                                 (base)
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,zlib,zstd
Supported compression libraries:
  BloscLZ: 2.3.0
  LZ4: 1.9.3
  Zlib: 1.2.11.zlib-ng
  Zstd: 1.4.9
Using compressor: zstd
Using shuffle type: shuffle
Running suite: single
--> 8, 4194304, 4, 19, zstd, shuffle
********************** Run info ******************************
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 4194304 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):		  477.5 us, 8377.2 MB/s
memcpy(read):		   85.5 us, 46786.6 MB/s
Compression level: 0
comp(write):	  128.9 us, 31040.8 MB/s	  Final bytes: 4194336  Ratio: 1.00
decomp(read):	  107.2 us, 37298.2 MB/s	  OK
Compression level: 1
comp(write):	 1027.6 us, 3892.6 MB/s	  Final bytes: 599602  Ratio: 7.00
decomp(read):	  408.5 us, 9791.2 MB/s	  OK
Compression level: 2
comp(write):	  742.8 us, 5385.3 MB/s	  Final bytes: 345678  Ratio: 12.13
decomp(read):	  403.9 us, 9903.4 MB/s	  OK
Compression level: 3
comp(write):	  965.2 us, 4144.0 MB/s	  Final bytes: 134398  Ratio: 31.21
decomp(read):	  396.9 us, 10078.5 MB/s	  OK
Compression level: 4
comp(write):	 1069.1 us, 3741.3 MB/s	  Final bytes: 62832  Ratio: 66.75
decomp(read):	  235.2 us, 17009.2 MB/s	  OK
Compression level: 5
comp(write):	 1178.7 us, 3393.6 MB/s	  Final bytes: 60076  Ratio: 69.82
decomp(read):	  223.7 us, 17877.1 MB/s	  OK
Compression level: 6
comp(write):	 1566.6 us, 2553.4 MB/s	  Final bytes: 59080  Ratio: 70.99
decomp(read):	  162.6 us, 24593.5 MB/s	  OK
Compression level: 7
comp(write):	 2319.6 us, 1724.4 MB/s	  Final bytes: 37592  Ratio: 111.57
decomp(read):	  137.3 us, 29132.4 MB/s	  OK
Compression level: 8
comp(write):	 2626.4 us, 1523.0 MB/s	  Final bytes: 37464  Ratio: 111.96
decomp(read):	  137.3 us, 29128.3 MB/s	  OK
Compression level: 9
comp(write):	 13825.9 us, 289.3 MB/s	  Final bytes: 15400  Ratio: 272.36
decomp(read):	  145.6 us, 27476.1 MB/s	  OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:	    5.5 s, 3094.6 MB/s

After (zstd 1.5.0):

> ./bench/b2bench zstd shuffle single 8                                                                                     (base)
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
List of supported compressors in this build: blosclz,lz4,lz4hc,zlib,zstd
Supported compression libraries:
  BloscLZ: 2.3.0
  LZ4: 1.9.3
  Zlib: 1.2.11.zlib-ng
  Zstd: 1.4.9
Using compressor: zstd
Using shuffle type: shuffle
Running suite: single
--> 8, 4194304, 4, 19, zstd, shuffle
********************** Run info ******************************
Blosc version: 2.0.0.rc.2.dev ($Date:: 2021-05-06 #$)
Using synthetic data with 19 significant bits (out of 32)
Dataset size: 4194304 bytes	Type size: 4 bytes
Working set: 256.0 MB		Number of threads: 8
********************** Running benchmarks *********************
memcpy(write):		  440.0 us, 9090.7 MB/s
memcpy(read):		   97.7 us, 40942.7 MB/s
Compression level: 0
comp(write):	  132.3 us, 30226.6 MB/s	  Final bytes: 4194336  Ratio: 1.00
decomp(read):	  119.5 us, 33463.8 MB/s	  OK
Compression level: 1
comp(write):	 1123.1 us, 3561.7 MB/s	  Final bytes: 599602  Ratio: 7.00
decomp(read):	  468.5 us, 8537.5 MB/s	  OK
Compression level: 2
comp(write):	  751.2 us, 5325.0 MB/s	  Final bytes: 345678  Ratio: 12.13
decomp(read):	  407.2 us, 9822.2 MB/s	  OK
Compression level: 3
comp(write):	  968.8 us, 4128.9 MB/s	  Final bytes: 134398  Ratio: 31.21
decomp(read):	  398.1 us, 10046.7 MB/s	  OK
Compression level: 4
comp(write):	 1053.0 us, 3798.7 MB/s	  Final bytes: 62832  Ratio: 66.75
decomp(read):	  235.3 us, 17001.6 MB/s	  OK
Compression level: 5
comp(write):	 1188.4 us, 3365.8 MB/s	  Final bytes: 60076  Ratio: 69.82
decomp(read):	  223.4 us, 17905.2 MB/s	  OK
Compression level: 6
comp(write):	 1586.9 us, 2520.7 MB/s	  Final bytes: 59080  Ratio: 70.99
decomp(read):	  167.9 us, 23817.5 MB/s	  OK
Compression level: 7
comp(write):	 2317.5 us, 1726.0 MB/s	  Final bytes: 37592  Ratio: 111.57
decomp(read):	  139.2 us, 28730.5 MB/s	  OK
Compression level: 8
comp(write):	 2616.8 us, 1528.6 MB/s	  Final bytes: 37464  Ratio: 111.96
decomp(read):	  137.5 us, 29082.7 MB/s	  OK
Compression level: 9
comp(write):	 13949.7 us, 286.7 MB/s	  Final bytes: 15400  Ratio: 272.36
decomp(read):	  144.0 us, 27770.3 MB/s	  OK

Round-trip compr/decompr on 7.5 GB
Elapsed time:	    5.5 s, 3061.5 MB/s

To reproduce benchmarks, build the library following instructions in:
https://github.com/Blosc/c-blosc2/blob/main/README.rst#compiling-the-c-blosc2-library-with-cmake

and the ./bench/b2bench will appear in the build dir.

@ghost ghost changed the title v1.5.0 speed regressions on msvc2019 v1.5.0 speed regressions May 17, 2021
@senhuang42
Copy link
Contributor

senhuang42 commented May 17, 2021

@FrancescAlted I can indeed measure a speed regression as well with your C-blosc2 project - I took a random file generated by your benchmark tool, and benched it using the zstd benchmarking utility:

1.5.0:
 5#out.txt           :    262144 ->      5175 (50.66), 400.8 MB/s ,3193.3 MB/s 
 6#out.txt           :    262144 ->      5173 (50.68), 354.4 MB/s ,3193.2 MB/s 
 7#out.txt           :    262144 ->      3940 (66.53), 361.1 MB/s ,7293.9 MB/s 
 8#out.txt           :    262144 ->      4469 (58.66), 281.0 MB/s ,5744.3 MB/s 
1.4.9:
 5#out.txt           :    262144 ->      5175 (50.66), 541.5 MB/s ,3194.8 MB/s 
 6#out.txt           :    262144 ->      5173 (50.68), 454.7 MB/s ,3191.6 MB/s 
 7#out.txt           :    262144 ->      3917 (66.92), 458.4 MB/s ,7341.2 MB/s 
 8#out.txt           :    262144 ->      4469 (58.66), 332.5 MB/s ,5764.8 MB/s

So I'm measuring around a 30% regression on gcc with an i9-9900k. The file is two long strings repeated many times, so it's not inconcievable that the new match finder doesn't deal with this particular case as well.

One way to mitigate this in your tool in particular could just be disabling the new matchfinder. Basically, you just need to #define ZSTD_STATIC_LINKING_ONLY before including zstd.h (to link in the advanced API), and you could migrate the ZSTD_compressCCtx() call to:

ZSTD_CCtx_setParameter(thread_context->zstd_cctx, ZSTD_c_compressionLevel, clevel);
ZSTD_CCtx_setParameter(thread_context->zstd_cctx, ZSTD_c_useRowMatchFinder, ZSTD_urm_disableRowMatchFinder);
code = ZSTD_compress2(thread_context->zstd_cctx,
        (void*)output, maxout, (void*)input, input_length);

and for ZSTD_compress_usingCDict(), just add the following line before it:

ZSTD_CCtx_refCDict(thread_context->zstd_cctx);

Though maybe this warrants some more investigation. A perf comparison doesn't seem to flag any particular function, but a lot of time is spent in prefetching, and in this case it might just not be as useful for some reason. It could be the case that the Apple M1 deals with heavy prefetching relatively better, and doesn't suffer from it as much.

@terrelln
Copy link
Contributor

terrelln commented May 18, 2021

I think what is happening is that this file as a lot of very long matches. The row-based match finder's update function is slower, but it makes up for it in much faster searches. But, when you have very long matches (100s of bytes at least), the update function can start to dominate. And the hash chain's function is very very simple.

I did a simple experiment. I added this code:

    if (target - idx > 100) {
        idx = target - 100;
        if (useCache)
            ZSTD_row_fillHashCache(ms, base, rowLog, mls, idx, ip + 1);
    }

here. I don't think the code is quite right, but it produces a valid result so whatever.

The result is:

--no-row-match-finder:
 5#c-blosc.dat       :    262144 ->      5175 (50.66), 530.5 MB/s ,3188.3 MB/s
--row-match-finder before:
 5#c-blosc.dat       :    262144 ->      5175 (50.66), 410.4 MB/s ,3183.6 MB/s
--row-match-finder after:
 5#c-blosc.dat       :    262144 ->      5131 (51.09), 876.2 MB/s ,3210.0 MB/s

So, I think that we should investigate the right skipping strategy for (in)compressible sections.

@ghost
Copy link
Author

ghost commented May 20, 2021

The two regressions reported by me are invalid.

634bfd3
This regression only occurs when using /Ob3 compiler option, AND on i5-4570.
If use /Ob2 option, OR on Ryzen-3600X, 634bfd3 has no speed regression.
It may be a regression for a specific processor.

980f3bb
It seems this regression only occurs in finalize_dictionary function, which is not a performance critical function.
This regression can be reproduced on Ryzen-3600X/win10/msvc2019, i5-4570/wsl2/gcc-9.3.
If you have time, it's best to take a look, it's an optimization commit after all.

@Cyan4973
Copy link
Contributor

fixed in #2755

@terrelln
Copy link
Contributor

This should be fixed by #2755

@P-E-Meunier
Copy link

P-E-Meunier commented Feb 24, 2023

Hi! We're still experiencing this regression very badly with the latest release in Pijul (https://pijul.org), which uses the zstd-seekable format.

We have been pinpointing our version of ZStd for quite a while, causing endless issues with the different platforms we support.

ZStd 1.5 is at least 10 times slower than ZStd 1.4.8, even on our most basic test cases.

@Cyan4973
Copy link
Contributor

ZStd 1.5 is at least 10 times slower than ZStd 1.4.8, even on our most basic test cases.

This is a really huge regression, and we haven't observed anything that bad in our tests so far.

We would be interested in understanding better your specific scenario,
if we can reproduce it, this will be a good starting point for a fix.

@terrelln
Copy link
Contributor

@P-E-Meunier this is likely unrelated to this Issue. Could you please provide a way for us to reproduce the issue?

Can you also double check that you're building zstd with asserts disabled -DNDEBUG and with -O3?

@P-E-Meunier
Copy link

I'm actually using the zstd provided by my Linux distribution (NixOS), but the exact same build script is impressively fast with 1.4.8 and extremely slow with 1.5.0.

Reproducing the issue is not particularly easy, since ZStd really sits at the bottom of our stack. I will try to make a more minimal example than what I have.

@P-E-Meunier
Copy link

If you want to try it now anyway, the way to reproduce this is:

  • install a recent enough version of Rust
  • run cargo install pijul --version "~1.0.0-beta"
  • run a minimal repository example, like pijul init; dd if=/dev/zero of=testfile bs=1024 count=102400; pijul rec -am. testfile

On ZStd 1.4.8, the last command (pijul rec -am. testfile) takes a few seconds, maybe 10s on an old-ish laptop. On ZStd 1.5.0, it takes a few minutes.

One way to debug this could be to download the latest source code for Pijul by running pijul clone https://nest.pijul.com/pijul/pijul, and run it again after placing a new ZStd in the library path. If that isn't possible, then download the latest source code for the Rust bindings of ZStd-seekable: pijul clone https://nest.pijul.com/pmeunier/zstd-seekable, update pijul/libpijul/Cargo.toml to change the path for crate zstd-seekable (adding path="/path/to/my/clone" on the line where zstd-seekable is declared should be enough), and run cargo install in the pijul directory.

@Cyan4973
Copy link
Contributor

On ZStd 1.5.0, it takes a few minutes.

Have you tried v1.5.4 ?

@P-E-Meunier
Copy link

Yes, 1.5.4 is still affected, see https://nest.pijul.com/pijul/pijul/discussions/761

@P-E-Meunier
Copy link

Just a minor note in case you want to test, we're now packaging ZStd 1.4.8 along with our bindings in order to avoid the speed regression. So this regression is harder to observe, I'll still try to make up a test case.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 7, 2023

I would like to reproduce the faulty scenario,
though installing pijul and all its dependencies is not my first choice.
Instead, I would like to reproduce an equivalent scenario, using libzstd and the seekable format,
to observe a similar effect.

A few initial questions come to mind : in the pijul scenario experiencing the slowdown

  1. what's the compression level set for libzstd
  2. what's the maximum block size set for the seekable format
  3. what are the final average block size and average compression ratio
  4. is there a read test in the benchmark ? if yes, what's the read pattern ?

Looking at the proposed reproduction scenario :
dd if=/dev/zero of=testfile bs=1024 count=102400
it seems this is a source document which is just completely filled with zeroes ?
In which case, the compression ratio will be very high, but this is hardly a "normal" use case, more like an edge case.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 7, 2023

By the way,
I tried the process explained in this post,
and when reaching last command,
I'm not sure what's happening, but it does not seem to match expectation :

time pijul rec -am. testfile
Hash: NI2YIUC6RSD7OFBFVKPEFFU42Z6SI3TW6HCVQB722PZHXKAIVAHAC
pijul rec -am. testfile  0.46s user 0.05s system 92% cpu 0.549 total

with the local system providing libzstd v1.5.0 by default.

Btw, does pijul use the system's libzstd library, or does it vendor in its own version ?
I tested the recommended version, aka cargo install pijul --version "~1.0.0-beta".

@P-E-Meunier
Copy link

The compression level is 10, the frame size is 256. I don't know about the compression ratio. There is no read test.

Here is the Rust code we use to compress, you should be able to reproduce this by compiling with the zstd-seekable crate at version 1.7.0 (further versions use older zstd-seekable).

const LEVEL: usize = 10;
const FRAME_SIZE: usize = 256;
fn compress(input: &[u8], w: &mut Vec<u8>) -> Result<(), ChangeError> {
    info!("compressing with ZStd {}", zstd_seekable::version().to_str().unwrap());
    let mut cstream = zstd_seekable::SeekableCStream::new(LEVEL, FRAME_SIZE).unwrap();
    let mut output = [0; 4096];
    let mut input_pos = 0;
    while input_pos < input.len() {
        let (out_pos, inp_pos) = cstream.compress(&mut output, &input[input_pos..])?;
        w.write_all(&output[..out_pos])?;
        input_pos += inp_pos;
    }
    while let Ok(n) = cstream.end_stream(&mut output) {
        if n == 0 {
            break;
        }
        w.write_all(&output[..n])?;
    }
    Ok(())
}

The latest Pijul (beta.4) first tries to find ZStd >= 1.4.0 and < 1.5.0 on the system via pkg-config, and uses its own version (ZStd 1.4.8) if that doesn't work. On Windows, it always uses its own version.

Pijul doesn't do any of that on its own, the much smaller ZStd-seekable crate is responsible for handling all that.

@P-E-Meunier
Copy link

About the faulty reproduction scenario: yes, this is a large file filled with zeros, we observe the same problem with actual text code files if they are large enough, which is how this issue was found.

@yoniko
Copy link
Contributor

yoniko commented Mar 7, 2023

A frame size of 256 is rather small.
Couple that with the fact that the seekable compressor uses ZSTD_compressStream means that we basically hit the scenario that #3426 tries to address.
I haven't benchmarked but I believe that utilizing ZSTD_compressStream2 in seekable compressor could solve the problem.

@P-E-Meunier - mind sharing why you have chosen such a small frame size? while a file of zeroes compresses very well, more complex files will not compress as well when the frame size is so small. Additionally, you pay a large overhead for compression / decompression of every frame.
Have you benchmarked other frame sizes?

@P-E-Meunier
Copy link

I did this a few years ago. I remember benchmarking a few sizes and not noticing a large difference there, but I will try a larger frame size, thanks for the advice.

I haven't followed the changes closely in ZStd, but does that explain why ZStd 1.5 was orders of magnitudes slower than 1.4.9 in our particular case?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Mar 7, 2023

does that explain why ZStd 1.5 was orders of magnitudes slower

Yes, it does.
This scenario is possibly the worst possible case for v1.5.0,
and it's very unusual in the specific list of parameters selected,
so much so that we would not have expected it to be a real-world use case.

mind sharing why you have chosen such a small frame size? while a file of zeroes compresses very well, more complex files will not compress as well when the frame size is so small

I did this a few years ago. I remember benchmarking a few sizes and not noticing a large difference there, but I will try a larger frame size, thanks for the advice.

That part is surprising.
256 bytes is extremely small.
Save some special corner cases (such as just bunch of zeroes), cutting input into independent frames of 256 bytes shall result in very bad compression ratio, almost no compression at all.
Selecting higher sizes should result in much better ratios. Presuming source data is compressible, the differences from 256 bytes to 1K to 4K to 16K should be large and obvious.
Hence, it's surprising to read that you did not notice any large difference.
So I wonder if we use FRAME_SIZE to mean the same thing.

@yoniko
Copy link
Contributor

yoniko commented Mar 7, 2023

So I wonder if we use FRAME_SIZE to mean the same thing.

According to the posted code 256 is the frame size fed into the seekable-zstd library, so I believe it's the same thing.

@P-E-Meunier
Copy link

P-E-Meunier commented Mar 7, 2023

Just tested our slowest benchmark, and 4K is indeed very fast with ZStd ≥ 1.5. Thanks again!

Cyan4973 added a commit that referenced this issue Mar 10, 2023
As reported by @P-E-Meunier in #2662 (comment),
seekable format ingestion speed can be particularly slow
when selected `FRAME_SIZE` is very small,
especially in combination with the recent row_hash compression mode.
The specific scenario mentioned was `pijul`,
using frame sizes of 256 bytes and level 10.

This is improved in this PR,
by providing approximate parameter adaptation to the compression process.

Tested locally on a M1 laptop,
ingestion of `enwik8` using `pijul` parameters
went from 35sec. (before this PR) to 2.5sec (with this PR).
For the specific corner case of a file full of zeroes,
this is even more pronounced, going from 45sec. to 0.5sec.

These benefits are unrelated to (and come on top of) other improvement efforts currently being made by @yoniko for the row_hash compression method specifically.

The `seekable_compress` test program has been updated to allows setting compression level,
in order to produce these performance results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants