Does ZSTD_flushStream have adverse impact on compression ratio? #900

johnlanni · 2017-10-25T03:42:50Z

Use the ZlibWrapper，when the zlib flush flag is Z_SYNC_FLUSH，the z_deflate wrap function's implementation is ZSTD_flushStream.
In my case, I use a trained zstd dictionary, and when I set Z_SYNC_FLUSH, the z_defalte write output data to buffer immediately, however the compression ratio is much lower than set Z_NO_FLUSH & Z_FINISH, which use ZSTD_endStream in the end, not immediately.
I want a high compression ratio, and also need the flush feature, are these two contradictory?

terrelln · 2017-10-25T18:39:31Z

If I understand correctly, you are using both flush and dictionaries, is that right? In general people will use either streaming with flush, or dictionaries, but not both. If you can provide some more details I can help find the best setup.

Whats your use case, and why do you need to flush?
How large is each chunk between flush calls?
How much data total are you compressing, Kilobytes, Megabytes, or Gigabytes?

Cyan4973 · 2017-11-01T18:47:39Z

Yes, ZSTD_flushStream() impacts compression ratio.
By requiring to flush immediately, it forces the creation of a block.
The smaller the size of the block, the larger the compression impact.
Creating a new compressed block introduces some "fixed cost", in the form of headers and table descriptors. The smaller the block, the comparatively larger is the "fixed cost".

There are techniques to reduce this "fixed cost". At a minimum, Zstandard defines some "static" tables which cost almost nothing to send, and are useful for really small blocks.
We can do better, but that requires a well-defined scenario to work on and optimise.
I guess going further will require more information.

edo1 · 2019-11-06T07:02:08Z

I have about to 1.7Mb uncompressed test data, 70000 records (20-30 bytes per record).
This data has lot of repeatable patterns, so compression rate is good.
gzip -9 compress it to 40Kb.
ztsd -19 compress it to 32Kb.

I need to do flush after every record. Every compressed record is stored independently.
With zlib and Z_SYNC_FLUSH and eliminating "00 00 FF FF" tail I achieved 324Kb output (so overhead is about to 4 bytes per flush).
With zstd and ZSTD_flushStream I got 684Kb of output!

edo1 · 2019-11-06T07:36:46Z

BTW, I looked into the compressed file and was surprised by the repetition of some string ("finish").
This string occurs 4000+ times in uncompressed data and 197 times (!!!) in compressed data. How could this happen?!?

libzstd version 1.3.8 (Debian stable)
test code:

    ZSTD_CStream* zs=ZSTD_createCStream();
    ZSTD_initCStream(zs, 19);
    // some additional initialization here
    for (;;) {
        // read into zin here
        zout.pos=0;
        zout.size=16384;
        ZSTD_compressStream(zs,&zout,&zin);
        ZSTD_flushStream(zs,&zout);
        write(outfd, zout.dst, zout.pos);
    }

edo1 · 2019-11-06T08:11:39Z

I repeated the test with libzstd 1.4.3 (from Debian testing) and ZSTD_compressStream2(zs,&zout,&zin,ZSTD_e_flush) instead of ZSTD_compressStream/ZSTD_flushStream - no changes at all

edo1 · 2019-11-06T08:56:44Z

Whats your use case, and why do you need to flush?

I write the data (journal) to the persistent storage (NOR flash) and I need to have consistent commit to storage (record must be readable, even is shutdown occurs directly after write).

How large is each chunk between flush calls?

Usually 20-30 bytes, sometimes larger (up to several Kbs)

How much data total are you compressing, Kilobytes, Megabytes, or Gigabytes?

Compressed data is stored in blocks of up to 64Kb (100-200Kb of uncompressed data with zlib).

Cyan4973 · 2019-11-06T18:11:40Z

This use case is not yet well-optimized.
20-30 bytes is really small, and there are a number of challenges to beat "raw" representation.

The most important one is that, by triggering a "flush", it effectively creates a block.
Block header is at minimum 3 bytes, but there are a number of additional internal headers for compressed blocks, which increase this budget towards ~10 bytes.

A full block header with complete statistics is more ~100 bytes, which is obviously too large for your use case. Fortunately, the compressor detects that, so it will undo some of these steps, only producing headers if they generate a benefit, up to sending data uncompressed if need be.

The real issue here is that, by default, the streaming interface doesn't know what's going to happen next. Hence each decision is "local". When creating a block of 20-30 bytes of content, it's very likely that the most sensible decision is to send it uncompressed, or partially compressed but using only some "default" statistics tables. It's extremely likely that the "literals" part will be sent uncompressed.

If the algorithm knew in advance that it's going to compress a bunch of similar data, it could create a kind of "shared statistics", that would be produced once, and re-use on all blocks. But because it doesn't know it, it's obliged to make a local decision, and never reaches the point where some efficient re-usable statistics are produced.

There are likely some other issues, such as a mismatch between the statistics level 19 believes it's using, and the real statistics employed once the impact of small block size is taken into consideration, making the parser's choices ill-funded. It's likely that using strategy lazy2 or btlazy2 would actually be better, because these strategies do not depend on statistics. You could try levels 11-15 to see if it improves.

The "best" solution would require a custom interface and dedicated development. Unfortunately, that's a lot of investment, likely too much to justify it.

Cyan4973 · 2019-11-06T18:22:04Z

I need to do flush after every record. Every compressed record is stored independently.

Note that, even if each record is stored as an independent block, each block still needs to be retrieved in order, so that decompression instructions make sense.

This use case seems a good fit for dictionary compression, where each record would be truly independent, hence could be extracted and decompressed independently.
It would also help for the issue of statistics, since a dictionary contain statistics, ready to be shared.

This is a very different setup though, and is likely going to change your existing data pipeline too much.

Yet, it's an interesting idea to keep in mind, should you have in the future some similar scenario to develop from scratch, where you would not be tied by some existing design.

edo1 · 2019-11-06T19:24:48Z

Block header is at minimum 3 bytes, but there are a number of additional internal headers for compressed blocks, which increase this budget towards ~10 bytes.

So use of zstd has no sense in my case.
It is ok, one format cannot meet all needs.

This use case seems a good fit for dictionary compression

I thought about it. There is no guarantee that a static dictionary will be relevant forever. And dictionary update looks like a headache.

Cyan4973 added the question label Oct 29, 2017

Cyan4973 closed this as completed Nov 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does ZSTD_flushStream have adverse impact on compression ratio? #900

Does ZSTD_flushStream have adverse impact on compression ratio? #900

johnlanni commented Oct 25, 2017

terrelln commented Oct 25, 2017

Cyan4973 commented Nov 1, 2017

edo1 commented Nov 6, 2019

edo1 commented Nov 6, 2019 •

edited

edo1 commented Nov 6, 2019

edo1 commented Nov 6, 2019 •

edited

Cyan4973 commented Nov 6, 2019

Cyan4973 commented Nov 6, 2019 •

edited

edo1 commented Nov 6, 2019

Does ZSTD_flushStream have adverse impact on compression ratio? #900

Does ZSTD_flushStream have adverse impact on compression ratio? #900

Comments

johnlanni commented Oct 25, 2017

terrelln commented Oct 25, 2017

Cyan4973 commented Nov 1, 2017

edo1 commented Nov 6, 2019

edo1 commented Nov 6, 2019 • edited

edo1 commented Nov 6, 2019

edo1 commented Nov 6, 2019 • edited

Cyan4973 commented Nov 6, 2019

Cyan4973 commented Nov 6, 2019 • edited

edo1 commented Nov 6, 2019

edo1 commented Nov 6, 2019 •

edited

edo1 commented Nov 6, 2019 •

edited

Cyan4973 commented Nov 6, 2019 •

edited