Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does ZSTD_flushStream have adverse impact on compression ratio? #900

Closed
johnlanni opened this issue Oct 25, 2017 · 9 comments
Closed

Does ZSTD_flushStream have adverse impact on compression ratio? #900

johnlanni opened this issue Oct 25, 2017 · 9 comments
Labels

Comments

@johnlanni
Copy link

Use the ZlibWrapper,when the zlib flush flag is Z_SYNC_FLUSH,the z_deflate wrap function's implementation is ZSTD_flushStream.
In my case, I use a trained zstd dictionary, and when I set Z_SYNC_FLUSH, the z_defalte write output data to buffer immediately, however the compression ratio is much lower than set Z_NO_FLUSH & Z_FINISH, which use ZSTD_endStream in the end, not immediately.
I want a high compression ratio, and also need the flush feature, are these two contradictory?

@terrelln
Copy link
Contributor

If I understand correctly, you are using both flush and dictionaries, is that right? In general people will use either streaming with flush, or dictionaries, but not both. If you can provide some more details I can help find the best setup.

  • Whats your use case, and why do you need to flush?
  • How large is each chunk between flush calls?
  • How much data total are you compressing, Kilobytes, Megabytes, or Gigabytes?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Nov 1, 2017

Yes, ZSTD_flushStream() impacts compression ratio.
By requiring to flush immediately, it forces the creation of a block.
The smaller the size of the block, the larger the compression impact.
Creating a new compressed block introduces some "fixed cost", in the form of headers and table descriptors. The smaller the block, the comparatively larger is the "fixed cost".

There are techniques to reduce this "fixed cost". At a minimum, Zstandard defines some "static" tables which cost almost nothing to send, and are useful for really small blocks.
We can do better, but that requires a well-defined scenario to work on and optimise.
I guess going further will require more information.

@edo1
Copy link

edo1 commented Nov 6, 2019

I have about to 1.7Mb uncompressed test data, 70000 records (20-30 bytes per record).
This data has lot of repeatable patterns, so compression rate is good.
gzip -9 compress it to 40Kb.
ztsd -19 compress it to 32Kb.

I need to do flush after every record. Every compressed record is stored independently.
With zlib and Z_SYNC_FLUSH and eliminating "00 00 FF FF" tail I achieved 324Kb output (so overhead is about to 4 bytes per flush).
With zstd and ZSTD_flushStream I got 684Kb of output!

@edo1
Copy link

edo1 commented Nov 6, 2019

BTW, I looked into the compressed file and was surprised by the repetition of some string ("finish").
This string occurs 4000+ times in uncompressed data and 197 times (!!!) in compressed data. How could this happen?!?

libzstd version 1.3.8 (Debian stable)
test code:

    ZSTD_CStream* zs=ZSTD_createCStream();
    ZSTD_initCStream(zs, 19);
    // some additional initialization here
    for (;;) {
        // read into zin here
        zout.pos=0;
        zout.size=16384;
        ZSTD_compressStream(zs,&zout,&zin);
        ZSTD_flushStream(zs,&zout);
        write(outfd, zout.dst, zout.pos);
    }

@edo1
Copy link

edo1 commented Nov 6, 2019

I repeated the test with libzstd 1.4.3 (from Debian testing) and ZSTD_compressStream2(zs,&zout,&zin,ZSTD_e_flush) instead of ZSTD_compressStream/ZSTD_flushStream - no changes at all

@edo1
Copy link

edo1 commented Nov 6, 2019

Whats your use case, and why do you need to flush?

I write the data (journal) to the persistent storage (NOR flash) and I need to have consistent commit to storage (record must be readable, even is shutdown occurs directly after write).

How large is each chunk between flush calls?

Usually 20-30 bytes, sometimes larger (up to several Kbs)

How much data total are you compressing, Kilobytes, Megabytes, or Gigabytes?

Compressed data is stored in blocks of up to 64Kb (100-200Kb of uncompressed data with zlib).

@Cyan4973
Copy link
Contributor

Cyan4973 commented Nov 6, 2019

This use case is not yet well-optimized.
20-30 bytes is really small, and there are a number of challenges to beat "raw" representation.

The most important one is that, by triggering a "flush", it effectively creates a block.
Block header is at minimum 3 bytes, but there are a number of additional internal headers for compressed blocks, which increase this budget towards ~10 bytes.

A full block header with complete statistics is more ~100 bytes, which is obviously too large for your use case. Fortunately, the compressor detects that, so it will undo some of these steps, only producing headers if they generate a benefit, up to sending data uncompressed if need be.

The real issue here is that, by default, the streaming interface doesn't know what's going to happen next. Hence each decision is "local". When creating a block of 20-30 bytes of content, it's very likely that the most sensible decision is to send it uncompressed, or partially compressed but using only some "default" statistics tables. It's extremely likely that the "literals" part will be sent uncompressed.

If the algorithm knew in advance that it's going to compress a bunch of similar data, it could create a kind of "shared statistics", that would be produced once, and re-use on all blocks. But because it doesn't know it, it's obliged to make a local decision, and never reaches the point where some efficient re-usable statistics are produced.

There are likely some other issues, such as a mismatch between the statistics level 19 believes it's using, and the real statistics employed once the impact of small block size is taken into consideration, making the parser's choices ill-funded. It's likely that using strategy lazy2 or btlazy2 would actually be better, because these strategies do not depend on statistics. You could try levels 11-15 to see if it improves.

The "best" solution would require a custom interface and dedicated development. Unfortunately, that's a lot of investment, likely too much to justify it.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Nov 6, 2019

I need to do flush after every record. Every compressed record is stored independently.

Note that, even if each record is stored as an independent block, each block still needs to be retrieved in order, so that decompression instructions make sense.

This use case seems a good fit for dictionary compression, where each record would be truly independent, hence could be extracted and decompressed independently.
It would also help for the issue of statistics, since a dictionary contain statistics, ready to be shared.

This is a very different setup though, and is likely going to change your existing data pipeline too much.

Yet, it's an interesting idea to keep in mind, should you have in the future some similar scenario to develop from scratch, where you would not be tied by some existing design.

@edo1
Copy link

edo1 commented Nov 6, 2019

Block header is at minimum 3 bytes, but there are a number of additional internal headers for compressed blocks, which increase this budget towards ~10 bytes.

So use of zstd has no sense in my case.
It is ok, one format cannot meet all needs.

This use case seems a good fit for dictionary compression

I thought about it. There is no guarantee that a static dictionary will be relevant forever. And dictionary update looks like a headache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants