Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question in understanding Zstd Digested Dictionaries #3808

Closed
Tristan617 opened this issue Oct 31, 2023 · 2 comments
Closed

Question in understanding Zstd Digested Dictionaries #3808

Tristan617 opened this issue Oct 31, 2023 · 2 comments
Assignees
Labels

Comments

@Tristan617
Copy link

I understand Zstd dictionaries to basically be a header with some metadata, and then a body of raw content bytes.

These dictionaries can then be digested for compression, or digested for decompression. My mental model for this operation is that it serves as a form of compression pre-processing.

1. Is that the correct way of thinking about this? What is actually happening when we digest a dictionary for compression, or for decompression?

I'm running an experiment right now and providing a max dictionary size of 100k. And I've observed two things about digested dictionaries. (Which I am digesting by reference).
I'm using ZSTD_sizeof_DDict and ZSTD_sizeof_CDict to get the sizes.

Firstly, the digested decompression dictionary size is always equal to 27352 bytes.
2. Is that to be expected? And it suggests that digesting a dictionary for decompression is some static workload?

Secondly, the digested compression dictionary size appears to be on average roughly half the size of the raw dictionary.
3. Does this just totally depend on the underlying data and it's basically a coincidence that i'm observing this pattern, or is there some rule of thumb about digesting for compression that explains this?

Thanks!
Tristan

@Cyan4973 Cyan4973 self-assigned this Nov 1, 2023
@Cyan4973
Copy link
Contributor

Cyan4973 commented Nov 1, 2023

In term of mental models, there are probably several ways to "see" this operation that can make sense,
for reference here is how I see it :

  • Dictionaries represent a sort of "past". It's as if we had been compressing up to that point, and would resume from that point. Same thing for decompression.
  • Choosing an "optimal" past for a myriad of (hopefully somewhat similar) inputs is an NP hard problem. We have some rough approximation algorithms that try to get there, and that seem relatively effective, but there are probably other ways to do the same job, sometimes better, and certainly corner cases that would be better served by some very specific hand-selected strategies. Overfitting is a thing for dictionaries though.
  • A CDict is a "digested" dictionary for compression. A DDict is the same for decompression. Both are developed from the same source, but developed into very different objects.
  • The purpose of CDict and DDict is to be able to start compression (respectively decompression) without any extra initialization workload, because everything is already developed "ready to use" within these objects. This makes a large performance difference, especially at scale.

the digested decompression dictionary size is always equal to 27352 bytes.
Is that to be expected?

When the content is not included (byRef mode), then yes, the DDict is expected to have a stable size.
Why it's ~27 KB I can't tell, I would need to go into the source code.
It seems a bit higher than I remember, but not by that much.

The main purpose of the DDict, on top of referencing the dictionary content, is to develop the entropy tables. The FSE tables used for the Sequences should cost ~5 KB, while the Huffman tables could use anywhere between 2 KB to 16 KB, and probably always settle for 16 KB, because while it costs a bit more to build, it then runs faster at decoding time.
There are still a few KB to find to reach 27KB, that's where scrubbing the source code would be useful.

Secondly, the digested compression dictionary size appears to be on average roughly half the size of the raw dictionary.

This is likely dependent on the compression level.
I would expect the search structures, which cost the majority of the CDict size, to differ depending on the compression level.

@Tristan617
Copy link
Author

thank you for the detailed explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants