-
-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Could borg make use of zstd dictionary training? #5633
Comments
Theoretically, that would be possible to implement. But not sure whether it would be worth it - storage is often cheap and recompressing a whole repo takes quite some time. Maybe you could do an experiment and compare how much space a repo takes with normal zstd,X compression and then recompress it as you described and check space usage again. If that is more than a few percent improvement, it might be worth implementing that. |
A related not-yet-implemented feature would be to keep compression dicts between different chunks and not start from scratch each time. Guess if subsequent chunks are from same file / file type, maybe even directly following chunks (initial backup run) that might improve compression a bit. Not sure if it could also have a negative impact, if to-be-compressed stuff is of very different kind, like when switching from a binary file's chunks to a text file chunk. |
I could use an uncompressed repo for that and then compress the whole repo with an external zstd tool; but to really try compressing the individual chunks ... I would need to access them separately. Is there any way to export all chunks to individual files? |
This is just a very preliminary comparison, because I do not know how to extract all individual chunks. REPO_UNCOMPRESSED 1.7GB >> External Compress with ZSTD 22 whole repo >> 575 MB Nearly the same size (575 MB) is obtained when training a 1MB-dict based on full repo directory and then using this dict to compress the whole repo with level 22. Having the individual chunks as separate files would provide a more reasonable scenario! |
You could try to |
I tried that. For 4352 unique chunks I got 3880 segment files. Compressing those as individual files without, with (dxxx dict size in kb) size; or as tar with all files results in following file sizes: 531M ./tar.zstd -- not individual I'v not changed any dictionary training settings. Does not look as promising as expected. I don't know if keeping the compression dicts between chunks would result in similar performance as tar.zstd? |
You wrote about a related not-yet-implemented feature: keeping compression dicts between different chunks and not starting from scratch each time. Is this in active consideration? |
I do not currently work on that, but feel free if you want to try this. |
As far as I understand, borg uses compression on chunk level and therefore cannot harvest full power. I've read that zstd-compression is able to train and later use explicit dictionaries. So I thought about borg and if it could use this feature for improving compression. Imagine some recompress / optimization routine that first trains a dictionary based on all the chunks in the repo; then recompresses everything with this pre-trained dictionary; I might also be used for all further compression. I mean, it's just an idea which might be discussed.
The text was updated successfully, but these errors were encountered: