Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could borg make use of zstd dictionary training? #5633

Open
Beiri22 opened this issue Jan 17, 2021 · 8 comments
Open

Could borg make use of zstd dictionary training? #5633

Beiri22 opened this issue Jan 17, 2021 · 8 comments

Comments

@Beiri22
Copy link

Beiri22 commented Jan 17, 2021

As far as I understand, borg uses compression on chunk level and therefore cannot harvest full power. I've read that zstd-compression is able to train and later use explicit dictionaries. So I thought about borg and if it could use this feature for improving compression. Imagine some recompress / optimization routine that first trains a dictionary based on all the chunks in the repo; then recompresses everything with this pre-trained dictionary; I might also be used for all further compression. I mean, it's just an idea which might be discussed.

@ThomasWaldmann
Copy link
Member

Theoretically, that would be possible to implement.

But not sure whether it would be worth it - storage is often cheap and recompressing a whole repo takes quite some time.

Maybe you could do an experiment and compare how much space a repo takes with normal zstd,X compression and then recompress it as you described and check space usage again.

If that is more than a few percent improvement, it might be worth implementing that.

@ThomasWaldmann
Copy link
Member

A related not-yet-implemented feature would be to keep compression dicts between different chunks and not start from scratch each time.

Guess if subsequent chunks are from same file / file type, maybe even directly following chunks (initial backup run) that might improve compression a bit. Not sure if it could also have a negative impact, if to-be-compressed stuff is of very different kind, like when switching from a binary file's chunks to a text file chunk.

@Beiri22
Copy link
Author

Beiri22 commented Jan 18, 2021

I could use an uncompressed repo for that and then compress the whole repo with an external zstd tool; but to really try compressing the individual chunks ... I would need to access them separately. Is there any way to export all chunks to individual files?

@Beiri22
Copy link
Author

Beiri22 commented Jan 18, 2021

This is just a very preliminary comparison, because I do not know how to extract all individual chunks.

REPO_UNCOMPRESSED 1.7GB >> External Compress with ZSTD 22 whole repo >> 575 MB
REPO_SETTINGZSTD22 818MB >> External Compress with ZSTD 22 whole repo >> 647 MB

Nearly the same size (575 MB) is obtained when training a 1MB-dict based on full repo directory and then using this dict to compress the whole repo with level 22.

Having the individual chunks as separate files would provide a more reasonable scenario!

@ThomasWaldmann
Copy link
Member

You could try to borg init a new test repo, then edit the repo config and use a very small segment file size (default: 500MB, try 1kB or so). Guess it will start a new segment file for each chunk then, but I never tried that.

@Beiri22
Copy link
Author

Beiri22 commented Jan 18, 2021

I tried that. For 4352 unique chunks I got 3880 segment files. Compressing those as individual files without, with (dxxx dict size in kb) size; or as tar with all files results in following file sizes:

531M ./tar.zstd -- not individual
712M ./compressed_d3000k
720M ./compressed_d2000k
725M ./compressed_d5000k
726M ./compressed_d1500k
726M ./compressed_d4000k
727M ./compressed_d1250k
730M ./compressed_d10000k
731M ./compressed_d1000k
733M ./compressed_d750k
736M ./compressed_d500k
745M ./compressed_d200k
750M ./compressed_d100k
755M ./compressed_nodict
1,6G ./orig

I'v not changed any dictionary training settings. Does not look as promising as expected. I don't know if keeping the compression dicts between chunks would result in similar performance as tar.zstd?

@Beiri22
Copy link
Author

Beiri22 commented Feb 16, 2021

You wrote about a related not-yet-implemented feature: keeping compression dicts between different chunks and not starting from scratch each time. Is this in active consideration?

@ThomasWaldmann
Copy link
Member

I do not currently work on that, but feel free if you want to try this.

@ThomasWaldmann ThomasWaldmann modified the milestone: 2.0.0b3 Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants