Could borg make use of zstd dictionary training? #5633

Beiri22 · 2021-01-17T11:04:26Z

As far as I understand, borg uses compression on chunk level and therefore cannot harvest full power. I've read that zstd-compression is able to train and later use explicit dictionaries. So I thought about borg and if it could use this feature for improving compression. Imagine some recompress / optimization routine that first trains a dictionary based on all the chunks in the repo; then recompresses everything with this pre-trained dictionary; I might also be used for all further compression. I mean, it's just an idea which might be discussed.

ThomasWaldmann · 2021-01-17T16:27:15Z

Theoretically, that would be possible to implement.

But not sure whether it would be worth it - storage is often cheap and recompressing a whole repo takes quite some time.

Maybe you could do an experiment and compare how much space a repo takes with normal zstd,X compression and then recompress it as you described and check space usage again.

If that is more than a few percent improvement, it might be worth implementing that.

ThomasWaldmann · 2021-01-17T16:31:53Z

A related not-yet-implemented feature would be to keep compression dicts between different chunks and not start from scratch each time.

Guess if subsequent chunks are from same file / file type, maybe even directly following chunks (initial backup run) that might improve compression a bit. Not sure if it could also have a negative impact, if to-be-compressed stuff is of very different kind, like when switching from a binary file's chunks to a text file chunk.

Beiri22 · 2021-01-18T08:07:33Z

I could use an uncompressed repo for that and then compress the whole repo with an external zstd tool; but to really try compressing the individual chunks ... I would need to access them separately. Is there any way to export all chunks to individual files?

Beiri22 · 2021-01-18T17:05:24Z

This is just a very preliminary comparison, because I do not know how to extract all individual chunks.

REPO_UNCOMPRESSED 1.7GB >> External Compress with ZSTD 22 whole repo >> 575 MB
REPO_SETTINGZSTD22 818MB >> External Compress with ZSTD 22 whole repo >> 647 MB

Nearly the same size (575 MB) is obtained when training a 1MB-dict based on full repo directory and then using this dict to compress the whole repo with level 22.

Having the individual chunks as separate files would provide a more reasonable scenario!

ThomasWaldmann · 2021-01-18T17:10:46Z

You could try to borg init a new test repo, then edit the repo config and use a very small segment file size (default: 500MB, try 1kB or so). Guess it will start a new segment file for each chunk then, but I never tried that.

Beiri22 · 2021-01-18T22:49:23Z

I tried that. For 4352 unique chunks I got 3880 segment files. Compressing those as individual files without, with (dxxx dict size in kb) size; or as tar with all files results in following file sizes:

531M ./tar.zstd -- not individual
712M ./compressed_d3000k
720M ./compressed_d2000k
725M ./compressed_d5000k
726M ./compressed_d1500k
726M ./compressed_d4000k
727M ./compressed_d1250k
730M ./compressed_d10000k
731M ./compressed_d1000k
733M ./compressed_d750k
736M ./compressed_d500k
745M ./compressed_d200k
750M ./compressed_d100k
755M ./compressed_nodict
1,6G ./orig

I'v not changed any dictionary training settings. Does not look as promising as expected. I don't know if keeping the compression dicts between chunks would result in similar performance as tar.zstd?

Beiri22 · 2021-02-16T20:04:33Z

You wrote about a related not-yet-implemented feature: keeping compression dicts between different chunks and not starting from scratch each time. Is this in active consideration?

ThomasWaldmann · 2021-02-16T20:55:55Z

I do not currently work on that, but feel free if you want to try this.

ThomasWaldmann modified the milestone: 2.0.0b3 Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could borg make use of zstd dictionary training? #5633

Could borg make use of zstd dictionary training? #5633

Beiri22 commented Jan 17, 2021

ThomasWaldmann commented Jan 17, 2021

ThomasWaldmann commented Jan 17, 2021

Beiri22 commented Jan 18, 2021

Beiri22 commented Jan 18, 2021 •

edited

ThomasWaldmann commented Jan 18, 2021

Beiri22 commented Jan 18, 2021

Beiri22 commented Feb 16, 2021

ThomasWaldmann commented Feb 16, 2021

Could borg make use of zstd dictionary training? #5633

Could borg make use of zstd dictionary training? #5633

Comments

Beiri22 commented Jan 17, 2021

ThomasWaldmann commented Jan 17, 2021

ThomasWaldmann commented Jan 17, 2021

Beiri22 commented Jan 18, 2021

Beiri22 commented Jan 18, 2021 • edited

ThomasWaldmann commented Jan 18, 2021

Beiri22 commented Jan 18, 2021

Beiri22 commented Feb 16, 2021

ThomasWaldmann commented Feb 16, 2021

Beiri22 commented Jan 18, 2021 •

edited