Long and very long range redundancy removal #3062

giovariot · 2022-02-10T11:58:38Z

The problem is that these days, with huge hard disks being so cheap and so common is not unusual to forget about files you already saved and copy them again in another folder, or to try to reorganize one's lifetime files in different ways.

As an example I bring a 1TB SSD that I backed up from a friend of mine who inadvertently formatted it which had 2 partitions: one where they kept files well organized in clearly named folders and files, and a second partition used as a "messy" storage with lots of the same data from old backups used as a source to keep organizing the data on the first partition.

I’ve tried using ckolivas/lrzip which has implemented this kind of unlimited range redundancy removal through its -U option: there's a noticeable difference when compressing big disk images backups on using zstd and lrzip. Of course zstd is faster but removing redundancies through ckolivas/lrzip most of the times gets a smaller file in the end. For example on a 127 GB disk image with 3 different partitions containing 3 different windows versions and with free space zeroed out:

using lrzip -U -l -p1 resulted in a 28.94 GB file
using zstd -T0 --long -19 resulted in a 33.66 GB file

I've then tried using its -U -n option (only redundancy removal, compression disabled) on the 1000.2 GB disk image I was naming as an example: it resulted in a 732.4 GB file. Just using a simple lzo compression algorithm with lrzip’s option -U -l resulted in a 695 GB file. Using zstd on the original img took lots of hours (~40) and only got a 867.6 GB file.

I know I could have simply used lrzip -U -n to remove the redundancies and then compress the output file with zstd, but I think this could be a very useful feature for zstd itself indeed: I’m pretty sure the fast speed and high compression ratios combo make it a very common algorithm to compress disk images.

So I'd find it very useful to implement some sort of redundancy removal process:

ckolivas/lrzip makes it using a “sliding mmap” (not sure I’ve actually understood the concept, but it should be a bit-by-bit redundancy removal process) it is quite slow even if very effective
another way that comes to my mind could be some sort of magic byte recognition algorithm that could remove also some redundancies created by deleted files (that would be faster, but it would be limited to non fragmented files)
another way could be supporting different common filesystems and reducing redundancy through some sort of “duplicate removal” process using hashes. Maybe even faster but this would totally forgo deleted files.

Don’t really know if anything of this is actually feasible inside the current zstd framework, but I really think some sort of long range redundancy removal should be part of it, considering its possible uses also in commercial environments (virtual machine image compression for example would be an area where this feature would surely be beneficial).

Thanks in advance, thanks for the great software and sorry for my pretty bad English, it’s not my mother tongue.

The text was updated successfully, but these errors were encountered:

Cyan4973 · 2022-02-10T14:12:12Z

As a stopgap, you could start using --long=31. This will considerably increase the finder's range (and the memory requirement) and should narrow the distance with lrzip.

Beyond that, we know that the current implementation is limited (though not the format, it's purely an implementation issue). Going beyond that limitation is substantial work though.
In a world of infinite resources, this could be taken care of. But that's not the world we live in, we have to balance this effort vs other ongoing efforts that are being achieved instead. So far, it hasn't reached the top of the list, so it's a priority issue. But we are aware of the situation, this item is part of a list of long-term goals, and one of these days, it will be its turn.

Beiri22 · 2022-04-07T09:13:29Z

@giovariot Have you tried the --long=31 option? What are the results compared to default --long and lrzip?

giovariot · 2022-04-07T09:15:21Z

The compression is a bit better than simply using --long, but not really as good as lrzip's one.

I can try again timing the whole thing, but I don't have much free time lately so I'm not sure how long it's gonna be until I'll have some news.

AlexDaniel · 2023-11-06T01:49:49Z

I'm seeing similar results – lrzip being overall a bit “better” – higher compression ratio with faster compression. --long=31 helps a little bit, but it's not enough. Here's some real-world example (<30MB archive that has >1GB worth of data): https://whateverable.6lang.org/0dc6efd6314bcc27c9b8351453d2ec8ca10196df

And some rough measurements for that data:

Command	Compression time	File size	Decompression time
lrzip -q -L 9	≈26s	≈8M	≈5.3s
zstd -19 --long=31	≈97s	≈12M	≈0.8s
zstd --long=31	≈3.7s	≈16M	≈0.8s

Cyan4973 added the feature request label Feb 10, 2022

terrelln added the long-term valid topics that are expected take a long time to make progress label Dec 21, 2022

AlexDaniel mentioned this issue Nov 6, 2023

Replace lrzip with zstd Raku/whateverable#389

Open

9001 mentioned this issue May 13, 2024

LRZip support nakijun/peazip#149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long and very long range redundancy removal #3062

Long and very long range redundancy removal #3062

giovariot commented Feb 10, 2022 •

edited

Loading

Cyan4973 commented Feb 10, 2022

Beiri22 commented Apr 7, 2022

giovariot commented Apr 7, 2022 •

edited

Loading

AlexDaniel commented Nov 6, 2023 •

edited

Loading

Long and very long range redundancy removal #3062

Long and very long range redundancy removal #3062

Comments

giovariot commented Feb 10, 2022 • edited Loading

Cyan4973 commented Feb 10, 2022

Beiri22 commented Apr 7, 2022

giovariot commented Apr 7, 2022 • edited Loading

AlexDaniel commented Nov 6, 2023 • edited Loading

giovariot commented Feb 10, 2022 •

edited

Loading

giovariot commented Apr 7, 2022 •

edited

Loading

AlexDaniel commented Nov 6, 2023 •

edited

Loading