Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long and very long range redundancy removal #3062

Open
giovariot opened this issue Feb 10, 2022 · 4 comments
Open

Long and very long range redundancy removal #3062

giovariot opened this issue Feb 10, 2022 · 4 comments
Labels
feature request long-term valid topics that are expected take a long time to make progress

Comments

@giovariot
Copy link

giovariot commented Feb 10, 2022

The problem is that these days, with huge hard disks being so cheap and so common is not unusual to forget about files you already saved and copy them again in another folder, or to try to reorganize one's lifetime files in different ways.

As an example I bring a 1TB SSD that I backed up from a friend of mine who inadvertently formatted it which had 2 partitions: one where they kept files well organized in clearly named folders and files, and a second partition used as a "messy" storage with lots of the same data from old backups used as a source to keep organizing the data on the first partition.

I’ve tried using ckolivas/lrzip which has implemented this kind of unlimited range redundancy removal through its -U option: there's a noticeable difference when compressing big disk images backups on using zstd and lrzip. Of course zstd is faster but removing redundancies through ckolivas/lrzip most of the times gets a smaller file in the end. For example on a 127 GB disk image with 3 different partitions containing 3 different windows versions and with free space zeroed out:

  • using lrzip -U -l -p1 resulted in a 28.94 GB file
  • using zstd -T0 --long -19 resulted in a 33.66 GB file

I've then tried using its -U -n option (only redundancy removal, compression disabled) on the 1000.2 GB disk image I was naming as an example: it resulted in a 732.4 GB file. Just using a simple lzo compression algorithm with lrzip’s option -U -l resulted in a 695 GB file. Using zstd on the original img took lots of hours (~40) and only got a 867.6 GB file.

I know I could have simply used lrzip -U -n to remove the redundancies and then compress the output file with zstd, but I think this could be a very useful feature for zstd itself indeed: I’m pretty sure the fast speed and high compression ratios combo make it a very common algorithm to compress disk images.

So I'd find it very useful to implement some sort of redundancy removal process:

  • ckolivas/lrzip makes it using a “sliding mmap” (not sure I’ve actually understood the concept, but it should be a bit-by-bit redundancy removal process) it is quite slow even if very effective
  • another way that comes to my mind could be some sort of magic byte recognition algorithm that could remove also some redundancies created by deleted files (that would be faster, but it would be limited to non fragmented files)
  • another way could be supporting different common filesystems and reducing redundancy through some sort of “duplicate removal” process using hashes. Maybe even faster but this would totally forgo deleted files.

Don’t really know if anything of this is actually feasible inside the current zstd framework, but I really think some sort of long range redundancy removal should be part of it, considering its possible uses also in commercial environments (virtual machine image compression for example would be an area where this feature would surely be beneficial).

Thanks in advance, thanks for the great software and sorry for my pretty bad English, it’s not my mother tongue.

@Cyan4973
Copy link
Contributor

As a stopgap, you could start using --long=31. This will considerably increase the finder's range (and the memory requirement) and should narrow the distance with lrzip.

Beyond that, we know that the current implementation is limited (though not the format, it's purely an implementation issue). Going beyond that limitation is substantial work though.
In a world of infinite resources, this could be taken care of. But that's not the world we live in, we have to balance this effort vs other ongoing efforts that are being achieved instead. So far, it hasn't reached the top of the list, so it's a priority issue. But we are aware of the situation, this item is part of a list of long-term goals, and one of these days, it will be its turn.

@Beiri22
Copy link

Beiri22 commented Apr 7, 2022

@giovariot Have you tried the --long=31 option? What are the results compared to default --long and lrzip?

@giovariot
Copy link
Author

giovariot commented Apr 7, 2022

The compression is a bit better than simply using --long, but not really as good as lrzip's one.

I can try again timing the whole thing, but I don't have much free time lately so I'm not sure how long it's gonna be until I'll have some news.

@terrelln terrelln added the long-term valid topics that are expected take a long time to make progress label Dec 21, 2022
@AlexDaniel
Copy link

AlexDaniel commented Nov 6, 2023

I'm seeing similar results – lrzip being overall a bit “better” – higher compression ratio with faster compression. --long=31 helps a little bit, but it's not enough. Here's some real-world example (<30MB archive that has >1GB worth of data): https://whateverable.6lang.org/0dc6efd6314bcc27c9b8351453d2ec8ca10196df

And some rough measurements for that data:

Command Compression time File size Decompression time
lrzip -q -L 9 ≈26s ≈8M ≈5.3s
zstd -19 --long=31 ≈97s ≈12M ≈0.8s
zstd --long=31 ≈3.7s ≈16M ≈0.8s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request long-term valid topics that are expected take a long time to make progress
Projects
None yet
Development

No branches or pull requests

5 participants