Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file type based chunking / compression heuristic #82

Closed
ThomasWaldmann opened this issue Jul 5, 2015 · 9 comments
Closed

file type based chunking / compression heuristic #82

ThomasWaldmann opened this issue Jul 5, 2015 · 9 comments
Assignees

Comments

@ThomasWaldmann
Copy link
Member

We could have special chunkers / no chunking and compression algorithms / no compression for specific file types (as determined by file ext. or magic), file sizes, etc.

@ThomasWaldmann
Copy link
Member Author

release 0.25 introduced different compression options and compression type is store with the data chunk, so this issues idea is possible to implement now.

@anarcat
Copy link
Contributor

anarcat commented Sep 30, 2015

is this related to restic's interesting new chunking algorithm?

@ThomasWaldmann
Copy link
Member Author

No, borg also does CDC, just using a different algorithm.

For chunking, it is not easy, but just as an example:
For uncompressed .tar files, we could identify the boundaries of member files. if we would chunk at precisely these boundaries, we could deduplicate with untarred files or with other, similar tar files that just have some files changed/deleted/added. this also works with the simple cdc algorithm, but a bit less precise.

For compression, it is easier, we could e.g. do something like:
.mp3 .mp4 .tgz .zip ... -> compression none
.txt .c .py ... and small file size -> some (slow) compression that's good with text
uncompressed and big file size -> fast compression, lz4

@dragetd
Copy link
Contributor

dragetd commented Nov 17, 2015

It could chunk along boundaries of files not only within TARs, but vm-images; header information of media-files that might change more likely compared to the data-stream; open-office files are also compressed ZIP files; EXIF-tags and some raw image formats might contain previews that might fit neatly into a chunk… etc. etc.

I could see some real benefits. And funky heuristics and support for certain filetypes could be added incrementally without harming any compatibility. And yet, it would require a lot of work and knowledge of all the various formats… and it needs to be crafted carefully so the parser will robustly only output suggested chunk-sizes to the chunker, and not be prone to security issues when trying to parse dozens of filetypes.

Cool idea, but maybe something for post-1.0 ?

@RonnyPfannschmidt
Copy link
Contributor

it would be nice if there was a more lax content equivalent mode that could more effectively dedup compressed things

@ThomasWaldmann
Copy link
Member Author

see also #765.

@ThomasWaldmann ThomasWaldmann self-assigned this Mar 19, 2016
@Ape
Copy link
Contributor

Ape commented Mar 21, 2016

Btrfs has the following logic, maybe we could do something similar:

There is a simple decision logic: if the first portion of data being compressed is not smaller than the original, the compression of the file is disabled

@ThomasWaldmann
Copy link
Member Author

File type based compression implemented by: #810

@ThomasWaldmann
Copy link
Member Author

ThomasWaldmann commented May 2, 2016

I am splitting this into multiple tickets:

So, as all is covered now in these tickets, I am closing this one as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants