file type based chunking / compression heuristic #82

ThomasWaldmann · 2015-07-05T23:24:04Z

We could have special chunkers / no chunking and compression algorithms / no compression for specific file types (as determined by file ext. or magic), file sizes, etc.

ThomasWaldmann · 2015-08-04T21:58:19Z

release 0.25 introduced different compression options and compression type is store with the data chunk, so this issues idea is possible to implement now.

anarcat · 2015-09-30T20:28:03Z

is this related to restic's interesting new chunking algorithm?

ThomasWaldmann · 2015-09-30T21:49:00Z

No, borg also does CDC, just using a different algorithm.

For chunking, it is not easy, but just as an example:
For uncompressed .tar files, we could identify the boundaries of member files. if we would chunk at precisely these boundaries, we could deduplicate with untarred files or with other, similar tar files that just have some files changed/deleted/added. this also works with the simple cdc algorithm, but a bit less precise.

For compression, it is easier, we could e.g. do something like:
.mp3 .mp4 .tgz .zip ... -> compression none
.txt .c .py ... and small file size -> some (slow) compression that's good with text
uncompressed and big file size -> fast compression, lz4

dragetd · 2015-11-17T18:37:05Z

It could chunk along boundaries of files not only within TARs, but vm-images; header information of media-files that might change more likely compared to the data-stream; open-office files are also compressed ZIP files; EXIF-tags and some raw image formats might contain previews that might fit neatly into a chunk… etc. etc.

I could see some real benefits. And funky heuristics and support for certain filetypes could be added incrementally without harming any compatibility. And yet, it would require a lot of work and knowledge of all the various formats… and it needs to be crafted carefully so the parser will robustly only output suggested chunk-sizes to the chunker, and not be prone to security issues when trying to parse dozens of filetypes.

Cool idea, but maybe something for post-1.0 ?

RonnyPfannschmidt · 2015-11-17T18:39:53Z

it would be nice if there was a more lax content equivalent mode that could more effectively dedup compressed things

ThomasWaldmann · 2016-03-17T22:35:31Z

see also #765.

Ape · 2016-03-21T19:38:08Z

Btrfs has the following logic, maybe we could do something similar:

There is a simple decision logic: if the first portion of data being compressed is not smaller than the original, the compression of the file is disabled

https://btrfs.wiki.kernel.org/index.php/Compression#What_happens_to_incompressible_files.3F

ThomasWaldmann · 2016-04-30T02:12:20Z

File type based compression implemented by: #810

ThomasWaldmann · 2016-05-02T15:43:17Z

I am splitting this into multiple tickets:

file type based compression #1004 for filetype (filename extension) based compression selection
file type based chunking #1005 for filetype (filename extension) based chunking
content-based heuristic compression selection #1006 content-based heuristic compression selection
there is already decompress, dedup, store, load, reassemble, recompress #63 for the even more complicated dissecting and bit-identical reproducing of compressed archives

So, as all is covered now in these tickets, I am closing this one as duplicate.

ThomasWaldmann added the enhancement label Jul 6, 2015

ThomasWaldmann self-assigned this Mar 19, 2016

ThomasWaldmann closed this as completed May 2, 2016

phiresky mentioned this issue Dec 2, 2018

Idea: Add a chunking algorithm that determines the chunk size depending on the information density #4193

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

file type based chunking / compression heuristic #82

file type based chunking / compression heuristic #82

ThomasWaldmann commented Jul 5, 2015

ThomasWaldmann commented Aug 4, 2015

anarcat commented Sep 30, 2015

ThomasWaldmann commented Sep 30, 2015

dragetd commented Nov 17, 2015

RonnyPfannschmidt commented Nov 17, 2015

ThomasWaldmann commented Mar 17, 2016

Ape commented Mar 21, 2016

ThomasWaldmann commented Apr 30, 2016

ThomasWaldmann commented May 2, 2016 •

edited

Loading

file type based chunking / compression heuristic #82

file type based chunking / compression heuristic #82

Comments

ThomasWaldmann commented Jul 5, 2015

ThomasWaldmann commented Aug 4, 2015

anarcat commented Sep 30, 2015

ThomasWaldmann commented Sep 30, 2015

dragetd commented Nov 17, 2015

RonnyPfannschmidt commented Nov 17, 2015

ThomasWaldmann commented Mar 17, 2016

Ape commented Mar 21, 2016

ThomasWaldmann commented Apr 30, 2016

ThomasWaldmann commented May 2, 2016 • edited Loading

ThomasWaldmann commented May 2, 2016 •

edited

Loading