-
-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
significantly reduce cache space needs #235
Comments
every chunk archive is 353MB. so if i let this go through, it will take around 100GB:
that seems rather unreasonable - it's almost the size of the original dataset... |
final size of the chunk cache is indeed 98G. |
borg is locally caching all the single-archive indexes and that takes a lot of space. the "attic way" was to transfer them all from the (remote) repository when a cache rebuild was needed, which may take a lot of time. So it's time vs space currently. but I've 1-2 ideas left how to optimize that, it's just not yet implemented. |
@anarcat for a quick remedy, try compressing 1 single-archive index using gzip, bz2 and lzma. How much is it before / after? But: if we do that, it'll get slower again. |
a fresh new backup of the same dataset generates a chunk cache of 575MB. i don't understand why this is necessary: this is a local backup, so this is a huge performance hit without any gain whatsoever in speed, from my perspective. how can i do the compression you are refering to? |
oh, and unless this wasn't clear: this is a blocker for me right now - i can't afford the (local!) space cost of 100GB for my backups, unfortunately, so right now i have failed to convert to the borg collective. sorry!!! i want to be assimilated, what's going on? ;) |
compression: that's just an experiment, not an existing feature. Try to compress your 575MB file using different methods to see if it is worth the effort. I'll add a quick & temporary hack so you won't need the space. |
See PR #238 (just as a temporary solution, see comment in the source). A long term solution might be one of: a) Update: this can only work, if only new archives were added, but no archives were deleted in the repo. To do the fixup for deleted archives, we need their index. But as they are deleted and we do not want to store all indexes locally, we don't have it. So, when deleted archives are detected, it needs a full index rebuild. Assuming we start from:
Then cache resync is defined as:
Note:
b) look at how restic does it. IIRC it works using a reference to the previous backup archive and only considers chunk information from there. |
around 30% compression with standard gzip - is that what you were looking for? it would make 100GB go to 70GB. doesn't seem like a solution. i don't know how to use #238, sorry... |
No, gzip reducing by 30% is not worth the effort. and bzip2 and lzma would be much slower... |
thanks for looking into this! the more i think about this issue, the more it doesn't seem to make sense. even if the repository would be remote, i would want 100GB (50% of my dataset!) to be lying around as a cache on the local filesystem, just for an eventual cache resync. i guess i'm unclear as to why such a cache resync would happen. i know it can happen if multiple servers backup to the same source, but that's a special use case that could be disabled by default.... is that it? |
for some people it is not just an eventual cache resync, but a thing that happens daily - if you do a daily backup of multiple machines to same destination repo. and that case is not that special, especially if you have similar machines and you want to exploit the deduplication to the max. |
sure, i understand that, but that seems like a special use case that was basically unsupported forever. :) now i understand we would like that to work, but i'm not sure the cost/benefit ratio is worth it. it certainly is not worth it by default, imho. |
…orgbackup#235 if archives are added, we just add their chunk refcounts to our chunks index. this is relatively quick if only a few archives were added, so even for remote repos, one can live without the locally cached per-archive indexes, saving a lot of space. but: if archives are deleted from the repo, borg will do a full index rebuild. if one wants this to be relatively fast and has a slow repo connection, the locally cached per-archive indexes are useful. thus: use delete / prune less frequently than create. also: added error check in hashindex_merge in case hashindex_set fails
…orgbackup#235 if archives are added, we just add their chunk refcounts to our chunks index. this is relatively quick if only a few archives were added, so even for remote repos, one can live without the locally cached per-archive indexes, saving a lot of space. but: if archives are deleted from the repo, borg will do a full index rebuild. if one wants this to be relatively fast and has a slow repo connection, the locally cached per-archive indexes are useful. thus: use delete / prune less frequently than create. also: added error check in hashindex_merge in case hashindex_set fails
See PR #301. Needs careful review and testing. |
just for the record, when i ran the final conversion last night, the resulting borg repo was perfectly usable without the cache rebuild: the backup ran within 13 minutes and didn't run out of space at all... so everything is good for me here... but i'll take a look at the pr. |
Related: #474 |
as I rejected my own PR #301 (see there for reasons), this will take a bit longer, thus I adjusted the milestone. |
The tangent to this is sync speed. Test set:
tree: https://github.com/enkore/borg/tree/f/fastsync EDIT: Interestingly enough 1.0.x doesn't has nowhere near as large a variance as master branch. And it is even some ~20 % faster than the tree linked above. Whoops! A closer look is needed here. |
Hmm, interesting, but not really related to "cache space needs", is it? |
Do we consider ~33 % on average "significant" in the context of this ticket? (I'm not talking about LZ4'ing the files). I don't see any technical possibilities to do much better than that with the c.a.d. model. Getting completely rid of the cache is already in other tickets. |
Depends on cost.
Free/very cheap -> yes. Expensive -> no.
We already had expensive packing / compression and it was no good.
…On June 2, 2017 4:33:33 PM GMT+02:00, enkore ***@***.***> wrote:
Do we consider ~33 % on average "significant"?
--
You are receiving this because you were assigned.
Reply to this email directly or view it on GitHub:
#235 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
Basically free, much faster than LZ4. And merging the resulting c.a.d. becomes a tiny bit faster (the main cost of merging A into B is looking every key from A up in B), too. |
Sounds good. More details about the idea? |
13.58 GB (used space) + 7.57 GB (saved space) = 21.15 GB (used without .compact is only necessary because previously the first index was used Since the HashIndex grows at 10 % for signficiant sizes, it has a bias The function itself (hashindex_compact) is very efficient [0] but has some [0] It may look on a first glance like it loops multiple times over the (GB = 1000³ bytes) |
Sounds good, let's delay further review to after b6 though. |
after converting an attic repo to borg (using #231), i am having problems running
borg create
. i suspect this is not related to the migration, but with the new caching algorithm that borg uses.when running
borg create
, it seems to want to regenerate the chunks cache, even though it was converted and is available in$BORG_CACHE_DIR/<id>/chunks
:... you can see how it's quickly running out of space. whereas the cache directory was 1GB in Attic, this ran out of space after 8GB written.
💰 there is a bounty for this
The text was updated successfully, but these errors were encountered: