significantly reduce cache space needs #235

anarcat · 2015-10-01T21:17:05Z

after converting an attic repo to borg (using #231), i am having problems running borg create. i suspect this is not related to the migration, but with the new caching algorithm that borg uses.

when running borg create, it seems to want to regenerate the chunks cache, even though it was converted and is available in $BORG_CACHE_DIR/<id>/chunks:

# borg create --do-not-cross-mountpoints              --stats                                          /media/anarcat/BOOK/borg::$(hostname)-$(date +%Y-%m-%d)-post-attic       / /boot /usr /var /home                          --exclude-caches                                 --exclude "/home/*/.cache"                       --exclude "*/.Trash-*"                           --exclude "*/[Cc]cache/*"                        --exclude "*/.bitcoin/blocks/*"                  --exclude "/tmp/*"                               --exclude "*/build-area/*"                       --exclude "/proc/*"                              --exclude "/dev/*"                               --exclude "/sys/*"                               --exclude "/var/cache/*"                         --exclude "/var/tmp/*"                      --exclude "/var/log/*" -v -v
Synchronizing chunks cache...
Archives: 284, w/ cached Idx: 9, w/ outdated Idx: 0, w/o cached Idx: 275.
Fetching and building archive index for marcos-2015-08-22 ...
Merging into master chunks index ...
Fetching and building archive index for marcos-2015-04-15 ...
Merging into master chunks index ...
Fetching and building archive index for marcos-2015-05-17 ...
Merging into master chunks index ...
[...]
Fetching and building archive index for marcos-2015-02-19 ...
hashindex: /root/.cache/borg/d31f0c3d6969eb44b6a77eb4edac2ec0c1d257c8fbb3a0e26ae4cede83963a89/chunks.archive.d/28fbf8d98f730e04b9835e2c628fed6f00f057267979ed9b3b7c1497782760ca.tmp: fwrite buckets failed (No space left on device)

... you can see how it's quickly running out of space. whereas the cache directory was 1GB in Attic, this ran out of space after 8GB written.

💰 there is a bounty for this

The text was updated successfully, but these errors were encountered:

anarcat · 2015-10-01T21:28:57Z

every chunk archive is 353MB. so if i let this go through, it will take around 100GB:

> 353Mibyte * 284 à Gibyte

  353 * mebibyte * 284 = approx. 97.902344 gibibytes

that seems rather unreasonable - it's almost the size of the original dataset...

anarcat · 2015-10-02T00:43:51Z

final size of the chunk cache is indeed 98G.

ThomasWaldmann · 2015-10-02T08:35:22Z

borg is locally caching all the single-archive indexes and that takes a lot of space. the "attic way" was to transfer them all from the (remote) repository when a cache rebuild was needed, which may take a lot of time.

So it's time vs space currently. but I've 1-2 ideas left how to optimize that, it's just not yet implemented.

ThomasWaldmann · 2015-10-02T08:37:32Z

@anarcat for a quick remedy, try compressing 1 single-archive index using gzip, bz2 and lzma. How much is it before / after? But: if we do that, it'll get slower again.

anarcat · 2015-10-02T13:41:49Z

a fresh new backup of the same dataset generates a chunk cache of 575MB.

i don't understand why this is necessary: this is a local backup, so this is a huge performance hit without any gain whatsoever in speed, from my perspective.

how can i do the compression you are refering to?

anarcat · 2015-10-02T13:51:20Z

oh, and unless this wasn't clear: this is a blocker for me right now - i can't afford the (local!) space cost of 100GB for my backups, unfortunately, so right now i have failed to convert to the borg collective. sorry!!!

i want to be assimilated, what's going on? ;)

ThomasWaldmann · 2015-10-02T14:12:05Z

compression: that's just an experiment, not an existing feature.

Try to compress your 575MB file using different methods to see if it is worth the effort.

I'll add a quick & temporary hack so you won't need the space.

ThomasWaldmann · 2015-10-02T15:02:19Z

See PR #238 (just as a temporary solution, see comment in the source).

A long term solution might be one of:

a) Update: this can only work, if only new archives were added, but no archives were deleted in the repo. To do the fixup for deleted archives, we need their index. But as they are deleted and we do not want to store all indexes locally, we don't have it. So, when deleted archives are detected, it needs a full index rebuild.

Assuming we start from:

a valid chunks master index MI (this should usually be the case) ...
... for a KNOWN set of archive ids Ai in that master index (this is a new feature, we currently do not remember the archive ids the master index was built from).
a repo with a set of archive ids Ar

Then cache resync is defined as:

added = Ar - Ai
deleted = Ai - Ar
for archive_id in added:
    fixup(MI, archive_id, mult=+1)
for archive_id in deleted:
    fixup(MI, archive_id, mult=-1)
remember_as_Ai(Ar)

fixup:
for id, refcount, size, csize in archive_chunk_infos(archive_id):
    m_refcount, m_size, m_csize = MI.get(id, (0, None, None))
    m_refcount += mult * refcount
    m_size, m_csize = size, csize
    if refcount > 0:
        MI[id] = refcount, m_size, m_csize
    elif refcount == 0:
        del M[id]
    else:  # < 0
        # should not happen
    # note: we must do more sanity checks above, omitted here for brevity

Note:

this will be fast in a lot of "normal" use cases, except when MI / Ai gets lost, then a full master index rebuild must be done (which might be slow when connection to repo is slow / has big latency).
this does not use / not need a local single archive index cache

b) look at how restic does it.

IIRC it works using a reference to the previous backup archive and only considers chunk information from there.

anarcat · 2015-10-02T15:55:55Z

Try to compress your 575MB file using different methods to see if it is worth the effort.

root@marcos:/media/anarcat/BOOK/cache/d31f0c3d6969eb44b6a77eb4edac2ec0c1d257c8fbb3a0e26ae4cede83963a89/chunks.archive.d# pv ffdc92e4749f332628aaaea53316aa293aed10e5f3bd282b5aa428e75f344bec  | gzip -c > f.gz
 352MiO 0:00:31 [11,1MiB/s] [===============================================================================================================>] 100%
root@marcos:/media/anarcat/BOOK/cache/d31f0c3d6969eb44b6a77eb4edac2ec0c1d257c8fbb3a0e26ae4cede83963a89/chunks.archive.d# gunzip -l f.gz
         compressed        uncompressed  ratio uncompressed_name
          253679021           369098770  31.3% f

around 30% compression with standard gzip - is that what you were looking for? it would make 100GB go to 70GB. doesn't seem like a solution.

i don't know how to use #238, sorry...

ThomasWaldmann · 2015-10-03T12:30:57Z

No, gzip reducing by 30% is not worth the effort. and bzip2 and lzma would be much slower...

anarcat · 2015-10-03T14:48:07Z

thanks for looking into this!

the more i think about this issue, the more it doesn't seem to make sense. even if the repository would be remote, i would want 100GB (50% of my dataset!) to be lying around as a cache on the local filesystem, just for an eventual cache resync.

i guess i'm unclear as to why such a cache resync would happen. i know it can happen if multiple servers backup to the same source, but that's a special use case that could be disabled by default.... is that it?

ThomasWaldmann · 2015-10-03T16:30:52Z

for some people it is not just an eventual cache resync, but a thing that happens daily - if you do a daily backup of multiple machines to same destination repo. and that case is not that special, especially if you have similar machines and you want to exploit the deduplication to the max.

anarcat · 2015-10-03T17:03:35Z

sure, i understand that, but that seems like a special use case that was basically unsupported forever. :) now i understand we would like that to work, but i'm not sure the cost/benefit ratio is worth it.

it certainly is not worth it by default, imho.

…orgbackup#235 if archives are added, we just add their chunk refcounts to our chunks index. this is relatively quick if only a few archives were added, so even for remote repos, one can live without the locally cached per-archive indexes, saving a lot of space. but: if archives are deleted from the repo, borg will do a full index rebuild. if one wants this to be relatively fast and has a slow repo connection, the locally cached per-archive indexes are useful. thus: use delete / prune less frequently than create. also: added error check in hashindex_merge in case hashindex_set fails

ThomasWaldmann · 2015-10-18T15:29:04Z

See PR #301. Needs careful review and testing.

anarcat · 2015-10-18T22:16:29Z

just for the record, when i ran the final conversion last night, the resulting borg repo was perfectly usable without the cache rebuild: the backup ran within 13 minutes and didn't run out of space at all... so everything is good for me here... but i'll take a look at the pr.

ThomasWaldmann · 2015-12-08T09:42:41Z

Related: #474

ThomasWaldmann · 2016-01-15T23:31:04Z

as I rejected my own PR #301 (see there for reasons), this will take a bit longer, thus I adjusted the milestone.

enkore · 2016-10-15T20:55:19Z

The tangent to this is sync speed.

Test set:

16 archives, 984 GB, 447 GB deduplicated, 2283351 unique chunks, 8728178 refs.
Full syncs (easy setup).
Client <-> Server GbE.
Data transfer peaks at 740 MBit/s (=load ~74 %). Server side FS cache was not tampered with (not cleared between runs, so it's hot hot hot). Test command to sync with is "borg info ".

Variance between runs is large (worst-best ~1:4).
... because archives are synced in hash order, so random each time
This makes a difference for both the fetch phase (because this makes the cache contents and locality different each sync) and the merge phase (somewhat surprised here)
As expected, applying a stable sort made this reproducible with low variance (the avg-avg number above; <1 s deltas).
In my tests oldest-first was the worst sort, newest-first the best. (factor ~three). But this likely depends on a lot of factors (eg metadata dedup between archives).
Moving a hot loop to pyx also made some difference, but much bigger effect due to removing Item(internal_dict=item) [at the same time, misanalyzed the data here first due to that] and directly working on the dict (1:2.4 for the latter part; 1:2.7 for the first part combined with the latter). So the Item stuff looks like quite a perf regression here. Likely explanation is that this hits the allocator.

tree: https://github.com/enkore/borg/tree/f/fastsync

EDIT: Interestingly enough 1.0.x doesn't has nowhere near as large a variance as master branch. And it is even some ~20 % faster than the tree linked above. Whoops! A closer look is needed here.

ThomasWaldmann · 2016-10-15T22:24:59Z

Hmm, interesting, but not really related to "cache space needs", is it?

enkore · 2017-06-02T14:33:32Z

Do we consider ~33 % on average "significant" in the context of this ticket? (I'm not talking about LZ4'ing the files).

I don't see any technical possibilities to do much better than that with the c.a.d. model. Getting completely rid of the cache is already in other tickets.

ThomasWaldmann · 2017-06-02T14:40:01Z

Depends on cost. Free/very cheap -> yes. Expensive -> no. We already had expensive packing / compression and it was no good.

…

On June 2, 2017 4:33:33 PM GMT+02:00, enkore ***@***.***> wrote: Do we consider ~33 % on average "significant"? -- You are receiving this because you were assigned. Reply to this email directly or view it on GitHub: #235 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

enkore · 2017-06-02T14:41:10Z

Basically free, much faster than LZ4. And merging the resulting c.a.d. becomes a tiny bit faster (the main cost of merging A into B is looking every key from A up in B), too.

ThomasWaldmann · 2017-06-02T15:30:06Z

Sounds good. More details about the idea?

enkore · 2017-06-02T16:05:50Z

enkore@0d2c904

Cache sync: processed 123.83 GB bytes (770937 chunks) of metadata
Cache sync: compact chunks.archive.d storage saved 7.57 GB bytes
Done.
RepositoryCache: current items 41466, size 2.07 GB / 2.15 GB, 632520 hits, 135240 misses, 4549 slow misses (+5.8s), 96951 evictions, 0 ENOSPC hit

$ du -sbc /home/mabe/.cache/borg/.../chunks.archive.d
13583935276

13.58 GB (used space) + 7.57 GB (saved space) = 21.15 GB (used without
this technique); Thus, 36 % space savings. Syncing from
c.a.d. (i.e. edit manifest ID in cache config) is down <5 %; if the disk
where the limit the impact would be larger, since less I/O bandwidth
is required.

.compact is only necessary because previously the first index was used
as a starting point for the master index, so if that one were read
from disk and were a compact index, then the cache would be corrupted.

Since the HashIndex grows at 10 % for signficiant sizes, it has a bias
towards being full rather than empty, which is why the average space
savings seem to work out to about 1-1/MAX_LF = 33 % (or slightly
less). Savings are higher for small archives which are still in the
100 % growth area.

The function itself (hashindex_compact) is very efficient [0] but has some
cases left where it could be made even more efficient. To munge
through the 21.15 GB and compact them it required ~3.0 s, which means
7 GB/s.

[0] It may look on a first glance like it loops multiple times over the
array, but it actually does not and looks at every bucket in the array only once.
Note how there is only one loop counter and the nested loops advance
that counter.

(GB = 1000³ bytes)

ThomasWaldmann · 2017-06-02T20:14:09Z

Sounds good, let's delay further review to after b6 though.

anarcat mentioned this issue Oct 2, 2015

attic to borg one time converter #231

Merged

ThomasWaldmann added the enhancement label Oct 2, 2015

ThomasWaldmann changed the title ~~cache usage explosion~~ significantly reduce cache space needs Oct 2, 2015

ThomasWaldmann self-assigned this Oct 3, 2015

anarcat referenced this issue in anarcat/borg Oct 3, 2015

update docs to reflect that cache is converted

6904058

ThomasWaldmann modified the milestones: 0.27, 0.28 Oct 3, 2015

ThomasWaldmann added the Bountysource label Oct 9, 2015

ThomasWaldmann modified the milestones: 0.29, 0.28 Oct 17, 2015

ThomasWaldmann modified the milestones: 0.28, 0.29 Oct 17, 2015

ThomasWaldmann modified the milestones: 0.29, 0.28 Oct 20, 2015

ThomasWaldmann modified the milestones: 0.30 maybe, maybe not, 0.29 - improve locking / logging Dec 12, 2015

ThomasWaldmann modified the milestones: 2.0 - future goals, 0.30 misc fixes and improvements Jan 15, 2016

enkore mentioned this issue Oct 16, 2016

Improve cache sync speed #1729

Closed

martin21 mentioned this issue Oct 26, 2016

reduce cache size needs (txn.active + files) #1766

Open

ThomasWaldmann mentioned this issue Nov 25, 2016

limit cache space needs to avoid out of space exceptions #1765

Closed

enkore mentioned this issue Dec 9, 2016

Cache sync stuff #1940

Merged

ThomasWaldmann removed their assignment Mar 22, 2017

ThomasWaldmann modified the milestones: 1.1.0b7, 2.0 - future goals Jun 2, 2017

enkore self-assigned this Jun 2, 2017

enkore mentioned this issue Jun 9, 2017

Compact chunks.archive.d #2638

Merged

enkore closed this as completed in #2638 Jun 10, 2017

ThomasWaldmann modified the milestones: 1.1.0b6, 1.1.0b7 Jun 10, 2017

enkore added the c: hashindex label Jul 23, 2017

davidbartonau mentioned this issue Aug 30, 2019

Borg Cache Grows Linearly with Number of Backups #4740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

significantly reduce cache space needs #235

significantly reduce cache space needs #235

anarcat commented Oct 1, 2015

anarcat commented Oct 1, 2015

anarcat commented Oct 2, 2015

ThomasWaldmann commented Oct 2, 2015

ThomasWaldmann commented Oct 2, 2015

anarcat commented Oct 2, 2015

anarcat commented Oct 2, 2015

ThomasWaldmann commented Oct 2, 2015

ThomasWaldmann commented Oct 2, 2015

anarcat commented Oct 2, 2015

ThomasWaldmann commented Oct 3, 2015

anarcat commented Oct 3, 2015

ThomasWaldmann commented Oct 3, 2015

anarcat commented Oct 3, 2015

ThomasWaldmann commented Oct 18, 2015

anarcat commented Oct 18, 2015

ThomasWaldmann commented Dec 8, 2015

ThomasWaldmann commented Jan 15, 2016

enkore commented Oct 15, 2016 •

edited

Loading

ThomasWaldmann commented Oct 15, 2016

enkore commented Jun 2, 2017 •

edited

Loading

ThomasWaldmann commented Jun 2, 2017 via email

enkore commented Jun 2, 2017 •

edited

Loading

ThomasWaldmann commented Jun 2, 2017

enkore commented Jun 2, 2017 •

edited

Loading

ThomasWaldmann commented Jun 2, 2017

significantly reduce cache space needs #235

significantly reduce cache space needs #235

Comments

anarcat commented Oct 1, 2015

anarcat commented Oct 1, 2015

anarcat commented Oct 2, 2015

ThomasWaldmann commented Oct 2, 2015

ThomasWaldmann commented Oct 2, 2015

anarcat commented Oct 2, 2015

anarcat commented Oct 2, 2015

ThomasWaldmann commented Oct 2, 2015

ThomasWaldmann commented Oct 2, 2015

anarcat commented Oct 2, 2015

ThomasWaldmann commented Oct 3, 2015

anarcat commented Oct 3, 2015

ThomasWaldmann commented Oct 3, 2015

anarcat commented Oct 3, 2015

ThomasWaldmann commented Oct 18, 2015

anarcat commented Oct 18, 2015

ThomasWaldmann commented Dec 8, 2015

ThomasWaldmann commented Jan 15, 2016

enkore commented Oct 15, 2016 • edited Loading

ThomasWaldmann commented Oct 15, 2016

enkore commented Jun 2, 2017 • edited Loading

ThomasWaldmann commented Jun 2, 2017 via email

enkore commented Jun 2, 2017 • edited Loading

ThomasWaldmann commented Jun 2, 2017

enkore commented Jun 2, 2017 • edited Loading

ThomasWaldmann commented Jun 2, 2017

enkore commented Oct 15, 2016 •

edited

Loading

enkore commented Jun 2, 2017 •

edited

Loading

enkore commented Jun 2, 2017 •

edited

Loading

enkore commented Jun 2, 2017 •

edited

Loading