New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

significantly reduce cache space needs #235

Closed
anarcat opened this Issue Oct 1, 2015 · 27 comments

Comments

Projects
None yet
3 participants
@anarcat
Contributor

anarcat commented Oct 1, 2015

after converting an attic repo to borg (using #231), i am having problems running borg create. i suspect this is not related to the migration, but with the new caching algorithm that borg uses.

when running borg create, it seems to want to regenerate the chunks cache, even though it was converted and is available in $BORG_CACHE_DIR/<id>/chunks:

# borg create --do-not-cross-mountpoints              --stats                                          /media/anarcat/BOOK/borg::$(hostname)-$(date +%Y-%m-%d)-post-attic       / /boot /usr /var /home                          --exclude-caches                                 --exclude "/home/*/.cache"                       --exclude "*/.Trash-*"                           --exclude "*/[Cc]cache/*"                        --exclude "*/.bitcoin/blocks/*"                  --exclude "/tmp/*"                               --exclude "*/build-area/*"                       --exclude "/proc/*"                              --exclude "/dev/*"                               --exclude "/sys/*"                               --exclude "/var/cache/*"                         --exclude "/var/tmp/*"                      --exclude "/var/log/*" -v -v
Synchronizing chunks cache...
Archives: 284, w/ cached Idx: 9, w/ outdated Idx: 0, w/o cached Idx: 275.
Fetching and building archive index for marcos-2015-08-22 ...
Merging into master chunks index ...
Fetching and building archive index for marcos-2015-04-15 ...
Merging into master chunks index ...
Fetching and building archive index for marcos-2015-05-17 ...
Merging into master chunks index ...
[...]
Fetching and building archive index for marcos-2015-02-19 ...
hashindex: /root/.cache/borg/d31f0c3d6969eb44b6a77eb4edac2ec0c1d257c8fbb3a0e26ae4cede83963a89/chunks.archive.d/28fbf8d98f730e04b9835e2c628fed6f00f057267979ed9b3b7c1497782760ca.tmp: fwrite buckets failed (No space left on device)

... you can see how it's quickly running out of space. whereas the cache directory was 1GB in Attic, this ran out of space after 8GB written.


💰 there is a bounty for this

@anarcat

This comment has been minimized.

Show comment
Hide comment
@anarcat

anarcat Oct 1, 2015

Contributor

every chunk archive is 353MB. so if i let this go through, it will take around 100GB:

> 353Mibyte * 284 à Gibyte

  353 * mebibyte * 284 = approx. 97.902344 gibibytes

that seems rather unreasonable - it's almost the size of the original dataset...

Contributor

anarcat commented Oct 1, 2015

every chunk archive is 353MB. so if i let this go through, it will take around 100GB:

> 353Mibyte * 284 à Gibyte

  353 * mebibyte * 284 = approx. 97.902344 gibibytes

that seems rather unreasonable - it's almost the size of the original dataset...

@anarcat

This comment has been minimized.

Show comment
Hide comment
@anarcat

anarcat Oct 2, 2015

Contributor

final size of the chunk cache is indeed 98G.

Contributor

anarcat commented Oct 2, 2015

final size of the chunk cache is indeed 98G.

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Oct 2, 2015

Member

borg is locally caching all the single-archive indexes and that takes a lot of space. the "attic way" was to transfer them all from the (remote) repository when a cache rebuild was needed, which may take a lot of time.

So it's time vs space currently. but I've 1-2 ideas left how to optimize that, it's just not yet implemented.

Member

ThomasWaldmann commented Oct 2, 2015

borg is locally caching all the single-archive indexes and that takes a lot of space. the "attic way" was to transfer them all from the (remote) repository when a cache rebuild was needed, which may take a lot of time.

So it's time vs space currently. but I've 1-2 ideas left how to optimize that, it's just not yet implemented.

@ThomasWaldmann ThomasWaldmann changed the title from cache usage explosion to significantly reduce cache space needs Oct 2, 2015

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Oct 2, 2015

Member

@anarcat for a quick remedy, try compressing 1 single-archive index using gzip, bz2 and lzma. How much is it before / after? But: if we do that, it'll get slower again.

Member

ThomasWaldmann commented Oct 2, 2015

@anarcat for a quick remedy, try compressing 1 single-archive index using gzip, bz2 and lzma. How much is it before / after? But: if we do that, it'll get slower again.

@anarcat

This comment has been minimized.

Show comment
Hide comment
@anarcat

anarcat Oct 2, 2015

Contributor

a fresh new backup of the same dataset generates a chunk cache of 575MB.

i don't understand why this is necessary: this is a local backup, so this is a huge performance hit without any gain whatsoever in speed, from my perspective.

how can i do the compression you are refering to?

Contributor

anarcat commented Oct 2, 2015

a fresh new backup of the same dataset generates a chunk cache of 575MB.

i don't understand why this is necessary: this is a local backup, so this is a huge performance hit without any gain whatsoever in speed, from my perspective.

how can i do the compression you are refering to?

@anarcat

This comment has been minimized.

Show comment
Hide comment
@anarcat

anarcat Oct 2, 2015

Contributor

oh, and unless this wasn't clear: this is a blocker for me right now - i can't afford the (local!) space cost of 100GB for my backups, unfortunately, so right now i have failed to convert to the borg collective. sorry!!!

i want to be assimilated, what's going on? ;)

Contributor

anarcat commented Oct 2, 2015

oh, and unless this wasn't clear: this is a blocker for me right now - i can't afford the (local!) space cost of 100GB for my backups, unfortunately, so right now i have failed to convert to the borg collective. sorry!!!

i want to be assimilated, what's going on? ;)

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Oct 2, 2015

Member

compression: that's just an experiment, not an existing feature.

Try to compress your 575MB file using different methods to see if it is worth the effort.

I'll add a quick & temporary hack so you won't need the space.

Member

ThomasWaldmann commented Oct 2, 2015

compression: that's just an experiment, not an existing feature.

Try to compress your 575MB file using different methods to see if it is worth the effort.

I'll add a quick & temporary hack so you won't need the space.

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Oct 2, 2015

Member

See PR #238 (just as a temporary solution, see comment in the source).

A long term solution might be one of:

a) Update: this can only work, if only new archives were added, but no archives were deleted in the repo. To do the fixup for deleted archives, we need their index. But as they are deleted and we do not want to store all indexes locally, we don't have it. So, when deleted archives are detected, it needs a full index rebuild.

Assuming we start from:

  • a valid chunks master index MI (this should usually be the case) ...
  • ... for a KNOWN set of archive ids Ai in that master index (this is a new feature, we currently do not remember the archive ids the master index was built from).
  • a repo with a set of archive ids Ar

Then cache resync is defined as:

added = Ar - Ai
deleted = Ai - Ar
for archive_id in added:
    fixup(MI, archive_id, mult=+1)
for archive_id in deleted:
    fixup(MI, archive_id, mult=-1)
remember_as_Ai(Ar)

fixup:
for id, refcount, size, csize in archive_chunk_infos(archive_id):
    m_refcount, m_size, m_csize = MI.get(id, (0, None, None))
    m_refcount += mult * refcount
    m_size, m_csize = size, csize
    if refcount > 0:
        MI[id] = refcount, m_size, m_csize
    elif refcount == 0:
        del M[id]
    else:  # < 0
        # should not happen
    # note: we must do more sanity checks above, omitted here for brevity

Note:

  • this will be fast in a lot of "normal" use cases, except when MI / Ai gets lost, then a full master index rebuild must be done (which might be slow when connection to repo is slow / has big latency).
  • this does not use / not need a local single archive index cache

b) look at how restic does it.

IIRC it works using a reference to the previous backup archive and only considers chunk information from there.

Member

ThomasWaldmann commented Oct 2, 2015

See PR #238 (just as a temporary solution, see comment in the source).

A long term solution might be one of:

a) Update: this can only work, if only new archives were added, but no archives were deleted in the repo. To do the fixup for deleted archives, we need their index. But as they are deleted and we do not want to store all indexes locally, we don't have it. So, when deleted archives are detected, it needs a full index rebuild.

Assuming we start from:

  • a valid chunks master index MI (this should usually be the case) ...
  • ... for a KNOWN set of archive ids Ai in that master index (this is a new feature, we currently do not remember the archive ids the master index was built from).
  • a repo with a set of archive ids Ar

Then cache resync is defined as:

added = Ar - Ai
deleted = Ai - Ar
for archive_id in added:
    fixup(MI, archive_id, mult=+1)
for archive_id in deleted:
    fixup(MI, archive_id, mult=-1)
remember_as_Ai(Ar)

fixup:
for id, refcount, size, csize in archive_chunk_infos(archive_id):
    m_refcount, m_size, m_csize = MI.get(id, (0, None, None))
    m_refcount += mult * refcount
    m_size, m_csize = size, csize
    if refcount > 0:
        MI[id] = refcount, m_size, m_csize
    elif refcount == 0:
        del M[id]
    else:  # < 0
        # should not happen
    # note: we must do more sanity checks above, omitted here for brevity

Note:

  • this will be fast in a lot of "normal" use cases, except when MI / Ai gets lost, then a full master index rebuild must be done (which might be slow when connection to repo is slow / has big latency).
  • this does not use / not need a local single archive index cache

b) look at how restic does it.

IIRC it works using a reference to the previous backup archive and only considers chunk information from there.

@anarcat

This comment has been minimized.

Show comment
Hide comment
@anarcat

anarcat Oct 2, 2015

Contributor

Try to compress your 575MB file using different methods to see if it is worth the effort.

root@marcos:/media/anarcat/BOOK/cache/d31f0c3d6969eb44b6a77eb4edac2ec0c1d257c8fbb3a0e26ae4cede83963a89/chunks.archive.d# pv ffdc92e4749f332628aaaea53316aa293aed10e5f3bd282b5aa428e75f344bec  | gzip -c > f.gz
 352MiO 0:00:31 [11,1MiB/s] [===============================================================================================================>] 100%
root@marcos:/media/anarcat/BOOK/cache/d31f0c3d6969eb44b6a77eb4edac2ec0c1d257c8fbb3a0e26ae4cede83963a89/chunks.archive.d# gunzip -l f.gz
         compressed        uncompressed  ratio uncompressed_name
          253679021           369098770  31.3% f

around 30% compression with standard gzip - is that what you were looking for? it would make 100GB go to 70GB. doesn't seem like a solution.

i don't know how to use #238, sorry...

Contributor

anarcat commented Oct 2, 2015

Try to compress your 575MB file using different methods to see if it is worth the effort.

root@marcos:/media/anarcat/BOOK/cache/d31f0c3d6969eb44b6a77eb4edac2ec0c1d257c8fbb3a0e26ae4cede83963a89/chunks.archive.d# pv ffdc92e4749f332628aaaea53316aa293aed10e5f3bd282b5aa428e75f344bec  | gzip -c > f.gz
 352MiO 0:00:31 [11,1MiB/s] [===============================================================================================================>] 100%
root@marcos:/media/anarcat/BOOK/cache/d31f0c3d6969eb44b6a77eb4edac2ec0c1d257c8fbb3a0e26ae4cede83963a89/chunks.archive.d# gunzip -l f.gz
         compressed        uncompressed  ratio uncompressed_name
          253679021           369098770  31.3% f

around 30% compression with standard gzip - is that what you were looking for? it would make 100GB go to 70GB. doesn't seem like a solution.

i don't know how to use #238, sorry...

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Oct 3, 2015

Member

No, gzip reducing by 30% is not worth the effort. and bzip2 and lzma would be much slower...

Member

ThomasWaldmann commented Oct 3, 2015

No, gzip reducing by 30% is not worth the effort. and bzip2 and lzma would be much slower...

@anarcat

This comment has been minimized.

Show comment
Hide comment
@anarcat

anarcat Oct 3, 2015

Contributor

thanks for looking into this!

the more i think about this issue, the more it doesn't seem to make sense. even if the repository would be remote, i would want 100GB (50% of my dataset!) to be lying around as a cache on the local filesystem, just for an eventual cache resync.

i guess i'm unclear as to why such a cache resync would happen. i know it can happen if multiple servers backup to the same source, but that's a special use case that could be disabled by default.... is that it?

Contributor

anarcat commented Oct 3, 2015

thanks for looking into this!

the more i think about this issue, the more it doesn't seem to make sense. even if the repository would be remote, i would want 100GB (50% of my dataset!) to be lying around as a cache on the local filesystem, just for an eventual cache resync.

i guess i'm unclear as to why such a cache resync would happen. i know it can happen if multiple servers backup to the same source, but that's a special use case that could be disabled by default.... is that it?

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Oct 3, 2015

Member

for some people it is not just an eventual cache resync, but a thing that happens daily - if you do a daily backup of multiple machines to same destination repo. and that case is not that special, especially if you have similar machines and you want to exploit the deduplication to the max.

Member

ThomasWaldmann commented Oct 3, 2015

for some people it is not just an eventual cache resync, but a thing that happens daily - if you do a daily backup of multiple machines to same destination repo. and that case is not that special, especially if you have similar machines and you want to exploit the deduplication to the max.

@anarcat

This comment has been minimized.

Show comment
Hide comment
@anarcat

anarcat Oct 3, 2015

Contributor

sure, i understand that, but that seems like a special use case that was basically unsupported forever. :) now i understand we would like that to work, but i'm not sure the cost/benefit ratio is worth it.

it certainly is not worth it by default, imho.

Contributor

anarcat commented Oct 3, 2015

sure, i understand that, but that seems like a special use case that was basically unsupported forever. :) now i understand we would like that to work, but i'm not sure the cost/benefit ratio is worth it.

it certainly is not worth it by default, imho.

@ThomasWaldmann ThomasWaldmann modified the milestones: 0.27, 0.28 Oct 3, 2015

@ThomasWaldmann ThomasWaldmann modified the milestones: 0.29, 0.28 Oct 17, 2015

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Oct 17, 2015

cache: faster chunks cache resync if archives were only added, fixes b…
…orgbackup#235

if archives are added, we just add their chunk refcounts to our chunks index.
this is relatively quick if only a few archives were added, so even for remote repos,
one can live without the locally cached per-archive indexes, saving a lot of space.

but: if archives are deleted from the repo, borg will do a full index rebuild.
if one wants this to be relatively fast and has a slow repo connection,
the locally cached per-archive indexes are useful.
thus: use delete / prune less frequently than create.

also: added error check in hashindex_merge in case hashindex_set fails

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Oct 17, 2015

cache: faster chunks cache resync if archives were only added, fixes b…
…orgbackup#235

if archives are added, we just add their chunk refcounts to our chunks index.
this is relatively quick if only a few archives were added, so even for remote repos,
one can live without the locally cached per-archive indexes, saving a lot of space.

but: if archives are deleted from the repo, borg will do a full index rebuild.
if one wants this to be relatively fast and has a slow repo connection,
the locally cached per-archive indexes are useful.
thus: use delete / prune less frequently than create.

also: added error check in hashindex_merge in case hashindex_set fails

@ThomasWaldmann ThomasWaldmann modified the milestones: 0.28, 0.29 Oct 17, 2015

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Oct 18, 2015

Member

See PR #301. Needs careful review and testing.

Member

ThomasWaldmann commented Oct 18, 2015

See PR #301. Needs careful review and testing.

@anarcat

This comment has been minimized.

Show comment
Hide comment
@anarcat

anarcat Oct 18, 2015

Contributor

just for the record, when i ran the final conversion last night, the resulting borg repo was perfectly usable without the cache rebuild: the backup ran within 13 minutes and didn't run out of space at all... so everything is good for me here... but i'll take a look at the pr.

Contributor

anarcat commented Oct 18, 2015

just for the record, when i ran the final conversion last night, the resulting borg repo was perfectly usable without the cache rebuild: the backup ran within 13 minutes and didn't run out of space at all... so everything is good for me here... but i'll take a look at the pr.

@anarcat

This comment has been minimized.

Show comment
Hide comment
@anarcat

anarcat Oct 19, 2015

Contributor

yeah, i don't understand why it broke the last time i tried... i guess i fixed something in the upgrader. yay? :)

Contributor

anarcat commented Oct 19, 2015

yeah, i don't understand why it broke the last time i tried... i guess i fixed something in the upgrader. yay? :)

@ThomasWaldmann ThomasWaldmann modified the milestones: 0.29, 0.28 Oct 20, 2015

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Dec 8, 2015

Member

Related: #474

Member

ThomasWaldmann commented Dec 8, 2015

Related: #474

@ThomasWaldmann ThomasWaldmann modified the milestones: 0.30 maybe, maybe not, 0.29 - improve locking / logging Dec 12, 2015

@ThomasWaldmann ThomasWaldmann modified the milestones: 2.0 - future goals, 0.30 misc fixes and improvements Jan 15, 2016

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Jan 15, 2016

Member

as I rejected my own PR #301 (see there for reasons), this will take a bit longer, thus I adjusted the milestone.

Member

ThomasWaldmann commented Jan 15, 2016

as I rejected my own PR #301 (see there for reasons), this will take a bit longer, thus I adjusted the milestone.

@enkore

This comment has been minimized.

Show comment
Hide comment
@enkore

enkore Oct 15, 2016

Contributor

The tangent to this is sync speed.

Test set:

  • 16 archives, 984 GB, 447 GB deduplicated, 2283351 unique chunks, 8728178 refs.
  • Full syncs (easy setup).
  • Client <-> Server GbE.
  • Data transfer peaks at 740 MBit/s (=load ~74 %). Server side FS cache was not tampered with (not cleared between runs, so it's hot hot hot). Test command to sync with is "borg info ".

  • Variance between runs is large (worst-best ~1:4).
  • ... because archives are synced in hash order, so random each time
  • This makes a difference for both the fetch phase (because this makes the cache contents and locality different each sync) and the merge phase (somewhat surprised here)
  • As expected, applying a stable sort made this reproducible with low variance (the avg-avg number above; <1 s deltas).
  • In my tests oldest-first was the worst sort, newest-first the best. (factor ~three). But this likely depends on a lot of factors (eg metadata dedup between archives).
  • Moving a hot loop to pyx also made some difference, but much bigger effect due to removing Item(internal_dict=item) [at the same time, misanalyzed the data here first due to that] and directly working on the dict (1:2.4 for the latter part; 1:2.7 for the first part combined with the latter). So the Item stuff looks like quite a perf regression here. Likely explanation is that this hits the allocator.

tree: https://github.com/enkore/borg/tree/f/fastsync

EDIT: Interestingly enough 1.0.x doesn't has nowhere near as large a variance as master branch. And it is even some ~20 % faster than the tree linked above. Whoops! A closer look is needed here.

Contributor

enkore commented Oct 15, 2016

The tangent to this is sync speed.

Test set:

  • 16 archives, 984 GB, 447 GB deduplicated, 2283351 unique chunks, 8728178 refs.
  • Full syncs (easy setup).
  • Client <-> Server GbE.
  • Data transfer peaks at 740 MBit/s (=load ~74 %). Server side FS cache was not tampered with (not cleared between runs, so it's hot hot hot). Test command to sync with is "borg info ".

  • Variance between runs is large (worst-best ~1:4).
  • ... because archives are synced in hash order, so random each time
  • This makes a difference for both the fetch phase (because this makes the cache contents and locality different each sync) and the merge phase (somewhat surprised here)
  • As expected, applying a stable sort made this reproducible with low variance (the avg-avg number above; <1 s deltas).
  • In my tests oldest-first was the worst sort, newest-first the best. (factor ~three). But this likely depends on a lot of factors (eg metadata dedup between archives).
  • Moving a hot loop to pyx also made some difference, but much bigger effect due to removing Item(internal_dict=item) [at the same time, misanalyzed the data here first due to that] and directly working on the dict (1:2.4 for the latter part; 1:2.7 for the first part combined with the latter). So the Item stuff looks like quite a perf regression here. Likely explanation is that this hits the allocator.

tree: https://github.com/enkore/borg/tree/f/fastsync

EDIT: Interestingly enough 1.0.x doesn't has nowhere near as large a variance as master branch. And it is even some ~20 % faster than the tree linked above. Whoops! A closer look is needed here.

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Oct 15, 2016

Member

Hmm, interesting, but not really related to "cache space needs", is it?

Member

ThomasWaldmann commented Oct 15, 2016

Hmm, interesting, but not really related to "cache space needs", is it?

@enkore

This comment has been minimized.

Show comment
Hide comment
@enkore

enkore Jun 2, 2017

Contributor

Do we consider ~33 % on average "significant" in the context of this ticket? (I'm not talking about LZ4'ing the files).

I don't see any technical possibilities to do much better than that with the c.a.d. model. Getting completely rid of the cache is already in other tickets.

Contributor

enkore commented Jun 2, 2017

Do we consider ~33 % on average "significant" in the context of this ticket? (I'm not talking about LZ4'ing the files).

I don't see any technical possibilities to do much better than that with the c.a.d. model. Getting completely rid of the cache is already in other tickets.

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Jun 2, 2017

Member
Member

ThomasWaldmann commented Jun 2, 2017

@enkore

This comment has been minimized.

Show comment
Hide comment
@enkore

enkore Jun 2, 2017

Contributor

Basically free, much faster than LZ4. And merging the resulting c.a.d. becomes a tiny bit faster (the main cost of merging A into B is looking every key from A up in B), too.

Contributor

enkore commented Jun 2, 2017

Basically free, much faster than LZ4. And merging the resulting c.a.d. becomes a tiny bit faster (the main cost of merging A into B is looking every key from A up in B), too.

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Jun 2, 2017

Member

Sounds good. More details about the idea?

Member

ThomasWaldmann commented Jun 2, 2017

Sounds good. More details about the idea?

@enkore

This comment has been minimized.

Show comment
Hide comment
@enkore

enkore Jun 2, 2017

Contributor

enkore@0d2c904

Cache sync: processed 123.83 GB bytes (770937 chunks) of metadata
Cache sync: compact chunks.archive.d storage saved 7.57 GB bytes
Done.
RepositoryCache: current items 41466, size 2.07 GB / 2.15 GB, 632520 hits, 135240 misses, 4549 slow misses (+5.8s), 96951 evictions, 0 ENOSPC hit

$ du -sbc /home/mabe/.cache/borg/.../chunks.archive.d
13583935276

13.58 GB (used space) + 7.57 GB (saved space) = 21.15 GB (used without
this technique); Thus, 36 % space savings. Syncing from
c.a.d. (i.e. edit manifest ID in cache config) is down <5 %; if the disk
where the limit the impact would be larger, since less I/O bandwidth
is required.

.compact is only necessary because previously the first index was used
as a starting point for the master index, so if that one were read
from disk and were a compact index, then the cache would be corrupted.

Since the HashIndex grows at 10 % for signficiant sizes, it has a bias
towards being full rather than empty, which is why the average space
savings seem to work out to about 1-1/MAX_LF = 33 % (or slightly
less). Savings are higher for small archives which are still in the
100 % growth area.

The function itself (hashindex_compact) is very efficient [0] but has some
cases left where it could be made even more efficient. To munge
through the 21.15 GB and compact them it required ~3.0 s, which means
7 GB/s.

[0] It may look on a first glance like it loops multiple times over the
array, but it actually does not and looks at every bucket in the array only once.
Note how there is only one loop counter and the nested loops advance
that counter.

(GB = 1000³ bytes)

Contributor

enkore commented Jun 2, 2017

enkore@0d2c904

Cache sync: processed 123.83 GB bytes (770937 chunks) of metadata
Cache sync: compact chunks.archive.d storage saved 7.57 GB bytes
Done.
RepositoryCache: current items 41466, size 2.07 GB / 2.15 GB, 632520 hits, 135240 misses, 4549 slow misses (+5.8s), 96951 evictions, 0 ENOSPC hit

$ du -sbc /home/mabe/.cache/borg/.../chunks.archive.d
13583935276

13.58 GB (used space) + 7.57 GB (saved space) = 21.15 GB (used without
this technique); Thus, 36 % space savings. Syncing from
c.a.d. (i.e. edit manifest ID in cache config) is down <5 %; if the disk
where the limit the impact would be larger, since less I/O bandwidth
is required.

.compact is only necessary because previously the first index was used
as a starting point for the master index, so if that one were read
from disk and were a compact index, then the cache would be corrupted.

Since the HashIndex grows at 10 % for signficiant sizes, it has a bias
towards being full rather than empty, which is why the average space
savings seem to work out to about 1-1/MAX_LF = 33 % (or slightly
less). Savings are higher for small archives which are still in the
100 % growth area.

The function itself (hashindex_compact) is very efficient [0] but has some
cases left where it could be made even more efficient. To munge
through the 21.15 GB and compact them it required ~3.0 s, which means
7 GB/s.

[0] It may look on a first glance like it loops multiple times over the
array, but it actually does not and looks at every bucket in the array only once.
Note how there is only one loop counter and the nested loops advance
that counter.

(GB = 1000³ bytes)

@ThomasWaldmann

This comment has been minimized.

Show comment
Hide comment
@ThomasWaldmann

ThomasWaldmann Jun 2, 2017

Member

Sounds good, let's delay further review to after b6 though.

Member

ThomasWaldmann commented Jun 2, 2017

Sounds good, let's delay further review to after b6 though.

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.0b7, 2.0 - future goals Jun 2, 2017

@enkore enkore self-assigned this Jun 2, 2017

@enkore enkore closed this in #2638 Jun 10, 2017

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.0b6, 1.1.0b7 Jun 10, 2017

@enkore enkore added the c: hashindex label Jul 23, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment