Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

borgception #474

Open
ThomasWaldmann opened this issue Dec 7, 2015 · 9 comments

Comments

@ThomasWaldmann
Copy link
Member

commented Dec 7, 2015

i thought a bit about how to optimize the chunks cache and just wanted to document one weird idea.

the issue with the chunks cache is that it needs to match the overall repository state (== have up-to-date information about all chunks in all archives, including refcount, size, csize). when backing up multiple machines into same repo, creating an archive of one machine invalidates all chunk caches on the other machines and they need to resync their chunks cache with the repo, which is expensive.

so, there is the idea to store the chunk index into the repo also, so all out-of-sync clients can just fetch the index from the repo.

But:

  • index can be large (way larger than segment size)
  • when using raw hashtable it could have up to 75% unused bucket space
  • index has additional information about chunks, so we should not store it unencrypted if the repo is encrypted
  • the index should match the chunks in the repo, so it should not create own chunks in the repo.

So we need:

  • chunking of the index into smaller pieces
  • compression (unused bucket space is mostly binary zeros AFAIK)
  • encryption

This pretty much sounds like we should just backup the index of repo A into a related, but separate borg repository A'. :-)


💰 there is a bounty for this

@ThomasWaldmann ThomasWaldmann changed the title borg inception borgception Dec 7, 2015

@level323

This comment has been minimized.

Copy link

commented Dec 7, 2015

I don't think it's a weird idea at all.

I was thinking along similar lines, actually.

I was initially "what iffing" the idea of storing the chunks cache inside
the main repo, but then saw benefits in storing it in a special dedicated
repo - because this opens the possibility of exploiting the known structure
of the chunks cache to minimise the size of its repo (and minimise transfer
time over slow links).

I think similar fresh approaches can be used to optimise the size of other
objects such as the archive metadata and the files cache to optimise
synchronisation of this data between machines in a
multi-machine-backing-up-to-central-repo use case.

On 8 December 2015 at 05:58, TW notifications@github.com wrote:

i thought a bit about how to optimize the chunks cache and just wanted to
document one weird idea.

the issue with the chunks cache is that it needs to match the overall
repository state (== have up-to-date information about all chunks in all
archives, including refcount, size, csize). when backing up multiple
machines into same repo, creating an archive of one machine invalidates all
chunk caches on the other machines and they need to resync their chunks
cache with the repo, which is expensive.

so, there is the idea to store the chunk index into the repo also, so all
out-of-sync clients can just fetch the index from the repo.

But:

  • index can be large (way larger than segment size)
  • when using raw hashtable it could have up to 75% unused bucket space
  • index has additional information about chunks, so we should not
    store it unencrypted if the repo is encrypted

So we need:

  • chunking of the index into smaller pieces
  • compression (unused bucket space is mostly binary zeros AFAIK)
  • encryption

This pretty much sounds like we should just backup the index into a
related borg repository. :-)


Reply to this email directly or view it on GitHub
#474.

@RonnyPfannschmidt

This comment has been minimized.

Copy link
Contributor

commented Dec 7, 2015

how about having per segment index chunks - then each segment would have encrypted index update instructions, and those could be removed/regenerated as well

@RonnyPfannschmidt

This comment has been minimized.

Copy link
Contributor

commented Dec 7, 2015

a yet different method could be to just have 2 files per segment, one encrypted metadata, one for the blob data it adds in order

the whole chunk index can then be reconstructed from the metadata

@ThomasWaldmann

This comment has been minimized.

Copy link
Member Author

commented Dec 7, 2015

@level323 the files index does not need to be shared. it just remembers the mtime/size/inode/chunks info for all files of last backup, so it can quickly skip the files next time.

@RonnyPfannschmidt what's the point of a per-segment index if you have to merge 100.000 of them to get the full index?

@RonnyPfannschmidt

This comment has been minimized.

Copy link
Contributor

commented Dec 8, 2015

@ThomasWaldmann you only need to deal with tens of thousands on a full index regen

the multi machine use case would be much less intensive on computation, since one only has to obtain increments

@ThomasWaldmann

This comment has been minimized.

Copy link
Member Author

commented Dec 8, 2015

@RonnyPfannschmidt well, a segment has 5MB, so a 500GB repo has 100.000 segments. That's just an example, but a quite realistic one. Of course, it can be more or less depending on how much data you have.

But I still don't see how your suggestion should be efficient. Just for comparison: the normal, uncached and quite slow repo rebuild goes through ALL archives, ALL files and uses the file item chunk list stored in metadata as increment. The items metadata are stored clustered together in a few segments [not together with file content data]).

BTW, an incremental "just add on top what we already have" approach for the chunks cache only works as long as nothing is removed. Because if something is removed, also the information about it is gone, so we can't subtract. (see the PRs)

@RonnyPfannschmidt

This comment has been minimized.

Copy link
Contributor

commented Dec 8, 2015

if each segment also tracks the removals, then the chunks index willl match the current state after applying a segment, the main problem would be to correct the reference segment of the current state on a vacuum - which is quite hard, since segments will be combined on vacuum

@pszxzsd

This comment has been minimized.

Copy link

commented Dec 9, 2015

No idea if it's applicable to the problem, but ZeroDB (docs), an end-to-end encrypted database, recently went open source as a Python implementation.

@RonnyPfannschmidt

This comment has been minimized.

Copy link
Contributor

commented Dec 9, 2015

its highly unrelated and a completely different kind of data-store not suitable for borg's needs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.