Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the concurrent/remote story? #185

Open
jgoerzen opened this issue Jan 3, 2021 · 6 comments
Open

What is the concurrent/remote story? #185

jgoerzen opened this issue Jan 3, 2021 · 6 comments

Comments

@jgoerzen
Copy link

jgoerzen commented Jan 3, 2021

Hi,

First, than's for rdedup! This looks like a fantastic project to meet some real needs.

I would be happy to write some documentation on this if you can point me in the right direction.

I'm wondering two related questions.

First, is concurrent access to the repository allowed, and if so, in what ways? Can two processes write to it at once? Is any locking done? This is relevant for consolidating backups from multiple hosts to a single backup host.

Secondly, I see that cloud storage is WIP, which is fine. I'm wondering what exactly rdedup needs from its underlying filesystem, with an aim to evaluating whether it can run atop the various FUSE remotes; anything from sshfs to the S3-based ones, etc.

Thanks!

  • John
@jendakol
Copy link

jendakol commented Jan 3, 2021

Hi @jgoerzen,
not exactly answering your question but since you're interested in remote storages, see my recent PR: #184

BTW, after you check the linked code, you will understand that rdedup has solved concurrency quite well - the backend has a two-level locking.

@dpc
Copy link
Owner

dpc commented Jan 4, 2021

Quick relevant link: https://github.com/dpc/rdedup/wiki/Rust's-fearless-concurrency-in-rdedup

IIRC, the whole backend storage is protected by a sort of a read-write lock, and most operations (in particular adding new data) takes a shared lock

let _lock = self.aio.lock_shared();
.This is thanks to everything being idempotent and content-addressable. You can read & write at will, and any overwritten file etc. is going to have exactly same content every time, so there's no problem.

Notably, removing data (mostly garbage collection) takes an exclusive lock.

@dpc
Copy link
Owner

dpc commented Jan 4, 2021

A backend is anything that can implement these basic interfaces

pub trait BackendThread: Send {

@geek-merlin
Copy link

geek-merlin commented Jan 6, 2021

Notably, removing data (mostly garbage collection) takes an exclusive lock.

@dpc Thanks for pointing that out, i already wondered about that. So starting a GC blocks everything else until done? Or finer granularity? Is the case that one backup writes a chunk and another one GC's it, covered? (source pointer appreciated ;-) What if the repo is remote (iirc on S3-like remotes writes are not instantly so locking may not even possible)? DO you know the duplicacy 2-step process?

@dpc
Copy link
Owner

dpc commented Jan 6, 2021

I think GC right now will block everything. Backend is irrelevant. From main logic perspective backends are only writing and loading requested files (kind of).

However GC can be stopped at any time without losing progress and then resumed, so I could imagine if long GC is a problem it could be put behind a timeout rdedup gc ... or something and run periodically. Probably could be implemented with finer granularity, but I never investigated it.

I skimmed at https://github.com/gilbertchen/duplicacy/wiki/Lock-Free-Deduplication#two-step-fossil-collection and rdedup is doing the exact opposite, mostly due to idempotency design and support for concurrent writes synced through dumb syncing mechanism dropbox/syncthing.

The GC works by creating another "generation" folder, then stored-name by stored-name rewriting (moving) all the chunks to the new generation. After all names have been moved from the past generation to new generation, the leftover data chunks in the previous generations are deleted (after some reasonably long time has passed, to make sure any concurrent writers had time for dropbox/syncthing to sync) as they are clearly not referenced by anything. This should be fine as long as the renames are not very expensive (which is not always the case - eg. Backblaze B2 had no support for rename operation at the time).

@geek-merlin
Copy link

geek-merlin commented Jan 6, 2021

Ah amazing! Reading old tickets gives me a picture: #75 #32 #132 #37
Relevant to me: #172

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants