Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deletion/Gc #1

Open
zenhack opened this issue Aug 25, 2020 · 8 comments
Open

Deletion/Gc #1

zenhack opened this issue Aug 25, 2020 · 8 comments

Comments

@zenhack
Copy link
Contributor

zenhack commented Aug 25, 2020

I'd like to be able to use a blob store like this to store things with the option of purging data that should no longer be kept (e.g. for privacy reasons). But right now the interfaces in this package don't provide any way to delete blobs.

So, I have some interest in that feature, but additionally it would be nice if the store could keep track of edges between blobs and do garbage collection, treating the anchors as roots.

@bobg
Copy link
Owner

bobg commented Aug 27, 2020

Hi! Thanks for your interest.

I've been thinking about this feature too, and I would welcome your input on a design decision.

As you may have seen, bs differs from Perkeep in at least one important respect: blobs are optionally typed. If a type is specified, it refers to a protobuf structure into which the blob can be deserialized.

So given a root set of refs and/or anchors, it is possible, in principle, to trace all that set's transitive references, and protect those blobs from GC. Without type information, this would have to be done heuristically, as it is in Perkeep.

The question is, how should the system know which fields of a protobuf are references to blobs that should be protected? There are a few choices:

  • It's a reference if it has a certain specific type.

The type would probably have to be something like this:

message Ref {
  bytes ref = 1;
}

This option seems like it might be a little annoying to deal with. But it'd be typesafe.

  • It's a reference if it has type bytes and the name ref, or ending in _ref.

This is verging into heuristic territory, which is something I've been trying to avoid in the design of bs. I don't actually like this option, because it places a similar naming constraint on the programmer as the first option, minus the measure of typesafety that option 1 gives.

  • It's a reference if it has type bytes and is 32 bytes long.

This is totally heuristic, and it might protect some blobs by mistake. On the other hand, it's dead-simple for the programmer, errs on the side of safety (protecting too much instead of too little, which is a danger with the first couple of options), and the likelihood of a 32-byte bytes field not being meant as a ref, but still matching the hash of an existing blob, can probably be totally discounted.

I think I just talked myself into option 3. What do you think?

@zenhack
Copy link
Contributor Author

zenhack commented Aug 27, 2020

One thing I don't like about option 3 is it makes it impossible to tell the difference between a dangling reference/missing data, e.g. because of an interrupted sync, and something that just isn't a reference at all. This would be less of a danger if all of the backends supported transactions. But it also means that when looking at a blob, you can't tell if something is a reference or not without asking the store, so e.g.if you want to do a traversal of some sub-graph, you'd have to try to fetch every 32-byte field just in case it's a ref -- while I'm not really that worried about false-positives for the full hash, I do expect 32-byte fields that aren't hashes to be reasonably common. I'd like a system where I can rule out non-refs without hitting the store. So option 1 I guess.

@bobg
Copy link
Owner

bobg commented Aug 28, 2020

OK, I will give option 1 a try too to see how it feels, but in the meantime here's a sketch of option 3: https://github.com/bobg/bs/tree/gc/gc. Interested to know what you think.

@zenhack
Copy link
Contributor Author

zenhack commented Aug 28, 2020

Given those interfaces, I'm not sure how transitive deletion is supposed to work -- it doesn't look like reachability information is being stored anywhere, so if I delete a root A, and A is the only thing that refers to B, how do we detect that B should no longer be kept? One obvious solution would be to explicitly store edges/refcounts.

A refcounting approach will probably also be more efficient on large stores, as it avoids the need to do a full sweep over all the blobs.

@bobg
Copy link
Owner

bobg commented Aug 29, 2020

I don't think transitive deletion should be a feature of a bs.Store, which is and should remain a low-level storage mechanism with minimal knowledge of any kind of data schema. (And as we've seen in #3, the knowledge it does have is already causing problems.)

However, it's not hard to imagine a store defined in terms of bs.Store but at a higher abstraction layer, the way anchor.Store and gc.Store are, in which blobs can be refcounted, and where a refcount of zero does mean automatic deletion.

Let's continue to use this issue to discuss the pending gc work (on the gc branch). If you'd like to track the idea of a reference-counting abstraction layer I encourage you to open a new feature-request issue.

@zenhack
Copy link
Contributor Author

zenhack commented Aug 29, 2020

/me looks again.

Ah, for some reason I had it in my head that the Keep was going to be persistent, but for a simple mark & sweep we can just throw it out afterwards. Perhaps for the benefit on-disk Keep implementations we could add a Clear() method to wipe the underlying storage?

I do think edge tracking would still have advantages, but sure, that can be a separate issue.

@bobg
Copy link
Owner

bobg commented Aug 11, 2021

Ahoy there! One year later and I have done away with the whole idea of types in BS. They were not accomplishing what I wanted them to.

For garbage collection, it is the caller's responsibility to identify root refs and callback functions that can traverse them to identify other additional refs to protect. See https://pkg.go.dev/github.com/bobg/bs@v0.2.0/gc#Protect. Hashsplit trees have a predefined "protect" callback: split.Protect (which doubles as a model for how to write these traversal functions).

How does this interface feel?

@zenhack
Copy link
Contributor Author

zenhack commented Aug 17, 2021

This seems like a reasonable point in the design space. I do still worry that the mark & sweep collector you've designed is going to have scaling problems; on very large data stores doing O(n) passes on all live data is likely to be prohibitive.

For what it's worth, I started working on my own merkle-dag store a while back which has explicit knowledge of pointers as I suggest, though largely by piggy-backing on cap'n proto:

https://github.com/zenhack/ocap-merkledag

The schema in particular: https://github.com/zenhack/ocap-merkledag/blob/master/schema/protocol.capnp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants