-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deletion/Gc #1
Comments
Hi! Thanks for your interest. I've been thinking about this feature too, and I would welcome your input on a design decision. As you may have seen, bs differs from Perkeep in at least one important respect: blobs are optionally typed. If a type is specified, it refers to a protobuf structure into which the blob can be deserialized. So given a root set of refs and/or anchors, it is possible, in principle, to trace all that set's transitive references, and protect those blobs from GC. Without type information, this would have to be done heuristically, as it is in Perkeep. The question is, how should the system know which fields of a protobuf are references to blobs that should be protected? There are a few choices:
The type would probably have to be something like this: message Ref {
bytes ref = 1;
} This option seems like it might be a little annoying to deal with. But it'd be typesafe.
This is verging into heuristic territory, which is something I've been trying to avoid in the design of bs. I don't actually like this option, because it places a similar naming constraint on the programmer as the first option, minus the measure of typesafety that option 1 gives.
This is totally heuristic, and it might protect some blobs by mistake. On the other hand, it's dead-simple for the programmer, errs on the side of safety (protecting too much instead of too little, which is a danger with the first couple of options), and the likelihood of a 32-byte I think I just talked myself into option 3. What do you think? |
One thing I don't like about option 3 is it makes it impossible to tell the difference between a dangling reference/missing data, e.g. because of an interrupted sync, and something that just isn't a reference at all. This would be less of a danger if all of the backends supported transactions. But it also means that when looking at a blob, you can't tell if something is a reference or not without asking the store, so e.g.if you want to do a traversal of some sub-graph, you'd have to try to fetch every 32-byte field just in case it's a ref -- while I'm not really that worried about false-positives for the full hash, I do expect 32-byte fields that aren't hashes to be reasonably common. I'd like a system where I can rule out non-refs without hitting the store. So option 1 I guess. |
OK, I will give option 1 a try too to see how it feels, but in the meantime here's a sketch of option 3: https://github.com/bobg/bs/tree/gc/gc. Interested to know what you think. |
Given those interfaces, I'm not sure how transitive deletion is supposed to work -- it doesn't look like reachability information is being stored anywhere, so if I delete a root A refcounting approach will probably also be more efficient on large stores, as it avoids the need to do a full sweep over all the blobs. |
I don't think transitive deletion should be a feature of a However, it's not hard to imagine a store defined in terms of Let's continue to use this issue to discuss the pending gc work (on the gc branch). If you'd like to track the idea of a reference-counting abstraction layer I encourage you to open a new feature-request issue. |
/me looks again. Ah, for some reason I had it in my head that the Keep was going to be persistent, but for a simple mark & sweep we can just throw it out afterwards. Perhaps for the benefit on-disk Keep implementations we could add a Clear() method to wipe the underlying storage? I do think edge tracking would still have advantages, but sure, that can be a separate issue. |
Ahoy there! One year later and I have done away with the whole idea of types in BS. They were not accomplishing what I wanted them to. For garbage collection, it is the caller's responsibility to identify root refs and callback functions that can traverse them to identify other additional refs to protect. See https://pkg.go.dev/github.com/bobg/bs@v0.2.0/gc#Protect. Hashsplit trees have a predefined "protect" callback: split.Protect (which doubles as a model for how to write these traversal functions). How does this interface feel? |
This seems like a reasonable point in the design space. I do still worry that the mark & sweep collector you've designed is going to have scaling problems; on very large data stores doing O(n) passes on all live data is likely to be prohibitive. For what it's worth, I started working on my own merkle-dag store a while back which has explicit knowledge of pointers as I suggest, though largely by piggy-backing on cap'n proto: https://github.com/zenhack/ocap-merkledag The schema in particular: https://github.com/zenhack/ocap-merkledag/blob/master/schema/protocol.capnp |
I'd like to be able to use a blob store like this to store things with the option of purging data that should no longer be kept (e.g. for privacy reasons). But right now the interfaces in this package don't provide any way to delete blobs.
So, I have some interest in that feature, but additionally it would be nice if the store could keep track of edges between blobs and do garbage collection, treating the anchors as roots.
The text was updated successfully, but these errors were encountered: