Deletion/Gc #1

zenhack · 2020-08-25T22:17:53Z

I'd like to be able to use a blob store like this to store things with the option of purging data that should no longer be kept (e.g. for privacy reasons). But right now the interfaces in this package don't provide any way to delete blobs.

So, I have some interest in that feature, but additionally it would be nice if the store could keep track of edges between blobs and do garbage collection, treating the anchors as roots.

bobg · 2020-08-27T03:26:22Z

Hi! Thanks for your interest.

I've been thinking about this feature too, and I would welcome your input on a design decision.

As you may have seen, bs differs from Perkeep in at least one important respect: blobs are optionally typed. If a type is specified, it refers to a protobuf structure into which the blob can be deserialized.

So given a root set of refs and/or anchors, it is possible, in principle, to trace all that set's transitive references, and protect those blobs from GC. Without type information, this would have to be done heuristically, as it is in Perkeep.

The question is, how should the system know which fields of a protobuf are references to blobs that should be protected? There are a few choices:

It's a reference if it has a certain specific type.

The type would probably have to be something like this:

message Ref {
  bytes ref = 1;
}

This option seems like it might be a little annoying to deal with. But it'd be typesafe.

It's a reference if it has type bytes and the name ref, or ending in _ref.

This is verging into heuristic territory, which is something I've been trying to avoid in the design of bs. I don't actually like this option, because it places a similar naming constraint on the programmer as the first option, minus the measure of typesafety that option 1 gives.

It's a reference if it has type bytes and is 32 bytes long.

This is totally heuristic, and it might protect some blobs by mistake. On the other hand, it's dead-simple for the programmer, errs on the side of safety (protecting too much instead of too little, which is a danger with the first couple of options), and the likelihood of a 32-byte bytes field not being meant as a ref, but still matching the hash of an existing blob, can probably be totally discounted.

I think I just talked myself into option 3. What do you think?

zenhack · 2020-08-27T05:45:23Z

One thing I don't like about option 3 is it makes it impossible to tell the difference between a dangling reference/missing data, e.g. because of an interrupted sync, and something that just isn't a reference at all. This would be less of a danger if all of the backends supported transactions. But it also means that when looking at a blob, you can't tell if something is a reference or not without asking the store, so e.g.if you want to do a traversal of some sub-graph, you'd have to try to fetch every 32-byte field just in case it's a ref -- while I'm not really that worried about false-positives for the full hash, I do expect 32-byte fields that aren't hashes to be reasonably common. I'd like a system where I can rule out non-refs without hitting the store. So option 1 I guess.

bobg · 2020-08-28T02:25:48Z

OK, I will give option 1 a try too to see how it feels, but in the meantime here's a sketch of option 3: https://github.com/bobg/bs/tree/gc/gc. Interested to know what you think.

zenhack · 2020-08-28T11:56:09Z

Given those interfaces, I'm not sure how transitive deletion is supposed to work -- it doesn't look like reachability information is being stored anywhere, so if I delete a root A, and A is the only thing that refers to B, how do we detect that B should no longer be kept? One obvious solution would be to explicitly store edges/refcounts.

A refcounting approach will probably also be more efficient on large stores, as it avoids the need to do a full sweep over all the blobs.

bobg · 2020-08-29T04:51:12Z

I don't think transitive deletion should be a feature of a bs.Store, which is and should remain a low-level storage mechanism with minimal knowledge of any kind of data schema. (And as we've seen in #3, the knowledge it does have is already causing problems.)

However, it's not hard to imagine a store defined in terms of bs.Store but at a higher abstraction layer, the way anchor.Store and gc.Store are, in which blobs can be refcounted, and where a refcount of zero does mean automatic deletion.

Let's continue to use this issue to discuss the pending gc work (on the gc branch). If you'd like to track the idea of a reference-counting abstraction layer I encourage you to open a new feature-request issue.

zenhack · 2020-08-29T10:35:48Z

/me looks again.

Ah, for some reason I had it in my head that the Keep was going to be persistent, but for a simple mark & sweep we can just throw it out afterwards. Perhaps for the benefit on-disk Keep implementations we could add a Clear() method to wipe the underlying storage?

I do think edge tracking would still have advantages, but sure, that can be a separate issue.

bobg · 2021-08-11T03:16:32Z

Ahoy there! One year later and I have done away with the whole idea of types in BS. They were not accomplishing what I wanted them to.

For garbage collection, it is the caller's responsibility to identify root refs and callback functions that can traverse them to identify other additional refs to protect. See https://pkg.go.dev/github.com/bobg/bs@v0.2.0/gc#Protect. Hashsplit trees have a predefined "protect" callback: split.Protect (which doubles as a model for how to write these traversal functions).

How does this interface feel?

zenhack · 2021-08-17T06:40:10Z

This seems like a reasonable point in the design space. I do still worry that the mark & sweep collector you've designed is going to have scaling problems; on very large data stores doing O(n) passes on all live data is likely to be prohibitive.

For what it's worth, I started working on my own merkle-dag store a while back which has explicit knowledge of pointers as I suggest, though largely by piggy-backing on cap'n proto:

https://github.com/zenhack/ocap-merkledag

The schema in particular: https://github.com/zenhack/ocap-merkledag/blob/master/schema/protocol.capnp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deletion/Gc #1

Deletion/Gc #1

zenhack commented Aug 25, 2020

bobg commented Aug 27, 2020

zenhack commented Aug 27, 2020

bobg commented Aug 28, 2020

zenhack commented Aug 28, 2020

bobg commented Aug 29, 2020

zenhack commented Aug 29, 2020

bobg commented Aug 11, 2021

zenhack commented Aug 17, 2021

Deletion/Gc #1

Deletion/Gc #1

Comments

zenhack commented Aug 25, 2020

bobg commented Aug 27, 2020

zenhack commented Aug 27, 2020

bobg commented Aug 28, 2020

zenhack commented Aug 28, 2020

bobg commented Aug 29, 2020

zenhack commented Aug 29, 2020

bobg commented Aug 11, 2021

zenhack commented Aug 17, 2021