Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upDiscussion: options for hypercore feed-level metadata #13
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mafintosh
Mar 27, 2018
Been thinking a bit about the handshake message. Instead of making it the first message it could be powerful if you could update that message. Especially if we have an "relatedFeeds" scheme since you most likely wanna update that over time (hyperdb does this a bunch!)
What about something like this
message Header {
message Feed {
required bytes key = 1;
}
required string protocolType = 1;
optional uint64 version = 2; // defaults to version 0
repeated Feed relatedFeeds = 3; // use a Feed message so users can extend it with other metadata
}And then a convention that every message points back to the last header message
message Entry {
optional uint64 headerSeq = 4;
}Or some variation of this
mafintosh
commented
Mar 27, 2018
|
Been thinking a bit about the handshake message. Instead of making it the first message it could be powerful if you could update that message. Especially if we have an "relatedFeeds" scheme since you most likely wanna update that over time (hyperdb does this a bunch!) What about something like this message Header {
message Feed {
required bytes key = 1;
}
required string protocolType = 1;
optional uint64 version = 2; // defaults to version 0
repeated Feed relatedFeeds = 3; // use a Feed message so users can extend it with other metadata
}And then a convention that every message points back to the last header message message Entry {
optional uint64 headerSeq = 4;
}Or some variation of this |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mafintosh
Mar 27, 2018
@bnewbold expanding on the above... what if you could attach an immutable blob and a mutable sequence pointer, or simply just a mutable sequence pointer. Then the header would be stored in the feed but you'd keep the "latest header" pointer outside
mafintosh
commented
Mar 27, 2018
•
|
@bnewbold expanding on the above... what if you could attach an immutable blob and a mutable sequence pointer, or simply just a mutable sequence pointer. Then the header would be stored in the feed but you'd keep the "latest header" pointer outside |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mafintosh
Mar 27, 2018
Another idea.
Being able to attach a mutable blob that is history less. Meaning that in the handshake or somehow each peer exchange (blob, blobSeq, signature).
blobSeq increments everytime the owner updates the blob and is used to pick the newest one if the peers disagree on which one is the correct one. The owner also signs it
mafintosh
commented
Mar 27, 2018
|
Another idea. Being able to attach a mutable blob that is history less. Meaning that in the handshake or somehow each peer exchange
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
pfrazee
Mar 28, 2018
Member
Being able to attach a mutable blob that is history less. Meaning that in the handshake or somehow each peer exchange (blob, blobSeq, signature).
Sounds like a good solution to me
Sounds like a good solution to me |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mafintosh
Mar 28, 2018
Been thinking about the security aspects of the mutable header. It becomes a bit tricky fast, imo. I want to avoid situations where peers can withhold the latest blob which means we need to add it to the merkle tree which makes the no-history aspect of it hard (open for ideas here!).
Going back to pragmatism what is it we can to get out of this? Originally my thoughts about the immutable header was you'd include a string like this
hyperdb/v1
Or
content-feed
Ie. immutable descriptions of the data that let's you pick the right strategy to parse the data.
The main thing gained from the mutable one would be if we could specify which feeds to crawl (makes archivers easier over time), assuming we spec out a required schema for the handshake on top. This ofcourse could be a massive benefit as well. Unsure how to proceed, again open for input.
mafintosh
commented
Mar 28, 2018
•
|
Been thinking about the security aspects of the mutable header. It becomes a bit tricky fast, imo. I want to avoid situations where peers can withhold the latest Going back to pragmatism what is it we can to get out of this? Originally my thoughts about the immutable header was you'd include a string like this
Or
Ie. immutable descriptions of the data that let's you pick the right strategy to parse the data. The main thing gained from the mutable one would be if we could specify which feeds to crawl (makes archivers easier over time), assuming we spec out a required schema for the handshake on top. This ofcourse could be a massive benefit as well. Unsure how to proceed, again open for input. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
pfrazee
Mar 28, 2018
Member
Could an immutable blob which identifies the data structure then use custom headers to identify additional feeds? Then that custom header would be part of the data structure and could be made mutable
|
Could an immutable blob which identifies the data structure then use custom headers to identify additional feeds? Then that custom header would be part of the data structure and could be made mutable |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mafintosh
Mar 28, 2018
@pfrazee yea that's what i've been thinking too. use the immutable string to pick the right strategy to crawl the feed (default to contentfeed which means no crawling).
pros
- easy to impl
- backwards compat (old cores would just have the header '')
- easy to review sec wise.
cons
- means archivers need to know about datastructures to crawl them
mafintosh
commented
Mar 28, 2018
|
@pfrazee yea that's what i've been thinking too. use the immutable string to pick the right strategy to crawl the feed (default to contentfeed which means no crawling). pros
cons
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bnewbold
Apr 13, 2018
Contributor
As per the original message, I think there are sort of two things going on here.
The first is for all clients/readers/infra/etc to be able to quickly get from a bare dat:// URI to knowing what "type" of content the feed has, at all. The analogy to me is the Content-Type HTTP headers, which comes before any content and can be fetched with a quick HEAD (don't need a full GET). In dat-land, the dat CLI should be able to see "this isn't a legacy hyperdrive or a hyperdb-style hyperdrive, so i'm just going to bail", or it selects the appropriate code path to continue with. My assumption is that this is immutable (tied to the feed as a whole at time of creation), but maybe i'm wrong.
The second is the ability to associate generic metadata with a feed, sort of a key/value sidecar to the feed contents proper, which might include related feeds or anything else. We sort of do this with hyperdrive-like feeds via dat.json, but it might be nice to have this for any feed.
I think the first is more urgently needed for hyperdb+hyperdrive roll out. I propose we focus on a solution to the first part, but not include any "related feed" functionality in it, because that is more "mutable". I think a mutable solution to the second bit would probably be good... but I also think more thinking is needed.
In either/any case, off the top of my head I think we should keep all such metadata "in band" in that the same hashing/merkle structure should cover the metadata as well as feed content, so we don't need to add additional verification complexity.
|
As per the original message, I think there are sort of two things going on here. The first is for all clients/readers/infra/etc to be able to quickly get from a bare The second is the ability to associate generic metadata with a feed, sort of a key/value sidecar to the feed contents proper, which might include related feeds or anything else. We sort of do this with hyperdrive-like feeds via I think the first is more urgently needed for hyperdb+hyperdrive roll out. I propose we focus on a solution to the first part, but not include any "related feed" functionality in it, because that is more "mutable". I think a mutable solution to the second bit would probably be good... but I also think more thinking is needed. In either/any case, off the top of my head I think we should keep all such metadata "in band" in that the same hashing/merkle structure should cover the metadata as well as feed content, so we don't need to add additional verification complexity. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bnewbold
May 7, 2018
Contributor
Just a ping that I think we want to make progress on this in the next week or so. What would be the best next step? A specific implementation proposal?
|
Just a ping that I think we want to make progress on this in the next week or so. What would be the best next step? A specific implementation proposal? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bnewbold
May 23, 2018
Contributor
This announcement about git wire protocol v2 has some details about how they shoe-horned in a protocol version flag: https://opensource.googleblog.com/2018/05/introducing-git-protocol-version-2.html
(this message is really a poke at @mafintosh to write up what we discussed last week)
|
This announcement about git wire protocol v2 has some details about how they shoe-horned in a protocol version flag: https://opensource.googleblog.com/2018/05/introducing-git-protocol-version-2.html (this message is really a poke at @mafintosh to write up what we discussed last week) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bnewbold
Jun 10, 2018
Contributor
In hyperdb v3.0.0, @mafintosh added a minimal protocol header as the first hypercore entry, with protobuf schema (mafintosh/hyperdb#121):
message Header {
required string protocol = 1;
}
and hyperdb sets the protocol string to hyperdb for now. It's not clear to me yet what hyperdrive-on-hyperdb will do.
|
In hyperdb v3.0.0, @mafintosh added a minimal protocol header as the first hypercore entry, with protobuf schema (mafintosh/hyperdb#121):
and hyperdb sets the protocol string to |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Frando
Jul 18, 2018
I actually think that we should adhere to the layered nature of the hyper* tools, which would mean that it does not make sense to state that a hypercore is a hyperdrive, but only that a hypercore is a hyperdb, and then set on the hyperdb level that it's a hyperdrive.
So at the hypercore level: Header "hyperdbv1", or "hyperdbv1-content", because a hyperdb may also have a content feed which is a hypercore, but with a different data structure from a hyperdb.
And then, at the hyperdb level, I propse that we have a single special reserved key that stores some JSON to set more properties. So e.g. /:meta or similiar. There, it would say
{type: 'hyperdrive', version: 'v1' }. That meta key could also be the place to store mount information (should we decide to implement it at the hyperdb level) or different value encodings per prefix (should we decide to support subhyperdb natively, to e.g. have a part of a hyperdb be a hyperdrive and another part a json key value store).
Frando
commented
Jul 18, 2018
•
|
I actually think that we should adhere to the layered nature of the hyper* tools, which would mean that it does not make sense to state that a hypercore is a hyperdrive, but only that a hypercore is a hyperdb, and then set on the hyperdb level that it's a hyperdrive. So at the hypercore level: And then, at the hyperdb level, I propse that we have a single special reserved key that stores some JSON to set more properties. So e.g. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
bnewbold
Jul 18, 2018
Contributor
Hi @Frando! Thanks for the feedback.
We've gone back and forth on this a few times; i'm not sure all the history is in this issue thread. There are advantages to what i'd call the "recursive" approach you mention (typing at each layer of the stack): tooling can fall back to partial support of higher-level protocols (eg, inspect hyperdb even if hyperdrive isn't supported), etc. Some of the trade-offs that pushed me over into the single-top-level-string camp are:
- more complex recursive code is needed to determine the application-layer type (which many tools would want to display to the user, eg hashbase). Importantly, checking the type becomes a less-deterministic operation (with the single string case, it's just a single element to be synchronized; with hyperdb a recursive lookup needs to be done to discover the correct key/value pair)
- immutability of content type as a feature, not a bug
- backwards compatibility (huge!)
- don't want to burden every container format with needing to include "next" level type metadata. AKA, the application-agnostic
/:metakey isn't very elegant to me, and potentially constrains use cases that would want to make every key/value semantically meaningful. Would this one value always be JSON or protobuf, regardless of the other value encodings? All the same debates we've had with this header decision, with each content data structure. Not insurmountable, but if we can keep it simpler that seems better.
In the end, this boat has basically sailed, in that DEP-0007 got published. We can leave this thread open a little longer if you have more comments, and then close.
|
Hi @Frando! Thanks for the feedback. We've gone back and forth on this a few times; i'm not sure all the history is in this issue thread. There are advantages to what i'd call the "recursive" approach you mention (typing at each layer of the stack): tooling can fall back to partial support of higher-level protocols (eg, inspect hyperdb even if hyperdrive isn't supported), etc. Some of the trade-offs that pushed me over into the single-top-level-string camp are:
In the end, this boat has basically sailed, in that DEP-0007 got published. We can leave this thread open a little longer if you have more comments, and then close. |
bnewbold commentedMar 21, 2018
Motivation: have a way to annotate the "type" of feed contents. For example, determine if you're looking at a hyperdb key/value feed, a hyperdrive, or some other thing. A requirement is that code/libraries be able to replicate the feed and discover the content type (and schema version) without necessarily understanding the schema itself. A related motivation is to discover related ("content") feeds in a protocol-agnostic manner, but this isn't a requirement.
Question: should this blob be strictly immutable? Being able to change some metadata might be nice (eg, paired feeds), but keeping it immutable is simple for, eg, hosting platforms and archives.
Option 1: protobuf message as special first entry in feed. This is basically what hyperdrive does currently to point from metadata to content feed, only we would want to use (extensible) protobuf instead of bare bytes. Could potentially select a small fixed number of fields for this protobuf schema (eg, "repeated relatedFeeds bytes", "optional protobufSchema String", "optional contentType String"; strings could be mimetype-like), which application could extend upon.
Option 2: add a metadata/header blob out-of-band to hypercore feeds. @mafintosh mentioned a scheme where an immutable blob is transmitted during feed handshakes, and the hash of that blob is used as a key for internal hypercore hashing. Would be stored as a new stub file in SLEEP directories (like feed key is currently).
There are probably more options if we get creative!