Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Batched Anchor Data Structure #69

Open
oed opened this issue Nov 2, 2020 · 5 comments
Open

Discussion: Batched Anchor Data Structure #69

oed opened this issue Nov 2, 2020 · 5 comments
Assignees

Comments

@oed
Copy link
Member

oed commented Nov 2, 2020

Discussion for CIP-69
See PR for latest spec: #74

@stbrody
Copy link
Contributor

stbrody commented Nov 2, 2020

@oed, can you elaborate on what exactly your question is about building the merkle tree when there are different numbers of leaf nodes?

Beyond that, I'd like to see a bit more about what exactly the entries in the bloom filter look like/how they are built. Your code example actually has that part commented out and unspecified. Are you imagining that the tags, schema, controllers, and docId all get encoded into one entry somehow? Or does one ceramic document create like 4-10 different entries into the bloom filter? If the latter, do you plan to add anything into how these entries are encoded to allow distinguishing between entries from the different input types (distinguishing an entry that is a CID based on whether it was a main docID, a schema reference docId, or a tag, for example).

Lastly, I'm a little confused as to how one would actually take advantage of the sortedness of the merkle tree, given that when you're at an intermediate node in the tree, you don't actually have any information encoded in that node about what portion of the sorted keyspace it represents, so you'd have no way to actually decide which branch of the tree to follow to get closer to the set of documents you are looking for, unless I'm missing something?

@oed
Copy link
Member Author

oed commented Nov 3, 2020

can you elaborate on what exactly your question is about building the merkle tree when there are different numbers of leaf nodes?

Basically if the number of nodes is not a power of 2 then the tree will not be symmetrical. We should however always form the tree in a deterministic way, so which algorithm should we use for that? cc @simonovic86

Beyond that, I'd like to see a bit more about what exactly the entries in the bloom filter look like/how they are built.

It's an array of strings, updating the first post. Each document will create multiple entries in the filter.

If the latter, do you plan to add anything into how these entries are encoded to allow distinguishing between entries from the different input types

Good point! We need a way to distinguish between DocIDs and Schema-DocIDs at the very least!

Lastly, I'm a little confused as to how one would actually take advantage of the sortedness of the merkle tree,

Yeah since there are more than one tag this is not optimal I think. Basically if you know that a tag is the first tag of a document then you can do a binary search to effectively find the update. However, if you only know some other info you can't really take much advantage of this. Might be better options here so open to suggestions!

@simonovic86
Copy link

@oed @stbrody we already cover the case of odd number of leaves. This is the standard way of constructing merkle trees if I'm not mistaken. Here's the test case: link.

@stbrody
Copy link
Contributor

stbrody commented Nov 3, 2020

Yeah, the algorithm for building a balanced binary search tree is pretty standard and straightforward (I used to ask it as an interview question :P), I believe the only part of it that could be different implementation by implementation is when you have an even number of nodes whether you take the rightmost node in the left half as the root, or the leftmost node of the right half. Example code here. As long as we're consistent though it should be fine either way I'd think.

@oed I see you updated the description above to include prepending "tag-" and "schema-" to those bloom filter records - I'd suggest we do the same for the other entries too ("controller-did-" and "DocId-") to future-proof ourselves in case we want to add additional DIDs or DocIds into the bloom filter in the future. Adding characters to the size of the entries in the bloom filter is free, since the actual entry strings never make it into the filter, just the hashes of them do.

My concern about doing binary search on the tree is about more than having access to the right information of the document you're looking up. Even if you have all the fields of the document you are looking for, I'm still not sure how you could actually use that to search in the tree. The problem is that at the non-leaf nodes of the tree, you only have the CIDs of your left and right sub-tree nodes, the tree itself has no information about the contents of the documents covered by each subtree. To use the sortedness property of the tree, the non-leaf nodes would need to also contain information about which range of the overall sorted keyspace they cover

@oed oed changed the title Batched Anchor Data Structure Discussion: Batched Anchor Data Structure Nov 12, 2020
@oed
Copy link
Member Author

oed commented Nov 12, 2020

Created a PR for this CIP, updated the spec and made this into the discussion thread. See #74
Note that this introduces the topic property in the metadata of documents, which is currently not implemented anywhere. Some work remains to add this in various places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants