Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbounded HistoryTreeNode field #112

Closed
slawlor opened this issue Dec 13, 2021 · 5 comments · Fixed by #113
Closed

Unbounded HistoryTreeNode field #112

slawlor opened this issue Dec 13, 2021 · 5 comments · Fixed by #113
Assignees
Labels
enhancement New feature or request optimization

Comments

@slawlor
Copy link
Contributor

slawlor commented Dec 13, 2021

HistoryTreeNode has an epochs field that is used to identify the epochs which an update took place.
As it stands now, the field can grow to an unbounded size. This will pose an issue when the corresponding DB record hits either soft or hard limits imposed by the DB schema or architecture.

ESTIMATED LIMITS

Epochs are u64 = 8 bytes.

  • For mysql, with the current VARBINARY(2000) definition, 2KB / 8B = ~ 250 values (we could always raise the max value in the definition, this is just for illustration)
  • For some alternate key-value db (KVDB) with say 100KB recommended soft limit for record size, ignoring other overhead, 100KB / 8B = ~ 12500 values

SOLUTION?

The associated epochs for a HistoryTreeNode can be obtained directly from the HistoryNodeState table given the node label, since each update corresponds to a HistoryNodeState.

  • For mysql, this can be done through one call with a join, or a separate, concurrent call to the DB.
  • For the alternate KVDB, one call is also sufficient. Instead of a "get" operation to fetch a singular HistoryTreeNode record, we can use a "scan" operation to fetch both the HistoryTreeNode record AND associated HistoryNodeState records at one go, provided we have a good sharding setup.

Accumulating the data from separate tables will incur some computation time on the database server and we will need to profile and assess the impact. However, I think this is necessary to avoid the unbounded record size issue. Are there any other potential solutions I've missed?

@slawlor
Copy link
Contributor Author

slawlor commented Dec 13, 2021

Exporting external group conversation

@kevinlewi
Good catch. The epoch vector will indeed need to be removed. However, how would we efficiently look up the HistoryNodeState table for the appropriate epoch? I believe we will perhaps need each HistoryNodeState to have a "pointer" to the previous valid epoch for the node, right? Or are you thinking of a more efficient way to do this?
Btw, at some point it would also be great to create public-facing issues on the github to track these, even just as a placeholder. Thanks for the finds!!

@afterdusk
Hmm, when an epoch is added to the epochs field, does it not remain perpetually? My impression was that all epochs where a HistoryNodeState exists for the node are valid epochs.
Based on this assumption, the following naive query can be run in parallel to collect the epochs where a state exists for a node:

SELECT a.epoch
FROM HistoryNodeStateTable a
WHERE a.label_len = <label_len> AND a.label_val = <label_val>

(In hindsight, a join to combine this with the query to HistoryTreeNodeTable wouldn’t be feasible, so two queries are needed.)
Also thank you for pointing out GitHub issues, I’ll post there for discussions like these in future!
Edit: To be clear, I'm proposing to remove the epochs field from the database table for HistoryTreeNode, but not the struct definition. When retrieving a HistoryTreeNode from storage, we will populate the epochs field with a query to the table storing HistoryNodeState 🙂

@slawlor
Copy link
Contributor Author

slawlor commented Dec 13, 2021

My 2 cents.

I've actually been thinking about this before as a place to optimize. For a lookup proof & a publish, we only need the most recent state of a HistoryTreeNode correct? (@kevinlewi confirm plz :) )

What if we change HistoryTreeNode to contain the "latest" state as a part of the struct. That way at the data-layer it's constant in size, and it actually saves us a 2nd get operation generally since we (most of the time) only need the latest state. However when we write, we'll also want to push and changed states to their own table. That way for a key-history request, we can still retrieve the information (via JOIN or whatever). It might be worth changing the storage contract to support something like this (instead of just get/set generics).

slawlor added a commit that referenced this issue Dec 13, 2021
…jority of operations only require the birth and current epochs. (Only audit & key history require the full list of epochs a node was mutated in)

Resolves #112
@kevinlewi
Copy link
Contributor

kevinlewi commented Dec 13, 2021

For a lookup proof & a publish, we only need the most recent state of a HistoryTreeNode correct?

Yes, @Jasleen1 to confirm.

However, I think the intermediate states are needed for the append-only / key history consistency proofs. And we still need the information about which epochs a node actually gets changed in for these proofs. Otherwise, we could end up having to search for all epochs between the birth and current epochs for changes -- which may not be very efficient...

Edit: left a comment here (https://github.com/novifinancial/akd/pull/113/files/9367f0ebc7dde560dfaf47491b1b32a5bb2eff51#diff-be3234553a30392a01a72481f1676c8338a439bc2a15a8a9f00f7923d1ef9353) which further elaborates on this

@slawlor
Copy link
Contributor Author

slawlor commented Dec 14, 2021

Yeah @kevinlewi so in MySQL it's relatively straightforward with

SELECT `epoch` 
FROM `states` 
WHERE 
  `label_len` = :len 
  AND `label_val` = :val 
  AND `epoch` <= :epoch 
ORDER BY `epoch` DESC 
LIMIT 1

which is searching over an indexed column space with "retrieve the first epoch which is <= :epoch with a specific label" In-memory, yes it's slower, but I did a bit of a nifty trick to not scan the entire "memory db". Construct an array from

0...=epoch, map it to the NodeStateKey with the provided specific label, and retrieve those batch of nodes. Any epochs where an update didn't occur will simply return a not-found, so after we just take the largest epoch.

@slawlor
Copy link
Contributor Author

slawlor commented Dec 14, 2021

In MySQL, this guarantees that the query is going to return a single u64 which at least saves us on the data transport (we aren't loading an entire node, or state or something, and wasting time serializing)

slawlor added a commit that referenced this issue Dec 14, 2021
* Remove the epochs vector from the history tree node since the vast majority of operations only require the birth and current epochs. (Only audit & key history require the full list of epochs a node was mutated in)

Resolves #112

* Add mysql limit to limit select as small as possible

* Version bump to 0.3.6

Co-authored-by: Sean Lawlor <seanlawlor@fb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request optimization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants