New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: HyperDB #3

Merged
merged 27 commits into from May 7, 2018

Conversation

Projects
None yet
6 participants
@bnewbold
Contributor

bnewbold commented Feb 5, 2018

This DEP is submitted for review for Draft status.

Rendered pre-merge

A diagram showing the trie structure might be nice to help with learning, but maybe the specification isn't the right place for that. The examples are verbose, but I think are important to clarify exactly how things work. I think exact binary protobuf output (hexdump style?) could also be included for some example messages, to make encoding and endianness completely unambiguous.

@bnewbold bnewbold force-pushed the bnewbold:dep-hyperdb branch from 549a0ac to 7dc1759 Mar 5, 2018

@bnewbold bnewbold referenced this pull request Mar 5, 2018

Closed

Nitty Gritty Questions #57

# Summary
[summary]: #summary
HyperDB is a new abstraction layer providing a general purpose distributed

This comment has been minimized.

@pfrazee

pfrazee Mar 5, 2018

Member

Is it "new"? Aint this all new?

This comment has been minimized.

@pfrazee

pfrazee Mar 5, 2018

Member

(Not just nitpicking on prose; I'm concerned "new" will be confusing, both now and later when nothing is comparatively newer.)

This comment has been minimized.

@bnewbold

bnewbold Mar 5, 2018

Contributor

Good point; i'm still finding the right "voice" to use: tense, "we/I", "should/could/must", "one/you".

I was thinking of adding a CI pass with something like http://proselint.com/; we should probably have at least a pass that validates the markdown, and maybe checks out-links for valid HTTP responses.

Hyperdrive (used by the Dat application) is expected to be re-implemented on
top of HyperDB for improved performance with many files (eg, millions). The
hyperdrive API should be largely unchanged, but the `metadata` format will be
backwards-incompatible.

This comment has been minimized.

@pfrazee

pfrazee Mar 5, 2018

Member

Yeah I suggest we move this paragraph about hyperdrive into the motivation, where the plans and reasoning at the time of writing can all be contained.

A secondary benefit is to refactor the [trie][trie]-structured key/value API
out of hyperdrive, allowing third party code to build applications directly on
this abstraction layer.

This comment has been minimized.

@pfrazee

pfrazee Mar 5, 2018

Member

Third motivation: add multi-writer support

This comment has been minimized.

@pfrazee

pfrazee Mar 5, 2018

Member

Oh scratch that, multi-writer isn't discussed in this doc

`db.get(key)`: Reading a non-existant `key` is an error. Read-only.
`db.delete(key)`: Removes the key from the database. Deleting a non-existant
key is an error. Requires read-write access.

This comment has been minimized.

@pfrazee

pfrazee Mar 5, 2018

Member

Does deletion of a path-segment cause the deletion of all keys which the segment is a prefix of to be deleted?

This comment has been minimized.

@bnewbold

bnewbold Mar 5, 2018

Contributor

Good question! I think "No", a separate delete_recursive() (or a flag) would be needed.
I've tried to minimize the API surface here in the spec; eg, I haven't included the diff or history streaming stuff. I think in this case it's pretty straightforward to combined list(prefix) and delete(key).

This comment has been minimized.

@mafintosh

mafintosh Apr 9, 2018

Supporting recursive deletes thru a single append is actually not hard and could be added in the future

- `path`: a 2-bit hash sequence of `key`'s components.
- `seq`: the zero-indexed sequence number of this Node (hyperdb) entry in
the feed. Note that this may be offset from the hypercore entry index if
there are any leading non-hyperdb entries.

This comment has been minimized.

@pfrazee

pfrazee Mar 5, 2018

Member

Does this mean that seq only increments when a hyperdb Node is written to the feed?

HyperDB is not backwards compatible with the existing hyperdrive metadata,
meaning dat clients may need to support both versions during a transition
period. This applies both to archives saved to disk (eg, in SLEEP) and to
archives received and published to peers over the network.

This comment has been minimized.

@pfrazee

pfrazee Mar 5, 2018

Member

How can a client detect which metadata protocol to use?

This comment has been minimized.

@bnewbold

bnewbold Mar 5, 2018

Contributor

Good question. This is related to the issue of hashbase (and other tools) choking when they try to download a non-hyperdrive hypercore.
Two long-term solutions I could imagine are:

  • add a Feed-level "type" field to hypercore itself. The obvious place for this would be the Feed message, except this would then leak type information in clear text on the wire, which is bad. A new message type? I could imagine refactoring hypercore messages to have a new "connection init" message for crypto setup, then re-send a Feed encrypted. This would be a good place to toss in a protocol version header as well. hypercore has been pretty stable though, this would be disruptive. Also, (thinking out loud) how would the software figure out the type of feeds on disk (aka, mystery meat SLEEP files)?
  • extend the convention of always writing a "special" protobuf message as the first entry in every hypercore feed, and encode type info there. hyperdrive currently does this with a pointer to the data feed, but no other hypercore feed "types" do this (AFAIK).

@mafintosh, thoughts?

@pfrazee

This comment has been minimized.

Member

pfrazee commented Mar 5, 2018

Looking really good!

An example pseudo-code session working with a database might be:
db.put('/life/animal/mammal/kitten', '{"cuteness": 500.3}')
db.put('/life/plant/bush/bananna', '{"delicious": 103.4}')

This comment has been minimized.

@dcposch

dcposch Mar 14, 2018

don't know if you care but the banana has an extra n

This comment has been minimized.

@bnewbold

bnewbold Mar 15, 2018

Contributor

Thanks for the catch!

1. Calculate the `path` array for the key you are looking for.
2. Select the most-recent ("latest") Node for the feed.
3. Compare path arrays. If the paths match exactly, you have found the Node you

This comment has been minimized.

@dcposch

dcposch Mar 14, 2018

SipHash is not collision resistant, so two distinct paths could have the same path array, right?

This comment has been minimized.

@mafintosh

mafintosh Mar 14, 2018

Yes, in case of a collision the final node bucket will just have more than one node in it. The full final node bucket is read on every read (usually only a single node) to do key full key comparison to check for collisions

implementation. The current implementation requires the most recent node in a
directory to point to all other nodes in the directory. Even with pointer
compression, this requires on the order of `O(N^2)` bytes; the HyperDB
implementation scales with `O(N log(N))`.

This comment has been minimized.

@dcposch

dcposch Mar 14, 2018

👍 👍

this is what prevented us from doing datpedia the straightforward way

specified limit of 2 GByte (due to 32-bit signed arthimetic), and most
implementations impose a (configurable) 64 MByte limit. Should this DEP impose
specific limits on key and value sizes? Would be good to decide before Draft
status.

This comment has been minimized.

@dcposch

dcposch Mar 14, 2018

since dat does rabin chunking, the 2GB limit seems fine ~ large files will always be divided into chunks smaller than that.

will the chunks always be below 64MB though? @mafintosh

This comment has been minimized.

@mafintosh

mafintosh Mar 15, 2018

@dcposch note this actually not for the files but just for the file metadata. in our case that is just a fs.Stat object with an additional numeric pointer to another feed where the actual file content is stored. This gives basically as large files as you'd want

This comment has been minimized.

@bnewbold

bnewbold Mar 15, 2018

Contributor

Also, FWIW, dat (via hyperdrive) does not actually use Rabin chunking at the moment, just fixed size chunks. The protocol is agnostic to chunking, so this can evolve in the future with little disruption if/when content-aware chunking is more performent.

This comment has been minimized.

@bnewbold

bnewbold Mar 15, 2018

Contributor

To remind, dat (via hyperdrive) is only one possible application on top of hyperdrive; while hyperdrive should be pretty comfortable even with a couple KByte metadata size limit, other applications might want to store arbitrary data in the hyperdb values; setting expectations around this could save frustration or growing pains later.

This comment has been minimized.

@mafintosh

mafintosh Mar 18, 2018

@bnewbold agreed. imo you should always only store "small" values directly in the hyperdb, say < 1mb. anything you should put in a "content feed" and then link it from the value.

This comment has been minimized.

@bnewbold

bnewbold Apr 9, 2018

Contributor

I think for Draft we can mention the inherent limits and then mention:

"It is STRONGLY recommended to keep both keys and values to less than 2 megabytes (individually and combined). If larger value sizes are desired, a separate content can be pointed to (via contentFeed), and the value in this feed can be metadata pointing to data (eg, contiguous blocks of entries) in that feed."

Are the "deletion" semantics here correct, in that deletion Nodes persist as
"tombstones", or should deleted keys be omitted from future `trie` fields to
remove their presence?

This comment has been minimized.

@dcposch

dcposch Mar 14, 2018

it seems important to leave deletion nodes in place--that way, list() can iterate over all current and deleted nodes under a directory and then omit the deleted ones

otherwise, i think list() might return partial results

This comment has been minimized.

@mafintosh

mafintosh Mar 15, 2018

@dcposch thats how it works atm actually

array. How unlikely is this situation? Eg, how many keys in a database before
the birthday paradox results in a realistic chance of a collision? How should
implementations behave if they detect a collision (matching path but not
matching key)?

This comment has been minimized.

@dcposch

dcposch Mar 15, 2018

i found a collision so we can test this

after an hour or so with a little Go script on a 122GB r4.4xlarge instance...

Checked 7059000000 hashes
Checked 7060000000 hashes
Checked 7061000000 hashes
Checked 7062000000 hashes
Checked 7063000000 hashes
Checked 7064000000 hashes
COLLISION: SipHash24(mpomeiehc) = SipHash24(idgcmnmna) = 11615559218517537840

that's SipHash24 with an all-zero key, as described above. so if you make a dat and add the following files:

/mpomeiehc
/idgcmnmna

...you should be able to test a path collision

@mafintosh @bnewbold

This comment has been minimized.

@mafintosh

mafintosh Mar 15, 2018

mafintosh/hyperdb@fd60344 <-- i've added @dcposch's hash collision to the tests. We support hash collisions, the two nodes just end up in the same bucket similar to have a multiwriter fork would

a pointer to the Entry with the same hash at the final array index.
5. If the paths don't entirely match, find the first index at which the two
arrays differ. Copy all `trie` elements (empty or not) into the new `trie`
for indicies between the "current index" and the "differing index".

This comment has been minimized.

@ralphtheninja
field of an Entry protobuf message.
Consider a trie array with `N` buckets and `M` non-empty buckets (`0 <= M <=
N`). In the encoded field, there will be `M` concatanated bytestrings of the

This comment has been minimized.

@ralphtheninja
the second bytestring chunk would be:
- index varint is `64` (65th element in the trie array; the terminating value)
- btifield is `0b10000` (varint 1); there is a single pointer set... but it

This comment has been minimized.

@ralphtheninja
hash arrays, we now get all the way to index 34 before there is a difference.
We again look in the `trie`, find a pointer to entry index 0, and fetch the
first Entry and recurse. Now the path elements match exactly; we have found the
Entry we are looking for, and it has an existant `value`, so we return the

This comment has been minimized.

@ralphtheninja
The basic key/value semantics of hyperdb (as discussed in this DEP, not
considering multi-writer changes) are not known to introduce new privacy issues
when compared with, eg, storing binary values at key-like paths using the

This comment has been minimized.

@ralphtheninja
Hyperdb is not backwards compatible with the existing hyperdrive metadata,
meaning dat clients may need to support both versions during a transition
period. This applies both to archives saved to disk (eg, in SLEEP) and to

This comment has been minimized.

@ralphtheninja
a feed? See also <https://github.com/datprotocol/DEPs/issues/13>
Need to think through deletion process with respect to listing a path prefix;
will previously deleted nodes be occulded, or potentially show up in list

This comment has been minimized.

@ralphtheninja
db.get('/life/animal/mammal/kitten')
=> {"cuteness": 500.3}
db.list('/life/')
=> ['/life/animal/mammal/kitten', '/life/plant/tree/banana']

This comment has been minimized.

@pfrazee

pfrazee Apr 18, 2018

Member

Is this example listing correct? It's showing keys multiple prefix-segments below.

This comment has been minimized.

@bnewbold

bnewbold Apr 19, 2018

Contributor

I intended for it to be recursive; looks like there is a recursive flag in maf's hyperdb library. I've added a clarification.

care to the following scenarios: A malicious writer may be able to produce very
frequent key path hash collisions, which could degrade to linear performance. A
malicious writer could send broken trie structures that contain pointer loops,
duplicate pointers, or other invalid contents. A malicous writer could write

This comment has been minimized.

@pfrazee

pfrazee Apr 18, 2018

Member

Pointer loops sound like they could be a pretty serious DoS attack vector. Do we have any way to mitigate that?

There's a calculable upper limit on lookups based on the number of prefixes... Perhaps the recursive algorithm needs to track how many lookups it does and then abort if it hits the limit?

@bnewbold

This comment has been minimized.

Contributor

bnewbold commented Apr 19, 2018

@noffle I've still got you in here as a co-author in attribution of your ARCHITECTURE.md document that was very helpful when starting this DEP. However, while this is derivative, almost none of that document remains in here, and I don't want to "co-author" you without consent. Any thoughts? I'll move you down to an Acknowledgements section if I don't hear back in a day or two.

@noffle

This comment has been minimized.

noffle commented Apr 19, 2018

@bnewbold whatever you think is appropriate is fine by me!

@ralphtheninja

This comment has been minimized.

ralphtheninja commented on 2cffaa1 Apr 19, 2018

@bnewbold ❤️

@bnewbold

This comment has been minimized.

Contributor

bnewbold commented Apr 26, 2018

Any further comments here? Poke @mafintosh. If we could approve/merge this week that would be great (eg, at the protocol WG meeting next Wednesday at the latest).

@mafintosh

This comment has been minimized.

mafintosh commented Apr 26, 2018

@pfrazee pfrazee referenced this pull request Apr 30, 2018

Closed

Upcoming Meeting Agenda - 2 May 2018 #18

3 of 6 tasks complete
For a database with most keys having N path segments, the cost of a `get()`
scales with the number of entries M as `O(log(M))` with best case 1 lookup and
worst case `4 * 32 * N = 128 * N` lookups (for a saturated `trie`).

This comment has been minimized.

@mafintosh

mafintosh May 2, 2018

This worst case will only happen if all the hashes in the db are colliding which is extremely unlikely. The worst case for a "perfect" hash is still log4(n) afik. The best case is O(1) and the normal case is log4(n) / k where k depends on how many shared hash prefixes there are

# Unresolved questions
[unresolved]: #unresolved-questions
Should we declare that `contendFeed` pointers *not* change over the history of

This comment has been minimized.

@mafintosh
mafintosh mentioned this might be in the works. Does this DEP need to "leave
room" for those changes, or should we call out the potential for future change?
Probably not, should only describe existing solutions. This can be resolved
after Draft.

This comment has been minimized.

@mafintosh

mafintosh May 2, 2018

it can and should be improved in the future (backwards compat). step one is shipping what we have

after Draft.
Review (or prove?) the `O(log(M))` intuition for `get()` operations. This could
happen after Draft status.

This comment has been minimized.

@mafintosh

mafintosh May 2, 2018

it's log(M) yes

@pfrazee pfrazee referenced this pull request May 2, 2018

Closed

Action Items - Meeting #9 (2 May) #20

8 of 17 tasks complete

@bnewbold bnewbold merged commit 23fa355 into datprotocol:master May 7, 2018

@bnewbold bnewbold deleted the bnewbold:dep-hyperdb branch May 7, 2018

@bnewbold

This comment has been minimized.

Contributor

bnewbold commented May 7, 2018

To be explicit: left @noffle as a co-author. Thanks for kicking this whole thing off!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment