RFC for document storage #403

kocolosk · 2019-04-01T22:18:59Z

Overview

This document describes a data model for storing JSON documents as key-value
pairs in FoundationDB. It includes a discussion of storing multiple versions of
the document, each identified by unique revision identifiers, and discusses some
of the operations needed to query and modify these documents.

janl

Looks, great, left one addition

rfcs/004-document-storage.md

Co-Authored-By: kocolosk <kocolosk@apache.org>

rnewson · 2019-04-03T22:13:38Z

rfcs/004-document-storage.md

+would be represented by a key-value pair of
+
+```
+pack({"foo", "bar", "baz"}) = pack(123)


at least in @davisp's work on erlfdb so far, packing only happens for tuples. so should this be pack({123}) or should erlfdb allow packing of primitives? I think Paul is reflecting the API's elsewhere so our choice may be forced here as we'd like our data to be readable with those API's.

Good catch. When reading https://apple.github.io/foundationdb/data-modeling.html#encoding-data-types I get the sense that FDB primarily intends the tuple layer to be used with keys, not values, since much of the encoding work is concerned with getting well-defined orderings of the encoded binaries. For example

For values, the main concerns for serialization are simply CPU and space efficiency. For keys, there’s an additional consideration: it’s often important for keys to preserve the order of the data types ...

@davisp - what do you think? Does it make sense to use some alternative faster / more compact encoding of values?

@kocolosk In general I'd agree that the tuple layer is largely more focused on keys, however its also a language agnostic encoding which allows tooling written in other languages to easily interoperate. I haven't got the slightest if that's something we'd want to enable or perhaps even actively discourage.

Also I should point out that the top level tuple is elided in the tuple layer. So pack({123}) ends up producing the bytes that were most likely assumed would be produced by pack(123). There's no length header when unpacking or anything. It just reads values until the binary is exhausted and then converts the list to a tuple. This makes it nice so that you can "extend" previously encoded tuples and the like (i.e., the subspace layer).

Good clarification, thanks. I think we want a value encoding which delivers a) interoperability, b) CPU efficiency and c) storage efficiency. If there are other encodings which deliver better on those traits than the Tuple encoding it could make good sense to use them.

For now I updated the example to pack({123})

One idle thought, it seems like we could fairly easily combine things and do something like snappy encoded tuple layer packed binaries. Again its one more thing that can be easily tested to compare. In the meantime I don't think it really affects the rest of the RFC logic so deferring until later doesn't seem unwise.

Yes that's exactly the sort of thing I had in mind. It's slightly less easy to experiment because we're changing the on-disk format, but that's why we have versioning of the data structure.

As I'm sure you've noticed this data model never uses a multi-element tuple for user-supplied data in the values, so any encoding which covers true/false/null, numbers, and strings would suffice on its own.

rfcs/004-document-storage.md

rnewson

feedback added to seek clarification on a few points but I'm +1 on the model described here. I will be referring to this RFC PR from a new one on attachment storage soon.

davisp · 2019-04-04T17:26:22Z

Does the suggested 1MiB limit and associated conversation imply that we'll only ever update a single document in a given transaction? One thing I noodled over for a bit for _bulk_docs was whether we'd provide single transaction semantics or if each doc update would be an independent transaction.

kocolosk · 2019-04-04T17:31:02Z

@davisp no I did not mean to imply that multi-doc transactions are off the table. I figured it was a separate topic though.

davisp · 2019-04-05T19:42:19Z

rfcs/004-document-storage.md

+   execute a `get_range_startswith` operation as above.
+1. We can start streaming the entire key range from the `?DOCUMENTS` space
+   prefixed by `DocID` in reverse, and break if we reach another revision of the
+   document ID besides the winning one.


A couple things to note here. Generally speaking, when we know a specific revision we want to read we don't actually know if its deleted or not so we would have to do two range reads for the deleted=true and deleted=false. Obviously one of those will be empty but its a slight complication on the stated approach.

For the two options on reading a doc that's obviously up in the air if the extra round trip against the ?REVISIONS subspace vs possibly streaming extra rows vs possibly having to issue a second range read. Also if we go with Option 1 we don't need to NotDeleted at all which also simplifies things a bit if we're always reading this range with a known {RevPos, RevHash} pair.

Another happy side effect of always reading the winning revision from the ?REVISIONS subspace is that we stop having to worry about deleted vs normal revisions in this subspace. An empty doc with no attachments could just as well have zero data in this space and then we don't have to store something arbitrarily in the metadata for each empty doc as a third kind of odd document state.

Yes I covered this in the CRUD operations section, though I proposed doing the followup read in ?REVISIONS rather than issuing both deleted=true and deleted=false guesses.

If a reader is implementing Option 2 and does not find any keys associated with the supplied DocID in the ?DOCUMENTS space, it will need to do a followup read on the ?REVISIONS space in order to determine whether the appropriate response is {"not_found": "missing"} or {"not_found": "deleted"}.

Ah. Yet again I was slightly off track considering when we return the deleted body which is called out elsewhere for requiring extra fdb requests.

That seems legit but I am starting to worry a bit about the subtlety of some of these optimizations that presume existing behaviors as opposed to being less efficient but easier to implement.

I'm definitely not trying to be "cute" or overly subtle, so stop me if you think there's a better path. My thinking on using ?REVISIONS in a followup request was exactly because of this condition you brought up:

An empty doc with no attachments could just as well have zero data in this space

I don't think this is quite to that level. I'm just trying to contemplate if we go with the first option and always require a read from the ?REVISIONS key space then there's a good chunk of things that end up falling away complexity wise. We'd have defined bounds on all reads to the ?DOCUMENTS key space and our range reads would be optimal.

However, same as you I believe, forcing that extra round trip for every document read is tempting to try and eliminate. However, for the single range read to get the current winner I'm also wondering how often that ends up with either a second range request or wasted rows read and how that cost compares to the extra round trip.

Luckily though it should be easy to implement both approaches to compare apples to apples and get a better idea of the relative cost either way.

Agreed this is an area where an A/B test is doable without too much surgery.

davisp · 2019-04-05T19:46:51Z

rfcs/004-document-storage.md

+are baked into the key. The structure looks like this
+
+```
+{DbName, ?DOCUMENTS, DocID, NotDeleted, RevPos, RevHash} = RevisionMetadata


It occurs to me that this format prevents RevisionMetadata from every exceeding 100KB which depending on what we end up putting in there may or may not be an issue.

There's also no versioning in this RFC for how we might do schema migrations in the future like was done for the revision storage RFC. Its obviously possible to repurpose that version marker as a means to version the doc body storage as well but that seems like something we'd want to call out explicitly in one place or other other and also feels a bit limiting that we're tying the two together.

Versioning is included, see the line directly below this one:

RevisionMetadata includes at the minimum an enum to enable schema evolution for subsequent changes to the document encoding structure

In fact, that's the only planned use for that field at the moment. Seems to me that if we have more than 100KB of metadata here, not including the rev tree, something has gone horribly wrong with our database design.

Ah. I was definitely making the assumption that this would hold things like the #docs.atts metadata and what not. We can obviously do other things there but it still feels like something to make sure that we call out so we ensure that we don’t allow a pathological breakage there.

The outer tuple does not take any space in the Tuple Layer encoding, but is required to use the API. We can have a debate about whether the Tuple Layer is even the right thing to use to encode leaf values. There are many other viable alternatives since the sorting rules don't matter, the priority should be on portability, CPU efficiency, and storage efficiency.

garrensmith · 2019-04-09T09:06:26Z

rfcs/004-document-storage.md

+
+```
+pack({"states", 0}) = "MA"
+pack({"states", 1}) = "OH"


Is there a specific reason why we don't pack the values for the array?

Nope, just me being sloppy. Good catch.

rnewson · 2019-04-17T09:58:57Z

Is there more to do here?

kocolosk · 2019-04-17T14:05:55Z

Personally I'm still wondering a bit about the value encoding. For example, should we be pursuing a compression option for string values?

rnewson · 2019-04-17T15:24:33Z

The documented tuple format, while motivated by sorting concerns, does look fairly compact, so do you have concerns about using it for values other than, perhaps, that long string values could be compressed? Short strings won't likely compress by much though (if we exclude the various header bytes when storing it might be simplest to compress all strings).

RFC for document storage

21d8111

janl reviewed Apr 2, 2019

View reviewed changes

rfcs/004-document-storage.md Show resolved Hide resolved

Add detail on predicate pushdown

85c6368

Co-Authored-By: kocolosk <kocolosk@apache.org>

rnewson reviewed Apr 3, 2019

View reviewed changes

rfcs/004-document-storage.md Show resolved Hide resolved

rnewson reviewed Apr 3, 2019

View reviewed changes

rfcs/004-document-storage.md Show resolved Hide resolved

rnewson approved these changes Apr 3, 2019

View reviewed changes

kocolosk requested a review from davisp April 4, 2019 15:27

davisp reviewed Apr 5, 2019

View reviewed changes

kocolosk added 3 commits April 8, 2019 11:37

Switch proposed doc size limit to base10

4ce8346

Add detail on Objects viz. sorting, duplicate keys

0e4edcf

garrensmith reviewed Apr 9, 2019

View reviewed changes

Pack all values correctly

b2bed58

Merge branch 'master' into rfc/004-document-storage

0265108

kocolosk merged commit 5c3404d into master Sep 17, 2019

kocolosk deleted the rfc/004-document-storage branch September 17, 2019 15:16

RFC for document storage #403

RFC for document storage #403

Uh oh!

Conversation

kocolosk commented Apr 1, 2019

Overview

Uh oh!

janl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rnewson left a comment

Choose a reason for hiding this comment

Uh oh!

davisp commented Apr 4, 2019

Uh oh!

kocolosk commented Apr 4, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davisp Apr 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rnewson commented Apr 17, 2019

Uh oh!

kocolosk commented Apr 17, 2019

Uh oh!

rnewson commented Apr 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

davisp Apr 5, 2019 •

edited

Loading

rnewson commented Apr 17, 2019 •

edited

Loading