-
Notifications
You must be signed in to change notification settings - Fork 222
RFC for document storage #403
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks, great, left one addition
Co-Authored-By: kocolosk <kocolosk@apache.org>
rfcs/004-document-storage.md
Outdated
would be represented by a key-value pair of | ||
|
||
``` | ||
pack({"foo", "bar", "baz"}) = pack(123) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least in @davisp's work on erlfdb so far, packing only happens for tuples. so should this be pack({123})
or should erlfdb allow packing of primitives? I think Paul is reflecting the API's elsewhere so our choice may be forced here as we'd like our data to be readable with those API's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. When reading https://apple.github.io/foundationdb/data-modeling.html#encoding-data-types I get the sense that FDB primarily intends the tuple layer to be used with keys, not values, since much of the encoding work is concerned with getting well-defined orderings of the encoded binaries. For example
For values, the main concerns for serialization are simply CPU and space efficiency. For keys, there’s an additional consideration: it’s often important for keys to preserve the order of the data types ...
@davisp - what do you think? Does it make sense to use some alternative faster / more compact encoding of values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kocolosk In general I'd agree that the tuple layer is largely more focused on keys, however its also a language agnostic encoding which allows tooling written in other languages to easily interoperate. I haven't got the slightest if that's something we'd want to enable or perhaps even actively discourage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I should point out that the top level tuple is elided in the tuple layer. So pack({123})
ends up producing the bytes that were most likely assumed would be produced by pack(123)
. There's no length header when unpacking or anything. It just reads values until the binary is exhausted and then converts the list to a tuple. This makes it nice so that you can "extend" previously encoded tuples and the like (i.e., the subspace layer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good clarification, thanks. I think we want a value encoding which delivers a) interoperability, b) CPU efficiency and c) storage efficiency. If there are other encodings which deliver better on those traits than the Tuple encoding it could make good sense to use them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I updated the example to pack({123})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idle thought, it seems like we could fairly easily combine things and do something like snappy encoded tuple layer packed binaries. Again its one more thing that can be easily tested to compare. In the meantime I don't think it really affects the rest of the RFC logic so deferring until later doesn't seem unwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's exactly the sort of thing I had in mind. It's slightly less easy to experiment because we're changing the on-disk format, but that's why we have versioning of the data structure.
As I'm sure you've noticed this data model never uses a multi-element tuple for user-supplied data in the values, so any encoding which covers true/false/null, numbers, and strings would suffice on its own.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feedback added to seek clarification on a few points but I'm +1 on the model described here. I will be referring to this RFC PR from a new one on attachment storage soon.
Does the suggested 1MiB limit and associated conversation imply that we'll only ever update a single document in a given transaction? One thing I noodled over for a bit for |
@davisp no I did not mean to imply that multi-doc transactions are off the table. I figured it was a separate topic though. |
execute a `get_range_startswith` operation as above. | ||
1. We can start streaming the entire key range from the `?DOCUMENTS` space | ||
prefixed by `DocID` in reverse, and break if we reach another revision of the | ||
document ID besides the winning one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple things to note here. Generally speaking, when we know a specific revision we want to read we don't actually know if its deleted or not so we would have to do two range reads for the deleted=true
and deleted=false
. Obviously one of those will be empty but its a slight complication on the stated approach.
For the two options on reading a doc that's obviously up in the air if the extra round trip against the ?REVISIONS
subspace vs possibly streaming extra rows vs possibly having to issue a second range read. Also if we go with Option 1 we don't need to NotDeleted at all which also simplifies things a bit if we're always reading this range with a known {RevPos, RevHash}
pair.
Another happy side effect of always reading the winning revision from the ?REVISIONS
subspace is that we stop having to worry about deleted vs normal revisions in this subspace. An empty doc with no attachments could just as well have zero data in this space and then we don't have to store something arbitrarily in the metadata for each empty doc as a third kind of odd document state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I covered this in the CRUD operations section, though I proposed doing the followup read in ?REVISIONS
rather than issuing both deleted=true
and deleted=false
guesses.
If a reader is implementing Option 2 and does not find any keys associated with the supplied DocID in the ?DOCUMENTS space, it will need to do a followup read on the ?REVISIONS space in order to determine whether the appropriate response is {"not_found": "missing"} or {"not_found": "deleted"}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Yet again I was slightly off track considering when we return the deleted body which is called out elsewhere for requiring extra fdb requests.
That seems legit but I am starting to worry a bit about the subtlety of some of these optimizations that presume existing behaviors as opposed to being less efficient but easier to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely not trying to be "cute" or overly subtle, so stop me if you think there's a better path. My thinking on using ?REVISIONS
in a followup request was exactly because of this condition you brought up:
An empty doc with no attachments could just as well have zero data in this space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is quite to that level. I'm just trying to contemplate if we go with the first option and always require a read from the ?REVISIONS
key space then there's a good chunk of things that end up falling away complexity wise. We'd have defined bounds on all reads to the ?DOCUMENTS
key space and our range reads would be optimal.
However, same as you I believe, forcing that extra round trip for every document read is tempting to try and eliminate. However, for the single range read to get the current winner I'm also wondering how often that ends up with either a second range request or wasted rows read and how that cost compares to the extra round trip.
Luckily though it should be easy to implement both approaches to compare apples to apples and get a better idea of the relative cost either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is an area where an A/B test is doable without too much surgery.
are baked into the key. The structure looks like this | ||
|
||
``` | ||
{DbName, ?DOCUMENTS, DocID, NotDeleted, RevPos, RevHash} = RevisionMetadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me that this format prevents RevisionMetadata from every exceeding 100KB which depending on what we end up putting in there may or may not be an issue.
There's also no versioning in this RFC for how we might do schema migrations in the future like was done for the revision storage RFC. Its obviously possible to repurpose that version marker as a means to version the doc body storage as well but that seems like something we'd want to call out explicitly in one place or other other and also feels a bit limiting that we're tying the two together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Versioning is included, see the line directly below this one:
RevisionMetadata includes at the minimum an enum to enable schema evolution for subsequent changes to the document encoding structure
In fact, that's the only planned use for that field at the moment. Seems to me that if we have more than 100KB of metadata here, not including the rev tree, something has gone horribly wrong with our database design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I was definitely making the assumption that this would hold things like the #docs.atts metadata and what not. We can obviously do other things there but it still feels like something to make sure that we call out so we ensure that we don’t allow a pathological breakage there.
The outer tuple does not take any space in the Tuple Layer encoding, but is required to use the API. We can have a debate about whether the Tuple Layer is even the right thing to use to encode leaf values. There are many other viable alternatives since the sorting rules don't matter, the priority should be on portability, CPU efficiency, and storage efficiency.
rfcs/004-document-storage.md
Outdated
|
||
``` | ||
pack({"states", 0}) = "MA" | ||
pack({"states", 1}) = "OH" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason why we don't pack the values for the array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, just me being sloppy. Good catch.
Is there more to do here? |
Personally I'm still wondering a bit about the value encoding. For example, should we be pursuing a compression option for string values? |
The documented tuple format, while motivated by sorting concerns, does look fairly compact, so do you have concerns about using it for values other than, perhaps, that long string values could be compressed? Short strings won't likely compress by much though (if we exclude the various header bytes when storing it might be simplest to compress all strings). |
Overview
This document describes a data model for storing JSON documents as key-value
pairs in FoundationDB. It includes a discussion of storing multiple versions of
the document, each identified by unique revision identifiers, and discusses some
of the operations needed to query and modify these documents.