Skip to content
This repository was archived by the owner on Oct 17, 2022. It is now read-only.
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
251 changes: 251 additions & 0 deletions rfcs/004-document-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
---
name: Formal RFC
about: Submit a formal Request For Comments for consideration by the team.
title: 'JSON document storage in FoundationDB'
labels: rfc, discussion
assignees: ''

---

[NOTE]: # ( ^^ Provide a general summary of the RFC in the title above. ^^ )

# Introduction

This document describes a data model for storing JSON documents as key-value
pairs in FoundationDB. It includes a discussion of storing multiple versions of
the document, each identified by unique revision identifiers, and discusses some
of the operations needed to query and modify these documents.

## Abstract

The data model maps each "leaf" JSON value (number, string, true, false, and
null) to a single KV in FoundationDB. Nested relationships are modeled using a
tuple structure in the keys. Different versions of a document are stored
completely independently from one another. Values are encoded using
FoundationDB's tuple encoding.

The use of a single KV pair for each leaf value implies a new 100KB limit on
those values stored in CouchDB documents. An alternative design could split
these large (string) values across multiple KV pairs.

Extremely deeply-nested data structures and the use of long names in the nesting
objects could cause a path to a leaf value to exceed FoundationDB's 10KB limit
on key sizes. String interning could reduce the likelihood of this occurring but
not eliminate it entirely. Interning could also provide some significant space
savings in the current FoundationDB storage engine, although the introduction of
key prefix elision in the Redwood engine should also help on that front.

FoundationDB imposes a hard 10MB limit on transactions. In order to reserve
space for additional metadata, user-defined indexes, and generally drive users
towards best practices in data modeling this RFC proposes a **1MB (1,000,000
byte)** limit on document sizes going forward.

## Requirements Language

[NOTE]: # ( Do not alter the section below. Follow its instructions. )

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in
[RFC 2119](https://www.rfc-editor.org/rfc/rfc2119.txt).

## Terminology

---

# Detailed Description

## Value Encoding

The `true` (`\x27`), `false` (`\x26`) and `null` (`\x00`) values each have a
single-byte encoding in FoundationDB's tuple layer. Integers are represented
with arbitrary precision (technically, up to 255 bytes can be used).
Floating-point numbers use an IEEE binary representation up to double precision.
More details on these specific byte codes are available in the [FoundationDB
documentation](https://github.com/apple/foundationdb/blob/6.0.18/design/tuple.md).

Unicode strings must be encoded into UTF-8. They are prefixed with a `\x02`
bytecode and are null-terminated. Any nulls within the string must be replaced
by `\x00\xff`. Raw byte strings have their own `\x01` prefix and must follow the
same rules regarding null bytes in the string. Both are limited to 100KB.

An object is decomposed into multiple key-value pairs, where each key is a tuple
identifying the path to a final leaf value. For example, the object

```
{
"foo": {
"bar": {
"baz": 123
}
}
}
```

would be represented by a key-value pair of

```
pack({"foo", "bar", "baz"}) = pack({123})
```

Clients SHOULD NOT submit objects containing duplicate keys, as CouchDB will
only preserve the last occurence of the key and will silently drop the other
occurrences. Similarly, clients MUST NOT rely on the ordering of keys within an
Object as this ordering will generally not be preserved by the database.

An array of N elements is represented by N distinct key-value pairs, where the
last element of the tuple key is an integer representing the zero-indexed
position of the value within the array. As an example:

```
{
"states": ["MA", "OH", "TX", "NM", "PA"]
}
```

becomes

```
pack({"states", 0}) = pack({"MA"})
pack({"states", 1}) = pack({"OH"})
pack({"states", 2}) = pack({"TX"})
pack({"states", 3}) = pack({"NM"})
pack({"states", 4}) = pack({"PA"})
```

More details on the encodings in the FoundationDB Tuple Layer can be found in
the [design
documentation](https://github.com/apple/foundationdb/blob/6.0.18/design/tuple.md).

## Document Subspace and Versioning

Document bodies will be stored in their own portion of the keyspace with a fixed
single-byte prefix identifying the "subspace". Each revision of a document will
be stored separately without term sharing, and the document ID and revision ID
are baked into the key. The structure looks like this

```
{DbName, ?DOCUMENTS, DocID, NotDeleted, RevPos, RevHash} = RevisionMetadata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that this format prevents RevisionMetadata from every exceeding 100KB which depending on what we end up putting in there may or may not be an issue.

There's also no versioning in this RFC for how we might do schema migrations in the future like was done for the revision storage RFC. Its obviously possible to repurpose that version marker as a means to version the doc body storage as well but that seems like something we'd want to call out explicitly in one place or other other and also feels a bit limiting that we're tying the two together.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Versioning is included, see the line directly below this one:

RevisionMetadata includes at the minimum an enum to enable schema evolution for subsequent changes to the document encoding structure

In fact, that's the only planned use for that field at the moment. Seems to me that if we have more than 100KB of metadata here, not including the rev tree, something has gone horribly wrong with our database design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I was definitely making the assumption that this would hold things like the #docs.atts metadata and what not. We can obviously do other things there but it still feels like something to make sure that we call out so we ensure that we don’t allow a pathological breakage there.

{DbName, ?DOCUMENTS, DocID, NotDeleted, RevPos, RevHash, "foo"} = (value for doc.foo)
et cetera
```

where `RevisionMetadata` includes at the minium an enum to enable schema
evolution for subsequent changes to the document encoding structure, and
`NotDeleted` is `true` if this revision is a typical `deleted=false` revision,
and `false` if the revision is storing user-supplied data associated with the
tombstone. Regular document deletions without any data in the tombstone do not
show up in the `?DOCUMENTS` subspace at all. This key structure ensures that in
the case of multiple edit branches the "winning" revision's data will sort last
in the key space.

## CRUD Operations

FoundationDB transactions have a hard limit of 10 MB each. Our document
operations will need to modify some metadata alongside the user data, and we'd
also like to reserve space for updating indexes as part of the same transaction.
This document proposes to limit the maximum document size to **1 MB (1,000,000
bytes)** going forward (excluding attachments).

A document insert does not need to clear any data in the `?DOCUMENTS` subspace,
and simply inserts the new document content. The transaction will issue a read
against the `?REVISIONS` subspace to ensure that no `NotDeleted` revision
already exists.

A document update targeting a parent revision will clear the entire range of
keys associated with the parent revision in the `?DOCUMENTS` space as part of
its transaction. Again, the read in the `?REVISIONS` space ensures that this
transaction can only succeed if the parent revision is actually a leaf revision.

Document deletions are a special class of update that typically do not insert
any keys into the `?DOCUMENTS` subspace. However, if a user includes extra
fields in the deletion they will show up in this subspace.

Document reads where we already know the specific revision of interest can be
done efficiently using a single `get_range_startswith` operation. In the more
common case where we do not know the revision identifier, there are two basic
options:

1. We can retrieve the winning revision ID from the `?REVISIONS` subspace, then
execute a `get_range_startswith` operation as above.
1. We can start streaming the entire key range from the `?DOCUMENTS` space
prefixed by `DocID` in reverse, and break if we reach another revision of the
document ID besides the winning one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things to note here. Generally speaking, when we know a specific revision we want to read we don't actually know if its deleted or not so we would have to do two range reads for the deleted=true and deleted=false. Obviously one of those will be empty but its a slight complication on the stated approach.

For the two options on reading a doc that's obviously up in the air if the extra round trip against the ?REVISIONS subspace vs possibly streaming extra rows vs possibly having to issue a second range read. Also if we go with Option 1 we don't need to NotDeleted at all which also simplifies things a bit if we're always reading this range with a known {RevPos, RevHash} pair.

Another happy side effect of always reading the winning revision from the ?REVISIONS subspace is that we stop having to worry about deleted vs normal revisions in this subspace. An empty doc with no attachments could just as well have zero data in this space and then we don't have to store something arbitrarily in the metadata for each empty doc as a third kind of odd document state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I covered this in the CRUD operations section, though I proposed doing the followup read in ?REVISIONS rather than issuing both deleted=true and deleted=false guesses.

If a reader is implementing Option 2 and does not find any keys associated with the supplied DocID in the ?DOCUMENTS space, it will need to do a followup read on the ?REVISIONS space in order to determine whether the appropriate response is {"not_found": "missing"} or {"not_found": "deleted"}.

Copy link
Member

@davisp davisp Apr 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Yet again I was slightly off track considering when we return the deleted body which is called out elsewhere for requiring extra fdb requests.

That seems legit but I am starting to worry a bit about the subtlety of some of these optimizations that presume existing behaviors as opposed to being less efficient but easier to implement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely not trying to be "cute" or overly subtle, so stop me if you think there's a better path. My thinking on using ?REVISIONS in a followup request was exactly because of this condition you brought up:

An empty doc with no attachments could just as well have zero data in this space

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite to that level. I'm just trying to contemplate if we go with the first option and always require a read from the ?REVISIONS key space then there's a good chunk of things that end up falling away complexity wise. We'd have defined bounds on all reads to the ?DOCUMENTS key space and our range reads would be optimal.

However, same as you I believe, forcing that extra round trip for every document read is tempting to try and eliminate. However, for the single range read to get the current winner I'm also wondering how often that ends up with either a second range request or wasted rows read and how that cost compares to the extra round trip.

Luckily though it should be easy to implement both approaches to compare apples to apples and get a better idea of the relative cost either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is an area where an A/B test is doable without too much surgery.


Document reads specifying `conflicts`, `deleted_conflicts`, `meta`, or
`revs_info` will need to retrieve the revision metadata from the `?REVISIONS`
subspace alongside the document body regardless of which option we pursue above.

If a reader is implementing Option 2 and does not find any keys associated with
the supplied `DocID` in the `?DOCUMENTS` space, it will need to do a followup
read on the `?REVISIONS` space in order to determine whether the appropriate
response is `{"not_found": "missing"}` or `{"not_found": "deleted"}`.

# Advantages and Disadvantages

A leading alternative to this design in the mailing list discussion was to
simply store each JSON document as a single key-value pair. Documents exceeding
the 100KB value threshold would be chunked up into contiguous key-value pairs.
The advantages of this "exploded" approach are

- it lends itself nicely to sub-document operations, e.g. apache/couchdb#1559
- it optimizes the creation of Mango indexes on existing databases since we only
need to retrieve the value(s) we want to index
- it optimizes Mango queries that use field selectors

The disadvantages of this approach are that it uses a larger number of key-value
pairs and has a higher overall storage overhead from the repeated common key
prefixes. The new FoundationDB storage engine should eliminate some of the
storage overhead.
As per the FoundationDB discussion about being able to co-locate compute operations with data storage servers/nodes](https://forums.foundationdb.org/t/feature-request-predicate-pushdown/954/6), if we were to make use of this hypothetical feature, we’d not get a guarantee of entire documents being co-located on one storage node, requiring us to do extra work should we want to, say, assemble a full `doc` to send to a map function. JS views would have a harder time, while Mango indexes with their explicit field declarations might get around this particular complexity more easily. For now, this is recorded here so we don’t forget track of this later.


# Key Changes

- Individual strings within documents are limited to 100 KB each.
- The "path" to a leaf value within a document can be no longer than 10 KB.
- The entire JSON document is limited to 1 MiB.

Size limitations aside, this design preserves all of the existing API options
for working with CouchDB documents.

## Applications and Modules affected

TBD depending on exact code layout going forward.

## HTTP API additions

None.

## HTTP API deprecations

None, aside from the more restrictive size limitations discussed in the Key
Changes section above.

# Security Considerations

None have been identified.

# References

[Original mailing list discussion](https://lists.apache.org/thread.html/fb8bdd386b83d60dc50411c51c5dddff7503ece32d35f88612d228cc@%3Cdev.couchdb.apache.org%3E)

[Draft RFC for revision metadata](https://github.com/apache/couchdb-documentation/blob/rfc/001-fdb-revision-model/rfcs/001-fdb-revision-metadata-model.md)

[Current version of Tuple Layer documentation](https://github.com/apple/foundationdb/blob/6.0.18/design/tuple.md)

# Acknowledgements

We had lots of input on the mailing list in this discussion, thanks to

- @banjiewen
- @davisp
- @ermouth
- @iilyak
- @janl
- @mikerhodes
- @rnewson
- @vatamane
- @wohali
- Michael Fair.
- Reddy B.