Design Proposal

Chesterfield - A Node Based CouchDB clone

Goals

To create an extensible high performance couchdb clone optimized for SSDs* in Node.JS.
To create an extensible web services platform that is data driven, scales elegantly, and is implemented in pure JavaScript.
To create a solid foundation for experimenting with different datastore formats intended for use with SSDs.

* When I say SSDs I really mean any storage solution with fast random access and 'decent' sequential throughput.

Motivation

couchdb is already closely coupled with javascript
there is a large pool of skilled javascript hackers and that pool will continue to grow
a pure javascript implementation of couchdb will be more extensible, the following are being considered:
- extending the security model to include fine grained read access controls since there will be far less of a performance penalty in Chesterfield
- design document support for services such as web sockets
- document support for automatic conflict resolution
- ability to reference external libraries from inside design documents when explicitly desired
- potential for library sharing with the client
- simple support for clean urls and internal redirections
- extra authentication and session tracking features
- statistics gathering support
SSDs change the storage game dramatically
- no seek penalty (mechanical drive heads need to move into position)
- fast random reads/writes of relatively small block sizes (4096KB*)
- becoming more affordable
- next generation filesystems like ZFS can utilize SSDs as caches (ZIL), making large hybrid HDD/SDD storage arrays inexpensive and highly performant for random writes
- we need to reconsider data store design decisions in light of this information
node shines when it comes to IO intensive services
- callback chains hang off of relatively slow IO operations
couchdb (or any db/datastore) is a textbook example of an IO intensive service
ignoring IO, V8 is very fast (though this is less important)
- suspect V8 is computationally much faster than beam (erlang vm), though single threaded
- V8 is computationally much faster than the spidermonkey which is in couchdb for views (map/reduce). IO latency of a write dwarves the latency of the computations, there's still gains to be had here
append only B+Tree is ideally suited for use in actor model with one writer, many readers
- this can be easily implemented in node with multi processes and IPC
- the OS will cache frequently read pages system wide

* This is drive dependent, more info here.

Design Overiew

Async

using the nodejs async paradigm from socket to disk
trade clear code for brevity and conciseness = make callbacks readable

Document Log

all documents stored in an append only log than can be traversed both frontwards and backwards
documents stored as JSON strings, along which are both appended and prepended with 4 byte long integers used to determine the document lengths
sequence numbers need to be stored along with documents
forward and backward references can either point to the long integers, or if the document length is otherwise available, the documents themselves

Document Indexes

Initial Implementation

append only B+ trees's used for indexing
static page size, 4096K at first, but this can be tweaked
keys map to offsets with in the document log
sequence number of the last indexed document is stored in the root of the tree for O(1) retrieval
main index, object id -> object file reference
indexes for each view
sequence index: sequence number -> document

Possible Future Implementation Variations

Indexes to use proper B+ Tree with document copies stored in leaf nodes
variable Index page sizes which tend towards 4096K over time, rather being hardcoded
hybrid SDD+HDD support

Caching (to be implemented later)

once documents are written to the document log, they are considered durably persisted
in memory an associative array maps document ids to offsets in a temporary cache
add/update requests can receive responses as soon as objects have been added document log and cache
once the response has be sent, they can be added to the index right away, or by set interval
this should eventually be completely configurable
requests for views cause objects in cache to processed immediately (indexed)
this really to reduce write latency and hopefully maximize throughput (100% disk utilization)
all documents can be referenced by an incrementing sequence number which follows the order they were added in
should a document make to the document log, but not an index a mechanism for determining the last sequence number is needed.
last sequence number stored with the root notes of the append only B+tree
at startup these are checked along with the sequence number of the last document in the document log
document log is read backwards until the first unindexed document is found, then read forwards to update the indexes

Other Features

Optimization Tests and Tuning

there are a number of tunables, it would be nice to have tests provide the best values for a given system

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design Proposal

Chesterfield - A Node Based CouchDB clone

Goals

Motivation

Design Overiew

Async

Document Log

Document Indexes

Initial Implementation

Possible Future Implementation Variations

Caching (to be implemented later)

Other Features

Optimization Tests and Tuning

Clone this wiki locally