-
Notifications
You must be signed in to change notification settings - Fork 0
Design Proposal
-
To create an extensible high performance couchdb clone optimized for SSDs* in Node.JS.
-
To create an extensible web services platform that is data driven, scales elegantly, and is implemented in pure JavaScript.
-
To create a solid foundation for experimenting with different datastore formats intended for use with SSDs.
* When I say SSDs I really mean any storage solution with fast random access and 'decent' sequential throughput.
-
couchdb is already closely coupled with javascript
-
there is a large pool of skilled javascript hackers and that pool will continue to grow
-
a pure javascript implementation of couchdb will be more extensible, the following are being considered:
- extending the security model to include fine grained read access controls since there will be far less of a performance penalty in Chesterfield
- design document support for services such as web sockets
- document support for automatic conflict resolution
- ability to reference external libraries from inside design documents when explicitly desired
- potential for library sharing with the client
- simple support for clean urls and internal redirections
- extra authentication and session tracking features
- statistics gathering support
-
SSDs change the storage game dramatically
- no seek penalty (mechanical drive heads need to move into position)
- fast random reads/writes of relatively small block sizes (4096KB*)
- becoming more affordable
- next generation filesystems like ZFS can utilize SSDs as caches (ZIL), making large hybrid HDD/SDD storage arrays inexpensive and highly performant for random writes
- we need to reconsider data store design decisions in light of this information
-
node shines when it comes to IO intensive services
- callback chains hang off of relatively slow IO operations
-
couchdb (or any db/datastore) is a textbook example of an IO intensive service
-
ignoring IO, V8 is very fast (though this is less important)
- suspect V8 is computationally much faster than beam (erlang vm), though single threaded
- V8 is computationally much faster than the spidermonkey which is in couchdb for views (map/reduce). IO latency of a write dwarves the latency of the computations, there's still gains to be had here
-
append only B+Tree is ideally suited for use in actor model with one writer, many readers
- this can be easily implemented in node with multi processes and IPC
- the OS will cache frequently read pages system wide
* This is drive dependent, more info here.
- using the nodejs async paradigm from socket to disk
- trade clear code for brevity and conciseness = make callbacks readable
- all documents stored in an append only log than can be traversed both frontwards and backwards
- documents stored as JSON strings, along which are both appended and prepended with 4 byte long integers used to determine the document lengths
- sequence numbers need to be stored along with documents
- forward and backward references can either point to the long integers, or if the document length is otherwise available, the documents themselves
- append only B+ trees's used for indexing
- static page size, 4096K at first, but this can be tweaked
- keys map to offsets with in the document log
- sequence number of the last indexed document is stored in the root of the tree for O(1) retrieval
- main index, object id -> object file reference
- indexes for each view
- sequence index: sequence number -> document
- Indexes to use proper B+ Tree with document copies stored in leaf nodes
- variable Index page sizes which tend towards 4096K over time, rather being hardcoded
- hybrid SDD+HDD support
- once documents are written to the document log, they are considered durably persisted
- in memory an associative array maps document ids to offsets in a temporary cache
- add/update requests can receive responses as soon as objects have been added document log and cache
- once the response has be sent, they can be added to the index right away, or by set interval
- this should eventually be completely configurable
- requests for views cause objects in cache to processed immediately (indexed)
- this really to reduce write latency and hopefully maximize throughput (100% disk utilization)
- all documents can be referenced by an incrementing sequence number which follows the order they were added in
- should a document make to the document log, but not an index a mechanism for determining the last sequence number is needed.
- last sequence number stored with the root notes of the append only B+tree
- at startup these are checked along with the sequence number of the last document in the document log
- document log is read backwards until the first unindexed document is found, then read forwards to update the indexes
- there are a number of tunables, it would be nice to have tests provide the best values for a given system