Design Proposal

gflarity edited this page Jun 12, 2011 · 49 revisions

Chesterfield - A Node Based CouchDB clone

Goals

  1. To create an extensible high performance couchdb clone optimized for SSDs* in Node.JS.

  2. To create an extensible web services platform that is data driven, scales elegantly, and is implemented in pure JavaScript.

  3. To create a solid foundation for experimenting with different datastore formats intended for use with SSDs.

* When I say SSDs I really mean any storage solution with fast random access and 'decent' sequential throughput.

Motivation

  • couchdb is already closely coupled with javascript

  • there is a large pool of skilled javascript hackers and that pool will continue to grow

  • a pure javascript implementation of couchdb will be more extensible, the following are being considered:

    • extending the security model to include fine grained read access controls since there will be far less of a performance penalty in Chesterfield
    • design document support for services such as web sockets
    • document support for automatic conflict resolution
    • ability to reference external libraries from inside design documents when explicitly desired
    • potential for library sharing with the client
    • simple support for clean urls and internal redirections
    • extra authentication and session tracking features
    • statistics gathering support
  • SSDs change the storage game dramatically

    • no seek penalty (mechanical drive heads need to move into position)
    • fast random reads/writes of relatively small block sizes (4096KB*)
    • becoming more affordable
    • next generation filesystems like ZFS can utilize SSDs as caches (ZIL), making large hybrid HDD/SDD storage arrays inexpensive and highly performant for random writes
    • we need to reconsider data store design decisions in light of this information
  • node shines when it comes to IO intensive services

    • callback chains hang off of relatively slow IO operations
  • couchdb (or any db/datastore) is a textbook example of an IO intensive service

  • ignoring IO, V8 is very fast (though this is less important)

    • suspect V8 is computationally much faster than beam (erlang vm), though single threaded
    • V8 is computationally much faster than the spidermonkey which is in couchdb for views (map/reduce). IO latency of a write dwarves the latency of the computations, there's still gains to be had here
  • append only B+Tree is ideally suited for use in actor model with one writer, many readers

    • this can be easily implemented in node with multi processes and IPC
    • the OS will cache frequently read pages system wide

* This is drive dependent, more info here.

Design Overiew

Async

  • using the nodejs async paradigm from socket to disk
  • trade clear code for brevity and conciseness = make callbacks readable

Document Log

  • all documents stored in an append only log than can be traversed both frontwards and backwards
  • documents stored as JSON strings, along which are both appended and prepended with 4 byte long integers used to determine the document lengths
  • sequence numbers need to be stored along with documents
  • forward and backward references can either point to the long integers, or if the document length is otherwise available, the documents themselves

Document Indexes

Initial Implementation

  • append only B+ trees's used for indexing
  • static page size, 4096K at first, but this can be tweaked
  • keys map to offsets with in the document log
  • sequence number of the last indexed document is stored in the root of the tree for O(1) retrieval
  • main index, object id -> object file reference
  • indexes for each view
  • sequence index: sequence number -> document

Possible Future Implementation Variations

  • Indexes to use proper B+ Tree with document copies stored in leaf nodes
  • variable Index page sizes which tend towards 4096K over time, rather being hardcoded
  • hybrid SDD+HDD support

Caching (to be implemented later)

  • once documents are written to the document log, they are considered durably persisted
  • in memory an associative array maps document ids to offsets in a temporary cache
  • add/update requests can receive responses as soon as objects have been added document log and cache
  • once the response has be sent, they can be added to the index right away, or by set interval
  • this should eventually be completely configurable
  • requests for views cause objects in cache to processed immediately (indexed)
  • this really to reduce write latency and hopefully maximize throughput (100% disk utilization)
  • all documents can be referenced by an incrementing sequence number which follows the order they were added in
  • should a document make to the document log, but not an index a mechanism for determining the last sequence number is needed.
  • last sequence number stored with the root notes of the append only B+tree
  • at startup these are checked along with the sequence number of the last document in the document log
  • document log is read backwards until the first unindexed document is found, then read forwards to update the indexes

Other Features

Optimization Tests and Tuning

  • there are a number of tunables, it would be nice to have tests provide the best values for a given system
Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.