Skip to content
gflarity edited this page Jun 12, 2011 · 49 revisions

Chesterfield - A Node Based CouchDB clone

Goals

  1. To create an extensible high performance couchdb clone optimized for SSDs* in Node.JS.

  2. To create an extensible web services platform that is data driven, scales elegantly, and is implemented in pure JavaScript.

  3. To create a solid foundation for experimenting with different datastore formats intended for use with SSDs.

* When I say SSDs I really mean any storage solution with fast random access and 'decent' sequential throughput.

Motivation

  • couchdb is already closely coupled with javascript

  • there is a large pool of skilled javascript hackers and that pool will continue to grow

  • a pure javascript implementation of couchdb will be more extensible, the following are being considered:

    • extending the security model to include fine grained read access controls since there will be far less of a performance penalty in Chesterfield
    • design document support for services such as web sockets
    • document support for automatic conflict resolution
    • ability to reference external libraries from inside design documents when explicitly desired
    • potential for library sharing with the client
    • simple support for clean urls and internal redirections
    • extra authentication and session tracking features
    • statistics gathering support
  • SSDs change the storage game dramatically

    • no seek penalty (mechanical drive heads need to move into position)
    • fast random reads/writes of relatively small block sizes (4096KB*)
    • becoming more affordable
    • next generation filesystems like ZFS can utilize SSDs as caches (ZIL), making large hybrid HDD/SDD storage arrays inexpensive and highly performant for random writes
    • we need to reconsider data store design decisions in light of this information
  • node shines when it comes to IO intensive services

    • callback chains hang off of relatively slow IO operations
  • couchdb (or any db/datastore) is a textbook example of an IO intensive service

  • ignoring IO, V8 is very fast (though this is less important)

    • suspect V8 is computationally much faster than beam (erlang vm), though single threaded
    • V8 is computationally much faster than the spidermonkey which is in couchdb for views (map/reduce). IO latency of a write dwarves the latency of the computations, there's still gains to be had here
  • append only B+Tree is ideally suited for use in actor model with one writer, many readers

    • this can be easily implemented in node with multi processes and IPC
    • the OS will cache frequently read pages system wide

* This is drive dependent, more info here.

Design Overiew

Async

  • using the nodejs async paradigm from socket to disk
  • trade clear code for brevity and conciseness = make callbacks readable

Document Log

  • all documents stored in an append only log than can be traversed both frontwards and backwards
  • documents stored as JSON strings, along which are both appended and prepended with 4 byte long integers used to determine the document lengths
  • sequence numbers need to be stored along with documents
  • forward and backward references can either point to the long integers, or if the document length is otherwise available, the documents themselves

Document Indexes

Initial Implementation

  • append only B+ trees's used for indexing
  • static page size, 4096K at first, but this can be tweaked
  • keys map to offsets with in the document log
  • sequence number of the last indexed document is stored in the root of the tree for O(1) retrieval
  • main index, object id -> object file reference
  • indexes for each view
  • sequence index: sequence number -> document

Possible Future Implementation Variations

  • Indexes to use proper B+ Tree with document copies stored in leaf nodes
  • variable Index page sizes which tend towards 4096K over time, rather being hardcoded
  • hybrid SDD+HDD support

Caching (to be implemented later)

  • once documents are written to the document log, they are considered durably persisted
  • in memory an associative array maps document ids to offsets in a temporary cache
  • add/update requests can receive responses as soon as objects have been added document log and cache
  • once the response has be sent, they can be added to the index right away, or by set interval
  • this should eventually be completely configurable
  • requests for views cause objects in cache to processed immediately (indexed)
  • this really to reduce write latency and hopefully maximize throughput (100% disk utilization)
  • all documents can be referenced by an incrementing sequence number which follows the order they were added in
  • should a document make to the document log, but not an index a mechanism for determining the last sequence number is needed.
  • last sequence number stored with the root notes of the append only B+tree
  • at startup these are checked along with the sequence number of the last document in the document log
  • document log is read backwards until the first unindexed document is found, then read forwards to update the indexes

Other Features

Optimization Tests and Tuning

  • there are a number of tunables, it would be nice to have tests provide the best values for a given system