Skip to content
Aaron Miller edited this page Mar 2, 2012 · 10 revisions

On disk file format

This is not fully implemented just yet, working on it!

Numbers are in MSB-first or big-endian byte order.

Blocks and Chunks

Every 4096th byte of the file is a zero, unless a header immediately follows.

All data "chunks" are written to the file prefixed with their length as a 32-bit big endian number, followed by the CRC32 hash of the data. The CRC32 is not included in the length.

File Header

Prefixed with a 32 bit length and a hash, similarly to other data chunks, but the length does include the length of the hash.

Values in the file header

  • 8 bits -- File format version (Currently 10)
  • 48 bits -- Update sequence number. This is the sequence number new updates should start at.
  • 48 bits -- Purge sequence number.
  • 48 bits -- Purged documents pointer
  • 16 bits -- Size of by_seq B-tree root
  • 16 bits -- Size of by_id B-tree root
  • 16 bits -- Size of local documents B-tree root
  • The B-tree roots, in the order of the sizes, are B-tree node pointers as described in the "B-trees" section.

B-trees

Nodes on disk

  • First byte -- 1 if a leaf node or K/V node, 0 if a pointer node or K/P Node
  • A list of key-value or key-pointer pairs, each of which is:
    • 12 bits -- Key size
    • 28 bits -- Value size
    • Followed by the Key data,
    • then the Value data

In pointer nodes the Value parts of these pairs are pointers to another B-tree node, where keys less than or equal to that pair's Key will be.

Node pointers

  • 48 bits -- Node position in file
  • 48 bits -- Sub tree size, or disk size of B-tree data below this node. For pointers to leaf nodes this is the size on disk of the pointed to node, otherwise it is the sum of the size of the pointed to node and all the values of this field on the pointers in that node.
  • 16 bits -- Reduce value size (NOT present in the root pointers embedded in the file header, as it can be inferred from the lengths in the header)
  • Reduce value -- Used for memoizing the intermediate values of a reduce function.

The By-Sequence index

The keys in this B-tree are 48-bit numbers representing the sequence number of a change to the database. This number is strictly increasing.

The values are

  • 12 bits -- Size of the document ID
  • 28 bits -- Size of the document data
  • 1 bit -- Deleted flag (this item in the by-sequence index represents a deletion action if this is set)
  • 47 bits -- Position of the document content on disk
  • 8 bits -- Content type code
  • 32 bits -- Document revision number
  • Document ID
  • Revision ID

The reduce value in this B-tree is a 40-bit number, indicating the amount of records.

The By-ID index

This B-tree is an index of documents by their IDs, the keys are the document IDs. The keys are ordered as if sorted by memcmp

The values are

  • 48 bits -- The sequence number this document was last modified at. If this document is modified, this sequence number should be removed from the by-sequence index.
  • 32 bits -- Size of the document data
  • 1 bit -- Deleted flag. If this is set this document should be ignored in indexing.
  • 47 bits -- Position of the doucment content on disk
  • 8 bits -- Content type code
  • 32 bits -- Document revision umber
  • Revision ID

The reduce value in this B-tree is

  • 40 bits -- Number of documents that are not flagged as deleted.
  • 40 bits -- Number of documents that are flagged as deleted
  • 48 bits -- Total disk size of document content

Content type code

The high bit of this byte is used to indicate whether the document is compressed on disk with the google snappy compression library, the rest is used for identifying the type of the document data.

  • 0 -- JSON data
  • 1 -- BLOB data (Attempted to parse as JSON but failed)
  • 2 -- BLOB data (Attempted to parse as JSON, JSON contained reserved keys)
  • 3 -- BLOB data (No attempt to parse as JSON)

Revision IDs

This is identifying information about a document revision that should be small enough to be kept in the indexes. This value is not handled by Couchstore.

It's currently used by Couchbase as

  • 64 bits -- Memcached CAS value (unique cookie used by memcache to allow atomic writes)
  • 32 bits -- Expiration time of a value
  • 32 bits -- User flags as specified by Memcache protocol
Clone this wiki locally