docs on storage and consistency

crate · Nov 27, 2014 · 83f418e · 83f418e
1 parent 641c803
commit 83f418e
Show file tree

Hide file tree

Showing 4 changed files with 242 additions and 13 deletions.
diff --git a/docs/best_practice/multi_node_setup.txt b/docs/best_practice/multi_node_setup.txt
@@ -6,7 +6,7 @@ Multi Node Setup
 ================
 
 Crate is a distributed datastore by design so in a production environment
-you usually have a cluster of 2 or more nodes. We at Crate.IO try to make
+you usually have a cluster of 3 or more nodes. We at Crate.IO try to make
 the cluster setup as easy as possible. However, there are a few things to
 bear in mind when you start building a new cluster.
 
@@ -100,32 +100,61 @@ or use internal netork IPs + transport port::
   as soon as the new node will ping existing ones.
 
 
+.. _master_node_election:
+
 Master Node Election
 ====================
 
-Although all Crate nodes in a cluster are equal, there is one node elected as master.
-Like any other peer-to-peer system, nodes communicate with each other directly.
-The master node is responsible for making changes to and publishing the
-global cluster state, as well as for delegating redistribution of shards when
-nodes join or leave the cluster.
+Although all Crate nodes in a cluster are equal, there is one node
+elected as master for managing cluster meta data.  Like any other
+peer-to-peer system, nodes communicate with each other directly.  The
+master node is responsible for making changes to and publishing the
+global cluster state, as well as for delegating redistribution of
+shards when nodes join or leave the cluster. All nodes are eligible to
+become master.
+
+There must be only one single master per cluster. To ensure this,
+Crate allows for setting a quorum, which needs to be present in order
+to elect a master and for the cluster to be operational.
 
-By default all nodes are eligible to become master. The settings ``minimum_master_nodes``
-defines how many nodes need to be available in the cluster to elect a master.
-We highly recommend to set the quorum as follows::
+This quorum can be configured using the ``minimum_master_nodes``
+setting. We highly recommend to set the quorum to be greater than the
+half of the maximum number of nodes in the cluster. The formula is as
+follows::
 
   (N / 2) + 1
 
-where ``N`` is the number of nodes in the cluster. In a 3-node cluster it would mean
-that at least 2 nodes need to be started before they elect the master.
+where ``N`` is the maximum number of nodes in the cluster.
+
+.. note::
+
+  Setting the quorum lower than described above may lead to split
+  brain scenarios. This means that in case of a network paritioning
+  there could be more than one pool of nodes meeting the quorum and
+  start to elect a master on their own. This can cause data loss and
+  inconsistencies.
+
+  Also, if the planned number of nodes changes in a cluster, the quorum
+  needs to be updated too, since it should never get below the half of
+  the available nodes, but also should allow for operation without
+  having all nodes online, so it should not be too high to.
+
+For example, in a 3-node cluster it would mean that at least 2 nodes
+need to see each other before they are allowed to elect a master. So
+the following line needs to be added to the configuration file.
 
 ::
 
   discovery.zen.minimum_master_nodes: 2
 
 .. note::
 
-  Note that this value must be updated according to the amount of nodes
-  when you add new or remove nodes from the cluster.
+  Given the formula above it means that in a cluster with a maximum of
+  2 nodes the quorum is also 2. In practice this means that a 2-node
+  cluster needs to have both nodes online in order to be operational
+  and therefore a highly available and fault-tolerant multi-node setup
+  requires at least 3 nodes.
+
 
 Publish Host and Port
 =====================

diff --git a/docs/index_html.rst b/docs/index_html.rst
@@ -23,5 +23,6 @@ Contents
    sql/index
    blob
    udc
+   storage_consistency
    best_practice/index
    integration/index
diff --git a/docs/sql/ddl.txt b/docs/sql/ddl.txt
@@ -89,6 +89,8 @@ table creation. Example::
 
   The number of shards can only be set on table creation, it cannot be changed later on.
 
+.. _routing:
+
 Routing
 -------
 

diff --git a/docs/storage_consistency.txt b/docs/storage_consistency.txt
@@ -0,0 +1,197 @@
+=======================
+Storage and Consistency
+=======================
+
+This document provides an overview on how Crate stores and distributes
+state accross the cluster and what consistency and durabiltiy
+guarantees are provided.
+
+.. note::
+
+  Since Crate heavily relies on Elasticsearch_ and Lucene_ for storage
+  and cluster consensus, concepts shown here might look familiar to
+  Elasticsearch_ users, since the implementation is actually reused
+  from the Elasticsearch_ code.
+
+Data Storage
+============
+
+Every table in Crate is sharded, which means that tables are divided
+and distributed accross the nodes of a cluster. Each shard in Crate is
+a Lucene_ index which is broken down into segments getting stored on
+the filesystem. Physically the files reside under one of the
+configured data directories of a node.
+
+Lucene only appends data to segment files, which means that data
+written to the disc will never be mutated. This makes it easy for
+replication and recovery, since syncing a shard is simply a matter of
+fetching data from a specific marker.
+
+An arbitrary number of replica shards can be configured per
+table. Every operational replica holds a full synchronized copy of the
+primary shard. Writes are synchronous over all active replicas with
+the following flow:
+
+1. The operation is routed to the according primary shard for
+   execution.
+
+2. The operation gets executed on the primary shard
+
+3. If the operation succeeds on the primary, the operation gets
+   executed on all replicas in parallel.
+
+4. After all replica operations are finished the operation result gets
+   returned to the caller.
+
+Should any replica shard fail to write the data or times out in step
+4, it is immediatly considered as unavailable.
+
+Atomic on Document Level
+========================
+
+Each row of a table in Crate is a semi structured document which can
+be nested arbitrarily deep through the use of object and array types.
+
+Operations on documents are atomic. Meaning that a write operation on
+a document either succeeds as a whole or has no effect at all. This is
+always the case, regardless of the nesting depth or size of the
+document.
+
+Crate does not provide transactions. However, since every document in
+Crate has a version number assigned, which gets increased every time a
+change occurs, patterns like optimistic concurrency control (see
+:ref:`sql_occ`) can help to work around that limitation.
+
+Durability
+==========
+
+Each shard has a WAL_ also known as translog. It guarantees that
+operations on documents are persisted to disk without having to issue
+a Lucene-Commit for every write operation. When the translog gets
+flushed all data is written to the persistent index storage of Lucene
+and the translog gets cleared.
+
+In case of an unclean shutdown of a shard, the transactions in the
+translog are getting replayed upon startup to ensure that all executed
+operations are permanent.
+
+The translog is also directly transfered when a newly allocated
+replica initialises itself from the primary shard. So there is no need
+to flush segements to disc just for replica-recovery purposes.
+
+Adressing of Documents
+======================
+
+Every document has an internal identifier (see
+:ref:`_id <sql_ddl_system_column_id>`) . This identifier is derived from the
+primary key by default. Documents living in tables without a primary
+key are assigned a unique auto-generated id automatically when they
+are created.
+
+Each document is routed by its routing key to one specific shard. By
+default this key is the value of the ``_id`` column. However this can
+be configugred in the table schema (see :ref:`routing`).
+
+While transparent to the user, internally there are two ways how Crate
+accesses documents:
+
+:get: Direct access by identifier. Only applicable if the routing key
+      and the identifier can be computed from the given query
+      specification. (e.g: the full primary key is defined in the
+      where clause).
+
+      This is the most efficient way to access a document, since only
+      a single shard gets accessed and only a simple index lookup on
+      the ``_id`` field has to be done.
+
+:search: Query by matching against fields of documents accross all
+         candidate shards of the table.
+
+Consistency
+===========
+
+Crate is eventual consistent for search operations. Search operations
+are performed on shared ``IndexReaders`` which besides other
+functionalities, provide caching and reverse lookup capabilities for
+shards. An ``IndexReader`` is always bound to the Lucene_ segement it
+was started from, which means it has to be refreshed in order to see
+new changes, this is done on a time based mannner, but can also be
+done manually (see :ref:`refresh_data`). Therefore a search only sees
+a change if the according `IndexReader` was refreshed after that
+change occured.
+
+If a query specification results in a ``get`` operation, changes are
+visible immediatly. This is achieved by looking up the document in the
+translog first, which will always have the most recent version of the
+document. The common update and fetch use-case is therfore
+possible. If a client updates a row and that row is looked up by its
+primary key after that update the changes will always be visible,
+since the information will be retrieved directly from the translog.
+
+.. note::
+
+  Every replica shard is updated synchronously with its primary and
+  always carries the same information. Therefore it does not matter if
+  the primary or a replica shard is accessed in terms of
+  conistency. Only the refresh of the ``IndexReader`` affects
+  consistency.
+
+Cluster Meta Data
+=================
+
+Cluster meta data is held in the so called "Cluster State", which
+contains the following information:
+
+- Tables schemas.
+
+- Primary and replica shard locations. Basically just a mapping from
+  shard number to the storage node.
+
+- Status of each shard, wich tells if a shard is currently ready for
+  use or has any other state like "initialising", "recovering" or
+  cannot be assigned at all.
+
+- Information about discovered nodes and their status.
+
+- Configuration information.
+
+Every node has its own copy of the cluster state. However there is
+only one node allowed to change the cluster state at runtime. This
+node is called the "master" node and gets auto-elected. The "master"
+node has no special configuration at all, any node in the cluster can
+be elected as a master. There is also an automatic re-election if the
+current master node goes down for some reason.
+
+.. note::
+
+  In order to avoid a scenario where two masters are elected due to
+  network partitioning it is required to define a quorum of nodes with
+  wich it is possible to elect a master. For details in how to do this
+  and further information see :ref:`master_node_election`.
+
+In order to explain the flow of events for any cluster state change -
+here an example flow for an "ALTER TABLE" statement which changes the
+schema of a table:
+
+#. A node in the cluster receives the ``ALTER TABLE`` request.
+
+#. The node sends out a request to the current master node to change
+   the table definition.
+
+#. The master node applies the changes locally to the cluster state
+   and sends out a notification to all affected nodes about the
+   change.
+
+#. The nodes apply the change, so that they are now in sync with the
+   master.
+
+#. Every node might take some local action depending on the type of
+   cluster state change.
+
+
+
+.. _Elasticsearch: http://www.elasticsearch.org/
+
+.. _Lucene: http://lucene.apache.org/core/
+
+.. _WAL: http://en.wikipedia.org/wiki/Write-ahead_logging