Skip to content

Commit

Permalink
docs on storage and consistency
Browse files Browse the repository at this point in the history
  • Loading branch information
dobe committed Nov 27, 2014
1 parent 641c803 commit 83f418e
Show file tree
Hide file tree
Showing 4 changed files with 242 additions and 13 deletions.
55 changes: 42 additions & 13 deletions docs/best_practice/multi_node_setup.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Multi Node Setup
================

Crate is a distributed datastore by design so in a production environment
you usually have a cluster of 2 or more nodes. We at Crate.IO try to make
you usually have a cluster of 3 or more nodes. We at Crate.IO try to make
the cluster setup as easy as possible. However, there are a few things to
bear in mind when you start building a new cluster.

Expand Down Expand Up @@ -100,32 +100,61 @@ or use internal netork IPs + transport port::
as soon as the new node will ping existing ones.


.. _master_node_election:

Master Node Election
====================

Although all Crate nodes in a cluster are equal, there is one node elected as master.
Like any other peer-to-peer system, nodes communicate with each other directly.
The master node is responsible for making changes to and publishing the
global cluster state, as well as for delegating redistribution of shards when
nodes join or leave the cluster.
Although all Crate nodes in a cluster are equal, there is one node
elected as master for managing cluster meta data. Like any other
peer-to-peer system, nodes communicate with each other directly. The
master node is responsible for making changes to and publishing the
global cluster state, as well as for delegating redistribution of
shards when nodes join or leave the cluster. All nodes are eligible to
become master.

There must be only one single master per cluster. To ensure this,
Crate allows for setting a quorum, which needs to be present in order
to elect a master and for the cluster to be operational.

By default all nodes are eligible to become master. The settings ``minimum_master_nodes``
defines how many nodes need to be available in the cluster to elect a master.
We highly recommend to set the quorum as follows::
This quorum can be configured using the ``minimum_master_nodes``
setting. We highly recommend to set the quorum to be greater than the
half of the maximum number of nodes in the cluster. The formula is as
follows::

(N / 2) + 1

where ``N`` is the number of nodes in the cluster. In a 3-node cluster it would mean
that at least 2 nodes need to be started before they elect the master.
where ``N`` is the maximum number of nodes in the cluster.

.. note::

Setting the quorum lower than described above may lead to split
brain scenarios. This means that in case of a network paritioning
there could be more than one pool of nodes meeting the quorum and
start to elect a master on their own. This can cause data loss and
inconsistencies.

Also, if the planned number of nodes changes in a cluster, the quorum
needs to be updated too, since it should never get below the half of
the available nodes, but also should allow for operation without
having all nodes online, so it should not be too high to.

For example, in a 3-node cluster it would mean that at least 2 nodes
need to see each other before they are allowed to elect a master. So
the following line needs to be added to the configuration file.

::

discovery.zen.minimum_master_nodes: 2

.. note::

Note that this value must be updated according to the amount of nodes
when you add new or remove nodes from the cluster.
Given the formula above it means that in a cluster with a maximum of
2 nodes the quorum is also 2. In practice this means that a 2-node
cluster needs to have both nodes online in order to be operational
and therefore a highly available and fault-tolerant multi-node setup
requires at least 3 nodes.


Publish Host and Port
=====================
Expand Down
1 change: 1 addition & 0 deletions docs/index_html.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,6 @@ Contents
sql/index
blob
udc
storage_consistency
best_practice/index
integration/index
2 changes: 2 additions & 0 deletions docs/sql/ddl.txt
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,8 @@ table creation. Example::

The number of shards can only be set on table creation, it cannot be changed later on.

.. _routing:

Routing
-------

Expand Down
197 changes: 197 additions & 0 deletions docs/storage_consistency.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
=======================
Storage and Consistency
=======================

This document provides an overview on how Crate stores and distributes
state accross the cluster and what consistency and durabiltiy
guarantees are provided.

.. note::

Since Crate heavily relies on Elasticsearch_ and Lucene_ for storage
and cluster consensus, concepts shown here might look familiar to
Elasticsearch_ users, since the implementation is actually reused
from the Elasticsearch_ code.

Data Storage
============

Every table in Crate is sharded, which means that tables are divided
and distributed accross the nodes of a cluster. Each shard in Crate is
a Lucene_ index which is broken down into segments getting stored on
the filesystem. Physically the files reside under one of the
configured data directories of a node.

Lucene only appends data to segment files, which means that data
written to the disc will never be mutated. This makes it easy for
replication and recovery, since syncing a shard is simply a matter of
fetching data from a specific marker.

An arbitrary number of replica shards can be configured per
table. Every operational replica holds a full synchronized copy of the
primary shard. Writes are synchronous over all active replicas with
the following flow:

1. The operation is routed to the according primary shard for
execution.

2. The operation gets executed on the primary shard

3. If the operation succeeds on the primary, the operation gets
executed on all replicas in parallel.

4. After all replica operations are finished the operation result gets
returned to the caller.

Should any replica shard fail to write the data or times out in step
4, it is immediatly considered as unavailable.

Atomic on Document Level
========================

Each row of a table in Crate is a semi structured document which can
be nested arbitrarily deep through the use of object and array types.

Operations on documents are atomic. Meaning that a write operation on
a document either succeeds as a whole or has no effect at all. This is
always the case, regardless of the nesting depth or size of the
document.

Crate does not provide transactions. However, since every document in
Crate has a version number assigned, which gets increased every time a
change occurs, patterns like optimistic concurrency control (see
:ref:`sql_occ`) can help to work around that limitation.

Durability
==========

Each shard has a WAL_ also known as translog. It guarantees that
operations on documents are persisted to disk without having to issue
a Lucene-Commit for every write operation. When the translog gets
flushed all data is written to the persistent index storage of Lucene
and the translog gets cleared.

In case of an unclean shutdown of a shard, the transactions in the
translog are getting replayed upon startup to ensure that all executed
operations are permanent.

The translog is also directly transfered when a newly allocated
replica initialises itself from the primary shard. So there is no need
to flush segements to disc just for replica-recovery purposes.

Adressing of Documents
======================

Every document has an internal identifier (see
:ref:`_id <sql_ddl_system_column_id>`) . This identifier is derived from the
primary key by default. Documents living in tables without a primary
key are assigned a unique auto-generated id automatically when they
are created.

Each document is routed by its routing key to one specific shard. By
default this key is the value of the ``_id`` column. However this can
be configugred in the table schema (see :ref:`routing`).

While transparent to the user, internally there are two ways how Crate
accesses documents:

:get: Direct access by identifier. Only applicable if the routing key
and the identifier can be computed from the given query
specification. (e.g: the full primary key is defined in the
where clause).

This is the most efficient way to access a document, since only
a single shard gets accessed and only a simple index lookup on
the ``_id`` field has to be done.

:search: Query by matching against fields of documents accross all
candidate shards of the table.

Consistency
===========

Crate is eventual consistent for search operations. Search operations
are performed on shared ``IndexReaders`` which besides other
functionalities, provide caching and reverse lookup capabilities for
shards. An ``IndexReader`` is always bound to the Lucene_ segement it
was started from, which means it has to be refreshed in order to see
new changes, this is done on a time based mannner, but can also be
done manually (see :ref:`refresh_data`). Therefore a search only sees
a change if the according `IndexReader` was refreshed after that
change occured.

If a query specification results in a ``get`` operation, changes are
visible immediatly. This is achieved by looking up the document in the
translog first, which will always have the most recent version of the
document. The common update and fetch use-case is therfore
possible. If a client updates a row and that row is looked up by its
primary key after that update the changes will always be visible,
since the information will be retrieved directly from the translog.

.. note::

Every replica shard is updated synchronously with its primary and
always carries the same information. Therefore it does not matter if
the primary or a replica shard is accessed in terms of
conistency. Only the refresh of the ``IndexReader`` affects
consistency.

Cluster Meta Data
=================

Cluster meta data is held in the so called "Cluster State", which
contains the following information:

- Tables schemas.

- Primary and replica shard locations. Basically just a mapping from
shard number to the storage node.

- Status of each shard, wich tells if a shard is currently ready for
use or has any other state like "initialising", "recovering" or
cannot be assigned at all.

- Information about discovered nodes and their status.

- Configuration information.

Every node has its own copy of the cluster state. However there is
only one node allowed to change the cluster state at runtime. This
node is called the "master" node and gets auto-elected. The "master"
node has no special configuration at all, any node in the cluster can
be elected as a master. There is also an automatic re-election if the
current master node goes down for some reason.

.. note::

In order to avoid a scenario where two masters are elected due to
network partitioning it is required to define a quorum of nodes with
wich it is possible to elect a master. For details in how to do this
and further information see :ref:`master_node_election`.

In order to explain the flow of events for any cluster state change -
here an example flow for an "ALTER TABLE" statement which changes the
schema of a table:

#. A node in the cluster receives the ``ALTER TABLE`` request.

#. The node sends out a request to the current master node to change
the table definition.

#. The master node applies the changes locally to the cluster state
and sends out a notification to all affected nodes about the
change.

#. The nodes apply the change, so that they are now in sync with the
master.

#. Every node might take some local action depending on the type of
cluster state change.



.. _Elasticsearch: http://www.elasticsearch.org/

.. _Lucene: http://lucene.apache.org/core/

.. _WAL: http://en.wikipedia.org/wiki/Write-ahead_logging

0 comments on commit 83f418e

Please sign in to comment.