-
Notifications
You must be signed in to change notification settings - Fork 550
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
242 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,5 +23,6 @@ Contents | |
sql/index | ||
blob | ||
udc | ||
storage_consistency | ||
best_practice/index | ||
integration/index |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,197 @@ | ||
======================= | ||
Storage and Consistency | ||
======================= | ||
|
||
This document provides an overview on how Crate stores and distributes | ||
state accross the cluster and what consistency and durabiltiy | ||
guarantees are provided. | ||
|
||
.. note:: | ||
|
||
Since Crate heavily relies on Elasticsearch_ and Lucene_ for storage | ||
and cluster consensus, concepts shown here might look familiar to | ||
Elasticsearch_ users, since the implementation is actually reused | ||
from the Elasticsearch_ code. | ||
|
||
Data Storage | ||
============ | ||
|
||
Every table in Crate is sharded, which means that tables are divided | ||
and distributed accross the nodes of a cluster. Each shard in Crate is | ||
a Lucene_ index which is broken down into segments getting stored on | ||
the filesystem. Physically the files reside under one of the | ||
configured data directories of a node. | ||
|
||
Lucene only appends data to segment files, which means that data | ||
written to the disc will never be mutated. This makes it easy for | ||
replication and recovery, since syncing a shard is simply a matter of | ||
fetching data from a specific marker. | ||
|
||
An arbitrary number of replica shards can be configured per | ||
table. Every operational replica holds a full synchronized copy of the | ||
primary shard. Writes are synchronous over all active replicas with | ||
the following flow: | ||
|
||
1. The operation is routed to the according primary shard for | ||
execution. | ||
|
||
2. The operation gets executed on the primary shard | ||
|
||
3. If the operation succeeds on the primary, the operation gets | ||
executed on all replicas in parallel. | ||
|
||
4. After all replica operations are finished the operation result gets | ||
returned to the caller. | ||
|
||
Should any replica shard fail to write the data or times out in step | ||
4, it is immediatly considered as unavailable. | ||
|
||
Atomic on Document Level | ||
======================== | ||
|
||
Each row of a table in Crate is a semi structured document which can | ||
be nested arbitrarily deep through the use of object and array types. | ||
|
||
Operations on documents are atomic. Meaning that a write operation on | ||
a document either succeeds as a whole or has no effect at all. This is | ||
always the case, regardless of the nesting depth or size of the | ||
document. | ||
|
||
Crate does not provide transactions. However, since every document in | ||
Crate has a version number assigned, which gets increased every time a | ||
change occurs, patterns like optimistic concurrency control (see | ||
:ref:`sql_occ`) can help to work around that limitation. | ||
|
||
Durability | ||
========== | ||
|
||
Each shard has a WAL_ also known as translog. It guarantees that | ||
operations on documents are persisted to disk without having to issue | ||
a Lucene-Commit for every write operation. When the translog gets | ||
flushed all data is written to the persistent index storage of Lucene | ||
and the translog gets cleared. | ||
|
||
In case of an unclean shutdown of a shard, the transactions in the | ||
translog are getting replayed upon startup to ensure that all executed | ||
operations are permanent. | ||
|
||
The translog is also directly transfered when a newly allocated | ||
replica initialises itself from the primary shard. So there is no need | ||
to flush segements to disc just for replica-recovery purposes. | ||
|
||
Adressing of Documents | ||
====================== | ||
|
||
Every document has an internal identifier (see | ||
:ref:`_id <sql_ddl_system_column_id>`) . This identifier is derived from the | ||
primary key by default. Documents living in tables without a primary | ||
key are assigned a unique auto-generated id automatically when they | ||
are created. | ||
|
||
Each document is routed by its routing key to one specific shard. By | ||
default this key is the value of the ``_id`` column. However this can | ||
be configugred in the table schema (see :ref:`routing`). | ||
|
||
While transparent to the user, internally there are two ways how Crate | ||
accesses documents: | ||
|
||
:get: Direct access by identifier. Only applicable if the routing key | ||
and the identifier can be computed from the given query | ||
specification. (e.g: the full primary key is defined in the | ||
where clause). | ||
|
||
This is the most efficient way to access a document, since only | ||
a single shard gets accessed and only a simple index lookup on | ||
the ``_id`` field has to be done. | ||
|
||
:search: Query by matching against fields of documents accross all | ||
candidate shards of the table. | ||
|
||
Consistency | ||
=========== | ||
|
||
Crate is eventual consistent for search operations. Search operations | ||
are performed on shared ``IndexReaders`` which besides other | ||
functionalities, provide caching and reverse lookup capabilities for | ||
shards. An ``IndexReader`` is always bound to the Lucene_ segement it | ||
was started from, which means it has to be refreshed in order to see | ||
new changes, this is done on a time based mannner, but can also be | ||
done manually (see :ref:`refresh_data`). Therefore a search only sees | ||
a change if the according `IndexReader` was refreshed after that | ||
change occured. | ||
|
||
If a query specification results in a ``get`` operation, changes are | ||
visible immediatly. This is achieved by looking up the document in the | ||
translog first, which will always have the most recent version of the | ||
document. The common update and fetch use-case is therfore | ||
possible. If a client updates a row and that row is looked up by its | ||
primary key after that update the changes will always be visible, | ||
since the information will be retrieved directly from the translog. | ||
|
||
.. note:: | ||
|
||
Every replica shard is updated synchronously with its primary and | ||
always carries the same information. Therefore it does not matter if | ||
the primary or a replica shard is accessed in terms of | ||
conistency. Only the refresh of the ``IndexReader`` affects | ||
consistency. | ||
|
||
Cluster Meta Data | ||
================= | ||
|
||
Cluster meta data is held in the so called "Cluster State", which | ||
contains the following information: | ||
|
||
- Tables schemas. | ||
|
||
- Primary and replica shard locations. Basically just a mapping from | ||
shard number to the storage node. | ||
|
||
- Status of each shard, wich tells if a shard is currently ready for | ||
use or has any other state like "initialising", "recovering" or | ||
cannot be assigned at all. | ||
|
||
- Information about discovered nodes and their status. | ||
|
||
- Configuration information. | ||
|
||
Every node has its own copy of the cluster state. However there is | ||
only one node allowed to change the cluster state at runtime. This | ||
node is called the "master" node and gets auto-elected. The "master" | ||
node has no special configuration at all, any node in the cluster can | ||
be elected as a master. There is also an automatic re-election if the | ||
current master node goes down for some reason. | ||
|
||
.. note:: | ||
|
||
In order to avoid a scenario where two masters are elected due to | ||
network partitioning it is required to define a quorum of nodes with | ||
wich it is possible to elect a master. For details in how to do this | ||
and further information see :ref:`master_node_election`. | ||
|
||
In order to explain the flow of events for any cluster state change - | ||
here an example flow for an "ALTER TABLE" statement which changes the | ||
schema of a table: | ||
|
||
#. A node in the cluster receives the ``ALTER TABLE`` request. | ||
|
||
#. The node sends out a request to the current master node to change | ||
the table definition. | ||
|
||
#. The master node applies the changes locally to the cluster state | ||
and sends out a notification to all affected nodes about the | ||
change. | ||
|
||
#. The nodes apply the change, so that they are now in sync with the | ||
master. | ||
|
||
#. Every node might take some local action depending on the type of | ||
cluster state change. | ||
|
||
|
||
|
||
.. _Elasticsearch: http://www.elasticsearch.org/ | ||
|
||
.. _Lucene: http://lucene.apache.org/core/ | ||
|
||
.. _WAL: http://en.wikipedia.org/wiki/Write-ahead_logging |