Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
302 lines (224 sloc) 13 KB
v0.8 - August, 2016 Paul Asmuth <>
Table of Contents
1. Preface
2. Design Overview
2.1 Metadata File
3. Partition Lifecyle
4. Partition Assignment
4.1 Partition Split
4.2 Server Join
4.3 Server Leave
5. Metadata File
5.1 Create Metadata File
5.1 Change Metadata File
5.2 Join Metadata Server
5.3 Remove Metadata Server
5.4 Background Metadata Replication
6. Implementation Details
6.1 Change Notification
6.2 Placement IDs
7. Alternatives Considered
7.1 Consistent Hashing
8. Code Locations
1. Preface
The EventQL partioning design is based on the BigTable partitioning scheme
[link to paper] with a few major differences.
2. Design Overview
We define a table's keyspace as the range [begin, end] of all valid
partioning key values.
For numeric keys, the keyspace could be the interval [UINT64_MIN, UINT64_MAX].
For string keys, the keyspace is the interval ["", "") where an empty string
in the begin or end position represents the highest or lowest permissible
value respectively.
The table's keyspace is split into a number of non-overlapping ranges called
"partitions". Each partition is defined by the it's start and end position in
the keyspace, i.e. the lowest and highest key that will still be contained
in the partition. The mapping of partitions for a table is recorded in a
Each partition is simultaneously served by N servers where N is called the
replication factor.
From all replicas of a partition, we elect one as the leader and designate the
other replicas as followers. All partitions can accept read operations for the
data (allthough the leader will always have the most recent snapshot) but only
the leader accepts writes.
How exactly the rows are stored on disks is not part of this design document.
For this document, we assume that a partition is stored as a single unit on
Similarily, the exact replication semantics (i.e. how a copy of a partition on
one node is transferred to another node) is also not in the scope of this
document; we will assume there exists a mechanism that can synchronize two
copies of the same partition.
2.1 Metadata File
All partitioning information is stored in a METADATA file. The metadata file
consists mainly of a number of entries for each partition in the table. Each
entry in the metadata file contains these pieces of information:
- begin - the begin of the partition in the keyspace (i.e. the first contained key)
- partition_id - a unique/random id for this partition
- servers - the list of servers that serve this partition
- servers_joining - the list of servers that are joining into this partition
- servers_leaving - the list of servers that are leaving from the partition
- split_point
- split_partition_id_low
- split_partition_id_high
- split_servers_low
- split_servers_high
The metadata file supports transactional (compare-and-swap) updates.
3. Partition Lifecyle
Each replica of a partition (i.e. each copy of a partition that is local to
a specific server) has a defined lifetime separated into four stages:
- LOAD -- the local server is joining the replica list for the partition.
it doesn't take part in leader election and does not server any read
requests yet. from the LOAD state, we can transition to the LIVE or
UNLOAD state
- LIVE -- the local server is a live serving replica for this partition. it
servers both writes and reads and takes part in the leader election. from
the LIVE state we can transition to the UNLOAD state only
- UNLOAD -- the local server has either left the replica list for the
partition or the partition has split and does not exist anymore. the local
server does not server reads or writes and does not take part in leader
election. we transition to the UNLOAD_FINAL stage once we have pushed
out all unreplicated rows to all new leaders
- UNLOAD_FINAL -- this state is the final state and allows the server to
delete the partitions data from disk. from the unload final state we can
not transition back to any other state
Partition replication and lifecyle is a simple abstract algorithm:
- For each partition P:
- For each server which should store the data in partition P
(considering splits and joining servers)
- Check if we have already pushed all data to this server
- If not push the unreplicated data to the server
- If all servers have acknowledged all data and we are in the
UNLOAD_FINAL state drop the partition
- On every change in the server list, restart the replication for this
affected partition
- On every insert, restart the replication for the affected partition
In practice the algorithm is a bit more involved and documented in the
replication design document.
4. Partition Assignment
Every table starts out with a single partition covering the whole keyspace (from
negative to positive infinity) that is then further subdivded (splitted) to
create more partitions.
When a new table is created, we add this single partition (covering the whole
key range) and a list of servers for the partition to the metadata file.
4.1 Partition Split
Only the leading server for the partition may initiate a split. To split a
parent partition A into B1 and B1, the splitting server decides on the split
point and the initial server assignment for the new partitions B1 and B2 and then
commits a METADATA transaction recording an ongoing split into the entry for
partition A.
Once the split is commited, the splitting server immediately starts to perform
the split operation.
The split procedure is simple: the current partition is read in and sent to
the leading servers for the new partition. Once all new leading servers have
confirmed that they have committed the data to durable storage, the splitting
server commits a new METADATA transaction to finalize the split. In this
transcation, the partition A is removed from the partition list and the new
partitions B1 and B2 are added.
4.2 Server Join
To add a new server for a given partition, the server places itself in the joining
list for that partition. While the server is joining, it will not receive any read
or write requests for the partition but it will receive replicated rows
from the other servers. Once the leading server has pushed all data to the
joining server, it will commit a METADATA transaction to move the joining servers
from LOAD to LIVE.
Additionally there is a force-join option where a new server directly enters
the live server list. This is never used except in the case where all servers
for a given partition are down.
4.3 Server Leave
To remove a server from a given partition, a METADATA transaction is commited
to delete the server from either the server or the joining list of that partition.
Once the server is removed from the list, the normal replication and lifecyle
algorithm will eventually drop the local partition data once it is confirmed to
be replicated to all other live servers.
Of course, the server leave algorithm should ensure that no partition falls
below the number of replicas specified by the replication level. In the normal
case, the master handles all rebalances and initiates the leave operation only
when it is safe to remove a node.
The metadata file is not stored in the coordination service as it would be too
large for e.g. ZooKeeper..
Instead, the METADATA file of each table is stored as a simple file on N nodes
which are recorded in the coordination service. Since we need to perform
compare-and-swap updates on the metadata file, we use the coordination service
(e.g.) zookeeper to store a bit of metadata, namely a transaction and sequence
ID that allows us to safely update the metadata file from any node at any time.
This is the exact information that we store in the coordination service for each
metadata_txnid: the currently valid METADATA transaction id for this table
metadata_sequence: the currently valid METADATA sequence number
metadata_servers: the list of live metadata servers for this table
The metadata transaction id is a random SHA1 hash. Additionally we store an
incrementing sequence number with each transaction id to allow us to order
The METADATA file supports these general operations:
5.1 Create Metadata File
When a table is created, we pick three random servers from the cluster, create
a new metadata file with a random metadata_txnid and sequence number of 1.
We then store the metadata file on each of the servers and store the
metadata_txnid, metadata_sequence and metadata_servers information into the
coordination service.
5.1 Change Metadata File
To change the metadata file, a change requester creates a change request. The
change request contains the modification (e.g. SPLIT_PARTITION), the transaction
id the change is based on and a new (random) transaction id.
To commit the change, the requests sends the change request to any coordinating
metadata server. The coordinationg metadata server generates a new METADATA file
for the new transcation id and applies the operation. The coordinating metadata
server then asks each other metadata server to store the new transaction.
Once a majority of the metadata servers (including the coordinating metadata
server) have confirmed that they have commited the new transaction to durable
storage, the coordinating metadata server performs a compare-and-swap update in
the coordination service to record the new transaction id as the current
transaction id. If the update suceeds, the change is commited and the coordinating
metadata server returns a success response to the requester.
The compare-and-swap update will fail if another transaction was (atomically)
commited in the meantime or the list of metadata servers changed. If the update
fails, the coordinating metadata server asks the other metadata servers to clean
up the aborted transaction file and returns an error the the requester.
5.2 Join Metadata Server
A new metadata server can join itself into the list of metadata servers for a given
table. Once the join was requested by e.g. the master, the metadata server reads
the latest metadata transaction id from the coordination service and tries to
download the METADATA file for that transaction from any of the currently live
metadata servers. Once it has successfully fetched the latest transaction it
performs a compare-and-swap update in the coordination service to add itself
to the list of live metadata servers. If another transaction was commited while
the new metadata server was trying to join (and hence, the locally stored
transaction isn't the most recent one), the update will fail and the procedure
is restarted from the beginning.
5.3 Remove Metadata Server
To leave the list of metadata servers, an update is written to the coordination
service to the remove the host from the list of metadata servers.
5.4 Background Metadata Replication
If a metadata server was unavailable for some time and does not have the
latest transaction stored it will not serve any metadata requests until it has
successfully retrieved the transaction from one of the other metadata servers.
Additionally, each server watches the information from the coordination service
and ensures that it has the most recent metadata file for each partition to
which it is assigned as a meta dataserver (by downloading the files from one
of the other servers).
6. Implementation Details
6.1 Change Notification
Once a new metadata transaction was committed (by writing the new metadata
transaction to the coordination service) all other servers in
the cluster will reliably get notified of the change by the coordinator. When a
server sees a metadata change to a table, it will send a "partition discovery"
requerst for each partition that it has locally stored for the chnaged table to
one of the responsible metadata servers. This partition discovery request
contains the metadata transaction id and the name/id of the partition. The
response to the partition discovery request is called the partition discovery
response. The partition covery response contains the should-be status of the
partition on the requesting host ("LOAD", "SERVE" or "UNLOAD") and the list
of other servers (and partition ids, in case of a split) the data should be
6.2 Placement IDs
One important implementation detail is that every time we add or remove a server
to a partition, we assign a unique "placement" ID to prevent ABA scenarios,
7. Alternatives Considered
7.1 Consistent Hashing
8 .Code Locations