Permalink
Newer
100644
1210 lines (1002 sloc)
58.1 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
CockroachDB is a distributed SQL database. The primary design goals
8
are **scalability**, **strong consistency** and **survivability**
9
(hence the name). Cockroach aims to tolerate disk, machine, rack, and
10
even **datacenter failures** with minimal latency disruption and **no
11
manual intervention**. Cockroach nodes are symmetric; a design goal is
12
**homogeneous deployment** (one binary) with minimal configuration and
13
no required external dependencies.
14
15
The entry point for database clients is the SQL interface. Every node
16
in a CockroachDB cluster can act as a client SQL gateway. A SQL
17
gateway transforms and executes client SQL statements to key-value
18
(KV) operations, which the gateway distributes across the cluster as
19
necessary and returns results to the client. CockroachDB implements a
20
**single, monolithic sorted map** from key to value where both keys
21
and values are byte strings.
22
23
The KV map is logically composed of smaller segments of the keyspace
24
called ranges. Each range is backed by data stored in a local KV
25
storage engine (we use [RocksDB](http://rocksdb.org/), a variant of
26
LevelDB). Range data is replicated to a configurable number of
27
additional CockroachDB nodes. Ranges are merged and split to maintain
28
a target size, by default `64M`. The relatively small size facilitates
29
quick repair and rebalancing to address node failures, new capacity
30
and even read/write load.
31
32
CockroachDB achieves horizontally scalability:
33
- adding more nodes increases the capacity of the cluster by the
34
amount of storage on each node (divided by a configurable
35
replication factor), theoretically up to 4 exabytes (4E) of logical
36
data;
37
- client queries can be sent to any node in the cluster, and queries
38
can operate fully independently from each other, meaning that
39
overall throughput is a linear factor of the number of nodes in the
40
cluster.
41
- queries are distributed (ref: distributed SQL) so that the overall
42
throughput of single queries can also be increased by adding more
43
nodes.
44
45
CocroachDB achieves strong consistency:
46
- uses a distributed consensus protocol for synchronous replication of
47
data in each key value range. We’ve chosen to use the [Raft
48
consensus algorithm](https://raftconsensus.github.io); all consensus
49
state is stored in RocksDB.
50
- single or batched mutations to a single range are mediated via the
51
range's Raft instance. Raft guarantees ACID semantics.
52
- logical mutations which affect multiple ranges employ distributed
53
transactions for ACID semantics. CockroachDB uses an efficient
54
**non-locking distributed commit** protocol.
55
56
CockroachDB achieves survivability:
57
- range replicas can be co-located within a single datacenter for low
58
latency replication and survive disk or machine failures. They can
59
be located across racks to survive some network switch failures.
60
- range replicas can be located in datacenters spanning increasingly
61
disparate geographies to survive ever-greater failure scenarios from
62
datacenter power or networking loss to regional power failures
63
(e.g. `{ US-East-1a, US-East-1b, US-East-1c }, `{ US-East, US-West,
64
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East,
65
US-West, Japan, Australia }`).
66
67
Cockroach provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
68
serializable snapshot isolation (SSI) semantics, allowing **externally
69
consistent, lock-free reads and writes**--both from a historical
70
snapshot timestamp and from the current wall clock time. SI provides
71
lock-free reads and writes but still allows write skew. SSI eliminates
72
write skew, but introduces a performance hit in the case of a
73
contentious system. SSI is the default isolation; clients must
74
consciously decide to trade correctness for performance. Cockroach
75
implements [a limited form of linearizability](#linearizability),
76
providing ordering for any observer or chain of observers.
77
78
Similar to
79
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
80
directories, Cockroach allows configuration of arbitrary zones of data.
81
This allows replication factor, storage device type, and/or datacenter
82
location to be chosen to optimize performance and/or availability.
83
Unlike Spanner, zones are monolithic and don’t allow movement of fine
84
grained data on the level of entity groups.
85
86
# Architecture
87
88
Cockroach implements a layered architecture. The highest level of
89
abstraction is the SQL layer (currently unspecified in this document).
90
It depends directly on the [*structured data
91
API*](#structured-data-api), which provides familiar relational concepts
92
such as schemas, tables, columns, and indexes. The structured data API
93
in turn depends on the [distributed key value store](#key-value-api),
94
which handles the details of range addressing to provide the abstraction
95
of a single, monolithic key value store. The distributed KV store
96
communicates with any number of physical cockroach nodes. Each node
97
contains one or more stores, one per physical device.
98
99

100
101
Each store contains potentially many ranges, the lowest-level unit of
102
key-value data. Ranges are replicated using the Raft consensus protocol.
103
The diagram below is a blown up version of stores from four of the five
104
nodes in the previous diagram. Each range is replicated three ways using
105
raft. The color coding shows associated range replicas.
106
107

108
109
Each physical node exports a RoachNode service. Each RoachNode exports
110
one or more key ranges. RoachNodes are symmetric. Each has the same
111
binary and assumes identical roles.
112
113
Nodes and the ranges they provide access to can be arranged with various
114
physical network topologies to make trade offs between reliability and
115
performance. For example, a triplicated (3-way replica) range could have
116
each replica located on different:
117
118
- disks within a server to tolerate disk failures.
119
- servers within a rack to tolerate server failures.
120
- servers on different racks within a datacenter to tolerate rack power/network failures.
121
- servers in different datacenters to tolerate large scale network or power outages.
122
123
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
124
125
# Cockroach Client
126
127
In order to support diverse client usage, Cockroach clients connect to
128
any node via HTTPS using protocol buffers or JSON. The connected node
129
proxies involved client work including key lookups and write buffering.
130
131
# Keys
132
133
Cockroach keys are arbitrary byte arrays. If textual data is used in
134
keys, utf8 encoding is recommended (this helps for cleaner display of
135
values in debugging tools). User-supplied keys are encoded using an
136
ordered code. System keys are either prefixed with null characters (`\0`
137
or `\0\0`) for system tables, or take the form of
138
`<user-key><system-suffix>` to sort user-key-range specific system
139
keys immediately after the user keys they refer to. Null characters are
140
used in system key prefixes to guarantee that they sort first.
141
142
# Versioned Values
143
144
Cockroach maintains historical versions of values by storing them with
145
associated commit timestamps. Reads and scans can specify a snapshot
146
time to return the most recent writes prior to the snapshot timestamp.
147
Older versions of values are garbage collected by the system during
148
compaction according to a user-specified expiration interval. In order
149
to support long-running scans (e.g. for MapReduce), all versions have a
150
minimum expiration.
151
152
Versioned values are supported via modifications to RocksDB to record
153
commit timestamps and GC expirations per key.
154
163
# Lock-Free Distributed Transactions
164
165
Cockroach provides distributed transactions without locks. Cockroach
166
transactions support two isolation levels:
167
168
- snapshot isolation (SI) and
169
- *serializable* snapshot isolation (SSI).
170
171
*SI* is simple to implement, highly performant, and correct for all but a
172
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
173
more complexity, is still highly performant (less so with contention), and has
174
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
175
the literature and some possibly novel insights.
176
177
SSI is the default level, with SI provided for application developers
178
who are certain enough of their need for performance and the absence of
179
write skew conditions to consciously elect to use it. In a lightly
180
contended system, our implementation of SSI is just as performant as SI,
181
requiring no locking or additional writes. With contention, our
182
implementation of SSI still requires no locking, but will end up
183
aborting more transactions. Cockroach’s SI and SSI implementations
184
prevent starvation scenarios even for arbitrarily long transactions.
185
186
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
187
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
188
For a discussion of SSI implemented by preventing read-write conflicts
189
(in contrast to detecting them, called write-snapshot isolation), see
190
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
191
which is the source of much inspiration for Cockroach’s SSI.
192
193
Each Cockroach transaction is assigned a random priority and a
194
"candidate timestamp" at start. The candidate timestamp is the
195
provisional timestamp at which the transaction will commit, and is
196
chosen as the current clock time of the node coordinating the
197
transaction. This means that a transaction without conflicts will
198
usually commit with a timestamp that, in absolute time, precedes the
199
actual work done by that transaction.
200
201
In the course of coordinating a transaction between one or more
202
distributed nodes, the candidate timestamp may be increased, but will
204
SI and SSI is that the former allows the transaction's candidate
205
timestamp to increase and the latter does not.
210
in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf).
211
HLC time uses timestamps which are composed of a physical component (thought of
212
as and always close to local wall time) and a logical component (used to
213
distinguish between events with the same physical component). It allows us to
214
track causality for related events similar to vector clocks, but with less
215
overhead. In practice, it works much like other logical clocks: When events
216
are received by a node, it informs the local HLC about the timestamp supplied
217
with the event by the sender, and when events are sent a timestamp generated by
218
the local HLC is attached.
219
220
For a more in depth description of HLC please read the paper. Our
221
implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/util/hlc/hlc.go).
222
223
Cockroach picks a Timestamp for a transaction using HLC time. Throughout this
224
document, *timestamp* always refers to the HLC time which is a singleton
225
on each node. The HLC is updated by every read/write event on the node, and
227
from another node is not only used to version the operation, but also updates
228
the HLC on the node. This is useful in guaranteeing that all data read/written
229
on a node is at a timestamp < next HLC time.
236
transaction table (keys prefixed by *\0tx*) with state “PENDING”. In
237
parallel write an "intent" value for each datum being written as part
238
of the transaction. These are normal MVCC values, with the addition of
239
a special flag (i.e. “intent”) indicating that the value may be
241
the transaction id (unique and chosen at tx start time by client)
242
is stored with intent values. The tx id is used to refer to the
243
transaction table when there are conflicts and to make
244
tie-breaking decisions on ordering between identical timestamps.
246
original candidate timestamp in the absence of read/write conflicts);
247
the client selects the maximum from amongst all write timestamps as the
248
final commit timestamp.
251
transaction table (keys prefixed by *\0tx*). The value of the
252
commit entry contains the candidate timestamp (increased as
253
necessary to accommodate any latest read timestamps). Note that
254
the transaction is considered fully committed at this point and
255
control may be returned to the client.
256
257
In the case of an SI transaction, a commit timestamp which was
258
increased to accommodate concurrent readers is perfectly
259
acceptable and the commit may continue. For SSI transactions,
260
however, a gap between candidate and commit timestamps
261
necessitates transaction restart (note: restart is different than
262
abort--see below).
263
264
After the transaction is committed, all written intents are upgraded
265
in parallel by removing the “intent” flag. The transaction is
266
considered fully committed before this step and does not wait for
267
it to return control to the transaction coordinator.
268
269
In the absence of conflicts, this is the end. Nothing else is necessary
270
to ensure the correctness of the system.
271
272
**Conflict Resolution**
273
274
Things get more interesting when a reader or writer encounters an intent
275
record or newly-committed value in a location that it needs to read or
276
write. This is a conflict, usually causing either of the transactions to
277
abort or restart depending on the type of conflict.
278
279
***Transaction restart:***
280
281
This is the usual (and more efficient) type of behaviour and is used
282
except when the transaction was aborted (for instance by another
283
transaction).
284
In effect, that reduces to two cases; the first being the one outlined
285
above: An SSI transaction that finds upon attempting to commit that
286
its commit timestamp has been pushed. The second case involves a transaction
287
actively encountering a conflict, that is, one of its readers or writers
288
encounter data that necessitate conflict resolution
289
(see transaction interactions below).
293
begins anew reusing the same tx id. The prior run of the transaction might
294
have written some write intents, which need to be deleted before the
295
transaction commits, so as to not be included as part of the transaction.
296
These stale write intent deletions are done during the reexecution of the
298
the same keys as part of the reexecution of the transaction, or explicitly,
299
by cleaning up stale intents that are not part of the reexecution of the
300
transaction. Since most transactions will end up writing to the same keys,
301
the explicit cleanup run just before committing the transaction is usually
302
a NOOP.
303
304
***Transaction abort:***
305
306
This is the case in which a transaction, upon reading its transaction
307
table entry, finds that it has been aborted. In this case, the
308
transaction can not reuse its intents; it returns control to the client
309
before cleaning them up (other readers and writers would clean up
310
dangling intents as they encounter them) but will make an effort to
311
clean up after itself. The next attempt (if applicable) then runs as a
315
316
There are several scenarios in which transactions interact:
317
318
- **Reader encounters write intent or value with newer timestamp far
319
enough in the future**: This is not a conflict. The reader is free
320
to proceed; after all, it will be reading an older version of the
321
value and so does not conflict. Recall that the write intent may
322
be committed with a later timestamp than its candidate; it will
323
never commit with an earlier one. **Side note**: if a SI transaction
324
reader finds an intent with a newer timestamp which the reader’s own
326
327
- **Reader encounters write intent or value with newer timestamp in the
328
near future:** In this case, we have to be careful. The newer
329
intent may, in absolute terms, have happened in our read's past if
330
the clock of the writer is ahead of the node serving the values.
331
In that case, we would need to take this value into account, but
332
we just don't know. Hence the transaction restarts, using instead
333
a future timestamp (but remembering a maximum timestamp used to
334
limit the uncertainty window to the maximum clock skew). In fact,
335
this is optimized further; see the details under "choosing a time
336
stamp" below.
337
338
- **Reader encounters write intent with older timestamp**: the reader
339
must follow the intent’s transaction id to the transaction table.
340
If the transaction has already been committed, then the reader can
341
just read the value. If the write transaction has not yet been
342
committed, then the reader has two options. If the write conflict
343
is from an SI transaction, the reader can *push that transaction's
344
commit timestamp into the future* (and consequently not have to
345
read it). This is simple to do: the reader just updates the
346
transaction’s commit timestamp to indicate that when/if the
347
transaction does commit, it should use a timestamp *at least* as
348
high. However, if the write conflict is from an SSI transaction,
349
the reader must compare priorities. If the reader has the higher priority,
350
it pushes the transaction’s commit timestamp (that
358
priority, the writer aborts the conflicting transaction. If the write
359
intent has a higher or equal priority the transaction retries, using as a new
360
priority *max(new random priority, conflicting txn’s priority - 1)*;
361
the retry occurs after a short, randomized backoff interval.
363
- **Writer encounters newer committed value**:
364
The committed value could also be an unresolved write intent made by a
365
transaction that has already committed. The transaction restarts. On restart,
366
the same priority is reused, but the candidate timestamp is moved forward
367
to the encountered value's timestamp.
371
candidate timestamp is earlier than the low water mark on the cache itself
372
(i.e. its last evicted timestamp) or if the key being written has a read
373
timestamp later than the write’s candidate timestamp, this later timestamp
374
value is returned with the write. A new timestamp forces a transaction
375
restart only if it is serializable.
377
**Transaction management**
378
379
Transactions are managed by the client proxy (or gateway in SQL Azure
380
parlance). Unlike in Spanner, writes are not buffered but are sent
381
directly to all implicated ranges. This allows the transaction to abort
382
quickly if it encounters a write conflict. The client proxy keeps track
383
of all written keys in order to resolve write intents asynchronously upon
384
transaction completion. If a transaction commits successfully, all intents
385
are upgraded to committed. In the event a transaction is aborted, all written
386
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
390
transaction table until aborted by another transaction. Transactions
391
heartbeat the transaction table every five seconds by default.
392
Transactions encountered by readers or writers with dangling intents
393
which haven’t been heartbeat within the required interval are aborted.
398
399
An exploration of retries with contention and abort times with abandoned
400
transaction is
401
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
402
403
**Transaction Table**
404
405
Please see [roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/master/roachpb/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
406
407
**Pros**
408
409
- No requirement for reliable code execution to prevent stalled 2PC
410
protocol.
411
- Readers never block with SI semantics; with SSI semantics, they may
412
abort.
413
- Lower latency than traditional 2PC commit protocol (w/o contention)
414
because second phase requires only a single write to the
415
transaction table instead of a synchronous round to all
416
transaction participants.
417
- Priorities avoid starvation for arbitrarily long transactions and
418
always pick a winner from between contending transactions (no
419
mutual aborts).
420
- Writes not buffered at client; writes fail fast.
421
- No read-locking overhead required for *serializable* SI (in contrast
422
to other SSI implementations).
423
- Well-chosen (i.e. less random) priorities can flexibly give
424
probabilistic guarantees on latency for arbitrary transactions
425
(for example: make OLTP transactions 10x less likely to abort than
426
low priority transactions, such as asynchronously scheduled jobs).
427
428
**Cons**
429
432
- Abandoned transactions may block contending writers for up to the
433
heartbeat interval, though average wait is likely to be
434
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
435
This is likely considerably more performant than detecting and
436
restarting 2PC in order to release read and write locks.
437
- Behavior different than other SI implementations: no first writer
438
wins, and shorter transactions do not always finish quickly.
439
Element of surprise for OLTP systems may be a problematic factor.
440
- Aborts can decrease throughput in a contended system compared with
441
two phase locking. Aborts and retries increase read and write
442
traffic, increase latency and decrease throughput.
443
444
**Choosing a Timestamp**
445
446
A key challenge of reading data in a distributed system with clock skew
447
is choosing a timestamp guaranteed to be greater than the latest
448
timestamp of any committed transaction (in absolute time). No system can
449
claim consistency and fail to read already-committed data.
450
452
accessing a single node is easy. The timestamp is assigned by the node
453
itself, so it is guaranteed to be at a greater timestamp than all the
454
existing timestamped data on the node.
455
456
For multiple nodes, the timestamp of the node coordinating the
457
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
458
supplied to provide an upper bound on timestamps for already-committed
459
data (`ε` is the maximum clock skew). As the transaction progresses, any
460
data read which have timestamps greater than `t` but less than `t+ε`
461
cause the transaction to abort and retry with the conflicting timestamp
463
the same. This implies that transaction restarts due to clock uncertainty
464
can only happen on a time interval of length `ε`.
468
into account t<sub>c</sub>, but the timestamp of the node at the time
469
of the uncertain read t<sub>node</sub>. The larger of those two timestamps
470
t<sub>c</sub> and t<sub>node</sub> (likely equal to the latter) is used
471
to increase the read timestamp. Additionally, the conflicting node is
472
marked as “certain”. Then, for future reads to that node within the
473
transaction, we set `MaxTimestamp = Read Timestamp`, preventing further
474
uncertainty restarts.
475
476
Correctness follows from the fact that we know that at the time of the read,
477
there exists no version of any key on that node with a higher timestamp than
479
encounters a key with a higher timestamp, it knows that in absolute time,
480
the value was written after t<sub>node</sub> was obtained, i.e. after the
481
uncertain read. Hence the transaction can move forward reading an older version
482
of the data (at the transaction's timestamp). This limits the time uncertainty
483
restarts attributed to a node to at most one. The tradeoff is that we might
484
pick a timestamp larger than the optimal one (> highest conflicting timestamp),
485
resulting in the possibility of a few more conflicts.
486
487
We expect retries will be rare, but this assumption may need to be
488
revisited if retries become problematic. Note that this problem does not
489
apply to historical reads. An alternate approach which does not require
490
retries makes a round to all node participants in advance and
491
chooses the highest reported node wall time as the timestamp. However,
492
knowing which nodes will be accessed in advance is difficult and
493
potentially limiting. Cockroach could also potentially use a global
494
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)),
495
which would be feasible for smaller, geographically-proximate clusters.
496
497
# Linearizability
498
499
First a word about [***Spanner***](http://research.google.com/archive/spanner.html).
500
By combining judicious use of wait intervals with accurate time signals,
501
Spanner provides a global ordering between any two non-overlapping transactions
502
(in absolute time) with \~14ms latencies. Put another way:
503
Spanner guarantees that if a transaction T<sub>1</sub> commits (in absolute time)
504
before another transaction T<sub>2</sub> starts, then T<sub>1</sub>'s assigned commit
505
timestamp is smaller than T<sub>2</sub>'s. Using atomic clocks and GPS receivers,
506
Spanner reduces their clock skew uncertainty to \< 10ms (`ε`). To make
507
good on the promised guarantee, transactions must take at least double
508
the clock skew uncertainty interval to commit (`2ε`). See [*this
509
article*](http://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf)
510
for a helpful overview of Spanner’s concurrency control.
511
512
Cockroach could make the same guarantees without specialized hardware,
513
at the expense of longer wait times. If servers in the cluster were
514
configured to work only with NTP, transaction wait times would likely to
515
be in excess of 150ms. For wide-area zones, this would be somewhat
516
mitigated by overlap from cross datacenter link latencies. If clocks
517
were made more accurate, the minimal limit for commit latencies would
518
improve.
519
520
However, let’s take a step back and evaluate whether Spanner’s external
521
consistency guarantee is worth the automatic commit wait. First, if the
522
commit wait is omitted completely, the system still yields a consistent
523
view of the map at an arbitrary timestamp. However with clock skew, it
524
would become possible for commit timestamps on non-overlapping but
525
causally related transactions to suffer temporal reverse. In other
526
words, the following scenario is possible for a client without global
527
ordering:
528
529
- Start transaction T<sub>1</sub> to modify value `x` with commit time s<sub>1</sub>
530
531
- On commit of T<sub>1</sub>, start T<sub>2</sub> to modify value `y` with commit time
533
534
- Read `x` and `y` and discover that s<sub>1</sub> \> s<sub>2</sub> (**!**)
535
536
The external consistency which Spanner guarantees is referred to as
537
**linearizability**. It goes beyond serializability by preserving
538
information about the causality inherent in how external processes
539
interacted with the database. The strength of Spanner’s guarantee can be
540
formulated as follows: any two processes, with clock skew within
541
expected bounds, may independently record their wall times for the
542
completion of transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>) and start of transaction
543
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) respectively, and if later
544
compared such that T<sub>1</sub><sup>end</sup> \< T<sub>2</sub><sup>start</sup>,
545
then commit timestamps s<sub>1</sub> \< s<sub>2</sub>.
546
This guarantee is broad enough to completely cover all cases of explicit
547
causality, in addition to covering any and all imaginable scenarios of implicit
548
causality.
549
550
Our contention is that causality is chiefly important from the
551
perspective of a single client or a chain of successive clients (*if a
552
tree falls in the forest and nobody hears…*). As such, Cockroach
553
provides two mechanisms to provide linearizability for the vast majority
554
of use cases without a mandatory transaction commit wait or an elaborate
555
system to minimize clock skew.
556
557
1. Clients provide the highest transaction commit timestamp with
558
successive transactions. This allows node clocks from previous
559
transactions to effectively participate in the formulation of the
560
commit timestamp for the current transaction. This guarantees
561
linearizability for transactions committed by this client.
562
563
Newly launched clients wait at least 2 \* ε from process start
564
time before beginning their first transaction. This preserves the
565
same property even on client restart, and the wait will be
566
mitigated by process initialization.
567
568
All causally-related events within Cockroach maintain
569
linearizability.
570
571
2. Committed transactions respond with a commit wait parameter which
572
represents the remaining time in the nominal commit wait. This
573
will typically be less than the full commit wait as the consensus
574
write at the coordinator accounts for a portion of it.
575
576
Clients taking any action outside of another Cockroach transaction
577
(e.g. writing to another distributed system component) can either
578
choose to wait the remaining interval before proceeding, or
579
alternatively, pass the wait and/or commit timestamp to the
580
execution of the outside action for its consideration. This pushes
581
the burden of linearizability to clients, but is a useful tool in
582
mitigating commit latencies if the clock skew is potentially
583
large. This functionality can be used for ordering in the face of
584
backchannel dependencies as mentioned in the
585
[AugmentedTime](http://www.cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf)
586
paper.
587
588
Using these mechanisms in place of commit wait, Cockroach’s guarantee can be
589
formulated as follows: any process which signals the start of transaction
590
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) after the completion of
591
transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>), will have commit
592
timestamps such thats<sub>1</sub> \< s<sub>2</sub>.
593
594
# Logical Map Content
595
596
Logically, the map contains a series of reserved system key / value
599
(e.g. the actual meat of the map).
600
601
- `\0\0meta1` Range metadata for location of `\0\0meta2`.
602
- `\0\0meta1<key1>` Range metadata for location of `\0\0meta2<key1>`.
603
- ...
604
- `\0\0meta1<keyN>`: Range metadata for location of `\0\0meta2<keyN>`.
605
- `\0\0meta2`: Range metadata for location of first non-range metadata key.
606
- `\0\0meta2<key1>`: Range metadata for location of `<key1>`.
607
- ...
608
- `\0\0meta2<keyN>`: Range metadata for location of `<keyN>`.
609
- `\0acct<key0>`: Accounting for key prefix key0.
610
- ...
611
- `\0acct<keyN>`: Accounting for key prefix keyN.
612
- `\0node<node-address0>`: Accounting data for node 0.
613
- ...
614
- `\0node<node-addressN>`: Accounting data for node N.
615
- `\0tx<tx-id0>`: Transaction record for transaction 0.
616
- ...
617
- `\0tx<tx-idN>`: Transaction record for transaction N.
618
- `\0zone<key0>`: Zone information for key prefix key0.
619
- ...
620
- `\0zone<keyN>`: Zone information for key prefix keyN.
621
- `<>acctd<metric0>`: Accounting data for Metric 0 for empty key prefix.
622
- ...
623
- `<>acctd<metricN>`: Accounting data for Metric N for empty key prefix.
624
- `<key0>`: `<value0>` The first user data key.**
625
- ...
626
- `<keyN>`: `<valueN>` The last user data key.**
627
628
There are some additional system entries sprinkled amongst the
629
non-system keys. See the Key-Prefix Accounting section in this document
630
for further details.
631
632
# Node Storage
633
634
Nodes maintain a separate instance of RocksDB for each disk. Each
635
RocksDB instance hosts any number of ranges. RPCs arriving at a
636
RoachNode are multiplexed based on the disk name to the appropriate
637
RocksDB instance. A single instance per disk is used to avoid
638
contention. If every range maintained its own RocksDB, global management
639
of available cache memory would be impossible and writers for each range
640
would compete for non-contiguous writes to multiple RocksDB logs.
641
642
In addition to the key/value pairs of the range itself, various range
643
metadata is maintained.
644
645
- participating replicas
646
647
- consensus metadata
648
649
- split/merge activity
650
651
A really good reference on tuning Linux installations with RocksDB is
652
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
653
654
# Range Metadata
655
656
The default approximate size of a range is 64M (2\^26 B). In order to
657
support 1P (2\^50 B) of logical data, metadata is needed for roughly
658
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
661
B would require roughly 4G (2\^32 B) to store--too much to duplicate
662
between machines. Our conclusion is that range metadata must be
663
distributed for large installations.
664
665
To keep key lookups relatively fast in the presence of distributed metadata,
666
we store all the top-level metadata in a single range (the first range). These
667
top-level metadata keys are known as *meta1* keys, and are prefixed such that
668
they sort to the beginning of the key space. Given the metadata size of 256
669
bytes given above, a single 64M range would support 64M/256B = 2\^18 ranges,
671
above, we need two levels of indirection, where the first level addresses the
672
second, and the second addresses user data. With two levels of indirection, we
673
can address 2\^(18 + 18) = 2\^36 ranges; each range addresses 2\^26 B, and
674
altogether we address 2\^(36+26) B = 2\^62 B = 4E of user data.
675
676
For a given user-addressable `key1`, the associated *meta1* record is found
677
at the successor key to `key1` in the *meta1* space. Since the *meta1* space
678
is sparse, the successor key is defined as the next key which is present. The
679
*meta1* record identifies the range containing the *meta2* record, which is
680
found using the same process. The *meta2* record identifies the range
681
containing `key1`, which is again found the same way (see examples below).
683
Concretely, metadata keys are prefixed by `\0\0meta{1,2}`; the two null
684
characters provide for the desired sorting behaviour. Thus, `key1`'s
685
*meta1* record will reside at the successor key to `\0\0\meta1<key1>`.
686
688
the RocksDB iterator only supports a Seek() interface which acts as a
689
Ceil(). Using the start key of the range would cause Seek() to find the
690
key *after* the meta indexing record we’re looking for, which would
691
result in having to back the iterator up, an option which is both less
692
efficient and not available in all cases.
693
694
The following example shows the directory structure for a map with
695
three ranges worth of data. Ellipses indicate additional key/value pairs to
696
fill an entire range of data. Except for the fact that splitting ranges
697
requires updates to the range metadata with knowledge of the metadata layout,
698
the range metadata itself requires no special treatment or bootstrapping.
699
700
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
701
`dcrama3:8000`)
702
703
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
704
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
705
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
706
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
707
- ...
708
- `<lastkey0>`: `<lastvalue0>`
709
710
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
711
`dcrama6:8000`)
712
713
- ...
714
- `<lastkey1>`: `<lastvalue1>`
715
716
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
717
`dcrama9:8000`)
718
719
- ...
720
- `<lastkey2>`: `<lastvalue2>`
721
722
Consider a simpler example of a map containing less than a single
723
range of data. In this case, all range metadata and all data are
724
located in the same range:
725
726
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
727
`dcrama3:8000`)*
728
729
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
730
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
731
- `<key0>`: `<value0>`
732
- `...`
733
734
Finally, a map large enough to need both levels of indirection would
735
look like (note that instead of showing range replicas, this
736
example is simplified to just show range indexes):
737
738
**Range 0**
739
740
- `\0\0meta1<lastkeyN-1>`: Range 0
741
- `\0\0meta1\xff`: Range 1
742
- `\0\0meta2<lastkey1>`: Range 1
743
- `\0\0meta2<lastkey2>`: Range 2
744
- `\0\0meta2<lastkey3>`: Range 3
745
- ...
746
- `\0\0meta2<lastkeyN-1>`: Range 262143
747
748
**Range 1**
749
750
- `\0\0meta2<lastkeyN>`: Range 262144
751
- `\0\0meta2<lastkeyN+1>`: Range 262145
752
- ...
753
- `\0\0meta2\xff`: Range 500,000
754
- ...
755
- `<lastkey1>`: `<lastvalue1>`
756
757
**Range 2**
758
759
- ...
760
- `<lastkey2>`: `<lastvalue2>`
761
762
**Range 3**
763
764
- ...
765
- `<lastkey3>`: `<lastvalue3>`
766
767
**Range 262144**
768
769
- ...
770
- `<lastkeyN>`: `<lastvalueN>`
771
772
**Range 262145**
773
774
- ...
775
- `<lastkeyN+1>`: `<lastvalueN+1>`
776
777
Note that the choice of range `262144` is just an approximation. The
778
actual number of ranges addressable via a single metadata range is
779
dependent on the size of the keys. If efforts are made to keep key sizes
780
small, the total number of addressable ranges would increase and vice
781
versa.
782
783
From the examples above it’s clear that key location lookups require at
784
most three reads to get the value for `<key>`:
785
786
1. lower bound of `\0\0meta1<key>`
787
2. lower bound of `\0\0meta2<key>`,
788
3. `<key>`.
789
790
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
791
containing less than 16T of data would require two lookups. Clients cache both
792
levels of range metadata, and we expect that data locality for individual
793
clients will be high. Clients may end up with stale cache entries. If on a
794
lookup, the range consulted does not match the client’s expectations, the
795
client evicts the stale entries and possibly does a new lookup.
796
799
Each range is configured to consist of three or more replicas, as specified by
800
their ZoneConfig. The replicas in a range maintain their own instance of a
801
distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://raftconsensus.github.io)
804
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
805
promising performance characteristics for WAN-distributed replicas, but
806
it does not guarantee a consistent ordering between replicas.
807
808
Raft elects a relatively long-lived leader which must be involved to
810
replicated. In the absence of heartbeats, followers become candidates
811
after randomized election timeouts and proceed to hold new leader
812
elections. Cockroach weights random timeouts such that the replicas with
813
shorter round trip times to peers are more likely to hold elections
814
first (not implemented yet). Only the Raft leader may propose commands;
815
followers will simply relay commands to the last known leader.
817
Our Raft implementation was developed together with CoreOS, but adds an extra
818
layer of optimization to account for the fact that a single Node may have
819
millions of consensus groups (one for each Range). Areas of optimization
820
are chiefly coalesced heartbeats (so that the number of nodes dictates the
821
number of heartbeats as opposed to the much larger number of ranges) and
822
batch processing of requests.
823
Future optimizations may include two-phase elections and quiescent ranges
824
(i.e. stopping traffic completely for inactive ranges).
825
827
828
As outlined in the Raft section, the replicas of a Range are organized as a
829
Raft group and execute commands from their shared commit log. Going through
830
Raft is an expensive operation though, and there are tasks which should only be
831
carried out by a single replica at a time (as opposed to all of them).
832
834
This is a lease held for a slice of (database, i.e. hybrid logical) time and is
835
established by committing a special log entry through Raft containing the
837
combination that uniquely describes the requesting replica. Reads and writes
838
must generally be addressed to the replica holding the lease; if none does, any
839
replica may be addressed, causing it to try to obtain the lease synchronously.
842
lease holder. These requests are retried transparently with the updated lease by the
843
gateway node and never reach the client.
844
845
The replica holding the lease is in charge or involved in handling
846
Range-specific maintenance tasks such as
847
848
* gossiping the sentinel and/or first range information
849
* splitting, merging and rebalancing
850
851
and, very importantly, may satisfy reads locally, without incurring the
852
overhead of going through Raft.
853
854
Since reads bypass Raft, a new lease holder will, among other things, ascertain
855
that its timestamp cache does not report timestamps smaller than the previous
856
lease holder's (so that it's compatible with reads which may have occurred on
857
the former lease holder). This is accomplished by setting the low water mark of the
858
timestamp cache to the expiration of the previous lease plus the maximum clock
859
offset.
860
861
## Relationship to Raft leadership
862
863
The range lease is completely separate from Raft leadership, and so without
864
further efforts, Raft leadership and the Range lease may not be represented by the same
865
replica most of the time. This is convenient semantically since it decouples
866
these two types of leadership and allows the use of Raft as a "black box", but
867
for reasons of performance, it is desirable to have both on the same replica.
868
Otherwise, sending a command through Raft always incurs the overhead of being
869
proposed to the Range lease holder's Raft instance first, which must relay it to the
873
Range lease and Raft leadership coincide. A fairly easy method for achieving this is
874
to have each new lease period (extension or new) be accompanied by a
875
stipulation to the lease holder's replica to start Raft elections (unless it's
877
relatively stable and long-lived to avoid a large number of Raft leadership
878
transitions.
879
883
command in more details. Each command specifies (1) a key (or a range
884
of keys) that the command accesses and (2) the ID of a range which the
885
key(s) belongs to. When receiving a command, a RoachNode looks up a
886
range by the specified Range ID and checks if the range is still
887
responsible for the supplied keys. If any of the keys do not belong to the
888
range, the RoachNode returns an error so that the client will retry
889
and send a request to a correct range.
890
891
When all the keys belong to the range, the RoachNode attempts to
892
process the command. If the command is an inconsistent read-only
893
command, it is processed immediately. If the command is a consistent
894
read or a write, the command is executed when both of the following
895
conditions hold:
896
898
- There are no other running commands whose keys overlap with
899
the submitted command and cause read/write conflict.
900
901
When the first condition is not met, the replica attempts to acquire
902
a lease or returns an error so that the client will redirect the
907
When the above two conditions are met, the lease holder replica processes the
908
command. Consistent reads are processed on the lease holder immediately.
910
will execute the same commands. All commands produce deterministic
911
results so that the range replicas keep consistent states among them.
912
913
When a write command completes, all the replica updates their response
915
replica updates its timestamp cache to keep track of the latest read
916
for a given key.
917
919
executed. Before executing a command, each replica checks if a replica
920
proposing the command has a still lease. When the lease has been
921
expired, the command will be rejected by the replica.
922
923
924
# Splitting / Merging Ranges
925
926
RoachNodes split or merge ranges based on whether they exceed maximum or
927
minimum thresholds for capacity or load. Ranges exceeding maximums for
928
either capacity or load are split; ranges below minimums for *both*
929
capacity and load are merged.
930
931
Ranges maintain the same accounting statistics as accounting key
932
prefixes. These boil down to a time series of data points with minute
933
granularity. Everything from number of bytes to read/write queue sizes.
934
Arbitrary distillations of the accounting stats can be determined as the
936
split/merge are range size in bytes and IOps. A good metric for
937
rebalancing a replica from one node to another would be total read/write
938
queue wait times. These metrics are gossipped, with each range / node
939
passing along relevant metrics if they’re in the bottom or top of the
940
range it’s aware of.
941
942
A range finding itself exceeding either capacity or load threshold
944
candidate and issues the split through Raft. In contrast to splitting,
945
merging requires a range to be below the minimum threshold for both
946
capacity *and* load. A range being merged chooses the smaller of the
947
ranges immediately preceding and succeeding it.
948
949
Splitting, merging, rebalancing and recovering all follow the same basic
950
algorithm for moving data between roach nodes. New target replicas are
951
created and added to the replica set of source range. Then each new
952
replica is brought up to date by either replaying the log in full or
953
copying a snapshot of the source replica data and then replaying the log
954
from the timestamp of the snapshot to catch up fully. Once the new
955
replicas are fully up to date, the range metadata is updated and old,
956
source replica(s) deleted if applicable.
957
959
960
```
961
if splitting
962
SplitRange(split_key): splits happen locally on range replicas and
963
only after being completed locally, are moved to new target replicas.
964
else if merging
965
Choose new replicas on same servers as target range replicas;
966
add to replica set.
967
else if rebalancing || recovering
968
Choose new replica(s) on least loaded servers; add to replica set.
969
```
970
971
**New Replica**
972
973
*Bring replica up to date:*
974
975
```
976
if all info can be read from replicated log
977
copy replicated log
978
else
979
snapshot source replica
980
send successive ReadRange requests to source replica
981
referencing snapshot
982
983
if merging
984
combine ranges on all replicas
985
else if rebalancing || recovering
986
remove old range replica(s)
987
```
988
989
RoachNodes split ranges when the total data in a range exceeds a
990
configurable maximum threshold. Similarly, ranges are merged when the
991
total data falls below a configurable minimum threshold.
992
993
**TBD: flesh this out**: Especially for merges (but also rebalancing) we have a
994
range disappearing from the local node; that range needs to disappear
995
gracefully, with a smooth handoff of operation to the new owner of its data.
996
997
Ranges are rebalanced if a node determines its load or capacity is one
998
of the worst in the cluster based on gossipped load stats. A node with
999
spare capacity is chosen in the same datacenter and a special-case split
1000
is done which simply duplicates the data 1:1 and resets the range
1001
configuration metadata.
1002
1003
# Node Allocation (via Gossip)
1004
1005
New nodes must be allocated when a range is split. Instead of requiring
1006
every RoachNode to know about the status of all or even a large number
1007
of peer nodes --or-- alternatively requiring a specialized curator or
1008
master with sufficiently global knowledge, we use a gossip protocol to
1009
efficiently communicate only interesting information between all of the
1010
nodes in the cluster. What’s interesting information? One example would
1011
be whether a particular node has a lot of spare capacity. Each node,
1012
when gossiping, compares each topic of gossip to its own state. If its
1013
own state is somehow “more interesting” than the least interesting item
1014
in the topic it’s seen recently, it includes its own state as part of
1015
the next gossip session with a peer node. In this way, a node with
1016
capacity sufficiently in excess of the mean quickly becomes discovered
1017
by the entire cluster. To avoid piling onto outliers, nodes from the
1018
high capacity set are selected at random for allocation.
1019
1020
The gossip protocol itself contains two primary components:
1021
1022
- **Peer Selection**: each node maintains up to N peers with which it
1023
regularly communicates. It selects peers with an eye towards
1024
maximizing fanout. A peer node which itself communicates with an
1025
array of otherwise unknown nodes will be selected over one which
1026
communicates with a set containing significant overlap. Each time
1027
gossip is initiated, each nodes’ set of peers is exchanged. Each
1028
node is then free to incorporate the other’s peers as it sees fit.
1029
To avoid any node suffering from excess incoming requests, a node
1030
may refuse to answer a gossip exchange. Each node is biased
1031
towards answering requests from nodes without significant overlap
1032
and refusing requests otherwise.
1033
1034
Peers are efficiently selected using a heuristic as described in
1035
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
1036
1037
**TBD**: how to avoid partitions? Need to work out a simulation of
1038
the protocol to tune the behavior and see empirically how well it
1039
works.
1040
1041
- **Gossip Selection**: what to communicate. Gossip is divided into
1042
topics. Load characteristics (capacity per disk, cpu load, and
1043
state [e.g. draining, ok, failure]) are used to drive node
1044
allocation. Range statistics (range read/write load, missing
1045
replicas, unavailable ranges) and network topology (inter-rack
1046
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
1047
outages) are used for determining when to split ranges, when to
1048
recover replicas vs. wait for network connectivity, and for
1049
debugging / sysops. In all cases, a set of minimums and a set of
1050
maximums is propagated; each node applies its own view of the
1051
world to augment the values. Each minimum and maximum value is
1052
tagged with the reporting node and other accompanying contextual
1053
information. Each topic of gossip has its own protobuf to hold the
1054
structured data. The number of items of gossip in each topic is
1055
limited by a configurable bound.
1056
1057
For efficiency, nodes assign each new item of gossip a sequence
1058
number and keep track of the highest sequence number each peer
1059
node has seen. Each round of gossip communicates only the delta
1060
containing new items.
1061
1062
# Node Accounting
1063
1064
The gossip protocol discussed in the previous section is useful to
1065
quickly communicate fragments of important information in a
1066
decentralized manner. However, complete accounting for each node is also
1067
stored to a central location, available to any dashboard process. This
1068
is done using the map itself. Each node periodically writes its state to
1069
the map with keys prefixed by `\0node`, similar to the first level of
1070
range metadata, but with an ‘`node`’ suffix. Each value is a protobuf
1071
containing the full complement of node statistics--everything
1072
communicated normally via the gossip protocol plus other useful, but
1073
non-critical data.
1074
1075
The range containing the first key in the node accounting table is
1076
responsible for gossiping the total count of nodes. This total count is
1077
used by the gossip network to most efficiently organize itself. In
1078
particular, the maximum number of hops for gossipped information to take
1079
before reaching a node is given by `ceil(log(node count) / log(max
1080
fanout)) + 1`.
1081
1085
key prefixes. Key prefixes can overlap, as is necessary for capturing
1086
hierarchical relationships. For illustrative purposes, let’s say keys
1087
specifying rows in a set of databases have the following format:
1088
1089
`<db>:<table>:<primary-key>[:<secondary-key>]`
1090
1092
key prefixes:
1093
1094
`db1`, `db1:user`, `db1:order`,
1095
1096
Accounting is kept for the entire map by default.
1097
1098
## Accounting
1099
to keep accounting for a range defined by a key prefix, an entry is created in
1100
the accounting system table. The format of accounting table keys is:
1101
1102
`\0acct<key-prefix>`
1103
1104
In practice, we assume each RoachNode capable of caching the
1105
entire accounting table as it is likely to be relatively small.
1106
1107
Accounting is kept for key prefix ranges with eventual consistency for
1108
efficiency. There are two types of values which comprise accounting:
1109
counts and occurrences, for lack of better terms. Counts describe
1110
system state, such as the total number of bytes, rows,
1111
etc. Occurrences include transient performance and load metrics. Both
1112
types of accounting are captured as time series with minute
1113
granularity. The length of time accounting metrics are kept is
1114
configurable. Below are examples of each type of accounting value.
1115
1116
**System State Counters/Performance**
1117
1118
- Count of items (e.g. rows)
1119
- Total bytes
1120
- Total key bytes
1121
- Total value length
1122
- Queued message count
1123
- Queued message total bytes
1124
- Count of values \< 16B
1125
- Count of values \< 64B
1126
- Count of values \< 256B
1127
- Count of values \< 1K
1128
- Count of values \< 4K
1129
- Count of values \< 16K
1130
- Count of values \< 64K
1131
- Count of values \< 256K
1132
- Count of values \< 1M
1133
- Count of values \> 1M
1134
- Total bytes of accounting
1135
1136
1137
**Load Occurrences**
1138
1139
- Get op count
1140
- Get total MB
1141
- Put op count
1142
- Put total MB
1143
- Delete op count
1144
- Delete total MB
1145
- Delete range op count
1146
- Delete range total MB
1147
- Scan op count
1148
- Scan op MB
1149
- Split count
1150
- Merge count
1151
1152
Because accounting information is kept as time series and over many
1153
possible metrics of interest, the data can become numerous. Accounting
1154
data are stored in the map near the key prefix described, in order to
1155
distribute load (for both aggregation and storage).
1156
1157
Accounting keys for system state have the form:
1158
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1159
character. It’s meant to sort the root level account AFTER any other
1160
system tables. They must increment the same underlying values as they
1161
are permanent counts, and not transient activity. Logic at the
1162
RoachNode takes care of snapshotting the value into an appropriately
1163
suffixed (e.g. with timestamp hour) multi-value time series entry.
1164
1165
Keys for perf/load metrics:
1166
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1167
1168
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1169
containing a varint64 entry for each minute with activity during the
1170
specified hour.
1171
1172
To efficiently keep accounting over large key ranges, the task of
1173
aggregation must be distributed. If activity occurs within the same
1174
range as the key prefix for accounting, the updates are made as part
1175
of the consensus write. If the ranges differ, then a message is sent
1176
to the parent range to increment the accounting. If upon receiving the
1177
message, the parent range also does not include the key prefix, it in
1178
turn forwards it to its parent or left child in the balanced binary
1179
tree which is maintained to describe the range hierarchy. This limits
1180
the number of messages before an update is visible at the root to `2*log N`,
1181
where `N` is the number of ranges in the key prefix.
1182
1183
## Zones
1184
zones are stored in the map with keys prefixed by
1185
`\0zone` followed by the key prefix to which the zone
1186
configuration applies. Zone values specify a protobuf containing
1187
the datacenters from which replicas for ranges which fall under
1188
the zone must be chosen.
1189
1190
Please see [config/config.proto](https://github.com/cockroachdb/cockroach/blob/master/config/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1191
1192
If zones are modified in situ, each RoachNode verifies the
1193
existing zones for its ranges against the zone configuration. If
1194
it discovers differences, it reconfigures ranges in the same way
1195
that it rebalances away from busy nodes, via special-case 1:1
1196
split to a duplicate range comprising the new configuration.
1197
1198
[0]: http://rocksdb.org/
1199
[1]: https://github.com/google/leveldb
1200
[2]: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf
1201
[3]: http://research.google.com/archive/spanner.html
1202
[4]: http://research.google.com/pubs/pub36971.html
1203
[5]: https://github.com/cockroachdb/cockroach/tree/master/sql
1204
[7]: https://godoc.org/github.com/cockroachdb/cockroach/kv
1205
[8]: https://github.com/cockroachdb/cockroach/tree/master/kv
1206
[9]: https://godoc.org/github.com/cockroachdb/cockroach/server
1207
[10]: https://github.com/cockroachdb/cockroach/tree/master/server
1208
[11]: https://godoc.org/github.com/cockroachdb/cockroach/storage
1209
[12]: https://github.com/cockroachdb/cockroach/tree/master/storage