Permalink
Newer
Older
100644 1345 lines (1113 sloc) 64.9 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
Cockroach is a distributed key:value datastore (SQL and structured
8
data layers of cockroach have yet to be defined) which supports **ACID
9
transactional semantics** and **versioned values** as first-class
10
features. The primary design goal is **global consistency and
11
survivability**, hence the name. Cockroach aims to tolerate disk,
12
machine, rack, and even **datacenter failures** with minimal latency
13
disruption and **no manual intervention**. Cockroach nodes are
14
symmetric; a design goal is **homogenous deployment** (one binary) with
15
minimal configuration.
16
17
Cockroach implements a **single, monolithic sorted map** from key to
18
value where both keys and values are byte strings (not unicode).
19
Cockroach **scales linearly** (theoretically up to 4 exabytes (4E) of
20
logical data). The map is composed of one or more ranges and each range
21
is backed by data stored in [RocksDB](http://rocksdb.org/) (a
22
variant of LevelDB), and is replicated to a total of three or more
23
cockroach servers. Ranges are defined by start and end keys. Ranges are
24
merged and split to maintain total byte size within a globally
25
configurable min/max size interval. Range sizes default to target `64M` in
26
order to facilitate quick splits and merges and to distribute load at
27
hotspots within a key range. Range replicas are intended to be located
28
in disparate datacenters for survivability (e.g. `{ US-East, US-West,
29
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East, US-West,
30
Japan, Australia }`).
31
32
Single mutations to ranges are mediated via an instance of a distributed
33
consensus algorithm to ensure consistency. We’ve chosen to use the
34
[Raft consensus algorithm](https://raftconsensus.github.io); all consensus
35
state is stored in RocksDB.
36
37
A single logical mutation may affect multiple key/value pairs. Logical
38
mutations have ACID transactional semantics. If all keys affected by a
39
logical mutation fall within the same range, atomicity and consistency
40
are guaranteed by Raft; this is the **fast commit path**. Otherwise, a
41
**non-locking distributed commit** protocol is employed between affected
42
ranges.
43
44
Cockroach provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
45
serializable snapshot isolation (SSI) semantics, allowing **externally
46
consistent, lock-free reads and writes**--both from a historical
47
snapshot timestamp and from the current wall clock time. SI provides
48
lock-free reads and writes but still allows write skew. SSI eliminates
49
write skew, but introduces a performance hit in the case of a
50
contentious system. SSI is the default isolation; clients must
51
consciously decide to trade correctness for performance. Cockroach
52
implements [a limited form of linearizability](#linearizability),
53
providing ordering for any observer or chain of observers.
54
55
Similar to
56
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
57
directories, Cockroach allows configuration of arbitrary zones of data.
58
This allows replication factor, storage device type, and/or datacenter
59
location to be chosen to optimize performance and/or availability.
60
Unlike Spanner, zones are monolithic and don’t allow movement of fine
61
grained data on the level of entity groups.
62
63
A
64
[Megastore](http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf)-like
65
message queue mechanism is also provided to 1) efficiently sideline
66
updates which can tolerate asynchronous execution and 2) provide an
67
integrated message queuing system for asynchronous communication between
68
distributed system components.
69
70
# Architecture
71
72
Cockroach implements a layered architecture. The highest level of
73
abstraction is the SQL layer (currently unspecified in this document).
74
It depends directly on the [*structured data
75
API*](#structured-data-api), which provides familiar relational concepts
76
such as schemas, tables, columns, and indexes. The structured data API
77
in turn depends on the [distributed key value store](#key-value-api),
78
which handles the details of range addressing to provide the abstraction
79
of a single, monolithic key value store. The distributed KV store
80
communicates with any number of physical cockroach nodes. Each node
81
contains one or more stores, one per physical device.
82
83
![Architecture](media/architecture.png)
84
85
Each store contains potentially many ranges, the lowest-level unit of
86
key-value data. Ranges are replicated using the Raft consensus protocol.
87
The diagram below is a blown up version of stores from four of the five
88
nodes in the previous diagram. Each range is replicated three ways using
89
raft. The color coding shows associated range replicas.
90
91
![Ranges](media/ranges.png)
92
93
Each physical node exports a RoachNode service. Each RoachNode exports
94
one or more key ranges. RoachNodes are symmetric. Each has the same
95
binary and assumes identical roles.
96
97
Nodes and the ranges they provide access to can be arranged with various
98
physical network topologies to make trade offs between reliability and
99
performance. For example, a triplicated (3-way replica) range could have
100
each replica located on different:
101
102
- disks within a server to tolerate disk failures.
103
- servers within a rack to tolerate server failures.
104
- servers on different racks within a datacenter to tolerate rack power/network failures.
105
- servers in different datacenters to tolerate large scale network or power outages.
106
107
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
108
109
# Cockroach Client
110
111
In order to support diverse client usage, Cockroach clients connect to
112
any node via HTTPS using protocol buffers or JSON. The connected node
113
proxies involved client work including key lookups and write buffering.
114
115
# Keys
116
117
Cockroach keys are arbitrary byte arrays. If textual data is used in
118
keys, utf8 encoding is recommended (this helps for cleaner display of
119
values in debugging tools). User-supplied keys are encoded using an
120
ordered code. System keys are either prefixed with null characters (`\0`
121
or `\0\0`) for system tables, or take the form of
122
`<user-key><system-suffix>` to sort user-key-range specific system
123
keys immediately after the user keys they refer to. Null characters are
124
used in system key prefixes to guarantee that they sort first.
125
126
# Versioned Values
127
128
Cockroach maintains historical versions of values by storing them with
129
associated commit timestamps. Reads and scans can specify a snapshot
130
time to return the most recent writes prior to the snapshot timestamp.
131
Older versions of values are garbage collected by the system during
132
compaction according to a user-specified expiration interval. In order
133
to support long-running scans (e.g. for MapReduce), all versions have a
134
minimum expiration.
135
136
Versioned values are supported via modifications to RocksDB to record
137
commit timestamps and GC expirations per key.
138
May 29, 2015
139
Each range maintains a small (i.e. latest 10s of read timestamps),
140
*in-memory* cache from key to the latest timestamp at which the
141
key was read. This *read timestamp cache* is updated everytime a key
142
is read. The cache’s entries are evicted oldest timestamp first, updating
May 29, 2015
143
the low water mark of the cache appropriately. If a new range replica leader
144
is elected, it sets the low water mark for the cache to the current
Jun 1, 2015
145
wall time + ε (ε = 99th percentile clock skew).
147
# Lock-Free Distributed Transactions
148
149
Cockroach provides distributed transactions without locks. Cockroach
150
transactions support two isolation levels:
151
152
- snapshot isolation (SI) and
153
- *serializable* snapshot isolation (SSI).
154
155
*SI* is simple to implement, highly performant, and correct for all but a
156
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
157
more complexity, is still highly performant (less so with contention), and has
158
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
159
the literature and some possibly novel insights.
160
161
SSI is the default level, with SI provided for application developers
162
who are certain enough of their need for performance and the absence of
163
write skew conditions to consciously elect to use it. In a lightly
164
contended system, our implementation of SSI is just as performant as SI,
165
requiring no locking or additional writes. With contention, our
166
implementation of SSI still requires no locking, but will end up
167
aborting more transactions. Cockroach’s SI and SSI implementations
168
prevent starvation scenarios even for arbitrarily long transactions.
169
170
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
171
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
172
For a discussion of SSI implemented by preventing read-write conflicts
173
(in contrast to detecting them, called write-snapshot isolation), see
174
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
175
which is the source of much inspiration for Cockroach’s SSI.
176
177
Each Cockroach transaction is assigned a random priority and a
178
"candidate timestamp" at start. The candidate timestamp is the
179
provisional timestamp at which the transaction will commit, and is
180
chosen as the current clock time of the node coordinating the
181
transaction. This means that a transaction without conflicts will
182
usually commit with a timestamp that, in absolute time, precedes the
183
actual work done by that transaction.
184
May 22, 2015
185
In the course of coordinating a transaction between one or more
186
distributed nodes, the candidate timestamp may be increased, but will
187
never be decreased. The core difference between the two isolation levels
188
SI and SSI is that the former allows the transaction's candidate
189
timestamp to increase and the latter does not.
191
**Hybrid Logical Clock**
192
193
Each cockroach node maintains a hybrid logical clock (HLC) as discussed
194
in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf).
195
HLC time uses timestamps which are composed of a physical component (thought of
196
as and always close to local wall time) and a logical component (used to
197
distinguish between events with the same physical component). It allows us to
198
track causality for related events similar to vector clocks, but with less
199
overhead. In practice, it works much like other logical clocks: When events
200
are received by a node, it informs the local HLC about the timestamp supplied
201
with the event by the sender, and when events are sent a timestamp generated by
202
the local HLC is attached.
203
204
For a more in depth description of HLC please read the paper. Our
205
implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/util/hlc/hlc.go).
206
207
Cockroach picks a Timestamp for a transaction using HLC time. Throughout this
208
document, *timestamp* always refers to the HLC time which is a singleton
209
on each node. The HLC is updated by every read/write event on the node, and
210
the HLC time >= walltime. A read/write timestamp received in a cockroach request
211
from another node is not only used to version the operation, but also updates
212
the HLC on the node. This is useful in guaranteeing that all data read/written
213
on a node is at a timestamp < next HLC time.
215
**Transaction execution flow**
216
217
Transactions are executed in two phases:
218
219
1. Start the transaction by writing a new entry to the system
220
transaction table (keys prefixed by *\0tx*) with state “PENDING”. In
221
parallel write an "intent" value for each datum being written as part
222
of the transaction. These are normal MVCC values, with the addition of
223
a special flag (i.e. “intent”) indicating that the value may be
224
committed after the transaction itself commits. In addition,
225
the transaction id (unique and chosen at tx start time by client)
226
is stored with intent values. The tx id is used to refer to the
227
transaction table when there are conflicts and to make
228
tie-breaking decisions on ordering between identical timestamps.
229
Each node returns the timestamp used for the write (which is the
230
original candidate timestamp in the absence of read/write conflicts);
231
the client selects the maximum from amongst all write timestamps as the
232
final commit timestamp.
234
2. Commit the transaction by updating its entry in the system
235
transaction table (keys prefixed by *\0tx*). The value of the
236
commit entry contains the candidate timestamp (increased as
237
necessary to accommodate any latest read timestamps). Note that
238
the transaction is considered fully committed at this point and
239
control may be returned to the client.
240
241
In the case of an SI transaction, a commit timestamp which was
242
increased to accommodate concurrent readers is perfectly
243
acceptable and the commit may continue. For SSI transactions,
244
however, a gap between candidate and commit timestamps
245
necessitates transaction restart (note: restart is different than
246
abort--see below).
247
248
After the transaction is committed, all written intents are upgraded
249
in parallel by removing the “intent” flag. The transaction is
250
considered fully committed before this step and does not wait for
251
it to return control to the transaction coordinator.
252
253
In the absence of conflicts, this is the end. Nothing else is necessary
254
to ensure the correctness of the system.
255
256
**Conflict Resolution**
257
258
Things get more interesting when a reader or writer encounters an intent
259
record or newly-committed value in a location that it needs to read or
260
write. This is a conflict, usually causing either of the transactions to
261
abort or restart depending on the type of conflict.
262
263
***Transaction restart:***
264
265
This is the usual (and more efficient) type of behaviour and is used
266
except when the transaction was aborted (for instance by another
267
transaction).
268
In effect, that reduces to two cases; the first being the one outlined
269
above: An SSI transaction that finds upon attempting to commit that
270
its commit timestamp has been pushed. The second case involves a transaction
271
actively encountering a conflict, that is, one of its readers or writers
272
encounter data that necessitate conflict resolution
273
(see transaction interactions below).
274
275
When a transaction restarts, it changes its priority and/or moves its
276
timestamp forward depending on data tied to the conflict, and
277
begins anew reusing the same tx id. The prior run of the transaction might
278
have written some write intents, which need to be deleted before the
279
transaction commits, so as to not be included as part of the transaction.
280
These stale write intent deletions are done during the reexecution of the
281
transaction, either implicitly, through writing new intents to
282
the same keys as part of the reexecution of the transaction, or explicitly,
283
by cleaning up stale intents that are not part of the reexecution of the
284
transaction. Since most transactions will end up writing to the same keys,
285
the explicit cleanup run just before committing the transaction is usually
286
a NOOP.
287
288
***Transaction abort:***
289
290
This is the case in which a transaction, upon reading its transaction
291
table entry, finds that it has been aborted. In this case, the
292
transaction can not reuse its intents; it returns control to the client
293
before cleaning them up (other readers and writers would clean up
294
dangling intents as they encounter them) but will make an effort to
295
clean up after itself. The next attempt (if applicable) then runs as a
296
new transaction with **a new tx id**.
297
298
***Transaction interactions:***
299
300
There are several scenarios in which transactions interact:
301
302
- **Reader encounters write intent or value with newer timestamp far
303
enough in the future**: This is not a conflict. The reader is free
304
to proceed; after all, it will be reading an older version of the
305
value and so does not conflict. Recall that the write intent may
306
be committed with a later timestamp than its candidate; it will
307
never commit with an earlier one. **Side note**: if a SI transaction
308
reader finds an intent with a newer timestamp which the reader’s own
309
transaction has written, the reader always returns that intent's value.
310
311
- **Reader encounters write intent or value with newer timestamp in the
312
near future:** In this case, we have to be careful. The newer
313
intent may, in absolute terms, have happened in our read's past if
314
the clock of the writer is ahead of the node serving the values.
315
In that case, we would need to take this value into account, but
316
we just don't know. Hence the transaction restarts, using instead
317
a future timestamp (but remembering a maximum timestamp used to
318
limit the uncertainty window to the maximum clock skew). In fact,
319
this is optimized further; see the details under "choosing a time
320
stamp" below.
321
322
- **Reader encounters write intent with older timestamp**: the reader
323
must follow the intent’s transaction id to the transaction table.
324
If the transaction has already been committed, then the reader can
325
just read the value. If the write transaction has not yet been
326
committed, then the reader has two options. If the write conflict
327
is from an SI transaction, the reader can *push that transaction's
328
commit timestamp into the future* (and consequently not have to
329
read it). This is simple to do: the reader just updates the
330
transaction’s commit timestamp to indicate that when/if the
331
transaction does commit, it should use a timestamp *at least* as
332
high. However, if the write conflict is from an SSI transaction,
333
the reader must compare priorities. If the reader has the higher priority,
334
it pushes the transaction’s commit timestamp (that
335
transaction will then notice its timestamp has been pushed, and
336
restart). If it has the lower or same priority, it retries itself using as
337
a new priority `max(new random priority, conflicting txn’s
338
priority - 1)`.
340
- **Writer encounters uncommitted write intent**:
341
If the other write intent has been written by a transaction with a lower
342
priority, the writer aborts the conflicting transaction. If the write
343
intent has a higher or equal priority the transaction retries, using as a new
344
priority *max(new random priority, conflicting txn’s priority - 1)*;
345
the retry occurs after a short, randomized backoff interval.
347
- **Writer encounters newer committed value**:
348
The committed value could also be an unresolved write intent made by a
349
transaction that has already committed. The transaction restarts. On restart,
350
the same priority is reused, but the candidate timestamp is moved forward
351
to the encountered value's timestamp.
353
- **Writer encounters more recently read key**:
354
The *read timestamp cache* is consulted on each write at a node. If the write’s
355
candidate timestamp is earlier than the low water mark on the cache itself
356
(i.e. its last evicted timestamp) or if the key being written has a read
357
timestamp later than the write’s candidate timestamp, this later timestamp
358
value is returned with the write. A new timestamp forces a transaction
359
restart only if it is serializable.
361
**Transaction management**
362
363
Transactions are managed by the client proxy (or gateway in SQL Azure
364
parlance). Unlike in Spanner, writes are not buffered but are sent
365
directly to all implicated ranges. This allows the transaction to abort
366
quickly if it encounters a write conflict. The client proxy keeps track
367
of all written keys in order to resolve write intents asynchronously upon
368
transaction completion. If a transaction commits successfully, all intents
369
are upgraded to committed. In the event a transaction is aborted, all written
370
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
371
372
In the event the client proxy restarts before the pending transaction is
373
committed, the dangling transaction would continue to live in the
374
transaction table until aborted by another transaction. Transactions
375
heartbeat the transaction table every five seconds by default.
376
Transactions encountered by readers or writers with dangling intents
377
which haven’t been heartbeat within the required interval are aborted.
378
In the event the proxy restarts after a transaction commits but before
379
the asynchronous resolution is complete, the dangling intents are upgraded
380
when encountered by future readers and writers and the system does
381
not depend on their timely resolution for correctness.
382
383
An exploration of retries with contention and abort times with abandoned
384
transaction is
385
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
386
387
**Transaction Table**
388
389
Please see [proto/data.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
390
391
**Pros**
392
393
- No requirement for reliable code execution to prevent stalled 2PC
394
protocol.
395
- Readers never block with SI semantics; with SSI semantics, they may
396
abort.
397
- Lower latency than traditional 2PC commit protocol (w/o contention)
398
because second phase requires only a single write to the
399
transaction table instead of a synchronous round to all
400
transaction participants.
401
- Priorities avoid starvation for arbitrarily long transactions and
402
always pick a winner from between contending transactions (no
403
mutual aborts).
404
- Writes not buffered at client; writes fail fast.
405
- No read-locking overhead required for *serializable* SI (in contrast
406
to other SSI implementations).
407
- Well-chosen (i.e. less random) priorities can flexibly give
408
probabilistic guarantees on latency for arbitrary transactions
409
(for example: make OLTP transactions 10x less likely to abort than
410
low priority transactions, such as asynchronously scheduled jobs).
411
412
**Cons**
413
414
- Reads from non-leader replicas still require a ping to the leader to
415
update *read timestamp cache*.
416
- Abandoned transactions may block contending writers for up to the
417
heartbeat interval, though average wait is likely to be
418
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
419
This is likely considerably more performant than detecting and
420
restarting 2PC in order to release read and write locks.
421
- Behavior different than other SI implementations: no first writer
422
wins, and shorter transactions do not always finish quickly.
423
Element of surprise for OLTP systems may be a problematic factor.
424
- Aborts can decrease throughput in a contended system compared with
425
two phase locking. Aborts and retries increase read and write
426
traffic, increase latency and decrease throughput.
427
428
**Choosing a Timestamp**
429
430
A key challenge of reading data in a distributed system with clock skew
431
is choosing a timestamp guaranteed to be greater than the latest
432
timestamp of any committed transaction (in absolute time). No system can
433
claim consistency and fail to read already-committed data.
434
435
Accomplishing consistency for transactions (or just single operations)
436
accessing a single node is easy. The timestamp is assigned by the node
437
itself, so it is guaranteed to be at a greater timestamp than all the
438
existing timestamped data on the node.
439
440
For multiple nodes, the timestamp of the node coordinating the
441
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
442
supplied to provide an upper bound on timestamps for already-committed
443
data (`ε` is the maximum clock skew). As the transaction progresses, any
444
data read which have timestamps greater than `t` but less than `t+ε`
445
cause the transaction to abort and retry with the conflicting timestamp
446
t<sub>c</sub>, where t<sub>c</sub> \> t. The maximum timestamp `t+ε` remains
447
the same. This implies that transaction restarts due to clock uncertainty
448
can only happen on a time interval of length `ε`.
450
We apply another optimization to reduce the restarts caused
451
by uncertainty. Upon restarting, the transaction not only takes
452
into account t<sub>c</sub>, but the timestamp of the node at the time
453
of the uncertain read t<sub>node</sub>. The larger of those two timestamps
454
t<sub>c</sub> and t<sub>node</sub> (likely equal to the latter) is used
455
to increase the read timestamp. Additionally, the conflicting node is
456
marked as “certain”. Then, for future reads to that node within the
457
transaction, we set `MaxTimestamp = Read Timestamp`, preventing further
458
uncertainty restarts.
459
460
Correctness follows from the fact that we know that at the time of the read,
461
there exists no version of any key on that node with a higher timestamp than
462
t<sub>node</sub>. Upon a restart caused by the node, if the transaction
463
encounters a key with a higher timestamp, it knows that in absolute time,
464
the value was written after t<sub>node</sub> was obtained, i.e. after the
465
uncertain read. Hence the transaction can move forward reading an older version
466
of the data (at the transaction's timestamp). This limits the time uncertainty
467
restarts attributed to a node to at most one. The tradeoff is that we might
468
pick a timestamp larger than the optimal one (> highest conflicting timestamp),
469
resulting in the possibility of a few more conflicts.
470
471
We expect retries will be rare, but this assumption may need to be
472
revisited if retries become problematic. Note that this problem does not
473
apply to historical reads. An alternate approach which does not require
474
retries makes a round to all node participants in advance and
475
chooses the highest reported node wall time as the timestamp. However,
476
knowing which nodes will be accessed in advance is difficult and
477
potentially limiting. Cockroach could also potentially use a global
478
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)),
479
which would be feasible for smaller, geographically-proximate clusters.
480
481
# Linearizability
482
483
First a word about [***Spanner***](http://research.google.com/archive/spanner.html).
484
By combining judicious use of wait intervals with accurate time signals,
485
Spanner provides a global ordering between any two non-overlapping transactions
486
(in absolute time) with \~14ms latencies. Put another way:
487
Spanner guarantees that if a transaction T<sub>1</sub> commits (in absolute time)
488
before another transaction T<sub>2</sub> starts, then T<sub>1</sub>'s assigned commit
489
timestamp is smaller than T<sub>2</sub>'s. Using atomic clocks and GPS receivers,
490
Spanner reduces their clock skew uncertainty to \< 10ms (`ε`). To make
491
good on the promised guarantee, transactions must take at least double
492
the clock skew uncertainty interval to commit (`2ε`). See [*this
493
article*](http://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf)
494
for a helpful overview of Spanner’s concurrency control.
495
496
Cockroach could make the same guarantees without specialized hardware,
497
at the expense of longer wait times. If servers in the cluster were
498
configured to work only with NTP, transaction wait times would likely to
499
be in excess of 150ms. For wide-area zones, this would be somewhat
500
mitigated by overlap from cross datacenter link latencies. If clocks
501
were made more accurate, the minimal limit for commit latencies would
502
improve.
503
504
However, let’s take a step back and evaluate whether Spanner’s external
505
consistency guarantee is worth the automatic commit wait. First, if the
506
commit wait is omitted completely, the system still yields a consistent
507
view of the map at an arbitrary timestamp. However with clock skew, it
508
would become possible for commit timestamps on non-overlapping but
509
causally related transactions to suffer temporal reverse. In other
510
words, the following scenario is possible for a client without global
511
ordering:
512
513
- Start transaction T<sub>1</sub> to modify value `x` with commit time *s<sub>1</sub>*
514
515
- On commit of T<sub>1</sub>, start T<sub>2</sub> to modify value `y` with commit time
516
\> s<sub>2</sub>
517
518
- Read `x` and `y` and discover that s<sub>1</sub> \> s<sub>2</sub> (**!**)
519
520
The external consistency which Spanner guarantees is referred to as
521
**linearizability**. It goes beyond serializability by preserving
522
information about the causality inherent in how external processes
523
interacted with the database. The strength of Spanner’s guarantee can be
524
formulated as follows: any two processes, with clock skew within
525
expected bounds, may independently record their wall times for the
526
completion of transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>) and start of transaction
527
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) respectively, and if later
528
compared such that T<sub>1</sub><sup>end</sup> \< T<sub>2</sub><sup>start</sup>,
529
then commit timestamps s<sub>1</sub> \< s<sub>2</sub>.
530
This guarantee is broad enough to completely cover all cases of explicit
531
causality, in addition to covering any and all imaginable scenarios of implicit
532
causality.
533
534
Our contention is that causality is chiefly important from the
535
perspective of a single client or a chain of successive clients (*if a
536
tree falls in the forest and nobody hears…*). As such, Cockroach
537
provides two mechanisms to provide linearizability for the vast majority
538
of use cases without a mandatory transaction commit wait or an elaborate
539
system to minimize clock skew.
540
541
1. Clients provide the highest transaction commit timestamp with
542
> successive transactions. This allows node clocks from previous
543
> transactions to effectively participate in the formulation of the
544
> commit timestamp for the current transaction. This guarantees
545
> linearizability for transactions committed by this client.
546
>
547
> Newly launched clients wait at least 2 \* ε from process start
548
> time before beginning their first transaction. This preserves the
549
> same property even on client restart, and the wait will be
550
> mitigated by process initialization.
551
>
552
> All causally-related events within Cockroach maintain
553
> linearizability. Message queues, for example, guarantee that the
554
> receipt timestamp is greater than send timestamp, and that
555
> delivered messages may not be reaped until after the commit wait.
556
557
2. Committed transactions respond with a commit wait parameter which
558
> represents the remaining time in the nominal commit wait. This
559
> will typically be less than the full commit wait as the consensus
560
> write at the coordinator accounts for a portion of it.
561
>
562
> Clients taking any action outside of another Cockroach transaction
563
> (e.g. writing to another distributed system component) can either
564
> choose to wait the remaining interval before proceeding, or
565
> alternatively, pass the wait and/or commit timestamp to the
566
> execution of the outside action for its consideration. This pushes
567
> the burden of linearizability to clients, but is a useful tool in
568
> mitigating commit latencies if the clock skew is potentially
569
> large. This functionality can be used for ordering in the face of
570
> backchannel dependencies as mentioned in the
571
> [AugmentedTime](http://www.cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf)
572
> paper.
573
574
Using these mechanisms in place of commit wait, Cockroach’s guarantee can be
575
formulated as follows: any process which signals the start of transaction
576
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) after the completion of
577
transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>), will have commit
578
timestamps such thats<sub>1</sub> \< s<sub>2</sub>.
579
580
# Logical Map Content
581
582
Logically, the map contains a series of reserved system key / value
583
pairs covering accounting, range metadata, node accounting and
584
permissions before the actual key / value pairs for non-system data
585
(e.g. the actual meat of the map).
586
587
- `\0\0meta1` Range metadata for location of `\0\0meta2`.
588
- `\0\0meta1<key1>` Range metadata for location of `\0\0meta2<key1>`.
589
- ...
590
- `\0\0meta1<keyN>`: Range metadata for location of `\0\0meta2<keyN>`.
591
- `\0\0meta2`: Range metadata for location of first non-range metadata key.
592
- `\0\0meta2<key1>`: Range metadata for location of `<key1>`.
593
- ...
594
- `\0\0meta2<keyN>`: Range metadata for location of `<keyN>`.
595
- `\0acct<key0>`: Accounting for key prefix key0.
596
- ...
597
- `\0acct<keyN>`: Accounting for key prefix keyN.
598
- `\0node<node-address0>`: Accounting data for node 0.
599
- ...
600
- `\0node<node-addressN>`: Accounting data for node N.
601
- `\0perm<key0><user0>`: Permissions for user0 for key prefix key0.
602
- ...
603
- `\0perm<keyN><userN>`: Permissions for userN for key prefix keyN.
604
- `\0tree_root`: Range key for root of range-spanning tree.
605
- `\0tx<tx-id0>`: Transaction record for transaction 0.
606
- ...
607
- `\0tx<tx-idN>`: Transaction record for transaction N.
608
- `\0zone<key0>`: Zone information for key prefix key0.
609
- ...
610
- `\0zone<keyN>`: Zone information for key prefix keyN.
611
- `<>acctd<metric0>`: Accounting data for Metric 0 for empty key prefix.
612
- ...
613
- `<>acctd<metricN>`: Accounting data for Metric N for empty key prefix.
614
- `<key0>`: `<value0>` The first user data key.**
615
- ...
616
- `<keyN>`: `<valueN>` The last user data key.**
617
618
There are some additional system entries sprinkled amongst the
619
non-system keys. See the Key-Prefix Accounting section in this document
620
for further details.
621
622
# Node Storage
623
624
Nodes maintain a separate instance of RocksDB for each disk. Each
625
RocksDB instance hosts any number of ranges. RPCs arriving at a
626
RoachNode are multiplexed based on the disk name to the appropriate
627
RocksDB instance. A single instance per disk is used to avoid
628
contention. If every range maintained its own RocksDB, global management
629
of available cache memory would be impossible and writers for each range
630
would compete for non-contiguous writes to multiple RocksDB logs.
631
632
In addition to the key/value pairs of the range itself, various range
633
metadata is maintained.
634
635
- range-spanning tree node links
636
637
- participating replicas
638
639
- consensus metadata
640
641
- split/merge activity
642
643
A really good reference on tuning Linux installations with RocksDB is
644
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
645
646
# Range Metadata
647
648
The default approximate size of a range is 64M (2\^26 B). In order to
649
support 1P (2\^50 B) of logical data, metadata is needed for roughly
650
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
651
size is roughly 256 bytes (3\*12 bytes for the triplicated node
652
locations and 220 bytes for the range key itself*). 2\^24 ranges \* 2\^8
653
B would require roughly 4G (2\^32 B) to store--too much to duplicate
654
between machines. Our conclusion is that range metadata must be
655
distributed for large installations.
656
657
To keep key lookups relatively fast in the presence of distributed metadata,
658
we store all the top-level metadata in a single range (the first range). These
659
top-level metadata keys are known as *meta1* keys, and are prefixed such that
660
they sort to the beginning of the key space. Given the metadata size of 256
661
bytes given above, a single 64M range would support 64M/256B = 2\^18 ranges,
662
which gives a total storage of 64M \* 2\^18 = 16.7T. To support the 1P quoted
663
above, we need two levels of indirection, where the first level addresses the
664
second, and the second addresses user data. With two levels of indirection, we
665
can address 2\^(18 + 18) = 2\^36 ranges; each range addresses 2\^26 B, and
666
altogether we address 2\^(36+26) B = 2\^62 B = 4E of user data.
667
668
For a given user-addressable `key1`, the associated *meta1* record is found
669
at the successor key to `key1` in the *meta1* space. Since the *meta1* space
670
is sparse, the successor key is defined as the next key which is present. The
671
*meta1* record identifies the range containing the *meta2* record, which is
672
found using the same process. The *meta2* record identifies the range
673
containing `key1`, which is again found the same way (see examples below).
675
Concretely, metadata keys are prefixed by `\0\0meta{1,2}`; the two null
676
characters provide for the desired sorting behaviour. Thus, `key1`'s
677
*meta1* record will reside at the successor key to `\0\0\meta1<key1>`.
678
679
Note: we append the end key of each range to meta[12] records because
680
the RocksDB iterator only supports a Seek() interface which acts as a
681
Ceil(). Using the start key of the range would cause Seek() to find the
682
key *after* the meta indexing record we’re looking for, which would
683
result in having to back the iterator up, an option which is both less
684
efficient and not available in all cases.
685
686
The following example shows the directory structure for a map with
687
three ranges worth of data. Ellipses indicate additional key/value pairs to
688
fill an entire range of data. Except for the fact that splitting ranges
689
requires updates to the range metadata with knowledge of the metadata layout,
690
the range metadata itself requires no special treatment or bootstrapping.
691
692
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
693
`dcrama3:8000`)
694
695
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
696
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
697
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
698
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
699
- ...
700
- `<lastkey0>`: `<lastvalue0>`
701
702
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
703
`dcrama6:8000`)
704
705
- ...
706
- `<lastkey1>`: `<lastvalue1>`
707
708
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
709
`dcrama9:8000`)
710
711
- ...
712
- `<lastkey2>`: `<lastvalue2>`
713
714
Consider a simpler example of a map containing less than a single
715
range of data. In this case, all range metadata and all data are
716
located in the same range:
717
718
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
719
`dcrama3:8000`)*
720
721
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
722
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
723
- `<key0>`: `<value0>`
724
- `...`
725
726
Finally, a map large enough to need both levels of indirection would
727
look like (note that instead of showing range replicas, this
728
example is simplified to just show range indexes):
729
730
**Range 0**
731
732
- `\0\0meta1<lastkeyN-1>`: Range 0
733
- `\0\0meta1\xff`: Range 1
734
- `\0\0meta2<lastkey1>`: Range 1
735
- `\0\0meta2<lastkey2>`: Range 2
736
- `\0\0meta2<lastkey3>`: Range 3
737
- ...
738
- `\0\0meta2<lastkeyN-1>`: Range 262143
739
740
**Range 1**
741
742
- `\0\0meta2<lastkeyN>`: Range 262144
743
- `\0\0meta2<lastkeyN+1>`: Range 262145
744
- ...
745
- `\0\0meta2\xff`: Range 500,000
746
- ...
747
- `<lastkey1>`: `<lastvalue1>`
748
749
**Range 2**
750
751
- ...
752
- `<lastkey2>`: `<lastvalue2>`
753
754
**Range 3**
755
756
- ...
757
- `<lastkey3>`: `<lastvalue3>`
758
759
**Range 262144**
760
761
- ...
762
- `<lastkeyN>`: `<lastvalueN>`
763
764
**Range 262145**
765
766
- ...
767
- `<lastkeyN+1>`: `<lastvalueN+1>`
768
769
Note that the choice of range `262144` is just an approximation. The
770
actual number of ranges addressable via a single metadata range is
771
dependent on the size of the keys. If efforts are made to keep key sizes
772
small, the total number of addressable ranges would increase and vice
773
versa.
774
775
From the examples above it’s clear that key location lookups require at
776
most three reads to get the value for `<key>`:
777
778
1. lower bound of `\0\0meta1<key>`
779
2. lower bound of `\0\0meta2<key>`,
780
3. `<key>`.
781
782
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
783
containing less than 16T of data would require two lookups. Clients cache both
784
levels of range metadata, and we expect that data locality for individual
785
clients will be high. Clients may end up with stale cache entries. If on a
786
lookup, the range consulted does not match the client’s expectations, the
787
client evicts the stale entries and possibly does a new lookup.
788
789
# Raft - Consistency of Range Replicas
790
791
Each range is configured to consist of three or more replicas, as specified by
792
their ZoneConfig. The replicas in a range maintain their own instance of a
793
distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://raftconsensus.github.io)
794
as it is simpler to reason about and includes a reference implementation
795
covering important details.
796
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
797
promising performance characteristics for WAN-distributed replicas, but
798
it does not guarantee a consistent ordering between replicas.
799
800
Raft elects a relatively long-lived leader which must be involved to
801
propose commands. It heartbeats followers periodically and keeps their logs
802
replicated. In the absence of heartbeats, followers become candidates
803
after randomized election timeouts and proceed to hold new leader
804
elections. Cockroach weights random timeouts such that the replicas with
805
shorter round trip times to peers are more likely to hold elections
806
first (not implemented yet). Only the Raft leader may propose commands;
807
followers will simply relay commands to the last known leader.
809
Our Raft implementation was developed together with CoreOS, but adds an extra
810
layer of optimization to account for the fact that a single Node may have
811
millions of consensus groups (one for each Range). Areas of optimization
812
are chiefly coalesced heartbeats (so that the number of nodes dictates the
813
number of heartbeats as opposed to the much larger number of ranges) and
814
batch processing of requests.
815
Future optimizations may include two-phase elections and quiescent ranges
816
(i.e. stopping traffic completely for inactive ranges).
817
818
# Range Leadership (Leader Leases)
819
820
As outlined in the Raft section, the replicas of a Range are organized as a
821
Raft group and execute commands from their shared commit log. Going through
822
Raft is an expensive operation though, and there are tasks which should only be
823
carried out by a single replica at a time (as opposed to all of them).
824
825
For these reasons, Cockroach introduces the concept of **Range Leadership**:
826
This is a lease held for a slice of (database, i.e. hybrid logical) time and is
827
established by committing a special log entry through Raft containing the
828
interval the leadership is going to be active on, along with the Node:RaftID
829
combination that uniquely describes the requesting replica. Reads and writes
830
must generally be addressed to the replica holding the lease; if none does, any
831
replica may be addressed, causing it to try to obtain the lease synchronously.
832
Requests received by a non-leader (for the HLC timestamp specified in the
833
request's header) fail with an error pointing at the replica's last known
834
leader. These requests are retried transparently with the updated leader by the
835
gateway node and never reach the client.
836
837
The replica holding the lease is in charge or involved in handling
838
Range-specific maintenance tasks such as
839
840
* gossiping the sentinel and/or first range information
841
* splitting, merging and rebalancing
842
* handling message queues (see below)
843
844
and, very importantly, may satisfy reads locally, without incurring the
845
overhead of going through Raft.
846
847
Since reads bypass Raft, a new lease holder will, among other things, ascertain
848
that its timestamp cache does not report timestamps smaller than the previous
849
lease holder's (so that it's compatible with reads which may have occurred on
850
the former leader). This is accomplished by setting the low water mark of the
851
timestamp cache to the expiration of the previous lease plus the maximum clock
852
offset.
853
854
## Relationship to Raft leadership
855
856
Range leadership is completely separate from Raft leadership, and so without
857
further efforts, Raft and Range leadership may not be represented by the same
858
replica most of the time. This is convenient semantically since it decouples
859
these two types of leadership and allows the use of Raft as a "black box", but
860
for reasons of performance, it is desirable to have both on the same replica.
861
Otherwise, sending a command through Raft always incurs the overhead of being
862
proposed to the Range leader's Raft instance first, which must relay it to the
863
Raft leader, which finally commits it into the log and updates its followers,
864
including the Range leader. This yields correct results but wastes several
865
round-trip delays, and so we will make sure that in the vast majority of cases
866
Range and Raft leadership coincide. A fairly easy method for achieving this is
867
to have each new lease period (extension or new) be accompanied by a
868
stipulation to the lease holder's replica to start Raft elections (unless it's
869
already leading), though some care should be taken that Range leadership is
870
relatively stable and long-lived to avoid a large number of Raft leadership
871
transitions.
872
873
# Splitting / Merging Ranges
874
875
RoachNodes split or merge ranges based on whether they exceed maximum or
876
minimum thresholds for capacity or load. Ranges exceeding maximums for
877
either capacity or load are split; ranges below minimums for *both*
878
capacity and load are merged.
879
880
Ranges maintain the same accounting statistics as accounting key
881
prefixes. These boil down to a time series of data points with minute
882
granularity. Everything from number of bytes to read/write queue sizes.
883
Arbitrary distillations of the accounting stats can be determined as the
884
basis for splitting / merging. Two sensical metrics for use with
885
split/merge are range size in bytes and IOps. A good metric for
886
rebalancing a replica from one node to another would be total read/write
887
queue wait times. These metrics are gossipped, with each range / node
888
passing along relevant metrics if they’re in the bottom or top of the
889
range it’s aware of.
890
891
A range finding itself exceeding either capacity or load threshold
892
splits. To this end, the range leader computes an appropriate split key
893
candidate and issues the split through Raft. In contrast to splitting,
894
merging requires a range to be below the minimum threshold for both
895
capacity *and* load. A range being merged chooses the smaller of the
896
ranges immediately preceding and succeeding it.
897
898
Splitting, merging, rebalancing and recovering all follow the same basic
899
algorithm for moving data between roach nodes. New target replicas are
900
created and added to the replica set of source range. Then each new
901
replica is brought up to date by either replaying the log in full or
902
copying a snapshot of the source replica data and then replaying the log
903
from the timestamp of the snapshot to catch up fully. Once the new
904
replicas are fully up to date, the range metadata is updated and old,
905
source replica(s) deleted if applicable.
906
907
**Coordinator** (leader replica)
908
909
```
910
if splitting
911
SplitRange(split_key): splits happen locally on range replicas and
912
only after being completed locally, are moved to new target replicas.
913
else if merging
914
Choose new replicas on same servers as target range replicas;
915
add to replica set.
916
else if rebalancing || recovering
917
Choose new replica(s) on least loaded servers; add to replica set.
918
```
919
920
**New Replica**
921
922
*Bring replica up to date:*
923
924
```
925
if all info can be read from replicated log
926
copy replicated log
927
else
928
snapshot source replica
929
send successive ReadRange requests to source replica
930
referencing snapshot
931
932
if merging
933
combine ranges on all replicas
934
else if rebalancing || recovering
935
remove old range replica(s)
936
```
937
938
RoachNodes split ranges when the total data in a range exceeds a
939
configurable maximum threshold. Similarly, ranges are merged when the
940
total data falls below a configurable minimum threshold.
941
942
**TBD: flesh this out**: Especially for merges (but also rebalancing) we have a
943
range disappearing from the local node; that range needs to disappear
944
gracefully, with a smooth handoff of operation to the new owner of its data.
945
946
Ranges are rebalanced if a node determines its load or capacity is one
947
of the worst in the cluster based on gossipped load stats. A node with
948
spare capacity is chosen in the same datacenter and a special-case split
949
is done which simply duplicates the data 1:1 and resets the range
950
configuration metadata.
951
952
# Message Queues
953
954
Each range maintains an array of incoming message queues, referred to
955
here as **inboxes**. Additionally, each range maintains and *processes*
956
an array of outgoing message queues, referred to here as **outboxes**.
957
Both inboxes and outboxes are assigned to keys; messages can be sent or
958
received on behalf of any key. Inboxes and outboxes can contain any
959
number of pending messages.
960
961
Messages can be *deliverable*, or *executable.*
962
963
Deliverable messages are defined by Value objects - simple byte arrays -
964
that are delivered to a key’s inbox, awaiting collection by a client
965
invoking the ReapQueue operation. These are typically used by client
966
applications wishing to be notified of changes to an entry for further
967
processing, such as expensive offline operations like sending emails,
968
SMSs, etc.
969
970
Executable messages are *outgoing-only*, and are instances of
971
PutRequest,IncrementRequest, DeleteRequest, DeleteRangeRequest
May 29, 2015
972
or AccountingRequest. Rather than being delivered to a key’s inbox, are
973
executed when encountered. These are primarily useful when updates that
974
are nominally part of a transaction can tolerate asynchronous execution
975
(e.g. eventual consistency), and are otherwise too busy or numerous to
976
make including them in the original [distributed] transaction efficient.
977
Examples may include updates to the accounting for successive key
978
prefixes (potentially busy) or updates to a full-text index (potentially
979
numerous).
980
981
These two types of messages are enqueued in different outboxes too - see
982
key formats below.
983
984
At commit time, the range processing the transaction places messages
985
into a shared outbox located at the start of the range metadata. This is
986
effectively free as it’s part of the same consensus write for the range
987
as the COMMIT record. Outgoing messages are processed asynchronously by
988
the range. To make processing easy, all outboxes are co-located at the
989
start of the range. To make lookup easy, all inboxes are located
990
immediately after the recipient key. The leader replica of a range is
991
responsible for processing message queues.
992
993
A dispatcher polls a given range’s *deliverable message outbox*
994
periodically (configurable), and delivers those messages to the target
995
key’s inbox. The dispatcher is also woken up whenever a new message is
996
added to the outbox. A separate executor also polls the range’s
997
*executable message outbox* periodically as well (again, configurable),
May 31, 2015
998
and executes those commands. The executor, too, is woken up whenever a
999
new message is added to the outbox.
1000
1001
Formats follow in the table below. Notice that inbox messages for a
1002
given key sort by the `<outbox-timestamp>`. This doesn’t provide a
1003
precise ordering, but it does allow clients to scan messages in an
1004
approximate ordering of when they were originally lodged with senders.
1005
NTP offers walltime deltas to within 100s of milliseconds. The
1006
`<sender-range-key>` suffix provides uniqueness.
1007
1008
**Outbox**
1009
`<sender-range-key>deliverable-outbox:<recipient-key><outbox-timestamp>`
1010
`<sender-range-key>executable-outbox:<recipient-key><outbox-timestamp>`
1011
1012
**Inbox**
1013
`<recipient-key>inbox:<outbox-timestamp><sender-range-key>`
1014
1015
Messages are processed and then deleted as part of a single distributed
1016
transaction. The message will be executed or delivered exactly once,
1017
regardless of failures at either sender or receiver.
1018
1019
Delivered messages may be read by clients via the ReapQueue operation.
1020
This operation may only be used as part of a transaction. Clients should
1021
commit only after having processed the message. If the transaction is
1022
committed, scanned messages are automatically deleted. The operation
1023
name was chosen to reflect its mutating side effect. Deletion of read
1024
messages is mandatory because senders deliver messages asynchronously
1025
and a delay could cause insertion of messages at arbitrary points in the
1026
inbox queue. If clients require persistence, they should re-save read
1027
messages manually; the ReapQueue operation can be incorporated into
1028
normal transactional updates.
1029
1030
# Range-Spanning Binary Tree
1031
1032
A crucial enhancement to the organization of range metadata is to
1033
augment the bi-level range metadata lookup with a minimum spanning tree,
1034
implemented as a left-leaning red-black tree over all ranges in the map.
1035
This tree structure allows the system to start at any key prefix and
1036
efficiently traverse an arbitrary key range with minimal RPC traffic,
1037
minimal fan-in and fan-out, and with bounded time complexity equal to
1038
`2*log N` steps, where `N` is the total number of ranges in the system.
1039
1040
Unlike the range metadata rows prefixed with `\0\0meta[1|2]`, the
1041
metadata for the range-spanning tree (e.g. parent range and left / right
1042
child ranges) is stored directly at the ranges as non-map metadata. The
1043
metadata for each node of the tree (e.g. links to parent range, left
1044
child range, and right child range) is stored with the range metadata.
1045
In effect, the tree metadata is stored implicitly. In order to traverse
1046
the tree, for example, you’d need to query each range in turn for its
1047
metadata.
1048
1049
Any time a range is split or merged, both the bi-level range lookup
1050
metadata and the per-range binary tree metadata are updated as part of
1051
the same distributed transaction. The total number of nodes involved in
1052
the update is bounded by 2 + log N (i.e. 2 updates for meta1 and
1053
meta2, and up to log N updates to balance the range-spanning tree).
1054
The range corresponding to the root node of the tree is stored in
1055
*\0tree_root*.
1056
1057
As an example, consider the following set of nine ranges and their
1058
associated range-spanning tree:
1059
1060
R0: `aa - cc`, R1: `*cc - lll`, R2: `*lll - llr`, R3: `*llr - nn`, R4: `*nn - rr`, R5: `*rr - ssss`, R6: `*ssss - sst`, R7: `*sst - vvv`, R8: `*vvv - zzzz`.
1061
1062
![Range Tree](media/rangetree.png)
1063
1064
The range-spanning tree has many beneficial uses in Cockroach. It makes
1065
the problem of efficiently aggregating accounting information of
1066
potentially vast ranges of data tractable. Imagine a subrange of data
1067
over which accounting is being kept. For example, the *photos* table in
1068
a public photo sharing site. To efficiently keep track of data about the
1069
table (e.g. total size, number of rows, etc.), messages can be passed
1070
first up the tree and then down to the left until updates arrive at the
1071
key prefix under which accounting is aggregated. This makes worst case
1072
number of hops for an update to propagate into the accounting totals
1073
2 \* log N. A 64T database will require 1M ranges, meaning 40 hops
1074
worst case. In our experience, accounting tasks over vast ranges of data
1075
are most often map/reduce jobs scheduled with coarse-grained
1076
periodicity. By contrast, we expect Cockroach to maintain statistics
1077
with sub 10s accuracy and with minimal cycles and minimal IOPs.
1078
1079
Another use for the range-spanning tree is to push accounting, zones and
1080
permissions configurations to all ranges. In the case of zones and
1081
permissions, this is an efficient way to pass updated configuration
1082
information with exponential fan-out. When adding accounting
1083
configurations (i.e. specifying a new key prefix to track), the
1084
implicated ranges are transactionally scanned and zero-state accounting
1085
information is computed as well. Deleting accounting configurations is
1086
similar, except accounting records are deleted.
1087
1088
Last but *not* least, the range-spanning tree provides a convenient
1089
mechanism for planning and executing parallel queries. These provide the
1090
basis for
1091
[Dremel](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf)-like
1092
query execution trees and it’s easy to imagine supporting a subset of
1093
SQL or even javascript-based user functions for complex data analysis
1094
tasks.
1095
1096
1097
1098
# Node Allocation (via Gossip)
1099
1100
New nodes must be allocated when a range is split. Instead of requiring
1101
every RoachNode to know about the status of all or even a large number
1102
of peer nodes --or-- alternatively requiring a specialized curator or
1103
master with sufficiently global knowledge, we use a gossip protocol to
1104
efficiently communicate only interesting information between all of the
1105
nodes in the cluster. What’s interesting information? One example would
1106
be whether a particular node has a lot of spare capacity. Each node,
1107
when gossiping, compares each topic of gossip to its own state. If its
1108
own state is somehow “more interesting” than the least interesting item
1109
in the topic it’s seen recently, it includes its own state as part of
1110
the next gossip session with a peer node. In this way, a node with
1111
capacity sufficiently in excess of the mean quickly becomes discovered
1112
by the entire cluster. To avoid piling onto outliers, nodes from the
1113
high capacity set are selected at random for allocation.
1114
1115
The gossip protocol itself contains two primary components:
1116
1117
- **Peer Selection**: each node maintains up to N peers with which it
1118
regularly communicates. It selects peers with an eye towards
1119
maximizing fanout. A peer node which itself communicates with an
1120
array of otherwise unknown nodes will be selected over one which
1121
communicates with a set containing significant overlap. Each time
1122
gossip is initiated, each nodes’ set of peers is exchanged. Each
1123
node is then free to incorporate the other’s peers as it sees fit.
1124
To avoid any node suffering from excess incoming requests, a node
1125
may refuse to answer a gossip exchange. Each node is biased
1126
towards answering requests from nodes without significant overlap
1127
and refusing requests otherwise.
1128
1129
Peers are efficiently selected using a heuristic as described in
1130
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
1131
1132
**TBD**: how to avoid partitions? Need to work out a simulation of
1133
the protocol to tune the behavior and see empirically how well it
1134
works.
1135
1136
- **Gossip Selection**: what to communicate. Gossip is divided into
1137
topics. Load characteristics (capacity per disk, cpu load, and
1138
state [e.g. draining, ok, failure]) are used to drive node
1139
allocation. Range statistics (range read/write load, missing
1140
replicas, unavailable ranges) and network topology (inter-rack
1141
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
1142
outages) are used for determining when to split ranges, when to
1143
recover replicas vs. wait for network connectivity, and for
1144
debugging / sysops. In all cases, a set of minimums and a set of
1145
maximums is propagated; each node applies its own view of the
1146
world to augment the values. Each minimum and maximum value is
1147
tagged with the reporting node and other accompanying contextual
1148
information. Each topic of gossip has its own protobuf to hold the
1149
structured data. The number of items of gossip in each topic is
1150
limited by a configurable bound.
1151
1152
For efficiency, nodes assign each new item of gossip a sequence
1153
number and keep track of the highest sequence number each peer
1154
node has seen. Each round of gossip communicates only the delta
1155
containing new items.
1156
1157
# Node Accounting
1158
1159
The gossip protocol discussed in the previous section is useful to
1160
quickly communicate fragments of important information in a
1161
decentralized manner. However, complete accounting for each node is also
1162
stored to a central location, available to any dashboard process. This
1163
is done using the map itself. Each node periodically writes its state to
1164
the map with keys prefixed by `\0node`, similar to the first level of
1165
range metadata, but with an ‘`node`’ suffix. Each value is a protobuf
1166
containing the full complement of node statistics--everything
1167
communicated normally via the gossip protocol plus other useful, but
1168
non-critical data.
1169
1170
The range containing the first key in the node accounting table is
1171
responsible for gossiping the total count of nodes. This total count is
1172
used by the gossip network to most efficiently organize itself. In
1173
particular, the maximum number of hops for gossipped information to take
1174
before reaching a node is given by `ceil(log(node count) / log(max
1175
fanout)) + 1`.
1176
1177
# Key-prefix Accounting, Zones & Permissions
1178
1179
Arbitrarily fine-grained accounting and permissions are specified via
1180
key prefixes. Key prefixes can overlap, as is necessary for capturing
1181
hierarchical relationships. For illustrative purposes, let’s say keys
1182
specifying rows in a set of databases have the following format:
1183
1184
`<db>:<table>:<primary-key>[:<secondary-key>]`
1185
1186
In this case, we might collect accounting or specify permissions with
1187
key prefixes:
1188
1189
`db1`, `db1:user`, `db1:order`,
1190
1191
Accounting is kept for the entire map by default.
1192
1193
## Accounting
1194
to keep accounting for a range defined by a key prefix, an entry is created in
1195
the accounting system table. The format of accounting table keys is:
1196
1197
`\0acct<key-prefix>`
1198
1199
In practice, we assume each RoachNode capable of caching the
1200
entire accounting table as it is likely to be relatively small.
1201
1202
Accounting is kept for key prefix ranges with eventual consistency
1203
for efficiency. Updates to accounting values propagate through the
1204
system using the message queue facility if the accounting keys do
1205
not reside on the same range as ongoing activity (true for all but
1206
the smallest systems). There are two types of values which
1207
comprise accounting: counts and occurrences, for lack of better
1208
terms. Counts describe system state, such as the total number of
1209
bytes, rows, etc. Occurrences include transient performance and
1210
load metrics. Both types of accounting are captured as time series
1211
with minute granularity. The length of time accounting metrics are
1212
kept is configurable. Below are examples of each type of
1213
accounting value.
1214
1215
**System State Counters/Performance**
1216
1217
- Count of items (e.g. rows)
1218
- Total bytes
1219
- Total key bytes
1220
- Total value length
1221
- Queued message count
1222
- Queued message total bytes
1223
- Count of values \< 16B
1224
- Count of values \< 64B
1225
- Count of values \< 256B
1226
- Count of values \< 1K
1227
- Count of values \< 4K
1228
- Count of values \< 16K
1229
- Count of values \< 64K
1230
- Count of values \< 256K
1231
- Count of values \< 1M
1232
- Count of values \> 1M
1233
- Total bytes of accounting
1234
1235
1236
**Load Occurences**
1237
1238
Get op count
1239
Get total MB
1240
Put op count
1241
Put total MB
1242
Delete op count
1243
Delete total MB
1244
Delete range op count
1245
Delete range total MB
1246
Scan op count
1247
Scan op MB
1248
Split count
1249
Merge count
1250
1251
Because accounting information is kept as time series and over many
1252
possible metrics of interest, the data can become numerous. Accounting
1253
data are stored in the map near the key prefix described, in order to
1254
distribute load (for both aggregation and storage).
1255
1256
Accounting keys for system state have the form:
1257
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1258
character. It’s meant to sort the root level account AFTER any other
1259
system tables. They must increment the same underlying values as they
1260
are permanent counts, and not transient activity. Logic at the
1261
RoachNode takes care of snapshotting the value into an appropriately
1262
suffixed (e.g. with timestamp hour) multi-value time series entry.
1263
1264
Keys for perf/load metrics:
1265
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1266
1267
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1268
containing a varint64 entry for each minute with activity during the
1269
specified hour.
1270
1271
To efficiently keep accounting over large key ranges, the task of
1272
aggregation must be distributed. If activity occurs within the same
1273
range as the key prefix for accounting, the updates are made as part
1274
of the consensus write. If the ranges differ, then a message is sent
1275
to the parent range to increment the accounting. If upon receiving the
1276
message, the parent range also does not include the key prefix, it in
1277
turn forwards it to its parent or left child in the balanced binary
1278
tree which is maintained to describe the range hierarchy. This limits
1279
the number of messages before an update is visible at the root to `2*log N`,
1280
where `N` is the number of ranges in the key prefix.
1281
1282
## Zones
1283
zones are stored in the map with keys prefixed by
1284
`\0zone` followed by the key prefix to which the zone
1285
configuration applies. Zone values specify a protobuf containing
1286
the datacenters from which replicas for ranges which fall under
1287
the zone must be chosen.
1288
1289
Please see [proto/config.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1290
1291
If zones are modified in situ, each RoachNode verifies the
1292
existing zones for its ranges against the zone configuration. If
1293
it discovers differences, it reconfigures ranges in the same way
1294
that it rebalances away from busy nodes, via special-case 1:1
1295
split to a duplicate range comprising the new configuration.
1296
1298
permissions are stored in the map with keys prefixed by *\0perm* followed by
1299
the key prefix and user to which the specified permissions apply. The format of
1300
permissions keys is:
1301
1302
`\0perm<key-prefix><user>`
1303
1304
Permission values are a protobuf containing the permission details;
1305
please see [proto/config.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/config.proto) for up-to-date data structures used, the best entry point being `message PermConfig`.
1306
1307
A default system root permission is assumed for the entire map
1308
with an empty key prefix and read/write as true.
1309
1310
When determining whether or not to allow a read or a write a key
1311
value (e.g. `db1:user:1` for user `spencer`), a RoachNode would
1312
read the following permissions values:
1313
1314
```
1315
\0perm<db1:user:1>spencer
1316
\0perm<db1:user>spencer
1317
\0perm<db1>spencer
1318
\0perm<>spencer
1319
```
1320
1321
If any prefix in the hierarchy provides the required permission,
1322
the request is satisfied; otherwise, the request returns an
1323
error.
1324
1325
The priority for a user permission is used to order requests at
1326
Raft consensus ranges and for choosing an initial priority for
1327
distributed transactions. When scheduling operations at the Raft
1328
consensus range, all outstanding requests are ordered by key
1329
prefix and each assigned priorities according to key, user and
1330
arrival time. The next request is chosen probabilistically using
1331
priorities to weight the choice. Each key can have multiple
1332
priorities as they’re hierarchical (e.g. for /user/key, one
1333
priority for root ‘/’, and one for ‘/user/key’). The most general
1334
priority is used first. If two keys share the most general, then
1335
they’re compared with the next most general if applicable, and so on.
1336
1337
# Key-Value API
1338
1339
see the protobufs in [proto/](https://github.com/cockroachdb/cockroach/blob/master/proto),
1340
in particular [proto/api.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/api.proto) and the comments within.
1341
1342
# Structured Data API
1343
1344
A preliminary design can be found in the [Go source documentation](http://godoc.org/github.com/cockroachdb/cockroach/structured).