Permalink
Newer
Older
100644 1274 lines (1058 sloc) 60.8 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
Cockroach is a distributed key:value datastore (SQL and structured
8
data layers of cockroach have yet to be defined) which supports **ACID
9
transactional semantics** and **versioned values** as first-class
10
features. The primary design goal is **global consistency and
11
survivability**, hence the name. Cockroach aims to tolerate disk,
12
machine, rack, and even **datacenter failures** with minimal latency
13
disruption and **no manual intervention**. Cockroach nodes are
14
symmetric; a design goal is **homogenous deployment** (one binary) with
15
minimal configuration.
16
17
Cockroach implements a **single, monolithic sorted map** from key to
18
value where both keys and values are byte strings (not unicode).
19
Cockroach **scales linearly** (theoretically up to 4 exabytes (4E) of
20
logical data). The map is composed of one or more ranges and each range
21
is backed by data stored in [RocksDB](http://rocksdb.org/) (a
22
variant of LevelDB), and is replicated to a total of three or more
23
cockroach servers. Ranges are defined by start and end keys. Ranges are
24
merged and split to maintain total byte size within a globally
25
configurable min/max size interval. Range sizes default to target `64M` in
26
order to facilitate quick splits and merges and to distribute load at
27
hotspots within a key range. Range replicas are intended to be located
28
in disparate datacenters for survivability (e.g. `{ US-East, US-West,
29
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East, US-West,
30
Japan, Australia }`).
31
32
Single mutations to ranges are mediated via an instance of a distributed
33
consensus algorithm to ensure consistency. We’ve chosen to use the
34
[Raft consensus
35
algorithm](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
36
All consensus state is stored in RocksDB.
37
38
A single logical mutation may affect multiple key/value pairs. Logical
39
mutations have ACID transactional semantics. If all keys affected by a
40
logical mutation fall within the same range, atomicity and consistency
41
are guaranteed by Raft; this is the **fast commit path**. Otherwise, a
42
**non-locking distributed commit** protocol is employed between affected
43
ranges.
44
45
Cockroach provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
46
serializable snapshot isolation (SSI) semantics, allowing **externally
47
consistent, lock-free reads and writes**--both from a historical
48
snapshot timestamp and from the current wall clock time. SI provides
49
lock-free reads and writes but still allows write skew. SSI eliminates
50
write skew, but introduces a performance hit in the case of a
51
contentious system. SSI is the default isolation; clients must
52
consciously decide to trade correctness for performance. Cockroach
53
implements [a limited form of linearizability](#linearizability),
54
providing ordering for any observer or chain of observers.
55
56
Similar to
57
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
58
directories, Cockroach allows configuration of arbitrary zones of data.
59
This allows replication factor, storage device type, and/or datacenter
60
location to be chosen to optimize performance and/or availability.
61
Unlike Spanner, zones are monolithic and don’t allow movement of fine
62
grained data on the level of entity groups.
63
64
A
65
[Megastore](http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf)-like
66
message queue mechanism is also provided to 1) efficiently sideline
67
updates which can tolerate asynchronous execution and 2) provide an
68
integrated message queuing system for asynchronous communication between
69
distributed system components.
70
71
# Architecture
72
73
Cockroach implements a layered architecture. The highest level of
74
abstraction is the SQL layer (currently unspecified in this document).
75
It depends directly on the [*structured data
76
API*](#structured-data-api), which provides familiar relational concepts
77
such as schemas, tables, columns, and indexes. The structured data API
78
in turn depends on the [distributed key value store](#key-value-api),
79
which handles the details of range addressing to provide the abstraction
80
of a single, monolithic key value store. The distributed KV store
81
communicates with any number of physical cockroach nodes. Each node
82
contains one or more stores, one per physical device.
83
84
![Architecture](media/architecture.png)
85
86
Each store contains potentially many ranges, the lowest-level unit of
87
key-value data. Ranges are replicated using the Raft consensus protocol.
88
The diagram below is a blown up version of stores from four of the five
89
nodes in the previous diagram. Each range is replicated three ways using
90
raft. The color coding shows associated range replicas.
91
92
![Ranges](media/ranges.png)
93
94
Each physical node exports a RoachNode service. Each RoachNode exports
95
one or more key ranges. RoachNodes are symmetric. Each has the same
96
binary and assumes identical roles.
97
98
Nodes and the ranges they provide access to can be arranged with various
99
physical network topologies to make trade offs between reliability and
100
performance. For example, a triplicated (3-way replica) range could have
101
each replica located on different:
102
103
- disks within a server to tolerate disk failures.
104
- servers within a rack to tolerate server failures.
105
- servers on different racks within a datacenter to tolerate rack power/network failures.
106
- servers in different datacenters to tolerate large scale network or power outages.
107
108
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
109
110
# Cockroach Client
111
112
In order to support diverse client usage, Cockroach clients connect to
113
any node via HTTPS using protocol buffers or JSON. The connected node
114
proxies involved client work including key lookups and write buffering.
115
116
# Keys
117
118
Cockroach keys are arbitrary byte arrays. If textual data is used in
119
keys, utf8 encoding is recommended (this helps for cleaner display of
120
values in debugging tools). User-supplied keys are encoded using an
121
ordered code. System keys are either prefixed with null characters (`\0`
122
or `\0\0`) for system tables, or take the form of
123
`<user-key><system-suffix>` to sort user-key-range specific system
124
keys immediately after the user keys they refer to. Null characters are
125
used in system key prefixes to guarantee that they sort first.
126
127
# Versioned Values
128
129
Cockroach maintains historical versions of values by storing them with
130
associated commit timestamps. Reads and scans can specify a snapshot
131
time to return the most recent writes prior to the snapshot timestamp.
132
Older versions of values are garbage collected by the system during
133
compaction according to a user-specified expiration interval. In order
134
to support long-running scans (e.g. for MapReduce), all versions have a
135
minimum expiration.
136
137
Versioned values are supported via modifications to RocksDB to record
138
commit timestamps and GC expirations per key.
139
May 29, 2015
140
Each range maintains a small (i.e. latest 10s of read timestamps),
141
*in-memory* cache from key to the latest timestamp at which the
May 29, 2015
142
key was read. This *latest-read-cache* is updated everytime a key
143
is read. The cache’s entries are evicted oldest timestamp first, updating
May 29, 2015
144
the low water mark of the cache appropriately. If a new range replica leader
145
is elected, it sets the low water mark for the cache to the current
146
wall time + ε (ε = 99^th^ percentile clock skew).
148
# Lock-Free Distributed Transactions
149
150
Cockroach provides distributed transactions without locks. Cockroach
151
transactions support two isolation levels:
152
153
- snapshot isolation (SI) and
154
- *serializable* snapshot isolation (SSI).
155
156
*SI* is simple to implement, highly performant, and correct for all but a
157
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
158
more complexity, is still highly performant (less so with contention), and has
159
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
160
the literature and some possibly novel insights.
161
162
SSI is the default level, with SI provided for application developers
163
who are certain enough of their need for performance and the absence of
164
write skew conditions to consciously elect to use it. In a lightly
165
contended system, our implementation of SSI is just as performant as SI,
166
requiring no locking or additional writes. With contention, our
167
implementation of SSI still requires no locking, but will end up
168
aborting more transactions. Cockroach’s SI and SSI implementations
169
prevent starvation scenarios even for arbitrarily long transactions.
170
171
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
172
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
173
For a discussion of SSI implemented by preventing read-write conflicts
174
(in contrast to detecting them, called write-snapshot isolation), see
175
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
176
which is the source of much inspiration for Cockroach’s SSI.
177
178
Each Cockroach transaction is assigned a random priority and a
179
"candidate timestamp" at start. The candidate timestamp is the
180
provisional timestamp at which the transaction will commit, and is
181
chosen as the current clock time of the node coordinating the
182
transaction. This means that a transaction without conflicts will
183
usually commit with a timestamp that, in absolute time, precedes the
184
actual work done by that transaction.
185
May 22, 2015
186
In the course of coordinating a transaction between one or more
187
distributed nodes, the candidate timestamp may be increased, but will
188
never be decreased. The core difference between the two isolation levels
189
SI and SSI is that the former allows the transaction's candidate
190
timestamp to increase and the latter does not.
192
Each cockroach node maintains a hybrid logical clock (HLC) as discussed
193
in the [*Hybrid Logical Clock
194
paper.(HLC)*](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf)
195
A HLC is a combination of both a physical and a logical component
196
to support monotonic increments without degenerate cases causing
197
HLC time to diverge dramatically from wall clock time. Cockroach
May 29, 2015
198
picks a Timestamp for a transaction using HLC time. The timestamp
199
at a node referred in this design is the HLC time at that node.
200
201
Transactions are executed in two phases:
202
203
1. Start the transaction by writing a new entry to the system
204
transaction table (keys prefixed by *\0tx*) with state “PENDING”. In
205
parallel write an "intent" value for each datum being written as part
206
of the transaction. These are normal MVCC values, with the addition of
207
a special flag (i.e. “intent”) indicating that the value may be
208
committed after the transaction itself commits. In addition,
209
the transaction id (unique and chosen at tx start time by client)
210
is stored with intent values. The tx id is used to refer to the
211
transaction table when there are conflicts and to make
212
tie-breaking decisions on ordering between identical timestamps.
213
Each node returns the timestamp used for the write (which is the
214
original candidate timestamp in the absence of read/write conflicts);
215
the client selects the maximum from amongst all write timestamps as the
216
final commit timestamp.
218
2. Commit the transaction by updating its entry in the system
219
transaction table (keys prefixed by *\0tx*). The value of the
220
commit entry contains the candidate timestamp (increased as
221
necessary to accommodate any latest read timestamps). Note that
222
the transaction is considered fully committed at this point and
223
control may be returned to the client.
224
225
In the case of an SI transaction, a commit timestamp which was
226
increased to accommodate concurrent readers is perfectly
227
acceptable and the commit may continue. For SSI transactions,
228
however, a gap between candidate and commit timestamps
229
necessitates transaction restart (note: restart is different than
230
abort--see below).
231
232
After the transaction is committed, all written intents are upgraded
233
in parallel by removing the “intent” flag. The transaction is
234
considered fully committed before this step and does not wait for
235
it to return control to the transaction coordinator.
236
237
In the absence of conflicts, this is the end. Nothing else is necessary
238
to ensure the correctness of the system.
239
240
**Conflict Resolution**
241
242
Things get more interesting when a reader or writer encounters an intent
243
record or newly-committed value in a location that it needs to read or
244
write. This is a conflict, usually causing either of the transactions to
245
abort or restart depending on the type of conflict.
246
247
***Transaction restart:***
248
249
This is the usual (and more efficient) type of behaviour and is used
250
except when the transaction was aborted (for instance by another
251
transaction).
252
In effect, that reduces to two cases; the first being the one outlined
253
above: An SSI transaction that finds upon attempting to commit that
254
its commit timestamp has been pushed. The second case involves a transaction
255
actively encountering a conflict, that is, one of its readers or writers
256
encounter data that necessitate conflict resolution
257
(see transaction interactions below).
258
259
When a transaction restarts, it changes its priority and/or moves its
260
timestamp forward depending on data tied to the conflict, and
261
begins anew reusing the same tx id. The prior run of the transaction might
262
have written some write intents, which need to be deleted before the
263
transaction commits, so as to not be included as part of the transaction.
264
These stale write intent deletions are done during the reexecution of the
265
transaction, either implicitly, through writing new intents to
266
the same keys as part of the reexecution of the transaction, or explicitly,
267
by cleaning up stale intents that are not part of the reexecution of the
268
transaction. Since most transactions will end up writing to the same keys,
269
the explicit cleanup run just before committing the transaction is usually
270
a NOOP.
271
272
***Transaction abort:***
273
274
This is the case in which a transaction, upon reading its transaction
275
table entry, finds that it has been aborted. In this case, the
276
transaction can not reuse its intents; it returns control to the client
277
before cleaning them up (other readers and writers would clean up
278
dangling intents as they encounter them) but will make an effort to
279
clean up after itself. The next attempt (if applicable) then runs as a
280
new transaction with **a new tx id**.
281
282
***Transaction interactions:***
283
284
There are several scenarios in which transactions interact:
285
286
- **Reader encounters write intent or value with newer timestamp far
287
enough in the future**: This is not a conflict. The reader is free
288
to proceed; after all, it will be reading an older version of the
289
value and so does not conflict. Recall that the write intent may
290
be committed with a later timestamp than its candidate; it will
291
never commit with an earlier one. **Side note**: if a SI transaction
292
reader finds an intent with a newer timestamp which the reader’s own
293
transaction has written, the reader always returns that intent's value.
294
295
- **Reader encounters write intent or value with newer timestamp in the
296
near future:** In this case, we have to be careful. The newer
297
intent may, in absolute terms, have happened in our read's past if
298
the clock of the writer is ahead of the node serving the values.
299
In that case, we would need to take this value into account, but
300
we just don't know. Hence the transaction restarts, using instead
301
a future timestamp (but remembering a maximum timestamp used to
302
limit the uncertainty window to the maximum clock skew). In fact,
303
this is optimized further; see the details under "choosing a time
304
stamp" below.
305
306
- **Reader encounters write intent with older timestamp**: the reader
307
must follow the intent’s transaction id to the transaction table.
308
If the transaction has already been committed, then the reader can
309
just read the value. If the write transaction has not yet been
310
committed, then the reader has two options. If the write conflict
311
is from an SI transaction, the reader can *push that transaction's
312
commit timestamp into the future* (and consequently not have to
313
read it). This is simple to do: the reader just updates the
314
transaction’s commit timestamp to indicate that when/if the
315
transaction does commit, it should use a timestamp *at least* as
316
high. However, if the write conflict is from an SSI transaction,
317
the reader must compare priorities. If the reader has the higher priority,
318
it pushes the transaction’s commit timestamp (that
319
transaction will then notice its timestamp has been pushed, and
320
restart). If it has the lower or same priority, it retries itself using as
321
a new priority `max(new random priority, conflicting txn’s
322
priority - 1)`.
324
- **Writer encounters uncommitted write intent**:
325
If the other write intent has been written by a transaction with a lower
326
priority, the writer aborts the conflicting transaction. If the write
327
intent has a higher or equal priority the transaction retries, using as a new
328
priority *max(new random priority, conflicting txn’s priority - 1)*;
329
the retry occurs after a short, randomized backoff interval.
331
- **Writer encounters newer committed value**:
332
The committed value could also be an unresolved write intent made by a
333
transaction that has already committed. The transaction restarts. On restart,
334
the same priority is reused, but the candidate timestamp is moved forward
335
to the encountered value's timestamp.
337
- **Writer encounters newer read key**:
338
The *latest-read-cache* is consulted on each write at a node. If the write’s
339
candidate timestamp is earlier than the low water mark on the cache itself
340
(i.e. its last evicted timestamp) or if the key being written has a read
341
timestamp later than the write’s candidate timestamp, this later timestamp
342
value is returned with the write forcing the transaction to restart.
343
344
**Transaction management**
345
346
Transactions are managed by the client proxy (or gateway in SQL Azure
347
parlance). Unlike in Spanner, writes are not buffered but are sent
348
directly to all implicated ranges. This allows the transaction to abort
349
quickly if it encounters a write conflict. The client proxy keeps track
350
of all written keys in order to resolve write intents asynchronously upon
351
transaction completion. If a transaction commits successfully, all intents
352
are upgraded to committed. In the event a transaction is aborted, all written
353
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
354
355
In the event the client proxy restarts before the pending transaction is
356
committed, the dangling transaction would continue to live in the
357
transaction table until aborted by another transaction. Transactions
358
heartbeat the transaction table every five seconds by default.
359
Transactions encountered by readers or writers with dangling intents
360
which haven’t been heartbeat within the required interval are aborted.
361
In the event the proxy restarts after a transaction commits but before
362
the resolution is complete, the dangling intents are upgraded
363
when encountered by future readers and writers and the system does
364
not depend on their timely resolution for correctness.
365
366
An exploration of retries with contention and abort times with abandoned
367
transaction is
368
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
369
370
**Transaction Table**
371
372
Please see [proto/data.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
373
374
**Pros**
375
376
- No requirement for reliable code execution to prevent stalled 2PC
377
protocol.
378
- Readers never block with SI semantics; with SSI semantics, they may
379
abort.
380
- Lower latency than traditional 2PC commit protocol (w/o contention)
381
because second phase requires only a single write to the
382
transaction table instead of a synchronous round to all
383
transaction participants.
384
- Priorities avoid starvation for arbitrarily long transactions and
385
always pick a winner from between contending transactions (no
386
mutual aborts).
387
- Writes not buffered at client; writes fail fast.
388
- No read-locking overhead required for *serializable* SI (in contrast
389
to other SSI implementations).
390
- Well-chosen (i.e. less random) priorities can flexibly give
391
probabilistic guarantees on latency for arbitrary transactions
392
(for example: make OLTP transactions 10x less likely to abort than
393
low priority transactions, such as asynchronously scheduled jobs).
394
395
**Cons**
396
397
- Reads from non-leader replicas still require a ping to the leader to
398
update *latest-read-cache*.
399
- Abandoned transactions may block contending writers for up to the
400
heartbeat interval, though average wait is likely to be
401
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
402
This is likely considerably more performant than detecting and
403
restarting 2PC in order to release read and write locks.
404
- Behavior different than other SI implementations: no first writer
405
wins, and shorter transactions do not always finish quickly.
406
Element of surprise for OLTP systems may be a problematic factor.
407
- Aborts can decrease throughput in a contended system compared with
408
two phase locking. Aborts and retries increase read and write
409
traffic, increase latency and decrease throughput.
410
411
**Choosing a Timestamp**
412
413
A key challenge of reading data in a distributed system with clock skew
414
is choosing a timestamp guaranteed to be greater than the latest
415
timestamp of any committed transaction (in absolute time). No system can
416
claim consistency and fail to read already-committed data.
417
418
Time for a node is maintained by a hybrid logical clock (HLC).
419
The HLC time >= wall time and is potentially updated by each write at that node.
420
The write's timestamp is not only used to version the data being written,
421
but also potentially updates the logical time on the node. This is useful in
422
guaranteeing that all data written to a node is at a timestamp < HLC time.
423
424
Accomplishing consistency for transactions (or just single operations)
425
accessing a single node is easy. The transaction uses the HLC time as the
426
timestamp which is guaranteed to be at a greater timestamp than all the
427
timestamped data on the node.
429
For multiple nodes, the HLC time of the node coordinating the
430
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
431
supplied to provide an upper bound on timestamps for already-committed
432
data (`ε` is the maximum clock skew). As the transaction progresses, any
433
data read which have timestamps greater than `t` but less than `t+ε`
434
cause the transaction to abort and retry with the conflicting timestamp
435
t<sub>c</sub>, where t<sub>c</sub> \> t. The maximum timestamp `t+ε` remains
436
the same.
437
438
We apply another optimization to reduce the restarts caused
439
by uncertainty. Upon restarting, the transaction not only takes
440
into account t<sub>c<sub>, but the HLC time of the node at the time
441
of the uncertain read t<sub>node<sub>. The larger of those two timestamps
442
(more likely the latter): max(t<sub>c<sub>, t<sub>node<sub>) is used
443
to bump up the read timestamp. Additionally, the conflicting node is
444
marked as “certain”. This means that for future reads to that node
445
within the transaction, we can set `MaxTimestamp = Read Timestamp`.
446
Correctness follows from the fact that we know that at the time of the read,
447
there exists no version of any key on that node with a higher timestamp than
448
t<sub>node<sub>. Upon a restart caused by the node, if the transaction were to
449
encounter a key with a higher timestamp it would imply that the value
450
is written in the future in absolute time, and the transaction can move
451
forward reading an older version of the data (at the transactions timestamp).
452
This limits the time uncertainty restarts attributed to a node to <= 1. The
453
tradeoff is that we might pick a timestamp larger than the optimal one
454
(> highest conflicting timestamp), resulting in the possibility of a few
455
more conflicts.
456
457
We expect retries will be rare, but this assumption may need to be
458
revisited if retries become problematic. Note that this problem does not
459
apply to historical reads. An alternate approach which does not require
460
retries makes a round to all node participants in advance and
461
chooses the highest reported node wall time as the timestamp. However,
462
knowing which nodes will be accessed in advance is difficult and
463
potentially limiting. Cockroach could also potentially use a global
464
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)), which would be
465
feasible for smaller, geographically-proximate clusters.
466
467
# Linearizability
468
469
First a word about [***Spanner***](http://research.google.com/archive/spanner.html).
470
By combining judicious use of wait intervals with accurate time signals,
471
Spanner provides a global ordering between any two non-overlapping transactions
472
(in absolute time) with \~14ms latencies. Put another way:
473
Spanner guarantees that if a transaction T<sub>1</sub> commits (in absolute time)
474
before another transaction T<sub>2</sub> starts, then T<sub>1</sub>'s assigned commit
475
timestamp is smaller than T<sub>2</sub>'s. Using atomic clocks and GPS receivers,
476
Spanner reduces their clock skew uncertainty to \< 10ms (`ε`). To make
477
good on the promised guarantee, transactions must take at least double
478
the clock skew uncertainty interval to commit (`2ε`). See [*this
479
article*](http://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf)
480
for a helpful overview of Spanner’s concurrency control.
481
482
Cockroach could make the same guarantees without specialized hardware,
483
at the expense of longer wait times. If servers in the cluster were
484
configured to work only with NTP, transaction wait times would likely to
485
be in excess of 150ms. For wide-area zones, this would be somewhat
486
mitigated by overlap from cross datacenter link latencies. If clocks
487
were made more accurate, the minimal limit for commit latencies would
488
improve.
489
490
However, let’s take a step back and evaluate whether Spanner’s external
491
consistency guarantee is worth the automatic commit wait. First, if the
492
commit wait is omitted completely, the system still yields a consistent
493
view of the map at an arbitrary timestamp. However with clock skew, it
494
would become possible for commit timestamps on non-overlapping but
495
causally related transactions to suffer temporal reverse. In other
496
words, the following scenario is possible for a client without global
497
ordering:
498
499
- Start transaction T<sub>1</sub> to modify value `x` with commit time *s<sub>1</sub>*
500
501
- On commit of T<sub>1</sub>, start T<sub>2</sub> to modify value `y` with commit time
502
\> s<sub>2</sub>
503
504
- Read `x` and `y` and discover that s<sub>1</sub> \> s<sub>2</sub> (**!**)
505
506
The external consistency which Spanner guarantees is referred to as
507
**linearizability**. It goes beyond serializability by preserving
508
information about the causality inherent in how external processes
509
interacted with the database. The strength of Spanner’s guarantee can be
510
formulated as follows: any two processes, with clock skew within
511
expected bounds, may independently record their wall times for the
512
completion of transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>) and start of transaction
513
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) respectively, and if later
514
compared such that T<sub>1</sub><sup>end</sup> \< T<sub>2</sub><sup>start</sup>,
515
then commit timestamps s<sub>1</sub> \< s<sub>2</sub>.
516
This guarantee is broad enough to completely cover all cases of explicit
517
causality, in addition to covering any and all imaginable scenarios of implicit
518
causality.
519
520
Our contention is that causality is chiefly important from the
521
perspective of a single client or a chain of successive clients (*if a
522
tree falls in the forest and nobody hears…*). As such, Cockroach
523
provides two mechanisms to provide linearizability for the vast majority
524
of use cases without a mandatory transaction commit wait or an elaborate
525
system to minimize clock skew.
526
527
1. Clients provide the highest transaction commit timestamp with
528
> successive transactions. This allows node clocks from previous
529
> transactions to effectively participate in the formulation of the
530
> commit timestamp for the current transaction. This guarantees
531
> linearizability for transactions committed by this client.
532
>
533
> Newly launched clients wait at least 2 \* ε from process start
534
> time before beginning their first transaction. This preserves the
535
> same property even on client restart, and the wait will be
536
> mitigated by process initialization.
537
>
538
> All causally-related events within Cockroach maintain
539
> linearizability. Message queues, for example, guarantee that the
540
> receipt timestamp is greater than send timestamp, and that
541
> delivered messages may not be reaped until after the commit wait.
542
543
2. Committed transactions respond with a commit wait parameter which
544
> represents the remaining time in the nominal commit wait. This
545
> will typically be less than the full commit wait as the consensus
546
> write at the coordinator accounts for a portion of it.
547
>
548
> Clients taking any action outside of another Cockroach transaction
549
> (e.g. writing to another distributed system component) can either
550
> choose to wait the remaining interval before proceeding, or
551
> alternatively, pass the wait and/or commit timestamp to the
552
> execution of the outside action for its consideration. This pushes
553
> the burden of linearizability to clients, but is a useful tool in
554
> mitigating commit latencies if the clock skew is potentially
555
> large. This functionality can be used for ordering in the face of
556
> backchannel dependencies as mentioned in the
557
> [AugmentedTime](http://www.cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf)
558
> paper.
559
560
Using these mechanisms in place of commit wait, Cockroach’s guarantee can be
561
formulated as follows: any process which signals the start of transaction
562
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) after the completion of
563
transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>), will have commit
564
timestamps such thats<sub>1</sub> \< s<sub>2</sub>.
565
566
# Logical Map Content
567
568
Logically, the map contains a series of reserved system key / value
569
pairs covering accounting, range metadata, node accounting and
570
permissions before the actual key / value pairs for non-system data
571
(e.g. the actual meat of the map).
572
573
- `\0\0meta1` Range metadata for location of `\0\0meta2`.
574
- `\0\0meta1<key1>` Range metadata for location of `\0\0meta2<key1>`.
575
- ...
576
- `\0\0meta1<keyN>`: Range metadata for location of `\0\0meta2<keyN>`.
577
- `\0\0meta2`: Range metadata for location of first non-range metadata key.
578
- `\0\0meta2<key1>`: Range metadata for location of `<key1>`.
579
- ...
580
- `\0\0meta2<keyN>`: Range metadata for location of `<keyN>`.
581
- `\0acct<key0>`: Accounting for key prefix key0.
582
- ...
583
- `\0acct<keyN>`: Accounting for key prefix keyN.
584
- `\0node<node-address0>`: Accounting data for node 0.
585
- ...
586
- `\0node<node-addressN>`: Accounting data for node N.
587
- `\0perm<key0><user0>`: Permissions for user0 for key prefix key0.
588
- ...
589
- `\0perm<keyN><userN>`: Permissions for userN for key prefix keyN.
590
- `\0tree_root`: Range key for root of range-spanning tree.
591
- `\0tx<tx-id0>`: Transaction record for transaction 0.
592
- ...
593
- `\0tx<tx-idN>`: Transaction record for transaction N.
594
- `\0zone<key0>`: Zone information for key prefix key0.
595
- ...
596
- `\0zone<keyN>`: Zone information for key prefix keyN.
597
- `<>acctd<metric0>`: Accounting data for Metric 0 for empty key prefix.
598
- ...
599
- `<>acctd<metricN>`: Accounting data for Metric N for empty key prefix.
600
- `<key0>`: `<value0>` The first user data key.**
601
- ...
602
- `<keyN>`: `<valueN>` The last user data key.**
603
604
There are some additional system entries sprinkled amongst the
605
non-system keys. See the Key-Prefix Accounting section in this document
606
for further details.
607
608
# Node Storage
609
610
Nodes maintain a separate instance of RocksDB for each disk. Each
611
RocksDB instance hosts any number of ranges. RPCs arriving at a
612
RoachNode are multiplexed based on the disk name to the appropriate
613
RocksDB instance. A single instance per disk is used to avoid
614
contention. If every range maintained its own RocksDB, global management
615
of available cache memory would be impossible and writers for each range
616
would compete for non-contiguous writes to multiple RocksDB logs.
617
618
In addition to the key/value pairs of the range itself, various range
619
metadata is maintained.
620
621
- range-spanning tree node links
622
623
- participating replicas
624
625
- consensus metadata
626
627
- split/merge activity
628
629
A really good reference on tuning Linux installations with RocksDB is
630
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
631
632
# Range Metadata
633
634
The default approximate size of a range is 64M (2\^26 B). In order to
635
support 1P (2\^50 B) of logical data, metadata is needed for roughly
636
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
637
size is roughly 256 bytes (3\*12 bytes for the triplicated node
638
locations and 220 bytes for the range key itself*). 2\^24 ranges \* 2\^8
639
B would require roughly 4G (2\^32 B) to store--too much to duplicate
640
between machines. Our conclusion is that range metadata must be
641
distributed for large installations.
642
643
To distribute the range metadata and keep key lookups relatively fast,
644
we use two levels of indirection. All of the range metadata sorts first
645
in our key-value map. We accomplish this by prefixing range metadata
646
with two null characters (*\0\0*). The *meta1* or *meta2* suffixes are
647
additionally appended to distinguish between the first level and second
648
level of range metadata. In order to do a lookup for *key1*,
649
we first locate the range information for the lower bound of
650
`\0\0meta1<key1>`, and then use that range to locate the lower bound
651
of `\0\0meta2<key1>`. The range specified there will indicate the
652
range location of `<key1>` (refer to examples below). Using two levels
653
of indirection, **our map can address approximately 2\^62 B of data, or
654
roughly 4E** (*each metadata range addresses 2\^(26-8) = 2\^18 ranges;
655
with two levels of indirection, we can address 2\^(18 + 18) = 2\^36
656
ranges; each range addresses 2\^26 B; total is 2\^(36+26) B = 2\^62 B =
657
4E*).
658
659
Note: we append the end key of each range to meta[12] records because
660
the RocksDB iterator only supports a Seek() interface which acts as a
661
Ceil(). Using the start key of the range would cause Seek() to find the
662
key *after* the meta indexing record we’re looking for, which would
663
result in having to back the iterator up, an option which is both less
664
efficient and not available in all cases.
665
666
The following example shows the directory structure for a map with
667
three ranges worth of data. Ellipses indicate additional key/value pairs to
668
fill an entire range of data. Except for the fact that splitting ranges
669
requires updates to the range metadata with knowledge of the metadata layout,
670
the range metadata itself requires no special treatment or bootstrapping.
671
672
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
673
`dcrama3:8000`)
674
675
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
676
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
677
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
678
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
679
- ...
680
- `<lastkey0>`: `<lastvalue0>`
681
682
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
683
`dcrama6:8000`)
684
685
- ...
686
- `<lastkey1>`: `<lastvalue1>`
687
688
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
689
`dcrama9:8000`)
690
691
- ...
692
- `<lastkey2>`: `<lastvalue2>`
693
694
Consider a simpler example of a map containing less than a single
695
range of data. In this case, all range metadata and all data are
696
located in the same range:
697
698
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
699
`dcrama3:8000`)*
700
701
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
702
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
703
- `<key0>`: `<value0>`
704
- `...`
705
706
Finally, a map large enough to need both levels of indirection would
707
look like (note that instead of showing range replicas, this
708
example is simplified to just show range indexes):
709
710
**Range 0**
711
712
- `\0\0meta1<lastkeyN-1>`: Range 0
713
- `\0\0meta1\xff`: Range 1
714
- `\0\0meta2<lastkey1>`: Range 1
715
- `\0\0meta2<lastkey2>`: Range 2
716
- `\0\0meta2<lastkey3>`: Range 3
717
- ...
718
- `\0\0meta2<lastkeyN-1>`: Range 262143
719
720
**Range 1**
721
722
- `\0\0meta2<lastkeyN>`: Range 262144
723
- `\0\0meta2<lastkeyN+1>`: Range 262145
724
- ...
725
- `\0\0meta2\xff`: Range 500,000
726
- ...
727
- `<lastkey1>`: `<lastvalue1>`
728
729
**Range 2**
730
731
- ...
732
- `<lastkey2>`: `<lastvalue2>`
733
734
**Range 3**
735
736
- ...
737
- `<lastkey3>`: `<lastvalue3>`
738
739
**Range 262144**
740
741
- ...
742
- `<lastkeyN>`: `<lastvalueN>`
743
744
**Range 262145**
745
746
- ...
747
- `<lastkeyN+1>`: `<lastvalueN+1>`
748
749
Note that the choice of range `262144` is just an approximation. The
750
actual number of ranges addressable via a single metadata range is
751
dependent on the size of the keys. If efforts are made to keep key sizes
752
small, the total number of addressable ranges would increase and vice
753
versa.
754
755
From the examples above it’s clear that key location lookups require at
756
most three reads to get the value for `<key>`:
757
758
1. lower bound of `\0\0meta1<key>`
759
2. lower bound of `\0\0meta2<key>`,
760
3. `<key>`.
761
762
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
763
containing less than 16T of data would require two lookups. Clients cache both
764
levels of range metadata, and we expect that data locality for individual
765
clients will be high. Clients may end up with stale cache entries. If on a
766
lookup, the range consulted does not match the client’s expectations, the
767
client evicts the stale entries and possibly does a new lookup.
768
769
# Range-Spanning Binary Tree
770
771
A crucial enhancement to the organization of range metadata is to
772
augment the bi-level range metadata lookup with a minimum spanning tree,
773
implemented as a left-leaning red-black tree over all ranges in the map.
774
This tree structure allows the system to start at any key prefix and
775
efficiently traverse an arbitrary key range with minimal RPC traffic,
776
minimal fan-in and fan-out, and with bounded time complexity equal to
777
`2*log N` steps, where `N` is the total number of ranges in the system.
778
779
Unlike the range metadata rows prefixed with `\0\0meta[1|2]`, the
780
metadata for the range-spanning tree (e.g. parent range and left / right
781
child ranges) is stored directly at the ranges as non-map metadata. The
782
metadata for each node of the tree (e.g. links to parent range, left
783
child range, and right child range) is stored with the range metadata.
784
In effect, the tree metadata is stored implicitly. In order to traverse
785
the tree, for example, you’d need to query each range in turn for its
786
metadata.
787
788
Any time a range is split or merged, both the bi-level range lookup
789
metadata and the per-range binary tree metadata are updated as part of
790
the same distributed transaction. The total number of nodes involved in
791
the update is bounded by 2 + log N (i.e. 2 updates for meta1 and
792
meta2, and up to log N updates to balance the range-spanning tree).
793
The range corresponding to the root node of the tree is stored in
Apr 23, 2015
794
*\0tree_root*.
795
796
As an example, consider the following set of nine ranges and their
797
associated range-spanning tree:
798
799
R0: `aa - cc`, R1: `*cc - lll`, R2: `*lll - llr`, R3: `*llr - nn`, R4: `*nn - rr`, R5: `*rr - ssss`, R6: `*ssss - sst`, R7: `*sst - vvv`, R8: `*vvv - zzzz`.
800
801
![Range Tree](media/rangetree.png)
802
803
The range-spanning tree has many beneficial uses in Cockroach. It makes
804
the problem of efficiently aggregating accounting information of
805
potentially vast ranges of data tractable. Imagine a subrange of data
806
over which accounting is being kept. For example, the *photos* table in
807
a public photo sharing site. To efficiently keep track of data about the
808
table (e.g. total size, number of rows, etc.), messages can be passed
809
first up the tree and then down to the left until updates arrive at the
810
key prefix under which accounting is aggregated. This makes worst case
811
number of hops for an update to propagate into the accounting totals
812
2 \* log N. A 64T database will require 1M ranges, meaning 40 hops
813
worst case. In our experience, accounting tasks over vast ranges of data
814
are most often map/reduce jobs scheduled with coarse-grained
815
periodicity. By contrast, we expect Cockroach to maintain statistics
816
with sub 10s accuracy and with minimal cycles and minimal IOPs.
817
818
Another use for the range-spanning tree is to push accounting, zones and
819
permissions configurations to all ranges. In the case of zones and
820
permissions, this is an efficient way to pass updated configuration
821
information with exponential fan-out. When adding accounting
822
configurations (i.e. specifying a new key prefix to track), the
823
implicated ranges are transactionally scanned and zero-state accounting
824
information is computed as well. Deleting accounting configurations is
825
similar, except accounting records are deleted.
826
827
Last but *not* least, the range-spanning tree provides a convenient
828
mechanism for planning and executing parallel queries. These provide the
829
basis for
830
[Dremel](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf)-like
831
query execution trees and it’s easy to imagine supporting a subset of
832
SQL or even javascript-based user functions for complex data analysis
833
tasks.
834
835
# Raft - Consistency of Range Replicas
836
837
Each range is configured to consist of three or more replicas. The
838
replicas in a range maintain their own instance of a distributed
839
consensus algorithm. We use the [*Raft consensus
840
algorithm*](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf)
841
as it is simpler to reason about and includes a reference implementation
842
covering important details. Every write to replicas is logged twice.
843
Once to RocksDB’s internal log and once to levedb itself as part of the
844
Raft consensus log.
845
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
846
promising performance characteristics for WAN-distributed replicas, but
847
it does not guarantee a consistent ordering between replicas.
848
849
Raft elects a relatively long-lived leader which must be involved to
850
propose writes. It heartbeats followers periodically to keep their logs
851
replicated. In the absence of heartbeats, followers become candidates
852
after randomized election timeouts and proceed to hold new leader
853
elections. Cockroach weights random timeouts such that the replicas with
854
shorter round trip times to peers are more likely to hold elections
855
first. Although only the leader can propose a new write, and as such
856
must be involved in any write to the consensus log, any replica can
857
service reads if the read is for a timestamp which the replica knows is
858
safe based on the last committed consensus write and the state of any
859
pending transactions.
860
861
Only the leader can propose a new write, but Cockroach accepts writes at
862
any replica. The replica merely forwards the write to the leader.
863
Instead of resending the write, the leader has only to acknowledge the
864
write to the forwarding replica using a log sequence number, as though
865
it were proposing it in the first place. The other replicas receive the
866
full write as though the leader were the originator.
867
868
Having a stable leader provides the choice of replica to handle
869
range-specific maintenance and processing tasks, such as delivering
870
pending message queues, handling splits and merges, rebalancing, etc.
871
872
# Splitting / Merging Ranges
873
874
RoachNodes split or merge ranges based on whether they exceed maximum or
875
minimum thresholds for capacity or load. Ranges exceeding maximums for
876
either capacity or load are split; ranges below minimums for *both*
877
capacity and load are merged.
878
879
Ranges maintain the same accounting statistics as accounting key
880
prefixes. These boil down to a time series of data points with minute
881
granularity. Everything from number of bytes to read/write queue sizes.
882
Arbitrary distillations of the accounting stats can be determined as the
883
basis for splitting / merging. Two sensical metrics for use with
884
split/merge are range size in bytes and IOps. A good metric for
885
rebalancing a replica from one node to another would be total read/write
886
queue wait times. These metrics are gossipped, with each range / node
887
passing along relevant metrics if they’re in the bottom or top of the
888
range it’s aware of.
889
890
A range finding itself exceeding either capacity or load threshold
891
splits. To this end, the range leader computes an appropriate split key
892
candidate and issues the split through Raft. In contrast to splitting,
893
merging requires a range to be below the minimum threshold for both
894
capacity *and* load. A range being merged chooses the smaller of the
895
ranges immediately preceding and succeeding it.
896
897
Splitting, merging, rebalancing and recovering all follow the same basic
898
algorithm for moving data between roach nodes. New target replicas are
899
created and added to the replica set of source range. Then each new
900
replica is brought up to date by either replaying the log in full or
901
copying a snapshot of the source replica data and then replaying the log
902
from the timestamp of the snapshot to catch up fully. Once the new
903
replicas are fully up to date, the range metadata is updated and old,
904
source replica(s) deleted if applicable.
905
906
**Coordinator** (leader replica)
907
908
```
909
if splitting
Apr 23, 2015
910
SplitRange(split_key): splits happen locally on range replicas and
911
only after being completed locally, are moved to new target replicas.
912
else if merging
913
Choose new replicas on same servers as target range replicas;
914
add to replica set.
915
else if rebalancing || recovering
916
Choose new replica(s) on least loaded servers; add to replica set.
917
```
918
919
**New Replica**
920
921
*Bring replica up to date:*
922
923
```
924
if all info can be read from replicated log
925
copy replicated log
926
else
927
snapshot source replica
928
send successive ReadRange requests to source replica
929
referencing snapshot
930
931
if merging
932
combine ranges on all replicas
933
else if rebalancing || recovering
934
remove old range replica(s)
935
```
936
937
RoachNodes split ranges when the total data in a range exceeds a
938
configurable maximum threshold. Similarly, ranges are merged when the
939
total data falls below a configurable minimum threshold.
940
941
**TBD: flesh this out**.
942
943
Ranges are rebalanced if a node determines its load or capacity is one
944
of the worst in the cluster based on gossipped load stats. A node with
945
spare capacity is chosen in the same datacenter and a special-case split
946
is done which simply duplicates the data 1:1 and resets the range
947
configuration metadata.
948
949
# Message Queues
950
951
Each range maintains an array of incoming message queues, referred to
952
here as **inboxes**. Additionally, each range maintains and *processes*
953
an array of outgoing message queues, referred to here as **outboxes**.
954
Both inboxes and outboxes are assigned to keys; messages can be sent or
955
received on behalf of any key. Inboxes and outboxes can contain any
956
number of pending messages.
957
958
Messages can be *deliverable*, or *executable.*
959
960
Deliverable messages are defined by Value objects - simple byte arrays -
961
that are delivered to a key’s inbox, awaiting collection by a client
962
invoking the ReapQueue operation. These are typically used by client
963
applications wishing to be notified of changes to an entry for further
964
processing, such as expensive offline operations like sending emails,
965
SMSs, etc.
966
967
Executable messages are *outgoing-only*, and are instances of
968
PutRequest,IncrementRequest, DeleteRequest, DeleteRangeRequest
May 29, 2015
969
or AccountingRequest. Rather than being delivered to a key’s inbox, are
970
executed when encountered. These are primarily useful when updates that
971
are nominally part of a transaction can tolerate asynchronous execution
972
(e.g. eventual consistency), and are otherwise too busy or numerous to
973
make including them in the original [distributed] transaction efficient.
974
Examples may include updates to the accounting for successive key
975
prefixes (potentially busy) or updates to a full-text index (potentially
976
numerous).
977
978
These two types of messages are enqueued in different outboxes too - see
979
key formats below.
980
981
At commit time, the range processing the transaction places messages
982
into a shared outbox located at the start of the range metadata. This is
983
effectively free as it’s part of the same consensus write for the range
984
as the COMMIT record. Outgoing messages are processed asynchronously by
985
the range. To make processing easy, all outboxes are co-located at the
986
start of the range. To make lookup easy, all inboxes are located
987
immediately after the recipient key. The leader replica of a range is
988
responsible for processing message queues.
989
990
A dispatcher polls a given range’s *deliverable message outbox*
991
periodically (configurable), and delivers those messages to the target
992
key’s inbox. The dispatcher is also woken up whenever a new message is
993
added to the outbox. A separate executor also polls the range’s
994
*executable message outbox* periodically as well (again, configurable),
May 31, 2015
995
and executes those commands. The executor, too, is woken up whenever a
996
new message is added to the outbox.
997
998
Formats follow in the table below. Notice that inbox messages for a
999
given key sort by the `<outbox-timestamp>`. This doesn’t provide a
1000
precise ordering, but it does allow clients to scan messages in an
1001
approximate ordering of when they were originally lodged with senders.
1002
NTP offers walltime deltas to within 100s of milliseconds. The
1003
`<sender-range-key>` suffix provides uniqueness.
1004
1005
**Outbox**
1006
`<sender-range-key>deliverable-outbox:<recipient-key><outbox-timestamp>`
1007
`<sender-range-key>executable-outbox:<recipient-key><outbox-timestamp>`
1008
1009
**Inbox**
1010
`<recipient-key>inbox:<outbox-timestamp><sender-range-key>`
1011
1012
Messages are processed and then deleted as part of a single distributed
1013
transaction. The message will be executed or delivered exactly once,
1014
regardless of failures at either sender or receiver.
1015
1016
Delivered messages may be read by clients via the ReapQueue operation.
1017
This operation may only be used as part of a transaction. Clients should
1018
commit only after having processed the message. If the transaction is
1019
committed, scanned messages are automatically deleted. The operation
1020
name was chosen to reflect its mutating side effect. Deletion of read
1021
messages is mandatory because senders deliver messages asynchronously
1022
and a delay could cause insertion of messages at arbitrary points in the
1023
inbox queue. If clients require persistence, they should re-save read
1024
messages manually; the ReapQueue operation can be incorporated into
1025
normal transactional updates.
1026
1027
# Node Allocation (via Gossip)
1028
1029
New nodes must be allocated when a range is split. Instead of requiring
1030
every RoachNode to know about the status of all or even a large number
1031
of peer nodes --or-- alternatively requiring a specialized curator or
1032
master with sufficiently global knowledge, we use a gossip protocol to
1033
efficiently communicate only interesting information between all of the
1034
nodes in the cluster. What’s interesting information? One example would
1035
be whether a particular node has a lot of spare capacity. Each node,
1036
when gossiping, compares each topic of gossip to its own state. If its
1037
own state is somehow “more interesting” than the least interesting item
1038
in the topic it’s seen recently, it includes its own state as part of
1039
the next gossip session with a peer node. In this way, a node with
1040
capacity sufficiently in excess of the mean quickly becomes discovered
1041
by the entire cluster. To avoid piling onto outliers, nodes from the
1042
high capacity set are selected at random for allocation.
1043
1044
The gossip protocol itself contains two primary components:
1045
1046
- **Peer Selection**: each node maintains up to N peers with which it
1047
regularly communicates. It selects peers with an eye towards
1048
maximizing fanout. A peer node which itself communicates with an
1049
array of otherwise unknown nodes will be selected over one which
1050
communicates with a set containing significant overlap. Each time
1051
gossip is initiated, each nodes’ set of peers is exchanged. Each
1052
node is then free to incorporate the other’s peers as it sees fit.
1053
To avoid any node suffering from excess incoming requests, a node
1054
may refuse to answer a gossip exchange. Each node is biased
1055
towards answering requests from nodes without significant overlap
1056
and refusing requests otherwise.
1057
1058
Peers are efficiently selected using a heuristic as described in
1059
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
1060
1061
**TBD**: how to avoid partitions? Need to work out a simulation of
1062
the protocol to tune the behavior and see empirically how well it
1063
works.
1064
1065
- **Gossip Selection**: what to communicate. Gossip is divided into
1066
topics. Load characteristics (capacity per disk, cpu load, and
1067
state [e.g. draining, ok, failure]) are used to drive node
1068
allocation. Range statistics (range read/write load, missing
1069
replicas, unavailable ranges) and network topology (inter-rack
1070
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
1071
outages) are used for determining when to split ranges, when to
1072
recover replicas vs. wait for network connectivity, and for
1073
debugging / sysops. In all cases, a set of minimums and a set of
1074
maximums is propagated; each node applies its own view of the
1075
world to augment the values. Each minimum and maximum value is
1076
tagged with the reporting node and other accompanying contextual
1077
information. Each topic of gossip has its own protobuf to hold the
1078
structured data. The number of items of gossip in each topic is
1079
limited by a configurable bound.
1080
1081
For efficiency, nodes assign each new item of gossip a sequence
1082
number and keep track of the highest sequence number each peer
1083
node has seen. Each round of gossip communicates only the delta
1084
containing new items.
1085
1086
# Node Accounting
1087
1088
The gossip protocol discussed in the previous section is useful to
1089
quickly communicate fragments of important information in a
1090
decentralized manner. However, complete accounting for each node is also
1091
stored to a central location, available to any dashboard process. This
1092
is done using the map itself. Each node periodically writes its state to
1093
the map with keys prefixed by `\0node`, similar to the first level of
1094
range metadata, but with an ‘`node`’ suffix. Each value is a protobuf
1095
containing the full complement of node statistics--everything
1096
communicated normally via the gossip protocol plus other useful, but
1097
non-critical data.
1098
1099
The range containing the first key in the node accounting table is
1100
responsible for gossiping the total count of nodes. This total count is
1101
used by the gossip network to most efficiently organize itself. In
1102
particular, the maximum number of hops for gossipped information to take
1103
before reaching a node is given by `ceil(log(node count) / log(max
1104
fanout)) + 1`.
1105
1106
# Key-prefix Accounting, Zones & Permissions
1107
1108
Arbitrarily fine-grained accounting and permissions are specified via
1109
key prefixes. Key prefixes can overlap, as is necessary for capturing
1110
hierarchical relationships. For illustrative purposes, let’s say keys
1111
specifying rows in a set of databases have the following format:
1112
1113
`<db>:<table>:<primary-key>[:<secondary-key>]`
1114
1115
In this case, we might collect accounting or specify permissions with
1116
key prefixes:
1117
1118
`db1`, `db1:user`, `db1:order`,
1119
1120
Accounting is kept for the entire map by default.
1121
1122
## Accounting
1123
to keep accounting for a range defined by a key prefix, an entry is created in
1124
the accounting system table. The format of accounting table keys is:
1125
1126
`\0acct<key-prefix>`
1127
1128
In practice, we assume each RoachNode capable of caching the
1129
entire accounting table as it is likely to be relatively small.
1130
1131
Accounting is kept for key prefix ranges with eventual consistency
1132
for efficiency. Updates to accounting values propagate through the
1133
system using the message queue facility if the accounting keys do
1134
not reside on the same range as ongoing activity (true for all but
1135
the smallest systems). There are two types of values which
1136
comprise accounting: counts and occurrences, for lack of better
1137
terms. Counts describe system state, such as the total number of
1138
bytes, rows, etc. Occurrences include transient performance and
1139
load metrics. Both types of accounting are captured as time series
1140
with minute granularity. The length of time accounting metrics are
1141
kept is configurable. Below are examples of each type of
1142
accounting value.
1143
1144
**System State Counters/Performance**
1145
1146
- Count of items (e.g. rows)
1147
- Total bytes
1148
- Total key bytes
1149
- Total value length
1150
- Queued message count
1151
- Queued message total bytes
1152
- Count of values \< 16B
1153
- Count of values \< 64B
1154
- Count of values \< 256B
1155
- Count of values \< 1K
1156
- Count of values \< 4K
1157
- Count of values \< 16K
1158
- Count of values \< 64K
1159
- Count of values \< 256K
1160
- Count of values \< 1M
1161
- Count of values \> 1M
1162
- Total bytes of accounting
1163
1164
1165
**Load Occurences**
1166
1167
Get op count
1168
Get total MB
1169
Put op count
1170
Put total MB
1171
Delete op count
1172
Delete total MB
1173
Delete range op count
1174
Delete range total MB
1175
Scan op count
1176
Scan op MB
1177
Split count
1178
Merge count
1179
1180
Because accounting information is kept as time series and over many
1181
possible metrics of interest, the data can become numerous. Accounting
1182
data are stored in the map near the key prefix described, in order to
1183
distribute load (for both aggregation and storage).
1184
1185
Accounting keys for system state have the form:
1186
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1187
character. It’s meant to sort the root level account AFTER any other
1188
system tables. They must increment the same underlying values as they
1189
are permanent counts, and not transient activity. Logic at the
1190
RoachNode takes care of snapshotting the value into an appropriately
1191
suffixed (e.g. with timestamp hour) multi-value time series entry.
1192
1193
Keys for perf/load metrics:
1194
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1195
1196
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1197
containing a varint64 entry for each minute with activity during the
1198
specified hour.
1199
1200
To efficiently keep accounting over large key ranges, the task of
1201
aggregation must be distributed. If activity occurs within the same
1202
range as the key prefix for accounting, the updates are made as part
1203
of the consensus write. If the ranges differ, then a message is sent
1204
to the parent range to increment the accounting. If upon receiving the
1205
message, the parent range also does not include the key prefix, it in
1206
turn forwards it to its parent or left child in the balanced binary
1207
tree which is maintained to describe the range hierarchy. This limits
1208
the number of messages before an update is visible at the root to `2*log N`,
1209
where `N` is the number of ranges in the key prefix.
1210
1211
## Zones
1212
zones are stored in the map with keys prefixed by
1213
`\0zone` followed by the key prefix to which the zone
1214
configuration applies. Zone values specify a protobuf containing
1215
the datacenters from which replicas for ranges which fall under
1216
the zone must be chosen.
1217
1218
Please see [proto/config.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1219
1220
If zones are modified in situ, each RoachNode verifies the
1221
existing zones for its ranges against the zone configuration. If
1222
it discovers differences, it reconfigures ranges in the same way
1223
that it rebalances away from busy nodes, via special-case 1:1
1224
split to a duplicate range comprising the new configuration.
1225
1226
### Permissions
1227
permissions are stored in the map with keys prefixed by *\0perm* followed by
1228
the key prefix and user to which the specified permissions apply. The format of
1229
permissions keys is:
1230
1231
`\0perm<key-prefix><user>`
1232
1233
Permission values are a protobuf containing the permission details;
1234
please see [proto/config.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/config.proto) for up-to-date data structures used, the best entry point being `message PermConfig`.
1235
1236
A default system root permission is assumed for the entire map
1237
with an empty key prefix and read/write as true.
1238
1239
When determining whether or not to allow a read or a write a key
1240
value (e.g. `db1:user:1` for user `spencer`), a RoachNode would
1241
read the following permissions values:
1242
1243
```
1244
\0perm<db1:user:1>spencer
1245
\0perm<db1:user>spencer
1246
\0perm<db1>spencer
1247
\0perm<>spencer
1248
```
1249
1250
If any prefix in the hierarchy provides the required permission,
1251
the request is satisfied; otherwise, the request returns an
1252
error.
1253
1254
The priority for a user permission is used to order requests at
1255
Raft consensus ranges and for choosing an initial priority for
1256
distributed transactions. When scheduling operations at the Raft
1257
consensus range, all outstanding requests are ordered by key
1258
prefix and each assigned priorities according to key, user and
1259
arrival time. The next request is chosen probabilistically using
1260
priorities to weight the choice. Each key can have multiple
1261
priorities as they’re hierarchical (e.g. for /user/key, one
1262
priority for root ‘/’, and one for ‘/user/key’). The most general
1263
priority is used first. If two keys share the most general, then
1264
they’re compared with the next most general if applicable, and so on.
1265
1266
# Key-Value API
1267
1268
see the protobufs in [proto/](https://github.com/cockroachdb/cockroach/blob/master/proto),
1269
in particular [proto/api.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/api.proto) and the comments within.
1270
1271
# Structured Data API
1272
1273
A preliminary design can be found in the [Go source documentation](http://godoc.org/github.com/cockroachdb/cockroach/structured).