Permalink
Newer
100644
1258 lines (1043 sloc)
59.3 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
Cockroach is a distributed key:value datastore which supports **ACID
8
transactional semantics** and **versioned values** as first-class
9
features. The primary design goal is **global consistency and
10
survivability**, hence the name. Cockroach aims to tolerate disk,
11
machine, rack, and even **datacenter failures** with minimal latency
12
disruption and **no manual intervention**. Cockroach nodes are
13
symmetric; a design goal is **homogenous deployment** (one binary) with
14
minimal configuration.
15
16
Cockroach implements a **single, monolithic sorted map** from key to
17
value where both keys and values are byte strings (not unicode).
18
Cockroach **scales linearly** (theoretically up to 4 exabytes (4E) of
19
logical data). The map is composed of one or more ranges and each range
20
is backed by data stored in [RocksDB](http://rocksdb.org/) (a
21
variant of LevelDB), and is replicated to a total of three or more
22
cockroach servers. Ranges are defined by start and end keys. Ranges are
23
merged and split to maintain total byte size within a globally
24
configurable min/max size interval. Range sizes default to target `64M` in
25
order to facilitate quick splits and merges and to distribute load at
26
hotspots within a key range. Range replicas are intended to be located
27
in disparate datacenters for survivability (e.g. `{ US-East, US-West,
28
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East, US-West,
29
Japan, Australia }`).
30
31
Single mutations to ranges are mediated via an instance of a distributed
32
consensus algorithm to ensure consistency. We’ve chosen to use the
33
[Raft consensus
34
algorithm](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
35
All consensus state is stored in RocksDB.
36
37
A single logical mutation may affect multiple key/value pairs. Logical
38
mutations have ACID transactional semantics. If all keys affected by a
39
logical mutation fall within the same range, atomicity and consistency
40
are guaranteed by Raft; this is the **fast commit path**. Otherwise, a
41
**non-locking distributed commit** protocol is employed between affected
42
ranges.
43
44
Cockroach provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
45
serializable snapshot isolation (SSI) semantics, allowing **externally
46
consistent, lock-free reads and writes**--both from a historical
47
snapshot timestamp and from the current wall clock time. SI provides
48
lock-free reads and writes but still allows write skew. SSI eliminates
49
write skew, but introduces a performance hit in the case of a
50
contentious system. SSI is the default isolation; clients must
51
consciously decide to trade correctness for performance. Cockroach
52
implements [a limited form of linearizability](#linearizability),
53
providing ordering for any observer or chain of observers.
54
55
Similar to
56
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
57
directories, Cockroach allows configuration of arbitrary zones of data.
58
This allows replication factor, storage device type, and/or datacenter
59
location to be chosen to optimize performance and/or availability.
60
Unlike Spanner, zones are monolithic and don’t allow movement of fine
61
grained data on the level of entity groups.
62
63
A
64
[Megastore](http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf)-like
65
message queue mechanism is also provided to 1) efficiently sideline
66
updates which can tolerate asynchronous execution and 2) provide an
67
integrated message queuing system for asynchronous communication between
68
distributed system components.
69
70
# Architecture
71
72
Cockroach implements a layered architecture. The highest level of
73
abstraction is the SQL layer (currently unspecified in this document).
74
It depends directly on the [*structured data
75
API*](#structured-data-api), which provides familiar relational concepts
76
such as schemas, tables, columns, and indexes. The structured data API
77
in turn depends on the [distributed key value store](#key-value-api),
78
which handles the details of range addressing to provide the abstraction
79
of a single, monolithic key value store. The distributed KV store
80
communicates with any number of physical cockroach nodes. Each node
81
contains one or more stores, one per physical device.
82
83

84
85
Each store contains potentially many ranges, the lowest-level unit of
86
key-value data. Ranges are replicated using the Raft consensus protocol.
87
The diagram below is a blown up version of stores from four of the five
88
nodes in the previous diagram. Each range is replicated three ways using
89
raft. The color coding shows associated range replicas.
90
91

92
93
Each physical node exports a RoachNode service. Each RoachNode exports
94
one or more key ranges. RoachNodes are symmetric. Each has the same
95
binary and assumes identical roles.
96
97
Nodes and the ranges they provide access to can be arranged with various
98
physical network topologies to make trade offs between reliability and
99
performance. For example, a triplicated (3-way replica) range could have
100
each replica located on different:
101
102
- disks within a server to tolerate disk failures.
103
- servers within a rack to tolerate server failures.
104
- servers on different racks within a datacenter to tolerate rack power/network failures.
105
- servers in different datacenters to tolerate large scale network or power outages.
106
107
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
108
109
# Cockroach Client
110
111
In order to support diverse client usage, Cockroach clients connect to
112
any node via HTTPS using protocol buffers or JSON. The connected node
113
proxies involved client work including key lookups and write buffering.
114
115
# Keys
116
117
Cockroach keys are arbitrary byte arrays. If textual data is used in
118
keys, utf8 encoding is recommended (this helps for cleaner display of
119
values in debugging tools). User-supplied keys are encoded using an
120
ordered code. System keys are either prefixed with null characters (`\0`
121
or `\0\0`) for system tables, or take the form of
122
`<user-key><system-suffix>` to sort user-key-range specific system
123
keys immediately after the user keys they refer to. Null characters are
124
used in system key prefixes to guarantee that they sort first.
125
126
# Versioned Values
127
128
Cockroach maintains historical versions of values by storing them with
129
associated commit timestamps. Reads and scans can specify a snapshot
130
time to return the most recent writes prior to the snapshot timestamp.
131
Older versions of values are garbage collected by the system during
132
compaction according to a user-specified expiration interval. In order
133
to support long-running scans (e.g. for MapReduce), all versions have a
134
minimum expiration.
135
136
Versioned values are supported via modifications to RocksDB to record
137
commit timestamps and GC expirations per key.
138
139
# Lock-Free Distributed Transactions
140
141
Cockroach provides distributed transactions without locks. Cockroach
142
transactions support two isolation levels:
143
144
- snapshot isolation (SI) and
145
- *serializable* snapshot isolation (SSI).
146
147
*SI* is simple to implement, highly performant, and correct for all but a
148
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
149
more complexity, is still highly performant (less so with contention), and has
150
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
151
the literature and some possibly novel insights.
152
153
SSI is the default level, with SI provided for application developers
154
who are certain enough of their need for performance and the absence of
155
write skew conditions to consciously elect to use it. In a lightly
156
contended system, our implementation of SSI is just as performant as SI,
157
requiring no locking or additional writes. With contention, our
158
implementation of SSI still requires no locking, but will end up
159
aborting more transactions. Cockroach’s SI and SSI implementations
160
prevent starvation scenarios even for arbitrarily long transactions.
161
162
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
163
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
164
For a discussion of SSI implemented by preventing read-write conflicts
165
(in contrast to detecting them, called write-snapshot isolation), see
166
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
167
which is the source of much inspiration for Cockroach’s SSI.
168
169
Each Cockroach transaction is assigned a random priority and a
170
"candidate timestamp" at start. The candidate timestamp is the
171
provisional timestamp at which the transaction will commit, and is
172
chosen as the current clock time of the node coordinating the
173
transaction. This means that a transaction without conflicts will
174
usually commit with a timestamp that, in absolute time, precedes the
175
actual work done by that transaction.
176
177
In the course of organizing the transaction between one or more
178
distributed nodes, the candidate timestamp may be increased, but will
179
never be decreased. The core difference between the two isolation levels
180
SI and SSI is that the former allows its commit timestamp to increase
181
and the latter does not.
182
183
Timestamps are a combination of both a physical and a logical component
184
to support monotonic increments without degenerate cases causing
185
timestamps to diverge from wall clock time, following closely the
186
[*Hybrid Logical Clock
187
paper.*](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf)
188
189
Transactions are executed in two phases:
190
191
1. Start the transaction by writing a new entry to the system
192
transaction table (keys prefixed by *\0tx*) with state “PENDING”.
193
In practice, this is done along with the first operation in the
194
transaction.
195
196
2. Write an "intent" value for each datum being written as part of the
197
transaction. These are normal MVCC values, with the addition of a
198
special flag (i.e. “intent”) indicating that the value may be
199
committed later, if the transaction itself commits. In addition,
200
the transaction id (unique and chosen at tx start time by client)
201
is stored with intent values. The tx id is used to refer to the
202
transaction table when there are conflicts and to make
203
tie-breaking decisions on ordering between identical timestamps.
204
Each node returns the timestamp used for the write; the client
205
selects the maximum from amongst all writes as the final commit
206
timestamp.
207
208
Each range maintains a small (i.e. latest 10s of read timestamps),
209
*in-memory* cache from key to the latest timestamp at which the
210
key(s) were read. This *latest-read-cache* is consulted on each
211
write. If the write’s candidate timestamp is earlier than the low
212
water mark on the cache itself (i.e. its last evicted timestamp)
213
or if the key being written has a read timestamp later than the
214
write’s candidate timestamp, this later timestamp value is
215
returned with the write. The cache’s entries are evicted oldest
216
timestamp first, updating low water mark as appropriate. If a new
217
range replica leader is elected, it sets the low water mark for
218
the cache to the current wall time + ε (ε = 99^th^ percentile
219
clock skew).
220
221
3. Commit the transaction by updating its entry in the system
222
transaction table (keys prefixed by *\0tx*). The value of the
223
commit entry contains the candidate timestamp (increased as
224
necessary to accommodate any latest read timestamps). Note that
225
the transaction is considered fully committed at this point and
226
control may be returned to the client.
227
228
In the case of an SI transaction, a commit timestamp which was
229
increased to accommodate concurrent readers is perfectly
230
acceptable and the commit may continue. For SSI transactions,
231
however, a gap between candidate and commit timestamps
232
necessitates transaction restart (note: restart is different than
233
abort--see below).
234
235
Additionally and in parallel, all written values are upgraded by
236
removing the “intent” flag. The transaction is considered fully
237
committed before this step and does not wait for it to return
238
control to the transaction coordinator.
239
240
In the absence of conflicts, this is the end. Nothing else is necessary
241
to ensure the correctness of the system.
242
243
**Conflict Resolution**
244
245
Things get more interesting when a reader or writer encounters an intent
246
record or newly-committed value in a location that it needs to read or
247
write. This is a conflict, usually causing either of the transactions to
248
abort or restart depending on the type of conflict.
249
250
***Transaction restart:***
251
252
This is the usual (and more efficient) type of behaviour and is used
253
except when the transaction was aborted (for instance by another
254
transaction).
255
In effect, that reduces to two cases, the first being the one outlined
256
above: An SSI transaction that finds (upon attempting to commit) that
257
its commit timestamp has been pushed. In the second case, a transaction
258
actively encounters a conflict, that is, one of its readers or writers
259
runs encounters data that necessitate conflict resolution.
260
261
When a transaction restarts, it changes its priority and/or moves its
262
timestamp forward depending on data tied to the conflict, and the
263
transaction begins anew, updating its intents. Since the set of keys
264
being written change between restarts, a set of keys written during
265
prior attempts at the transaction is maintained by the client. As it
266
restarts the transaction from the beginning, it removes keys from this
267
set as it writes them again. The remaining keys--should the transaction
268
run to completion--are crufty write intents which must be deleted
269
*before* the transaction commit record’s status is set to COMMITTED.
270
Many transactions will have no keys in this set.
271
272
***Transaction abort:***
273
274
This is the case in which a transaction, upon reading its transaction
275
table entry, finds that it has been aborted. In this case, the
276
transaction can not reuse its intents; it returns control to the client
277
before cleaning them up (other readers and writers would clean up
278
dangling intents as they encounter them) but will make an effort to
279
clean up after itself. The next attempt (if applicable) then runs as a
280
different transaction.
281
282
There are several scenarios in which transactions interact:
283
284
- **Reader encounters write intent or value with newer timestamp far
285
enough in the future**: This is not a conflict. The reader is free
286
to proceed; after all, it will be reading an older version of the
287
value and so does not conflict. Recall that the write intent may
288
be committed with a later timestamp than its candidate; it will
289
never commit with an earlier one. **Side note**: if the reader
290
finds an intent with a newer timestamp which the reader’s own
291
transaction has written, the reader always returns that value.
292
293
- **Reader encounters write intent or value with newer timestamp in the
294
near future:** In this case, we have to be careful. The newer
295
intent may, in absolute terms, have happened in our read's past if
296
the clock of the writer is ahead of the node serving the values.
297
In that case, we would need to take this value into account, but
298
we just don't know. Hence the transaction restarts, using instead
299
a future timestamp (but remembering a maximum timestamp used to
300
limit the uncertainty window to the maximum clock skew). In fact,
301
this is optimized further; see the details under "choosing a time
302
stamp" below.
303
304
- **Reader encounters write intent with older timestamp**: the reader
305
must follow the intent’s transaction id to the transaction table.
306
If the transaction has already been committed, then the reader can
307
just read the value. If the write transaction has not yet been
308
committed, then the reader has two options. If the write conflict
309
is from an SI transaction, the reader can *push that transaction's
310
commit timestamp into the future* (and consequently not have to
311
read it). This is simple to do: the reader just updates the
312
transaction’s commit timestamp to indicate that when/if the
313
transaction does commit, it should use a timestamp *at least* as
314
high. However, if the write conflict is from an SSI transaction,
315
the reader must compare priorities. If it has the higher priority,
316
it pushes the transaction’s commit timestamp, as with SI (that
317
transaction will then notice its timestamp has been pushed, and
318
restart). If it has the lower priority, it retries itself using as
319
a new priority `max(new random priority, conflicting txn’s
320
priority - 1)`.
321
322
- **Writer encounters uncommitted write intent with lower priority**:
323
writer aborts the conflicting transaction.
324
325
- **Writer encounters uncommitted write intent with higher priority**:
326
the transaction retries, using as a new priority *max(new random
327
priority, conflicting txn’s priority - 1)*; the retry occurs after
328
a short, randomized backoff interval.
329
330
- **Writer encounters committed write intent or newer committed value**:
331
The transaction restarts. On restart, the same priority is reused,
332
but the candidate timestamp is moved forward to the encountered
333
value's timestamp.
334
335
**Transaction management**
336
337
Transactions are managed by the client proxy (or gateway in SQL Azure
338
parlance). Unlike in Spanner, writes are not buffered but are sent
339
directly to all implicated ranges. This allows the transaction to abort
340
quickly if it encounters a write conflict. The client proxy keeps track
341
of all written keys in order to cleanup write intents upon transaction
342
completion.
343
344
If a transaction is completed successfully, all intents are upgraded to
345
committed. In the event a transaction is aborted, all written intents
346
are deleted. The client proxy doesn’t guarantee it will cleanup intents;
347
but dangling intents are upgraded or deleted when encountered by future
348
readers and writers and the system does not depend on their timely
349
cleanup for correctness.
350
351
In the event the client proxy restarts before the pending transaction is
352
completed, the dangling transaction would continue to live in the
353
transaction table until aborted by another transaction. Transactions
354
heartbeat the transaction table every five seconds by default.
355
Transactions encountered by readers or writers with dangling intents
356
which haven’t been heartbeat within the required interval are aborted.
357
358
An exploration of retries with contention and abort times with abandoned
359
transaction is
360
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
361
362
**Transaction Table**
363
364
Please see [proto/data.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
365
366
**Pros**
367
368
- No requirement for reliable code execution to prevent stalled 2PC
369
protocol.
370
- Readers never block with SI semantics; with SSI semantics, they may
371
abort.
372
- Lower latency than traditional 2PC commit protocol (w/o contention)
373
because second phase requires only a single write to the
374
transaction table instead of a synchronous round to all
375
transaction participants.
376
- Priorities avoid starvation for arbitrarily long transactions and
377
always pick a winner from between contending transactions (no
378
mutual aborts).
379
- Writes not buffered at client; writes fail fast.
380
- No read-locking overhead required for *serializable* SI (in contrast
381
to other SSI implementations).
382
- Well-chosen (i.e. less random) priorities can flexibly give
383
probabilistic guarantees on latency for arbitrary transactions
384
(for example: make OLTP transactions 10x less likely to abort than
385
low priority transactions, such as asynchronously scheduled jobs).
386
387
**Cons**
388
389
- Reads from non-leader replicas still require a ping to the leader to
390
update *latest-read-cache*.
391
- Abandoned transactions may block contending writers for up to the
392
heartbeat interval, though average wait is likely to be
393
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
394
This is likely considerably more performant than detecting and
395
restarting 2PC in order to release read and write locks.
396
- Behavior different than other SI implementations: no first writer
397
wins, and shorter transactions do not always finish quickly.
398
Element of surprise for OLTP systems may be a problematic factor.
399
- Aborts can decrease throughput in a contended system compared with
400
two phase locking. Aborts and retries increase read and write
401
traffic, increase latency and decrease throughput.
402
403
**Choosing a Timestamp**
404
405
A key challenge of reading data in a distributed system with clock skew
406
is choosing a timestamp guaranteed to be greater than the latest
407
timestamp of any committed transaction (in absolute time). No system can
408
claim consistency and fail to read already-committed data.
409
410
Accomplishing this for transactions (or just single operations)
411
accessing a single node is easy. The transaction supplies 0 for
412
timestamp, indicating that the node should use its current time (time
413
for a node is kept using a hybrid clock which combines wall time and a
414
logical time). This guarantees data already committed to that node have
415
earlier timestamps.
416
417
For multiple nodes, the timestamp of the node coordinating the
418
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
419
supplied to provide an upper bound on timestamps for already-committed
420
data (`ε` is the maximum clock skew). As the transaction progresses, any
421
data read which have timestamps greater than `t` but less than `t+ε`
422
cause the transaction to abort and retry with the conflicting timestamp
423
t<sub>c</sub>, where t<sub>c</sub> \> t. The maximum timestamp `t+ε` remains the same.
424
Time spent retrying because of reading recently committed data has an
425
upper bound of `ε`. In fact, this is further optimized: upon restarting,
426
the transaction not only takes into account the timestamp of the future
427
value, but the timestamp of the node at the time of the uncertain read.
428
The larger of those two timestamps (typically, the latter) is used to
429
bump up the read timestamp, and additionally the node is marked as
430
“certain”. This means that for future reads to that node within the
431
transaction, we can set `MaxTimestamp = Read Timestamp` (and hence avoid
432
further uncertainty restarts). Correctness follows from the fact that we
433
know that at the time of the read, there exists no version of any key on
434
that node with a higher timestamp; if we ran into one during a future
435
read, that node would have happened (in absolute time) after our
436
transaction started.
437
438
We expect retries will be rare, but this assumption may need to be
439
revisited if retries become problematic. Note that this problem does not
440
apply to historical reads. An alternate approach which does not require
441
retries would be to make a round to all node participants in advance and
442
choose the highest reported node wall time as the timestamp. However,
443
knowing which nodes will be accessed in advance is difficult and
444
potentially limiting. Cockroach could also potentially use a global
445
clock (Google did this with Pinax TODO: link to paper), which would be
446
feasible for smaller, geographically-proximate clusters.
447
448
# Linearizability
449
450
First a word about [***Spanner***](http://research.google.com/archive/spanner.html).
451
By combining judicious use of wait intervals with accurate time signals,
452
Spanner provides a global ordering between any two non-overlapping transactions
453
(in absolute time) with \~14ms latencies. Put another way:
454
Spanner guarantees that if a transaction T<sub>1</sub> commits (in absolute time)
455
before another transaction T<sub>2</sub> starts, then T<sub>1</sub>'s assigned commit
456
timestamp is smaller than T<sub>2</sub>'s. Using atomic clocks and GPS receivers,
457
Spanner reduces their clock skew uncertainty to \< 10ms (`ε`). To make
458
good on the promised guarantee, transactions must take at least double
459
the clock skew uncertainty interval to commit (`2ε`). See [*this
460
article*](http://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf)
461
for a helpful overview of Spanner’s concurrency control.
462
463
Cockroach could make the same guarantees without specialized hardware,
464
at the expense of longer wait times. If servers in the cluster were
465
configured to work only with NTP, transaction wait times would likely to
466
be in excess of 150ms. For wide-area zones, this would be somewhat
467
mitigated by overlap from cross datacenter link latencies. If clocks
468
were made more accurate, the minimal limit for commit latencies would
469
improve.
470
471
However, let’s take a step back and evaluate whether Spanner’s external
472
consistency guarantee is worth the automatic commit wait. First, if the
473
commit wait is omitted completely, the system still yields a consistent
474
view of the map at an arbitrary timestamp. However with clock skew, it
475
would become possible for commit timestamps on non-overlapping but
476
causally related transactions to suffer temporal reverse. In other
477
words, the following scenario is possible for a client without global
478
ordering:
479
480
- Start transaction T<sub>1</sub> to modify value `x` with commit time *s<sub>1</sub>*
481
482
- On commit of T<sub>1</sub>, start T<sub>2</sub> to modify value `y` with commit time
483
\> s<sub>2</sub>
484
485
- Read `x` and `y` and discover that s<sub>1</sub> \> s<sub>2</sub> (**!**)
486
487
The external consistency which Spanner guarantees is referred to as
488
**linearizability**. It goes beyond serializability by preserving
489
information about the causality inherent in how external processes
490
interacted with the database. The strength of Spanner’s guarantee can be
491
formulated as follows: any two processes, with clock skew within
492
expected bounds, may independently record their wall times for the
493
completion of transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>) and start of transaction
494
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) respectively, and if later
495
compared such that T<sub>1</sub><sup>end</sup> \< T<sub>2</sub><sup>start</sup>,
496
then commit timestamps s<sub>1</sub> \< s<sub>2</sub>.
497
This guarantee is broad enough to completely cover all cases of explicit
498
causality, in addition to covering any and all imaginable scenarios of implicit
499
causality.
500
501
Our contention is that causality is chiefly important from the
502
perspective of a single client or a chain of successive clients (*if a
503
tree falls in the forest and nobody hears…*). As such, Cockroach
504
provides two mechanisms to provide linearizability for the vast majority
505
of use cases without a mandatory transaction commit wait or an elaborate
506
system to minimize clock skew.
507
508
1. Clients provide the highest transaction commit timestamp with
509
> successive transactions. This allows node clocks from previous
510
> transactions to effectively participate in the formulation of the
511
> commit timestamp for the current transaction. This guarantees
512
> linearizability for transactions committed by this client.
513
>
514
> Newly launched clients wait at least 2 \* ε from process start
515
> time before beginning their first transaction. This preserves the
516
> same property even on client restart, and the wait will be
517
> mitigated by process initialization.
518
>
519
> All causally-related events within Cockroach maintain
520
> linearizability. Message queues, for example, guarantee that the
521
> receipt timestamp is greater than send timestamp, and that
522
> delivered messages may not be reaped until after the commit wait.
523
524
2. Committed transactions respond with a commit wait parameter which
525
> represents the remaining time in the nominal commit wait. This
526
> will typically be less than the full commit wait as the consensus
527
> write at the coordinator accounts for a portion of it.
528
>
529
> Clients taking any action outside of another Cockroach transaction
530
> (e.g. writing to another distributed system component) can either
531
> choose to wait the remaining interval before proceeding, or
532
> alternatively, pass the wait and/or commit timestamp to the
533
> execution of the outside action for its consideration. This pushes
534
> the burden of linearizability to clients, but is a useful tool in
535
> mitigating commit latencies if the clock skew is potentially
536
> large. This functionality can be used for ordering in the face of
537
> backchannel dependencies as mentioned in the
538
> [AugmentedTime](http://www.cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf)
539
> paper.
540
541
Using these mechanisms in place of commit wait, Cockroach’s guarantee can be
542
formulated as follows: any process which signals the start of transaction
543
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) after the completion of
544
transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>), will have commit
545
timestamps such thats<sub>1</sub> \< s<sub>2</sub>.
546
547
# Logical Map Content
548
549
Logically, the map contains a series of reserved system key / value
550
pairs covering accounting, range metadata, node accounting and
551
permissions before the actual key / value pairs for non-system data
552
(e.g. the actual meat of the map).
553
554
- `\0\0meta1` Range metadata for location of `\0\0meta2`.
555
- `\0\0meta1<key1>` Range metadata for location of `\0\0meta2<key1>`.
556
- ...
557
- `\0\0meta1<keyN>`: Range metadata for location of `\0\0meta2<keyN>`.
558
- `\0\0meta2`: Range metadata for location of first non-range metadata key.
559
- `\0\0meta2<key1>`: Range metadata for location of `<key1>`.
560
- ...
561
- `\0\0meta2<keyN>`: Range metadata for location of `<keyN>`.
562
- `\0acct<key0>`: Accounting for key prefix key0.
563
- ...
564
- `\0acct<keyN>`: Accounting for key prefix keyN.
565
- `\0node<node-address0>`: Accounting data for node 0.
566
- ...
567
- `\0node<node-addressN>`: Accounting data for node N.
568
- `\0perm<key0><user0>`: Permissions for user0 for key prefix key0.
569
- ...
570
- `\0perm<keyN><userN>`: Permissions for userN for key prefix keyN.
571
- `\0tree_root`: Range key for root of range-spanning tree.
572
- `\0tx<tx-id0>`: Transaction record for transaction 0.
573
- ...
574
- `\0tx<tx-idN>`: Transaction record for transaction N.
575
- `\0zone<key0>`: Zone information for key prefix key0.
576
- ...
577
- `\0zone<keyN>`: Zone information for key prefix keyN.
578
- `<>acctd<metric0>`: Accounting data for Metric 0 for empty key prefix.
579
- ...
580
- `<>acctd<metricN>`: Accounting data for Metric N for empty key prefix.
581
- `<key0>`: `<value0>` The first user data key.**
582
- ...
583
- `<keyN>`: `<valueN>` The last user data key.**
584
585
There are some additional system entries sprinkled amongst the
586
non-system keys. See the Key-Prefix Accounting section in this document
587
for further details.
588
589
# Node Storage
590
591
Nodes maintain a separate instance of RocksDB for each disk. Each
592
RocksDB instance hosts any number of ranges. RPCs arriving at a
593
RoachNode are multiplexed based on the disk name to the appropriate
594
RocksDB instance. A single instance per disk is used to avoid
595
contention. If every range maintained its own RocksDB, global management
596
of available cache memory would be impossible and writers for each range
597
would compete for non-contiguous writes to multiple RocksDB logs.
598
599
In addition to the key/value pairs of the range itself, various range
600
metadata is maintained.
601
602
- range-spanning tree node links
603
604
- participating replicas
605
606
- consensus metadata
607
608
- split/merge activity
609
610
A really good reference on tuning Linux installations with RocksDB is
611
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
612
613
# Range Metadata
614
615
The default approximate size of a range is 64M (2\^26 B). In order to
616
support 1P (2\^50 B) of logical data, metadata is needed for roughly
617
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
618
size is roughly 256 bytes (*3 \* 12 bytes for the triplicated node
619
locations and 220 bytes for the range key itself*). 2\^24 ranges \* 2\^8
620
B would require roughly 4G (2\^32 B) to store--too much to duplicate
621
between machines. Our conclusion is that range metadata must be
622
distributed for large installations.
623
624
To distribute the range metadata and keep key lookups relatively fast,
625
we use two levels of indirection. All of the range metadata sorts first
626
in our key-value map. We accomplish this by prefixing range metadata
627
with two null characters (*\0\0*). The *meta1* or *meta2* suffixes are
628
additionally appended to distinguish between the first level and second
629
level of **`r`**a***ng***e metadata. In order to do a lookup for *key1*,
630
we first locate the range information for the lower bound of
631
`\0\0meta1<key1>`, and then use that range to locate the lower bound
632
of `\0\0meta2<key1>`. The range specified there will indicate the
633
range location of `<key1>` (refer to examples below). Using two levels
634
of indirection, **our map can address approximately 2\^62 B of data, or
635
roughly 4E** (*each metadata range addresses 2\^(26-8) = 2\^18 ranges;
636
with two levels of indirection, we can address 2\^(18 + 18) = 2\^36
637
ranges; each range addresses 2\^26 B; total is 2\^(36+26) B = 2\^62 B =
638
4E*).
639
640
Note: we append the end key of each range to meta[12] records because
641
the RocksDB iterator only supports a Seek() interface which acts as a
642
Ceil(). Using the start key of the range would cause Seek() to find the
643
key *after* the meta indexing record we’re looking for, which would
644
result in having to back the iterator up, an option which is both less
645
efficient and not available in all cases.
646
647
The following example shows the directory structure for a map with
648
three ranges worth of data. The key/values in red show range
649
metadata. The key/values in black show actual data. Ellipses
650
indicate additional key/value pairs to fill out entire range of
651
data. Except for the fact that splitting ranges requires updates
652
to the range metadata with knowledge of the metadata layout, the
653
range metadata itself requires no special treatment or
654
bootstrapping.
655
656
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
657
`dcrama3:8000`)
658
659
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
660
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
661
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
662
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
663
- ...
664
- `<lastkey0>`: `<lastvalue0>`
665
666
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
667
`dcrama6:8000`)
668
669
- ...
670
- `<lastkey1>`: `<lastvalue1>`
671
672
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
673
`dcrama9:8000`)
674
675
- ...
676
- `<lastkey2>`: `<lastvalue2>`
677
678
Consider a simpler example of a map containing less than a single
679
range of data. In this case, all range metadata and all data are
680
located in the same range:
681
682
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
683
`dcrama3:8000`)*
684
685
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
686
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
687
- `<key0>`: `<value0>`
688
- `...`
689
690
Finally, a map large enough to need both levels of indirection would
691
look like (note that instead of showing range replicas, this
692
example is simplified to just show range indexes):
693
694
**Range 0**
695
696
- `\0\0meta1<lastkeyN-1>`: Range 0
697
- `\0\0meta1\xff`: Range 1
698
- `\0\0meta2<lastkey1>`: Range 1
699
- `\0\0meta2<lastkey2>`: Range 2
700
- `\0\0meta2<lastkey3>`: Range 3
701
- ...
702
- `\0\0meta2<lastkeyN-1>`: Range 262143
703
704
**Range 1**
705
706
- `\0\0meta2<lastkeyN>`: Range 262144
707
- `\0\0meta2<lastkeyN+1>`: Range 262145
708
- ...
709
- `\0\0meta2\xff`: Range 500,000
710
- ...
711
- `<lastkey1>`: `<lastvalue1>`
712
713
**Range 2**
714
715
- ...
716
- `<lastkey2>`: `<lastvalue2>`
717
718
**Range 3**
719
720
- ...
721
- `<lastkey3>`: `<lastvalue3>`
722
723
**Range 262144**
724
725
- ...
726
- `<lastkeyN>`: `<lastvalueN>`
727
728
**Range 262145**
729
730
- ...
731
- `<lastkeyN+1>`: `<lastvalueN+1>`
732
733
Note that the choice of range `262144` is just an approximation. The
734
actual number of ranges addressable via a single metadata range is
735
dependent on the size of the keys. If efforts are made to keep key sizes
736
small, the total number of addressable ranges would increase and vice
737
versa.
738
739
From the examples above it’s clear that key location lookups require at
740
most three reads to get the value for `<key>`:
741
742
1. lower bound of `\0\0meta1<key>`
743
2. lower bound of `\0\0meta2<key>`,
744
3. `<key>`.
745
746
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
747
containing less than 16T of data would require two lookups. Clients cache both
748
levels of range metadata, and we expect that data locality for individual
749
clients will be high. Clients may end up with stale cache entries. If on a
750
lookup, the range consulted does not match the client’s expectations, the
751
client evicts the stale entries and possibly does a new lookup.
752
753
# Range-Spanning Binary Tree
754
755
A crucial enhancement to the organization of range metadata is to
756
augment the bi-level range metadata lookup with a minimum spanning tree,
757
implemented as a left-leaning red-black tree over all ranges in the map.
758
This tree structure allows the system to start at any key prefix and
759
efficiently traverse an arbitrary key range with minimal RPC traffic,
760
minimal fan-in and fan-out, and with bounded time complexity equal to
761
`2*log N` steps, where `N` is the total number of ranges in the system.
762
763
Unlike the range metadata rows prefixed with `\0\0meta[1|2]`, the
764
metadata for the range-spanning tree (e.g. parent range and left / right
765
child ranges) is stored directly at the ranges as non-map metadata. The
766
metadata for each node of the tree (e.g. links to parent range, left
767
child range, and right child range) is stored with the range metadata.
768
In effect, the tree metadata is stored implicitly. In order to traverse
769
the tree, for example, you’d need to query each range in turn for its
770
metadata.
771
772
Any time a range is split or merged, both the bi-level range lookup
773
metadata and the per-range binary tree metadata are updated as part of
774
the same distributed transaction. The total number of nodes involved in
775
the update is bounded by 2 + log N (i.e. 2 updates for meta1 and
776
meta2, and up to log N updates to balance the range-spanning tree).
777
The range corresponding to the root node of the tree is stored in
779
780
As an example, consider the following set of nine ranges and their
781
associated range-spanning tree:
782
783
R0: `aa - cc`, R1: `*cc - lll`, R2: `*lll - llr`, R3: `*llr - nn`, R4: `*nn - rr`, R5: `*rr - ssss`, R6: `*ssss - sst`, R7: `*sst - vvv`, R8: `*vvv - zzzz`.
784
785

786
787
The range-spanning tree has many beneficial uses in Cockroach. It makes
788
the problem of efficiently aggregating accounting information of
789
potentially vast ranges of data tractable. Imagine a subrange of data
790
over which accounting is being kept. For example, the *photos* table in
791
a public photo sharing site. To efficiently keep track of data about the
792
table (e.g. total size, number of rows, etc.), messages can be passed
793
first up the tree and then down to the left until updates arrive at the
794
key prefix under which accounting is aggregated. This makes worst case
795
number of hops for an update to propagate into the accounting totals
796
2 \* log N. A 64T database will require 1M ranges, meaning 40 hops
797
worst case. In our experience, accounting tasks over vast ranges of data
798
are most often map/reduce jobs scheduled with coarse-grained
799
periodicity. By contrast, we expect Cockroach to maintain statistics
800
with sub 10s accuracy and with minimal cycles and minimal IOPs.
801
802
Another use for the range-spanning tree is to push accounting, zones and
803
permissions configurations to all ranges. In the case of zones and
804
permissions, this is an efficient way to pass updated configuration
805
information with exponential fan-out. When adding accounting
806
configurations (i.e. specifying a new key prefix to track), the
807
implicated ranges are transactionally scanned and zero-state accounting
808
information is computed as well. Deleting accounting configurations is
809
similar, except accounting records are deleted.
810
811
Last but *not* least, the range-spanning tree provides a convenient
812
mechanism for planning and executing parallel queries. These provide the
813
basis for
814
[Dremel](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf)-like
815
query execution trees and it’s easy to imagine supporting a subset of
816
SQL or even javascript-based user functions for complex data analysis
817
tasks.
818
819
# Raft - Consistency of Range Replicas
820
821
Each range is configured to consist of three or more replicas. The
822
replicas in a range maintain their own instance of a distributed
823
consensus algorithm. We use the [*Raft consensus
824
algorithm*](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf)
825
as it is simpler to reason about and includes a reference implementation
826
covering important details. Every write to replicas is logged twice.
827
Once to RocksDB’s internal log and once to levedb itself as part of the
828
Raft consensus log.
829
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
830
promising performance characteristics for WAN-distributed replicas, but
831
it does not guarantee a consistent ordering between replicas.
832
833
Raft elects a relatively long-lived leader which must be involved to
834
propose writes. It heartbeats followers periodically to keep their logs
835
replicated. In the absence of heartbeats, followers become candidates
836
after randomized election timeouts and proceed to hold new leader
837
elections. Cockroach weights random timeouts such that the replicas with
838
shorter round trip times to peers are more likely to hold elections
839
first. Although only the leader can propose a new write, and as such
840
must be involved in any write to the consensus log, any replica can
841
service reads if the read is for a timestamp which the replica knows is
842
safe based on the last committed consensus write and the state of any
843
pending transactions.
844
845
Only the leader can propose a new write, but Cockroach accepts writes at
846
any replica. The replica merely forwards the write to the leader.
847
Instead of resending the write, the leader has only to acknowledge the
848
write to the forwarding replica using a log sequence number, as though
849
it were proposing it in the first place. The other replicas receive the
850
full write as though the leader were the originator.
851
852
Having a stable leader provides the choice of replica to handle
853
range-specific maintenance and processing tasks, such as delivering
854
pending message queues, handling splits and merges, rebalancing, etc.
855
856
# Splitting / Merging Ranges
857
858
RoachNodes split or merge ranges based on whether they exceed maximum or
859
minimum thresholds for capacity or load. Ranges exceeding maximums for
860
either capacity or load are split; ranges below minimums for *both*
861
capacity and load are merged.
862
863
Ranges maintain the same accounting statistics as accounting key
864
prefixes. These boil down to a time series of data points with minute
865
granularity. Everything from number of bytes to read/write queue sizes.
866
Arbitrary distillations of the accounting stats can be determined as the
867
basis for splitting / merging. Two sensical metrics for use with
868
split/merge are range size in bytes and IOps. A good metric for
869
rebalancing a replica from one node to another would be total read/write
870
queue wait times. These metrics are gossipped, with each range / node
871
passing along relevant metrics if they’re in the bottom or top of the
872
range it’s aware of.
873
874
A range finding itself exceeding either capacity or load threshold
875
splits. To this end, the range leader computes an appropriate split key
876
candidate and issues the split through Raft. In contrast to splitting,
877
merging requires a range to be below the minimum threshold for both
878
capacity *and* load. A range being merged chooses the smaller of the
879
ranges immediately preceding and succeeding it.
880
881
Splitting, merging, rebalancing and recovering all follow the same basic
882
algorithm for moving data between roach nodes. New target replicas are
883
created and added to the replica set of source range. Then each new
884
replica is brought up to date by either replaying the log in full or
885
copying a snapshot of the source replica data and then replaying the log
886
from the timestamp of the snapshot to catch up fully. Once the new
887
replicas are fully up to date, the range metadata is updated and old,
888
source replica(s) deleted if applicable.
889
890
**Coordinator** (leader replica)
891
895
only after being completed locally, are moved to new target replicas.
896
else if merging
897
Choose new replicas on same servers as target range replicas;
898
add to replica set.
899
else if rebalancing || recovering
900
Choose new replica(s) on least loaded servers; add to replica set.
901
```
905
*Bring replica up to date:*
906
907
```
908
if all info can be read from replicated log
909
copy replicated log
910
else
911
snapshot source replica
912
send successive ReadRange requests to source replica
913
referencing snapshot
914
915
if merging
916
combine ranges on all replicas
917
else if rebalancing || recovering
918
remove old range replica(s)
919
```
920
921
RoachNodes split ranges when the total data in a range exceeds a
922
configurable maximum threshold. Similarly, ranges are merged when the
923
total data falls below a configurable minimum threshold.
924
925
**TBD: flesh this out**.
926
927
Ranges are rebalanced if a node determines its load or capacity is one
928
of the worst in the cluster based on gossipped load stats. A node with
929
spare capacity is chosen in the same datacenter and a special-case split
930
is done which simply duplicates the data 1:1 and resets the range
931
configuration metadata.
932
933
# Message Queues
934
935
Each range maintains an array of incoming message queues, referred to
936
here as **inboxes**. Additionally, each range maintains and *processes*
937
an array of outgoing message queues, referred to here as **outboxes**.
938
Both inboxes and outboxes are assigned to keys; messages can be sent or
939
received on behalf of any key. Inboxes and outboxes can contain any
940
number of pending messages.
941
942
Messages can be *deliverable*, or *executable.*
943
944
Deliverable messages are defined by Value objects - simple byte arrays -
945
that are delivered to a key’s inbox, awaiting collection by a client
946
invoking the ReapQueue operation. These are typically used by client
947
applications wishing to be notified of changes to an entry for further
948
processing, such as expensive offline operations like sending emails,
949
SMSs, etc.
950
951
Executable messages are *outgoing-only*, and are instances of
952
PutRequest,IncrementRequest, DeleteRequest, DeleteRangeRequest
953
orAccountingRequest. Rather than being delivered to a key’s inbox, are
954
executed when encountered. These are primarily useful when updates that
955
are nominally part of a transaction can tolerate asynchronous execution
956
(e.g. eventual consistency), and are otherwise too busy or numerous to
957
make including them in the original [distributed] transaction efficient.
958
Examples may include updates to the accounting for successive key
959
prefixes (potentially busy) or updates to a full-text index (potentially
960
numerous).
961
962
These two types of messages are enqueued in different outboxes too - see
963
key formats below.
964
965
At commit time, the range processing the transaction places messages
966
into a shared outbox located at the start of the range metadata. This is
967
effectively free as it’s part of the same consensus write for the range
968
as the COMMIT record. Outgoing messages are processed asynchronously by
969
the range. To make processing easy, all outboxes are co-located at the
970
start of the range. To make lookup easy, all inboxes are located
971
immediately after the recipient key. The leader replica of a range is
972
responsible for processing message queues.
973
974
A dispatcher polls a given range’s *deliverable message outbox*
975
periodically (configurable), and delivers those messages to the target
976
key’s inbox. The dispatcher is also woken up whenever a new message is
977
added to the outbox. A separate executor also polls the range’s
978
*executable message outbox* periodically as well (again, configurable),
979
and executes those commands. The exeecutor, too, is woken up whenever a
980
new message is added to the outbox.
981
982
Formats follow in the table below. Notice that inbox messages for a
983
given key sort by the `<outbox-timestamp>`. This doesn’t provide a
984
precise ordering, but it does allow clients to scan messages in an
985
approximate ordering of when they were originally lodged with senders.
986
NTP offers walltime deltas to within 100s of milliseconds. The
987
`<sender-range-key>` suffix provides uniqueness.
988
989
**Outbox**
990
`<sender-range-key>deliverable-outbox:<recipient-key><outbox-timestamp>`
991
`<sender-range-key>executable-outbox:<recipient-key><outbox-timestamp>`
992
993
**Inbox**
994
`<recipient-key>inbox:<outbox-timestamp><sender-range-key>`
995
996
Messages are processed and then deleted as part of a single distributed
997
transaction. The message will be executed or delivered exactly once,
998
regardless of failures at either sender or receiver.
999
1000
Delivered messages may be read by clients via the ReapQueue operation.
1001
This operation may only be used as part of a transaction. Clients should
1002
commit only after having processed the message. If the transaction is
1003
committed, scanned messages are automatically deleted. The operation
1004
name was chosen to reflect its mutating side effect. Deletion of read
1005
messages is mandatory because senders deliver messages asynchronously
1006
and a delay could cause insertion of messages at arbitrary points in the
1007
inbox queue. If clients require persistence, they should re-save read
1008
messages manually; the ReapQueue operation can be incorporated into
1009
normal transactional updates.
1010
1011
# Node Allocation (via Gossip)
1012
1013
New nodes must be allocated when a range is split. Instead of requiring
1014
every RoachNode to know about the status of all or even a large number
1015
of peer nodes --or-- alternatively requiring a specialized curator or
1016
master with sufficiently global knowledge, we use a gossip protocol to
1017
efficiently communicate only interesting information between all of the
1018
nodes in the cluster. What’s interesting information? One example would
1019
be whether a particular node has a lot of spare capacity. Each node,
1020
when gossiping, compares each topic of gossip to its own state. If its
1021
own state is somehow “more interesting” than the least interesting item
1022
in the topic it’s seen recently, it includes its own state as part of
1023
the next gossip session with a peer node. In this way, a node with
1024
capacity sufficiently in excess of the mean quickly becomes discovered
1025
by the entire cluster. To avoid piling onto outliers, nodes from the
1026
high capacity set are selected at random for allocation.
1027
1028
The gossip protocol itself contains two primary components:
1029
1030
- **Peer Selection**: each node maintains up to N peers with which it
1031
regularly communicates. It selects peers with an eye towards
1032
maximizing fanout. A peer node which itself communicates with an
1033
array of otherwise unknown nodes will be selected over one which
1034
communicates with a set containing significant overlap. Each time
1035
gossip is initiated, each nodes’ set of peers is exchanged. Each
1036
node is then free to incorporate the other’s peers as it sees fit.
1037
To avoid any node suffering from excess incoming requests, a node
1038
may refuse to answer a gossip exchange. Each node is biased
1039
towards answering requests from nodes without significant overlap
1040
and refusing requests otherwise.
1041
1042
Peers are efficiently selected using a heuristic as described in
1043
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
1044
1045
**TBD**: how to avoid partitions? Need to work out a simulation of
1046
the protocol to tune the behavior and see empirically how well it
1047
works.
1048
1049
- **Gossip Selection**: what to communicate. Gossip is divided into
1050
topics. Load characteristics (capacity per disk, cpu load, and
1051
state [e.g. draining, ok, failure]) are used to drive node
1052
allocation. Range statistics (range read/write load, missing
1053
replicas, unavailable ranges) and network topology (inter-rack
1054
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
1055
outages) are used for determining when to split ranges, when to
1056
recover replicas vs. wait for network connectivity, and for
1057
debugging / sysops. In all cases, a set of minimums and a set of
1058
maximums is propagated; each node applies its own view of the
1059
world to augment the values. Each minimum and maximum value is
1060
tagged with the reporting node and other accompanying contextual
1061
information. Each topic of gossip has its own protobuf to hold the
1062
structured data. The number of items of gossip in each topic is
1063
limited by a configurable bound.
1064
1065
For efficiency, nodes assign each new item of gossip a sequence
1066
number and keep track of the highest sequence number each peer
1067
node has seen. Each round of gossip communicates only the delta
1068
containing new items.
1069
1070
# Node Accounting
1071
1072
The gossip protocol discussed in the previous section is useful to
1073
quickly communicate fragments of important information in a
1074
decentralized manner. However, complete accounting for each node is also
1075
stored to a central location, available to any dashboard process. This
1076
is done using the map itself. Each node periodically writes its state to
1077
the map with keys prefixed by `\0node`, similar to the first level of
1078
range metadata, but with an ‘`node`’ suffix. Each value is a protobuf
1079
containing the full complement of node statistics--everything
1080
communicated normally via the gossip protocol plus other useful, but
1081
non-critical data.
1082
1083
The range containing the first key in the node accounting table is
1084
responsible for gossiping the total count of nodes. This total count is
1085
used by the gossip network to most efficiently organize itself. In
1086
particular, the maximum number of hops for gossipped information to take
1087
before reaching a node is given by `ceil(log(node count) / log(max
1088
fanout)) + 1`.
1089
1090
# Key-prefix Accounting, Zones & Permissions
1091
1092
Arbitrarily fine-grained accounting and permissions are specified via
1093
key prefixes. Key prefixes can overlap, as is necessary for capturing
1094
hierarchical relationships. For illustrative purposes, let’s say keys
1095
specifying rows in a set of databases have the following format:
1096
1097
`<db>:<table>:<primary-key>[:<secondary-key>]`
1098
1099
In this case, we might collect accounting or specify permissions with
1100
key prefixes:
1101
1102
`db1`, `db1:user`, `db1:order`,
1103
1104
Accounting is kept for the entire map by default.
1105
1106
## Accounting
1107
to keep accounting for a range defined by a key prefix, an entry is created in
1108
the accounting system table. The format of accounting table keys is:
1109
1110
`\0acct<key-prefix>`
1111
1112
In practice, we assume each RoachNode capable of caching the
1113
entire accounting table as it is likely to be relatively small.
1114
1115
Accounting is kept for key prefix ranges with eventual consistency
1116
for efficiency. Updates to accounting values propagate through the
1117
system using the message queue facility if the accounting keys do
1118
not reside on the same range as ongoing activity (true for all but
1119
the smallest systems). There are two types of values which
1120
comprise accounting: counts and occurrences, for lack of better
1121
terms. Counts describe system state, such as the total number of
1122
bytes, rows, etc. Occurrences include transient performance and
1123
load metrics. Both types of accounting are captured as time series
1124
with minute granularity. The length of time accounting metrics are
1125
kept is configurable. Below are examples of each type of
1126
accounting value.
1127
1128
**System State Counters/Performance**
1129
1130
- Count of items (e.g. rows)
1131
- Total bytes
1132
- Total key bytes
1133
- Total value length
1134
- Queued message count
1135
- Queued message total bytes
1136
- Count of values \< 16B
1137
- Count of values \< 64B
1138
- Count of values \< 256B
1139
- Count of values \< 1K
1140
- Count of values \< 4K
1141
- Count of values \< 16K
1142
- Count of values \< 64K
1143
- Count of values \< 256K
1144
- Count of values \< 1M
1145
- Count of values \> 1M
1146
- Total bytes of accounting
1147
1148
1149
**Load Occurences**
1150
1151
Get op count
1152
Get total MB
1153
Put op count
1154
Put total MB
1155
Delete op count
1156
Delete total MB
1157
Delete range op count
1158
Delete range total MB
1159
Scan op count
1160
Scan op MB
1161
Split count
1162
Merge count
1163
1164
Because accounting information is kept as time series and over many
1165
possible metrics of interest, the data can become numerous. Accounting
1166
data are stored in the map near the key prefix described, in order to
1167
distribute load (for both aggregation and storage).
1168
1169
Accounting keys for system state have the form:
1170
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1171
character. It’s meant to sort the root level account AFTER any other
1172
system tables. They must increment the same underlying values as they
1173
are permanent counts, and not transient activity. Logic at the
1174
RoachNode takes care of snapshotting the value into an appropriately
1175
suffixed (e.g. with timestamp hour) multi-value time series entry.
1176
1177
Keys for perf/load metrics:
1178
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1179
1180
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1181
containing a varint64 entry for each minute with activity during the
1182
specified hour.
1183
1184
To efficiently keep accounting over large key ranges, the task of
1185
aggregation must be distributed. If activity occurs within the same
1186
range as the key prefix for accounting, the updates are made as part
1187
of the consensus write. If the ranges differ, then a message is sent
1188
to the parent range to increment the accounting. If upon receiving the
1189
message, the parent range also does not include the key prefix, it in
1190
turn forwards it to its parent or left child in the balanced binary
1191
tree which is maintained to describe the range hierarchy. This limits
1192
the number of messages before an update is visible at the root to `2*log N`,
1193
where `N` is the number of ranges in the key prefix.
1194
1195
## Zones
1196
zones are stored in the map with keys prefixed by
1197
`\0zone` followed by the key prefix to which the zone
1198
configuration applies. Zone values specify a protobuf containing
1199
the datacenters from which replicas for ranges which fall under
1200
the zone must be chosen.
1201
1202
Please see [proto/config.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1203
1204
If zones are modified in situ, each RoachNode verifies the
1205
existing zones for its ranges against the zone configuration. If
1206
it discovers differences, it reconfigures ranges in the same way
1207
that it rebalances away from busy nodes, via special-case 1:1
1208
split to a duplicate range comprising the new configuration.
1209
1210
### Permissions
1211
permissions are stored in the map with keys prefixed by *\0perm* followed by
1212
the key prefix and user to which the specified permissions apply. The format of
1213
permissions keys is:
1214
1215
`\0perm<key-prefix><user>`
1216
1217
Permission values are a protobuf containing the permission details;
1218
please see [proto/config.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/config.proto) for up-to-date data structures used, the best entry point being `message PermConfig`.
1219
1220
A default system root permission is assumed for the entire map
1221
with an empty key prefix and read/write as true.
1222
1223
When determining whether or not to allow a read or a write a key
1224
value (e.g. `db1:user:1` for user `spencer`), a RoachNode would
1225
read the following permissions values:
1226
1227
```
1228
\0perm<db1:user:1>spencer
1229
\0perm<db1:user>spencer
1230
\0perm<db1>spencer
1231
\0perm<>spencer
1232
```
1233
1234
If any prefix in the hierarchy provides the required permission,
1235
the request is satisfied; otherwise, the request returns an
1236
error.
1237
1238
The priority for a user permission is used to order requests at
1239
Raft consensus ranges and for choosing an initial priority for
1240
distributed transactions. When scheduling operations at the Raft
1241
consensus range, all outstanding requests are ordered by key
1242
prefix and each assigned priorities according to key, user and
1243
arrival time. The next request is chosen probabilistically using
1244
priorities to weight the choice. Each key can have multiple
1245
priorities as they’re hierarchical (e.g. for /user/key, one
1246
priority for root ‘/’, and one for ‘/user/key’). The most general
1247
priority is used first. If two keys share the most general, then
1248
they’re compared with the next most general if applicable, and so on.
1249
1250
# Key-Value API
1251
1252
see the protobufs in [proto/](https://github.com/cockroachdb/cockroach/blob/master/proto),
1253
in particular [proto/api.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/api.proto) and the comments within.
1254
1255
# Structured Data API
1256
1257
A preliminary design can be found in the [Go source documentation](http://godoc.org/github.com/cockroachdb/cockroach/structured).