Permalink
Newer
Older
100644 1264 lines (1049 sloc) 59.8 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
Cockroach is a distributed key:value datastore (SQL and structured
8
data layers of cockroach have yet to be defined) which supports **ACID
9
transactional semantics** and **versioned values** as first-class
10
features. The primary design goal is **global consistency and
11
survivability**, hence the name. Cockroach aims to tolerate disk,
12
machine, rack, and even **datacenter failures** with minimal latency
13
disruption and **no manual intervention**. Cockroach nodes are
14
symmetric; a design goal is **homogenous deployment** (one binary) with
15
minimal configuration.
16
17
Cockroach implements a **single, monolithic sorted map** from key to
18
value where both keys and values are byte strings (not unicode).
19
Cockroach **scales linearly** (theoretically up to 4 exabytes (4E) of
20
logical data). The map is composed of one or more ranges and each range
21
is backed by data stored in [RocksDB](http://rocksdb.org/) (a
22
variant of LevelDB), and is replicated to a total of three or more
23
cockroach servers. Ranges are defined by start and end keys. Ranges are
24
merged and split to maintain total byte size within a globally
25
configurable min/max size interval. Range sizes default to target `64M` in
26
order to facilitate quick splits and merges and to distribute load at
27
hotspots within a key range. Range replicas are intended to be located
28
in disparate datacenters for survivability (e.g. `{ US-East, US-West,
29
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East, US-West,
30
Japan, Australia }`).
31
32
Single mutations to ranges are mediated via an instance of a distributed
33
consensus algorithm to ensure consistency. We’ve chosen to use the
34
[Raft consensus
35
algorithm](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
36
All consensus state is stored in RocksDB.
37
38
A single logical mutation may affect multiple key/value pairs. Logical
39
mutations have ACID transactional semantics. If all keys affected by a
40
logical mutation fall within the same range, atomicity and consistency
41
are guaranteed by Raft; this is the **fast commit path**. Otherwise, a
42
**non-locking distributed commit** protocol is employed between affected
43
ranges.
44
45
Cockroach provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
46
serializable snapshot isolation (SSI) semantics, allowing **externally
47
consistent, lock-free reads and writes**--both from a historical
48
snapshot timestamp and from the current wall clock time. SI provides
49
lock-free reads and writes but still allows write skew. SSI eliminates
50
write skew, but introduces a performance hit in the case of a
51
contentious system. SSI is the default isolation; clients must
52
consciously decide to trade correctness for performance. Cockroach
53
implements [a limited form of linearizability](#linearizability),
54
providing ordering for any observer or chain of observers.
55
56
Similar to
57
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
58
directories, Cockroach allows configuration of arbitrary zones of data.
59
This allows replication factor, storage device type, and/or datacenter
60
location to be chosen to optimize performance and/or availability.
61
Unlike Spanner, zones are monolithic and don’t allow movement of fine
62
grained data on the level of entity groups.
63
64
A
65
[Megastore](http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf)-like
66
message queue mechanism is also provided to 1) efficiently sideline
67
updates which can tolerate asynchronous execution and 2) provide an
68
integrated message queuing system for asynchronous communication between
69
distributed system components.
70
71
# Architecture
72
73
Cockroach implements a layered architecture. The highest level of
74
abstraction is the SQL layer (currently unspecified in this document).
75
It depends directly on the [*structured data
76
API*](#structured-data-api), which provides familiar relational concepts
77
such as schemas, tables, columns, and indexes. The structured data API
78
in turn depends on the [distributed key value store](#key-value-api),
79
which handles the details of range addressing to provide the abstraction
80
of a single, monolithic key value store. The distributed KV store
81
communicates with any number of physical cockroach nodes. Each node
82
contains one or more stores, one per physical device.
83
84
![Architecture](media/architecture.png)
85
86
Each store contains potentially many ranges, the lowest-level unit of
87
key-value data. Ranges are replicated using the Raft consensus protocol.
88
The diagram below is a blown up version of stores from four of the five
89
nodes in the previous diagram. Each range is replicated three ways using
90
raft. The color coding shows associated range replicas.
91
92
![Ranges](media/ranges.png)
93
94
Each physical node exports a RoachNode service. Each RoachNode exports
95
one or more key ranges. RoachNodes are symmetric. Each has the same
96
binary and assumes identical roles.
97
98
Nodes and the ranges they provide access to can be arranged with various
99
physical network topologies to make trade offs between reliability and
100
performance. For example, a triplicated (3-way replica) range could have
101
each replica located on different:
102
103
- disks within a server to tolerate disk failures.
104
- servers within a rack to tolerate server failures.
105
- servers on different racks within a datacenter to tolerate rack power/network failures.
106
- servers in different datacenters to tolerate large scale network or power outages.
107
108
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
109
110
# Cockroach Client
111
112
In order to support diverse client usage, Cockroach clients connect to
113
any node via HTTPS using protocol buffers or JSON. The connected node
114
proxies involved client work including key lookups and write buffering.
115
116
# Keys
117
118
Cockroach keys are arbitrary byte arrays. If textual data is used in
119
keys, utf8 encoding is recommended (this helps for cleaner display of
120
values in debugging tools). User-supplied keys are encoded using an
121
ordered code. System keys are either prefixed with null characters (`\0`
122
or `\0\0`) for system tables, or take the form of
123
`<user-key><system-suffix>` to sort user-key-range specific system
124
keys immediately after the user keys they refer to. Null characters are
125
used in system key prefixes to guarantee that they sort first.
126
127
# Versioned Values
128
129
Cockroach maintains historical versions of values by storing them with
130
associated commit timestamps. Reads and scans can specify a snapshot
131
time to return the most recent writes prior to the snapshot timestamp.
132
Older versions of values are garbage collected by the system during
133
compaction according to a user-specified expiration interval. In order
134
to support long-running scans (e.g. for MapReduce), all versions have a
135
minimum expiration.
136
137
Versioned values are supported via modifications to RocksDB to record
138
commit timestamps and GC expirations per key.
139
140
# Lock-Free Distributed Transactions
141
142
Cockroach provides distributed transactions without locks. Cockroach
143
transactions support two isolation levels:
144
145
- snapshot isolation (SI) and
146
- *serializable* snapshot isolation (SSI).
147
148
*SI* is simple to implement, highly performant, and correct for all but a
149
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
150
more complexity, is still highly performant (less so with contention), and has
151
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
152
the literature and some possibly novel insights.
153
154
SSI is the default level, with SI provided for application developers
155
who are certain enough of their need for performance and the absence of
156
write skew conditions to consciously elect to use it. In a lightly
157
contended system, our implementation of SSI is just as performant as SI,
158
requiring no locking or additional writes. With contention, our
159
implementation of SSI still requires no locking, but will end up
160
aborting more transactions. Cockroach’s SI and SSI implementations
161
prevent starvation scenarios even for arbitrarily long transactions.
162
163
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
164
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
165
For a discussion of SSI implemented by preventing read-write conflicts
166
(in contrast to detecting them, called write-snapshot isolation), see
167
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
168
which is the source of much inspiration for Cockroach’s SSI.
169
170
Each Cockroach transaction is assigned a random priority and a
171
"candidate timestamp" at start. The candidate timestamp is the
172
provisional timestamp at which the transaction will commit, and is
173
chosen as the current clock time of the node coordinating the
174
transaction. This means that a transaction without conflicts will
175
usually commit with a timestamp that, in absolute time, precedes the
176
actual work done by that transaction.
177
May 22, 2015
178
In the course of coordinating a transaction between one or more
179
distributed nodes, the candidate timestamp may be increased, but will
180
never be decreased. The core difference between the two isolation levels
181
SI and SSI is that the former allows the transaction's candidate
182
timestamp to increase and the latter does not.
183
184
Timestamps are a combination of both a physical and a logical component
185
to support monotonic increments without degenerate cases causing
186
timestamps to diverge from wall clock time, following closely the
187
[*Hybrid Logical Clock
188
paper.*](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf)
189
190
Transactions are executed in discrete phases:
191
192
1. Start the transaction by writing a new entry to the system
193
transaction table (keys prefixed by *\0tx*) with state “PENDING”.
194
In practice, this is done along with the next phase of the
195
transaction.
196
197
2. Write an "intent" value for each datum being written as part of the
198
transaction. These are normal MVCC values, with the addition of a
199
special flag (i.e. “intent”) indicating that the value may be
200
committed after the transaction itself commits. In addition,
201
the transaction id (unique and chosen at tx start time by client)
202
is stored with intent values. The tx id is used to refer to the
203
transaction table when there are conflicts and to make
204
tie-breaking decisions on ordering between identical timestamps.
205
Each node returns the timestamp used for the write (which is the
May 22, 2015
206
original candidate timestamp in the absence of conflicts); the client
207
selects the maximum from amongst all writes as the final commit
208
timestamp.
209
210
Each range maintains a small (i.e. latest 10s of read timestamps),
211
*in-memory* cache from key to the latest timestamp at which the
212
key(s) were read. This *latest-read-cache* is consulted on each
213
write. If the write’s candidate timestamp is earlier than the low
214
water mark on the cache itself (i.e. its last evicted timestamp)
215
or if the key being written has a read timestamp later than the
216
write’s candidate timestamp, this later timestamp value is
217
returned with the write. The cache’s entries are evicted oldest
218
timestamp first, updating low water mark as appropriate. If a new
219
range replica leader is elected, it sets the low water mark for
220
the cache to the current wall time + ε (ε = 99^th^ percentile
221
clock skew).
222
223
3. Commit the transaction by updating its entry in the system
224
transaction table (keys prefixed by *\0tx*). The value of the
225
commit entry contains the candidate timestamp (increased as
226
necessary to accommodate any latest read timestamps). Note that
227
the transaction is considered fully committed at this point and
228
control may be returned to the client.
229
230
In the case of an SI transaction, a commit timestamp which was
231
increased to accommodate concurrent readers is perfectly
232
acceptable and the commit may continue. For SSI transactions,
233
however, a gap between candidate and commit timestamps
234
necessitates transaction restart (note: restart is different than
235
abort--see below).
236
237
After the transaction is committed, all written intents are upgraded
238
in parallel by removing the “intent” flag. The transaction is
239
considered fully committed before this step and does not wait for
240
it to return control to the transaction coordinator.
241
242
In the absence of conflicts, this is the end. Nothing else is necessary
243
to ensure the correctness of the system.
244
245
**Conflict Resolution**
246
247
Things get more interesting when a reader or writer encounters an intent
248
record or newly-committed value in a location that it needs to read or
249
write. This is a conflict, usually causing either of the transactions to
250
abort or restart depending on the type of conflict.
251
252
***Transaction restart:***
253
254
This is the usual (and more efficient) type of behaviour and is used
255
except when the transaction was aborted (for instance by another
256
transaction).
257
In effect, that reduces to two cases; the first being the one outlined
258
above: An SSI transaction that finds upon attempting to commit that
259
its commit timestamp has been pushed. The second case involves a transaction
260
actively encountering a conflict, that is, one of its readers or writers
261
encounter data that necessitate conflict resolution
262
(see transaction interactions below).
263
264
When a transaction restarts, it changes its priority and/or moves its
265
timestamp forward depending on data tied to the conflict, and
266
begins anew reusing the same tx id. The prior run of the transaction might
267
have written some write intents, which need to be deleted before the
268
transaction commits, so as to not be included as part of the transaction.
269
These stale write intent deletions are done during the reexecution of the
270
transaction, either implicitly, through writing new intents to
271
the same keys as part of the reexecution of the transaction, or explicitly,
272
by cleaning up stale intents that are not part of the reexecution of the
273
transaction. Since most transactions will end up writing to the same keys,
274
the explicit cleanup run just before committing the transaction is usually
275
a NOOP.
276
277
***Transaction abort:***
278
279
This is the case in which a transaction, upon reading its transaction
280
table entry, finds that it has been aborted. In this case, the
281
transaction can not reuse its intents; it returns control to the client
282
before cleaning them up (other readers and writers would clean up
283
dangling intents as they encounter them) but will make an effort to
284
clean up after itself. The next attempt (if applicable) then runs as a
285
new transaction with **a new tx id**.
286
287
***Transaction interactions:***
288
289
There are several scenarios in which transactions interact:
290
291
- **Reader encounters write intent or value with newer timestamp far
292
enough in the future**: This is not a conflict. The reader is free
293
to proceed; after all, it will be reading an older version of the
294
value and so does not conflict. Recall that the write intent may
295
be committed with a later timestamp than its candidate; it will
296
never commit with an earlier one. **Side note**: if a SI transaction
297
reader finds an intent with a newer timestamp which the reader’s own
298
transaction has written, the reader always returns that intent's value.
299
300
- **Reader encounters write intent or value with newer timestamp in the
301
near future:** In this case, we have to be careful. The newer
302
intent may, in absolute terms, have happened in our read's past if
303
the clock of the writer is ahead of the node serving the values.
304
In that case, we would need to take this value into account, but
305
we just don't know. Hence the transaction restarts, using instead
306
a future timestamp (but remembering a maximum timestamp used to
307
limit the uncertainty window to the maximum clock skew). In fact,
308
this is optimized further; see the details under "choosing a time
309
stamp" below.
310
311
- **Reader encounters write intent with older timestamp**: the reader
312
must follow the intent’s transaction id to the transaction table.
313
If the transaction has already been committed, then the reader can
314
just read the value. If the write transaction has not yet been
315
committed, then the reader has two options. If the write conflict
316
is from an SI transaction, the reader can *push that transaction's
317
commit timestamp into the future* (and consequently not have to
318
read it). This is simple to do: the reader just updates the
319
transaction’s commit timestamp to indicate that when/if the
320
transaction does commit, it should use a timestamp *at least* as
321
high. However, if the write conflict is from an SSI transaction,
322
the reader must compare priorities. If the reader has the higher priority,
323
it pushes the transaction’s commit timestamp (that
324
transaction will then notice its timestamp has been pushed, and
325
restart). If it has the lower or same priority, it retries itself using as
326
a new priority `max(new random priority, conflicting txn’s
May 22, 2015
327
priority - 1)`. why max and not min?
329
- **Writer encounters uncommitted write intent**:
330
If the other write intent has been written by a transaction with a lower
331
priority, the writer aborts the conflicting transaction. If the write
332
intent has a higher or equal priority the transaction retries, using as a new
333
priority *max(new random priority, conflicting txn’s priority - 1)*;
334
the retry occurs after a short, randomized backoff interval.
336
- **Writer encounters newer committed write intent or committed value**:
337
The transaction restarts. On restart, the same priority is reused,
338
but the candidate timestamp is moved forward to the encountered
339
value's timestamp.
340
341
**Transaction management**
342
343
Transactions are managed by the client proxy (or gateway in SQL Azure
344
parlance). Unlike in Spanner, writes are not buffered but are sent
345
directly to all implicated ranges. This allows the transaction to abort
346
quickly if it encounters a write conflict. The client proxy keeps track
347
of all written keys in order to cleanup write intents upon transaction
348
completion.
349
350
If a transaction is completed successfully, all intents are upgraded to
351
committed. In the event a transaction is aborted, all written intents
352
are deleted. The client proxy doesn’t guarantee it will cleanup intents;
353
but dangling intents are upgraded or deleted when encountered by future
354
readers and writers and the system does not depend on their timely
355
cleanup for correctness.
356
357
In the event the client proxy restarts before the pending transaction is
358
completed, the dangling transaction would continue to live in the
359
transaction table until aborted by another transaction. Transactions
360
heartbeat the transaction table every five seconds by default.
361
Transactions encountered by readers or writers with dangling intents
362
which haven’t been heartbeat within the required interval are aborted.
363
364
An exploration of retries with contention and abort times with abandoned
365
transaction is
366
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
367
368
**Transaction Table**
369
370
Please see [proto/data.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
371
372
**Pros**
373
374
- No requirement for reliable code execution to prevent stalled 2PC
375
protocol.
376
- Readers never block with SI semantics; with SSI semantics, they may
377
abort.
378
- Lower latency than traditional 2PC commit protocol (w/o contention)
379
because second phase requires only a single write to the
380
transaction table instead of a synchronous round to all
381
transaction participants.
382
- Priorities avoid starvation for arbitrarily long transactions and
383
always pick a winner from between contending transactions (no
384
mutual aborts).
385
- Writes not buffered at client; writes fail fast.
386
- No read-locking overhead required for *serializable* SI (in contrast
387
to other SSI implementations).
388
- Well-chosen (i.e. less random) priorities can flexibly give
389
probabilistic guarantees on latency for arbitrary transactions
390
(for example: make OLTP transactions 10x less likely to abort than
391
low priority transactions, such as asynchronously scheduled jobs).
392
393
**Cons**
394
395
- Reads from non-leader replicas still require a ping to the leader to
396
update *latest-read-cache*.
397
- Abandoned transactions may block contending writers for up to the
398
heartbeat interval, though average wait is likely to be
399
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
400
This is likely considerably more performant than detecting and
401
restarting 2PC in order to release read and write locks.
402
- Behavior different than other SI implementations: no first writer
403
wins, and shorter transactions do not always finish quickly.
404
Element of surprise for OLTP systems may be a problematic factor.
405
- Aborts can decrease throughput in a contended system compared with
406
two phase locking. Aborts and retries increase read and write
407
traffic, increase latency and decrease throughput.
408
409
**Choosing a Timestamp**
410
411
A key challenge of reading data in a distributed system with clock skew
412
is choosing a timestamp guaranteed to be greater than the latest
413
timestamp of any committed transaction (in absolute time). No system can
414
claim consistency and fail to read already-committed data.
415
416
Accomplishing this for transactions (or just single operations)
417
accessing a single node is easy. The transaction supplies 0 for
418
timestamp, indicating that the node should use its current time (time
419
for a node is kept using a hybrid clock which combines wall time and a
420
logical time). This guarantees data already committed to that node have
421
earlier timestamps.
422
423
For multiple nodes, the timestamp of the node coordinating the
424
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
425
supplied to provide an upper bound on timestamps for already-committed
426
data (`ε` is the maximum clock skew). As the transaction progresses, any
427
data read which have timestamps greater than `t` but less than `t+ε`
428
cause the transaction to abort and retry with the conflicting timestamp
429
t<sub>c</sub>, where t<sub>c</sub> \> t. The maximum timestamp `t+ε` remains the same.
430
Time spent retrying because of reading recently committed data has an
431
upper bound of `ε`. In fact, this is further optimized: upon restarting,
432
the transaction not only takes into account the timestamp of the future
433
value, but the timestamp of the node at the time of the uncertain read.
434
The larger of those two timestamps (typically, the latter) is used to
435
bump up the read timestamp, and additionally the node is marked as
436
“certain”. This means that for future reads to that node within the
437
transaction, we can set `MaxTimestamp = Read Timestamp` (and hence avoid
438
further uncertainty restarts). Correctness follows from the fact that we
439
know that at the time of the read, there exists no version of any key on
440
that node with a higher timestamp; if we ran into one during a future
441
read, that node would have happened (in absolute time) after our
442
transaction started.
443
444
We expect retries will be rare, but this assumption may need to be
445
revisited if retries become problematic. Note that this problem does not
446
apply to historical reads. An alternate approach which does not require
447
retries would be to make a round to all node participants in advance and
448
choose the highest reported node wall time as the timestamp. However,
449
knowing which nodes will be accessed in advance is difficult and
450
potentially limiting. Cockroach could also potentially use a global
451
clock (Google did this with Pinax TODO: link to paper), which would be
452
feasible for smaller, geographically-proximate clusters.
453
454
# Linearizability
455
456
First a word about [***Spanner***](http://research.google.com/archive/spanner.html).
457
By combining judicious use of wait intervals with accurate time signals,
458
Spanner provides a global ordering between any two non-overlapping transactions
459
(in absolute time) with \~14ms latencies. Put another way:
460
Spanner guarantees that if a transaction T<sub>1</sub> commits (in absolute time)
461
before another transaction T<sub>2</sub> starts, then T<sub>1</sub>'s assigned commit
462
timestamp is smaller than T<sub>2</sub>'s. Using atomic clocks and GPS receivers,
463
Spanner reduces their clock skew uncertainty to \< 10ms (`ε`). To make
464
good on the promised guarantee, transactions must take at least double
465
the clock skew uncertainty interval to commit (`2ε`). See [*this
466
article*](http://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf)
467
for a helpful overview of Spanner’s concurrency control.
468
469
Cockroach could make the same guarantees without specialized hardware,
470
at the expense of longer wait times. If servers in the cluster were
471
configured to work only with NTP, transaction wait times would likely to
472
be in excess of 150ms. For wide-area zones, this would be somewhat
473
mitigated by overlap from cross datacenter link latencies. If clocks
474
were made more accurate, the minimal limit for commit latencies would
475
improve.
476
477
However, let’s take a step back and evaluate whether Spanner’s external
478
consistency guarantee is worth the automatic commit wait. First, if the
479
commit wait is omitted completely, the system still yields a consistent
480
view of the map at an arbitrary timestamp. However with clock skew, it
481
would become possible for commit timestamps on non-overlapping but
482
causally related transactions to suffer temporal reverse. In other
483
words, the following scenario is possible for a client without global
484
ordering:
485
486
- Start transaction T<sub>1</sub> to modify value `x` with commit time *s<sub>1</sub>*
487
488
- On commit of T<sub>1</sub>, start T<sub>2</sub> to modify value `y` with commit time
489
\> s<sub>2</sub>
490
491
- Read `x` and `y` and discover that s<sub>1</sub> \> s<sub>2</sub> (**!**)
492
493
The external consistency which Spanner guarantees is referred to as
494
**linearizability**. It goes beyond serializability by preserving
495
information about the causality inherent in how external processes
496
interacted with the database. The strength of Spanner’s guarantee can be
497
formulated as follows: any two processes, with clock skew within
498
expected bounds, may independently record their wall times for the
499
completion of transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>) and start of transaction
500
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) respectively, and if later
501
compared such that T<sub>1</sub><sup>end</sup> \< T<sub>2</sub><sup>start</sup>,
502
then commit timestamps s<sub>1</sub> \< s<sub>2</sub>.
503
This guarantee is broad enough to completely cover all cases of explicit
504
causality, in addition to covering any and all imaginable scenarios of implicit
505
causality.
506
507
Our contention is that causality is chiefly important from the
508
perspective of a single client or a chain of successive clients (*if a
509
tree falls in the forest and nobody hears…*). As such, Cockroach
510
provides two mechanisms to provide linearizability for the vast majority
511
of use cases without a mandatory transaction commit wait or an elaborate
512
system to minimize clock skew.
513
514
1. Clients provide the highest transaction commit timestamp with
515
> successive transactions. This allows node clocks from previous
516
> transactions to effectively participate in the formulation of the
517
> commit timestamp for the current transaction. This guarantees
518
> linearizability for transactions committed by this client.
519
>
520
> Newly launched clients wait at least 2 \* ε from process start
521
> time before beginning their first transaction. This preserves the
522
> same property even on client restart, and the wait will be
523
> mitigated by process initialization.
524
>
525
> All causally-related events within Cockroach maintain
526
> linearizability. Message queues, for example, guarantee that the
527
> receipt timestamp is greater than send timestamp, and that
528
> delivered messages may not be reaped until after the commit wait.
529
530
2. Committed transactions respond with a commit wait parameter which
531
> represents the remaining time in the nominal commit wait. This
532
> will typically be less than the full commit wait as the consensus
533
> write at the coordinator accounts for a portion of it.
534
>
535
> Clients taking any action outside of another Cockroach transaction
536
> (e.g. writing to another distributed system component) can either
537
> choose to wait the remaining interval before proceeding, or
538
> alternatively, pass the wait and/or commit timestamp to the
539
> execution of the outside action for its consideration. This pushes
540
> the burden of linearizability to clients, but is a useful tool in
541
> mitigating commit latencies if the clock skew is potentially
542
> large. This functionality can be used for ordering in the face of
543
> backchannel dependencies as mentioned in the
544
> [AugmentedTime](http://www.cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf)
545
> paper.
546
547
Using these mechanisms in place of commit wait, Cockroach’s guarantee can be
548
formulated as follows: any process which signals the start of transaction
549
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) after the completion of
550
transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>), will have commit
551
timestamps such thats<sub>1</sub> \< s<sub>2</sub>.
552
553
# Logical Map Content
554
555
Logically, the map contains a series of reserved system key / value
556
pairs covering accounting, range metadata, node accounting and
557
permissions before the actual key / value pairs for non-system data
558
(e.g. the actual meat of the map).
559
560
- `\0\0meta1` Range metadata for location of `\0\0meta2`.
561
- `\0\0meta1<key1>` Range metadata for location of `\0\0meta2<key1>`.
562
- ...
563
- `\0\0meta1<keyN>`: Range metadata for location of `\0\0meta2<keyN>`.
564
- `\0\0meta2`: Range metadata for location of first non-range metadata key.
565
- `\0\0meta2<key1>`: Range metadata for location of `<key1>`.
566
- ...
567
- `\0\0meta2<keyN>`: Range metadata for location of `<keyN>`.
568
- `\0acct<key0>`: Accounting for key prefix key0.
569
- ...
570
- `\0acct<keyN>`: Accounting for key prefix keyN.
571
- `\0node<node-address0>`: Accounting data for node 0.
572
- ...
573
- `\0node<node-addressN>`: Accounting data for node N.
574
- `\0perm<key0><user0>`: Permissions for user0 for key prefix key0.
575
- ...
576
- `\0perm<keyN><userN>`: Permissions for userN for key prefix keyN.
577
- `\0tree_root`: Range key for root of range-spanning tree.
578
- `\0tx<tx-id0>`: Transaction record for transaction 0.
579
- ...
580
- `\0tx<tx-idN>`: Transaction record for transaction N.
581
- `\0zone<key0>`: Zone information for key prefix key0.
582
- ...
583
- `\0zone<keyN>`: Zone information for key prefix keyN.
584
- `<>acctd<metric0>`: Accounting data for Metric 0 for empty key prefix.
585
- ...
586
- `<>acctd<metricN>`: Accounting data for Metric N for empty key prefix.
587
- `<key0>`: `<value0>` The first user data key.**
588
- ...
589
- `<keyN>`: `<valueN>` The last user data key.**
590
591
There are some additional system entries sprinkled amongst the
592
non-system keys. See the Key-Prefix Accounting section in this document
593
for further details.
594
595
# Node Storage
596
597
Nodes maintain a separate instance of RocksDB for each disk. Each
598
RocksDB instance hosts any number of ranges. RPCs arriving at a
599
RoachNode are multiplexed based on the disk name to the appropriate
600
RocksDB instance. A single instance per disk is used to avoid
601
contention. If every range maintained its own RocksDB, global management
602
of available cache memory would be impossible and writers for each range
603
would compete for non-contiguous writes to multiple RocksDB logs.
604
605
In addition to the key/value pairs of the range itself, various range
606
metadata is maintained.
607
608
- range-spanning tree node links
609
610
- participating replicas
611
612
- consensus metadata
613
614
- split/merge activity
615
616
A really good reference on tuning Linux installations with RocksDB is
617
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
618
619
# Range Metadata
620
621
The default approximate size of a range is 64M (2\^26 B). In order to
622
support 1P (2\^50 B) of logical data, metadata is needed for roughly
623
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
624
size is roughly 256 bytes (*3 \* 12 bytes for the triplicated node
625
locations and 220 bytes for the range key itself*). 2\^24 ranges \* 2\^8
626
B would require roughly 4G (2\^32 B) to store--too much to duplicate
627
between machines. Our conclusion is that range metadata must be
628
distributed for large installations.
629
630
To distribute the range metadata and keep key lookups relatively fast,
631
we use two levels of indirection. All of the range metadata sorts first
632
in our key-value map. We accomplish this by prefixing range metadata
633
with two null characters (*\0\0*). The *meta1* or *meta2* suffixes are
634
additionally appended to distinguish between the first level and second
635
level of **`r`**a***ng***e metadata. In order to do a lookup for *key1*,
636
we first locate the range information for the lower bound of
637
`\0\0meta1<key1>`, and then use that range to locate the lower bound
638
of `\0\0meta2<key1>`. The range specified there will indicate the
639
range location of `<key1>` (refer to examples below). Using two levels
640
of indirection, **our map can address approximately 2\^62 B of data, or
641
roughly 4E** (*each metadata range addresses 2\^(26-8) = 2\^18 ranges;
642
with two levels of indirection, we can address 2\^(18 + 18) = 2\^36
643
ranges; each range addresses 2\^26 B; total is 2\^(36+26) B = 2\^62 B =
644
4E*).
645
646
Note: we append the end key of each range to meta[12] records because
647
the RocksDB iterator only supports a Seek() interface which acts as a
648
Ceil(). Using the start key of the range would cause Seek() to find the
649
key *after* the meta indexing record we’re looking for, which would
650
result in having to back the iterator up, an option which is both less
651
efficient and not available in all cases.
652
653
The following example shows the directory structure for a map with
654
three ranges worth of data. The key/values in red show range
655
metadata. The key/values in black show actual data. Ellipses
656
indicate additional key/value pairs to fill out entire range of
657
data. Except for the fact that splitting ranges requires updates
658
to the range metadata with knowledge of the metadata layout, the
659
range metadata itself requires no special treatment or
660
bootstrapping.
661
662
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
663
`dcrama3:8000`)
664
665
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
666
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
667
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
668
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
669
- ...
670
- `<lastkey0>`: `<lastvalue0>`
671
672
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
673
`dcrama6:8000`)
674
675
- ...
676
- `<lastkey1>`: `<lastvalue1>`
677
678
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
679
`dcrama9:8000`)
680
681
- ...
682
- `<lastkey2>`: `<lastvalue2>`
683
684
Consider a simpler example of a map containing less than a single
685
range of data. In this case, all range metadata and all data are
686
located in the same range:
687
688
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
689
`dcrama3:8000`)*
690
691
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
692
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
693
- `<key0>`: `<value0>`
694
- `...`
695
696
Finally, a map large enough to need both levels of indirection would
697
look like (note that instead of showing range replicas, this
698
example is simplified to just show range indexes):
699
700
**Range 0**
701
702
- `\0\0meta1<lastkeyN-1>`: Range 0
703
- `\0\0meta1\xff`: Range 1
704
- `\0\0meta2<lastkey1>`: Range 1
705
- `\0\0meta2<lastkey2>`: Range 2
706
- `\0\0meta2<lastkey3>`: Range 3
707
- ...
708
- `\0\0meta2<lastkeyN-1>`: Range 262143
709
710
**Range 1**
711
712
- `\0\0meta2<lastkeyN>`: Range 262144
713
- `\0\0meta2<lastkeyN+1>`: Range 262145
714
- ...
715
- `\0\0meta2\xff`: Range 500,000
716
- ...
717
- `<lastkey1>`: `<lastvalue1>`
718
719
**Range 2**
720
721
- ...
722
- `<lastkey2>`: `<lastvalue2>`
723
724
**Range 3**
725
726
- ...
727
- `<lastkey3>`: `<lastvalue3>`
728
729
**Range 262144**
730
731
- ...
732
- `<lastkeyN>`: `<lastvalueN>`
733
734
**Range 262145**
735
736
- ...
737
- `<lastkeyN+1>`: `<lastvalueN+1>`
738
739
Note that the choice of range `262144` is just an approximation. The
740
actual number of ranges addressable via a single metadata range is
741
dependent on the size of the keys. If efforts are made to keep key sizes
742
small, the total number of addressable ranges would increase and vice
743
versa.
744
745
From the examples above it’s clear that key location lookups require at
746
most three reads to get the value for `<key>`:
747
748
1. lower bound of `\0\0meta1<key>`
749
2. lower bound of `\0\0meta2<key>`,
750
3. `<key>`.
751
752
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
753
containing less than 16T of data would require two lookups. Clients cache both
754
levels of range metadata, and we expect that data locality for individual
755
clients will be high. Clients may end up with stale cache entries. If on a
756
lookup, the range consulted does not match the client’s expectations, the
757
client evicts the stale entries and possibly does a new lookup.
758
759
# Range-Spanning Binary Tree
760
761
A crucial enhancement to the organization of range metadata is to
762
augment the bi-level range metadata lookup with a minimum spanning tree,
763
implemented as a left-leaning red-black tree over all ranges in the map.
764
This tree structure allows the system to start at any key prefix and
765
efficiently traverse an arbitrary key range with minimal RPC traffic,
766
minimal fan-in and fan-out, and with bounded time complexity equal to
767
`2*log N` steps, where `N` is the total number of ranges in the system.
768
769
Unlike the range metadata rows prefixed with `\0\0meta[1|2]`, the
770
metadata for the range-spanning tree (e.g. parent range and left / right
771
child ranges) is stored directly at the ranges as non-map metadata. The
772
metadata for each node of the tree (e.g. links to parent range, left
773
child range, and right child range) is stored with the range metadata.
774
In effect, the tree metadata is stored implicitly. In order to traverse
775
the tree, for example, you’d need to query each range in turn for its
776
metadata.
777
778
Any time a range is split or merged, both the bi-level range lookup
779
metadata and the per-range binary tree metadata are updated as part of
780
the same distributed transaction. The total number of nodes involved in
781
the update is bounded by 2 + log N (i.e. 2 updates for meta1 and
782
meta2, and up to log N updates to balance the range-spanning tree).
783
The range corresponding to the root node of the tree is stored in
Apr 23, 2015
784
*\0tree_root*.
785
786
As an example, consider the following set of nine ranges and their
787
associated range-spanning tree:
788
789
R0: `aa - cc`, R1: `*cc - lll`, R2: `*lll - llr`, R3: `*llr - nn`, R4: `*nn - rr`, R5: `*rr - ssss`, R6: `*ssss - sst`, R7: `*sst - vvv`, R8: `*vvv - zzzz`.
790
791
![Range Tree](media/rangetree.png)
792
793
The range-spanning tree has many beneficial uses in Cockroach. It makes
794
the problem of efficiently aggregating accounting information of
795
potentially vast ranges of data tractable. Imagine a subrange of data
796
over which accounting is being kept. For example, the *photos* table in
797
a public photo sharing site. To efficiently keep track of data about the
798
table (e.g. total size, number of rows, etc.), messages can be passed
799
first up the tree and then down to the left until updates arrive at the
800
key prefix under which accounting is aggregated. This makes worst case
801
number of hops for an update to propagate into the accounting totals
802
2 \* log N. A 64T database will require 1M ranges, meaning 40 hops
803
worst case. In our experience, accounting tasks over vast ranges of data
804
are most often map/reduce jobs scheduled with coarse-grained
805
periodicity. By contrast, we expect Cockroach to maintain statistics
806
with sub 10s accuracy and with minimal cycles and minimal IOPs.
807
808
Another use for the range-spanning tree is to push accounting, zones and
809
permissions configurations to all ranges. In the case of zones and
810
permissions, this is an efficient way to pass updated configuration
811
information with exponential fan-out. When adding accounting
812
configurations (i.e. specifying a new key prefix to track), the
813
implicated ranges are transactionally scanned and zero-state accounting
814
information is computed as well. Deleting accounting configurations is
815
similar, except accounting records are deleted.
816
817
Last but *not* least, the range-spanning tree provides a convenient
818
mechanism for planning and executing parallel queries. These provide the
819
basis for
820
[Dremel](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf)-like
821
query execution trees and it’s easy to imagine supporting a subset of
822
SQL or even javascript-based user functions for complex data analysis
823
tasks.
824
825
# Raft - Consistency of Range Replicas
826
827
Each range is configured to consist of three or more replicas. The
828
replicas in a range maintain their own instance of a distributed
829
consensus algorithm. We use the [*Raft consensus
830
algorithm*](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf)
831
as it is simpler to reason about and includes a reference implementation
832
covering important details. Every write to replicas is logged twice.
833
Once to RocksDB’s internal log and once to levedb itself as part of the
834
Raft consensus log.
835
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
836
promising performance characteristics for WAN-distributed replicas, but
837
it does not guarantee a consistent ordering between replicas.
838
839
Raft elects a relatively long-lived leader which must be involved to
840
propose writes. It heartbeats followers periodically to keep their logs
841
replicated. In the absence of heartbeats, followers become candidates
842
after randomized election timeouts and proceed to hold new leader
843
elections. Cockroach weights random timeouts such that the replicas with
844
shorter round trip times to peers are more likely to hold elections
845
first. Although only the leader can propose a new write, and as such
846
must be involved in any write to the consensus log, any replica can
847
service reads if the read is for a timestamp which the replica knows is
848
safe based on the last committed consensus write and the state of any
849
pending transactions.
850
851
Only the leader can propose a new write, but Cockroach accepts writes at
852
any replica. The replica merely forwards the write to the leader.
853
Instead of resending the write, the leader has only to acknowledge the
854
write to the forwarding replica using a log sequence number, as though
855
it were proposing it in the first place. The other replicas receive the
856
full write as though the leader were the originator.
857
858
Having a stable leader provides the choice of replica to handle
859
range-specific maintenance and processing tasks, such as delivering
860
pending message queues, handling splits and merges, rebalancing, etc.
861
862
# Splitting / Merging Ranges
863
864
RoachNodes split or merge ranges based on whether they exceed maximum or
865
minimum thresholds for capacity or load. Ranges exceeding maximums for
866
either capacity or load are split; ranges below minimums for *both*
867
capacity and load are merged.
868
869
Ranges maintain the same accounting statistics as accounting key
870
prefixes. These boil down to a time series of data points with minute
871
granularity. Everything from number of bytes to read/write queue sizes.
872
Arbitrary distillations of the accounting stats can be determined as the
873
basis for splitting / merging. Two sensical metrics for use with
874
split/merge are range size in bytes and IOps. A good metric for
875
rebalancing a replica from one node to another would be total read/write
876
queue wait times. These metrics are gossipped, with each range / node
877
passing along relevant metrics if they’re in the bottom or top of the
878
range it’s aware of.
879
880
A range finding itself exceeding either capacity or load threshold
881
splits. To this end, the range leader computes an appropriate split key
882
candidate and issues the split through Raft. In contrast to splitting,
883
merging requires a range to be below the minimum threshold for both
884
capacity *and* load. A range being merged chooses the smaller of the
885
ranges immediately preceding and succeeding it.
886
887
Splitting, merging, rebalancing and recovering all follow the same basic
888
algorithm for moving data between roach nodes. New target replicas are
889
created and added to the replica set of source range. Then each new
890
replica is brought up to date by either replaying the log in full or
891
copying a snapshot of the source replica data and then replaying the log
892
from the timestamp of the snapshot to catch up fully. Once the new
893
replicas are fully up to date, the range metadata is updated and old,
894
source replica(s) deleted if applicable.
895
896
**Coordinator** (leader replica)
897
898
```
899
if splitting
Apr 23, 2015
900
SplitRange(split_key): splits happen locally on range replicas and
901
only after being completed locally, are moved to new target replicas.
902
else if merging
903
Choose new replicas on same servers as target range replicas;
904
add to replica set.
905
else if rebalancing || recovering
906
Choose new replica(s) on least loaded servers; add to replica set.
907
```
908
909
**New Replica**
910
911
*Bring replica up to date:*
912
913
```
914
if all info can be read from replicated log
915
copy replicated log
916
else
917
snapshot source replica
918
send successive ReadRange requests to source replica
919
referencing snapshot
920
921
if merging
922
combine ranges on all replicas
923
else if rebalancing || recovering
924
remove old range replica(s)
925
```
926
927
RoachNodes split ranges when the total data in a range exceeds a
928
configurable maximum threshold. Similarly, ranges are merged when the
929
total data falls below a configurable minimum threshold.
930
931
**TBD: flesh this out**.
932
933
Ranges are rebalanced if a node determines its load or capacity is one
934
of the worst in the cluster based on gossipped load stats. A node with
935
spare capacity is chosen in the same datacenter and a special-case split
936
is done which simply duplicates the data 1:1 and resets the range
937
configuration metadata.
938
939
# Message Queues
940
941
Each range maintains an array of incoming message queues, referred to
942
here as **inboxes**. Additionally, each range maintains and *processes*
943
an array of outgoing message queues, referred to here as **outboxes**.
944
Both inboxes and outboxes are assigned to keys; messages can be sent or
945
received on behalf of any key. Inboxes and outboxes can contain any
946
number of pending messages.
947
948
Messages can be *deliverable*, or *executable.*
949
950
Deliverable messages are defined by Value objects - simple byte arrays -
951
that are delivered to a key’s inbox, awaiting collection by a client
952
invoking the ReapQueue operation. These are typically used by client
953
applications wishing to be notified of changes to an entry for further
954
processing, such as expensive offline operations like sending emails,
955
SMSs, etc.
956
957
Executable messages are *outgoing-only*, and are instances of
958
PutRequest,IncrementRequest, DeleteRequest, DeleteRangeRequest
959
orAccountingRequest. Rather than being delivered to a key’s inbox, are
960
executed when encountered. These are primarily useful when updates that
961
are nominally part of a transaction can tolerate asynchronous execution
962
(e.g. eventual consistency), and are otherwise too busy or numerous to
963
make including them in the original [distributed] transaction efficient.
964
Examples may include updates to the accounting for successive key
965
prefixes (potentially busy) or updates to a full-text index (potentially
966
numerous).
967
968
These two types of messages are enqueued in different outboxes too - see
969
key formats below.
970
971
At commit time, the range processing the transaction places messages
972
into a shared outbox located at the start of the range metadata. This is
973
effectively free as it’s part of the same consensus write for the range
974
as the COMMIT record. Outgoing messages are processed asynchronously by
975
the range. To make processing easy, all outboxes are co-located at the
976
start of the range. To make lookup easy, all inboxes are located
977
immediately after the recipient key. The leader replica of a range is
978
responsible for processing message queues.
979
980
A dispatcher polls a given range’s *deliverable message outbox*
981
periodically (configurable), and delivers those messages to the target
982
key’s inbox. The dispatcher is also woken up whenever a new message is
983
added to the outbox. A separate executor also polls the range’s
984
*executable message outbox* periodically as well (again, configurable),
985
and executes those commands. The exeecutor, too, is woken up whenever a
986
new message is added to the outbox.
987
988
Formats follow in the table below. Notice that inbox messages for a
989
given key sort by the `<outbox-timestamp>`. This doesn’t provide a
990
precise ordering, but it does allow clients to scan messages in an
991
approximate ordering of when they were originally lodged with senders.
992
NTP offers walltime deltas to within 100s of milliseconds. The
993
`<sender-range-key>` suffix provides uniqueness.
994
995
**Outbox**
996
`<sender-range-key>deliverable-outbox:<recipient-key><outbox-timestamp>`
997
`<sender-range-key>executable-outbox:<recipient-key><outbox-timestamp>`
998
999
**Inbox**
1000
`<recipient-key>inbox:<outbox-timestamp><sender-range-key>`
1001
1002
Messages are processed and then deleted as part of a single distributed
1003
transaction. The message will be executed or delivered exactly once,
1004
regardless of failures at either sender or receiver.
1005
1006
Delivered messages may be read by clients via the ReapQueue operation.
1007
This operation may only be used as part of a transaction. Clients should
1008
commit only after having processed the message. If the transaction is
1009
committed, scanned messages are automatically deleted. The operation
1010
name was chosen to reflect its mutating side effect. Deletion of read
1011
messages is mandatory because senders deliver messages asynchronously
1012
and a delay could cause insertion of messages at arbitrary points in the
1013
inbox queue. If clients require persistence, they should re-save read
1014
messages manually; the ReapQueue operation can be incorporated into
1015
normal transactional updates.
1016
1017
# Node Allocation (via Gossip)
1018
1019
New nodes must be allocated when a range is split. Instead of requiring
1020
every RoachNode to know about the status of all or even a large number
1021
of peer nodes --or-- alternatively requiring a specialized curator or
1022
master with sufficiently global knowledge, we use a gossip protocol to
1023
efficiently communicate only interesting information between all of the
1024
nodes in the cluster. What’s interesting information? One example would
1025
be whether a particular node has a lot of spare capacity. Each node,
1026
when gossiping, compares each topic of gossip to its own state. If its
1027
own state is somehow “more interesting” than the least interesting item
1028
in the topic it’s seen recently, it includes its own state as part of
1029
the next gossip session with a peer node. In this way, a node with
1030
capacity sufficiently in excess of the mean quickly becomes discovered
1031
by the entire cluster. To avoid piling onto outliers, nodes from the
1032
high capacity set are selected at random for allocation.
1033
1034
The gossip protocol itself contains two primary components:
1035
1036
- **Peer Selection**: each node maintains up to N peers with which it
1037
regularly communicates. It selects peers with an eye towards
1038
maximizing fanout. A peer node which itself communicates with an
1039
array of otherwise unknown nodes will be selected over one which
1040
communicates with a set containing significant overlap. Each time
1041
gossip is initiated, each nodes’ set of peers is exchanged. Each
1042
node is then free to incorporate the other’s peers as it sees fit.
1043
To avoid any node suffering from excess incoming requests, a node
1044
may refuse to answer a gossip exchange. Each node is biased
1045
towards answering requests from nodes without significant overlap
1046
and refusing requests otherwise.
1047
1048
Peers are efficiently selected using a heuristic as described in
1049
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
1050
1051
**TBD**: how to avoid partitions? Need to work out a simulation of
1052
the protocol to tune the behavior and see empirically how well it
1053
works.
1054
1055
- **Gossip Selection**: what to communicate. Gossip is divided into
1056
topics. Load characteristics (capacity per disk, cpu load, and
1057
state [e.g. draining, ok, failure]) are used to drive node
1058
allocation. Range statistics (range read/write load, missing
1059
replicas, unavailable ranges) and network topology (inter-rack
1060
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
1061
outages) are used for determining when to split ranges, when to
1062
recover replicas vs. wait for network connectivity, and for
1063
debugging / sysops. In all cases, a set of minimums and a set of
1064
maximums is propagated; each node applies its own view of the
1065
world to augment the values. Each minimum and maximum value is
1066
tagged with the reporting node and other accompanying contextual
1067
information. Each topic of gossip has its own protobuf to hold the
1068
structured data. The number of items of gossip in each topic is
1069
limited by a configurable bound.
1070
1071
For efficiency, nodes assign each new item of gossip a sequence
1072
number and keep track of the highest sequence number each peer
1073
node has seen. Each round of gossip communicates only the delta
1074
containing new items.
1075
1076
# Node Accounting
1077
1078
The gossip protocol discussed in the previous section is useful to
1079
quickly communicate fragments of important information in a
1080
decentralized manner. However, complete accounting for each node is also
1081
stored to a central location, available to any dashboard process. This
1082
is done using the map itself. Each node periodically writes its state to
1083
the map with keys prefixed by `\0node`, similar to the first level of
1084
range metadata, but with an ‘`node`’ suffix. Each value is a protobuf
1085
containing the full complement of node statistics--everything
1086
communicated normally via the gossip protocol plus other useful, but
1087
non-critical data.
1088
1089
The range containing the first key in the node accounting table is
1090
responsible for gossiping the total count of nodes. This total count is
1091
used by the gossip network to most efficiently organize itself. In
1092
particular, the maximum number of hops for gossipped information to take
1093
before reaching a node is given by `ceil(log(node count) / log(max
1094
fanout)) + 1`.
1095
1096
# Key-prefix Accounting, Zones & Permissions
1097
1098
Arbitrarily fine-grained accounting and permissions are specified via
1099
key prefixes. Key prefixes can overlap, as is necessary for capturing
1100
hierarchical relationships. For illustrative purposes, let’s say keys
1101
specifying rows in a set of databases have the following format:
1102
1103
`<db>:<table>:<primary-key>[:<secondary-key>]`
1104
1105
In this case, we might collect accounting or specify permissions with
1106
key prefixes:
1107
1108
`db1`, `db1:user`, `db1:order`,
1109
1110
Accounting is kept for the entire map by default.
1111
1112
## Accounting
1113
to keep accounting for a range defined by a key prefix, an entry is created in
1114
the accounting system table. The format of accounting table keys is:
1115
1116
`\0acct<key-prefix>`
1117
1118
In practice, we assume each RoachNode capable of caching the
1119
entire accounting table as it is likely to be relatively small.
1120
1121
Accounting is kept for key prefix ranges with eventual consistency
1122
for efficiency. Updates to accounting values propagate through the
1123
system using the message queue facility if the accounting keys do
1124
not reside on the same range as ongoing activity (true for all but
1125
the smallest systems). There are two types of values which
1126
comprise accounting: counts and occurrences, for lack of better
1127
terms. Counts describe system state, such as the total number of
1128
bytes, rows, etc. Occurrences include transient performance and
1129
load metrics. Both types of accounting are captured as time series
1130
with minute granularity. The length of time accounting metrics are
1131
kept is configurable. Below are examples of each type of
1132
accounting value.
1133
1134
**System State Counters/Performance**
1135
1136
- Count of items (e.g. rows)
1137
- Total bytes
1138
- Total key bytes
1139
- Total value length
1140
- Queued message count
1141
- Queued message total bytes
1142
- Count of values \< 16B
1143
- Count of values \< 64B
1144
- Count of values \< 256B
1145
- Count of values \< 1K
1146
- Count of values \< 4K
1147
- Count of values \< 16K
1148
- Count of values \< 64K
1149
- Count of values \< 256K
1150
- Count of values \< 1M
1151
- Count of values \> 1M
1152
- Total bytes of accounting
1153
1154
1155
**Load Occurences**
1156
1157
Get op count
1158
Get total MB
1159
Put op count
1160
Put total MB
1161
Delete op count
1162
Delete total MB
1163
Delete range op count
1164
Delete range total MB
1165
Scan op count
1166
Scan op MB
1167
Split count
1168
Merge count
1169
1170
Because accounting information is kept as time series and over many
1171
possible metrics of interest, the data can become numerous. Accounting
1172
data are stored in the map near the key prefix described, in order to
1173
distribute load (for both aggregation and storage).
1174
1175
Accounting keys for system state have the form:
1176
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1177
character. It’s meant to sort the root level account AFTER any other
1178
system tables. They must increment the same underlying values as they
1179
are permanent counts, and not transient activity. Logic at the
1180
RoachNode takes care of snapshotting the value into an appropriately
1181
suffixed (e.g. with timestamp hour) multi-value time series entry.
1182
1183
Keys for perf/load metrics:
1184
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1185
1186
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1187
containing a varint64 entry for each minute with activity during the
1188
specified hour.
1189
1190
To efficiently keep accounting over large key ranges, the task of
1191
aggregation must be distributed. If activity occurs within the same
1192
range as the key prefix for accounting, the updates are made as part
1193
of the consensus write. If the ranges differ, then a message is sent
1194
to the parent range to increment the accounting. If upon receiving the
1195
message, the parent range also does not include the key prefix, it in
1196
turn forwards it to its parent or left child in the balanced binary
1197
tree which is maintained to describe the range hierarchy. This limits
1198
the number of messages before an update is visible at the root to `2*log N`,
1199
where `N` is the number of ranges in the key prefix.
1200
1201
## Zones
1202
zones are stored in the map with keys prefixed by
1203
`\0zone` followed by the key prefix to which the zone
1204
configuration applies. Zone values specify a protobuf containing
1205
the datacenters from which replicas for ranges which fall under
1206
the zone must be chosen.
1207
1208
Please see [proto/config.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1209
1210
If zones are modified in situ, each RoachNode verifies the
1211
existing zones for its ranges against the zone configuration. If
1212
it discovers differences, it reconfigures ranges in the same way
1213
that it rebalances away from busy nodes, via special-case 1:1
1214
split to a duplicate range comprising the new configuration.
1215
1216
### Permissions
1217
permissions are stored in the map with keys prefixed by *\0perm* followed by
1218
the key prefix and user to which the specified permissions apply. The format of
1219
permissions keys is:
1220
1221
`\0perm<key-prefix><user>`
1222
1223
Permission values are a protobuf containing the permission details;
1224
please see [proto/config.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/config.proto) for up-to-date data structures used, the best entry point being `message PermConfig`.
1225
1226
A default system root permission is assumed for the entire map
1227
with an empty key prefix and read/write as true.
1228
1229
When determining whether or not to allow a read or a write a key
1230
value (e.g. `db1:user:1` for user `spencer`), a RoachNode would
1231
read the following permissions values:
1232
1233
```
1234
\0perm<db1:user:1>spencer
1235
\0perm<db1:user>spencer
1236
\0perm<db1>spencer
1237
\0perm<>spencer
1238
```
1239
1240
If any prefix in the hierarchy provides the required permission,
1241
the request is satisfied; otherwise, the request returns an
1242
error.
1243
1244
The priority for a user permission is used to order requests at
1245
Raft consensus ranges and for choosing an initial priority for
1246
distributed transactions. When scheduling operations at the Raft
1247
consensus range, all outstanding requests are ordered by key
1248
prefix and each assigned priorities according to key, user and
1249
arrival time. The next request is chosen probabilistically using
1250
priorities to weight the choice. Each key can have multiple
1251
priorities as they’re hierarchical (e.g. for /user/key, one
1252
priority for root ‘/’, and one for ‘/user/key’). The most general
1253
priority is used first. If two keys share the most general, then
1254
they’re compared with the next most general if applicable, and so on.
1255
1256
# Key-Value API
1257
1258
see the protobufs in [proto/](https://github.com/cockroachdb/cockroach/blob/master/proto),
1259
in particular [proto/api.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/api.proto) and the comments within.
1260
1261
# Structured Data API
1262
1263
A preliminary design can be found in the [Go source documentation](http://godoc.org/github.com/cockroachdb/cockroach/structured).