Permalink
Newer
100644
1282 lines (1060 sloc)
61.3 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
Cockroach is a distributed key:value datastore (SQL and structured
8
data layers of cockroach have yet to be defined) which supports **ACID
9
transactional semantics** and **versioned values** as first-class
10
features. The primary design goal is **global consistency and
11
survivability**, hence the name. Cockroach aims to tolerate disk,
12
machine, rack, and even **datacenter failures** with minimal latency
13
disruption and **no manual intervention**. Cockroach nodes are
14
symmetric; a design goal is **homogenous deployment** (one binary) with
15
minimal configuration.
16
17
Cockroach implements a **single, monolithic sorted map** from key to
18
value where both keys and values are byte strings (not unicode).
19
Cockroach **scales linearly** (theoretically up to 4 exabytes (4E) of
20
logical data). The map is composed of one or more ranges and each range
21
is backed by data stored in [RocksDB](http://rocksdb.org/) (a
22
variant of LevelDB), and is replicated to a total of three or more
23
cockroach servers. Ranges are defined by start and end keys. Ranges are
24
merged and split to maintain total byte size within a globally
25
configurable min/max size interval. Range sizes default to target `64M` in
26
order to facilitate quick splits and merges and to distribute load at
27
hotspots within a key range. Range replicas are intended to be located
28
in disparate datacenters for survivability (e.g. `{ US-East, US-West,
29
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East, US-West,
30
Japan, Australia }`).
31
32
Single mutations to ranges are mediated via an instance of a distributed
33
consensus algorithm to ensure consistency. We’ve chosen to use the
34
[Raft consensus
35
algorithm](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
36
All consensus state is stored in RocksDB.
37
38
A single logical mutation may affect multiple key/value pairs. Logical
39
mutations have ACID transactional semantics. If all keys affected by a
40
logical mutation fall within the same range, atomicity and consistency
41
are guaranteed by Raft; this is the **fast commit path**. Otherwise, a
42
**non-locking distributed commit** protocol is employed between affected
43
ranges.
44
45
Cockroach provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
46
serializable snapshot isolation (SSI) semantics, allowing **externally
47
consistent, lock-free reads and writes**--both from a historical
48
snapshot timestamp and from the current wall clock time. SI provides
49
lock-free reads and writes but still allows write skew. SSI eliminates
50
write skew, but introduces a performance hit in the case of a
51
contentious system. SSI is the default isolation; clients must
52
consciously decide to trade correctness for performance. Cockroach
53
implements [a limited form of linearizability](#linearizability),
54
providing ordering for any observer or chain of observers.
55
56
Similar to
57
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
58
directories, Cockroach allows configuration of arbitrary zones of data.
59
This allows replication factor, storage device type, and/or datacenter
60
location to be chosen to optimize performance and/or availability.
61
Unlike Spanner, zones are monolithic and don’t allow movement of fine
62
grained data on the level of entity groups.
63
64
A
65
[Megastore](http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf)-like
66
message queue mechanism is also provided to 1) efficiently sideline
67
updates which can tolerate asynchronous execution and 2) provide an
68
integrated message queuing system for asynchronous communication between
69
distributed system components.
70
71
# Architecture
72
73
Cockroach implements a layered architecture. The highest level of
74
abstraction is the SQL layer (currently unspecified in this document).
75
It depends directly on the [*structured data
76
API*](#structured-data-api), which provides familiar relational concepts
77
such as schemas, tables, columns, and indexes. The structured data API
78
in turn depends on the [distributed key value store](#key-value-api),
79
which handles the details of range addressing to provide the abstraction
80
of a single, monolithic key value store. The distributed KV store
81
communicates with any number of physical cockroach nodes. Each node
82
contains one or more stores, one per physical device.
83
84

85
86
Each store contains potentially many ranges, the lowest-level unit of
87
key-value data. Ranges are replicated using the Raft consensus protocol.
88
The diagram below is a blown up version of stores from four of the five
89
nodes in the previous diagram. Each range is replicated three ways using
90
raft. The color coding shows associated range replicas.
91
92

93
94
Each physical node exports a RoachNode service. Each RoachNode exports
95
one or more key ranges. RoachNodes are symmetric. Each has the same
96
binary and assumes identical roles.
97
98
Nodes and the ranges they provide access to can be arranged with various
99
physical network topologies to make trade offs between reliability and
100
performance. For example, a triplicated (3-way replica) range could have
101
each replica located on different:
102
103
- disks within a server to tolerate disk failures.
104
- servers within a rack to tolerate server failures.
105
- servers on different racks within a datacenter to tolerate rack power/network failures.
106
- servers in different datacenters to tolerate large scale network or power outages.
107
108
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
109
110
# Cockroach Client
111
112
In order to support diverse client usage, Cockroach clients connect to
113
any node via HTTPS using protocol buffers or JSON. The connected node
114
proxies involved client work including key lookups and write buffering.
115
116
# Keys
117
118
Cockroach keys are arbitrary byte arrays. If textual data is used in
119
keys, utf8 encoding is recommended (this helps for cleaner display of
120
values in debugging tools). User-supplied keys are encoded using an
121
ordered code. System keys are either prefixed with null characters (`\0`
122
or `\0\0`) for system tables, or take the form of
123
`<user-key><system-suffix>` to sort user-key-range specific system
124
keys immediately after the user keys they refer to. Null characters are
125
used in system key prefixes to guarantee that they sort first.
126
127
# Versioned Values
128
129
Cockroach maintains historical versions of values by storing them with
130
associated commit timestamps. Reads and scans can specify a snapshot
131
time to return the most recent writes prior to the snapshot timestamp.
132
Older versions of values are garbage collected by the system during
133
compaction according to a user-specified expiration interval. In order
134
to support long-running scans (e.g. for MapReduce), all versions have a
135
minimum expiration.
136
137
Versioned values are supported via modifications to RocksDB to record
138
commit timestamps and GC expirations per key.
139
144
the low water mark of the cache appropriately. If a new range replica leader
145
is elected, it sets the low water mark for the cache to the current
148
# Lock-Free Distributed Transactions
149
150
Cockroach provides distributed transactions without locks. Cockroach
151
transactions support two isolation levels:
152
153
- snapshot isolation (SI) and
154
- *serializable* snapshot isolation (SSI).
155
156
*SI* is simple to implement, highly performant, and correct for all but a
157
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
158
more complexity, is still highly performant (less so with contention), and has
159
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
160
the literature and some possibly novel insights.
161
162
SSI is the default level, with SI provided for application developers
163
who are certain enough of their need for performance and the absence of
164
write skew conditions to consciously elect to use it. In a lightly
165
contended system, our implementation of SSI is just as performant as SI,
166
requiring no locking or additional writes. With contention, our
167
implementation of SSI still requires no locking, but will end up
168
aborting more transactions. Cockroach’s SI and SSI implementations
169
prevent starvation scenarios even for arbitrarily long transactions.
170
171
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
172
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
173
For a discussion of SSI implemented by preventing read-write conflicts
174
(in contrast to detecting them, called write-snapshot isolation), see
175
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
176
which is the source of much inspiration for Cockroach’s SSI.
177
178
Each Cockroach transaction is assigned a random priority and a
179
"candidate timestamp" at start. The candidate timestamp is the
180
provisional timestamp at which the transaction will commit, and is
181
chosen as the current clock time of the node coordinating the
182
transaction. This means that a transaction without conflicts will
183
usually commit with a timestamp that, in absolute time, precedes the
184
actual work done by that transaction.
185
186
In the course of coordinating a transaction between one or more
187
distributed nodes, the candidate timestamp may be increased, but will
189
SI and SSI is that the former allows the transaction's candidate
190
timestamp to increase and the latter does not.
195
in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf).
196
HLC time uses timestamps which are composed of a physical component (thought of
197
as and always close to local wall time) and a logical component (used to
198
distinguish between events with the same physical component). It allows us to
199
track causality for related events similar to vector clocks, but with less
200
overhead. In practice, it works much like other logical clocks: When events
201
are received by a node, it informs the local HLC about the timestamp supplied
202
with the event by the sender, and when events are sent a timestamp generated by
203
the local HLC is attached.
204
205
For a more in depth description of HLC please read the paper. Our
206
implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/util/hlc/hlc.go).
207
208
Cockroach picks a Timestamp for a transaction using HLC time. Throughout this
209
document, *timestamp* always refers to the HLC time which is a singleton
210
on each node. The HLC is updated by every read/write event on the node, and
211
the HLC time >= walltime. A read/write timestamp received in a cockroach request
212
from another node is not only used to version the operation, but also updates
213
the HLC on the node. This is useful in guaranteeing that all data read/written
214
on a node is at a timestamp < next HLC time.
221
transaction table (keys prefixed by *\0tx*) with state “PENDING”. In
222
parallel write an "intent" value for each datum being written as part
223
of the transaction. These are normal MVCC values, with the addition of
224
a special flag (i.e. “intent”) indicating that the value may be
226
the transaction id (unique and chosen at tx start time by client)
227
is stored with intent values. The tx id is used to refer to the
228
transaction table when there are conflicts and to make
229
tie-breaking decisions on ordering between identical timestamps.
231
original candidate timestamp in the absence of read/write conflicts);
232
the client selects the maximum from amongst all write timestamps as the
233
final commit timestamp.
236
transaction table (keys prefixed by *\0tx*). The value of the
237
commit entry contains the candidate timestamp (increased as
238
necessary to accommodate any latest read timestamps). Note that
239
the transaction is considered fully committed at this point and
240
control may be returned to the client.
241
242
In the case of an SI transaction, a commit timestamp which was
243
increased to accommodate concurrent readers is perfectly
244
acceptable and the commit may continue. For SSI transactions,
245
however, a gap between candidate and commit timestamps
246
necessitates transaction restart (note: restart is different than
247
abort--see below).
248
249
After the transaction is committed, all written intents are upgraded
250
in parallel by removing the “intent” flag. The transaction is
251
considered fully committed before this step and does not wait for
252
it to return control to the transaction coordinator.
253
254
In the absence of conflicts, this is the end. Nothing else is necessary
255
to ensure the correctness of the system.
256
257
**Conflict Resolution**
258
259
Things get more interesting when a reader or writer encounters an intent
260
record or newly-committed value in a location that it needs to read or
261
write. This is a conflict, usually causing either of the transactions to
262
abort or restart depending on the type of conflict.
263
264
***Transaction restart:***
265
266
This is the usual (and more efficient) type of behaviour and is used
267
except when the transaction was aborted (for instance by another
268
transaction).
269
In effect, that reduces to two cases; the first being the one outlined
270
above: An SSI transaction that finds upon attempting to commit that
271
its commit timestamp has been pushed. The second case involves a transaction
272
actively encountering a conflict, that is, one of its readers or writers
273
encounter data that necessitate conflict resolution
274
(see transaction interactions below).
278
begins anew reusing the same tx id. The prior run of the transaction might
279
have written some write intents, which need to be deleted before the
280
transaction commits, so as to not be included as part of the transaction.
281
These stale write intent deletions are done during the reexecution of the
283
the same keys as part of the reexecution of the transaction, or explicitly,
284
by cleaning up stale intents that are not part of the reexecution of the
285
transaction. Since most transactions will end up writing to the same keys,
286
the explicit cleanup run just before committing the transaction is usually
287
a NOOP.
288
289
***Transaction abort:***
290
291
This is the case in which a transaction, upon reading its transaction
292
table entry, finds that it has been aborted. In this case, the
293
transaction can not reuse its intents; it returns control to the client
294
before cleaning them up (other readers and writers would clean up
295
dangling intents as they encounter them) but will make an effort to
296
clean up after itself. The next attempt (if applicable) then runs as a
300
301
There are several scenarios in which transactions interact:
302
303
- **Reader encounters write intent or value with newer timestamp far
304
enough in the future**: This is not a conflict. The reader is free
305
to proceed; after all, it will be reading an older version of the
306
value and so does not conflict. Recall that the write intent may
307
be committed with a later timestamp than its candidate; it will
308
never commit with an earlier one. **Side note**: if a SI transaction
309
reader finds an intent with a newer timestamp which the reader’s own
311
312
- **Reader encounters write intent or value with newer timestamp in the
313
near future:** In this case, we have to be careful. The newer
314
intent may, in absolute terms, have happened in our read's past if
315
the clock of the writer is ahead of the node serving the values.
316
In that case, we would need to take this value into account, but
317
we just don't know. Hence the transaction restarts, using instead
318
a future timestamp (but remembering a maximum timestamp used to
319
limit the uncertainty window to the maximum clock skew). In fact,
320
this is optimized further; see the details under "choosing a time
321
stamp" below.
322
323
- **Reader encounters write intent with older timestamp**: the reader
324
must follow the intent’s transaction id to the transaction table.
325
If the transaction has already been committed, then the reader can
326
just read the value. If the write transaction has not yet been
327
committed, then the reader has two options. If the write conflict
328
is from an SI transaction, the reader can *push that transaction's
329
commit timestamp into the future* (and consequently not have to
330
read it). This is simple to do: the reader just updates the
331
transaction’s commit timestamp to indicate that when/if the
332
transaction does commit, it should use a timestamp *at least* as
333
high. However, if the write conflict is from an SSI transaction,
334
the reader must compare priorities. If the reader has the higher priority,
335
it pushes the transaction’s commit timestamp (that
343
priority, the writer aborts the conflicting transaction. If the write
344
intent has a higher or equal priority the transaction retries, using as a new
345
priority *max(new random priority, conflicting txn’s priority - 1)*;
346
the retry occurs after a short, randomized backoff interval.
348
- **Writer encounters newer committed value**:
349
The committed value could also be an unresolved write intent made by a
350
transaction that has already committed. The transaction restarts. On restart,
351
the same priority is reused, but the candidate timestamp is moved forward
352
to the encountered value's timestamp.
356
candidate timestamp is earlier than the low water mark on the cache itself
357
(i.e. its last evicted timestamp) or if the key being written has a read
358
timestamp later than the write’s candidate timestamp, this later timestamp
359
value is returned with the write. A new timestamp forces a transaction
360
restart only if it is serializable.
362
**Transaction management**
363
364
Transactions are managed by the client proxy (or gateway in SQL Azure
365
parlance). Unlike in Spanner, writes are not buffered but are sent
366
directly to all implicated ranges. This allows the transaction to abort
367
quickly if it encounters a write conflict. The client proxy keeps track
368
of all written keys in order to resolve write intents asynchronously upon
369
transaction completion. If a transaction commits successfully, all intents
370
are upgraded to committed. In the event a transaction is aborted, all written
371
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
375
transaction table until aborted by another transaction. Transactions
376
heartbeat the transaction table every five seconds by default.
377
Transactions encountered by readers or writers with dangling intents
378
which haven’t been heartbeat within the required interval are aborted.
383
384
An exploration of retries with contention and abort times with abandoned
385
transaction is
386
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
387
388
**Transaction Table**
389
390
Please see [proto/data.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
391
392
**Pros**
393
394
- No requirement for reliable code execution to prevent stalled 2PC
395
protocol.
396
- Readers never block with SI semantics; with SSI semantics, they may
397
abort.
398
- Lower latency than traditional 2PC commit protocol (w/o contention)
399
because second phase requires only a single write to the
400
transaction table instead of a synchronous round to all
401
transaction participants.
402
- Priorities avoid starvation for arbitrarily long transactions and
403
always pick a winner from between contending transactions (no
404
mutual aborts).
405
- Writes not buffered at client; writes fail fast.
406
- No read-locking overhead required for *serializable* SI (in contrast
407
to other SSI implementations).
408
- Well-chosen (i.e. less random) priorities can flexibly give
409
probabilistic guarantees on latency for arbitrary transactions
410
(for example: make OLTP transactions 10x less likely to abort than
411
low priority transactions, such as asynchronously scheduled jobs).
412
413
**Cons**
414
415
- Reads from non-leader replicas still require a ping to the leader to
417
- Abandoned transactions may block contending writers for up to the
418
heartbeat interval, though average wait is likely to be
419
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
420
This is likely considerably more performant than detecting and
421
restarting 2PC in order to release read and write locks.
422
- Behavior different than other SI implementations: no first writer
423
wins, and shorter transactions do not always finish quickly.
424
Element of surprise for OLTP systems may be a problematic factor.
425
- Aborts can decrease throughput in a contended system compared with
426
two phase locking. Aborts and retries increase read and write
427
traffic, increase latency and decrease throughput.
428
429
**Choosing a Timestamp**
430
431
A key challenge of reading data in a distributed system with clock skew
432
is choosing a timestamp guaranteed to be greater than the latest
433
timestamp of any committed transaction (in absolute time). No system can
434
claim consistency and fail to read already-committed data.
435
437
accessing a single node is easy. The timestamp is assigned by the node
438
itself, so it is guaranteed to be at a greater timestamp than all the
439
existing timestamped data on the node.
440
441
For multiple nodes, the timestamp of the node coordinating the
442
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
443
supplied to provide an upper bound on timestamps for already-committed
444
data (`ε` is the maximum clock skew). As the transaction progresses, any
445
data read which have timestamps greater than `t` but less than `t+ε`
446
cause the transaction to abort and retry with the conflicting timestamp
448
the same. This implies that transaction restarts due to clock uncertainty
449
can only happen on a time interval of length `ε`.
453
into account t<sub>c</sub>, but the timestamp of the node at the time
454
of the uncertain read t<sub>node</sub>. The larger of those two timestamps
455
t<sub>c</sub> and t<sub>node</sub> (likely equal to the latter) is used
456
to increase the read timestamp. Additionally, the conflicting node is
457
marked as “certain”. Then, for future reads to that node within the
458
transaction, we set `MaxTimestamp = Read Timestamp`, preventing further
459
uncertainty restarts.
460
461
Correctness follows from the fact that we know that at the time of the read,
462
there exists no version of any key on that node with a higher timestamp than
464
encounters a key with a higher timestamp, it knows that in absolute time,
465
the value was written after t<sub>node</sub> was obtained, i.e. after the
466
uncertain read. Hence the transaction can move forward reading an older version
467
of the data (at the transaction's timestamp). This limits the time uncertainty
468
restarts attributed to a node to at most one. The tradeoff is that we might
469
pick a timestamp larger than the optimal one (> highest conflicting timestamp),
470
resulting in the possibility of a few more conflicts.
471
472
We expect retries will be rare, but this assumption may need to be
473
revisited if retries become problematic. Note that this problem does not
474
apply to historical reads. An alternate approach which does not require
475
retries makes a round to all node participants in advance and
476
chooses the highest reported node wall time as the timestamp. However,
477
knowing which nodes will be accessed in advance is difficult and
478
potentially limiting. Cockroach could also potentially use a global
479
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)),
480
which would be feasible for smaller, geographically-proximate clusters.
481
482
# Linearizability
483
484
First a word about [***Spanner***](http://research.google.com/archive/spanner.html).
485
By combining judicious use of wait intervals with accurate time signals,
486
Spanner provides a global ordering between any two non-overlapping transactions
487
(in absolute time) with \~14ms latencies. Put another way:
488
Spanner guarantees that if a transaction T<sub>1</sub> commits (in absolute time)
489
before another transaction T<sub>2</sub> starts, then T<sub>1</sub>'s assigned commit
490
timestamp is smaller than T<sub>2</sub>'s. Using atomic clocks and GPS receivers,
491
Spanner reduces their clock skew uncertainty to \< 10ms (`ε`). To make
492
good on the promised guarantee, transactions must take at least double
493
the clock skew uncertainty interval to commit (`2ε`). See [*this
494
article*](http://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf)
495
for a helpful overview of Spanner’s concurrency control.
496
497
Cockroach could make the same guarantees without specialized hardware,
498
at the expense of longer wait times. If servers in the cluster were
499
configured to work only with NTP, transaction wait times would likely to
500
be in excess of 150ms. For wide-area zones, this would be somewhat
501
mitigated by overlap from cross datacenter link latencies. If clocks
502
were made more accurate, the minimal limit for commit latencies would
503
improve.
504
505
However, let’s take a step back and evaluate whether Spanner’s external
506
consistency guarantee is worth the automatic commit wait. First, if the
507
commit wait is omitted completely, the system still yields a consistent
508
view of the map at an arbitrary timestamp. However with clock skew, it
509
would become possible for commit timestamps on non-overlapping but
510
causally related transactions to suffer temporal reverse. In other
511
words, the following scenario is possible for a client without global
512
ordering:
513
514
- Start transaction T<sub>1</sub> to modify value `x` with commit time *s<sub>1</sub>*
515
516
- On commit of T<sub>1</sub>, start T<sub>2</sub> to modify value `y` with commit time
517
\> s<sub>2</sub>
518
519
- Read `x` and `y` and discover that s<sub>1</sub> \> s<sub>2</sub> (**!**)
520
521
The external consistency which Spanner guarantees is referred to as
522
**linearizability**. It goes beyond serializability by preserving
523
information about the causality inherent in how external processes
524
interacted with the database. The strength of Spanner’s guarantee can be
525
formulated as follows: any two processes, with clock skew within
526
expected bounds, may independently record their wall times for the
527
completion of transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>) and start of transaction
528
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) respectively, and if later
529
compared such that T<sub>1</sub><sup>end</sup> \< T<sub>2</sub><sup>start</sup>,
530
then commit timestamps s<sub>1</sub> \< s<sub>2</sub>.
531
This guarantee is broad enough to completely cover all cases of explicit
532
causality, in addition to covering any and all imaginable scenarios of implicit
533
causality.
534
535
Our contention is that causality is chiefly important from the
536
perspective of a single client or a chain of successive clients (*if a
537
tree falls in the forest and nobody hears…*). As such, Cockroach
538
provides two mechanisms to provide linearizability for the vast majority
539
of use cases without a mandatory transaction commit wait or an elaborate
540
system to minimize clock skew.
541
542
1. Clients provide the highest transaction commit timestamp with
543
> successive transactions. This allows node clocks from previous
544
> transactions to effectively participate in the formulation of the
545
> commit timestamp for the current transaction. This guarantees
546
> linearizability for transactions committed by this client.
547
>
548
> Newly launched clients wait at least 2 \* ε from process start
549
> time before beginning their first transaction. This preserves the
550
> same property even on client restart, and the wait will be
551
> mitigated by process initialization.
552
>
553
> All causally-related events within Cockroach maintain
554
> linearizability. Message queues, for example, guarantee that the
555
> receipt timestamp is greater than send timestamp, and that
556
> delivered messages may not be reaped until after the commit wait.
557
558
2. Committed transactions respond with a commit wait parameter which
559
> represents the remaining time in the nominal commit wait. This
560
> will typically be less than the full commit wait as the consensus
561
> write at the coordinator accounts for a portion of it.
562
>
563
> Clients taking any action outside of another Cockroach transaction
564
> (e.g. writing to another distributed system component) can either
565
> choose to wait the remaining interval before proceeding, or
566
> alternatively, pass the wait and/or commit timestamp to the
567
> execution of the outside action for its consideration. This pushes
568
> the burden of linearizability to clients, but is a useful tool in
569
> mitigating commit latencies if the clock skew is potentially
570
> large. This functionality can be used for ordering in the face of
571
> backchannel dependencies as mentioned in the
572
> [AugmentedTime](http://www.cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf)
573
> paper.
574
575
Using these mechanisms in place of commit wait, Cockroach’s guarantee can be
576
formulated as follows: any process which signals the start of transaction
577
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) after the completion of
578
transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>), will have commit
579
timestamps such thats<sub>1</sub> \< s<sub>2</sub>.
580
581
# Logical Map Content
582
583
Logically, the map contains a series of reserved system key / value
584
pairs covering accounting, range metadata, node accounting and
585
permissions before the actual key / value pairs for non-system data
586
(e.g. the actual meat of the map).
587
588
- `\0\0meta1` Range metadata for location of `\0\0meta2`.
589
- `\0\0meta1<key1>` Range metadata for location of `\0\0meta2<key1>`.
590
- ...
591
- `\0\0meta1<keyN>`: Range metadata for location of `\0\0meta2<keyN>`.
592
- `\0\0meta2`: Range metadata for location of first non-range metadata key.
593
- `\0\0meta2<key1>`: Range metadata for location of `<key1>`.
594
- ...
595
- `\0\0meta2<keyN>`: Range metadata for location of `<keyN>`.
596
- `\0acct<key0>`: Accounting for key prefix key0.
597
- ...
598
- `\0acct<keyN>`: Accounting for key prefix keyN.
599
- `\0node<node-address0>`: Accounting data for node 0.
600
- ...
601
- `\0node<node-addressN>`: Accounting data for node N.
602
- `\0perm<key0><user0>`: Permissions for user0 for key prefix key0.
603
- ...
604
- `\0perm<keyN><userN>`: Permissions for userN for key prefix keyN.
605
- `\0tree_root`: Range key for root of range-spanning tree.
606
- `\0tx<tx-id0>`: Transaction record for transaction 0.
607
- ...
608
- `\0tx<tx-idN>`: Transaction record for transaction N.
609
- `\0zone<key0>`: Zone information for key prefix key0.
610
- ...
611
- `\0zone<keyN>`: Zone information for key prefix keyN.
612
- `<>acctd<metric0>`: Accounting data for Metric 0 for empty key prefix.
613
- ...
614
- `<>acctd<metricN>`: Accounting data for Metric N for empty key prefix.
615
- `<key0>`: `<value0>` The first user data key.**
616
- ...
617
- `<keyN>`: `<valueN>` The last user data key.**
618
619
There are some additional system entries sprinkled amongst the
620
non-system keys. See the Key-Prefix Accounting section in this document
621
for further details.
622
623
# Node Storage
624
625
Nodes maintain a separate instance of RocksDB for each disk. Each
626
RocksDB instance hosts any number of ranges. RPCs arriving at a
627
RoachNode are multiplexed based on the disk name to the appropriate
628
RocksDB instance. A single instance per disk is used to avoid
629
contention. If every range maintained its own RocksDB, global management
630
of available cache memory would be impossible and writers for each range
631
would compete for non-contiguous writes to multiple RocksDB logs.
632
633
In addition to the key/value pairs of the range itself, various range
634
metadata is maintained.
635
636
- range-spanning tree node links
637
638
- participating replicas
639
640
- consensus metadata
641
642
- split/merge activity
643
644
A really good reference on tuning Linux installations with RocksDB is
645
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
646
647
# Range Metadata
648
649
The default approximate size of a range is 64M (2\^26 B). In order to
650
support 1P (2\^50 B) of logical data, metadata is needed for roughly
651
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
653
locations and 220 bytes for the range key itself*). 2\^24 ranges \* 2\^8
654
B would require roughly 4G (2\^32 B) to store--too much to duplicate
655
between machines. Our conclusion is that range metadata must be
656
distributed for large installations.
657
658
To keep key lookups relatively fast in the presence of distributed metadata,
659
we store all the top-level metadata in a single range (the first range). These
660
top-level metadata keys are known as *meta1* keys, and are prefixed such that
661
they sort to the beginning of the key space. Given the metadata size of 256
662
bytes given above, a single 64M range would support 64M/256B = 2\^18 ranges,
663
which gives a total storage of 64M \* 2\^18 = 16.7T. To support the 1P quoted
664
above, we need two levels of indirection, where the first level addresses the
665
second, and the second addresses user data. With two levels of indirection, we
666
can address 2\^(18 + 18) = 2\^36 ranges; each range addresses 2\^26 B, and
667
altogether we address 2\^(36+26) B = 2\^62 B = 4E of user data.
668
669
For a given user-addressable `key1`, the associated *meta1* record is found
670
at the successor key to `key1` in the *meta1* space. Since the *meta1* space
671
is sparse, the successor key is defined as the next key which is present. The
672
*meta1* record identifies the range containing the *meta2* record, which is
673
found using the same process. The *meta2* record identifies the range
674
containing `key1`, which is again found the same way (see examples below).
676
Concretely, metadata keys are prefixed by `\0\0meta{1,2}`; the two null
677
characters provide for the desired sorting behaviour. Thus, `key1`'s
678
*meta1* record will reside at the successor key to `\0\0\meta1<key1>`.
679
680
Note: we append the end key of each range to meta[12] records because
681
the RocksDB iterator only supports a Seek() interface which acts as a
682
Ceil(). Using the start key of the range would cause Seek() to find the
683
key *after* the meta indexing record we’re looking for, which would
684
result in having to back the iterator up, an option which is both less
685
efficient and not available in all cases.
686
687
The following example shows the directory structure for a map with
688
three ranges worth of data. Ellipses indicate additional key/value pairs to
689
fill an entire range of data. Except for the fact that splitting ranges
690
requires updates to the range metadata with knowledge of the metadata layout,
691
the range metadata itself requires no special treatment or bootstrapping.
692
693
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
694
`dcrama3:8000`)
695
696
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
697
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
698
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
699
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
700
- ...
701
- `<lastkey0>`: `<lastvalue0>`
702
703
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
704
`dcrama6:8000`)
705
706
- ...
707
- `<lastkey1>`: `<lastvalue1>`
708
709
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
710
`dcrama9:8000`)
711
712
- ...
713
- `<lastkey2>`: `<lastvalue2>`
714
715
Consider a simpler example of a map containing less than a single
716
range of data. In this case, all range metadata and all data are
717
located in the same range:
718
719
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
720
`dcrama3:8000`)*
721
722
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
723
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
724
- `<key0>`: `<value0>`
725
- `...`
726
727
Finally, a map large enough to need both levels of indirection would
728
look like (note that instead of showing range replicas, this
729
example is simplified to just show range indexes):
730
731
**Range 0**
732
733
- `\0\0meta1<lastkeyN-1>`: Range 0
734
- `\0\0meta1\xff`: Range 1
735
- `\0\0meta2<lastkey1>`: Range 1
736
- `\0\0meta2<lastkey2>`: Range 2
737
- `\0\0meta2<lastkey3>`: Range 3
738
- ...
739
- `\0\0meta2<lastkeyN-1>`: Range 262143
740
741
**Range 1**
742
743
- `\0\0meta2<lastkeyN>`: Range 262144
744
- `\0\0meta2<lastkeyN+1>`: Range 262145
745
- ...
746
- `\0\0meta2\xff`: Range 500,000
747
- ...
748
- `<lastkey1>`: `<lastvalue1>`
749
750
**Range 2**
751
752
- ...
753
- `<lastkey2>`: `<lastvalue2>`
754
755
**Range 3**
756
757
- ...
758
- `<lastkey3>`: `<lastvalue3>`
759
760
**Range 262144**
761
762
- ...
763
- `<lastkeyN>`: `<lastvalueN>`
764
765
**Range 262145**
766
767
- ...
768
- `<lastkeyN+1>`: `<lastvalueN+1>`
769
770
Note that the choice of range `262144` is just an approximation. The
771
actual number of ranges addressable via a single metadata range is
772
dependent on the size of the keys. If efforts are made to keep key sizes
773
small, the total number of addressable ranges would increase and vice
774
versa.
775
776
From the examples above it’s clear that key location lookups require at
777
most three reads to get the value for `<key>`:
778
779
1. lower bound of `\0\0meta1<key>`
780
2. lower bound of `\0\0meta2<key>`,
781
3. `<key>`.
782
783
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
784
containing less than 16T of data would require two lookups. Clients cache both
785
levels of range metadata, and we expect that data locality for individual
786
clients will be high. Clients may end up with stale cache entries. If on a
787
lookup, the range consulted does not match the client’s expectations, the
788
client evicts the stale entries and possibly does a new lookup.
789
790
# Splitting / Merging Ranges
791
792
RoachNodes split or merge ranges based on whether they exceed maximum or
793
minimum thresholds for capacity or load. Ranges exceeding maximums for
794
either capacity or load are split; ranges below minimums for *both*
795
capacity and load are merged.
796
797
Ranges maintain the same accounting statistics as accounting key
798
prefixes. These boil down to a time series of data points with minute
799
granularity. Everything from number of bytes to read/write queue sizes.
800
Arbitrary distillations of the accounting stats can be determined as the
801
basis for splitting / merging. Two sensical metrics for use with
802
split/merge are range size in bytes and IOps. A good metric for
803
rebalancing a replica from one node to another would be total read/write
804
queue wait times. These metrics are gossipped, with each range / node
805
passing along relevant metrics if they’re in the bottom or top of the
806
range it’s aware of.
807
808
A range finding itself exceeding either capacity or load threshold
809
splits. To this end, the range leader computes an appropriate split key
810
candidate and issues the split through Raft. In contrast to splitting,
811
merging requires a range to be below the minimum threshold for both
812
capacity *and* load. A range being merged chooses the smaller of the
813
ranges immediately preceding and succeeding it.
814
815
Splitting, merging, rebalancing and recovering all follow the same basic
816
algorithm for moving data between roach nodes. New target replicas are
817
created and added to the replica set of source range. Then each new
818
replica is brought up to date by either replaying the log in full or
819
copying a snapshot of the source replica data and then replaying the log
820
from the timestamp of the snapshot to catch up fully. Once the new
821
replicas are fully up to date, the range metadata is updated and old,
822
source replica(s) deleted if applicable.
823
824
**Coordinator** (leader replica)
825
829
only after being completed locally, are moved to new target replicas.
830
else if merging
831
Choose new replicas on same servers as target range replicas;
832
add to replica set.
833
else if rebalancing || recovering
834
Choose new replica(s) on least loaded servers; add to replica set.
835
```
839
*Bring replica up to date:*
840
841
```
842
if all info can be read from replicated log
843
copy replicated log
844
else
845
snapshot source replica
846
send successive ReadRange requests to source replica
847
referencing snapshot
848
849
if merging
850
combine ranges on all replicas
851
else if rebalancing || recovering
852
remove old range replica(s)
853
```
854
855
RoachNodes split ranges when the total data in a range exceeds a
856
configurable maximum threshold. Similarly, ranges are merged when the
857
total data falls below a configurable minimum threshold.
858
859
**TBD: flesh this out**.
860
861
Ranges are rebalanced if a node determines its load or capacity is one
862
of the worst in the cluster based on gossipped load stats. A node with
863
spare capacity is chosen in the same datacenter and a special-case split
864
is done which simply duplicates the data 1:1 and resets the range
865
configuration metadata.
866
869
Each range is configured to consist of three or more replicas, as specified by
870
their ZoneConfig. The replicas in a range maintain their own instance of a
871
distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf)
872
as it is simpler to reason about and includes a reference implementation
873
covering important details. Every write to replicas is logged twice.
874
Once to RocksDB’s internal log and once to levedb itself as part of the
875
Raft consensus log.
876
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
877
promising performance characteristics for WAN-distributed replicas, but
878
it does not guarantee a consistent ordering between replicas.
879
880
Raft elects a relatively long-lived leader which must be involved to
882
replicated. In the absence of heartbeats, followers become candidates
883
after randomized election timeouts and proceed to hold new leader
884
elections. Cockroach weights random timeouts such that the replicas with
885
shorter round trip times to peers are more likely to hold elections
886
first (not implemented yet). Only the Raft leader may propose commands;
887
followers will simply relay commands to the last known leader.
889
# Message Queues
890
891
Each range maintains an array of incoming message queues, referred to
892
here as **inboxes**. Additionally, each range maintains and *processes*
893
an array of outgoing message queues, referred to here as **outboxes**.
894
Both inboxes and outboxes are assigned to keys; messages can be sent or
895
received on behalf of any key. Inboxes and outboxes can contain any
896
number of pending messages.
897
898
Messages can be *deliverable*, or *executable.*
899
900
Deliverable messages are defined by Value objects - simple byte arrays -
901
that are delivered to a key’s inbox, awaiting collection by a client
902
invoking the ReapQueue operation. These are typically used by client
903
applications wishing to be notified of changes to an entry for further
904
processing, such as expensive offline operations like sending emails,
905
SMSs, etc.
906
907
Executable messages are *outgoing-only*, and are instances of
908
PutRequest,IncrementRequest, DeleteRequest, DeleteRangeRequest
910
executed when encountered. These are primarily useful when updates that
911
are nominally part of a transaction can tolerate asynchronous execution
912
(e.g. eventual consistency), and are otherwise too busy or numerous to
913
make including them in the original [distributed] transaction efficient.
914
Examples may include updates to the accounting for successive key
915
prefixes (potentially busy) or updates to a full-text index (potentially
916
numerous).
917
918
These two types of messages are enqueued in different outboxes too - see
919
key formats below.
920
921
At commit time, the range processing the transaction places messages
922
into a shared outbox located at the start of the range metadata. This is
923
effectively free as it’s part of the same consensus write for the range
924
as the COMMIT record. Outgoing messages are processed asynchronously by
925
the range. To make processing easy, all outboxes are co-located at the
926
start of the range. To make lookup easy, all inboxes are located
927
immediately after the recipient key. The leader replica of a range is
928
responsible for processing message queues.
929
930
A dispatcher polls a given range’s *deliverable message outbox*
931
periodically (configurable), and delivers those messages to the target
932
key’s inbox. The dispatcher is also woken up whenever a new message is
933
added to the outbox. A separate executor also polls the range’s
934
*executable message outbox* periodically as well (again, configurable),
936
new message is added to the outbox.
937
938
Formats follow in the table below. Notice that inbox messages for a
939
given key sort by the `<outbox-timestamp>`. This doesn’t provide a
940
precise ordering, but it does allow clients to scan messages in an
941
approximate ordering of when they were originally lodged with senders.
942
NTP offers walltime deltas to within 100s of milliseconds. The
943
`<sender-range-key>` suffix provides uniqueness.
944
945
**Outbox**
946
`<sender-range-key>deliverable-outbox:<recipient-key><outbox-timestamp>`
947
`<sender-range-key>executable-outbox:<recipient-key><outbox-timestamp>`
948
949
**Inbox**
950
`<recipient-key>inbox:<outbox-timestamp><sender-range-key>`
951
952
Messages are processed and then deleted as part of a single distributed
953
transaction. The message will be executed or delivered exactly once,
954
regardless of failures at either sender or receiver.
955
956
Delivered messages may be read by clients via the ReapQueue operation.
957
This operation may only be used as part of a transaction. Clients should
958
commit only after having processed the message. If the transaction is
959
committed, scanned messages are automatically deleted. The operation
960
name was chosen to reflect its mutating side effect. Deletion of read
961
messages is mandatory because senders deliver messages asynchronously
962
and a delay could cause insertion of messages at arbitrary points in the
963
inbox queue. If clients require persistence, they should re-save read
964
messages manually; the ReapQueue operation can be incorporated into
965
normal transactional updates.
966
967
# Range-Spanning Binary Tree
968
969
A crucial enhancement to the organization of range metadata is to
970
augment the bi-level range metadata lookup with a minimum spanning tree,
971
implemented as a left-leaning red-black tree over all ranges in the map.
972
This tree structure allows the system to start at any key prefix and
973
efficiently traverse an arbitrary key range with minimal RPC traffic,
974
minimal fan-in and fan-out, and with bounded time complexity equal to
975
`2*log N` steps, where `N` is the total number of ranges in the system.
976
977
Unlike the range metadata rows prefixed with `\0\0meta[1|2]`, the
978
metadata for the range-spanning tree (e.g. parent range and left / right
979
child ranges) is stored directly at the ranges as non-map metadata. The
980
metadata for each node of the tree (e.g. links to parent range, left
981
child range, and right child range) is stored with the range metadata.
982
In effect, the tree metadata is stored implicitly. In order to traverse
983
the tree, for example, you’d need to query each range in turn for its
984
metadata.
985
986
Any time a range is split or merged, both the bi-level range lookup
987
metadata and the per-range binary tree metadata are updated as part of
988
the same distributed transaction. The total number of nodes involved in
989
the update is bounded by 2 + log N (i.e. 2 updates for meta1 and
990
meta2, and up to log N updates to balance the range-spanning tree).
991
The range corresponding to the root node of the tree is stored in
992
*\0tree_root*.
993
994
As an example, consider the following set of nine ranges and their
995
associated range-spanning tree:
996
997
R0: `aa - cc`, R1: `*cc - lll`, R2: `*lll - llr`, R3: `*llr - nn`, R4: `*nn - rr`, R5: `*rr - ssss`, R6: `*ssss - sst`, R7: `*sst - vvv`, R8: `*vvv - zzzz`.
998
999

1000
1001
The range-spanning tree has many beneficial uses in Cockroach. It makes
1002
the problem of efficiently aggregating accounting information of
1003
potentially vast ranges of data tractable. Imagine a subrange of data
1004
over which accounting is being kept. For example, the *photos* table in
1005
a public photo sharing site. To efficiently keep track of data about the
1006
table (e.g. total size, number of rows, etc.), messages can be passed
1007
first up the tree and then down to the left until updates arrive at the
1008
key prefix under which accounting is aggregated. This makes worst case
1009
number of hops for an update to propagate into the accounting totals
1010
2 \* log N. A 64T database will require 1M ranges, meaning 40 hops
1011
worst case. In our experience, accounting tasks over vast ranges of data
1012
are most often map/reduce jobs scheduled with coarse-grained
1013
periodicity. By contrast, we expect Cockroach to maintain statistics
1014
with sub 10s accuracy and with minimal cycles and minimal IOPs.
1015
1016
Another use for the range-spanning tree is to push accounting, zones and
1017
permissions configurations to all ranges. In the case of zones and
1018
permissions, this is an efficient way to pass updated configuration
1019
information with exponential fan-out. When adding accounting
1020
configurations (i.e. specifying a new key prefix to track), the
1021
implicated ranges are transactionally scanned and zero-state accounting
1022
information is computed as well. Deleting accounting configurations is
1023
similar, except accounting records are deleted.
1024
1025
Last but *not* least, the range-spanning tree provides a convenient
1026
mechanism for planning and executing parallel queries. These provide the
1027
basis for
1028
[Dremel](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf)-like
1029
query execution trees and it’s easy to imagine supporting a subset of
1030
SQL or even javascript-based user functions for complex data analysis
1031
tasks.
1032
1033
1034
1035
# Node Allocation (via Gossip)
1036
1037
New nodes must be allocated when a range is split. Instead of requiring
1038
every RoachNode to know about the status of all or even a large number
1039
of peer nodes --or-- alternatively requiring a specialized curator or
1040
master with sufficiently global knowledge, we use a gossip protocol to
1041
efficiently communicate only interesting information between all of the
1042
nodes in the cluster. What’s interesting information? One example would
1043
be whether a particular node has a lot of spare capacity. Each node,
1044
when gossiping, compares each topic of gossip to its own state. If its
1045
own state is somehow “more interesting” than the least interesting item
1046
in the topic it’s seen recently, it includes its own state as part of
1047
the next gossip session with a peer node. In this way, a node with
1048
capacity sufficiently in excess of the mean quickly becomes discovered
1049
by the entire cluster. To avoid piling onto outliers, nodes from the
1050
high capacity set are selected at random for allocation.
1051
1052
The gossip protocol itself contains two primary components:
1053
1054
- **Peer Selection**: each node maintains up to N peers with which it
1055
regularly communicates. It selects peers with an eye towards
1056
maximizing fanout. A peer node which itself communicates with an
1057
array of otherwise unknown nodes will be selected over one which
1058
communicates with a set containing significant overlap. Each time
1059
gossip is initiated, each nodes’ set of peers is exchanged. Each
1060
node is then free to incorporate the other’s peers as it sees fit.
1061
To avoid any node suffering from excess incoming requests, a node
1062
may refuse to answer a gossip exchange. Each node is biased
1063
towards answering requests from nodes without significant overlap
1064
and refusing requests otherwise.
1065
1066
Peers are efficiently selected using a heuristic as described in
1067
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
1068
1069
**TBD**: how to avoid partitions? Need to work out a simulation of
1070
the protocol to tune the behavior and see empirically how well it
1071
works.
1072
1073
- **Gossip Selection**: what to communicate. Gossip is divided into
1074
topics. Load characteristics (capacity per disk, cpu load, and
1075
state [e.g. draining, ok, failure]) are used to drive node
1076
allocation. Range statistics (range read/write load, missing
1077
replicas, unavailable ranges) and network topology (inter-rack
1078
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
1079
outages) are used for determining when to split ranges, when to
1080
recover replicas vs. wait for network connectivity, and for
1081
debugging / sysops. In all cases, a set of minimums and a set of
1082
maximums is propagated; each node applies its own view of the
1083
world to augment the values. Each minimum and maximum value is
1084
tagged with the reporting node and other accompanying contextual
1085
information. Each topic of gossip has its own protobuf to hold the
1086
structured data. The number of items of gossip in each topic is
1087
limited by a configurable bound.
1088
1089
For efficiency, nodes assign each new item of gossip a sequence
1090
number and keep track of the highest sequence number each peer
1091
node has seen. Each round of gossip communicates only the delta
1092
containing new items.
1093
1094
# Node Accounting
1095
1096
The gossip protocol discussed in the previous section is useful to
1097
quickly communicate fragments of important information in a
1098
decentralized manner. However, complete accounting for each node is also
1099
stored to a central location, available to any dashboard process. This
1100
is done using the map itself. Each node periodically writes its state to
1101
the map with keys prefixed by `\0node`, similar to the first level of
1102
range metadata, but with an ‘`node`’ suffix. Each value is a protobuf
1103
containing the full complement of node statistics--everything
1104
communicated normally via the gossip protocol plus other useful, but
1105
non-critical data.
1106
1107
The range containing the first key in the node accounting table is
1108
responsible for gossiping the total count of nodes. This total count is
1109
used by the gossip network to most efficiently organize itself. In
1110
particular, the maximum number of hops for gossipped information to take
1111
before reaching a node is given by `ceil(log(node count) / log(max
1112
fanout)) + 1`.
1113
1114
# Key-prefix Accounting, Zones & Permissions
1115
1116
Arbitrarily fine-grained accounting and permissions are specified via
1117
key prefixes. Key prefixes can overlap, as is necessary for capturing
1118
hierarchical relationships. For illustrative purposes, let’s say keys
1119
specifying rows in a set of databases have the following format:
1120
1121
`<db>:<table>:<primary-key>[:<secondary-key>]`
1122
1123
In this case, we might collect accounting or specify permissions with
1124
key prefixes:
1125
1126
`db1`, `db1:user`, `db1:order`,
1127
1128
Accounting is kept for the entire map by default.
1129
1130
## Accounting
1131
to keep accounting for a range defined by a key prefix, an entry is created in
1132
the accounting system table. The format of accounting table keys is:
1133
1134
`\0acct<key-prefix>`
1135
1136
In practice, we assume each RoachNode capable of caching the
1137
entire accounting table as it is likely to be relatively small.
1138
1139
Accounting is kept for key prefix ranges with eventual consistency
1140
for efficiency. Updates to accounting values propagate through the
1141
system using the message queue facility if the accounting keys do
1142
not reside on the same range as ongoing activity (true for all but
1143
the smallest systems). There are two types of values which
1144
comprise accounting: counts and occurrences, for lack of better
1145
terms. Counts describe system state, such as the total number of
1146
bytes, rows, etc. Occurrences include transient performance and
1147
load metrics. Both types of accounting are captured as time series
1148
with minute granularity. The length of time accounting metrics are
1149
kept is configurable. Below are examples of each type of
1150
accounting value.
1151
1152
**System State Counters/Performance**
1153
1154
- Count of items (e.g. rows)
1155
- Total bytes
1156
- Total key bytes
1157
- Total value length
1158
- Queued message count
1159
- Queued message total bytes
1160
- Count of values \< 16B
1161
- Count of values \< 64B
1162
- Count of values \< 256B
1163
- Count of values \< 1K
1164
- Count of values \< 4K
1165
- Count of values \< 16K
1166
- Count of values \< 64K
1167
- Count of values \< 256K
1168
- Count of values \< 1M
1169
- Count of values \> 1M
1170
- Total bytes of accounting
1171
1172
1173
**Load Occurences**
1174
1175
Get op count
1176
Get total MB
1177
Put op count
1178
Put total MB
1179
Delete op count
1180
Delete total MB
1181
Delete range op count
1182
Delete range total MB
1183
Scan op count
1184
Scan op MB
1185
Split count
1186
Merge count
1187
1188
Because accounting information is kept as time series and over many
1189
possible metrics of interest, the data can become numerous. Accounting
1190
data are stored in the map near the key prefix described, in order to
1191
distribute load (for both aggregation and storage).
1192
1193
Accounting keys for system state have the form:
1194
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1195
character. It’s meant to sort the root level account AFTER any other
1196
system tables. They must increment the same underlying values as they
1197
are permanent counts, and not transient activity. Logic at the
1198
RoachNode takes care of snapshotting the value into an appropriately
1199
suffixed (e.g. with timestamp hour) multi-value time series entry.
1200
1201
Keys for perf/load metrics:
1202
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1203
1204
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1205
containing a varint64 entry for each minute with activity during the
1206
specified hour.
1207
1208
To efficiently keep accounting over large key ranges, the task of
1209
aggregation must be distributed. If activity occurs within the same
1210
range as the key prefix for accounting, the updates are made as part
1211
of the consensus write. If the ranges differ, then a message is sent
1212
to the parent range to increment the accounting. If upon receiving the
1213
message, the parent range also does not include the key prefix, it in
1214
turn forwards it to its parent or left child in the balanced binary
1215
tree which is maintained to describe the range hierarchy. This limits
1216
the number of messages before an update is visible at the root to `2*log N`,
1217
where `N` is the number of ranges in the key prefix.
1218
1219
## Zones
1220
zones are stored in the map with keys prefixed by
1221
`\0zone` followed by the key prefix to which the zone
1222
configuration applies. Zone values specify a protobuf containing
1223
the datacenters from which replicas for ranges which fall under
1224
the zone must be chosen.
1225
1226
Please see [proto/config.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1227
1228
If zones are modified in situ, each RoachNode verifies the
1229
existing zones for its ranges against the zone configuration. If
1230
it discovers differences, it reconfigures ranges in the same way
1231
that it rebalances away from busy nodes, via special-case 1:1
1232
split to a duplicate range comprising the new configuration.
1233
1234
### Permissions
1235
permissions are stored in the map with keys prefixed by *\0perm* followed by
1236
the key prefix and user to which the specified permissions apply. The format of
1237
permissions keys is:
1238
1239
`\0perm<key-prefix><user>`
1240
1241
Permission values are a protobuf containing the permission details;
1242
please see [proto/config.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/config.proto) for up-to-date data structures used, the best entry point being `message PermConfig`.
1243
1244
A default system root permission is assumed for the entire map
1245
with an empty key prefix and read/write as true.
1246
1247
When determining whether or not to allow a read or a write a key
1248
value (e.g. `db1:user:1` for user `spencer`), a RoachNode would
1249
read the following permissions values:
1250
1251
```
1252
\0perm<db1:user:1>spencer
1253
\0perm<db1:user>spencer
1254
\0perm<db1>spencer
1255
\0perm<>spencer
1256
```
1257
1258
If any prefix in the hierarchy provides the required permission,
1259
the request is satisfied; otherwise, the request returns an
1260
error.
1261
1262
The priority for a user permission is used to order requests at
1263
Raft consensus ranges and for choosing an initial priority for
1264
distributed transactions. When scheduling operations at the Raft
1265
consensus range, all outstanding requests are ordered by key
1266
prefix and each assigned priorities according to key, user and
1267
arrival time. The next request is chosen probabilistically using
1268
priorities to weight the choice. Each key can have multiple
1269
priorities as they’re hierarchical (e.g. for /user/key, one
1270
priority for root ‘/’, and one for ‘/user/key’). The most general
1271
priority is used first. If two keys share the most general, then
1272
they’re compared with the next most general if applicable, and so on.
1273
1274
# Key-Value API
1275
1276
see the protobufs in [proto/](https://github.com/cockroachdb/cockroach/blob/master/proto),
1277
in particular [proto/api.proto](https://github.com/cockroachdb/cockroach/blob/master/proto/api.proto) and the comments within.
1278
1279
# Structured Data API
1280
1281
A preliminary design can be found in the [Go source documentation](http://godoc.org/github.com/cockroachdb/cockroach/structured).