Permalink
Newer
100644
1374 lines (1128 sloc)
65.9 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
Cockroach is a distributed key:value datastore (SQL and structured
8
data layers of cockroach have yet to be defined) which supports **ACID
9
transactional semantics** and **versioned values** as first-class
10
features. The primary design goal is **global consistency and
11
survivability**, hence the name. Cockroach aims to tolerate disk,
12
machine, rack, and even **datacenter failures** with minimal latency
13
disruption and **no manual intervention**. Cockroach nodes are
15
minimal configuration.
16
17
Cockroach implements a **single, monolithic sorted map** from key to
18
value where both keys and values are byte strings (not unicode).
19
Cockroach **scales linearly** (theoretically up to 4 exabytes (4E) of
20
logical data). The map is composed of one or more ranges and each range
21
is backed by data stored in [RocksDB](http://rocksdb.org/) (a
22
variant of LevelDB), and is replicated to a total of three or more
23
cockroach servers. Ranges are defined by start and end keys. Ranges are
24
merged and split to maintain total byte size within a globally
25
configurable min/max size interval. Range sizes default to target `64M` in
26
order to facilitate quick splits and merges and to distribute load at
27
hotspots within a key range. Range replicas are intended to be located
28
in disparate datacenters for survivability (e.g. `{ US-East, US-West,
29
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East, US-West,
30
Japan, Australia }`).
31
32
Single mutations to ranges are mediated via an instance of a distributed
33
consensus algorithm to ensure consistency. We’ve chosen to use the
34
[Raft consensus algorithm](https://raftconsensus.github.io); all consensus
35
state is stored in RocksDB.
36
37
A single logical mutation may affect multiple key/value pairs. Logical
38
mutations have ACID transactional semantics. If all keys affected by a
39
logical mutation fall within the same range, atomicity and consistency
40
are guaranteed by Raft; this is the **fast commit path**. Otherwise, a
41
**non-locking distributed commit** protocol is employed between affected
42
ranges.
43
44
Cockroach provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
45
serializable snapshot isolation (SSI) semantics, allowing **externally
46
consistent, lock-free reads and writes**--both from a historical
47
snapshot timestamp and from the current wall clock time. SI provides
48
lock-free reads and writes but still allows write skew. SSI eliminates
49
write skew, but introduces a performance hit in the case of a
50
contentious system. SSI is the default isolation; clients must
51
consciously decide to trade correctness for performance. Cockroach
52
implements [a limited form of linearizability](#linearizability),
53
providing ordering for any observer or chain of observers.
54
55
Similar to
56
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
57
directories, Cockroach allows configuration of arbitrary zones of data.
58
This allows replication factor, storage device type, and/or datacenter
59
location to be chosen to optimize performance and/or availability.
60
Unlike Spanner, zones are monolithic and don’t allow movement of fine
61
grained data on the level of entity groups.
62
63
# Architecture
64
65
Cockroach implements a layered architecture. The highest level of
66
abstraction is the SQL layer (currently unspecified in this document).
67
It depends directly on the [*structured data
68
API*](#structured-data-api), which provides familiar relational concepts
69
such as schemas, tables, columns, and indexes. The structured data API
70
in turn depends on the [distributed key value store](#key-value-api),
71
which handles the details of range addressing to provide the abstraction
72
of a single, monolithic key value store. The distributed KV store
73
communicates with any number of physical cockroach nodes. Each node
74
contains one or more stores, one per physical device.
75
76

77
78
Each store contains potentially many ranges, the lowest-level unit of
79
key-value data. Ranges are replicated using the Raft consensus protocol.
80
The diagram below is a blown up version of stores from four of the five
81
nodes in the previous diagram. Each range is replicated three ways using
82
raft. The color coding shows associated range replicas.
83
84

85
86
Each physical node exports a RoachNode service. Each RoachNode exports
87
one or more key ranges. RoachNodes are symmetric. Each has the same
88
binary and assumes identical roles.
89
90
Nodes and the ranges they provide access to can be arranged with various
91
physical network topologies to make trade offs between reliability and
92
performance. For example, a triplicated (3-way replica) range could have
93
each replica located on different:
94
95
- disks within a server to tolerate disk failures.
96
- servers within a rack to tolerate server failures.
97
- servers on different racks within a datacenter to tolerate rack power/network failures.
98
- servers in different datacenters to tolerate large scale network or power outages.
99
100
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
101
102
# Cockroach Client
103
104
In order to support diverse client usage, Cockroach clients connect to
105
any node via HTTPS using protocol buffers or JSON. The connected node
106
proxies involved client work including key lookups and write buffering.
107
108
# Keys
109
110
Cockroach keys are arbitrary byte arrays. If textual data is used in
111
keys, utf8 encoding is recommended (this helps for cleaner display of
112
values in debugging tools). User-supplied keys are encoded using an
113
ordered code. System keys are either prefixed with null characters (`\0`
114
or `\0\0`) for system tables, or take the form of
115
`<user-key><system-suffix>` to sort user-key-range specific system
116
keys immediately after the user keys they refer to. Null characters are
117
used in system key prefixes to guarantee that they sort first.
118
119
# Versioned Values
120
121
Cockroach maintains historical versions of values by storing them with
122
associated commit timestamps. Reads and scans can specify a snapshot
123
time to return the most recent writes prior to the snapshot timestamp.
124
Older versions of values are garbage collected by the system during
125
compaction according to a user-specified expiration interval. In order
126
to support long-running scans (e.g. for MapReduce), all versions have a
127
minimum expiration.
128
129
Versioned values are supported via modifications to RocksDB to record
130
commit timestamps and GC expirations per key.
131
140
# Lock-Free Distributed Transactions
141
142
Cockroach provides distributed transactions without locks. Cockroach
143
transactions support two isolation levels:
144
145
- snapshot isolation (SI) and
146
- *serializable* snapshot isolation (SSI).
147
148
*SI* is simple to implement, highly performant, and correct for all but a
149
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
150
more complexity, is still highly performant (less so with contention), and has
151
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
152
the literature and some possibly novel insights.
153
154
SSI is the default level, with SI provided for application developers
155
who are certain enough of their need for performance and the absence of
156
write skew conditions to consciously elect to use it. In a lightly
157
contended system, our implementation of SSI is just as performant as SI,
158
requiring no locking or additional writes. With contention, our
159
implementation of SSI still requires no locking, but will end up
160
aborting more transactions. Cockroach’s SI and SSI implementations
161
prevent starvation scenarios even for arbitrarily long transactions.
162
163
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
164
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
165
For a discussion of SSI implemented by preventing read-write conflicts
166
(in contrast to detecting them, called write-snapshot isolation), see
167
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
168
which is the source of much inspiration for Cockroach’s SSI.
169
170
Each Cockroach transaction is assigned a random priority and a
171
"candidate timestamp" at start. The candidate timestamp is the
172
provisional timestamp at which the transaction will commit, and is
173
chosen as the current clock time of the node coordinating the
174
transaction. This means that a transaction without conflicts will
175
usually commit with a timestamp that, in absolute time, precedes the
176
actual work done by that transaction.
177
178
In the course of coordinating a transaction between one or more
179
distributed nodes, the candidate timestamp may be increased, but will
181
SI and SSI is that the former allows the transaction's candidate
182
timestamp to increase and the latter does not.
187
in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf).
188
HLC time uses timestamps which are composed of a physical component (thought of
189
as and always close to local wall time) and a logical component (used to
190
distinguish between events with the same physical component). It allows us to
191
track causality for related events similar to vector clocks, but with less
192
overhead. In practice, it works much like other logical clocks: When events
193
are received by a node, it informs the local HLC about the timestamp supplied
194
with the event by the sender, and when events are sent a timestamp generated by
195
the local HLC is attached.
196
197
For a more in depth description of HLC please read the paper. Our
198
implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/util/hlc/hlc.go).
199
200
Cockroach picks a Timestamp for a transaction using HLC time. Throughout this
201
document, *timestamp* always refers to the HLC time which is a singleton
202
on each node. The HLC is updated by every read/write event on the node, and
204
from another node is not only used to version the operation, but also updates
205
the HLC on the node. This is useful in guaranteeing that all data read/written
206
on a node is at a timestamp < next HLC time.
213
transaction table (keys prefixed by *\0tx*) with state “PENDING”. In
214
parallel write an "intent" value for each datum being written as part
215
of the transaction. These are normal MVCC values, with the addition of
216
a special flag (i.e. “intent”) indicating that the value may be
218
the transaction id (unique and chosen at tx start time by client)
219
is stored with intent values. The tx id is used to refer to the
220
transaction table when there are conflicts and to make
221
tie-breaking decisions on ordering between identical timestamps.
223
original candidate timestamp in the absence of read/write conflicts);
224
the client selects the maximum from amongst all write timestamps as the
225
final commit timestamp.
228
transaction table (keys prefixed by *\0tx*). The value of the
229
commit entry contains the candidate timestamp (increased as
230
necessary to accommodate any latest read timestamps). Note that
231
the transaction is considered fully committed at this point and
232
control may be returned to the client.
233
234
In the case of an SI transaction, a commit timestamp which was
235
increased to accommodate concurrent readers is perfectly
236
acceptable and the commit may continue. For SSI transactions,
237
however, a gap between candidate and commit timestamps
238
necessitates transaction restart (note: restart is different than
239
abort--see below).
240
241
After the transaction is committed, all written intents are upgraded
242
in parallel by removing the “intent” flag. The transaction is
243
considered fully committed before this step and does not wait for
244
it to return control to the transaction coordinator.
245
246
In the absence of conflicts, this is the end. Nothing else is necessary
247
to ensure the correctness of the system.
248
249
**Conflict Resolution**
250
251
Things get more interesting when a reader or writer encounters an intent
252
record or newly-committed value in a location that it needs to read or
253
write. This is a conflict, usually causing either of the transactions to
254
abort or restart depending on the type of conflict.
255
256
***Transaction restart:***
257
258
This is the usual (and more efficient) type of behaviour and is used
259
except when the transaction was aborted (for instance by another
260
transaction).
261
In effect, that reduces to two cases; the first being the one outlined
262
above: An SSI transaction that finds upon attempting to commit that
263
its commit timestamp has been pushed. The second case involves a transaction
264
actively encountering a conflict, that is, one of its readers or writers
265
encounter data that necessitate conflict resolution
266
(see transaction interactions below).
270
begins anew reusing the same tx id. The prior run of the transaction might
271
have written some write intents, which need to be deleted before the
272
transaction commits, so as to not be included as part of the transaction.
273
These stale write intent deletions are done during the reexecution of the
275
the same keys as part of the reexecution of the transaction, or explicitly,
276
by cleaning up stale intents that are not part of the reexecution of the
277
transaction. Since most transactions will end up writing to the same keys,
278
the explicit cleanup run just before committing the transaction is usually
279
a NOOP.
280
281
***Transaction abort:***
282
283
This is the case in which a transaction, upon reading its transaction
284
table entry, finds that it has been aborted. In this case, the
285
transaction can not reuse its intents; it returns control to the client
286
before cleaning them up (other readers and writers would clean up
287
dangling intents as they encounter them) but will make an effort to
288
clean up after itself. The next attempt (if applicable) then runs as a
292
293
There are several scenarios in which transactions interact:
294
295
- **Reader encounters write intent or value with newer timestamp far
296
enough in the future**: This is not a conflict. The reader is free
297
to proceed; after all, it will be reading an older version of the
298
value and so does not conflict. Recall that the write intent may
299
be committed with a later timestamp than its candidate; it will
300
never commit with an earlier one. **Side note**: if a SI transaction
301
reader finds an intent with a newer timestamp which the reader’s own
303
304
- **Reader encounters write intent or value with newer timestamp in the
305
near future:** In this case, we have to be careful. The newer
306
intent may, in absolute terms, have happened in our read's past if
307
the clock of the writer is ahead of the node serving the values.
308
In that case, we would need to take this value into account, but
309
we just don't know. Hence the transaction restarts, using instead
310
a future timestamp (but remembering a maximum timestamp used to
311
limit the uncertainty window to the maximum clock skew). In fact,
312
this is optimized further; see the details under "choosing a time
313
stamp" below.
314
315
- **Reader encounters write intent with older timestamp**: the reader
316
must follow the intent’s transaction id to the transaction table.
317
If the transaction has already been committed, then the reader can
318
just read the value. If the write transaction has not yet been
319
committed, then the reader has two options. If the write conflict
320
is from an SI transaction, the reader can *push that transaction's
321
commit timestamp into the future* (and consequently not have to
322
read it). This is simple to do: the reader just updates the
323
transaction’s commit timestamp to indicate that when/if the
324
transaction does commit, it should use a timestamp *at least* as
325
high. However, if the write conflict is from an SSI transaction,
326
the reader must compare priorities. If the reader has the higher priority,
327
it pushes the transaction’s commit timestamp (that
335
priority, the writer aborts the conflicting transaction. If the write
336
intent has a higher or equal priority the transaction retries, using as a new
337
priority *max(new random priority, conflicting txn’s priority - 1)*;
338
the retry occurs after a short, randomized backoff interval.
340
- **Writer encounters newer committed value**:
341
The committed value could also be an unresolved write intent made by a
342
transaction that has already committed. The transaction restarts. On restart,
343
the same priority is reused, but the candidate timestamp is moved forward
344
to the encountered value's timestamp.
348
candidate timestamp is earlier than the low water mark on the cache itself
349
(i.e. its last evicted timestamp) or if the key being written has a read
350
timestamp later than the write’s candidate timestamp, this later timestamp
351
value is returned with the write. A new timestamp forces a transaction
352
restart only if it is serializable.
354
**Transaction management**
355
356
Transactions are managed by the client proxy (or gateway in SQL Azure
357
parlance). Unlike in Spanner, writes are not buffered but are sent
358
directly to all implicated ranges. This allows the transaction to abort
359
quickly if it encounters a write conflict. The client proxy keeps track
360
of all written keys in order to resolve write intents asynchronously upon
361
transaction completion. If a transaction commits successfully, all intents
362
are upgraded to committed. In the event a transaction is aborted, all written
363
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
367
transaction table until aborted by another transaction. Transactions
368
heartbeat the transaction table every five seconds by default.
369
Transactions encountered by readers or writers with dangling intents
370
which haven’t been heartbeat within the required interval are aborted.
375
376
An exploration of retries with contention and abort times with abandoned
377
transaction is
378
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
379
380
**Transaction Table**
381
382
Please see [roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/master/roachpb/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
383
384
**Pros**
385
386
- No requirement for reliable code execution to prevent stalled 2PC
387
protocol.
388
- Readers never block with SI semantics; with SSI semantics, they may
389
abort.
390
- Lower latency than traditional 2PC commit protocol (w/o contention)
391
because second phase requires only a single write to the
392
transaction table instead of a synchronous round to all
393
transaction participants.
394
- Priorities avoid starvation for arbitrarily long transactions and
395
always pick a winner from between contending transactions (no
396
mutual aborts).
397
- Writes not buffered at client; writes fail fast.
398
- No read-locking overhead required for *serializable* SI (in contrast
399
to other SSI implementations).
400
- Well-chosen (i.e. less random) priorities can flexibly give
401
probabilistic guarantees on latency for arbitrary transactions
402
(for example: make OLTP transactions 10x less likely to abort than
403
low priority transactions, such as asynchronously scheduled jobs).
404
405
**Cons**
406
409
- Abandoned transactions may block contending writers for up to the
410
heartbeat interval, though average wait is likely to be
411
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
412
This is likely considerably more performant than detecting and
413
restarting 2PC in order to release read and write locks.
414
- Behavior different than other SI implementations: no first writer
415
wins, and shorter transactions do not always finish quickly.
416
Element of surprise for OLTP systems may be a problematic factor.
417
- Aborts can decrease throughput in a contended system compared with
418
two phase locking. Aborts and retries increase read and write
419
traffic, increase latency and decrease throughput.
420
421
**Choosing a Timestamp**
422
423
A key challenge of reading data in a distributed system with clock skew
424
is choosing a timestamp guaranteed to be greater than the latest
425
timestamp of any committed transaction (in absolute time). No system can
426
claim consistency and fail to read already-committed data.
427
429
accessing a single node is easy. The timestamp is assigned by the node
430
itself, so it is guaranteed to be at a greater timestamp than all the
431
existing timestamped data on the node.
432
433
For multiple nodes, the timestamp of the node coordinating the
434
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
435
supplied to provide an upper bound on timestamps for already-committed
436
data (`ε` is the maximum clock skew). As the transaction progresses, any
437
data read which have timestamps greater than `t` but less than `t+ε`
438
cause the transaction to abort and retry with the conflicting timestamp
440
the same. This implies that transaction restarts due to clock uncertainty
441
can only happen on a time interval of length `ε`.
445
into account t<sub>c</sub>, but the timestamp of the node at the time
446
of the uncertain read t<sub>node</sub>. The larger of those two timestamps
447
t<sub>c</sub> and t<sub>node</sub> (likely equal to the latter) is used
448
to increase the read timestamp. Additionally, the conflicting node is
449
marked as “certain”. Then, for future reads to that node within the
450
transaction, we set `MaxTimestamp = Read Timestamp`, preventing further
451
uncertainty restarts.
452
453
Correctness follows from the fact that we know that at the time of the read,
454
there exists no version of any key on that node with a higher timestamp than
456
encounters a key with a higher timestamp, it knows that in absolute time,
457
the value was written after t<sub>node</sub> was obtained, i.e. after the
458
uncertain read. Hence the transaction can move forward reading an older version
459
of the data (at the transaction's timestamp). This limits the time uncertainty
460
restarts attributed to a node to at most one. The tradeoff is that we might
461
pick a timestamp larger than the optimal one (> highest conflicting timestamp),
462
resulting in the possibility of a few more conflicts.
463
464
We expect retries will be rare, but this assumption may need to be
465
revisited if retries become problematic. Note that this problem does not
466
apply to historical reads. An alternate approach which does not require
467
retries makes a round to all node participants in advance and
468
chooses the highest reported node wall time as the timestamp. However,
469
knowing which nodes will be accessed in advance is difficult and
470
potentially limiting. Cockroach could also potentially use a global
471
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)),
472
which would be feasible for smaller, geographically-proximate clusters.
473
474
# Linearizability
475
476
First a word about [***Spanner***](http://research.google.com/archive/spanner.html).
477
By combining judicious use of wait intervals with accurate time signals,
478
Spanner provides a global ordering between any two non-overlapping transactions
479
(in absolute time) with \~14ms latencies. Put another way:
480
Spanner guarantees that if a transaction T<sub>1</sub> commits (in absolute time)
481
before another transaction T<sub>2</sub> starts, then T<sub>1</sub>'s assigned commit
482
timestamp is smaller than T<sub>2</sub>'s. Using atomic clocks and GPS receivers,
483
Spanner reduces their clock skew uncertainty to \< 10ms (`ε`). To make
484
good on the promised guarantee, transactions must take at least double
485
the clock skew uncertainty interval to commit (`2ε`). See [*this
486
article*](http://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf)
487
for a helpful overview of Spanner’s concurrency control.
488
489
Cockroach could make the same guarantees without specialized hardware,
490
at the expense of longer wait times. If servers in the cluster were
491
configured to work only with NTP, transaction wait times would likely to
492
be in excess of 150ms. For wide-area zones, this would be somewhat
493
mitigated by overlap from cross datacenter link latencies. If clocks
494
were made more accurate, the minimal limit for commit latencies would
495
improve.
496
497
However, let’s take a step back and evaluate whether Spanner’s external
498
consistency guarantee is worth the automatic commit wait. First, if the
499
commit wait is omitted completely, the system still yields a consistent
500
view of the map at an arbitrary timestamp. However with clock skew, it
501
would become possible for commit timestamps on non-overlapping but
502
causally related transactions to suffer temporal reverse. In other
503
words, the following scenario is possible for a client without global
504
ordering:
505
506
- Start transaction T<sub>1</sub> to modify value `x` with commit time s<sub>1</sub>
507
508
- On commit of T<sub>1</sub>, start T<sub>2</sub> to modify value `y` with commit time
510
511
- Read `x` and `y` and discover that s<sub>1</sub> \> s<sub>2</sub> (**!**)
512
513
The external consistency which Spanner guarantees is referred to as
514
**linearizability**. It goes beyond serializability by preserving
515
information about the causality inherent in how external processes
516
interacted with the database. The strength of Spanner’s guarantee can be
517
formulated as follows: any two processes, with clock skew within
518
expected bounds, may independently record their wall times for the
519
completion of transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>) and start of transaction
520
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) respectively, and if later
521
compared such that T<sub>1</sub><sup>end</sup> \< T<sub>2</sub><sup>start</sup>,
522
then commit timestamps s<sub>1</sub> \< s<sub>2</sub>.
523
This guarantee is broad enough to completely cover all cases of explicit
524
causality, in addition to covering any and all imaginable scenarios of implicit
525
causality.
526
527
Our contention is that causality is chiefly important from the
528
perspective of a single client or a chain of successive clients (*if a
529
tree falls in the forest and nobody hears…*). As such, Cockroach
530
provides two mechanisms to provide linearizability for the vast majority
531
of use cases without a mandatory transaction commit wait or an elaborate
532
system to minimize clock skew.
533
534
1. Clients provide the highest transaction commit timestamp with
535
successive transactions. This allows node clocks from previous
536
transactions to effectively participate in the formulation of the
537
commit timestamp for the current transaction. This guarantees
538
linearizability for transactions committed by this client.
539
540
Newly launched clients wait at least 2 \* ε from process start
541
time before beginning their first transaction. This preserves the
542
same property even on client restart, and the wait will be
543
mitigated by process initialization.
544
545
All causally-related events within Cockroach maintain
546
linearizability.
547
548
2. Committed transactions respond with a commit wait parameter which
549
represents the remaining time in the nominal commit wait. This
550
will typically be less than the full commit wait as the consensus
551
write at the coordinator accounts for a portion of it.
552
553
Clients taking any action outside of another Cockroach transaction
554
(e.g. writing to another distributed system component) can either
555
choose to wait the remaining interval before proceeding, or
556
alternatively, pass the wait and/or commit timestamp to the
557
execution of the outside action for its consideration. This pushes
558
the burden of linearizability to clients, but is a useful tool in
559
mitigating commit latencies if the clock skew is potentially
560
large. This functionality can be used for ordering in the face of
561
backchannel dependencies as mentioned in the
562
[AugmentedTime](http://www.cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf)
563
paper.
564
565
Using these mechanisms in place of commit wait, Cockroach’s guarantee can be
566
formulated as follows: any process which signals the start of transaction
567
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) after the completion of
568
transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>), will have commit
569
timestamps such thats<sub>1</sub> \< s<sub>2</sub>.
570
571
# Logical Map Content
572
573
Logically, the map contains a series of reserved system key / value
574
pairs covering accounting, range metadata and node accounting
575
before the actual key / value pairs for non-system data
576
(e.g. the actual meat of the map).
577
578
- `\0\0meta1` Range metadata for location of `\0\0meta2`.
579
- `\0\0meta1<key1>` Range metadata for location of `\0\0meta2<key1>`.
580
- ...
581
- `\0\0meta1<keyN>`: Range metadata for location of `\0\0meta2<keyN>`.
582
- `\0\0meta2`: Range metadata for location of first non-range metadata key.
583
- `\0\0meta2<key1>`: Range metadata for location of `<key1>`.
584
- ...
585
- `\0\0meta2<keyN>`: Range metadata for location of `<keyN>`.
586
- `\0acct<key0>`: Accounting for key prefix key0.
587
- ...
588
- `\0acct<keyN>`: Accounting for key prefix keyN.
589
- `\0node<node-address0>`: Accounting data for node 0.
590
- ...
591
- `\0node<node-addressN>`: Accounting data for node N.
592
- `\0tree_root`: Range key for root of range-spanning tree.
593
- `\0tx<tx-id0>`: Transaction record for transaction 0.
594
- ...
595
- `\0tx<tx-idN>`: Transaction record for transaction N.
596
- `\0zone<key0>`: Zone information for key prefix key0.
597
- ...
598
- `\0zone<keyN>`: Zone information for key prefix keyN.
599
- `<>acctd<metric0>`: Accounting data for Metric 0 for empty key prefix.
600
- ...
601
- `<>acctd<metricN>`: Accounting data for Metric N for empty key prefix.
602
- `<key0>`: `<value0>` The first user data key.**
603
- ...
604
- `<keyN>`: `<valueN>` The last user data key.**
605
606
There are some additional system entries sprinkled amongst the
607
non-system keys. See the Key-Prefix Accounting section in this document
608
for further details.
609
610
# Node Storage
611
612
Nodes maintain a separate instance of RocksDB for each disk. Each
613
RocksDB instance hosts any number of ranges. RPCs arriving at a
614
RoachNode are multiplexed based on the disk name to the appropriate
615
RocksDB instance. A single instance per disk is used to avoid
616
contention. If every range maintained its own RocksDB, global management
617
of available cache memory would be impossible and writers for each range
618
would compete for non-contiguous writes to multiple RocksDB logs.
619
620
In addition to the key/value pairs of the range itself, various range
621
metadata is maintained.
622
623
- range-spanning tree node links
624
625
- participating replicas
626
627
- consensus metadata
628
629
- split/merge activity
630
631
A really good reference on tuning Linux installations with RocksDB is
632
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
633
634
# Range Metadata
635
636
The default approximate size of a range is 64M (2\^26 B). In order to
637
support 1P (2\^50 B) of logical data, metadata is needed for roughly
638
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
641
B would require roughly 4G (2\^32 B) to store--too much to duplicate
642
between machines. Our conclusion is that range metadata must be
643
distributed for large installations.
644
645
To keep key lookups relatively fast in the presence of distributed metadata,
646
we store all the top-level metadata in a single range (the first range). These
647
top-level metadata keys are known as *meta1* keys, and are prefixed such that
648
they sort to the beginning of the key space. Given the metadata size of 256
649
bytes given above, a single 64M range would support 64M/256B = 2\^18 ranges,
651
above, we need two levels of indirection, where the first level addresses the
652
second, and the second addresses user data. With two levels of indirection, we
653
can address 2\^(18 + 18) = 2\^36 ranges; each range addresses 2\^26 B, and
654
altogether we address 2\^(36+26) B = 2\^62 B = 4E of user data.
655
656
For a given user-addressable `key1`, the associated *meta1* record is found
657
at the successor key to `key1` in the *meta1* space. Since the *meta1* space
658
is sparse, the successor key is defined as the next key which is present. The
659
*meta1* record identifies the range containing the *meta2* record, which is
660
found using the same process. The *meta2* record identifies the range
661
containing `key1`, which is again found the same way (see examples below).
663
Concretely, metadata keys are prefixed by `\0\0meta{1,2}`; the two null
664
characters provide for the desired sorting behaviour. Thus, `key1`'s
665
*meta1* record will reside at the successor key to `\0\0\meta1<key1>`.
666
668
the RocksDB iterator only supports a Seek() interface which acts as a
669
Ceil(). Using the start key of the range would cause Seek() to find the
670
key *after* the meta indexing record we’re looking for, which would
671
result in having to back the iterator up, an option which is both less
672
efficient and not available in all cases.
673
674
The following example shows the directory structure for a map with
675
three ranges worth of data. Ellipses indicate additional key/value pairs to
676
fill an entire range of data. Except for the fact that splitting ranges
677
requires updates to the range metadata with knowledge of the metadata layout,
678
the range metadata itself requires no special treatment or bootstrapping.
679
680
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
681
`dcrama3:8000`)
682
683
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
684
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
685
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
686
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
687
- ...
688
- `<lastkey0>`: `<lastvalue0>`
689
690
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
691
`dcrama6:8000`)
692
693
- ...
694
- `<lastkey1>`: `<lastvalue1>`
695
696
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
697
`dcrama9:8000`)
698
699
- ...
700
- `<lastkey2>`: `<lastvalue2>`
701
702
Consider a simpler example of a map containing less than a single
703
range of data. In this case, all range metadata and all data are
704
located in the same range:
705
706
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
707
`dcrama3:8000`)*
708
709
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
710
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
711
- `<key0>`: `<value0>`
712
- `...`
713
714
Finally, a map large enough to need both levels of indirection would
715
look like (note that instead of showing range replicas, this
716
example is simplified to just show range indexes):
717
718
**Range 0**
719
720
- `\0\0meta1<lastkeyN-1>`: Range 0
721
- `\0\0meta1\xff`: Range 1
722
- `\0\0meta2<lastkey1>`: Range 1
723
- `\0\0meta2<lastkey2>`: Range 2
724
- `\0\0meta2<lastkey3>`: Range 3
725
- ...
726
- `\0\0meta2<lastkeyN-1>`: Range 262143
727
728
**Range 1**
729
730
- `\0\0meta2<lastkeyN>`: Range 262144
731
- `\0\0meta2<lastkeyN+1>`: Range 262145
732
- ...
733
- `\0\0meta2\xff`: Range 500,000
734
- ...
735
- `<lastkey1>`: `<lastvalue1>`
736
737
**Range 2**
738
739
- ...
740
- `<lastkey2>`: `<lastvalue2>`
741
742
**Range 3**
743
744
- ...
745
- `<lastkey3>`: `<lastvalue3>`
746
747
**Range 262144**
748
749
- ...
750
- `<lastkeyN>`: `<lastvalueN>`
751
752
**Range 262145**
753
754
- ...
755
- `<lastkeyN+1>`: `<lastvalueN+1>`
756
757
Note that the choice of range `262144` is just an approximation. The
758
actual number of ranges addressable via a single metadata range is
759
dependent on the size of the keys. If efforts are made to keep key sizes
760
small, the total number of addressable ranges would increase and vice
761
versa.
762
763
From the examples above it’s clear that key location lookups require at
764
most three reads to get the value for `<key>`:
765
766
1. lower bound of `\0\0meta1<key>`
767
2. lower bound of `\0\0meta2<key>`,
768
3. `<key>`.
769
770
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
771
containing less than 16T of data would require two lookups. Clients cache both
772
levels of range metadata, and we expect that data locality for individual
773
clients will be high. Clients may end up with stale cache entries. If on a
774
lookup, the range consulted does not match the client’s expectations, the
775
client evicts the stale entries and possibly does a new lookup.
776
779
Each range is configured to consist of three or more replicas, as specified by
780
their ZoneConfig. The replicas in a range maintain their own instance of a
781
distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://raftconsensus.github.io)
784
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
785
promising performance characteristics for WAN-distributed replicas, but
786
it does not guarantee a consistent ordering between replicas.
787
788
Raft elects a relatively long-lived leader which must be involved to
790
replicated. In the absence of heartbeats, followers become candidates
791
after randomized election timeouts and proceed to hold new leader
792
elections. Cockroach weights random timeouts such that the replicas with
793
shorter round trip times to peers are more likely to hold elections
794
first (not implemented yet). Only the Raft leader may propose commands;
795
followers will simply relay commands to the last known leader.
797
Our Raft implementation was developed together with CoreOS, but adds an extra
798
layer of optimization to account for the fact that a single Node may have
799
millions of consensus groups (one for each Range). Areas of optimization
800
are chiefly coalesced heartbeats (so that the number of nodes dictates the
801
number of heartbeats as opposed to the much larger number of ranges) and
802
batch processing of requests.
803
Future optimizations may include two-phase elections and quiescent ranges
804
(i.e. stopping traffic completely for inactive ranges).
805
807
808
As outlined in the Raft section, the replicas of a Range are organized as a
809
Raft group and execute commands from their shared commit log. Going through
810
Raft is an expensive operation though, and there are tasks which should only be
811
carried out by a single replica at a time (as opposed to all of them).
812
814
This is a lease held for a slice of (database, i.e. hybrid logical) time and is
815
established by committing a special log entry through Raft containing the
817
combination that uniquely describes the requesting replica. Reads and writes
818
must generally be addressed to the replica holding the lease; if none does, any
819
replica may be addressed, causing it to try to obtain the lease synchronously.
822
lease holder. These requests are retried transparently with the updated lease by the
823
gateway node and never reach the client.
824
825
The replica holding the lease is in charge or involved in handling
826
Range-specific maintenance tasks such as
827
828
* gossiping the sentinel and/or first range information
829
* splitting, merging and rebalancing
830
831
and, very importantly, may satisfy reads locally, without incurring the
832
overhead of going through Raft.
833
834
Since reads bypass Raft, a new lease holder will, among other things, ascertain
835
that its timestamp cache does not report timestamps smaller than the previous
836
lease holder's (so that it's compatible with reads which may have occurred on
837
the former lease holder). This is accomplished by setting the low water mark of the
838
timestamp cache to the expiration of the previous lease plus the maximum clock
839
offset.
840
841
## Relationship to Raft leadership
842
843
The range lease is completely separate from Raft leadership, and so without
844
further efforts, Raft leadership and the Range lease may not be represented by the same
845
replica most of the time. This is convenient semantically since it decouples
846
these two types of leadership and allows the use of Raft as a "black box", but
847
for reasons of performance, it is desirable to have both on the same replica.
848
Otherwise, sending a command through Raft always incurs the overhead of being
849
proposed to the Range lease holder's Raft instance first, which must relay it to the
853
Range lease and Raft leadership coincide. A fairly easy method for achieving this is
854
to have each new lease period (extension or new) be accompanied by a
855
stipulation to the lease holder's replica to start Raft elections (unless it's
857
relatively stable and long-lived to avoid a large number of Raft leadership
858
transitions.
859
863
command in more details. Each command specifies (1) a key (or a range
864
of keys) that the command accesses and (2) the ID of a range which the
865
key(s) belongs to. When receiving a command, a RoachNode looks up a
866
range by the specified Range ID and checks if the range is still
867
responsible for the supplied keys. If any of the keys do not belong to the
868
range, the RoachNode returns an error so that the client will retry
869
and send a request to a correct range.
870
871
When all the keys belong to the range, the RoachNode attempts to
872
process the command. If the command is an inconsistent read-only
873
command, it is processed immediately. If the command is a consistent
874
read or a write, the command is executed when both of the following
875
conditions hold:
876
878
- There are no other running commands whose keys overlap with
879
the submitted command and cause read/write conflict.
880
881
When the first condition is not met, the replica attempts to acquire
882
a lease or returns an error so that the client will redirect the
887
When the above two conditions are met, the lease holder replica processes the
888
command. Consistent reads are processed on the lease holder immediately.
890
will execute the same commands. All commands produce deterministic
891
results so that the range replicas keep consistent states among them.
892
893
When a write command completes, all the replica updates their response
895
replica updates its timestamp cache to keep track of the latest read
896
for a given key.
897
899
executed. Before executing a command, each replica checks if a replica
900
proposing the command has a still lease. When the lease has been
901
expired, the command will be rejected by the replica.
902
903
904
# Splitting / Merging Ranges
905
906
RoachNodes split or merge ranges based on whether they exceed maximum or
907
minimum thresholds for capacity or load. Ranges exceeding maximums for
908
either capacity or load are split; ranges below minimums for *both*
909
capacity and load are merged.
910
911
Ranges maintain the same accounting statistics as accounting key
912
prefixes. These boil down to a time series of data points with minute
913
granularity. Everything from number of bytes to read/write queue sizes.
914
Arbitrary distillations of the accounting stats can be determined as the
916
split/merge are range size in bytes and IOps. A good metric for
917
rebalancing a replica from one node to another would be total read/write
918
queue wait times. These metrics are gossipped, with each range / node
919
passing along relevant metrics if they’re in the bottom or top of the
920
range it’s aware of.
921
922
A range finding itself exceeding either capacity or load threshold
924
candidate and issues the split through Raft. In contrast to splitting,
925
merging requires a range to be below the minimum threshold for both
926
capacity *and* load. A range being merged chooses the smaller of the
927
ranges immediately preceding and succeeding it.
928
929
Splitting, merging, rebalancing and recovering all follow the same basic
930
algorithm for moving data between roach nodes. New target replicas are
931
created and added to the replica set of source range. Then each new
932
replica is brought up to date by either replaying the log in full or
933
copying a snapshot of the source replica data and then replaying the log
934
from the timestamp of the snapshot to catch up fully. Once the new
935
replicas are fully up to date, the range metadata is updated and old,
936
source replica(s) deleted if applicable.
937
939
940
```
941
if splitting
942
SplitRange(split_key): splits happen locally on range replicas and
943
only after being completed locally, are moved to new target replicas.
944
else if merging
945
Choose new replicas on same servers as target range replicas;
946
add to replica set.
947
else if rebalancing || recovering
948
Choose new replica(s) on least loaded servers; add to replica set.
949
```
950
951
**New Replica**
952
953
*Bring replica up to date:*
954
955
```
956
if all info can be read from replicated log
957
copy replicated log
958
else
959
snapshot source replica
960
send successive ReadRange requests to source replica
961
referencing snapshot
962
963
if merging
964
combine ranges on all replicas
965
else if rebalancing || recovering
966
remove old range replica(s)
967
```
968
969
RoachNodes split ranges when the total data in a range exceeds a
970
configurable maximum threshold. Similarly, ranges are merged when the
971
total data falls below a configurable minimum threshold.
972
973
**TBD: flesh this out**: Especially for merges (but also rebalancing) we have a
974
range disappearing from the local node; that range needs to disappear
975
gracefully, with a smooth handoff of operation to the new owner of its data.
976
977
Ranges are rebalanced if a node determines its load or capacity is one
978
of the worst in the cluster based on gossipped load stats. A node with
979
spare capacity is chosen in the same datacenter and a special-case split
980
is done which simply duplicates the data 1:1 and resets the range
981
configuration metadata.
982
983
# Range-Spanning Binary Tree
984
985
A crucial enhancement to the organization of range metadata is to
986
augment the bi-level range metadata lookup with a minimum spanning tree,
987
implemented as a left-leaning red-black tree over all ranges in the map.
988
This tree structure allows the system to start at any key prefix and
989
efficiently traverse an arbitrary key range with minimal RPC traffic,
990
minimal fan-in and fan-out, and with bounded time complexity equal to
991
`2*log N` steps, where `N` is the total number of ranges in the system.
992
993
Unlike the range metadata rows prefixed with `\0\0meta[1|2]`, the
994
metadata for the range-spanning tree (e.g. parent range and left / right
995
child ranges) is stored directly at the ranges as non-map metadata. The
996
metadata for each node of the tree (e.g. links to parent range, left
997
child range, and right child range) is stored with the range metadata.
998
In effect, the tree metadata is stored implicitly. In order to traverse
999
the tree, for example, you’d need to query each range in turn for its
1000
metadata.
1001
1002
Any time a range is split or merged, both the bi-level range lookup
1003
metadata and the per-range binary tree metadata are updated as part of
1004
the same distributed transaction. The total number of nodes involved in
1005
the update is bounded by 2 + log N (i.e. 2 updates for meta1 and
1006
meta2, and up to log N updates to balance the range-spanning tree).
1007
The range corresponding to the root node of the tree is stored in
1008
*\0tree_root*.
1009
1010
As an example, consider the following set of nine ranges and their
1011
associated range-spanning tree:
1012
1013
R0: `aa - cc`, R1: `*cc - lll`, R2: `*lll - llr`, R3: `*llr - nn`, R4: `*nn - rr`, R5: `*rr - ssss`, R6: `*ssss - sst`, R7: `*sst - vvv`, R8: `*vvv - zzzz`.
1014
1015

1016
1017
The range-spanning tree has many beneficial uses in Cockroach. It
1018
provides a ready made solution to scheduling mappers and sorting /
1019
reducing during map-reduce operations. It also provides a mechanism
1020
for visiting every Raft replica range which comprises a logical key
1021
range. This is used to periodically find the oldest extant write
1022
intent over the entire system.
1023
1024
The range-spanning tree provides a convenient mechanism for planning
1025
and executing parallel queries. These provide the basis for
1026
[Dremel](http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/36632.pdf)-like
1027
query execution trees and it’s easy to imagine supporting a subset of
1028
SQL or even javascript-based user functions for complex data analysis
1029
tasks.
1030
1031
1032
1033
# Node Allocation (via Gossip)
1034
1035
New nodes must be allocated when a range is split. Instead of requiring
1036
every RoachNode to know about the status of all or even a large number
1037
of peer nodes --or-- alternatively requiring a specialized curator or
1038
master with sufficiently global knowledge, we use a gossip protocol to
1039
efficiently communicate only interesting information between all of the
1040
nodes in the cluster. What’s interesting information? One example would
1041
be whether a particular node has a lot of spare capacity. Each node,
1042
when gossiping, compares each topic of gossip to its own state. If its
1043
own state is somehow “more interesting” than the least interesting item
1044
in the topic it’s seen recently, it includes its own state as part of
1045
the next gossip session with a peer node. In this way, a node with
1046
capacity sufficiently in excess of the mean quickly becomes discovered
1047
by the entire cluster. To avoid piling onto outliers, nodes from the
1048
high capacity set are selected at random for allocation.
1049
1050
The gossip protocol itself contains two primary components:
1051
1052
- **Peer Selection**: each node maintains up to N peers with which it
1053
regularly communicates. It selects peers with an eye towards
1054
maximizing fanout. A peer node which itself communicates with an
1055
array of otherwise unknown nodes will be selected over one which
1056
communicates with a set containing significant overlap. Each time
1057
gossip is initiated, each nodes’ set of peers is exchanged. Each
1058
node is then free to incorporate the other’s peers as it sees fit.
1059
To avoid any node suffering from excess incoming requests, a node
1060
may refuse to answer a gossip exchange. Each node is biased
1061
towards answering requests from nodes without significant overlap
1062
and refusing requests otherwise.
1063
1064
Peers are efficiently selected using a heuristic as described in
1065
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
1066
1067
**TBD**: how to avoid partitions? Need to work out a simulation of
1068
the protocol to tune the behavior and see empirically how well it
1069
works.
1070
1071
- **Gossip Selection**: what to communicate. Gossip is divided into
1072
topics. Load characteristics (capacity per disk, cpu load, and
1073
state [e.g. draining, ok, failure]) are used to drive node
1074
allocation. Range statistics (range read/write load, missing
1075
replicas, unavailable ranges) and network topology (inter-rack
1076
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
1077
outages) are used for determining when to split ranges, when to
1078
recover replicas vs. wait for network connectivity, and for
1079
debugging / sysops. In all cases, a set of minimums and a set of
1080
maximums is propagated; each node applies its own view of the
1081
world to augment the values. Each minimum and maximum value is
1082
tagged with the reporting node and other accompanying contextual
1083
information. Each topic of gossip has its own protobuf to hold the
1084
structured data. The number of items of gossip in each topic is
1085
limited by a configurable bound.
1086
1087
For efficiency, nodes assign each new item of gossip a sequence
1088
number and keep track of the highest sequence number each peer
1089
node has seen. Each round of gossip communicates only the delta
1090
containing new items.
1091
1092
# Node Accounting
1093
1094
The gossip protocol discussed in the previous section is useful to
1095
quickly communicate fragments of important information in a
1096
decentralized manner. However, complete accounting for each node is also
1097
stored to a central location, available to any dashboard process. This
1098
is done using the map itself. Each node periodically writes its state to
1099
the map with keys prefixed by `\0node`, similar to the first level of
1100
range metadata, but with an ‘`node`’ suffix. Each value is a protobuf
1101
containing the full complement of node statistics--everything
1102
communicated normally via the gossip protocol plus other useful, but
1103
non-critical data.
1104
1105
The range containing the first key in the node accounting table is
1106
responsible for gossiping the total count of nodes. This total count is
1107
used by the gossip network to most efficiently organize itself. In
1108
particular, the maximum number of hops for gossipped information to take
1109
before reaching a node is given by `ceil(log(node count) / log(max
1110
fanout)) + 1`.
1111
1115
key prefixes. Key prefixes can overlap, as is necessary for capturing
1116
hierarchical relationships. For illustrative purposes, let’s say keys
1117
specifying rows in a set of databases have the following format:
1118
1119
`<db>:<table>:<primary-key>[:<secondary-key>]`
1120
1122
key prefixes:
1123
1124
`db1`, `db1:user`, `db1:order`,
1125
1126
Accounting is kept for the entire map by default.
1127
1128
## Accounting
1129
to keep accounting for a range defined by a key prefix, an entry is created in
1130
the accounting system table. The format of accounting table keys is:
1131
1132
`\0acct<key-prefix>`
1133
1134
In practice, we assume each RoachNode capable of caching the
1135
entire accounting table as it is likely to be relatively small.
1136
1137
Accounting is kept for key prefix ranges with eventual consistency for
1138
efficiency. There are two types of values which comprise accounting:
1139
counts and occurrences, for lack of better terms. Counts describe
1140
system state, such as the total number of bytes, rows,
1141
etc. Occurrences include transient performance and load metrics. Both
1142
types of accounting are captured as time series with minute
1143
granularity. The length of time accounting metrics are kept is
1144
configurable. Below are examples of each type of accounting value.
1145
1146
**System State Counters/Performance**
1147
1148
- Count of items (e.g. rows)
1149
- Total bytes
1150
- Total key bytes
1151
- Total value length
1152
- Queued message count
1153
- Queued message total bytes
1154
- Count of values \< 16B
1155
- Count of values \< 64B
1156
- Count of values \< 256B
1157
- Count of values \< 1K
1158
- Count of values \< 4K
1159
- Count of values \< 16K
1160
- Count of values \< 64K
1161
- Count of values \< 256K
1162
- Count of values \< 1M
1163
- Count of values \> 1M
1164
- Total bytes of accounting
1165
1166
1167
**Load Occurrences**
1168
1169
- Get op count
1170
- Get total MB
1171
- Put op count
1172
- Put total MB
1173
- Delete op count
1174
- Delete total MB
1175
- Delete range op count
1176
- Delete range total MB
1177
- Scan op count
1178
- Scan op MB
1179
- Split count
1180
- Merge count
1181
1182
Because accounting information is kept as time series and over many
1183
possible metrics of interest, the data can become numerous. Accounting
1184
data are stored in the map near the key prefix described, in order to
1185
distribute load (for both aggregation and storage).
1186
1187
Accounting keys for system state have the form:
1188
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1189
character. It’s meant to sort the root level account AFTER any other
1190
system tables. They must increment the same underlying values as they
1191
are permanent counts, and not transient activity. Logic at the
1192
RoachNode takes care of snapshotting the value into an appropriately
1193
suffixed (e.g. with timestamp hour) multi-value time series entry.
1194
1195
Keys for perf/load metrics:
1196
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1197
1198
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1199
containing a varint64 entry for each minute with activity during the
1200
specified hour.
1201
1202
To efficiently keep accounting over large key ranges, the task of
1203
aggregation must be distributed. If activity occurs within the same
1204
range as the key prefix for accounting, the updates are made as part
1205
of the consensus write. If the ranges differ, then a message is sent
1206
to the parent range to increment the accounting. If upon receiving the
1207
message, the parent range also does not include the key prefix, it in
1208
turn forwards it to its parent or left child in the balanced binary
1209
tree which is maintained to describe the range hierarchy. This limits
1210
the number of messages before an update is visible at the root to `2*log N`,
1211
where `N` is the number of ranges in the key prefix.
1212
1213
## Zones
1214
zones are stored in the map with keys prefixed by
1215
`\0zone` followed by the key prefix to which the zone
1216
configuration applies. Zone values specify a protobuf containing
1217
the datacenters from which replicas for ranges which fall under
1218
the zone must be chosen.
1219
1220
Please see [config/config.proto](https://github.com/cockroachdb/cockroach/blob/master/config/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1221
1222
If zones are modified in situ, each RoachNode verifies the
1223
existing zones for its ranges against the zone configuration. If
1224
it discovers differences, it reconfigures ranges in the same way
1225
that it rebalances away from busy nodes, via special-case 1:1
1226
split to a duplicate range comprising the new configuration.
1227
1228
# Key-Value API
1229
1230
see the protobufs in [roachpb/](https://github.com/cockroachdb/cockroach/blob/master/roachpb),
1231
in particular [roachpb/api.proto](https://github.com/cockroachdb/cockroach/blob/master/roachpb/api.proto) and the comments within.
1235
A preliminary design can be found in the [Go source documentation](https://godoc.org/github.com/cockroachdb/cockroach/sql).
1236
1237
# Appendix
1238
1239
## Datastore Goal Articulation
1240
1241
There are other important axes involved in data-stores which are less
1242
well understood and/or explained. There is lots of cross-dependency,
1243
but it's safe to segregate two more of them as (a) scan efficiency,
1244
and (b) read vs write optimization.
1245
1246
### Datastore Scan Efficiency Spectrum
1247
1248
Scan efficiency refers to the number of IO ops required to scan a set
1249
of sorted adjacent rows matching a criteria. However, it's a
1250
complicated topic, because of the options (or lack of options) for
1251
controlling physical order in different systems.
1252
1253
* Some designs either default to or only support "heap organized"
1254
physical records (Oracle, MySQL, Postgres, SQLite, MongoDB). In this
1255
design, a naive sorted-scan of an index involves one IO op per
1256
record.
1257
* In these systems it's possible to "fully cover" a sorted-query in an
1258
index with some write-amplification.
1259
* In some systems it's possible to put the primary record data in a
1260
sorted btree instead of a heap-table (default in MySQL/Innodb,
1261
option in Oracle).
1262
* Sorted-order LSM NoSQL could be considered index-organized-tables,
1263
with efficient scans by the row-key. (HBase).
1264
* Some NoSQL is not optimized for sorted-order retrieval, because of
1265
hash-bucketing, primarily based on the Dynamo design. (Cassandra,
1266
Riak)
1267
1268

1269
1270
### Read vs. Write Optimization Spectrum
1271
1272
Read vs write optimization is a product of the underlying sorted-order
1273
data-structure used. Btrees are read-optimized. Hybrid write-deferred
1274
trees are a balance of read-and-write optimizations (shuttle-trees,
1275
fractal-trees, stratified-trees). LSM separates write-incorporation
1276
into a separate step, offering a tunable amount of read-to-write
1277
optimization. An "ideal" LSM at 0%-write-incorporation is a log, and
1278
at 100%-write-incorporation is a btree.
1279
1280
The topic of LSM is confused by the fact that LSM is not an algorithm,
1281
but a design pattern, and usage of LSM is hindered by the lack of a
1282
de-facto optimal LSM design. LevelDB/RocksDB is one of the more
1283
practical LSM implementations, but it is far from optimal. Popular
1284
text-indicies like Lucene are non-general purpose instances of
1285
write-optimized LSM.
1286
1287
Further, there is a dependency between access pattern
1288
(read-modify-write vs blind-write and write-fraction), cache-hitrate,
1289
and ideal sorted-order algorithm selection. At a certain
1290
write-fraction and read-cache-hitrate, systems achieve higher total
1291
throughput with write-optimized designs, at the cost of increased
1292
worst-case read latency. As either write-fraction or
1293
read-cache-hitrate approaches 1.0, write-optimized designs provide
1294
dramatically better sustained system throughput when record-sizes are
1295
small relative to IO sizes.
1296
1297
Given this information, data-stores can be sliced by their
1298
sorted-order storage algorithm selection. Btree stores are
1299
read-optimized (Oracle, SQLServer, Postgres, SQLite2, MySQL, MongoDB,
1300
CouchDB), hybrid stores are read-optimized with better
1301
write-throughput (Tokutek MySQL/MongoDB), while LSM-variants are
1302
write-optimized (HBase, Cassandra, SQLite3/LSM, CockroachDB).
1303
1304

1305
1306
## Architecture
1307
1308
CockroachDB implements a layered architecture, with various
1309
subdirectories implementing layers as appropriate. The highest level of
1310
abstraction is the [SQL layer][5], which depends
1311
directly on the structured data API. The structured
1312
data API provides familiar relational concepts such as schemas,
1313
tables, columns, and indexes. The structured data API in turn depends
1314
on the [distributed key value store][7] ([kv/][8]). The distributed key
1315
value store handles the details of range addressing to provide the
1316
abstraction of a single, monolithic key value store. It communicates
1317
with any number of [RoachNodes][9] ([server/][10]), storing the actual
1318
data. Each node contains one or more [stores][11] ([storage/][12]), one per
1319
physical device.
1320
1321

1322
1323
Each store contains potentially many ranges, the lowest-level unit of
1324
key-value data. Ranges are replicated using the [Raft][2] consensus
1325
protocol. The diagram below is a blown up version of stores from four
1326
of the five nodes in the previous diagram. Each range is replicated
1327
three ways using raft. The color coding shows associated range
1328
replicas.
1329
1330

1331
1332
## Client Architecture
1333
1334
RoachNodes serve client traffic using a fully-featured SQL API which accepts requests as either application/x-protobuf or
1335
application/json. Client implementations consist of an HTTP sender
1336
(transport) and a transactional sender which implements a simple
1337
exponential backoff / retry protocol, depending on CockroachDB error
1338
codes.
1339
1340
The DB client gateway accepts incoming requests and sends them
1341
through a transaction coordinator, which handles transaction
1342
heartbeats on behalf of clients, provides optimization pathways, and
1343
resolves write intents on transaction commit or abort. The transaction
1344
coordinator passes requests onto a distributed sender, which looks up
1345
index metadata, caches the results, and routes internode RPC traffic
1346
based on where the index metadata indicates keys are located in the
1347
distributed cluster.
1348
1349
In addition to the gateway for external DB client traffic, each RoachNode provides the full key/value API (including all internal methods) via
1350
a Go RPC server endpoint. The RPC server endpoint forwards requests to one
1351
or more local stores depending on the specified key range.
1352
1353
Internally, each RoachNode uses the Go implementation of the
1354
CockroachDB client in order to transactionally update system key/value
1355
data; for example during split and merge operations to update index
1356
metadata records. Unlike an external application, the internal client
1357
eschews the HTTP sender and instead directly shares the transaction
1358
coordinator and distributed sender used by the DB client gateway.
1359
1360

1361
1362
[0]: http://rocksdb.org/
1363
[1]: https://github.com/google/leveldb
1364
[2]: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf
1365
[3]: http://research.google.com/archive/spanner.html
1366
[4]: http://research.google.com/pubs/pub36971.html
1367
[5]: https://github.com/cockroachdb/cockroach/tree/master/sql
1368
[7]: https://godoc.org/github.com/cockroachdb/cockroach/kv
1369
[8]: https://github.com/cockroachdb/cockroach/tree/master/kv
1370
[9]: https://godoc.org/github.com/cockroachdb/cockroach/server
1371
[10]: https://github.com/cockroachdb/cockroach/tree/master/server
1372
[11]: https://godoc.org/github.com/cockroachdb/cockroach/storage
1373
[12]: https://github.com/cockroachdb/cockroach/tree/master/storage