Permalink
Newer
Older
100644 1306 lines (1073 sloc) 62.5 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
CockroachDB is a distributed SQL database. The primary design goals
8
are **scalability**, **strong consistency** and **survivability**
9
(hence the name). CockroachDB aims to tolerate disk, machine, rack, and
10
even **datacenter failures** with minimal latency disruption and **no
11
manual intervention**. CockroachDB nodes are symmetric; a design goal is
12
**homogeneous deployment** (one binary) with minimal configuration and
13
no required external dependencies.
14
15
The entry point for database clients is the SQL interface. Every node
16
in a CockroachDB cluster can act as a client SQL gateway. A SQL
17
gateway transforms and executes client SQL statements to key-value
18
(KV) operations, which the gateway distributes across the cluster as
19
necessary and returns results to the client. CockroachDB implements a
20
**single, monolithic sorted map** from key to value where both keys
21
and values are byte strings.
22
23
The KV map is logically composed of smaller segments of the keyspace
24
called ranges. Each range is backed by data stored in a local KV
25
storage engine (we use [RocksDB](http://rocksdb.org/), a variant of
26
LevelDB). Range data is replicated to a configurable number of
27
additional CockroachDB nodes. Ranges are merged and split to maintain
28
a target size, by default `64M`. The relatively small size facilitates
29
quick repair and rebalancing to address node failures, new capacity
30
and even read/write load. However, the size must be balanced against
31
the pressure on the system from having more ranges to manage.
32
33
CockroachDB achieves horizontally scalability:
34
- adding more nodes increases the capacity of the cluster by the
35
amount of storage on each node (divided by a configurable
36
replication factor), theoretically up to 4 exabytes (4E) of logical
37
data;
38
- client queries can be sent to any node in the cluster, and queries
39
can operate independently (w/o conflicts), meaning that overall
40
throughput is a linear factor of the number of nodes in the cluster.
41
- queries are distributed (ref: distributed SQL) so that the overall
42
throughput of single queries can be increased by adding more nodes.
44
CockroachDB achieves strong consistency:
45
- uses a distributed consensus protocol for synchronous replication of
46
data in each key value range. We’ve chosen to use the [Raft
47
consensus algorithm](https://raftconsensus.github.io); all consensus
48
state is stored in RocksDB.
49
- single or batched mutations to a single range are mediated via the
50
range's Raft instance. Raft guarantees ACID semantics.
51
- logical mutations which affect multiple ranges employ distributed
52
transactions for ACID semantics. CockroachDB uses an efficient
53
**non-locking distributed commit** protocol.
54
55
CockroachDB achieves survivability:
56
- range replicas can be co-located within a single datacenter for low
57
latency replication and survive disk or machine failures. They can
58
be distributed across racks to survive some network switch failures.
59
- range replicas can be located in datacenters spanning increasingly
60
disparate geographies to survive ever-greater failure scenarios from
61
datacenter power or networking loss to regional power failures
62
(e.g. `{ US-East-1a, US-East-1b, US-East-1c }, `{ US-East, US-West,
63
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East,
64
US-West, Japan, Australia }`).
66
CockroachDB provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
67
serializable snapshot isolation (SSI) semantics, allowing **externally
68
consistent, lock-free reads and writes**--both from a historical
69
snapshot timestamp and from the current wall clock time. SI provides
70
lock-free reads and writes but still allows write skew. SSI eliminates
71
write skew, but introduces a performance hit in the case of a
72
contentious system. SSI is the default isolation; clients must
73
consciously decide to trade correctness for performance. CockroachDB
74
implements [a limited form of linearizability](#linearizability),
75
providing ordering for any observer or chain of observers.
76
77
Similar to
78
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
79
directories, CockroachDB allows configuration of arbitrary zones of data.
80
This allows replication factor, storage device type, and/or datacenter
81
location to be chosen to optimize performance and/or availability.
82
Unlike Spanner, zones are monolithic and don’t allow movement of fine
83
grained data on the level of entity groups.
84
85
# Architecture
86
87
CockroachDB implements a layered architecture. The highest level of
88
abstraction is the SQL layer (currently unspecified in this document).
89
It depends directly on the [*SQL layer*](#sql),
90
which provides familiar relational concepts
91
such as schemas, tables, columns, and indexes. The SQL layer
92
in turn depends on the [distributed key value store](#key-value-api),
93
which handles the details of range addressing to provide the abstraction
94
of a single, monolithic key value store. The distributed KV store
95
communicates with any number of physical cockroach nodes. Each node
96
contains one or more stores, one per physical device.
97
98
![Architecture](media/architecture.png)
99
100
Each store contains potentially many ranges, the lowest-level unit of
101
key-value data. Ranges are replicated using the Raft consensus protocol.
102
The diagram below is a blown up version of stores from four of the five
103
nodes in the previous diagram. Each range is replicated three ways using
104
raft. The color coding shows associated range replicas.
105
106
![Ranges](media/ranges.png)
107
108
Each physical node exports a RoachNode service. Each RoachNode exports
109
one or more key ranges. RoachNodes are symmetric. Each has the same
110
binary and assumes identical roles.
111
112
Nodes and the ranges they provide access to can be arranged with various
113
physical network topologies to make trade offs between reliability and
114
performance. For example, a triplicated (3-way replica) range could have
115
each replica located on different:
116
117
- disks within a server to tolerate disk failures.
118
- servers within a rack to tolerate server failures.
119
- servers on different racks within a datacenter to tolerate rack power/network failures.
120
- servers in different datacenters to tolerate large scale network or power outages.
121
122
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
123
124
# Cockroach Client
125
126
In order to support diverse client usage, Cockroach clients connect to
127
any node via HTTPS using protocol buffers or JSON. The connected node
128
proxies involved client work including key lookups and write buffering.
129
130
# Keys
131
132
Cockroach keys are arbitrary byte arrays. If textual data is used in
133
keys, utf8 encoding is recommended (this helps for cleaner display of
134
values in debugging tools). User-supplied keys are encoded using an
135
ordered code. System keys are either prefixed with null characters (`\0`
136
or `\0\0`) for system tables, or take the form of
137
`<user-key><system-suffix>` to sort user-key-range specific system
138
keys immediately after the user keys they refer to. Null characters are
139
used in system key prefixes to guarantee that they sort first.
140
141
# Versioned Values
142
143
Cockroach maintains historical versions of values by storing them with
144
associated commit timestamps. Reads and scans can specify a snapshot
145
time to return the most recent writes prior to the snapshot timestamp.
146
Older versions of values are garbage collected by the system during
147
compaction according to a user-specified expiration interval. In order
148
to support long-running scans (e.g. for MapReduce), all versions have a
149
minimum expiration.
150
151
Versioned values are supported via modifications to RocksDB to record
152
commit timestamps and GC expirations per key.
153
May 29, 2015
154
Each range maintains a small (i.e. latest 10s of read timestamps),
155
*in-memory* cache from key to the latest timestamp at which the
156
key was read. This *read timestamp cache* is updated every time a key
157
is read. The cache’s entries are evicted oldest timestamp first, updating
158
the low water mark of the cache appropriately. If a new range lease holder
May 29, 2015
159
is elected, it sets the low water mark for the cache to the current
Jun 1, 2015
160
wall time + ε (ε = 99th percentile clock skew).
162
# Lock-Free Distributed Transactions
163
164
Cockroach provides distributed transactions without locks. Cockroach
165
transactions support two isolation levels:
166
167
- snapshot isolation (SI) and
168
- *serializable* snapshot isolation (SSI).
169
170
*SI* is simple to implement, highly performant, and correct for all but a
171
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
172
more complexity, is still highly performant (less so with contention), and has
173
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
174
the literature and some possibly novel insights.
175
176
SSI is the default level, with SI provided for application developers
177
who are certain enough of their need for performance and the absence of
178
write skew conditions to consciously elect to use it. In a lightly
179
contended system, our implementation of SSI is just as performant as SI,
180
requiring no locking or additional writes. With contention, our
181
implementation of SSI still requires no locking, but will end up
182
aborting more transactions. Cockroach’s SI and SSI implementations
183
prevent starvation scenarios even for arbitrarily long transactions.
184
185
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
186
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
187
For a discussion of SSI implemented by preventing read-write conflicts
188
(in contrast to detecting them, called write-snapshot isolation), see
189
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
190
which is the source of much inspiration for Cockroach’s SSI.
191
192
Each Cockroach transaction is assigned a random priority and a
193
"candidate timestamp" at start. The candidate timestamp is the
194
provisional timestamp at which the transaction will commit, and is
195
chosen as the current clock time of the node coordinating the
196
transaction. This means that a transaction without conflicts will
197
usually commit with a timestamp that, in absolute time, precedes the
198
actual work done by that transaction.
199
May 22, 2015
200
In the course of coordinating a transaction between one or more
201
distributed nodes, the candidate timestamp may be increased, but will
202
never be decreased. The core difference between the two isolation levels
203
SI and SSI is that the former allows the transaction's candidate
204
timestamp to increase and the latter does not.
206
**Hybrid Logical Clock**
207
208
Each cockroach node maintains a hybrid logical clock (HLC) as discussed
209
in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf).
210
HLC time uses timestamps which are composed of a physical component (thought of
211
as and always close to local wall time) and a logical component (used to
212
distinguish between events with the same physical component). It allows us to
213
track causality for related events similar to vector clocks, but with less
214
overhead. In practice, it works much like other logical clocks: When events
215
are received by a node, it informs the local HLC about the timestamp supplied
216
with the event by the sender, and when events are sent a timestamp generated by
217
the local HLC is attached.
218
219
For a more in depth description of HLC please read the paper. Our
220
implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/util/hlc/hlc.go).
221
222
Cockroach picks a Timestamp for a transaction using HLC time. Throughout this
223
document, *timestamp* always refers to the HLC time which is a singleton
224
on each node. The HLC is updated by every read/write event on the node, and
225
the HLC time >= wall time. A read/write timestamp received in a cockroach request
226
from another node is not only used to version the operation, but also updates
227
the HLC on the node. This is useful in guaranteeing that all data read/written
228
on a node is at a timestamp < next HLC time.
230
**Transaction execution flow**
231
232
Transactions are executed in two phases:
234
1. Start the transaction by selecting a range which is likely to be
235
heavily involved in the transaction and writing a new transaction
236
record to a reserved area of that range with state "PENDING". In
237
parallel write an "intent" value for each datum being written as part
238
of the transaction. These are normal MVCC values, with the addition of
239
a special flag (i.e. “intent”) indicating that the value may be
240
committed after the transaction itself commits. In addition,
241
the transaction id (unique and chosen at tx start time by client)
242
is stored with intent values. The txn id is used to refer to the
243
transaction record when there are conflicts and to make
244
tie-breaking decisions on ordering between identical timestamps.
245
Each node returns the timestamp used for the write (which is the
246
original candidate timestamp in the absence of read/write conflicts);
247
the client selects the maximum from amongst all write timestamps as the
248
final commit timestamp.
250
2. Commit the transaction by updating its transaction record. The value
251
of the commit entry contains the candidate timestamp (increased as
252
necessary to accommodate any latest read timestamps). Note that the
253
transaction is considered fully committed at this point and control
254
may be returned to the client.
255
256
In the case of an SI transaction, a commit timestamp which was
257
increased to accommodate concurrent readers is perfectly
258
acceptable and the commit may continue. For SSI transactions,
259
however, a gap between candidate and commit timestamps
260
necessitates transaction restart (note: restart is different than
261
abort--see below).
262
263
After the transaction is committed, all written intents are upgraded
264
in parallel by removing the “intent” flag. The transaction is
265
considered fully committed before this step and does not wait for
266
it to return control to the transaction coordinator.
267
268
In the absence of conflicts, this is the end. Nothing else is necessary
269
to ensure the correctness of the system.
270
271
**Conflict Resolution**
272
273
Things get more interesting when a reader or writer encounters an intent
274
record or newly-committed value in a location that it needs to read or
275
write. This is a conflict, usually causing either of the transactions to
276
abort or restart depending on the type of conflict.
277
278
***Transaction restart:***
279
280
This is the usual (and more efficient) type of behaviour and is used
281
except when the transaction was aborted (for instance by another
282
transaction).
283
In effect, that reduces to two cases; the first being the one outlined
284
above: An SSI transaction that finds upon attempting to commit that
285
its commit timestamp has been pushed. The second case involves a transaction
286
actively encountering a conflict, that is, one of its readers or writers
287
encounter data that necessitate conflict resolution
288
(see transaction interactions below).
289
290
When a transaction restarts, it changes its priority and/or moves its
291
timestamp forward depending on data tied to the conflict, and
292
begins anew reusing the same txn id. The prior run of the transaction might
293
have written some write intents, which need to be deleted before the
294
transaction commits, so as to not be included as part of the transaction.
295
These stale write intent deletions are done during the reexecution of the
296
transaction, either implicitly, through writing new intents to
297
the same keys as part of the reexecution of the transaction, or explicitly,
298
by cleaning up stale intents that are not part of the reexecution of the
299
transaction. Since most transactions will end up writing to the same keys,
300
the explicit cleanup run just before committing the transaction is usually
301
a NOOP.
302
303
***Transaction abort:***
304
305
This is the case in which a transaction, upon reading its transaction
306
record, finds that it has been aborted. In this case, the transaction
307
can not reuse its intents; it returns control to the client before
308
cleaning them up (other readers and writers would clean up dangling
309
intents as they encounter them) but will make an effort to clean up
310
after itself. The next attempt (if applicable) then runs as a new
311
transaction with **a new txn id**.
312
313
***Transaction interactions:***
314
315
There are several scenarios in which transactions interact:
316
317
- **Reader encounters write intent or value with newer timestamp far
318
enough in the future**: This is not a conflict. The reader is free
319
to proceed; after all, it will be reading an older version of the
320
value and so does not conflict. Recall that the write intent may
321
be committed with a later timestamp than its candidate; it will
322
never commit with an earlier one. **Side note**: if a SI transaction
323
reader finds an intent with a newer timestamp which the reader’s own
324
transaction has written, the reader always returns that intent's value.
325
326
- **Reader encounters write intent or value with newer timestamp in the
327
near future:** In this case, we have to be careful. The newer
328
intent may, in absolute terms, have happened in our read's past if
329
the clock of the writer is ahead of the node serving the values.
330
In that case, we would need to take this value into account, but
331
we just don't know. Hence the transaction restarts, using instead
332
a future timestamp (but remembering a maximum timestamp used to
333
limit the uncertainty window to the maximum clock skew). In fact,
334
this is optimized further; see the details under "choosing a time
335
stamp" below.
336
337
- **Reader encounters write intent with older timestamp**: the reader
338
must follow the intent’s transaction id to the transaction record.
339
If the transaction has already been committed, then the reader can
340
just read the value. If the write transaction has not yet been
341
committed, then the reader has two options. If the write conflict
342
is from an SI transaction, the reader can *push that transaction's
343
commit timestamp into the future* (and consequently not have to
344
read it). This is simple to do: the reader just updates the
345
transaction’s commit timestamp to indicate that when/if the
346
transaction does commit, it should use a timestamp *at least* as
347
high. However, if the write conflict is from an SSI transaction,
348
the reader must compare priorities. If the reader has the higher priority,
349
it pushes the transaction’s commit timestamp (that
350
transaction will then notice its timestamp has been pushed, and
351
restart). If it has the lower or same priority, it retries itself using as
352
a new priority `max(new random priority, conflicting txn’s
353
priority - 1)`.
355
- **Writer encounters uncommitted write intent**:
356
If the other write intent has been written by a transaction with a lower
357
priority, the writer aborts the conflicting transaction. If the write
358
intent has a higher or equal priority the transaction retries, using as a new
359
priority *max(new random priority, conflicting txn’s priority - 1)*;
360
the retry occurs after a short, randomized backoff interval.
362
- **Writer encounters newer committed value**:
363
The committed value could also be an unresolved write intent made by a
364
transaction that has already committed. The transaction restarts. On restart,
365
the same priority is reused, but the candidate timestamp is moved forward
366
to the encountered value's timestamp.
368
- **Writer encounters more recently read key**:
369
The *read timestamp cache* is consulted on each write at a node. If the write’s
370
candidate timestamp is earlier than the low water mark on the cache itself
371
(i.e. its last evicted timestamp) or if the key being written has a read
372
timestamp later than the write’s candidate timestamp, this later timestamp
373
value is returned with the write. A new timestamp forces a transaction
374
restart only if it is serializable.
376
**Transaction management**
377
378
Transactions are managed by the client proxy (or gateway in SQL Azure
379
parlance). Unlike in Spanner, writes are not buffered but are sent
380
directly to all implicated ranges. This allows the transaction to abort
381
quickly if it encounters a write conflict. The client proxy keeps track
382
of all written keys in order to resolve write intents asynchronously upon
383
transaction completion. If a transaction commits successfully, all intents
384
are upgraded to committed. In the event a transaction is aborted, all written
385
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
386
387
In the event the client proxy restarts before the pending transaction is
388
committed, the dangling transaction would continue to "live" until
389
aborted by another transaction. Transactions periodically heartbeat
390
their transaction record to maintain liveness.
391
Transactions encountered by readers or writers with dangling intents
392
which haven’t been heartbeat within the required interval are aborted.
393
In the event the proxy restarts after a transaction commits but before
394
the asynchronous resolution is complete, the dangling intents are upgraded
395
when encountered by future readers and writers and the system does
396
not depend on their timely resolution for correctness.
397
398
An exploration of retries with contention and abort times with abandoned
399
transaction is
400
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
401
402
**Transaction Records**
404
Please see [roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/master/roachpb/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
405
406
**Pros**
407
408
- No requirement for reliable code execution to prevent stalled 2PC
409
protocol.
410
- Readers never block with SI semantics; with SSI semantics, they may
411
abort.
412
- Lower latency than traditional 2PC commit protocol (w/o contention)
413
because second phase requires only a single write to the
414
transaction record instead of a synchronous round to all
415
transaction participants.
416
- Priorities avoid starvation for arbitrarily long transactions and
417
always pick a winner from between contending transactions (no
418
mutual aborts).
419
- Writes not buffered at client; writes fail fast.
420
- No read-locking overhead required for *serializable* SI (in contrast
421
to other SSI implementations).
422
- Well-chosen (i.e. less random) priorities can flexibly give
423
probabilistic guarantees on latency for arbitrary transactions
424
(for example: make OLTP transactions 10x less likely to abort than
425
low priority transactions, such as asynchronously scheduled jobs).
426
427
**Cons**
428
429
- Reads from non-lease holder replicas still require a ping to the lease holder
430
update *read timestamp cache*.
431
- Abandoned transactions may block contending writers for up to the
432
heartbeat interval, though average wait is likely to be
433
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
434
This is likely considerably more performant than detecting and
435
restarting 2PC in order to release read and write locks.
436
- Behavior different than other SI implementations: no first writer
437
wins, and shorter transactions do not always finish quickly.
438
Element of surprise for OLTP systems may be a problematic factor.
439
- Aborts can decrease throughput in a contended system compared with
440
two phase locking. Aborts and retries increase read and write
441
traffic, increase latency and decrease throughput.
442
443
**Choosing a Timestamp**
444
445
A key challenge of reading data in a distributed system with clock skew
446
is choosing a timestamp guaranteed to be greater than the latest
447
timestamp of any committed transaction (in absolute time). No system can
448
claim consistency and fail to read already-committed data.
449
450
Accomplishing consistency for transactions (or just single operations)
451
accessing a single node is easy. The timestamp is assigned by the node
452
itself, so it is guaranteed to be at a greater timestamp than all the
453
existing timestamped data on the node.
454
455
For multiple nodes, the timestamp of the node coordinating the
456
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
457
supplied to provide an upper bound on timestamps for already-committed
458
data (`ε` is the maximum clock skew). As the transaction progresses, any
459
data read which have timestamps greater than `t` but less than `t+ε`
460
cause the transaction to abort and retry with the conflicting timestamp
461
t<sub>c</sub>, where t<sub>c</sub> \> t. The maximum timestamp `t+ε` remains
462
the same. This implies that transaction restarts due to clock uncertainty
463
can only happen on a time interval of length `ε`.
465
We apply another optimization to reduce the restarts caused
466
by uncertainty. Upon restarting, the transaction not only takes
467
into account t<sub>c</sub>, but the timestamp of the node at the time
468
of the uncertain read t<sub>node</sub>. The larger of those two timestamps
469
t<sub>c</sub> and t<sub>node</sub> (likely equal to the latter) is used
470
to increase the read timestamp. Additionally, the conflicting node is
471
marked as “certain”. Then, for future reads to that node within the
472
transaction, we set `MaxTimestamp = Read Timestamp`, preventing further
473
uncertainty restarts.
474
475
Correctness follows from the fact that we know that at the time of the read,
476
there exists no version of any key on that node with a higher timestamp than
477
t<sub>node</sub>. Upon a restart caused by the node, if the transaction
478
encounters a key with a higher timestamp, it knows that in absolute time,
479
the value was written after t<sub>node</sub> was obtained, i.e. after the
480
uncertain read. Hence the transaction can move forward reading an older version
481
of the data (at the transaction's timestamp). This limits the time uncertainty
482
restarts attributed to a node to at most one. The tradeoff is that we might
483
pick a timestamp larger than the optimal one (> highest conflicting timestamp),
484
resulting in the possibility of a few more conflicts.
485
486
We expect retries will be rare, but this assumption may need to be
487
revisited if retries become problematic. Note that this problem does not
488
apply to historical reads. An alternate approach which does not require
489
retries makes a round to all node participants in advance and
490
chooses the highest reported node wall time as the timestamp. However,
491
knowing which nodes will be accessed in advance is difficult and
492
potentially limiting. Cockroach could also potentially use a global
493
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)),
494
which would be feasible for smaller, geographically-proximate clusters.
495
496
# Linearizability
497
498
First a word about [***Spanner***](http://research.google.com/archive/spanner.html).
499
By combining judicious use of wait intervals with accurate time signals,
500
Spanner provides a global ordering between any two non-overlapping transactions
501
(in absolute time) with \~14ms latencies. Put another way:
502
Spanner guarantees that if a transaction T<sub>1</sub> commits (in absolute time)
503
before another transaction T<sub>2</sub> starts, then T<sub>1</sub>'s assigned commit
504
timestamp is smaller than T<sub>2</sub>'s. Using atomic clocks and GPS receivers,
505
Spanner reduces their clock skew uncertainty to \< 10ms (`ε`). To make
506
good on the promised guarantee, transactions must take at least double
507
the clock skew uncertainty interval to commit (`2ε`). See [*this
508
article*](http://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf)
509
for a helpful overview of Spanner’s concurrency control.
510
511
Cockroach could make the same guarantees without specialized hardware,
512
at the expense of longer wait times. If servers in the cluster were
513
configured to work only with NTP, transaction wait times would likely to
514
be in excess of 150ms. For wide-area zones, this would be somewhat
515
mitigated by overlap from cross datacenter link latencies. If clocks
516
were made more accurate, the minimal limit for commit latencies would
517
improve.
518
519
However, let’s take a step back and evaluate whether Spanner’s external
520
consistency guarantee is worth the automatic commit wait. First, if the
521
commit wait is omitted completely, the system still yields a consistent
522
view of the map at an arbitrary timestamp. However with clock skew, it
523
would become possible for commit timestamps on non-overlapping but
524
causally related transactions to suffer temporal reverse. In other
525
words, the following scenario is possible for a client without global
526
ordering:
527
528
- Start transaction T<sub>1</sub> to modify value `x` with commit time s<sub>1</sub>
529
530
- On commit of T<sub>1</sub>, start T<sub>2</sub> to modify value `y` with commit time
532
533
- Read `x` and `y` and discover that s<sub>1</sub> \> s<sub>2</sub> (**!**)
534
535
The external consistency which Spanner guarantees is referred to as
536
**linearizability**. It goes beyond serializability by preserving
537
information about the causality inherent in how external processes
538
interacted with the database. The strength of Spanner’s guarantee can be
539
formulated as follows: any two processes, with clock skew within
540
expected bounds, may independently record their wall times for the
541
completion of transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>) and start of transaction
542
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) respectively, and if later
543
compared such that T<sub>1</sub><sup>end</sup> \< T<sub>2</sub><sup>start</sup>,
544
then commit timestamps s<sub>1</sub> \< s<sub>2</sub>.
545
This guarantee is broad enough to completely cover all cases of explicit
546
causality, in addition to covering any and all imaginable scenarios of implicit
547
causality.
548
549
Our contention is that causality is chiefly important from the
550
perspective of a single client or a chain of successive clients (*if a
551
tree falls in the forest and nobody hears…*). As such, Cockroach
552
provides two mechanisms to provide linearizability for the vast majority
553
of use cases without a mandatory transaction commit wait or an elaborate
554
system to minimize clock skew.
555
556
1. Clients provide the highest transaction commit timestamp with
557
successive transactions. This allows node clocks from previous
558
transactions to effectively participate in the formulation of the
559
commit timestamp for the current transaction. This guarantees
560
linearizability for transactions committed by this client.
561
562
Newly launched clients wait at least 2 \* ε from process start
563
time before beginning their first transaction. This preserves the
564
same property even on client restart, and the wait will be
565
mitigated by process initialization.
566
567
All causally-related events within Cockroach maintain
568
linearizability.
569
570
2. Committed transactions respond with a commit wait parameter which
571
represents the remaining time in the nominal commit wait. This
572
will typically be less than the full commit wait as the consensus
573
write at the coordinator accounts for a portion of it.
574
575
Clients taking any action outside of another Cockroach transaction
576
(e.g. writing to another distributed system component) can either
577
choose to wait the remaining interval before proceeding, or
578
alternatively, pass the wait and/or commit timestamp to the
579
execution of the outside action for its consideration. This pushes
580
the burden of linearizability to clients, but is a useful tool in
581
mitigating commit latencies if the clock skew is potentially
582
large. This functionality can be used for ordering in the face of
583
backchannel dependencies as mentioned in the
584
[AugmentedTime](http://www.cse.buffalo.edu/~demirbas/publications/augmentedTime.pdf)
585
paper.
586
587
Using these mechanisms in place of commit wait, Cockroach’s guarantee can be
588
formulated as follows: any process which signals the start of transaction
589
T<sub>2</sub> (T<sub>2</sub><sup>start</sup>) after the completion of
590
transaction T<sub>1</sub> (T<sub>1</sub><sup>end</sup>), will have commit
591
timestamps such thats<sub>1</sub> \< s<sub>2</sub>.
592
593
# Logical Map Content
594
595
Logically, the map contains a series of reserved system key/value
596
pairs preceding the actual user data (which is managed by the SQL
597
subsystem).
599
- `\x02<key1>`: Range metadata for range ending `\x03<key1>`. This a "meta1" key.
600
- ...
601
- `\x02<keyN>`: Range metadata for range ending `\x03<keyN>`. This a "meta1" key.
602
- `\x03<key1>`: Range metadata for range ending `<key1>`. This a "meta2" key.
603
- ...
604
- `\x03<keyN>`: Range metadata for range ending `<keyN>`. This a "meta2" key.
605
- `\x04{desc,node,range,store}-idegen`: ID generation oracles for various component types.
606
- `\x04status-node-<varint encoded Store ID>`: Store runtime metadata.
607
- `\x04tsd<key>`: Time-series data key.
608
- `<key>`: A user key. In practice, these keys are managed by the SQL
609
subsystem, which employs its own key anatomy.
610
611
# Node Storage
612
613
Nodes maintain a separate instance of RocksDB for each disk. Each
614
RocksDB instance hosts any number of ranges. RPCs arriving at a
615
RoachNode are multiplexed based on the disk name to the appropriate
616
RocksDB instance. A single instance per disk is used to avoid
617
contention. If every range maintained its own RocksDB, global management
618
of available cache memory would be impossible and writers for each range
619
would compete for non-contiguous writes to multiple RocksDB logs.
620
621
In addition to the key/value pairs of the range itself, various range
622
metadata is maintained.
623
624
- participating replicas
625
626
- consensus metadata
627
628
- split/merge activity
629
630
A really good reference on tuning Linux installations with RocksDB is
631
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
632
633
# Range Metadata
634
635
The default approximate size of a range is 64M (2\^26 B). In order to
636
support 1P (2\^50 B) of logical data, metadata is needed for roughly
637
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
638
size is roughly 256 bytes (3\*12 bytes for the triplicated node
639
locations and 220 bytes for the range key itself). 2\^24 ranges \* 2\^8
640
B would require roughly 4G (2\^32 B) to store--too much to duplicate
641
between machines. Our conclusion is that range metadata must be
642
distributed for large installations.
643
644
To keep key lookups relatively fast in the presence of distributed metadata,
645
we store all the top-level metadata in a single range (the first range). These
646
top-level metadata keys are known as *meta1* keys, and are prefixed such that
647
they sort to the beginning of the key space. Given the metadata size of 256
648
bytes given above, a single 64M range would support 64M/256B = 2\^18 ranges,
649
which gives a total storage of 64M \* 2\^18 = 16T. To support the 1P quoted
650
above, we need two levels of indirection, where the first level addresses the
651
second, and the second addresses user data. With two levels of indirection, we
652
can address 2\^(18 + 18) = 2\^36 ranges; each range addresses 2\^26 B, and
653
altogether we address 2\^(36+26) B = 2\^62 B = 4E of user data.
654
655
For a given user-addressable `key1`, the associated *meta1* record is found
656
at the successor key to `key1` in the *meta1* space. Since the *meta1* space
657
is sparse, the successor key is defined as the next key which is present. The
658
*meta1* record identifies the range containing the *meta2* record, which is
659
found using the same process. The *meta2* record identifies the range
660
containing `key1`, which is again found the same way (see examples below).
662
Concretely, metadata keys are prefixed by `\0\0meta{1,2}`; the two null
663
characters provide for the desired sorting behaviour. Thus, `key1`'s
664
*meta1* record will reside at the successor key to `\0\0\meta1<key1>`.
665
Jul 29, 2015
666
Note: we append the end key of each range to meta{1,2} records because
667
the RocksDB iterator only supports a Seek() interface which acts as a
668
Ceil(). Using the start key of the range would cause Seek() to find the
669
key *after* the meta indexing record we’re looking for, which would
670
result in having to back the iterator up, an option which is both less
671
efficient and not available in all cases.
672
673
The following example shows the directory structure for a map with
674
three ranges worth of data. Ellipses indicate additional key/value pairs to
675
fill an entire range of data. Except for the fact that splitting ranges
676
requires updates to the range metadata with knowledge of the metadata layout,
677
the range metadata itself requires no special treatment or bootstrapping.
678
679
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
680
`dcrama3:8000`)
681
682
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
683
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
684
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
685
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
686
- ...
687
- `<lastkey0>`: `<lastvalue0>`
688
689
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
690
`dcrama6:8000`)
691
692
- ...
693
- `<lastkey1>`: `<lastvalue1>`
694
695
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
696
`dcrama9:8000`)
697
698
- ...
699
- `<lastkey2>`: `<lastvalue2>`
700
701
Consider a simpler example of a map containing less than a single
702
range of data. In this case, all range metadata and all data are
703
located in the same range:
704
705
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
706
`dcrama3:8000`)*
707
708
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
709
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
710
- `<key0>`: `<value0>`
711
- `...`
712
713
Finally, a map large enough to need both levels of indirection would
714
look like (note that instead of showing range replicas, this
715
example is simplified to just show range indexes):
716
717
**Range 0**
718
719
- `\0\0meta1<lastkeyN-1>`: Range 0
720
- `\0\0meta1\xff`: Range 1
721
- `\0\0meta2<lastkey1>`: Range 1
722
- `\0\0meta2<lastkey2>`: Range 2
723
- `\0\0meta2<lastkey3>`: Range 3
724
- ...
725
- `\0\0meta2<lastkeyN-1>`: Range 262143
726
727
**Range 1**
728
729
- `\0\0meta2<lastkeyN>`: Range 262144
730
- `\0\0meta2<lastkeyN+1>`: Range 262145
731
- ...
732
- `\0\0meta2\xff`: Range 500,000
733
- ...
734
- `<lastkey1>`: `<lastvalue1>`
735
736
**Range 2**
737
738
- ...
739
- `<lastkey2>`: `<lastvalue2>`
740
741
**Range 3**
742
743
- ...
744
- `<lastkey3>`: `<lastvalue3>`
745
746
**Range 262144**
747
748
- ...
749
- `<lastkeyN>`: `<lastvalueN>`
750
751
**Range 262145**
752
753
- ...
754
- `<lastkeyN+1>`: `<lastvalueN+1>`
755
756
Note that the choice of range `262144` is just an approximation. The
757
actual number of ranges addressable via a single metadata range is
758
dependent on the size of the keys. If efforts are made to keep key sizes
759
small, the total number of addressable ranges would increase and vice
760
versa.
761
762
From the examples above it’s clear that key location lookups require at
763
most three reads to get the value for `<key>`:
764
765
1. lower bound of `\0\0meta1<key>`
766
2. lower bound of `\0\0meta2<key>`,
767
3. `<key>`.
768
769
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
770
containing less than 16T of data would require two lookups. Clients cache both
771
levels of range metadata, and we expect that data locality for individual
772
clients will be high. Clients may end up with stale cache entries. If on a
773
lookup, the range consulted does not match the client’s expectations, the
774
client evicts the stale entries and possibly does a new lookup.
775
776
# Raft - Consistency of Range Replicas
777
778
Each range is configured to consist of three or more replicas, as specified by
779
their ZoneConfig. The replicas in a range maintain their own instance of a
780
distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://raftconsensus.github.io)
781
as it is simpler to reason about and includes a reference implementation
782
covering important details.
783
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
784
promising performance characteristics for WAN-distributed replicas, but
785
it does not guarantee a consistent ordering between replicas.
786
787
Raft elects a relatively long-lived leader which must be involved to
788
propose commands. It heartbeats followers periodically and keeps their logs
789
replicated. In the absence of heartbeats, followers become candidates
790
after randomized election timeouts and proceed to hold new leader
791
elections. Cockroach weights random timeouts such that the replicas with
792
shorter round trip times to peers are more likely to hold elections
793
first (not implemented yet). Only the Raft leader may propose commands;
794
followers will simply relay commands to the last known leader.
796
Our Raft implementation was developed together with CoreOS, but adds an extra
797
layer of optimization to account for the fact that a single Node may have
798
millions of consensus groups (one for each Range). Areas of optimization
799
are chiefly coalesced heartbeats (so that the number of nodes dictates the
800
number of heartbeats as opposed to the much larger number of ranges) and
801
batch processing of requests.
802
Future optimizations may include two-phase elections and quiescent ranges
803
(i.e. stopping traffic completely for inactive ranges).
804
805
# Range Leases
806
807
As outlined in the Raft section, the replicas of a Range are organized as a
808
Raft group and execute commands from their shared commit log. Going through
809
Raft is an expensive operation though, and there are tasks which should only be
810
carried out by a single replica at a time (as opposed to all of them).
811
812
For these reasons, Cockroach introduces the concept of **Range Leases**:
813
This is a lease held for a slice of (database, i.e. hybrid logical) time and is
814
established by committing a special log entry through Raft containing the
815
interval the lease is going to be active on, along with the Node:RaftID
816
combination that uniquely describes the requesting replica. Reads and writes
817
must generally be addressed to the replica holding the lease; if none does, any
818
replica may be addressed, causing it to try to obtain the lease synchronously.
819
Requests received by a non-lease holder (for the HLC timestamp specified in the
820
request's header) fail with an error pointing at the replica's last known
821
lease holder. These requests are retried transparently with the updated lease by the
822
gateway node and never reach the client.
823
824
The replica holding the lease is in charge or involved in handling
825
Range-specific maintenance tasks such as
826
827
* gossiping the sentinel and/or first range information
828
* splitting, merging and rebalancing
829
830
and, very importantly, may satisfy reads locally, without incurring the
831
overhead of going through Raft.
832
833
Since reads bypass Raft, a new lease holder will, among other things, ascertain
834
that its timestamp cache does not report timestamps smaller than the previous
835
lease holder's (so that it's compatible with reads which may have occurred on
836
the former lease holder). This is accomplished by setting the low water mark of the
837
timestamp cache to the expiration of the previous lease plus the maximum clock
838
offset.
839
840
## Relationship to Raft leadership
841
842
The range lease is completely separate from Raft leadership, and so without
843
further efforts, Raft leadership and the Range lease may not be represented by the same
844
replica most of the time. This is convenient semantically since it decouples
845
these two types of leadership and allows the use of Raft as a "black box", but
846
for reasons of performance, it is desirable to have both on the same replica.
847
Otherwise, sending a command through Raft always incurs the overhead of being
848
proposed to the Range lease holder's Raft instance first, which must relay it to the
849
Raft leader, which finally commits it into the log and updates its followers,
850
including the Range lease holder. This yields correct results but wastes several
851
round-trip delays, and so we will make sure that in the vast majority of cases
852
Range lease and Raft leadership coincide. A fairly easy method for achieving this is
853
to have each new lease period (extension or new) be accompanied by a
854
stipulation to the lease holder's replica to start Raft elections (unless it's
855
already leading), though some care should be taken that Range lease holdership is
856
relatively stable and long-lived to avoid a large number of Raft leadership
857
transitions.
858
859
## Command Execution Flow
860
861
This subsection describes how a lease holder replica processes a read/write
862
command in more details. Each command specifies (1) a key (or a range
863
of keys) that the command accesses and (2) the ID of a range which the
864
key(s) belongs to. When receiving a command, a RoachNode looks up a
865
range by the specified Range ID and checks if the range is still
866
responsible for the supplied keys. If any of the keys do not belong to the
867
range, the RoachNode returns an error so that the client will retry
868
and send a request to a correct range.
869
870
When all the keys belong to the range, the RoachNode attempts to
871
process the command. If the command is an inconsistent read-only
872
command, it is processed immediately. If the command is a consistent
873
read or a write, the command is executed when both of the following
874
conditions hold:
875
876
- The range replica has a range lease.
877
- There are no other running commands whose keys overlap with
878
the submitted command and cause read/write conflict.
879
880
When the first condition is not met, the replica attempts to acquire
881
a lease or returns an error so that the client will redirect the
882
command to the current lease holder. The second condition guarantees that
883
consistent read/write commands for a given key are sequentially
884
executed.
885
886
When the above two conditions are met, the lease holder replica processes the
887
command. Consistent reads are processed on the lease holder immediately.
888
Write commands are committed into the Raft log so that every replica
889
will execute the same commands. All commands produce deterministic
890
results so that the range replicas keep consistent states among them.
891
892
When a write command completes, all the replica updates their response
893
cache to ensure idempotency. When a read command completes, the lease holder
894
replica updates its timestamp cache to keep track of the latest read
895
for a given key.
896
897
There is a chance that a range lease gets expired while a command is
898
executed. Before executing a command, each replica checks if a replica
899
proposing the command has a still lease. When the lease has been
900
expired, the command will be rejected by the replica.
901
902
903
# Splitting / Merging Ranges
904
905
RoachNodes split or merge ranges based on whether they exceed maximum or
906
minimum thresholds for capacity or load. Ranges exceeding maximums for
907
either capacity or load are split; ranges below minimums for *both*
908
capacity and load are merged.
909
910
Ranges maintain the same accounting statistics as accounting key
911
prefixes. These boil down to a time series of data points with minute
912
granularity. Everything from number of bytes to read/write queue sizes.
913
Arbitrary distillations of the accounting stats can be determined as the
914
basis for splitting / merging. Two sensible metrics for use with
915
split/merge are range size in bytes and IOps. A good metric for
916
rebalancing a replica from one node to another would be total read/write
917
queue wait times. These metrics are gossipped, with each range / node
918
passing along relevant metrics if they’re in the bottom or top of the
919
range it’s aware of.
920
921
A range finding itself exceeding either capacity or load threshold
922
splits. To this end, the range lease holder computes an appropriate split key
923
candidate and issues the split through Raft. In contrast to splitting,
924
merging requires a range to be below the minimum threshold for both
925
capacity *and* load. A range being merged chooses the smaller of the
926
ranges immediately preceding and succeeding it.
927
928
Splitting, merging, rebalancing and recovering all follow the same basic
929
algorithm for moving data between roach nodes. New target replicas are
930
created and added to the replica set of source range. Then each new
931
replica is brought up to date by either replaying the log in full or
932
copying a snapshot of the source replica data and then replaying the log
933
from the timestamp of the snapshot to catch up fully. Once the new
934
replicas are fully up to date, the range metadata is updated and old,
935
source replica(s) deleted if applicable.
936
937
**Coordinator** (lease holder replica)
938
939
```
940
if splitting
941
SplitRange(split_key): splits happen locally on range replicas and
942
only after being completed locally, are moved to new target replicas.
943
else if merging
944
Choose new replicas on same servers as target range replicas;
945
add to replica set.
946
else if rebalancing || recovering
947
Choose new replica(s) on least loaded servers; add to replica set.
948
```
949
950
**New Replica**
951
952
*Bring replica up to date:*
953
954
```
955
if all info can be read from replicated log
956
copy replicated log
957
else
958
snapshot source replica
959
send successive ReadRange requests to source replica
960
referencing snapshot
961
962
if merging
963
combine ranges on all replicas
964
else if rebalancing || recovering
965
remove old range replica(s)
966
```
967
968
RoachNodes split ranges when the total data in a range exceeds a
969
configurable maximum threshold. Similarly, ranges are merged when the
970
total data falls below a configurable minimum threshold.
971
972
**TBD: flesh this out**: Especially for merges (but also rebalancing) we have a
973
range disappearing from the local node; that range needs to disappear
974
gracefully, with a smooth handoff of operation to the new owner of its data.
975
976
Ranges are rebalanced if a node determines its load or capacity is one
977
of the worst in the cluster based on gossipped load stats. A node with
978
spare capacity is chosen in the same datacenter and a special-case split
979
is done which simply duplicates the data 1:1 and resets the range
980
configuration metadata.
981
982
# Node Allocation (via Gossip)
983
984
New nodes must be allocated when a range is split. Instead of requiring
985
every RoachNode to know about the status of all or even a large number
986
of peer nodes --or-- alternatively requiring a specialized curator or
987
master with sufficiently global knowledge, we use a gossip protocol to
988
efficiently communicate only interesting information between all of the
989
nodes in the cluster. What’s interesting information? One example would
990
be whether a particular node has a lot of spare capacity. Each node,
991
when gossiping, compares each topic of gossip to its own state. If its
992
own state is somehow “more interesting” than the least interesting item
993
in the topic it’s seen recently, it includes its own state as part of
994
the next gossip session with a peer node. In this way, a node with
995
capacity sufficiently in excess of the mean quickly becomes discovered
996
by the entire cluster. To avoid piling onto outliers, nodes from the
997
high capacity set are selected at random for allocation.
998
999
The gossip protocol itself contains two primary components:
1000
1001
- **Peer Selection**: each node maintains up to N peers with which it
1002
regularly communicates. It selects peers with an eye towards
1003
maximizing fanout. A peer node which itself communicates with an
1004
array of otherwise unknown nodes will be selected over one which
1005
communicates with a set containing significant overlap. Each time
1006
gossip is initiated, each nodes’ set of peers is exchanged. Each
1007
node is then free to incorporate the other’s peers as it sees fit.
1008
To avoid any node suffering from excess incoming requests, a node
1009
may refuse to answer a gossip exchange. Each node is biased
1010
towards answering requests from nodes without significant overlap
1011
and refusing requests otherwise.
1012
1013
Peers are efficiently selected using a heuristic as described in
1014
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
1015
1016
**TBD**: how to avoid partitions? Need to work out a simulation of
1017
the protocol to tune the behavior and see empirically how well it
1018
works.
1019
1020
- **Gossip Selection**: what to communicate. Gossip is divided into
1021
topics. Load characteristics (capacity per disk, cpu load, and
1022
state [e.g. draining, ok, failure]) are used to drive node
1023
allocation. Range statistics (range read/write load, missing
1024
replicas, unavailable ranges) and network topology (inter-rack
1025
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
1026
outages) are used for determining when to split ranges, when to
1027
recover replicas vs. wait for network connectivity, and for
1028
debugging / sysops. In all cases, a set of minimums and a set of
1029
maximums is propagated; each node applies its own view of the
1030
world to augment the values. Each minimum and maximum value is
1031
tagged with the reporting node and other accompanying contextual
1032
information. Each topic of gossip has its own protobuf to hold the
1033
structured data. The number of items of gossip in each topic is
1034
limited by a configurable bound.
1035
1036
For efficiency, nodes assign each new item of gossip a sequence
1037
number and keep track of the highest sequence number each peer
1038
node has seen. Each round of gossip communicates only the delta
1039
containing new items.
1040
1041
# Node Accounting
1042
1043
The gossip protocol discussed in the previous section is useful to
1044
quickly communicate fragments of important information in a
1045
decentralized manner. However, complete accounting for each node is also
1046
stored to a central location, available to any dashboard process. This
1047
is done using the map itself. Each node periodically writes its state to
1048
the map with keys prefixed by `\0node`, similar to the first level of
1049
range metadata, but with an ‘`node`’ suffix. Each value is a protobuf
1050
containing the full complement of node statistics--everything
1051
communicated normally via the gossip protocol plus other useful, but
1052
non-critical data.
1053
1054
The range containing the first key in the node accounting table is
1055
responsible for gossiping the total count of nodes. This total count is
1056
used by the gossip network to most efficiently organize itself. In
1057
particular, the maximum number of hops for gossipped information to take
1058
before reaching a node is given by `ceil(log(node count) / log(max
1059
fanout)) + 1`.
1060
1061
# Key-prefix Accounting and Zones
1063
Arbitrarily fine-grained accounting is specified via
1064
key prefixes. Key prefixes can overlap, as is necessary for capturing
1065
hierarchical relationships. For illustrative purposes, let’s say keys
1066
specifying rows in a set of databases have the following format:
1067
1068
`<db>:<table>:<primary-key>[:<secondary-key>]`
1069
1070
In this case, we might collect accounting with
1071
key prefixes:
1072
1073
`db1`, `db1:user`, `db1:order`,
1074
1075
Accounting is kept for the entire map by default.
1076
1077
## Accounting
1078
to keep accounting for a range defined by a key prefix, an entry is created in
1079
the accounting system table. The format of accounting table keys is:
1080
1081
`\0acct<key-prefix>`
1082
1083
In practice, we assume each RoachNode capable of caching the
1084
entire accounting table as it is likely to be relatively small.
1085
1086
Accounting is kept for key prefix ranges with eventual consistency for
1087
efficiency. There are two types of values which comprise accounting:
1088
counts and occurrences, for lack of better terms. Counts describe
1089
system state, such as the total number of bytes, rows,
1090
etc. Occurrences include transient performance and load metrics. Both
1091
types of accounting are captured as time series with minute
1092
granularity. The length of time accounting metrics are kept is
1093
configurable. Below are examples of each type of accounting value.
1094
1095
**System State Counters/Performance**
1096
1097
- Count of items (e.g. rows)
1098
- Total bytes
1099
- Total key bytes
1100
- Total value length
1101
- Queued message count
1102
- Queued message total bytes
1103
- Count of values \< 16B
1104
- Count of values \< 64B
1105
- Count of values \< 256B
1106
- Count of values \< 1K
1107
- Count of values \< 4K
1108
- Count of values \< 16K
1109
- Count of values \< 64K
1110
- Count of values \< 256K
1111
- Count of values \< 1M
1112
- Count of values \> 1M
1113
- Total bytes of accounting
1114
1115
1116
**Load Occurrences**
1117
1118
- Get op count
1119
- Get total MB
1120
- Put op count
1121
- Put total MB
1122
- Delete op count
1123
- Delete total MB
1124
- Delete range op count
1125
- Delete range total MB
1126
- Scan op count
1127
- Scan op MB
1128
- Split count
1129
- Merge count
1130
1131
Because accounting information is kept as time series and over many
1132
possible metrics of interest, the data can become numerous. Accounting
1133
data are stored in the map near the key prefix described, in order to
1134
distribute load (for both aggregation and storage).
1135
1136
Accounting keys for system state have the form:
1137
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1138
character. It’s meant to sort the root level account AFTER any other
1139
system tables. They must increment the same underlying values as they
1140
are permanent counts, and not transient activity. Logic at the
1141
RoachNode takes care of snapshotting the value into an appropriately
1142
suffixed (e.g. with timestamp hour) multi-value time series entry.
1143
1144
Keys for perf/load metrics:
1145
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1146
1147
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1148
containing a varint64 entry for each minute with activity during the
1149
specified hour.
1150
1151
To efficiently keep accounting over large key ranges, the task of
1152
aggregation must be distributed. If activity occurs within the same
1153
range as the key prefix for accounting, the updates are made as part
1154
of the consensus write. If the ranges differ, then a message is sent
1155
to the parent range to increment the accounting. If upon receiving the
1156
message, the parent range also does not include the key prefix, it in
1157
turn forwards it to its parent or left child in the balanced binary
1158
tree which is maintained to describe the range hierarchy. This limits
1159
the number of messages before an update is visible at the root to `2*log N`,
1160
where `N` is the number of ranges in the key prefix.
1161
1162
## Zones
1163
zones are stored in the map with keys prefixed by
1164
`\0zone` followed by the key prefix to which the zone
1165
configuration applies. Zone values specify a protobuf containing
1166
the datacenters from which replicas for ranges which fall under
1167
the zone must be chosen.
1168
1169
Please see [config/config.proto](https://github.com/cockroachdb/cockroach/blob/master/config/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1170
1171
If zones are modified in situ, each RoachNode verifies the
1172
existing zones for its ranges against the zone configuration. If
1173
it discovers differences, it reconfigures ranges in the same way
1174
that it rebalances away from busy nodes, via special-case 1:1
1175
split to a duplicate range comprising the new configuration.
1176
1177
# SQL
1178
1179
Each node in a cluster can accept SQL client connections. CockroachDB
1180
supports the PostgreSQL wire protocol, to enable reuse of native
1181
PostgreSQL client drivers. Connections using SSL and authenticated
1182
using client certificates are supported and even encouraged over
1183
unencrypted (insecure) and password-based connections.
1184
1185
Each connection is associated with a SQL session which holds the
1186
server-side state of the connection. Over the lifespan of a session
1187
the client can send SQL to open/close transactions, issue statements
1188
or queries or configure session parameters, much like with any other
1189
SQL database.
1190
1191
## Language support
1192
1193
CockroachDB also attempts to emulate the flavor of SQL supported by
1194
PostgreSQL, although it also diverges in significant ways:
1195
1196
- CockroachDB exclusively implements MVCC-based consistency for
1197
transactions, and thus only supports SQL's isolation levels SNAPSHOT
1198
and SERIALIZABLE. The other traditional SQL isolation levels are
1199
internally mapped to either SNAPSHOT or SERIALIZABLE.
1200
1201
- CockroachDB implements its own [SQL type system](RFCS/typing.md)
1202
which only supports a limited form of implicit coercions between
1203
types compared to PostgreSQL. The rationale is to keep the
1204
implementation simple and efficient, capitalizing on the observation
1205
that 1) most SQL code in clients is automatically generated with
1206
coherent typing already and 2) existing SQL code for other databases
1207
will need to be massaged for CockroachDB anyways.
1208
1209
## SQL architecture
1210
1211
Client connections over the network are handled in each node by a
1212
pgwire server process (goroutine). This handles the stream of incoming
1213
commands and sends back responses including query/statement results.
1214
The pgwire server also handles pgwire-level prepared statements,
1215
binding prepared statements to arguments and looking up prepared
1216
statements for execution.
1217
1218
Meanwhile the state of a SQL connection is maintained by a Session
1219
object and a monolithic `planner` object (one per connection) which
1220
coordinates execution between the session, the current SQL transaction
1221
state and the underlying KV store.
1222
1223
Upon receiving a query/statement (either directly or via an execute
1224
command for a previously prepared statement) the pgwire server forwards
1225
the SQL text to the `planner` associated with the connection. The SQL
1226
code is then transformed into a SQL query plan.
1227
The query plan is implemented as a tree of objects which describe the
1228
high-level data operations needed to resolve the query, for example
1229
"join", "index join", "scan", "group", etc.
1230
1231
The query plan objects currently also embed the run-time state needed
1232
for the execution of the query plan. Once the SQL query plan is ready,
1233
methods on these objects then carry the execution out in the fashion
1234
of "generators" in other programming languages: each node *starts* its
1235
children nodes and from that point forward each child node serves as a
1236
*generator* for a stream of result rows, which the parent node can
1237
consume and transform incrementally and present to its own parent node
1238
also as a generator.
1239
1240
The top-level planner consumes the data produced by the top node of
1241
the query plan and returns it to the client via pgwire.
1242
1243
## Data mapping between the SQL model and KV
1244
1245
Every SQL table has a primary key in CockroachDB. (If a table is created
1246
without one, an implicit primary key is provided automatically.)
1247
The table identifier, followed by the value of the primary key for
1248
each row, are encoded as the *prefix* of a key in the underlying KV
1249
store.
1250
1251
Each remaining column or *column family* in the table is then encoded
1252
as a value in the underlying KV store, and the column/family identifier
1253
is appended as *suffix* to the KV key.
1254
1255
For example:
1256
1257
- after table `customers` is created in a database `mydb` with a
1258
primary key column `name` and normal columns `address` and `URL`, the KV pairs
1259
to store the schema would be:
1260
1261
| Key | Values |
1262
| ---------------------------- | ------ |
1263
| `/system/databases/mydb/id` | 51 |
1264
| `/system/tables/customer/id` | 42 |
1265
| `/system/desc/51/42/address` | 69 |
1266
| `/system/desc/51/42/url` | 66 |
1267
1268
(The numeric values on the right are chosen arbitrarily for the
1269
example; the structure of the schema keys on the left is simplified
1270
for the example and subject to change.) Each database/table/column
1271
name is mapped to a spontaneously generated identifier, so as to
1272
simplify renames.
1273
1274
Then for a single row in this table:
1275
1276
| Key | Values |
1277
| ----------------- | -------------------------------- |
1278
| `/51/42/Apple/69` | `1 Infinite Loop, Cupertino, CA` |
1279
| `/51/42/Apple/66` | `http://apple.com/` |
1280
1281
Each key has the table prefix `/51/42` followed by the primary key
1282
prefix `/Apple` followed by the column/family suffix (`/66`,
1283
`/69`). The KV value is directly encoded from the SQL value.
1284
1285
Efficient storage for the keys is guaranteed by the underlying RocksDB engine
1286
by means of prefix compression.
1287
1288
Finally, for SQL indexes, the KV key is formed using the SQL value of the
1289
indexed columns, and the KV value is the KV key prefix of the rest of
1290
the indexed row.
1291
1292
# References
1293
1294
[0]: http://rocksdb.org/
1295
[1]: https://github.com/google/leveldb
1296
[2]: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf
1297
[3]: http://research.google.com/archive/spanner.html
1298
[4]: http://research.google.com/pubs/pub36971.html
1299
[5]: https://github.com/cockroachdb/cockroach/tree/master/sql
1300
[7]: https://godoc.org/github.com/cockroachdb/cockroach/kv
1301
[8]: https://github.com/cockroachdb/cockroach/tree/master/kv
1302
[9]: https://godoc.org/github.com/cockroachdb/cockroach/server
1303
[10]: https://github.com/cockroachdb/cockroach/tree/master/server
1304
[11]: https://godoc.org/github.com/cockroachdb/cockroach/storage
1305
[12]: https://github.com/cockroachdb/cockroach/tree/master/storage