Permalink
Newer
Older
100644 1477 lines (1215 sloc) 71.5 KB
1
# About
3
This document is an updated version of the original design documents
4
by Spencer Kimball from early 2014. It may not always be completely up to date.
5
For a more approachable explanation of how CockroachDB works, consider reading
6
the [Architecture docs](https://www.cockroachlabs.com/docs/stable/architecture/overview.html).
7
8
# Overview
9
10
CockroachDB is a distributed SQL database. The primary design goals
11
are **scalability**, **strong consistency** and **survivability**
12
(hence the name). CockroachDB aims to tolerate disk, machine, rack, and
13
even **datacenter failures** with minimal latency disruption and **no
14
manual intervention**. CockroachDB nodes are symmetric; a design goal is
15
**homogeneous deployment** (one binary) with minimal configuration and
16
no required external dependencies.
17
18
The entry point for database clients is the SQL interface. Every node
19
in a CockroachDB cluster can act as a client SQL gateway. A SQL
20
gateway transforms and executes client SQL statements to key-value
21
(KV) operations, which the gateway distributes across the cluster as
22
necessary and returns results to the client. CockroachDB implements a
23
**single, monolithic sorted map** from key to value where both keys
24
and values are byte strings.
25
26
The KV map is logically composed of smaller segments of the keyspace called
27
ranges. Each range is backed by data stored in a local KV storage engine (we
28
use [RocksDB](http://rocksdb.org/), a variant of
29
[LevelDB](https://github.com/google/leveldb)). Range data is replicated to a
30
configurable number of additional CockroachDB nodes. Ranges are merged and
31
split to maintain a target size, by default `64M`. The relatively small size
32
facilitates quick repair and rebalancing to address node failures, new capacity
33
and even read/write load. However, the size must be balanced against the
34
pressure on the system from having more ranges to manage.
35
36
CockroachDB achieves horizontally scalability:
37
- adding more nodes increases the capacity of the cluster by the
38
amount of storage on each node (divided by a configurable
39
replication factor), theoretically up to 4 exabytes (4E) of logical
40
data;
41
- client queries can be sent to any node in the cluster, and queries
42
can operate independently (w/o conflicts), meaning that overall
43
throughput is a linear factor of the number of nodes in the cluster.
44
- queries are distributed (ref: distributed SQL) so that the overall
45
throughput of single queries can be increased by adding more nodes.
47
CockroachDB achieves strong consistency:
48
- uses a distributed consensus protocol for synchronous replication of
49
data in each key value range. We’ve chosen to use the [Raft
50
consensus algorithm](https://raftconsensus.github.io); all consensus
51
state is stored in RocksDB.
52
- single or batched mutations to a single range are mediated via the
53
range's Raft instance. Raft guarantees ACID semantics.
54
- logical mutations which affect multiple ranges employ distributed
55
transactions for ACID semantics. CockroachDB uses an efficient
56
**non-locking distributed commit** protocol.
57
58
CockroachDB achieves survivability:
59
- range replicas can be co-located within a single datacenter for low
60
latency replication and survive disk or machine failures. They can
61
be distributed across racks to survive some network switch failures.
62
- range replicas can be located in datacenters spanning increasingly
63
disparate geographies to survive ever-greater failure scenarios from
64
datacenter power or networking loss to regional power failures
65
(e.g. `{ US-East-1a, US-East-1b, US-East-1c }`, `{ US-East, US-West,
66
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East,
67
US-West, Japan, Australia }`).
69
CockroachDB provides [snapshot
70
isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
71
serializable snapshot isolation (SSI) semantics, allowing **externally
72
consistent, lock-free reads and writes**--both from a historical snapshot
73
timestamp and from the current wall clock time. SI provides lock-free reads
74
and writes but still allows write skew. SSI eliminates write skew, but
75
introduces a performance hit in the case of a contentious system. SSI is the
76
default isolation; clients must consciously decide to trade correctness for
77
performance. CockroachDB implements [a limited form of linearizability
78
](#strict-serializability-linearizability), providing ordering for any
79
observer or chain of observers.
80
81
Similar to
82
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
83
directories, CockroachDB allows configuration of arbitrary zones of data.
84
This allows replication factor, storage device type, and/or datacenter
85
location to be chosen to optimize performance and/or availability.
86
Unlike Spanner, zones are monolithic and don’t allow movement of fine
87
grained data on the level of entity groups.
88
89
# Architecture
90
91
CockroachDB implements a layered architecture. The highest level of
92
abstraction is the SQL layer (currently unspecified in this document).
93
It depends directly on the [*SQL layer*](#sql),
94
which provides familiar relational concepts
95
such as schemas, tables, columns, and indexes. The SQL layer
96
in turn depends on the [distributed key value store](#key-value-api),
97
which handles the details of range addressing to provide the abstraction
98
of a single, monolithic key value store. The distributed KV store
99
communicates with any number of physical cockroach nodes. Each node
100
contains one or more stores, one per physical device.
101
102
![Architecture](media/architecture.png)
103
104
Each store contains potentially many ranges, the lowest-level unit of
105
key-value data. Ranges are replicated using the Raft consensus protocol.
106
The diagram below is a blown up version of stores from four of the five
107
nodes in the previous diagram. Each range is replicated three ways using
108
raft. The color coding shows associated range replicas.
109
110
![Ranges](media/ranges.png)
111
112
Each physical node exports two RPC-based key value APIs: one for
113
external clients and one for internal clients (exposing sensitive
114
operational features). Both services accept batches of requests and
115
return batches of responses. Nodes are symmetric in capabilities and
116
exported interfaces; each has the same binary and may assume any
117
role.
118
119
Nodes and the ranges they provide access to can be arranged with various
120
physical network topologies to make trade offs between reliability and
121
performance. For example, a triplicated (3-way replica) range could have
122
each replica located on different:
123
124
- disks within a server to tolerate disk failures.
125
- servers within a rack to tolerate server failures.
126
- servers on different racks within a datacenter to tolerate rack power/network failures.
127
- servers in different datacenters to tolerate large scale network or power outages.
128
129
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
130
131
# Keys
132
133
Cockroach keys are arbitrary byte arrays. Keys come in two flavors:
134
system keys and table data keys. System keys are used by Cockroach for
135
internal data structures and metadata. Table data keys contain SQL
136
table data (as well as index data). System and table data keys are
137
prefixed in such a way that all system keys sort before any table data
138
keys.
139
140
System keys come in several subtypes:
141
142
- **Global** keys store cluster-wide data such as the "meta1" and
143
"meta2" keys as well as various other system-wide keys such as the
144
node and store ID allocators.
145
- **Store local** keys are used for unreplicated store metadata
146
(e.g. the `StoreIdent` structure). "Unreplicated" indicates that
147
these values are not replicated across multiple stores because the
148
data they hold is tied to the lifetime of the store they are
149
present on.
150
- **Range local** keys store range metadata that is associated with a
151
global key. Range local keys have a special prefix followed by a
152
global key and a special suffix. For example, transaction records
153
are range local keys which look like:
154
`\x01k<global-key>txn-<txnID>`.
155
- **Replicated Range ID local** keys store range metadata that is
156
present on all of the replicas for a range. These keys are updated
157
via Raft operations. Examples include the range lease state and
158
abort cache entries.
159
- **Unreplicated Range ID local** keys store range metadata that is
160
local to a replica. The primary examples of such keys are the Raft
161
state and Raft log.
162
163
Table data keys are used to store all SQL data. Table data keys
164
contain internal structure as described in the section on [mapping
165
data between the SQL model and
166
KV](#data-mapping-between-the-sql-model-and-kv).
167
168
# Versioned Values
169
170
Cockroach maintains historical versions of values by storing them with
171
associated commit timestamps. Reads and scans can specify a snapshot
172
time to return the most recent writes prior to the snapshot timestamp.
173
Older versions of values are garbage collected by the system during
174
compaction according to a user-specified expiration interval. In order
175
to support long-running scans (e.g. for MapReduce), all versions have a
176
minimum expiration.
177
178
Versioned values are supported via modifications to RocksDB to record
179
commit timestamps and GC expirations per key.
180
181
# Lock-Free Distributed Transactions
182
183
Cockroach provides distributed transactions without locks. Cockroach
184
transactions support two isolation levels:
185
186
- snapshot isolation (SI) and
187
- *serializable* snapshot isolation (SSI).
188
189
*SI* is simple to implement, highly performant, and correct for all but a
190
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
191
more complexity, is still highly performant (less so with contention), and has
192
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
193
the literature and some possibly novel insights.
194
195
SSI is the default level, with SI provided for application developers
196
who are certain enough of their need for performance and the absence of
197
write skew conditions to consciously elect to use it. In a lightly
198
contended system, our implementation of SSI is just as performant as SI,
199
requiring no locking or additional writes. With contention, our
200
implementation of SSI still requires no locking, but will end up
201
aborting more transactions. Cockroach’s SI and SSI implementations
202
prevent starvation scenarios even for arbitrarily long transactions.
203
204
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
205
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
206
For a discussion of SSI implemented by preventing read-write conflicts
207
(in contrast to detecting them, called write-snapshot isolation), see
208
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
209
which is the source of much inspiration for Cockroach’s SSI.
210
211
Both SI and SSI require that the outcome of reads must be preserved, i.e.
212
a write of a key at a lower timestamp than a previous read must not succeed. To
213
this end, each range maintains a bounded *in-memory* cache from key range to
214
the latest timestamp at which it was read.
215
216
Most updates to this *timestamp cache* correspond to keys being read, though
217
the timestamp cache also protects the outcome of some writes (notably range
218
deletions) which consequently must also populate the cache. The cache’s entries
219
are evicted oldest timestamp first, updating the low water mark of the cache
220
appropriately.
221
222
Each Cockroach transaction is assigned a random priority and a
223
"candidate timestamp" at start. The candidate timestamp is the
224
provisional timestamp at which the transaction will commit, and is
225
chosen as the current clock time of the node coordinating the
226
transaction. This means that a transaction without conflicts will
227
usually commit with a timestamp that, in absolute time, precedes the
228
actual work done by that transaction.
229
May 22, 2015
230
In the course of coordinating a transaction between one or more
231
distributed nodes, the candidate timestamp may be increased, but will
232
never be decreased. The core difference between the two isolation levels
233
SI and SSI is that the former allows the transaction's candidate
234
timestamp to increase and the latter does not.
236
**Hybrid Logical Clock**
237
238
Each cockroach node maintains a hybrid logical clock (HLC) as discussed
239
in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf).
240
HLC time uses timestamps which are composed of a physical component (thought of
241
as and always close to local wall time) and a logical component (used to
242
distinguish between events with the same physical component). It allows us to
243
track causality for related events similar to vector clocks, but with less
244
overhead. In practice, it works much like other logical clocks: When events
245
are received by a node, it informs the local HLC about the timestamp supplied
246
with the event by the sender, and when events are sent a timestamp generated by
247
the local HLC is attached.
248
249
For a more in depth description of HLC please read the paper. Our
Oct 13, 2016
250
implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/hlc/hlc.go).
251
252
Cockroach picks a Timestamp for a transaction using HLC time. Throughout this
253
document, *timestamp* always refers to the HLC time which is a singleton
254
on each node. The HLC is updated by every read/write event on the node, and
255
the HLC time >= wall time. A read/write timestamp received in a cockroach request
256
from another node is not only used to version the operation, but also updates
257
the HLC on the node. This is useful in guaranteeing that all data read/written
258
on a node is at a timestamp < next HLC time.
260
**Transaction execution flow**
261
262
Transactions are executed in two phases:
264
1. Start the transaction by selecting a range which is likely to be
265
heavily involved in the transaction and writing a new transaction
266
record to a reserved area of that range with state "PENDING". In
267
parallel write an "intent" value for each datum being written as part
268
of the transaction. These are normal MVCC values, with the addition of
269
a special flag (i.e. “intent”) indicating that the value may be
270
committed after the transaction itself commits. In addition,
271
the transaction id (unique and chosen at txn start time by client)
272
is stored with intent values. The txn id is used to refer to the
273
transaction record when there are conflicts and to make
274
tie-breaking decisions on ordering between identical timestamps.
275
Each node returns the timestamp used for the write (which is the
276
original candidate timestamp in the absence of read/write conflicts);
277
the client selects the maximum from amongst all write timestamps as the
278
final commit timestamp.
280
2. Commit the transaction by updating its transaction record. The value
281
of the commit entry contains the candidate timestamp (increased as
282
necessary to accommodate any latest read timestamps). Note that the
283
transaction is considered fully committed at this point and control
284
may be returned to the client.
285
286
In the case of an SI transaction, a commit timestamp which was
287
increased to accommodate concurrent readers is perfectly
288
acceptable and the commit may continue. For SSI transactions,
289
however, a gap between candidate and commit timestamps
290
necessitates transaction restart (note: restart is different than
291
abort--see below).
292
293
After the transaction is committed, all written intents are upgraded
294
in parallel by removing the “intent” flag. The transaction is
295
considered fully committed before this step and does not wait for
296
it to return control to the transaction coordinator.
297
298
In the absence of conflicts, this is the end. Nothing else is necessary
299
to ensure the correctness of the system.
300
301
**Conflict Resolution**
302
303
Things get more interesting when a reader or writer encounters an intent
304
record or newly-committed value in a location that it needs to read or
305
write. This is a conflict, usually causing either of the transactions to
306
abort or restart depending on the type of conflict.
307
308
***Transaction restart:***
309
310
This is the usual (and more efficient) type of behaviour and is used
311
except when the transaction was aborted (for instance by another
312
transaction).
313
In effect, that reduces to two cases; the first being the one outlined
314
above: An SSI transaction that finds upon attempting to commit that
315
its commit timestamp has been pushed. The second case involves a transaction
316
actively encountering a conflict, that is, one of its readers or writers
317
encounter data that necessitate conflict resolution
318
(see transaction interactions below).
319
320
When a transaction restarts, it changes its priority and/or moves its
321
timestamp forward depending on data tied to the conflict, and
322
begins anew reusing the same txn id. The prior run of the transaction might
323
have written some write intents, which need to be deleted before the
324
transaction commits, so as to not be included as part of the transaction.
325
These stale write intent deletions are done during the reexecution of the
326
transaction, either implicitly, through writing new intents to
327
the same keys as part of the reexecution of the transaction, or explicitly,
328
by cleaning up stale intents that are not part of the reexecution of the
329
transaction. Since most transactions will end up writing to the same keys,
330
the explicit cleanup run just before committing the transaction is usually
331
a NOOP.
332
333
***Transaction abort:***
334
335
This is the case in which a transaction, upon reading its transaction
336
record, finds that it has been aborted. In this case, the transaction
337
can not reuse its intents; it returns control to the client before
338
cleaning them up (other readers and writers would clean up dangling
339
intents as they encounter them) but will make an effort to clean up
340
after itself. The next attempt (if applicable) then runs as a new
341
transaction with **a new txn id**.
342
343
***Transaction interactions:***
344
345
There are several scenarios in which transactions interact:
346
347
- **Reader encounters write intent or value with newer timestamp far
348
enough in the future**: This is not a conflict. The reader is free
349
to proceed; after all, it will be reading an older version of the
350
value and so does not conflict. Recall that the write intent may
351
be committed with a later timestamp than its candidate; it will
352
never commit with an earlier one. **Side note**: if a SI transaction
353
reader finds an intent with a newer timestamp which the reader’s own
354
transaction has written, the reader always returns that intent's value.
355
356
- **Reader encounters write intent or value with newer timestamp in the
357
near future:** In this case, we have to be careful. The newer
358
intent may, in absolute terms, have happened in our read's past if
359
the clock of the writer is ahead of the node serving the values.
360
In that case, we would need to take this value into account, but
361
we just don't know. Hence the transaction restarts, using instead
362
a future timestamp (but remembering a maximum timestamp used to
363
limit the uncertainty window to the maximum clock offset). In fact,
364
this is optimized further; see the details under "choosing a time
365
stamp" below.
366
367
- **Reader encounters write intent with older timestamp**: the reader
368
must follow the intent’s transaction id to the transaction record.
369
If the transaction has already been committed, then the reader can
370
just read the value. If the write transaction has not yet been
371
committed, then the reader has two options. If the write conflict
372
is from an SI transaction, the reader can *push that transaction's
373
commit timestamp into the future* (and consequently not have to
374
read it). This is simple to do: the reader just updates the
375
transaction’s commit timestamp to indicate that when/if the
376
transaction does commit, it should use a timestamp *at least* as
377
high. However, if the write conflict is from an SSI transaction,
378
the reader must compare priorities. If the reader has the higher priority,
379
it pushes the transaction’s commit timestamp (that
380
transaction will then notice its timestamp has been pushed, and
381
restart). If it has the lower or same priority, it retries itself using as
382
a new priority `max(new random priority, conflicting txn’s
383
priority - 1)`.
385
- **Writer encounters uncommitted write intent**:
386
If the other write intent has been written by a transaction with a lower
387
priority, the writer aborts the conflicting transaction. If the write
388
intent has a higher or equal priority the transaction retries, using as a new
389
priority *max(new random priority, conflicting txn’s priority - 1)*;
390
the retry occurs after a short, randomized backoff interval.
392
- **Writer encounters newer committed value**:
393
The committed value could also be an unresolved write intent made by a
394
transaction that has already committed. The transaction restarts. On restart,
395
the same priority is reused, but the candidate timestamp is moved forward
396
to the encountered value's timestamp.
398
- **Writer encounters more recently read key**:
399
The *read timestamp cache* is consulted on each write at a node. If the write’s
400
candidate timestamp is earlier than the low water mark on the cache itself
401
(i.e. its last evicted timestamp) or if the key being written has a read
402
timestamp later than the write’s candidate timestamp, this later timestamp
403
value is returned with the write. A new timestamp forces a transaction
404
restart only if it is serializable.
406
**Transaction management**
407
408
Transactions are managed by the client proxy (or gateway in SQL Azure
409
parlance). Unlike in Spanner, writes are not buffered but are sent
410
directly to all implicated ranges. This allows the transaction to abort
411
quickly if it encounters a write conflict. The client proxy keeps track
412
of all written keys in order to resolve write intents asynchronously upon
413
transaction completion. If a transaction commits successfully, all intents
414
are upgraded to committed. In the event a transaction is aborted, all written
415
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
416
417
In the event the client proxy restarts before the pending transaction is
418
committed, the dangling transaction would continue to "live" until
419
aborted by another transaction. Transactions periodically heartbeat
420
their transaction record to maintain liveness.
421
Transactions encountered by readers or writers with dangling intents
422
which haven’t been heartbeat within the required interval are aborted.
423
In the event the proxy restarts after a transaction commits but before
424
the asynchronous resolution is complete, the dangling intents are upgraded
425
when encountered by future readers and writers and the system does
426
not depend on their timely resolution for correctness.
427
428
An exploration of retries with contention and abort times with abandoned
429
transaction is
430
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
431
432
**Transaction Records**
Oct 13, 2016
434
Please see [pkg/roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/master/pkg/roachpb/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
435
436
**Pros**
437
438
- No requirement for reliable code execution to prevent stalled 2PC
439
protocol.
440
- Readers never block with SI semantics; with SSI semantics, they may
441
abort.
442
- Lower latency than traditional 2PC commit protocol (w/o contention)
443
because second phase requires only a single write to the
444
transaction record instead of a synchronous round to all
445
transaction participants.
446
- Priorities avoid starvation for arbitrarily long transactions and
447
always pick a winner from between contending transactions (no
448
mutual aborts).
449
- Writes not buffered at client; writes fail fast.
450
- No read-locking overhead required for *serializable* SI (in contrast
451
to other SSI implementations).
452
- Well-chosen (i.e. less random) priorities can flexibly give
453
probabilistic guarantees on latency for arbitrary transactions
454
(for example: make OLTP transactions 10x less likely to abort than
455
low priority transactions, such as asynchronously scheduled jobs).
456
457
**Cons**
458
459
- Reads from non-lease holder replicas still require a ping to the lease holder
460
to update the *read timestamp cache*.
461
- Abandoned transactions may block contending writers for up to the
462
heartbeat interval, though average wait is likely to be
463
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
464
This is likely considerably more performant than detecting and
465
restarting 2PC in order to release read and write locks.
466
- Behavior different than other SI implementations: no first writer
467
wins, and shorter transactions do not always finish quickly.
468
Element of surprise for OLTP systems may be a problematic factor.
469
- Aborts can decrease throughput in a contended system compared with
470
two phase locking. Aborts and retries increase read and write
471
traffic, increase latency and decrease throughput.
472
473
**Choosing a Timestamp**
474
475
A key challenge of reading data in a distributed system with clock offset
476
is choosing a timestamp guaranteed to be greater than the latest
477
timestamp of any committed transaction (in absolute time). No system can
478
claim consistency and fail to read already-committed data.
479
480
Accomplishing consistency for transactions (or just single operations)
481
accessing a single node is easy. The timestamp is assigned by the node
482
itself, so it is guaranteed to be at a greater timestamp than all the
483
existing timestamped data on the node.
484
485
For multiple nodes, the timestamp of the node coordinating the
486
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
487
supplied to provide an upper bound on timestamps for already-committed
488
data (`ε` is the maximum clock offset). As the transaction progresses, any
489
data read which have timestamps greater than `t` but less than `t+ε`
490
cause the transaction to abort and retry with the conflicting timestamp
491
t<sub>c</sub>, where t<sub>c</sub> \> t. The maximum timestamp `t+ε` remains
492
the same. This implies that transaction restarts due to clock uncertainty
493
can only happen on a time interval of length `ε`.
495
We apply another optimization to reduce the restarts caused
496
by uncertainty. Upon restarting, the transaction not only takes
497
into account t<sub>c</sub>, but the timestamp of the node at the time
498
of the uncertain read t<sub>node</sub>. The larger of those two timestamps
499
t<sub>c</sub> and t<sub>node</sub> (likely equal to the latter) is used
500
to increase the read timestamp. Additionally, the conflicting node is
501
marked as “certain”. Then, for future reads to that node within the
502
transaction, we set `MaxTimestamp = Read Timestamp`, preventing further
503
uncertainty restarts.
504
505
Correctness follows from the fact that we know that at the time of the read,
506
there exists no version of any key on that node with a higher timestamp than
507
t<sub>node</sub>. Upon a restart caused by the node, if the transaction
508
encounters a key with a higher timestamp, it knows that in absolute time,
509
the value was written after t<sub>node</sub> was obtained, i.e. after the
510
uncertain read. Hence the transaction can move forward reading an older version
511
of the data (at the transaction's timestamp). This limits the time uncertainty
512
restarts attributed to a node to at most one. The tradeoff is that we might
513
pick a timestamp larger than the optimal one (> highest conflicting timestamp),
514
resulting in the possibility of a few more conflicts.
515
516
We expect retries will be rare, but this assumption may need to be
517
revisited if retries become problematic. Note that this problem does not
518
apply to historical reads. An alternate approach which does not require
519
retries makes a round to all node participants in advance and
520
chooses the highest reported node wall time as the timestamp. However,
521
knowing which nodes will be accessed in advance is difficult and
522
potentially limiting. Cockroach could also potentially use a global
523
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)),
524
which would be feasible for smaller, geographically-proximate clusters.
526
# Strict Serializability (Linearizability)
527
528
Roughly speaking, the gap between <i>strict serializability</i> (which we use
529
interchangeably with <i>linearizability</i>) and CockroachDB's default
530
isolation level (<i>serializable</i>) is that with linearizable transactions,
531
causality is preserved. That is, if one transaction (say, creating a posting
532
for a user) waits for its predecessor (creating the user in the first place)
533
to complete, one would hope that the logical timestamp assigned to the former
534
is larger than that of the latter.
535
In practice, in distributed databases this may not hold, the reason typically
536
being that clocks across a distributed system are not perfectly synchronized
537
and the "later" transaction touches a part disjoint from that on which the
538
first transaction ran, resulting in clocks with disjoint information to decide
539
on the commit timestamps.
540
541
In practice, in CockroachDB many transactional workloads are actually
542
linearizable, though the precise conditions are too involved to outline them
543
here.
544
545
Causality is typically not required for many transactions, and so it is
546
advantageous to pay for it only when it *is* needed. CockroachDB implements
547
this via <i>causality tokens</i>: When committing a transaction, a causality
548
token can be retrieved and passed to the next transaction, ensuring that these
549
two transactions get assigned increasing logical timestamps.
550
551
Additionally, as better synchronized clocks become a standard commodity offered
552
by cloud providers, CockroachDB can provide global linearizability by doing
553
much the same that [Google's
554
Spanner](http://research.google.com/archive/spanner.html) does: wait out the
555
maximum clock offset after committing, but before returning to the client.
556
557
See the blog post below for much more in-depth information.
558
559
https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
560
561
# Logical Map Content
562
563
Logically, the map contains a series of reserved system key/value
564
pairs preceding the actual user data (which is managed by the SQL
565
subsystem).
567
- `\x02<key1>`: Range metadata for range ending `\x03<key1>`. This a "meta1" key.
568
- ...
569
- `\x02<keyN>`: Range metadata for range ending `\x03<keyN>`. This a "meta1" key.
570
- `\x03<key1>`: Range metadata for range ending `<key1>`. This a "meta2" key.
571
- ...
572
- `\x03<keyN>`: Range metadata for range ending `<keyN>`. This a "meta2" key.
573
- `\x04{desc,node,range,store}-idegen`: ID generation oracles for various component types.
574
- `\x04status-node-<varint encoded Store ID>`: Store runtime metadata.
575
- `\x04tsd<key>`: Time-series data key.
576
- `<key>`: A user key. In practice, these keys are managed by the SQL
577
subsystem, which employs its own key anatomy.
579
# Stores and Storage
580
581
Nodes contain one or more stores. Each store should be placed on a unique disk.
582
Internally, each store contains a single instance of RocksDB with a block cache
583
shared amongst all of the stores in a node. And these stores in turn have
584
a collection of range replicas. More than one replica for a range will never
585
be placed on the same store or even the same node.
586
587
Early on, when a cluster is first initialized, the few default starting ranges
588
will only have a single replica, but as soon as other nodes are available they
589
will replicate to them until they've reached their desired replication factor,
590
the default being 3.
591
592
Zone configs can be used to control a range's replication factor and add
593
constraints as to where the range's replicas can be located. When there is a
594
change in a range's zone config, the range will up or down replicate to the
595
appropriate number of replicas and move its replicas to the appropriate stores
596
based on zone config's constraints.
597
598
# Self Repair
599
600
If a store has not been heard from (gossiped their descriptors) in some time,
601
the default setting being 5 minutes, the cluster will consider this store to be
602
dead. When this happens, all ranges that have replicas on that store are
603
determined to be unavailable and removed. These ranges will then upreplicate
604
themselves to other available stores until their desired replication factor is
605
again met. If 50% or more of the replicas are unavailable at the same time,
606
there is no quorum and the whole range will be considered unavailable until at
607
least greater than 50% of the replicas are again available.
608
609
# Rebalancing
610
611
As more data are added to the system, some stores may grow faster than others.
612
To combat this and to spread the overall load across the full cluster, replicas
613
will be moved between stores maintaining the desired replication factor. The
614
heuristics used to perform this rebalancing include:
615
616
- the number of replicas per store
617
- the total size of the data used per store
618
- free space available per store
619
620
In the future, some other factors that might be considered include:
621
622
- cpu/network load per store
623
- ranges that are used together often in queries
624
- number of active ranges per store
625
- number of range leases held per store
626
627
# Range Metadata
628
629
The default approximate size of a range is 64M (2\^26 B). In order to
630
support 1P (2\^50 B) of logical data, metadata is needed for roughly
631
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
632
size is roughly 256 bytes (3\*12 bytes for the triplicated node
633
locations and 220 bytes for the range key itself). 2\^24 ranges \* 2\^8
634
B would require roughly 4G (2\^32 B) to store--too much to duplicate
635
between machines. Our conclusion is that range metadata must be
636
distributed for large installations.
637
638
To keep key lookups relatively fast in the presence of distributed metadata,
639
we store all the top-level metadata in a single range (the first range). These
640
top-level metadata keys are known as *meta1* keys, and are prefixed such that
641
they sort to the beginning of the key space. Given the metadata size of 256
642
bytes given above, a single 64M range would support 64M/256B = 2\^18 ranges,
643
which gives a total storage of 64M \* 2\^18 = 16T. To support the 1P quoted
644
above, we need two levels of indirection, where the first level addresses the
645
second, and the second addresses user data. With two levels of indirection, we
646
can address 2\^(18 + 18) = 2\^36 ranges; each range addresses 2\^26 B, and
647
altogether we address 2\^(36+26) B = 2\^62 B = 4E of user data.
648
649
For a given user-addressable `key1`, the associated *meta1* record is found
650
at the successor key to `key1` in the *meta1* space. Since the *meta1* space
651
is sparse, the successor key is defined as the next key which is present. The
652
*meta1* record identifies the range containing the *meta2* record, which is
653
found using the same process. The *meta2* record identifies the range
654
containing `key1`, which is again found the same way (see examples below).
656
Concretely, metadata keys are prefixed by `\x02` (meta1) and `\x03`
657
(meta2); the prefixes `\x02` and `\x03` provide for the desired
658
sorting behaviour. Thus, `key1`'s *meta1* record will reside at the
659
successor key to `\x02<key1>`.
Jul 29, 2015
661
Note: we append the end key of each range to meta{1,2} records because
662
the RocksDB iterator only supports a Seek() interface which acts as a
663
Ceil(). Using the start key of the range would cause Seek() to find the
664
key *after* the meta indexing record we’re looking for, which would
665
result in having to back the iterator up, an option which is both less
666
efficient and not available in all cases.
667
668
The following example shows the directory structure for a map with
669
three ranges worth of data. Ellipses indicate additional key/value
670
pairs to fill an entire range of data. For clarity, the examples use
671
`meta1` and `meta2` to refer to the prefixes `\x02` and `\x03`. Except
672
for the fact that splitting ranges requires updates to the range
673
metadata with knowledge of the metadata layout, the range metadata
674
itself requires no special treatment or bootstrapping.
675
676
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
677
`dcrama3:8000`)
678
679
- `meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
680
- `meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
681
- `meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
682
- `meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
683
- ...
684
- `<lastkey0>`: `<lastvalue0>`
685
686
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
687
`dcrama6:8000`)
688
689
- ...
690
- `<lastkey1>`: `<lastvalue1>`
691
692
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
693
`dcrama9:8000`)
694
695
- ...
696
- `<lastkey2>`: `<lastvalue2>`
697
698
Consider a simpler example of a map containing less than a single
699
range of data. In this case, all range metadata and all data are
700
located in the same range:
701
702
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
703
`dcrama3:8000`)*
704
705
- `meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
706
- `meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
707
- `<key0>`: `<value0>`
708
- `...`
709
710
Finally, a map large enough to need both levels of indirection would
711
look like (note that instead of showing range replicas, this
712
example is simplified to just show range indexes):
713
714
**Range 0**
715
716
- `meta1<lastkeyN-1>`: Range 0
717
- `meta1\xff`: Range 1
718
- `meta2<lastkey1>`: Range 1
719
- `meta2<lastkey2>`: Range 2
720
- `meta2<lastkey3>`: Range 3
721
- ...
722
- `meta2<lastkeyN-1>`: Range 262143
723
724
**Range 1**
725
726
- `meta2<lastkeyN>`: Range 262144
727
- `meta2<lastkeyN+1>`: Range 262145
728
- ...
729
- `meta2\xff`: Range 500,000
730
- ...
731
- `<lastkey1>`: `<lastvalue1>`
732
733
**Range 2**
734
735
- ...
736
- `<lastkey2>`: `<lastvalue2>`
737
738
**Range 3**
739
740
- ...
741
- `<lastkey3>`: `<lastvalue3>`
742
743
**Range 262144**
744
745
- ...
746
- `<lastkeyN>`: `<lastvalueN>`
747
748
**Range 262145**
749
750
- ...
751
- `<lastkeyN+1>`: `<lastvalueN+1>`
752
753
Note that the choice of range `262144` is just an approximation. The
754
actual number of ranges addressable via a single metadata range is
755
dependent on the size of the keys. If efforts are made to keep key sizes
756
small, the total number of addressable ranges would increase and vice
757
versa.
758
759
From the examples above it’s clear that key location lookups require at
760
most three reads to get the value for `<key>`:
761
762
1. lower bound of `meta1<key>`
763
2. lower bound of `meta2<key>`,
764
3. `<key>`.
765
766
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
767
containing less than 16T of data would require two lookups. Clients cache both
768
levels of range metadata, and we expect that data locality for individual
769
clients will be high. Clients may end up with stale cache entries. If on a
770
lookup, the range consulted does not match the client’s expectations, the
771
client evicts the stale entries and possibly does a new lookup.
772
773
# Raft - Consistency of Range Replicas
774
775
Each range is configured to consist of three or more replicas, as specified by
776
their ZoneConfig. The replicas in a range maintain their own instance of a
777
distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://raftconsensus.github.io)
778
as it is simpler to reason about and includes a reference implementation
779
covering important details.
780
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
781
promising performance characteristics for WAN-distributed replicas, but
782
it does not guarantee a consistent ordering between replicas.
783
784
Raft elects a relatively long-lived leader which must be involved to
785
propose commands. It heartbeats followers periodically and keeps their logs
786
replicated. In the absence of heartbeats, followers become candidates
787
after randomized election timeouts and proceed to hold new leader
788
elections. Cockroach weights random timeouts such that the replicas with
789
shorter round trip times to peers are more likely to hold elections
790
first (not implemented yet). Only the Raft leader may propose commands;
791
followers will simply relay commands to the last known leader.
793
Our Raft implementation was developed together with CoreOS, but adds an extra
794
layer of optimization to account for the fact that a single Node may have
795
millions of consensus groups (one for each Range). Areas of optimization
796
are chiefly coalesced heartbeats (so that the number of nodes dictates the
797
number of heartbeats as opposed to the much larger number of ranges) and
798
batch processing of requests.
799
Future optimizations may include two-phase elections and quiescent ranges
800
(i.e. stopping traffic completely for inactive ranges).
801
802
# Range Leases
803
804
As outlined in the Raft section, the replicas of a Range are organized as a
805
Raft group and execute commands from their shared commit log. Going through
806
Raft is an expensive operation though, and there are tasks which should only be
807
carried out by a single replica at a time (as opposed to all of them).
808
In particular, it is desirable to serve authoritative reads from a single
809
Replica (ideally from more than one, but that is far more difficult).
811
For these reasons, Cockroach introduces the concept of **Range Leases**:
812
This is a lease held for a slice of (database, i.e. hybrid logical) time.
813
A replica establishes itself as owning the lease on a range by committing
814
a special lease acquisition log entry through raft. The log entry contains
815
the replica node's epoch from the node liveness table--a system
816
table containing an epoch and an expiration time for each node. A node is
817
responsible for continuously updating the expiration time for its entry
818
in the liveness table. Once the lease has been committed through raft
819
the replica becomes the lease holder as soon as it applies the lease
820
acquisition command, guaranteeing that when it uses the lease it has
821
already applied all prior writes on the replica and can see them locally.
822
823
To prevent two nodes from acquiring the lease, the requestor includes a copy
824
of the lease that it believes to be valid at the time it requests the lease.
825
If that lease is still valid when the new lease is applied, it is granted,
826
or another lease is granted in the interim and the requested lease is
827
ignored. A lease can move from node A to node B only after node A's
828
liveness record has expired and its epoch has been incremented.
829
830
Note: range leases for ranges within the node liveness table keyspace and
831
all ranges that precede it, including meta1 and meta2, are not managed using
832
the above mechanism to prevent circular dependencies.
833
834
A replica holding a lease at a specific epoch can use the lease as long as
835
the node epoch hasn't changed and the expiration time hasn't passed.
836
The replica holding the lease may satisfy reads locally, without incurring the
837
overhead of going through Raft, and is in charge or involved in handling
838
Range-specific maintenance tasks such as splitting, merging and rebalancing
839
840
All Reads and writes are generally addressed to the replica holding
841
the lease; if none does, any replica may be addressed, causing it to try
842
to obtain the lease synchronously. Requests received by a non-lease holder
843
(for the HLC timestamp specified in the request's header) fail with an
844
error pointing at the replica's last known lease holder. These requests
845
are retried transparently with the updated lease by the gateway node and
846
never reach the client.
847
848
Since reads bypass Raft, a new lease holder will, among other things, ascertain
849
that its timestamp cache does not report timestamps smaller than the previous
850
lease holder's (so that it's compatible with reads which may have occurred on
851
the former lease holder). This is accomplished by letting leases enter
852
a <i>stasis period</i> (which is just the expiration minus the maximum clock
853
offset) before the actual expiration of the lease, so that all the next lease
854
holder has to do is set the low water mark of the timestamp cache to its
855
new lease's start time.
856
857
As a lease enters its stasis period, no more reads or writes are served, which
858
is undesirable. However, this would only happen in practice if a node became
859
unavailable. In almost all practical situations, no unavailability results
860
since leases are usually long-lived (and/or eagerly extended, which can avoid
861
the stasis period) or proactively transferred away from the lease holder, which
862
can also avoid the stasis period by promising not to serve any further reads
863
until the next lease goes into effect.
864
865
## Colocation with Raft leadership
867
The range lease is completely separate from Raft leadership, and so without
868
further efforts, Raft leadership and the Range lease might not be held by the
869
same Replica. Since it's expensive to not have these two roles colocated (the
870
lease holder has to forward each proposal to the leader, adding costly RPC
871
round-trips), each lease renewal or transfer also attempts to colocate them.
872
In practice, that means that the mismatch is rare and self-corrects quickly.
874
## Command Execution Flow
875
876
This subsection describes how a lease holder replica processes a
877
read/write command in more details. Each command specifies (1) a key
878
(or a range of keys) that the command accesses and (2) the ID of a
879
range which the key(s) belongs to. When receiving a command, a node
880
looks up a range by the specified Range ID and checks if the range is
881
still responsible for the supplied keys. If any of the keys do not
882
belong to the range, the node returns an error so that the client will
883
retry and send a request to a correct range.
885
When all the keys belong to the range, the node attempts to
886
process the command. If the command is an inconsistent read-only
887
command, it is processed immediately. If the command is a consistent
888
read or a write, the command is executed when both of the following
889
conditions hold:
890
891
- The range replica has a range lease.
892
- There are no other running commands whose keys overlap with
893
the submitted command and cause read/write conflict.
894
895
When the first condition is not met, the replica attempts to acquire
896
a lease or returns an error so that the client will redirect the
897
command to the current lease holder. The second condition guarantees that
898
consistent read/write commands for a given key are sequentially
899
executed.
900
901
When the above two conditions are met, the lease holder replica processes the
902
command. Consistent reads are processed on the lease holder immediately.
903
Write commands are committed into the Raft log so that every replica
904
will execute the same commands. All commands produce deterministic
905
results so that the range replicas keep consistent states among them.
906
907
When a write command completes, all the replica updates their response
908
cache to ensure idempotency. When a read command completes, the lease holder
909
replica updates its timestamp cache to keep track of the latest read
910
for a given key.
911
912
There is a chance that a range lease gets expired while a command is
913
executed. Before executing a command, each replica checks if a replica
914
proposing the command has a still lease. When the lease has been
915
expired, the command will be rejected by the replica.
916
917
918
# Splitting / Merging Ranges
919
920
Nodes split or merge ranges based on whether they exceed maximum or
921
minimum thresholds for capacity or load. Ranges exceeding maximums for
922
either capacity or load are split; ranges below minimums for *both*
923
capacity and load are merged.
924
925
Ranges maintain the same accounting statistics as accounting key
926
prefixes. These boil down to a time series of data points with minute
927
granularity. Everything from number of bytes to read/write queue sizes.
928
Arbitrary distillations of the accounting stats can be determined as the
929
basis for splitting / merging. Two sensible metrics for use with
930
split/merge are range size in bytes and IOps. A good metric for
931
rebalancing a replica from one node to another would be total read/write
932
queue wait times. These metrics are gossipped, with each range / node
933
passing along relevant metrics if they’re in the bottom or top of the
934
range it’s aware of.
935
936
A range finding itself exceeding either capacity or load threshold
937
splits. To this end, the range lease holder computes an appropriate split key
938
candidate and issues the split through Raft. In contrast to splitting,
939
merging requires a range to be below the minimum threshold for both
940
capacity *and* load. A range being merged chooses the smaller of the
941
ranges immediately preceding and succeeding it.
942
943
Splitting, merging, rebalancing and recovering all follow the same basic
944
algorithm for moving data between roach nodes. New target replicas are
945
created and added to the replica set of source range. Then each new
946
replica is brought up to date by either replaying the log in full or
947
copying a snapshot of the source replica data and then replaying the log
948
from the timestamp of the snapshot to catch up fully. Once the new
949
replicas are fully up to date, the range metadata is updated and old,
950
source replica(s) deleted if applicable.
951
952
**Coordinator** (lease holder replica)
953
954
```
955
if splitting
956
SplitRange(split_key): splits happen locally on range replicas and
957
only after being completed locally, are moved to new target replicas.
958
else if merging
959
Choose new replicas on same servers as target range replicas;
960
add to replica set.
961
else if rebalancing || recovering
962
Choose new replica(s) on least loaded servers; add to replica set.
963
```
964
965
**New Replica**
966
967
*Bring replica up to date:*
968
969
```
970
if all info can be read from replicated log
971
copy replicated log
972
else
973
snapshot source replica
974
send successive ReadRange requests to source replica
975
referencing snapshot
976
977
if merging
978
combine ranges on all replicas
979
else if rebalancing || recovering
980
remove old range replica(s)
981
```
982
983
Nodes split ranges when the total data in a range exceeds a
984
configurable maximum threshold. Similarly, ranges are merged when the
985
total data falls below a configurable minimum threshold.
986
987
**TBD: flesh this out**: Especially for merges (but also rebalancing) we have a
988
range disappearing from the local node; that range needs to disappear
989
gracefully, with a smooth handoff of operation to the new owner of its data.
990
991
Ranges are rebalanced if a node determines its load or capacity is one
992
of the worst in the cluster based on gossipped load stats. A node with
993
spare capacity is chosen in the same datacenter and a special-case split
994
is done which simply duplicates the data 1:1 and resets the range
995
configuration metadata.
996
997
# Node Allocation (via Gossip)
998
999
New nodes must be allocated when a range is split. Instead of requiring
1000
every node to know about the status of all or even a large number
1001
of peer nodes --or-- alternatively requiring a specialized curator or
1002
master with sufficiently global knowledge, we use a gossip protocol to
1003
efficiently communicate only interesting information between all of the
1004
nodes in the cluster. What’s interesting information? One example would
1005
be whether a particular node has a lot of spare capacity. Each node,
1006
when gossiping, compares each topic of gossip to its own state. If its
1007
own state is somehow “more interesting” than the least interesting item
1008
in the topic it’s seen recently, it includes its own state as part of
1009
the next gossip session with a peer node. In this way, a node with
1010
capacity sufficiently in excess of the mean quickly becomes discovered
1011
by the entire cluster. To avoid piling onto outliers, nodes from the
1012
high capacity set are selected at random for allocation.
1013
1014
The gossip protocol itself contains two primary components:
1015
1016
- **Peer Selection**: each node maintains up to N peers with which it
1017
regularly communicates. It selects peers with an eye towards
1018
maximizing fanout. A peer node which itself communicates with an
1019
array of otherwise unknown nodes will be selected over one which
1020
communicates with a set containing significant overlap. Each time
1021
gossip is initiated, each nodes’ set of peers is exchanged. Each
1022
node is then free to incorporate the other’s peers as it sees fit.
1023
To avoid any node suffering from excess incoming requests, a node
1024
may refuse to answer a gossip exchange. Each node is biased
1025
towards answering requests from nodes without significant overlap
1026
and refusing requests otherwise.
1027
1028
Peers are efficiently selected using a heuristic as described in
1029
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
1030
1031
**TBD**: how to avoid partitions? Need to work out a simulation of
1032
the protocol to tune the behavior and see empirically how well it
1033
works.
1034
1035
- **Gossip Selection**: what to communicate. Gossip is divided into
1036
topics. Load characteristics (capacity per disk, cpu load, and
1037
state [e.g. draining, ok, failure]) are used to drive node
1038
allocation. Range statistics (range read/write load, missing
1039
replicas, unavailable ranges) and network topology (inter-rack
1040
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
1041
outages) are used for determining when to split ranges, when to
1042
recover replicas vs. wait for network connectivity, and for
1043
debugging / sysops. In all cases, a set of minimums and a set of
1044
maximums is propagated; each node applies its own view of the
1045
world to augment the values. Each minimum and maximum value is
1046
tagged with the reporting node and other accompanying contextual
1047
information. Each topic of gossip has its own protobuf to hold the
1048
structured data. The number of items of gossip in each topic is
1049
limited by a configurable bound.
1050
1051
For efficiency, nodes assign each new item of gossip a sequence
1052
number and keep track of the highest sequence number each peer
1053
node has seen. Each round of gossip communicates only the delta
1054
containing new items.
1055
1056
# Node and Cluster Metrics
1057
1058
Every component of the system is responsible for exporting interesting
1059
metrics about itself. These could be histograms, throughput counters, or
1060
gauges.
1061
1062
These metrics are exported for external monitoring systems (such as Prometheus)
1063
via a HTTP endpoint, but CockroachDB also implements an internal timeseries
1064
database which is stored in the replicated key-value map.
1065
1066
Time series are stored at Store granularity and allow the admin dashboard
1067
to efficiently gain visibility into a universe of information at the Cluster,
1068
Node or Store level. A [periodic background process](RFCS/20160901_time_series_culling.md)
1069
culls older timeseries data, downsampling and eventually discarding it.
1071
# Zones
1072
1073
Zones provide a method for configuring the replication of portions of the
1074
keyspace. Zone values specify a protobuf containing
1075
the datacenters from which replicas for ranges which fall under
1076
the zone must be chosen.
1077
1079
[pkg/config/zone.proto](https://github.com/cockroachdb/cockroach/blob/master/pkg/config/zone.proto)
1080
for up-to-date data structures used, the best entry point being
1081
`message ZoneConfig`.
1083
If zones are modified in situ, each node verifies the
1084
existing zones for its ranges against the zone configuration. If
1085
it discovers differences, it reconfigures ranges in the same way
1086
that it rebalances away from busy nodes, via special-case 1:1
1087
split to a duplicate range comprising the new configuration.
1088
1089
# SQL
1090
1091
Each node in a cluster can accept SQL client connections. CockroachDB
1092
supports the PostgreSQL wire protocol, to enable reuse of native
1093
PostgreSQL client drivers. Connections using SSL and authenticated
1094
using client certificates are supported and even encouraged over
1095
unencrypted (insecure) and password-based connections.
1096
1097
Each connection is associated with a SQL session which holds the
1098
server-side state of the connection. Over the lifespan of a session
1099
the client can send SQL to open/close transactions, issue statements
1100
or queries or configure session parameters, much like with any other
1101
SQL database.
1102
1103
## Language support
1104
1105
CockroachDB also attempts to emulate the flavor of SQL supported by
1106
PostgreSQL, although it also diverges in significant ways:
1107
1108
- CockroachDB exclusively implements MVCC-based consistency for
1109
transactions, and thus only supports SQL's isolation levels SNAPSHOT
1110
and SERIALIZABLE. The other traditional SQL isolation levels are
1111
internally mapped to either SNAPSHOT or SERIALIZABLE.
1112
1113
- CockroachDB implements its own [SQL type system](RFCS/20160203_typing.md)
1114
which only supports a limited form of implicit coercions between
1115
types compared to PostgreSQL. The rationale is to keep the
1116
implementation simple and efficient, capitalizing on the observation
1117
that 1) most SQL code in clients is automatically generated with
1118
coherent typing already and 2) existing SQL code for other databases
1119
will need to be massaged for CockroachDB anyways.
1120
1121
## SQL architecture
1122
1123
Client connections over the network are handled in each node by a
1124
pgwire server process (goroutine). This handles the stream of incoming
1125
commands and sends back responses including query/statement results.
1126
The pgwire server also handles pgwire-level prepared statements,
1127
binding prepared statements to arguments and looking up prepared
1128
statements for execution.
1129
1130
Meanwhile the state of a SQL connection is maintained by a Session
1131
object and a monolithic `planner` object (one per connection) which
1132
coordinates execution between the session, the current SQL transaction
1133
state and the underlying KV store.
1134
1135
Upon receiving a query/statement (either directly or via an execute
1136
command for a previously prepared statement) the pgwire server forwards
1137
the SQL text to the `planner` associated with the connection. The SQL
1138
code is then transformed into a SQL query plan.
1139
The query plan is implemented as a tree of objects which describe the
1140
high-level data operations needed to resolve the query, for example
1141
"join", "index join", "scan", "group", etc.
1142
1143
The query plan objects currently also embed the run-time state needed
1144
for the execution of the query plan. Once the SQL query plan is ready,
1145
methods on these objects then carry the execution out in the fashion
1146
of "generators" in other programming languages: each node *starts* its
1147
children nodes and from that point forward each child node serves as a
1148
*generator* for a stream of result rows, which the parent node can
1149
consume and transform incrementally and present to its own parent node
1150
also as a generator.
1151
1152
The top-level planner consumes the data produced by the top node of
1153
the query plan and returns it to the client via pgwire.
1154
1155
## Data mapping between the SQL model and KV
1156
1157
Every SQL table has a primary key in CockroachDB. (If a table is created
1158
without one, an implicit primary key is provided automatically.)
1159
The table identifier, followed by the value of the primary key for
1160
each row, are encoded as the *prefix* of a key in the underlying KV
1161
store.
1162
1163
Each remaining column or *column family* in the table is then encoded
1164
as a value in the underlying KV store, and the column/family identifier
1165
is appended as *suffix* to the KV key.
1166
1167
For example:
1168
1169
- after table `customers` is created in a database `mydb` with a
1170
primary key column `name` and normal columns `address` and `URL`, the KV pairs
1171
to store the schema would be:
1172
1173
| Key | Values |
1174
| ---------------------------- | ------ |
1175
| `/system/databases/mydb/id` | 51 |
1176
| `/system/tables/customer/id` | 42 |
1177
| `/system/desc/51/42/address` | 69 |
1178
| `/system/desc/51/42/url` | 66 |
1179
1180
(The numeric values on the right are chosen arbitrarily for the
1181
example; the structure of the schema keys on the left is simplified
1182
for the example and subject to change.) Each database/table/column
1183
name is mapped to a spontaneously generated identifier, so as to
1184
simplify renames.
1185
1186
Then for a single row in this table:
1187
1188
| Key | Values |
1189
| ----------------- | -------------------------------- |
1190
| `/51/42/Apple/69` | `1 Infinite Loop, Cupertino, CA` |
1191
| `/51/42/Apple/66` | `http://apple.com/` |
1192
1193
Each key has the table prefix `/51/42` followed by the primary key
1194
prefix `/Apple` followed by the column/family suffix (`/66`,
1195
`/69`). The KV value is directly encoded from the SQL value.
1196
1197
Efficient storage for the keys is guaranteed by the underlying RocksDB engine
1198
by means of prefix compression.
1199
1200
Finally, for SQL indexes, the KV key is formed using the SQL value of the
1201
indexed columns, and the KV value is the KV key prefix of the rest of
1202
the indexed row.
1203
1204
## Distributed SQL
1205
1206
Dist-SQL is a new execution framework being developed as of Q3 2016 with the
1207
goal of distributing the processing of SQL queries.
1208
See the [Distributed SQL
1209
RFC](RFCS/20160421_distributed_sql.md)
1210
for a detailed design of the subsystem; this section will serve as a summary.
1211
1212
Distributing the processing is desirable for multiple reasons:
1213
- Remote-side filtering: when querying for a set of rows that match a filtering
1214
expression, instead of querying all the keys in certain ranges and processing
1215
the filters after receiving the data on the gateway node over the network,
1216
we'd like the filtering expression to be processed by the lease holder or
1217
remote node, saving on network traffic and related processing.
1218
- For statements like `UPDATE .. WHERE` and `DELETE .. WHERE` we want to
1219
perform the query and the updates on the node which has the data (as opposed
1220
to receiving results at the gateway over the network, and then performing the
1221
update or deletion there, which involves additional round-trips).
1222
- Parallelize SQL computation: when significant computation is required, we
1223
want to distribute it to multiple node, so that it scales with the amount of
1224
data involved. This applies to `JOIN`s, aggregation, sorting.
1225
1226
The approach we took was originally inspired by
1227
[Sawzall](https://cloud.google.com/dataflow/model/programming-model) - a
1228
project by Rob Pike et al. at Google that proposes a "shell" (high-level
1229
language interpreter) to ease the exploitation of MapReduce. It provides a
1230
clear separation between "local" processes which process a limited amount of
1231
data and distributed computations, which are abstracted away behind a
1232
restricted set of conceptual constructs.
1233
1234
To run SQL statements in a distributed fashion, we introduce a couple of concepts:
1235
- _logical plan_ - similar on the surface to the `planNode` tree described in
1236
the [SQL](#sql) section, it represents the abstract (non-distributed) data flow
1237
through computation stages.
1238
- _physical plan_ - a physical plan is conceptually a mapping of the _logical
1239
plan_ nodes to CockroachDB nodes. Logical plan nodes are replicated and
1240
specialized depending on the cluster topology. The components of the physical
1241
plan are scheduled and run on the cluster.
1242
1243
## Logical planning
1244
1245
The logical plan is made up of _aggregators_. Each _aggregator_ consumes an
1246
_input stream_ of rows (or multiple streams for joins) and produces an _output
1247
stream_ of rows. Both the input and the output streams have a set schema. The
1248
streams are a logical concept and might not map to a single data stream in the
1249
actual computation. Aggregators will be potentially distributed when converting
1250
the *logical plan* to a *physical plan*; to express what distribution and
1251
parallelization is allowed, an aggregator defines a _grouping_ on the data that
1252
flows through it, expressing which rows need to be processed on the same node
1253
(this mechanism constraints rows matching in a subset of columns to be
1254
processed on the same node). This concept is useful for aggregators that need
1255
to see some set of rows for producing output - e.g. the SQL aggregation
1256
functions. An aggregator with no grouping is a special but important case in
1257
which we are not aggregating multiple pieces of data, but we may be filtering,
1258
transforming, or reordering individual pieces of data.
1259
1260
Special **table reader** aggregators with no inputs are used as data sources; a
1261
table reader can be configured to output only certain columns, as needed.
1262
A special **final** aggregator with no outputs is used for the results of the
1263
query/statement.
1264
1265
To reflect the result ordering that a query has to produce, some aggregators
1266
(`final`, `limit`) are configured with an **ordering requirement** on the input
1267
stream (a list of columns with corresponding ascending/descending
1268
requirements). Some aggregators (like `table readers`) can guarantee a certain
1269
ordering on their output stream, called an **ordering guarantee**. All
1270
aggregators have an associated **ordering characterization** function
1271
`ord(input_order) -> output_order` that maps `input_order` (an ordering
1272
guarantee on the input stream) into `output_order` (an ordering guarantee for
1273
the output stream) - meaning that if the rows in the input stream are ordered
1274
according to `input_order`, then the rows in the output stream will be ordered
1275
according to `output_order`.
1276
1277
The ordering guarantee of the table readers along with the characterization
1278
functions can be used to propagate ordering information across the logical plan.
1279
When there is a mismatch (an aggregator has an ordering requirement that is not
1280
matched by a guarantee), we insert a **sorting aggregator**.
1281
1282
### Types of aggregators
1283
1284
- `TABLE READER` is a special aggregator, with no input stream. It's configured
1285
with spans of a table or index and the schema that it needs to read.
1286
Like every other aggregator, it can be configured with a programmable output
1287
filter.
1288
- `JOIN` performs a join on two streams, with equality constraints between
1289
certain columns. The aggregator is grouped on the columns that are
1290
constrained to be equal.
1291
- `JOIN READER` performs point-lookups for rows with the keys indicated by the
1292
input stream. It can do so by performing (potentially remote) KV reads, or by
1293
setting up remote flows.
1294
- `SET OPERATION` takes several inputs and performs set arithmetic on them
1295
(union, difference).
1296
- `AGGREGATOR` is the one that does "aggregation" in the SQL sense. It groups
1297
rows and computes an aggregate for each group. The group is configured using
1298
the group key. `AGGREGATOR` can be configured with one or more aggregation
1299
functions:
1300
- `SUM`
1301
- `COUNT`
1302
- `COUNT DISTINCT`
1303
- `DISTINCT`
1304
1305
An optional output filter has access to the group key and all the
1306
aggregated values (i.e. it can use even values that are not ultimately
1307
outputted).
1308
- `SORT` sorts the input according to a configurable set of columns.
1309
This is a no-grouping aggregator, hence it can be distributed arbitrarily to
1310
the data producers. This means that it doesn't produce a global ordering,
1311
instead it just guarantees an intra-stream ordering on each physical output
1312
streams). The global ordering, when needed, is achieved by an input
1313
synchronizer of a grouped processor (such as `LIMIT` or `FINAL`).
1314
- `LIMIT` is a single-group aggregator that stops after reading so many input
1315
rows.
1316
- `FINAL` is a single-group aggregator, scheduled on the gateway, that collects
1317
the results of the query. This aggregator will be hooked up to the pgwire
1318
connection to the client.
1319
1320
## Physical planning
1321
1322
Logical plans are transformed into physical plans in a *physical planning
1323
phase*. See the [corresponding
1324
section](RFCS/20160421_distributed_sql.md#from-logical-to-physical) of the Distributed SQL RFC
1325
for details. To summarize, each aggregator is planned as one or more
1326
*processors*, which we distribute starting from the data layout - `TABLE
1327
READER`s have multiple instances, split according to the ranges - each instance
1328
is planned on the lease holder of the relevant range. From that point on,
1329
subsequent processors are generally either colocated with their inputs, or
1330
planned as singletons, usually on the final destination node.
1331
1332
### Processors
1333
1334
When turning a _logical plan_ into a _physical plan_, its nodes are turned into
1335
_processors_. Processors are generally made up of three components:
1336
1337
![Processor](RFCS/images/distributed_sql_processor.png?raw=true "Processor")
1338
1339
1. The *input synchronizer* merges the input streams into a single stream of
1340
data. Types:
1341
* single-input (pass-through)
1342
* unsynchronized: passes rows from all input streams, arbitrarily
1343
interleaved.
1344
* ordered: the input physical streams have an ordering guarantee (namely the
1345
guarantee of the corresponding logical stream); the synchronizer is careful
1346
to interleave the streams so that the merged stream has the same guarantee.
1347
1348
2. The *data processor* core implements the data transformation or aggregation
1349
logic (and in some cases performs KV operations).
1350
1351
3. The *output router* splits the data processor's output to multiple streams;
1352
types:
1353
* single-output (pass-through)
1354
* mirror: every row is sent to all output streams
1355
* hashing: each row goes to a single output stream, chosen according
1356
to a hash function applied on certain elements of the data tuples.
1357
* by range: the router is configured with range information (relating to a
1358
certain table) and is able to send rows to the nodes that are lease holders for
1359
the respective ranges (useful for `JoinReader` nodes (taking index values
1360
to the node responsible for the PK) and `INSERT` (taking new rows to their
1361
lease holder-to-be)).
1362
1363
To illustrate with an example from the Distributed SQL RFC, the query:
1364
```
1365
TABLE Orders (OId INT PRIMARY KEY, CId INT, Value DECIMAL, Date DATE)
1366
1367
SELECT CID, SUM(VALUE) FROM Orders
1368
WHERE DATE > 2015
1369
GROUP BY CID
1370
ORDER BY 1 - SUM(Value)
1371
```
1372
1373
produces the following logical plan:
1374
1375
![Logical plan](RFCS/images/distributed_sql_logical_plan.png?raw=true "Logical Plan")
1376
1377
This logical plan above could be transformed into either one of the following
1378
physical plans:
1379
1380
![Physical plan](RFCS/images/distributed_sql_physical_plan.png?raw=true "Physical Plan")
1381
1382
or
1383
1384
![Alternate physical plan](RFCS/images/distributed_sql_physical_plan_2.png?raw=true "Alternate physical Plan")
1385
1386
1387
## Execution infrastructure
1388
1389
Once a physical plan has been generated, the system needs to divvy it up
1390
between the nodes and send it around for execution. Each node is responsible
1391
for locally scheduling data processors and input synchronizers. Nodes also
1392
communicate with each other for connecting output routers to input
1393
synchronizers through a streaming interface.
1394
1395
### Creating a local plan: the `ScheduleFlows` RPC
1396
1397
Distributed execution starts with the gateway making a request to every node
1398
that's supposed to execute part of the plan asking the node to schedule the
1399
sub-plan(s) it's responsible for (except for "on-the-fly" flows, see design
1400
doc). A node might be responsible for multiple disparate pieces of the overall
1401
DAG - let's call each of them a *flow*. A flow is described by the sequence of
1402
physical plan nodes in it, the connections between them (input synchronizers,
1403
output routers) plus identifiers for the input streams of the top node in the
1404
plan and the output streams of the (possibly multiple) bottom nodes. A node
1405
might be responsible for multiple heterogeneous flows. More commonly, when a
1406
node is the lease holder for multiple ranges from the same table involved in
1407
the query, it will run a `TableReader` configured with all the spans to be
1408
read across all the ranges local to the node.
1409
1410
A node therefore implements a `ScheduleFlows` RPC which takes a set of flows,
1411
sets up the input and output [mailboxes](#mailboxes), creates the local
1412
processors and starts their execution.
1413
1414
### Local scheduling of flows
1415
1416
The simplest way to schedule the different processors locally on a node is
1417
concurrently: each data processor, synchronizer and router runs as a goroutine,
1418
with channels between them. The channels are buffered to synchronize producers
1419
and consumers to a controllable degree.
1420
1421
### Mailboxes
1422
1423
Flows on different nodes communicate with each other over gRPC streams. To
1424
allow the producer and the consumer to start at different times,
1425
`ScheduleFlows` creates named mailboxes for all the input and output streams.
1426
These message boxes will hold some number of tuples in an internal queue until
1427
a gRPC stream is established for transporting them. From that moment on, gRPC
1428
flow control is used to synchronize the producer and consumer. A gRPC stream is
1429
established by the consumer using the `StreamMailbox` RPC, taking a mailbox id
1430
(the same one that's been already used in the flows passed to `ScheduleFlows`).
1431
1432
A diagram of a simple query using mailboxes for its execution:
1433
![Mailboxes](RFCS/images/distributed_sql_mailboxes.png?raw=true)
1434
1435
## A complex example: Daily Promotion
1436
1437
To give a visual intuition of all the concepts presented, we draw the physical plan of a relatively involved query. The
1438
point of the query is to help with a promotion that goes out daily, targeting
1439
customers that have spent over $1000 in the last year. We'll insert into the
1440
`DailyPromotion` table rows representing each such customer and the sum of her
1441
recent orders.
1442
1443
```SQL
1444
TABLE DailyPromotion (
1445
Email TEXT,
1446
Name TEXT,
1447
OrderCount INT
1448
)
1449
1450
TABLE Customers (
1451
CustomerID INT PRIMARY KEY,
1452
Email TEXT,
1453
Name TEXT
1454
)
1455
1456
TABLE Orders (
1457
CustomerID INT,
1458
Date DATETIME,
1459
Value INT,
1460
1461
PRIMARY KEY (CustomerID, Date),
1462
INDEX date (Date)
1463
)
1464
1465
INSERT INTO DailyPromotion
1466
(SELECT c.Email, c.Name, os.OrderCount FROM
1467
Customers AS c
1468
INNER JOIN
1469
(SELECT CustomerID, COUNT(*) as OrderCount FROM Orders
1470
WHERE Date >= '2015-01-01'
1471
GROUP BY CustomerID HAVING SUM(Value) >= 1000) AS os
1472
ON c.CustomerID = os.CustomerID)
1473
```
1474
1475
A possible physical plan:
1476
![Physical plan](RFCS/images/distributed_sql_daily_promotion_physical_plan.png?raw=true)