Permalink
Newer
Older
100644 1249 lines (1022 sloc) 59.4 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
CockroachDB is a distributed SQL database. The primary design goals
8
are **scalability**, **strong consistency** and **survivability**
9
(hence the name). CockroachDB aims to tolerate disk, machine, rack, and
10
even **datacenter failures** with minimal latency disruption and **no
11
manual intervention**. CockroachDB nodes are symmetric; a design goal is
12
**homogeneous deployment** (one binary) with minimal configuration and
13
no required external dependencies.
14
15
The entry point for database clients is the SQL interface. Every node
16
in a CockroachDB cluster can act as a client SQL gateway. A SQL
17
gateway transforms and executes client SQL statements to key-value
18
(KV) operations, which the gateway distributes across the cluster as
19
necessary and returns results to the client. CockroachDB implements a
20
**single, monolithic sorted map** from key to value where both keys
21
and values are byte strings.
22
23
The KV map is logically composed of smaller segments of the keyspace
24
called ranges. Each range is backed by data stored in a local KV
25
storage engine (we use [RocksDB](http://rocksdb.org/), a variant of
26
LevelDB). Range data is replicated to a configurable number of
27
additional CockroachDB nodes. Ranges are merged and split to maintain
28
a target size, by default `64M`. The relatively small size facilitates
29
quick repair and rebalancing to address node failures, new capacity
30
and even read/write load. However, the size must be balanced against
31
the pressure on the system from having more ranges to manage.
32
33
CockroachDB achieves horizontally scalability:
34
- adding more nodes increases the capacity of the cluster by the
35
amount of storage on each node (divided by a configurable
36
replication factor), theoretically up to 4 exabytes (4E) of logical
37
data;
38
- client queries can be sent to any node in the cluster, and queries
39
can operate independently (w/o conflicts), meaning that overall
40
throughput is a linear factor of the number of nodes in the cluster.
41
- queries are distributed (ref: distributed SQL) so that the overall
42
throughput of single queries can be increased by adding more nodes.
44
CockroachDB achieves strong consistency:
45
- uses a distributed consensus protocol for synchronous replication of
46
data in each key value range. We’ve chosen to use the [Raft
47
consensus algorithm](https://raftconsensus.github.io); all consensus
48
state is stored in RocksDB.
49
- single or batched mutations to a single range are mediated via the
50
range's Raft instance. Raft guarantees ACID semantics.
51
- logical mutations which affect multiple ranges employ distributed
52
transactions for ACID semantics. CockroachDB uses an efficient
53
**non-locking distributed commit** protocol.
54
55
CockroachDB achieves survivability:
56
- range replicas can be co-located within a single datacenter for low
57
latency replication and survive disk or machine failures. They can
58
be distributed across racks to survive some network switch failures.
59
- range replicas can be located in datacenters spanning increasingly
60
disparate geographies to survive ever-greater failure scenarios from
61
datacenter power or networking loss to regional power failures
62
(e.g. `{ US-East-1a, US-East-1b, US-East-1c }, `{ US-East, US-West,
63
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East,
64
US-West, Japan, Australia }`).
66
CockroachDB provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
67
serializable snapshot isolation (SSI) semantics, allowing **externally
68
consistent, lock-free reads and writes**--both from a historical
69
snapshot timestamp and from the current wall clock time. SI provides
70
lock-free reads and writes but still allows write skew. SSI eliminates
71
write skew, but introduces a performance hit in the case of a
72
contentious system. SSI is the default isolation; clients must
73
consciously decide to trade correctness for performance. CockroachDB
74
implements [a limited form of linearizability](#linearizability),
75
providing ordering for any observer or chain of observers.
76
77
Similar to
78
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
79
directories, CockroachDB allows configuration of arbitrary zones of data.
80
This allows replication factor, storage device type, and/or datacenter
81
location to be chosen to optimize performance and/or availability.
82
Unlike Spanner, zones are monolithic and don’t allow movement of fine
83
grained data on the level of entity groups.
84
85
# Architecture
86
87
CockroachDB implements a layered architecture. The highest level of
88
abstraction is the SQL layer (currently unspecified in this document).
89
It depends directly on the [*SQL layer*](#sql),
90
which provides familiar relational concepts
91
such as schemas, tables, columns, and indexes. The SQL layer
92
in turn depends on the [distributed key value store](#key-value-api),
93
which handles the details of range addressing to provide the abstraction
94
of a single, monolithic key value store. The distributed KV store
95
communicates with any number of physical cockroach nodes. Each node
96
contains one or more stores, one per physical device.
97
98
![Architecture](media/architecture.png)
99
100
Each store contains potentially many ranges, the lowest-level unit of
101
key-value data. Ranges are replicated using the Raft consensus protocol.
102
The diagram below is a blown up version of stores from four of the five
103
nodes in the previous diagram. Each range is replicated three ways using
104
raft. The color coding shows associated range replicas.
105
106
![Ranges](media/ranges.png)
107
108
Each physical node exports a RoachNode service. Each RoachNode exports
109
one or more key ranges. RoachNodes are symmetric. Each has the same
110
binary and assumes identical roles.
111
112
Nodes and the ranges they provide access to can be arranged with various
113
physical network topologies to make trade offs between reliability and
114
performance. For example, a triplicated (3-way replica) range could have
115
each replica located on different:
116
117
- disks within a server to tolerate disk failures.
118
- servers within a rack to tolerate server failures.
119
- servers on different racks within a datacenter to tolerate rack power/network failures.
120
- servers in different datacenters to tolerate large scale network or power outages.
121
122
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
123
124
# Cockroach Client
125
126
In order to support diverse client usage, Cockroach clients connect to
127
any node via HTTPS using protocol buffers or JSON. The connected node
128
proxies involved client work including key lookups and write buffering.
129
130
# Keys
131
132
Cockroach keys are arbitrary byte arrays. If textual data is used in
133
keys, utf8 encoding is recommended (this helps for cleaner display of
134
values in debugging tools). User-supplied keys are encoded using an
135
ordered code. System keys are either prefixed with null characters (`\0`
136
or `\0\0`) for system tables, or take the form of
137
`<user-key><system-suffix>` to sort user-key-range specific system
138
keys immediately after the user keys they refer to. Null characters are
139
used in system key prefixes to guarantee that they sort first.
140
141
# Versioned Values
142
143
Cockroach maintains historical versions of values by storing them with
144
associated commit timestamps. Reads and scans can specify a snapshot
145
time to return the most recent writes prior to the snapshot timestamp.
146
Older versions of values are garbage collected by the system during
147
compaction according to a user-specified expiration interval. In order
148
to support long-running scans (e.g. for MapReduce), all versions have a
149
minimum expiration.
150
151
Versioned values are supported via modifications to RocksDB to record
152
commit timestamps and GC expirations per key.
153
154
# Lock-Free Distributed Transactions
155
156
Cockroach provides distributed transactions without locks. Cockroach
157
transactions support two isolation levels:
158
159
- snapshot isolation (SI) and
160
- *serializable* snapshot isolation (SSI).
161
162
*SI* is simple to implement, highly performant, and correct for all but a
163
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
164
more complexity, is still highly performant (less so with contention), and has
165
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
166
the literature and some possibly novel insights.
167
168
SSI is the default level, with SI provided for application developers
169
who are certain enough of their need for performance and the absence of
170
write skew conditions to consciously elect to use it. In a lightly
171
contended system, our implementation of SSI is just as performant as SI,
172
requiring no locking or additional writes. With contention, our
173
implementation of SSI still requires no locking, but will end up
174
aborting more transactions. Cockroach’s SI and SSI implementations
175
prevent starvation scenarios even for arbitrarily long transactions.
176
177
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
178
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
179
For a discussion of SSI implemented by preventing read-write conflicts
180
(in contrast to detecting them, called write-snapshot isolation), see
181
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
182
which is the source of much inspiration for Cockroach’s SSI.
183
184
Both SI and SSI require that the outcome of reads must be preserved, i.e.
185
a write of a key at a lower timestamp than a previous read must not succeed. To
186
this end, each range maintains a bounded *in-memory* cache from key range to
187
the latest timestamp at which it was read.
188
189
Most updates to this *timestamp cache* correspond to keys being read, though
190
the timestamp cache also protects the outcome of some writes (notably range
191
deletions) which consequently must also populate the cache. The cache’s entries
192
are evicted oldest timestamp first, updating the low water mark of the cache
193
appropriately.
194
195
Each Cockroach transaction is assigned a random priority and a
196
"candidate timestamp" at start. The candidate timestamp is the
197
provisional timestamp at which the transaction will commit, and is
198
chosen as the current clock time of the node coordinating the
199
transaction. This means that a transaction without conflicts will
200
usually commit with a timestamp that, in absolute time, precedes the
201
actual work done by that transaction.
202
May 22, 2015
203
In the course of coordinating a transaction between one or more
204
distributed nodes, the candidate timestamp may be increased, but will
205
never be decreased. The core difference between the two isolation levels
206
SI and SSI is that the former allows the transaction's candidate
207
timestamp to increase and the latter does not.
209
**Hybrid Logical Clock**
210
211
Each cockroach node maintains a hybrid logical clock (HLC) as discussed
212
in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf).
213
HLC time uses timestamps which are composed of a physical component (thought of
214
as and always close to local wall time) and a logical component (used to
215
distinguish between events with the same physical component). It allows us to
216
track causality for related events similar to vector clocks, but with less
217
overhead. In practice, it works much like other logical clocks: When events
218
are received by a node, it informs the local HLC about the timestamp supplied
219
with the event by the sender, and when events are sent a timestamp generated by
220
the local HLC is attached.
221
222
For a more in depth description of HLC please read the paper. Our
223
implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/util/hlc/hlc.go).
224
225
Cockroach picks a Timestamp for a transaction using HLC time. Throughout this
226
document, *timestamp* always refers to the HLC time which is a singleton
227
on each node. The HLC is updated by every read/write event on the node, and
228
the HLC time >= wall time. A read/write timestamp received in a cockroach request
229
from another node is not only used to version the operation, but also updates
230
the HLC on the node. This is useful in guaranteeing that all data read/written
231
on a node is at a timestamp < next HLC time.
233
**Transaction execution flow**
234
235
Transactions are executed in two phases:
237
1. Start the transaction by selecting a range which is likely to be
238
heavily involved in the transaction and writing a new transaction
239
record to a reserved area of that range with state "PENDING". In
240
parallel write an "intent" value for each datum being written as part
241
of the transaction. These are normal MVCC values, with the addition of
242
a special flag (i.e. “intent”) indicating that the value may be
243
committed after the transaction itself commits. In addition,
244
the transaction id (unique and chosen at tx start time by client)
245
is stored with intent values. The txn id is used to refer to the
246
transaction record when there are conflicts and to make
247
tie-breaking decisions on ordering between identical timestamps.
248
Each node returns the timestamp used for the write (which is the
249
original candidate timestamp in the absence of read/write conflicts);
250
the client selects the maximum from amongst all write timestamps as the
251
final commit timestamp.
253
2. Commit the transaction by updating its transaction record. The value
254
of the commit entry contains the candidate timestamp (increased as
255
necessary to accommodate any latest read timestamps). Note that the
256
transaction is considered fully committed at this point and control
257
may be returned to the client.
258
259
In the case of an SI transaction, a commit timestamp which was
260
increased to accommodate concurrent readers is perfectly
261
acceptable and the commit may continue. For SSI transactions,
262
however, a gap between candidate and commit timestamps
263
necessitates transaction restart (note: restart is different than
264
abort--see below).
265
266
After the transaction is committed, all written intents are upgraded
267
in parallel by removing the “intent” flag. The transaction is
268
considered fully committed before this step and does not wait for
269
it to return control to the transaction coordinator.
270
271
In the absence of conflicts, this is the end. Nothing else is necessary
272
to ensure the correctness of the system.
273
274
**Conflict Resolution**
275
276
Things get more interesting when a reader or writer encounters an intent
277
record or newly-committed value in a location that it needs to read or
278
write. This is a conflict, usually causing either of the transactions to
279
abort or restart depending on the type of conflict.
280
281
***Transaction restart:***
282
283
This is the usual (and more efficient) type of behaviour and is used
284
except when the transaction was aborted (for instance by another
285
transaction).
286
In effect, that reduces to two cases; the first being the one outlined
287
above: An SSI transaction that finds upon attempting to commit that
288
its commit timestamp has been pushed. The second case involves a transaction
289
actively encountering a conflict, that is, one of its readers or writers
290
encounter data that necessitate conflict resolution
291
(see transaction interactions below).
292
293
When a transaction restarts, it changes its priority and/or moves its
294
timestamp forward depending on data tied to the conflict, and
295
begins anew reusing the same txn id. The prior run of the transaction might
296
have written some write intents, which need to be deleted before the
297
transaction commits, so as to not be included as part of the transaction.
298
These stale write intent deletions are done during the reexecution of the
299
transaction, either implicitly, through writing new intents to
300
the same keys as part of the reexecution of the transaction, or explicitly,
301
by cleaning up stale intents that are not part of the reexecution of the
302
transaction. Since most transactions will end up writing to the same keys,
303
the explicit cleanup run just before committing the transaction is usually
304
a NOOP.
305
306
***Transaction abort:***
307
308
This is the case in which a transaction, upon reading its transaction
309
record, finds that it has been aborted. In this case, the transaction
310
can not reuse its intents; it returns control to the client before
311
cleaning them up (other readers and writers would clean up dangling
312
intents as they encounter them) but will make an effort to clean up
313
after itself. The next attempt (if applicable) then runs as a new
314
transaction with **a new txn id**.
315
316
***Transaction interactions:***
317
318
There are several scenarios in which transactions interact:
319
320
- **Reader encounters write intent or value with newer timestamp far
321
enough in the future**: This is not a conflict. The reader is free
322
to proceed; after all, it will be reading an older version of the
323
value and so does not conflict. Recall that the write intent may
324
be committed with a later timestamp than its candidate; it will
325
never commit with an earlier one. **Side note**: if a SI transaction
326
reader finds an intent with a newer timestamp which the reader’s own
327
transaction has written, the reader always returns that intent's value.
328
329
- **Reader encounters write intent or value with newer timestamp in the
330
near future:** In this case, we have to be careful. The newer
331
intent may, in absolute terms, have happened in our read's past if
332
the clock of the writer is ahead of the node serving the values.
333
In that case, we would need to take this value into account, but
334
we just don't know. Hence the transaction restarts, using instead
335
a future timestamp (but remembering a maximum timestamp used to
336
limit the uncertainty window to the maximum clock skew). In fact,
337
this is optimized further; see the details under "choosing a time
338
stamp" below.
339
340
- **Reader encounters write intent with older timestamp**: the reader
341
must follow the intent’s transaction id to the transaction record.
342
If the transaction has already been committed, then the reader can
343
just read the value. If the write transaction has not yet been
344
committed, then the reader has two options. If the write conflict
345
is from an SI transaction, the reader can *push that transaction's
346
commit timestamp into the future* (and consequently not have to
347
read it). This is simple to do: the reader just updates the
348
transaction’s commit timestamp to indicate that when/if the
349
transaction does commit, it should use a timestamp *at least* as
350
high. However, if the write conflict is from an SSI transaction,
351
the reader must compare priorities. If the reader has the higher priority,
352
it pushes the transaction’s commit timestamp (that
353
transaction will then notice its timestamp has been pushed, and
354
restart). If it has the lower or same priority, it retries itself using as
355
a new priority `max(new random priority, conflicting txn’s
356
priority - 1)`.
358
- **Writer encounters uncommitted write intent**:
359
If the other write intent has been written by a transaction with a lower
360
priority, the writer aborts the conflicting transaction. If the write
361
intent has a higher or equal priority the transaction retries, using as a new
362
priority *max(new random priority, conflicting txn’s priority - 1)*;
363
the retry occurs after a short, randomized backoff interval.
365
- **Writer encounters newer committed value**:
366
The committed value could also be an unresolved write intent made by a
367
transaction that has already committed. The transaction restarts. On restart,
368
the same priority is reused, but the candidate timestamp is moved forward
369
to the encountered value's timestamp.
371
- **Writer encounters more recently read key**:
372
The *read timestamp cache* is consulted on each write at a node. If the write’s
373
candidate timestamp is earlier than the low water mark on the cache itself
374
(i.e. its last evicted timestamp) or if the key being written has a read
375
timestamp later than the write’s candidate timestamp, this later timestamp
376
value is returned with the write. A new timestamp forces a transaction
377
restart only if it is serializable.
379
**Transaction management**
380
381
Transactions are managed by the client proxy (or gateway in SQL Azure
382
parlance). Unlike in Spanner, writes are not buffered but are sent
383
directly to all implicated ranges. This allows the transaction to abort
384
quickly if it encounters a write conflict. The client proxy keeps track
385
of all written keys in order to resolve write intents asynchronously upon
386
transaction completion. If a transaction commits successfully, all intents
387
are upgraded to committed. In the event a transaction is aborted, all written
388
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
389
390
In the event the client proxy restarts before the pending transaction is
391
committed, the dangling transaction would continue to "live" until
392
aborted by another transaction. Transactions periodically heartbeat
393
their transaction record to maintain liveness.
394
Transactions encountered by readers or writers with dangling intents
395
which haven’t been heartbeat within the required interval are aborted.
396
In the event the proxy restarts after a transaction commits but before
397
the asynchronous resolution is complete, the dangling intents are upgraded
398
when encountered by future readers and writers and the system does
399
not depend on their timely resolution for correctness.
400
401
An exploration of retries with contention and abort times with abandoned
402
transaction is
403
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
404
405
**Transaction Records**
407
Please see [roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/master/roachpb/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
408
409
**Pros**
410
411
- No requirement for reliable code execution to prevent stalled 2PC
412
protocol.
413
- Readers never block with SI semantics; with SSI semantics, they may
414
abort.
415
- Lower latency than traditional 2PC commit protocol (w/o contention)
416
because second phase requires only a single write to the
417
transaction record instead of a synchronous round to all
418
transaction participants.
419
- Priorities avoid starvation for arbitrarily long transactions and
420
always pick a winner from between contending transactions (no
421
mutual aborts).
422
- Writes not buffered at client; writes fail fast.
423
- No read-locking overhead required for *serializable* SI (in contrast
424
to other SSI implementations).
425
- Well-chosen (i.e. less random) priorities can flexibly give
426
probabilistic guarantees on latency for arbitrary transactions
427
(for example: make OLTP transactions 10x less likely to abort than
428
low priority transactions, such as asynchronously scheduled jobs).
429
430
**Cons**
431
432
- Reads from non-lease holder replicas still require a ping to the lease holder
433
update *read timestamp cache*.
434
- Abandoned transactions may block contending writers for up to the
435
heartbeat interval, though average wait is likely to be
436
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
437
This is likely considerably more performant than detecting and
438
restarting 2PC in order to release read and write locks.
439
- Behavior different than other SI implementations: no first writer
440
wins, and shorter transactions do not always finish quickly.
441
Element of surprise for OLTP systems may be a problematic factor.
442
- Aborts can decrease throughput in a contended system compared with
443
two phase locking. Aborts and retries increase read and write
444
traffic, increase latency and decrease throughput.
445
446
**Choosing a Timestamp**
447
448
A key challenge of reading data in a distributed system with clock skew
449
is choosing a timestamp guaranteed to be greater than the latest
450
timestamp of any committed transaction (in absolute time). No system can
451
claim consistency and fail to read already-committed data.
452
453
Accomplishing consistency for transactions (or just single operations)
454
accessing a single node is easy. The timestamp is assigned by the node
455
itself, so it is guaranteed to be at a greater timestamp than all the
456
existing timestamped data on the node.
457
458
For multiple nodes, the timestamp of the node coordinating the
459
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
460
supplied to provide an upper bound on timestamps for already-committed
461
data (`ε` is the maximum clock skew). As the transaction progresses, any
462
data read which have timestamps greater than `t` but less than `t+ε`
463
cause the transaction to abort and retry with the conflicting timestamp
464
t<sub>c</sub>, where t<sub>c</sub> \> t. The maximum timestamp `t+ε` remains
465
the same. This implies that transaction restarts due to clock uncertainty
466
can only happen on a time interval of length `ε`.
468
We apply another optimization to reduce the restarts caused
469
by uncertainty. Upon restarting, the transaction not only takes
470
into account t<sub>c</sub>, but the timestamp of the node at the time
471
of the uncertain read t<sub>node</sub>. The larger of those two timestamps
472
t<sub>c</sub> and t<sub>node</sub> (likely equal to the latter) is used
473
to increase the read timestamp. Additionally, the conflicting node is
474
marked as “certain”. Then, for future reads to that node within the
475
transaction, we set `MaxTimestamp = Read Timestamp`, preventing further
476
uncertainty restarts.
477
478
Correctness follows from the fact that we know that at the time of the read,
479
there exists no version of any key on that node with a higher timestamp than
480
t<sub>node</sub>. Upon a restart caused by the node, if the transaction
481
encounters a key with a higher timestamp, it knows that in absolute time,
482
the value was written after t<sub>node</sub> was obtained, i.e. after the
483
uncertain read. Hence the transaction can move forward reading an older version
484
of the data (at the transaction's timestamp). This limits the time uncertainty
485
restarts attributed to a node to at most one. The tradeoff is that we might
486
pick a timestamp larger than the optimal one (> highest conflicting timestamp),
487
resulting in the possibility of a few more conflicts.
488
489
We expect retries will be rare, but this assumption may need to be
490
revisited if retries become problematic. Note that this problem does not
491
apply to historical reads. An alternate approach which does not require
492
retries makes a round to all node participants in advance and
493
chooses the highest reported node wall time as the timestamp. However,
494
knowing which nodes will be accessed in advance is difficult and
495
potentially limiting. Cockroach could also potentially use a global
496
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)),
497
which would be feasible for smaller, geographically-proximate clusters.
499
# Strict Serializability (Linearizability)
500
501
Roughly speaking, the gap between <i>strict serializability</i> (which we use
502
interchangeably with <i>linearizability</i>) and CockroachDB's default
503
isolation level (<i>serializable</i>) is that with linearizable transactions,
504
causality is preserved. That is, if one transaction (say, creating a posting
505
for a user) waits for its predecessor (creating the user in the first place)
506
to complete, one would hope that the logical timestamp assigned to the former
507
is larger than that of the latter.
508
In practice, in distributed databases this may not hold, the reason typically
509
being that clocks across a distributed system are not perfectly synchronized
510
and the "later" transaction touches a part disjoint from that on which the
511
first transaction ran, resulting in clocks with disjoint information to decide
512
on the commit timestamps.
513
514
In practice, in CockroachDB many transactional workloads are actually
515
linearizable, though the precise conditions are too involved to outline them
516
here.
517
518
Causality is typically not required for many transactions, and so it is
519
advantageous to pay for it only when it *is* needed. CockroachDB implements
520
this via <i>causality tokens</i>: When committing a transaction, a causality
521
token can be retrieved and passed to the next transaction, ensuring that these
522
two transactions get assigned increasing logical timestamps.
523
524
Additionally, as better synchronized clocks become a standard commodity offered
525
by cloud providers, CockroachDB can provide global linearizability by doing
526
much the same that [Google's
527
Spanner](http://research.google.com/archive/spanner.html) does: wait out the
528
maximum clock offset after committing, but before returning to the client.
529
530
See the blog post below for much more in-depth information.
531
532
https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
533
534
# Logical Map Content
535
536
Logically, the map contains a series of reserved system key/value
537
pairs preceding the actual user data (which is managed by the SQL
538
subsystem).
540
- `\x02<key1>`: Range metadata for range ending `\x03<key1>`. This a "meta1" key.
541
- ...
542
- `\x02<keyN>`: Range metadata for range ending `\x03<keyN>`. This a "meta1" key.
543
- `\x03<key1>`: Range metadata for range ending `<key1>`. This a "meta2" key.
544
- ...
545
- `\x03<keyN>`: Range metadata for range ending `<keyN>`. This a "meta2" key.
546
- `\x04{desc,node,range,store}-idegen`: ID generation oracles for various component types.
547
- `\x04status-node-<varint encoded Store ID>`: Store runtime metadata.
548
- `\x04tsd<key>`: Time-series data key.
549
- `<key>`: A user key. In practice, these keys are managed by the SQL
550
subsystem, which employs its own key anatomy.
551
552
# Node Storage
553
554
Nodes maintain a separate instance of RocksDB for each disk. Each
555
RocksDB instance hosts any number of ranges. RPCs arriving at a
556
RoachNode are multiplexed based on the disk name to the appropriate
557
RocksDB instance. A single instance per disk is used to avoid
558
contention. If every range maintained its own RocksDB, global management
559
of available cache memory would be impossible and writers for each range
560
would compete for non-contiguous writes to multiple RocksDB logs.
561
562
In addition to the key/value pairs of the range itself, various range
563
metadata is maintained.
564
565
- participating replicas
566
567
- consensus metadata
568
569
- split/merge activity
570
571
A really good reference on tuning Linux installations with RocksDB is
572
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
573
574
# Range Metadata
575
576
The default approximate size of a range is 64M (2\^26 B). In order to
577
support 1P (2\^50 B) of logical data, metadata is needed for roughly
578
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
579
size is roughly 256 bytes (3\*12 bytes for the triplicated node
580
locations and 220 bytes for the range key itself). 2\^24 ranges \* 2\^8
581
B would require roughly 4G (2\^32 B) to store--too much to duplicate
582
between machines. Our conclusion is that range metadata must be
583
distributed for large installations.
584
585
To keep key lookups relatively fast in the presence of distributed metadata,
586
we store all the top-level metadata in a single range (the first range). These
587
top-level metadata keys are known as *meta1* keys, and are prefixed such that
588
they sort to the beginning of the key space. Given the metadata size of 256
589
bytes given above, a single 64M range would support 64M/256B = 2\^18 ranges,
590
which gives a total storage of 64M \* 2\^18 = 16T. To support the 1P quoted
591
above, we need two levels of indirection, where the first level addresses the
592
second, and the second addresses user data. With two levels of indirection, we
593
can address 2\^(18 + 18) = 2\^36 ranges; each range addresses 2\^26 B, and
594
altogether we address 2\^(36+26) B = 2\^62 B = 4E of user data.
595
596
For a given user-addressable `key1`, the associated *meta1* record is found
597
at the successor key to `key1` in the *meta1* space. Since the *meta1* space
598
is sparse, the successor key is defined as the next key which is present. The
599
*meta1* record identifies the range containing the *meta2* record, which is
600
found using the same process. The *meta2* record identifies the range
601
containing `key1`, which is again found the same way (see examples below).
603
Concretely, metadata keys are prefixed by `\0\0meta{1,2}`; the two null
604
characters provide for the desired sorting behaviour. Thus, `key1`'s
605
*meta1* record will reside at the successor key to `\0\0\meta1<key1>`.
606
Jul 29, 2015
607
Note: we append the end key of each range to meta{1,2} records because
608
the RocksDB iterator only supports a Seek() interface which acts as a
609
Ceil(). Using the start key of the range would cause Seek() to find the
610
key *after* the meta indexing record we’re looking for, which would
611
result in having to back the iterator up, an option which is both less
612
efficient and not available in all cases.
613
614
The following example shows the directory structure for a map with
615
three ranges worth of data. Ellipses indicate additional key/value pairs to
616
fill an entire range of data. Except for the fact that splitting ranges
617
requires updates to the range metadata with knowledge of the metadata layout,
618
the range metadata itself requires no special treatment or bootstrapping.
619
620
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
621
`dcrama3:8000`)
622
623
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
624
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
625
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
626
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
627
- ...
628
- `<lastkey0>`: `<lastvalue0>`
629
630
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
631
`dcrama6:8000`)
632
633
- ...
634
- `<lastkey1>`: `<lastvalue1>`
635
636
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
637
`dcrama9:8000`)
638
639
- ...
640
- `<lastkey2>`: `<lastvalue2>`
641
642
Consider a simpler example of a map containing less than a single
643
range of data. In this case, all range metadata and all data are
644
located in the same range:
645
646
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
647
`dcrama3:8000`)*
648
649
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
650
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
651
- `<key0>`: `<value0>`
652
- `...`
653
654
Finally, a map large enough to need both levels of indirection would
655
look like (note that instead of showing range replicas, this
656
example is simplified to just show range indexes):
657
658
**Range 0**
659
660
- `\0\0meta1<lastkeyN-1>`: Range 0
661
- `\0\0meta1\xff`: Range 1
662
- `\0\0meta2<lastkey1>`: Range 1
663
- `\0\0meta2<lastkey2>`: Range 2
664
- `\0\0meta2<lastkey3>`: Range 3
665
- ...
666
- `\0\0meta2<lastkeyN-1>`: Range 262143
667
668
**Range 1**
669
670
- `\0\0meta2<lastkeyN>`: Range 262144
671
- `\0\0meta2<lastkeyN+1>`: Range 262145
672
- ...
673
- `\0\0meta2\xff`: Range 500,000
674
- ...
675
- `<lastkey1>`: `<lastvalue1>`
676
677
**Range 2**
678
679
- ...
680
- `<lastkey2>`: `<lastvalue2>`
681
682
**Range 3**
683
684
- ...
685
- `<lastkey3>`: `<lastvalue3>`
686
687
**Range 262144**
688
689
- ...
690
- `<lastkeyN>`: `<lastvalueN>`
691
692
**Range 262145**
693
694
- ...
695
- `<lastkeyN+1>`: `<lastvalueN+1>`
696
697
Note that the choice of range `262144` is just an approximation. The
698
actual number of ranges addressable via a single metadata range is
699
dependent on the size of the keys. If efforts are made to keep key sizes
700
small, the total number of addressable ranges would increase and vice
701
versa.
702
703
From the examples above it’s clear that key location lookups require at
704
most three reads to get the value for `<key>`:
705
706
1. lower bound of `\0\0meta1<key>`
707
2. lower bound of `\0\0meta2<key>`,
708
3. `<key>`.
709
710
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
711
containing less than 16T of data would require two lookups. Clients cache both
712
levels of range metadata, and we expect that data locality for individual
713
clients will be high. Clients may end up with stale cache entries. If on a
714
lookup, the range consulted does not match the client’s expectations, the
715
client evicts the stale entries and possibly does a new lookup.
716
717
# Raft - Consistency of Range Replicas
718
719
Each range is configured to consist of three or more replicas, as specified by
720
their ZoneConfig. The replicas in a range maintain their own instance of a
721
distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://raftconsensus.github.io)
722
as it is simpler to reason about and includes a reference implementation
723
covering important details.
724
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
725
promising performance characteristics for WAN-distributed replicas, but
726
it does not guarantee a consistent ordering between replicas.
727
728
Raft elects a relatively long-lived leader which must be involved to
729
propose commands. It heartbeats followers periodically and keeps their logs
730
replicated. In the absence of heartbeats, followers become candidates
731
after randomized election timeouts and proceed to hold new leader
732
elections. Cockroach weights random timeouts such that the replicas with
733
shorter round trip times to peers are more likely to hold elections
734
first (not implemented yet). Only the Raft leader may propose commands;
735
followers will simply relay commands to the last known leader.
737
Our Raft implementation was developed together with CoreOS, but adds an extra
738
layer of optimization to account for the fact that a single Node may have
739
millions of consensus groups (one for each Range). Areas of optimization
740
are chiefly coalesced heartbeats (so that the number of nodes dictates the
741
number of heartbeats as opposed to the much larger number of ranges) and
742
batch processing of requests.
743
Future optimizations may include two-phase elections and quiescent ranges
744
(i.e. stopping traffic completely for inactive ranges).
745
746
# Range Leases
747
748
As outlined in the Raft section, the replicas of a Range are organized as a
749
Raft group and execute commands from their shared commit log. Going through
750
Raft is an expensive operation though, and there are tasks which should only be
751
carried out by a single replica at a time (as opposed to all of them).
752
In particular, it is desirable to serve authoritative reads from a single
753
Replica (ideally from more than one, but that is far more difficult).
755
For these reasons, Cockroach introduces the concept of **Range Leases**:
756
This is a lease held for a slice of (database, i.e. hybrid logical) time and is
757
established by committing a special log entry through Raft containing the
758
interval the lease is going to be active on, along with the Node:RaftID
759
combination that uniquely describes the requesting replica. Reads and writes
760
must generally be addressed to the replica holding the lease; if none does, any
761
replica may be addressed, causing it to try to obtain the lease synchronously.
762
Requests received by a non-lease holder (for the HLC timestamp specified in the
763
request's header) fail with an error pointing at the replica's last known
764
lease holder. These requests are retried transparently with the updated lease by the
765
gateway node and never reach the client.
766
767
The replica holding the lease is in charge or involved in handling
768
Range-specific maintenance tasks such as
769
770
* gossiping the sentinel and/or first range information
771
* splitting, merging and rebalancing
772
773
and, very importantly, may satisfy reads locally, without incurring the
774
overhead of going through Raft.
775
776
Since reads bypass Raft, a new lease holder will, among other things, ascertain
777
that its timestamp cache does not report timestamps smaller than the previous
778
lease holder's (so that it's compatible with reads which may have occurred on
779
the former lease holder). This is accomplished by letting leases enter
780
a <i>stasis period</i> (which is just the expiration minus the maximum clock
781
offset) before the actual expiration of the lease, so that all the next lease
782
holder has to do is set the low water mark of the timestamp cache to its
783
new lease's start time.
784
785
As a lease enters its stasis period, no more reads or writes are served, which
786
is undesirable. However, this would only happen in practice if a node became
787
unavailable. In almost all practical situations, no unavailability results
788
since leases are usually long-lived (and/or eagerly extended, which can avoid
789
the stasis period) or proactively transferred away from the lease holder, which
790
can also avoid the stasis period by promising not to serve any further reads
791
until the next lease goes into effect.
792
793
## Colocation with Raft leadership
795
The range lease is completely separate from Raft leadership, and so without
796
further efforts, Raft leadership and the Range lease might not be held by the
797
same Replica. Since it's expensive to not have these two roles colocated (the
798
lease holder has to forward each proposal to the leader, adding costly RPC
799
round-trips), each lease renewal or transfer also attempts to colocate them.
800
In practice, that means that the mismatch is rare and self-corrects quickly.
802
## Command Execution Flow
803
804
This subsection describes how a lease holder replica processes a read/write
805
command in more details. Each command specifies (1) a key (or a range
806
of keys) that the command accesses and (2) the ID of a range which the
807
key(s) belongs to. When receiving a command, a RoachNode looks up a
808
range by the specified Range ID and checks if the range is still
809
responsible for the supplied keys. If any of the keys do not belong to the
810
range, the RoachNode returns an error so that the client will retry
811
and send a request to a correct range.
812
813
When all the keys belong to the range, the RoachNode attempts to
814
process the command. If the command is an inconsistent read-only
815
command, it is processed immediately. If the command is a consistent
816
read or a write, the command is executed when both of the following
817
conditions hold:
818
819
- The range replica has a range lease.
820
- There are no other running commands whose keys overlap with
821
the submitted command and cause read/write conflict.
822
823
When the first condition is not met, the replica attempts to acquire
824
a lease or returns an error so that the client will redirect the
825
command to the current lease holder. The second condition guarantees that
826
consistent read/write commands for a given key are sequentially
827
executed.
828
829
When the above two conditions are met, the lease holder replica processes the
830
command. Consistent reads are processed on the lease holder immediately.
831
Write commands are committed into the Raft log so that every replica
832
will execute the same commands. All commands produce deterministic
833
results so that the range replicas keep consistent states among them.
834
835
When a write command completes, all the replica updates their response
836
cache to ensure idempotency. When a read command completes, the lease holder
837
replica updates its timestamp cache to keep track of the latest read
838
for a given key.
839
840
There is a chance that a range lease gets expired while a command is
841
executed. Before executing a command, each replica checks if a replica
842
proposing the command has a still lease. When the lease has been
843
expired, the command will be rejected by the replica.
844
845
846
# Splitting / Merging Ranges
847
848
RoachNodes split or merge ranges based on whether they exceed maximum or
849
minimum thresholds for capacity or load. Ranges exceeding maximums for
850
either capacity or load are split; ranges below minimums for *both*
851
capacity and load are merged.
852
853
Ranges maintain the same accounting statistics as accounting key
854
prefixes. These boil down to a time series of data points with minute
855
granularity. Everything from number of bytes to read/write queue sizes.
856
Arbitrary distillations of the accounting stats can be determined as the
857
basis for splitting / merging. Two sensible metrics for use with
858
split/merge are range size in bytes and IOps. A good metric for
859
rebalancing a replica from one node to another would be total read/write
860
queue wait times. These metrics are gossipped, with each range / node
861
passing along relevant metrics if they’re in the bottom or top of the
862
range it’s aware of.
863
864
A range finding itself exceeding either capacity or load threshold
865
splits. To this end, the range lease holder computes an appropriate split key
866
candidate and issues the split through Raft. In contrast to splitting,
867
merging requires a range to be below the minimum threshold for both
868
capacity *and* load. A range being merged chooses the smaller of the
869
ranges immediately preceding and succeeding it.
870
871
Splitting, merging, rebalancing and recovering all follow the same basic
872
algorithm for moving data between roach nodes. New target replicas are
873
created and added to the replica set of source range. Then each new
874
replica is brought up to date by either replaying the log in full or
875
copying a snapshot of the source replica data and then replaying the log
876
from the timestamp of the snapshot to catch up fully. Once the new
877
replicas are fully up to date, the range metadata is updated and old,
878
source replica(s) deleted if applicable.
879
880
**Coordinator** (lease holder replica)
881
882
```
883
if splitting
884
SplitRange(split_key): splits happen locally on range replicas and
885
only after being completed locally, are moved to new target replicas.
886
else if merging
887
Choose new replicas on same servers as target range replicas;
888
add to replica set.
889
else if rebalancing || recovering
890
Choose new replica(s) on least loaded servers; add to replica set.
891
```
892
893
**New Replica**
894
895
*Bring replica up to date:*
896
897
```
898
if all info can be read from replicated log
899
copy replicated log
900
else
901
snapshot source replica
902
send successive ReadRange requests to source replica
903
referencing snapshot
904
905
if merging
906
combine ranges on all replicas
907
else if rebalancing || recovering
908
remove old range replica(s)
909
```
910
911
RoachNodes split ranges when the total data in a range exceeds a
912
configurable maximum threshold. Similarly, ranges are merged when the
913
total data falls below a configurable minimum threshold.
914
915
**TBD: flesh this out**: Especially for merges (but also rebalancing) we have a
916
range disappearing from the local node; that range needs to disappear
917
gracefully, with a smooth handoff of operation to the new owner of its data.
918
919
Ranges are rebalanced if a node determines its load or capacity is one
920
of the worst in the cluster based on gossipped load stats. A node with
921
spare capacity is chosen in the same datacenter and a special-case split
922
is done which simply duplicates the data 1:1 and resets the range
923
configuration metadata.
924
925
# Node Allocation (via Gossip)
926
927
New nodes must be allocated when a range is split. Instead of requiring
928
every RoachNode to know about the status of all or even a large number
929
of peer nodes --or-- alternatively requiring a specialized curator or
930
master with sufficiently global knowledge, we use a gossip protocol to
931
efficiently communicate only interesting information between all of the
932
nodes in the cluster. What’s interesting information? One example would
933
be whether a particular node has a lot of spare capacity. Each node,
934
when gossiping, compares each topic of gossip to its own state. If its
935
own state is somehow “more interesting” than the least interesting item
936
in the topic it’s seen recently, it includes its own state as part of
937
the next gossip session with a peer node. In this way, a node with
938
capacity sufficiently in excess of the mean quickly becomes discovered
939
by the entire cluster. To avoid piling onto outliers, nodes from the
940
high capacity set are selected at random for allocation.
941
942
The gossip protocol itself contains two primary components:
943
944
- **Peer Selection**: each node maintains up to N peers with which it
945
regularly communicates. It selects peers with an eye towards
946
maximizing fanout. A peer node which itself communicates with an
947
array of otherwise unknown nodes will be selected over one which
948
communicates with a set containing significant overlap. Each time
949
gossip is initiated, each nodes’ set of peers is exchanged. Each
950
node is then free to incorporate the other’s peers as it sees fit.
951
To avoid any node suffering from excess incoming requests, a node
952
may refuse to answer a gossip exchange. Each node is biased
953
towards answering requests from nodes without significant overlap
954
and refusing requests otherwise.
955
956
Peers are efficiently selected using a heuristic as described in
957
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
958
959
**TBD**: how to avoid partitions? Need to work out a simulation of
960
the protocol to tune the behavior and see empirically how well it
961
works.
962
963
- **Gossip Selection**: what to communicate. Gossip is divided into
964
topics. Load characteristics (capacity per disk, cpu load, and
965
state [e.g. draining, ok, failure]) are used to drive node
966
allocation. Range statistics (range read/write load, missing
967
replicas, unavailable ranges) and network topology (inter-rack
968
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
969
outages) are used for determining when to split ranges, when to
970
recover replicas vs. wait for network connectivity, and for
971
debugging / sysops. In all cases, a set of minimums and a set of
972
maximums is propagated; each node applies its own view of the
973
world to augment the values. Each minimum and maximum value is
974
tagged with the reporting node and other accompanying contextual
975
information. Each topic of gossip has its own protobuf to hold the
976
structured data. The number of items of gossip in each topic is
977
limited by a configurable bound.
978
979
For efficiency, nodes assign each new item of gossip a sequence
980
number and keep track of the highest sequence number each peer
981
node has seen. Each round of gossip communicates only the delta
982
containing new items.
983
984
# Node Accounting
985
986
The gossip protocol discussed in the previous section is useful to
987
quickly communicate fragments of important information in a
988
decentralized manner. However, complete accounting for each node is also
989
stored to a central location, available to any dashboard process. This
990
is done using the map itself. Each node periodically writes its state to
991
the map with keys prefixed by `\0node`, similar to the first level of
992
range metadata, but with an ‘`node`’ suffix. Each value is a protobuf
993
containing the full complement of node statistics--everything
994
communicated normally via the gossip protocol plus other useful, but
995
non-critical data.
996
997
The range containing the first key in the node accounting table is
998
responsible for gossiping the total count of nodes. This total count is
999
used by the gossip network to most efficiently organize itself. In
1000
particular, the maximum number of hops for gossipped information to take
1001
before reaching a node is given by `ceil(log(node count) / log(max
1002
fanout)) + 1`.
1003
1004
# Key-prefix Accounting and Zones
1006
Arbitrarily fine-grained accounting is specified via
1007
key prefixes. Key prefixes can overlap, as is necessary for capturing
1008
hierarchical relationships. For illustrative purposes, let’s say keys
1009
specifying rows in a set of databases have the following format:
1010
1011
`<db>:<table>:<primary-key>[:<secondary-key>]`
1012
1013
In this case, we might collect accounting with
1014
key prefixes:
1015
1016
`db1`, `db1:user`, `db1:order`,
1017
1018
Accounting is kept for the entire map by default.
1019
1020
## Accounting
1021
to keep accounting for a range defined by a key prefix, an entry is created in
1022
the accounting system table. The format of accounting table keys is:
1023
1024
`\0acct<key-prefix>`
1025
1026
In practice, we assume each RoachNode capable of caching the
1027
entire accounting table as it is likely to be relatively small.
1028
1029
Accounting is kept for key prefix ranges with eventual consistency for
1030
efficiency. There are two types of values which comprise accounting:
1031
counts and occurrences, for lack of better terms. Counts describe
1032
system state, such as the total number of bytes, rows,
1033
etc. Occurrences include transient performance and load metrics. Both
1034
types of accounting are captured as time series with minute
1035
granularity. The length of time accounting metrics are kept is
1036
configurable. Below are examples of each type of accounting value.
1037
1038
**System State Counters/Performance**
1039
1040
- Count of items (e.g. rows)
1041
- Total bytes
1042
- Total key bytes
1043
- Total value length
1044
- Queued message count
1045
- Queued message total bytes
1046
- Count of values \< 16B
1047
- Count of values \< 64B
1048
- Count of values \< 256B
1049
- Count of values \< 1K
1050
- Count of values \< 4K
1051
- Count of values \< 16K
1052
- Count of values \< 64K
1053
- Count of values \< 256K
1054
- Count of values \< 1M
1055
- Count of values \> 1M
1056
- Total bytes of accounting
1057
1058
1059
**Load Occurrences**
1060
1061
- Get op count
1062
- Get total MB
1063
- Put op count
1064
- Put total MB
1065
- Delete op count
1066
- Delete total MB
1067
- Delete range op count
1068
- Delete range total MB
1069
- Scan op count
1070
- Scan op MB
1071
- Split count
1072
- Merge count
1073
1074
Because accounting information is kept as time series and over many
1075
possible metrics of interest, the data can become numerous. Accounting
1076
data are stored in the map near the key prefix described, in order to
1077
distribute load (for both aggregation and storage).
1078
1079
Accounting keys for system state have the form:
1080
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1081
character. It’s meant to sort the root level account AFTER any other
1082
system tables. They must increment the same underlying values as they
1083
are permanent counts, and not transient activity. Logic at the
1084
RoachNode takes care of snapshotting the value into an appropriately
1085
suffixed (e.g. with timestamp hour) multi-value time series entry.
1086
1087
Keys for perf/load metrics:
1088
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1089
1090
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1091
containing a varint64 entry for each minute with activity during the
1092
specified hour.
1093
1094
To efficiently keep accounting over large key ranges, the task of
1095
aggregation must be distributed. If activity occurs within the same
1096
range as the key prefix for accounting, the updates are made as part
1097
of the consensus write. If the ranges differ, then a message is sent
1098
to the parent range to increment the accounting. If upon receiving the
1099
message, the parent range also does not include the key prefix, it in
1100
turn forwards it to its parent or left child in the balanced binary
1101
tree which is maintained to describe the range hierarchy. This limits
1102
the number of messages before an update is visible at the root to `2*log N`,
1103
where `N` is the number of ranges in the key prefix.
1104
1105
## Zones
1106
zones are stored in the map with keys prefixed by
1107
`\0zone` followed by the key prefix to which the zone
1108
configuration applies. Zone values specify a protobuf containing
1109
the datacenters from which replicas for ranges which fall under
1110
the zone must be chosen.
1111
1112
Please see [config/config.proto](https://github.com/cockroachdb/cockroach/blob/master/config/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1113
1114
If zones are modified in situ, each RoachNode verifies the
1115
existing zones for its ranges against the zone configuration. If
1116
it discovers differences, it reconfigures ranges in the same way
1117
that it rebalances away from busy nodes, via special-case 1:1
1118
split to a duplicate range comprising the new configuration.
1119
1120
# SQL
1121
1122
Each node in a cluster can accept SQL client connections. CockroachDB
1123
supports the PostgreSQL wire protocol, to enable reuse of native
1124
PostgreSQL client drivers. Connections using SSL and authenticated
1125
using client certificates are supported and even encouraged over
1126
unencrypted (insecure) and password-based connections.
1127
1128
Each connection is associated with a SQL session which holds the
1129
server-side state of the connection. Over the lifespan of a session
1130
the client can send SQL to open/close transactions, issue statements
1131
or queries or configure session parameters, much like with any other
1132
SQL database.
1133
1134
## Language support
1135
1136
CockroachDB also attempts to emulate the flavor of SQL supported by
1137
PostgreSQL, although it also diverges in significant ways:
1138
1139
- CockroachDB exclusively implements MVCC-based consistency for
1140
transactions, and thus only supports SQL's isolation levels SNAPSHOT
1141
and SERIALIZABLE. The other traditional SQL isolation levels are
1142
internally mapped to either SNAPSHOT or SERIALIZABLE.
1143
1144
- CockroachDB implements its own [SQL type system](RFCS/typing.md)
1145
which only supports a limited form of implicit coercions between
1146
types compared to PostgreSQL. The rationale is to keep the
1147
implementation simple and efficient, capitalizing on the observation
1148
that 1) most SQL code in clients is automatically generated with
1149
coherent typing already and 2) existing SQL code for other databases
1150
will need to be massaged for CockroachDB anyways.
1151
1152
## SQL architecture
1153
1154
Client connections over the network are handled in each node by a
1155
pgwire server process (goroutine). This handles the stream of incoming
1156
commands and sends back responses including query/statement results.
1157
The pgwire server also handles pgwire-level prepared statements,
1158
binding prepared statements to arguments and looking up prepared
1159
statements for execution.
1160
1161
Meanwhile the state of a SQL connection is maintained by a Session
1162
object and a monolithic `planner` object (one per connection) which
1163
coordinates execution between the session, the current SQL transaction
1164
state and the underlying KV store.
1165
1166
Upon receiving a query/statement (either directly or via an execute
1167
command for a previously prepared statement) the pgwire server forwards
1168
the SQL text to the `planner` associated with the connection. The SQL
1169
code is then transformed into a SQL query plan.
1170
The query plan is implemented as a tree of objects which describe the
1171
high-level data operations needed to resolve the query, for example
1172
"join", "index join", "scan", "group", etc.
1173
1174
The query plan objects currently also embed the run-time state needed
1175
for the execution of the query plan. Once the SQL query plan is ready,
1176
methods on these objects then carry the execution out in the fashion
1177
of "generators" in other programming languages: each node *starts* its
1178
children nodes and from that point forward each child node serves as a
1179
*generator* for a stream of result rows, which the parent node can
1180
consume and transform incrementally and present to its own parent node
1181
also as a generator.
1182
1183
The top-level planner consumes the data produced by the top node of
1184
the query plan and returns it to the client via pgwire.
1185
1186
## Data mapping between the SQL model and KV
1187
1188
Every SQL table has a primary key in CockroachDB. (If a table is created
1189
without one, an implicit primary key is provided automatically.)
1190
The table identifier, followed by the value of the primary key for
1191
each row, are encoded as the *prefix* of a key in the underlying KV
1192
store.
1193
1194
Each remaining column or *column family* in the table is then encoded
1195
as a value in the underlying KV store, and the column/family identifier
1196
is appended as *suffix* to the KV key.
1197
1198
For example:
1199
1200
- after table `customers` is created in a database `mydb` with a
1201
primary key column `name` and normal columns `address` and `URL`, the KV pairs
1202
to store the schema would be:
1203
1204
| Key | Values |
1205
| ---------------------------- | ------ |
1206
| `/system/databases/mydb/id` | 51 |
1207
| `/system/tables/customer/id` | 42 |
1208
| `/system/desc/51/42/address` | 69 |
1209
| `/system/desc/51/42/url` | 66 |
1210
1211
(The numeric values on the right are chosen arbitrarily for the
1212
example; the structure of the schema keys on the left is simplified
1213
for the example and subject to change.) Each database/table/column
1214
name is mapped to a spontaneously generated identifier, so as to
1215
simplify renames.
1216
1217
Then for a single row in this table:
1218
1219
| Key | Values |
1220
| ----------------- | -------------------------------- |
1221
| `/51/42/Apple/69` | `1 Infinite Loop, Cupertino, CA` |
1222
| `/51/42/Apple/66` | `http://apple.com/` |
1223
1224
Each key has the table prefix `/51/42` followed by the primary key
1225
prefix `/Apple` followed by the column/family suffix (`/66`,
1226
`/69`). The KV value is directly encoded from the SQL value.
1227
1228
Efficient storage for the keys is guaranteed by the underlying RocksDB engine
1229
by means of prefix compression.
1230
1231
Finally, for SQL indexes, the KV key is formed using the SQL value of the
1232
indexed columns, and the KV value is the KV key prefix of the rest of
1233
the indexed row.
1234
1235
# References
1236
1237
[0]: http://rocksdb.org/
1238
[1]: https://github.com/google/leveldb
1239
[2]: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf
1240
[3]: http://research.google.com/archive/spanner.html
1241
[4]: http://research.google.com/pubs/pub36971.html
1242
[5]: https://github.com/cockroachdb/cockroach/tree/master/sql
1243
[7]: https://godoc.org/github.com/cockroachdb/cockroach/kv
1244
[8]: https://github.com/cockroachdb/cockroach/tree/master/kv
1245
[9]: https://godoc.org/github.com/cockroachdb/cockroach/server
1246
[10]: https://github.com/cockroachdb/cockroach/tree/master/server
1247
[11]: https://godoc.org/github.com/cockroachdb/cockroach/storage
1248
[12]: https://github.com/cockroachdb/cockroach/tree/master/storage