Permalink
Newer
Older
100644 1274 lines (1043 sloc) 60.4 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
CockroachDB is a distributed SQL database. The primary design goals
8
are **scalability**, **strong consistency** and **survivability**
9
(hence the name). CockroachDB aims to tolerate disk, machine, rack, and
10
even **datacenter failures** with minimal latency disruption and **no
11
manual intervention**. CockroachDB nodes are symmetric; a design goal is
12
**homogeneous deployment** (one binary) with minimal configuration and
13
no required external dependencies.
14
15
The entry point for database clients is the SQL interface. Every node
16
in a CockroachDB cluster can act as a client SQL gateway. A SQL
17
gateway transforms and executes client SQL statements to key-value
18
(KV) operations, which the gateway distributes across the cluster as
19
necessary and returns results to the client. CockroachDB implements a
20
**single, monolithic sorted map** from key to value where both keys
21
and values are byte strings.
22
23
The KV map is logically composed of smaller segments of the keyspace
24
called ranges. Each range is backed by data stored in a local KV
25
storage engine (we use [RocksDB](http://rocksdb.org/), a variant of
26
LevelDB). Range data is replicated to a configurable number of
27
additional CockroachDB nodes. Ranges are merged and split to maintain
28
a target size, by default `64M`. The relatively small size facilitates
29
quick repair and rebalancing to address node failures, new capacity
30
and even read/write load. However, the size must be balanced against
31
the pressure on the system from having more ranges to manage.
32
33
CockroachDB achieves horizontally scalability:
34
- adding more nodes increases the capacity of the cluster by the
35
amount of storage on each node (divided by a configurable
36
replication factor), theoretically up to 4 exabytes (4E) of logical
37
data;
38
- client queries can be sent to any node in the cluster, and queries
39
can operate independently (w/o conflicts), meaning that overall
40
throughput is a linear factor of the number of nodes in the cluster.
41
- queries are distributed (ref: distributed SQL) so that the overall
42
throughput of single queries can be increased by adding more nodes.
44
CockroachDB achieves strong consistency:
45
- uses a distributed consensus protocol for synchronous replication of
46
data in each key value range. We’ve chosen to use the [Raft
47
consensus algorithm](https://raftconsensus.github.io); all consensus
48
state is stored in RocksDB.
49
- single or batched mutations to a single range are mediated via the
50
range's Raft instance. Raft guarantees ACID semantics.
51
- logical mutations which affect multiple ranges employ distributed
52
transactions for ACID semantics. CockroachDB uses an efficient
53
**non-locking distributed commit** protocol.
54
55
CockroachDB achieves survivability:
56
- range replicas can be co-located within a single datacenter for low
57
latency replication and survive disk or machine failures. They can
58
be distributed across racks to survive some network switch failures.
59
- range replicas can be located in datacenters spanning increasingly
60
disparate geographies to survive ever-greater failure scenarios from
61
datacenter power or networking loss to regional power failures
62
(e.g. `{ US-East-1a, US-East-1b, US-East-1c }, `{ US-East, US-West,
63
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East,
64
US-West, Japan, Australia }`).
66
CockroachDB provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
67
serializable snapshot isolation (SSI) semantics, allowing **externally
68
consistent, lock-free reads and writes**--both from a historical
69
snapshot timestamp and from the current wall clock time. SI provides
70
lock-free reads and writes but still allows write skew. SSI eliminates
71
write skew, but introduces a performance hit in the case of a
72
contentious system. SSI is the default isolation; clients must
73
consciously decide to trade correctness for performance. CockroachDB
74
implements [a limited form of linearizability](#linearizability),
75
providing ordering for any observer or chain of observers.
76
77
Similar to
78
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
79
directories, CockroachDB allows configuration of arbitrary zones of data.
80
This allows replication factor, storage device type, and/or datacenter
81
location to be chosen to optimize performance and/or availability.
82
Unlike Spanner, zones are monolithic and don’t allow movement of fine
83
grained data on the level of entity groups.
84
85
# Architecture
86
87
CockroachDB implements a layered architecture. The highest level of
88
abstraction is the SQL layer (currently unspecified in this document).
89
It depends directly on the [*SQL layer*](#sql),
90
which provides familiar relational concepts
91
such as schemas, tables, columns, and indexes. The SQL layer
92
in turn depends on the [distributed key value store](#key-value-api),
93
which handles the details of range addressing to provide the abstraction
94
of a single, monolithic key value store. The distributed KV store
95
communicates with any number of physical cockroach nodes. Each node
96
contains one or more stores, one per physical device.
97
98
![Architecture](media/architecture.png)
99
100
Each store contains potentially many ranges, the lowest-level unit of
101
key-value data. Ranges are replicated using the Raft consensus protocol.
102
The diagram below is a blown up version of stores from four of the five
103
nodes in the previous diagram. Each range is replicated three ways using
104
raft. The color coding shows associated range replicas.
105
106
![Ranges](media/ranges.png)
107
108
Each physical node exports two RPC-based key value APIs: one for
109
external clients and one for internal clients (exposing sensitive
110
operational features). Both services accept batches of requests and
111
return batches of responses. Nodes are symmetric in capabilities and
112
exported interfaces; each has the same binary and may assume any
113
role.
114
115
Nodes and the ranges they provide access to can be arranged with various
116
physical network topologies to make trade offs between reliability and
117
performance. For example, a triplicated (3-way replica) range could have
118
each replica located on different:
119
120
- disks within a server to tolerate disk failures.
121
- servers within a rack to tolerate server failures.
122
- servers on different racks within a datacenter to tolerate rack power/network failures.
123
- servers in different datacenters to tolerate large scale network or power outages.
124
125
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
126
127
# Cockroach Client
128
129
In order to support diverse client usage, Cockroach clients connect to
130
any node via HTTPS using protocol buffers or JSON. The connected node
131
proxies involved client work including key lookups and write buffering.
132
133
# Keys
134
135
Cockroach keys are arbitrary byte arrays. Keys come in two flavors:
136
system keys and table data keys. System keys are used by Cockroach for
137
internal data structures and metadata. Table data keys contain SQL
138
table data (as well as index data). System and table data keys are
139
prefixed in such a way that all system keys sort before any table data
140
keys.
141
142
System keys come in several subtypes:
143
144
- **Global** keys store cluster-wide data such as the "meta1" and
145
"meta2" keys as well as various other system-wide keys such as the
146
node and store ID allocators.
147
- **Store local** keys are used for unreplicated store metadata
148
(e.g. the `StoreIdent` structure). "Unreplicated" indicates that
149
these values are not replicated across multiple stores because the
150
data they hold is tied to the lifetime of the store they are
151
present on.
152
- **Range local** keys store range metadata that is associated with a
153
global key. Range local keys have a special prefix followed by a
154
global key and a special suffix. For example, transaction records
155
are range local keys which look like:
156
`\x01k<global-key>txn-<txnID>`.
157
- **Replicated Range ID local** keys store range metadata that is
158
present on all of the replicas for a range. These keys are updated
159
via Raft operations. Examples include the range lease state and
160
abort cache entries.
161
- **Unreplicated Range ID local** keys store range metadata that is
162
local to a replica. The primary examples of such keys are the Raft
163
state and Raft log.
164
165
Table data keys are used to store all SQL data. Table data keys
166
contain internal structure as described in the section on [mapping
167
data between the SQL model and
168
KV](#data-mapping-between-the-sql-model-and-kv).
169
170
# Versioned Values
171
172
Cockroach maintains historical versions of values by storing them with
173
associated commit timestamps. Reads and scans can specify a snapshot
174
time to return the most recent writes prior to the snapshot timestamp.
175
Older versions of values are garbage collected by the system during
176
compaction according to a user-specified expiration interval. In order
177
to support long-running scans (e.g. for MapReduce), all versions have a
178
minimum expiration.
179
180
Versioned values are supported via modifications to RocksDB to record
181
commit timestamps and GC expirations per key.
182
183
# Lock-Free Distributed Transactions
184
185
Cockroach provides distributed transactions without locks. Cockroach
186
transactions support two isolation levels:
187
188
- snapshot isolation (SI) and
189
- *serializable* snapshot isolation (SSI).
190
191
*SI* is simple to implement, highly performant, and correct for all but a
192
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
193
more complexity, is still highly performant (less so with contention), and has
194
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
195
the literature and some possibly novel insights.
196
197
SSI is the default level, with SI provided for application developers
198
who are certain enough of their need for performance and the absence of
199
write skew conditions to consciously elect to use it. In a lightly
200
contended system, our implementation of SSI is just as performant as SI,
201
requiring no locking or additional writes. With contention, our
202
implementation of SSI still requires no locking, but will end up
203
aborting more transactions. Cockroach’s SI and SSI implementations
204
prevent starvation scenarios even for arbitrarily long transactions.
205
206
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
207
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
208
For a discussion of SSI implemented by preventing read-write conflicts
209
(in contrast to detecting them, called write-snapshot isolation), see
210
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
211
which is the source of much inspiration for Cockroach’s SSI.
212
213
Both SI and SSI require that the outcome of reads must be preserved, i.e.
214
a write of a key at a lower timestamp than a previous read must not succeed. To
215
this end, each range maintains a bounded *in-memory* cache from key range to
216
the latest timestamp at which it was read.
217
218
Most updates to this *timestamp cache* correspond to keys being read, though
219
the timestamp cache also protects the outcome of some writes (notably range
220
deletions) which consequently must also populate the cache. The cache’s entries
221
are evicted oldest timestamp first, updating the low water mark of the cache
222
appropriately.
223
224
Each Cockroach transaction is assigned a random priority and a
225
"candidate timestamp" at start. The candidate timestamp is the
226
provisional timestamp at which the transaction will commit, and is
227
chosen as the current clock time of the node coordinating the
228
transaction. This means that a transaction without conflicts will
229
usually commit with a timestamp that, in absolute time, precedes the
230
actual work done by that transaction.
231
May 22, 2015
232
In the course of coordinating a transaction between one or more
233
distributed nodes, the candidate timestamp may be increased, but will
234
never be decreased. The core difference between the two isolation levels
235
SI and SSI is that the former allows the transaction's candidate
236
timestamp to increase and the latter does not.
238
**Hybrid Logical Clock**
239
240
Each cockroach node maintains a hybrid logical clock (HLC) as discussed
241
in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf).
242
HLC time uses timestamps which are composed of a physical component (thought of
243
as and always close to local wall time) and a logical component (used to
244
distinguish between events with the same physical component). It allows us to
245
track causality for related events similar to vector clocks, but with less
246
overhead. In practice, it works much like other logical clocks: When events
247
are received by a node, it informs the local HLC about the timestamp supplied
248
with the event by the sender, and when events are sent a timestamp generated by
249
the local HLC is attached.
250
251
For a more in depth description of HLC please read the paper. Our
252
implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/util/hlc/hlc.go).
253
254
Cockroach picks a Timestamp for a transaction using HLC time. Throughout this
255
document, *timestamp* always refers to the HLC time which is a singleton
256
on each node. The HLC is updated by every read/write event on the node, and
257
the HLC time >= wall time. A read/write timestamp received in a cockroach request
258
from another node is not only used to version the operation, but also updates
259
the HLC on the node. This is useful in guaranteeing that all data read/written
260
on a node is at a timestamp < next HLC time.
262
**Transaction execution flow**
263
264
Transactions are executed in two phases:
266
1. Start the transaction by selecting a range which is likely to be
267
heavily involved in the transaction and writing a new transaction
268
record to a reserved area of that range with state "PENDING". In
269
parallel write an "intent" value for each datum being written as part
270
of the transaction. These are normal MVCC values, with the addition of
271
a special flag (i.e. “intent”) indicating that the value may be
272
committed after the transaction itself commits. In addition,
273
the transaction id (unique and chosen at tx start time by client)
274
is stored with intent values. The txn id is used to refer to the
275
transaction record when there are conflicts and to make
276
tie-breaking decisions on ordering between identical timestamps.
277
Each node returns the timestamp used for the write (which is the
278
original candidate timestamp in the absence of read/write conflicts);
279
the client selects the maximum from amongst all write timestamps as the
280
final commit timestamp.
282
2. Commit the transaction by updating its transaction record. The value
283
of the commit entry contains the candidate timestamp (increased as
284
necessary to accommodate any latest read timestamps). Note that the
285
transaction is considered fully committed at this point and control
286
may be returned to the client.
287
288
In the case of an SI transaction, a commit timestamp which was
289
increased to accommodate concurrent readers is perfectly
290
acceptable and the commit may continue. For SSI transactions,
291
however, a gap between candidate and commit timestamps
292
necessitates transaction restart (note: restart is different than
293
abort--see below).
294
295
After the transaction is committed, all written intents are upgraded
296
in parallel by removing the “intent” flag. The transaction is
297
considered fully committed before this step and does not wait for
298
it to return control to the transaction coordinator.
299
300
In the absence of conflicts, this is the end. Nothing else is necessary
301
to ensure the correctness of the system.
302
303
**Conflict Resolution**
304
305
Things get more interesting when a reader or writer encounters an intent
306
record or newly-committed value in a location that it needs to read or
307
write. This is a conflict, usually causing either of the transactions to
308
abort or restart depending on the type of conflict.
309
310
***Transaction restart:***
311
312
This is the usual (and more efficient) type of behaviour and is used
313
except when the transaction was aborted (for instance by another
314
transaction).
315
In effect, that reduces to two cases; the first being the one outlined
316
above: An SSI transaction that finds upon attempting to commit that
317
its commit timestamp has been pushed. The second case involves a transaction
318
actively encountering a conflict, that is, one of its readers or writers
319
encounter data that necessitate conflict resolution
320
(see transaction interactions below).
321
322
When a transaction restarts, it changes its priority and/or moves its
323
timestamp forward depending on data tied to the conflict, and
324
begins anew reusing the same txn id. The prior run of the transaction might
325
have written some write intents, which need to be deleted before the
326
transaction commits, so as to not be included as part of the transaction.
327
These stale write intent deletions are done during the reexecution of the
328
transaction, either implicitly, through writing new intents to
329
the same keys as part of the reexecution of the transaction, or explicitly,
330
by cleaning up stale intents that are not part of the reexecution of the
331
transaction. Since most transactions will end up writing to the same keys,
332
the explicit cleanup run just before committing the transaction is usually
333
a NOOP.
334
335
***Transaction abort:***
336
337
This is the case in which a transaction, upon reading its transaction
338
record, finds that it has been aborted. In this case, the transaction
339
can not reuse its intents; it returns control to the client before
340
cleaning them up (other readers and writers would clean up dangling
341
intents as they encounter them) but will make an effort to clean up
342
after itself. The next attempt (if applicable) then runs as a new
343
transaction with **a new txn id**.
344
345
***Transaction interactions:***
346
347
There are several scenarios in which transactions interact:
348
349
- **Reader encounters write intent or value with newer timestamp far
350
enough in the future**: This is not a conflict. The reader is free
351
to proceed; after all, it will be reading an older version of the
352
value and so does not conflict. Recall that the write intent may
353
be committed with a later timestamp than its candidate; it will
354
never commit with an earlier one. **Side note**: if a SI transaction
355
reader finds an intent with a newer timestamp which the reader’s own
356
transaction has written, the reader always returns that intent's value.
357
358
- **Reader encounters write intent or value with newer timestamp in the
359
near future:** In this case, we have to be careful. The newer
360
intent may, in absolute terms, have happened in our read's past if
361
the clock of the writer is ahead of the node serving the values.
362
In that case, we would need to take this value into account, but
363
we just don't know. Hence the transaction restarts, using instead
364
a future timestamp (but remembering a maximum timestamp used to
365
limit the uncertainty window to the maximum clock skew). In fact,
366
this is optimized further; see the details under "choosing a time
367
stamp" below.
368
369
- **Reader encounters write intent with older timestamp**: the reader
370
must follow the intent’s transaction id to the transaction record.
371
If the transaction has already been committed, then the reader can
372
just read the value. If the write transaction has not yet been
373
committed, then the reader has two options. If the write conflict
374
is from an SI transaction, the reader can *push that transaction's
375
commit timestamp into the future* (and consequently not have to
376
read it). This is simple to do: the reader just updates the
377
transaction’s commit timestamp to indicate that when/if the
378
transaction does commit, it should use a timestamp *at least* as
379
high. However, if the write conflict is from an SSI transaction,
380
the reader must compare priorities. If the reader has the higher priority,
381
it pushes the transaction’s commit timestamp (that
382
transaction will then notice its timestamp has been pushed, and
383
restart). If it has the lower or same priority, it retries itself using as
384
a new priority `max(new random priority, conflicting txn’s
385
priority - 1)`.
387
- **Writer encounters uncommitted write intent**:
388
If the other write intent has been written by a transaction with a lower
389
priority, the writer aborts the conflicting transaction. If the write
390
intent has a higher or equal priority the transaction retries, using as a new
391
priority *max(new random priority, conflicting txn’s priority - 1)*;
392
the retry occurs after a short, randomized backoff interval.
394
- **Writer encounters newer committed value**:
395
The committed value could also be an unresolved write intent made by a
396
transaction that has already committed. The transaction restarts. On restart,
397
the same priority is reused, but the candidate timestamp is moved forward
398
to the encountered value's timestamp.
400
- **Writer encounters more recently read key**:
401
The *read timestamp cache* is consulted on each write at a node. If the write’s
402
candidate timestamp is earlier than the low water mark on the cache itself
403
(i.e. its last evicted timestamp) or if the key being written has a read
404
timestamp later than the write’s candidate timestamp, this later timestamp
405
value is returned with the write. A new timestamp forces a transaction
406
restart only if it is serializable.
408
**Transaction management**
409
410
Transactions are managed by the client proxy (or gateway in SQL Azure
411
parlance). Unlike in Spanner, writes are not buffered but are sent
412
directly to all implicated ranges. This allows the transaction to abort
413
quickly if it encounters a write conflict. The client proxy keeps track
414
of all written keys in order to resolve write intents asynchronously upon
415
transaction completion. If a transaction commits successfully, all intents
416
are upgraded to committed. In the event a transaction is aborted, all written
417
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
418
419
In the event the client proxy restarts before the pending transaction is
420
committed, the dangling transaction would continue to "live" until
421
aborted by another transaction. Transactions periodically heartbeat
422
their transaction record to maintain liveness.
423
Transactions encountered by readers or writers with dangling intents
424
which haven’t been heartbeat within the required interval are aborted.
425
In the event the proxy restarts after a transaction commits but before
426
the asynchronous resolution is complete, the dangling intents are upgraded
427
when encountered by future readers and writers and the system does
428
not depend on their timely resolution for correctness.
429
430
An exploration of retries with contention and abort times with abandoned
431
transaction is
432
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
433
434
**Transaction Records**
436
Please see [roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/master/roachpb/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
437
438
**Pros**
439
440
- No requirement for reliable code execution to prevent stalled 2PC
441
protocol.
442
- Readers never block with SI semantics; with SSI semantics, they may
443
abort.
444
- Lower latency than traditional 2PC commit protocol (w/o contention)
445
because second phase requires only a single write to the
446
transaction record instead of a synchronous round to all
447
transaction participants.
448
- Priorities avoid starvation for arbitrarily long transactions and
449
always pick a winner from between contending transactions (no
450
mutual aborts).
451
- Writes not buffered at client; writes fail fast.
452
- No read-locking overhead required for *serializable* SI (in contrast
453
to other SSI implementations).
454
- Well-chosen (i.e. less random) priorities can flexibly give
455
probabilistic guarantees on latency for arbitrary transactions
456
(for example: make OLTP transactions 10x less likely to abort than
457
low priority transactions, such as asynchronously scheduled jobs).
458
459
**Cons**
460
461
- Reads from non-lease holder replicas still require a ping to the lease holder
462
to update the *read timestamp cache*.
463
- Abandoned transactions may block contending writers for up to the
464
heartbeat interval, though average wait is likely to be
465
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
466
This is likely considerably more performant than detecting and
467
restarting 2PC in order to release read and write locks.
468
- Behavior different than other SI implementations: no first writer
469
wins, and shorter transactions do not always finish quickly.
470
Element of surprise for OLTP systems may be a problematic factor.
471
- Aborts can decrease throughput in a contended system compared with
472
two phase locking. Aborts and retries increase read and write
473
traffic, increase latency and decrease throughput.
474
475
**Choosing a Timestamp**
476
477
A key challenge of reading data in a distributed system with clock skew
478
is choosing a timestamp guaranteed to be greater than the latest
479
timestamp of any committed transaction (in absolute time). No system can
480
claim consistency and fail to read already-committed data.
481
482
Accomplishing consistency for transactions (or just single operations)
483
accessing a single node is easy. The timestamp is assigned by the node
484
itself, so it is guaranteed to be at a greater timestamp than all the
485
existing timestamped data on the node.
486
487
For multiple nodes, the timestamp of the node coordinating the
488
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
489
supplied to provide an upper bound on timestamps for already-committed
490
data (`ε` is the maximum clock skew). As the transaction progresses, any
491
data read which have timestamps greater than `t` but less than `t+ε`
492
cause the transaction to abort and retry with the conflicting timestamp
493
t<sub>c</sub>, where t<sub>c</sub> \> t. The maximum timestamp `t+ε` remains
494
the same. This implies that transaction restarts due to clock uncertainty
495
can only happen on a time interval of length `ε`.
497
We apply another optimization to reduce the restarts caused
498
by uncertainty. Upon restarting, the transaction not only takes
499
into account t<sub>c</sub>, but the timestamp of the node at the time
500
of the uncertain read t<sub>node</sub>. The larger of those two timestamps
501
t<sub>c</sub> and t<sub>node</sub> (likely equal to the latter) is used
502
to increase the read timestamp. Additionally, the conflicting node is
503
marked as “certain”. Then, for future reads to that node within the
504
transaction, we set `MaxTimestamp = Read Timestamp`, preventing further
505
uncertainty restarts.
506
507
Correctness follows from the fact that we know that at the time of the read,
508
there exists no version of any key on that node with a higher timestamp than
509
t<sub>node</sub>. Upon a restart caused by the node, if the transaction
510
encounters a key with a higher timestamp, it knows that in absolute time,
511
the value was written after t<sub>node</sub> was obtained, i.e. after the
512
uncertain read. Hence the transaction can move forward reading an older version
513
of the data (at the transaction's timestamp). This limits the time uncertainty
514
restarts attributed to a node to at most one. The tradeoff is that we might
515
pick a timestamp larger than the optimal one (> highest conflicting timestamp),
516
resulting in the possibility of a few more conflicts.
517
518
We expect retries will be rare, but this assumption may need to be
519
revisited if retries become problematic. Note that this problem does not
520
apply to historical reads. An alternate approach which does not require
521
retries makes a round to all node participants in advance and
522
chooses the highest reported node wall time as the timestamp. However,
523
knowing which nodes will be accessed in advance is difficult and
524
potentially limiting. Cockroach could also potentially use a global
525
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)),
526
which would be feasible for smaller, geographically-proximate clusters.
528
# Strict Serializability (Linearizability)
529
530
Roughly speaking, the gap between <i>strict serializability</i> (which we use
531
interchangeably with <i>linearizability</i>) and CockroachDB's default
532
isolation level (<i>serializable</i>) is that with linearizable transactions,
533
causality is preserved. That is, if one transaction (say, creating a posting
534
for a user) waits for its predecessor (creating the user in the first place)
535
to complete, one would hope that the logical timestamp assigned to the former
536
is larger than that of the latter.
537
In practice, in distributed databases this may not hold, the reason typically
538
being that clocks across a distributed system are not perfectly synchronized
539
and the "later" transaction touches a part disjoint from that on which the
540
first transaction ran, resulting in clocks with disjoint information to decide
541
on the commit timestamps.
542
543
In practice, in CockroachDB many transactional workloads are actually
544
linearizable, though the precise conditions are too involved to outline them
545
here.
546
547
Causality is typically not required for many transactions, and so it is
548
advantageous to pay for it only when it *is* needed. CockroachDB implements
549
this via <i>causality tokens</i>: When committing a transaction, a causality
550
token can be retrieved and passed to the next transaction, ensuring that these
551
two transactions get assigned increasing logical timestamps.
552
553
Additionally, as better synchronized clocks become a standard commodity offered
554
by cloud providers, CockroachDB can provide global linearizability by doing
555
much the same that [Google's
556
Spanner](http://research.google.com/archive/spanner.html) does: wait out the
557
maximum clock offset after committing, but before returning to the client.
558
559
See the blog post below for much more in-depth information.
560
561
https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
562
563
# Logical Map Content
564
565
Logically, the map contains a series of reserved system key/value
566
pairs preceding the actual user data (which is managed by the SQL
567
subsystem).
569
- `\x02<key1>`: Range metadata for range ending `\x03<key1>`. This a "meta1" key.
570
- ...
571
- `\x02<keyN>`: Range metadata for range ending `\x03<keyN>`. This a "meta1" key.
572
- `\x03<key1>`: Range metadata for range ending `<key1>`. This a "meta2" key.
573
- ...
574
- `\x03<keyN>`: Range metadata for range ending `<keyN>`. This a "meta2" key.
575
- `\x04{desc,node,range,store}-idegen`: ID generation oracles for various component types.
576
- `\x04status-node-<varint encoded Store ID>`: Store runtime metadata.
577
- `\x04tsd<key>`: Time-series data key.
578
- `<key>`: A user key. In practice, these keys are managed by the SQL
579
subsystem, which employs its own key anatomy.
580
581
# Node Storage
582
583
Nodes maintain a separate instance of RocksDB for each store (physical
584
or virtual storage device). Each RocksDB instance hosts any number of
585
replicas. RPCs arriving at a node are routed based on the store ID to
586
the appropriate RocksDB instance. A single instance per store is used
587
to avoid contention. If every range maintained its own RocksDB, global
588
management of available cache memory would be impossible and writers
589
for each range would compete for non-contiguous writes to multiple
590
RocksDB logs.
591
592
In addition to the key/value pairs of the range itself, various range
593
metadata is maintained.
594
595
- participating replicas
596
597
- consensus metadata
598
599
- split/merge activity
600
601
A really good reference on tuning Linux installations with RocksDB is
602
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
603
604
# Range Metadata
605
606
The default approximate size of a range is 64M (2\^26 B). In order to
607
support 1P (2\^50 B) of logical data, metadata is needed for roughly
608
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
609
size is roughly 256 bytes (3\*12 bytes for the triplicated node
610
locations and 220 bytes for the range key itself). 2\^24 ranges \* 2\^8
611
B would require roughly 4G (2\^32 B) to store--too much to duplicate
612
between machines. Our conclusion is that range metadata must be
613
distributed for large installations.
614
615
To keep key lookups relatively fast in the presence of distributed metadata,
616
we store all the top-level metadata in a single range (the first range). These
617
top-level metadata keys are known as *meta1* keys, and are prefixed such that
618
they sort to the beginning of the key space. Given the metadata size of 256
619
bytes given above, a single 64M range would support 64M/256B = 2\^18 ranges,
620
which gives a total storage of 64M \* 2\^18 = 16T. To support the 1P quoted
621
above, we need two levels of indirection, where the first level addresses the
622
second, and the second addresses user data. With two levels of indirection, we
623
can address 2\^(18 + 18) = 2\^36 ranges; each range addresses 2\^26 B, and
624
altogether we address 2\^(36+26) B = 2\^62 B = 4E of user data.
625
626
For a given user-addressable `key1`, the associated *meta1* record is found
627
at the successor key to `key1` in the *meta1* space. Since the *meta1* space
628
is sparse, the successor key is defined as the next key which is present. The
629
*meta1* record identifies the range containing the *meta2* record, which is
630
found using the same process. The *meta2* record identifies the range
631
containing `key1`, which is again found the same way (see examples below).
633
Concretely, metadata keys are prefixed by `\0\0meta{1,2}`; the two null
634
characters provide for the desired sorting behaviour. Thus, `key1`'s
635
*meta1* record will reside at the successor key to `\0\0\meta1<key1>`.
636
Jul 29, 2015
637
Note: we append the end key of each range to meta{1,2} records because
638
the RocksDB iterator only supports a Seek() interface which acts as a
639
Ceil(). Using the start key of the range would cause Seek() to find the
640
key *after* the meta indexing record we’re looking for, which would
641
result in having to back the iterator up, an option which is both less
642
efficient and not available in all cases.
643
644
The following example shows the directory structure for a map with
645
three ranges worth of data. Ellipses indicate additional key/value pairs to
646
fill an entire range of data. Except for the fact that splitting ranges
647
requires updates to the range metadata with knowledge of the metadata layout,
648
the range metadata itself requires no special treatment or bootstrapping.
649
650
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
651
`dcrama3:8000`)
652
653
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
654
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
655
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
656
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
657
- ...
658
- `<lastkey0>`: `<lastvalue0>`
659
660
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
661
`dcrama6:8000`)
662
663
- ...
664
- `<lastkey1>`: `<lastvalue1>`
665
666
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
667
`dcrama9:8000`)
668
669
- ...
670
- `<lastkey2>`: `<lastvalue2>`
671
672
Consider a simpler example of a map containing less than a single
673
range of data. In this case, all range metadata and all data are
674
located in the same range:
675
676
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
677
`dcrama3:8000`)*
678
679
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
680
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
681
- `<key0>`: `<value0>`
682
- `...`
683
684
Finally, a map large enough to need both levels of indirection would
685
look like (note that instead of showing range replicas, this
686
example is simplified to just show range indexes):
687
688
**Range 0**
689
690
- `\0\0meta1<lastkeyN-1>`: Range 0
691
- `\0\0meta1\xff`: Range 1
692
- `\0\0meta2<lastkey1>`: Range 1
693
- `\0\0meta2<lastkey2>`: Range 2
694
- `\0\0meta2<lastkey3>`: Range 3
695
- ...
696
- `\0\0meta2<lastkeyN-1>`: Range 262143
697
698
**Range 1**
699
700
- `\0\0meta2<lastkeyN>`: Range 262144
701
- `\0\0meta2<lastkeyN+1>`: Range 262145
702
- ...
703
- `\0\0meta2\xff`: Range 500,000
704
- ...
705
- `<lastkey1>`: `<lastvalue1>`
706
707
**Range 2**
708
709
- ...
710
- `<lastkey2>`: `<lastvalue2>`
711
712
**Range 3**
713
714
- ...
715
- `<lastkey3>`: `<lastvalue3>`
716
717
**Range 262144**
718
719
- ...
720
- `<lastkeyN>`: `<lastvalueN>`
721
722
**Range 262145**
723
724
- ...
725
- `<lastkeyN+1>`: `<lastvalueN+1>`
726
727
Note that the choice of range `262144` is just an approximation. The
728
actual number of ranges addressable via a single metadata range is
729
dependent on the size of the keys. If efforts are made to keep key sizes
730
small, the total number of addressable ranges would increase and vice
731
versa.
732
733
From the examples above it’s clear that key location lookups require at
734
most three reads to get the value for `<key>`:
735
736
1. lower bound of `\0\0meta1<key>`
737
2. lower bound of `\0\0meta2<key>`,
738
3. `<key>`.
739
740
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
741
containing less than 16T of data would require two lookups. Clients cache both
742
levels of range metadata, and we expect that data locality for individual
743
clients will be high. Clients may end up with stale cache entries. If on a
744
lookup, the range consulted does not match the client’s expectations, the
745
client evicts the stale entries and possibly does a new lookup.
746
747
# Raft - Consistency of Range Replicas
748
749
Each range is configured to consist of three or more replicas, as specified by
750
their ZoneConfig. The replicas in a range maintain their own instance of a
751
distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://raftconsensus.github.io)
752
as it is simpler to reason about and includes a reference implementation
753
covering important details.
754
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
755
promising performance characteristics for WAN-distributed replicas, but
756
it does not guarantee a consistent ordering between replicas.
757
758
Raft elects a relatively long-lived leader which must be involved to
759
propose commands. It heartbeats followers periodically and keeps their logs
760
replicated. In the absence of heartbeats, followers become candidates
761
after randomized election timeouts and proceed to hold new leader
762
elections. Cockroach weights random timeouts such that the replicas with
763
shorter round trip times to peers are more likely to hold elections
764
first (not implemented yet). Only the Raft leader may propose commands;
765
followers will simply relay commands to the last known leader.
767
Our Raft implementation was developed together with CoreOS, but adds an extra
768
layer of optimization to account for the fact that a single Node may have
769
millions of consensus groups (one for each Range). Areas of optimization
770
are chiefly coalesced heartbeats (so that the number of nodes dictates the
771
number of heartbeats as opposed to the much larger number of ranges) and
772
batch processing of requests.
773
Future optimizations may include two-phase elections and quiescent ranges
774
(i.e. stopping traffic completely for inactive ranges).
775
776
# Range Leases
777
778
As outlined in the Raft section, the replicas of a Range are organized as a
779
Raft group and execute commands from their shared commit log. Going through
780
Raft is an expensive operation though, and there are tasks which should only be
781
carried out by a single replica at a time (as opposed to all of them).
782
In particular, it is desirable to serve authoritative reads from a single
783
Replica (ideally from more than one, but that is far more difficult).
785
For these reasons, Cockroach introduces the concept of **Range Leases**:
786
This is a lease held for a slice of (database, i.e. hybrid logical) time and is
787
established by committing a special log entry through Raft containing the
788
interval the lease is going to be active on, along with the Node:RaftID
789
combination that uniquely describes the requesting replica. Reads and writes
790
must generally be addressed to the replica holding the lease; if none does, any
791
replica may be addressed, causing it to try to obtain the lease synchronously.
792
Requests received by a non-lease holder (for the HLC timestamp specified in the
793
request's header) fail with an error pointing at the replica's last known
794
lease holder. These requests are retried transparently with the updated lease by the
795
gateway node and never reach the client.
796
797
The replica holding the lease is in charge or involved in handling
798
Range-specific maintenance tasks such as
799
800
* gossiping the sentinel and/or first range information
801
* splitting, merging and rebalancing
802
803
and, very importantly, may satisfy reads locally, without incurring the
804
overhead of going through Raft.
805
806
Since reads bypass Raft, a new lease holder will, among other things, ascertain
807
that its timestamp cache does not report timestamps smaller than the previous
808
lease holder's (so that it's compatible with reads which may have occurred on
809
the former lease holder). This is accomplished by letting leases enter
810
a <i>stasis period</i> (which is just the expiration minus the maximum clock
811
offset) before the actual expiration of the lease, so that all the next lease
812
holder has to do is set the low water mark of the timestamp cache to its
813
new lease's start time.
814
815
As a lease enters its stasis period, no more reads or writes are served, which
816
is undesirable. However, this would only happen in practice if a node became
817
unavailable. In almost all practical situations, no unavailability results
818
since leases are usually long-lived (and/or eagerly extended, which can avoid
819
the stasis period) or proactively transferred away from the lease holder, which
820
can also avoid the stasis period by promising not to serve any further reads
821
until the next lease goes into effect.
822
823
## Colocation with Raft leadership
825
The range lease is completely separate from Raft leadership, and so without
826
further efforts, Raft leadership and the Range lease might not be held by the
827
same Replica. Since it's expensive to not have these two roles colocated (the
828
lease holder has to forward each proposal to the leader, adding costly RPC
829
round-trips), each lease renewal or transfer also attempts to colocate them.
830
In practice, that means that the mismatch is rare and self-corrects quickly.
832
## Command Execution Flow
833
834
This subsection describes how a lease holder replica processes a
835
read/write command in more details. Each command specifies (1) a key
836
(or a range of keys) that the command accesses and (2) the ID of a
837
range which the key(s) belongs to. When receiving a command, a node
838
looks up a range by the specified Range ID and checks if the range is
839
still responsible for the supplied keys. If any of the keys do not
840
belong to the range, the node returns an error so that the client will
841
retry and send a request to a correct range.
843
When all the keys belong to the range, the node attempts to
844
process the command. If the command is an inconsistent read-only
845
command, it is processed immediately. If the command is a consistent
846
read or a write, the command is executed when both of the following
847
conditions hold:
848
849
- The range replica has a range lease.
850
- There are no other running commands whose keys overlap with
851
the submitted command and cause read/write conflict.
852
853
When the first condition is not met, the replica attempts to acquire
854
a lease or returns an error so that the client will redirect the
855
command to the current lease holder. The second condition guarantees that
856
consistent read/write commands for a given key are sequentially
857
executed.
858
859
When the above two conditions are met, the lease holder replica processes the
860
command. Consistent reads are processed on the lease holder immediately.
861
Write commands are committed into the Raft log so that every replica
862
will execute the same commands. All commands produce deterministic
863
results so that the range replicas keep consistent states among them.
864
865
When a write command completes, all the replica updates their response
866
cache to ensure idempotency. When a read command completes, the lease holder
867
replica updates its timestamp cache to keep track of the latest read
868
for a given key.
869
870
There is a chance that a range lease gets expired while a command is
871
executed. Before executing a command, each replica checks if a replica
872
proposing the command has a still lease. When the lease has been
873
expired, the command will be rejected by the replica.
874
875
876
# Splitting / Merging Ranges
877
878
Nodes split or merge ranges based on whether they exceed maximum or
879
minimum thresholds for capacity or load. Ranges exceeding maximums for
880
either capacity or load are split; ranges below minimums for *both*
881
capacity and load are merged.
882
883
Ranges maintain the same accounting statistics as accounting key
884
prefixes. These boil down to a time series of data points with minute
885
granularity. Everything from number of bytes to read/write queue sizes.
886
Arbitrary distillations of the accounting stats can be determined as the
887
basis for splitting / merging. Two sensible metrics for use with
888
split/merge are range size in bytes and IOps. A good metric for
889
rebalancing a replica from one node to another would be total read/write
890
queue wait times. These metrics are gossipped, with each range / node
891
passing along relevant metrics if they’re in the bottom or top of the
892
range it’s aware of.
893
894
A range finding itself exceeding either capacity or load threshold
895
splits. To this end, the range lease holder computes an appropriate split key
896
candidate and issues the split through Raft. In contrast to splitting,
897
merging requires a range to be below the minimum threshold for both
898
capacity *and* load. A range being merged chooses the smaller of the
899
ranges immediately preceding and succeeding it.
900
901
Splitting, merging, rebalancing and recovering all follow the same basic
902
algorithm for moving data between roach nodes. New target replicas are
903
created and added to the replica set of source range. Then each new
904
replica is brought up to date by either replaying the log in full or
905
copying a snapshot of the source replica data and then replaying the log
906
from the timestamp of the snapshot to catch up fully. Once the new
907
replicas are fully up to date, the range metadata is updated and old,
908
source replica(s) deleted if applicable.
909
910
**Coordinator** (lease holder replica)
911
912
```
913
if splitting
914
SplitRange(split_key): splits happen locally on range replicas and
915
only after being completed locally, are moved to new target replicas.
916
else if merging
917
Choose new replicas on same servers as target range replicas;
918
add to replica set.
919
else if rebalancing || recovering
920
Choose new replica(s) on least loaded servers; add to replica set.
921
```
922
923
**New Replica**
924
925
*Bring replica up to date:*
926
927
```
928
if all info can be read from replicated log
929
copy replicated log
930
else
931
snapshot source replica
932
send successive ReadRange requests to source replica
933
referencing snapshot
934
935
if merging
936
combine ranges on all replicas
937
else if rebalancing || recovering
938
remove old range replica(s)
939
```
940
941
Nodes split ranges when the total data in a range exceeds a
942
configurable maximum threshold. Similarly, ranges are merged when the
943
total data falls below a configurable minimum threshold.
944
945
**TBD: flesh this out**: Especially for merges (but also rebalancing) we have a
946
range disappearing from the local node; that range needs to disappear
947
gracefully, with a smooth handoff of operation to the new owner of its data.
948
949
Ranges are rebalanced if a node determines its load or capacity is one
950
of the worst in the cluster based on gossipped load stats. A node with
951
spare capacity is chosen in the same datacenter and a special-case split
952
is done which simply duplicates the data 1:1 and resets the range
953
configuration metadata.
954
955
# Node Allocation (via Gossip)
956
957
New nodes must be allocated when a range is split. Instead of requiring
958
every node to know about the status of all or even a large number
959
of peer nodes --or-- alternatively requiring a specialized curator or
960
master with sufficiently global knowledge, we use a gossip protocol to
961
efficiently communicate only interesting information between all of the
962
nodes in the cluster. What’s interesting information? One example would
963
be whether a particular node has a lot of spare capacity. Each node,
964
when gossiping, compares each topic of gossip to its own state. If its
965
own state is somehow “more interesting” than the least interesting item
966
in the topic it’s seen recently, it includes its own state as part of
967
the next gossip session with a peer node. In this way, a node with
968
capacity sufficiently in excess of the mean quickly becomes discovered
969
by the entire cluster. To avoid piling onto outliers, nodes from the
970
high capacity set are selected at random for allocation.
971
972
The gossip protocol itself contains two primary components:
973
974
- **Peer Selection**: each node maintains up to N peers with which it
975
regularly communicates. It selects peers with an eye towards
976
maximizing fanout. A peer node which itself communicates with an
977
array of otherwise unknown nodes will be selected over one which
978
communicates with a set containing significant overlap. Each time
979
gossip is initiated, each nodes’ set of peers is exchanged. Each
980
node is then free to incorporate the other’s peers as it sees fit.
981
To avoid any node suffering from excess incoming requests, a node
982
may refuse to answer a gossip exchange. Each node is biased
983
towards answering requests from nodes without significant overlap
984
and refusing requests otherwise.
985
986
Peers are efficiently selected using a heuristic as described in
987
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
988
989
**TBD**: how to avoid partitions? Need to work out a simulation of
990
the protocol to tune the behavior and see empirically how well it
991
works.
992
993
- **Gossip Selection**: what to communicate. Gossip is divided into
994
topics. Load characteristics (capacity per disk, cpu load, and
995
state [e.g. draining, ok, failure]) are used to drive node
996
allocation. Range statistics (range read/write load, missing
997
replicas, unavailable ranges) and network topology (inter-rack
998
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
999
outages) are used for determining when to split ranges, when to
1000
recover replicas vs. wait for network connectivity, and for
1001
debugging / sysops. In all cases, a set of minimums and a set of
1002
maximums is propagated; each node applies its own view of the
1003
world to augment the values. Each minimum and maximum value is
1004
tagged with the reporting node and other accompanying contextual
1005
information. Each topic of gossip has its own protobuf to hold the
1006
structured data. The number of items of gossip in each topic is
1007
limited by a configurable bound.
1008
1009
For efficiency, nodes assign each new item of gossip a sequence
1010
number and keep track of the highest sequence number each peer
1011
node has seen. Each round of gossip communicates only the delta
1012
containing new items.
1013
1014
# Node and Cluster Metrics
1015
1016
Every component of the system is responsible for exporting interesting
1017
metrics about itself. These could be histograms, throughput counters, or
1018
gauges.
1019
1020
These metrics are exported for external monitoring systems (such as Prometheus)
1021
via a HTTP endpoint, but CockroachDB also implements an internal timeseries
1022
database which is stored in the replicated key-value map.
1023
1024
Time series are stored at Store granularity and allow the admin dashboard
1025
to efficiently gain visibility into a universe of information at the Cluster,
1026
Node or Store level. A [periodic background process](RFCS/time_series_culling.md)
1027
culls older timeseries data, downsampling and eventually discarding it.
1029
# Key-prefix Accounting and Zones
1031
Arbitrarily fine-grained accounting is specified via
1032
key prefixes. Key prefixes can overlap, as is necessary for capturing
1033
hierarchical relationships. For illustrative purposes, let’s say keys
1034
specifying rows in a set of databases have the following format:
1035
1036
`<db>:<table>:<primary-key>[:<secondary-key>]`
1037
1038
In this case, we might collect accounting with
1039
key prefixes:
1040
1041
`db1`, `db1:user`, `db1:order`,
1042
1043
Accounting is kept for the entire map by default.
1044
1045
## Accounting
1046
to keep accounting for a range defined by a key prefix, an entry is created in
1047
the accounting system table. The format of accounting table keys is:
1048
1049
`\0acct<key-prefix>`
1050
1051
In practice, we assume each node is capable of caching the
1052
entire accounting table as it is likely to be relatively small.
1053
1054
Accounting is kept for key prefix ranges with eventual consistency for
1055
efficiency. There are two types of values which comprise accounting:
1056
counts and occurrences, for lack of better terms. Counts describe
1057
system state, such as the total number of bytes, rows,
1058
etc. Occurrences include transient performance and load metrics. Both
1059
types of accounting are captured as time series with minute
1060
granularity. The length of time accounting metrics are kept is
1061
configurable. Below are examples of each type of accounting value.
1062
1063
**System State Counters/Performance**
1064
1065
- Count of items (e.g. rows)
1066
- Total bytes
1067
- Total key bytes
1068
- Total value length
1069
- Queued message count
1070
- Queued message total bytes
1071
- Count of values \< 16B
1072
- Count of values \< 64B
1073
- Count of values \< 256B
1074
- Count of values \< 1K
1075
- Count of values \< 4K
1076
- Count of values \< 16K
1077
- Count of values \< 64K
1078
- Count of values \< 256K
1079
- Count of values \< 1M
1080
- Count of values \> 1M
1081
- Total bytes of accounting
1082
1083
1084
**Load Occurrences**
1085
1086
- Get op count
1087
- Get total MB
1088
- Put op count
1089
- Put total MB
1090
- Delete op count
1091
- Delete total MB
1092
- Delete range op count
1093
- Delete range total MB
1094
- Scan op count
1095
- Scan op MB
1096
- Split count
1097
- Merge count
1098
1099
Because accounting information is kept as time series and over many
1100
possible metrics of interest, the data can become numerous. Accounting
1101
data are stored in the map near the key prefix described, in order to
1102
distribute load (for both aggregation and storage).
1103
1104
Accounting keys for system state have the form:
1105
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1106
character. It’s meant to sort the root level account AFTER any other
1107
system tables. They must increment the same underlying values as they
1108
are permanent counts, and not transient activity. Logic at the
1109
node takes care of snapshotting the value into an appropriately
1110
suffixed (e.g. with timestamp hour) multi-value time series entry.
1111
1112
Keys for perf/load metrics:
1113
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1114
1115
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1116
containing a varint64 entry for each minute with activity during the
1117
specified hour.
1118
1119
To efficiently keep accounting over large key ranges, the task of
1120
aggregation must be distributed. If activity occurs within the same
1121
range as the key prefix for accounting, the updates are made as part
1122
of the consensus write. If the ranges differ, then a message is sent
1123
to the parent range to increment the accounting. If upon receiving the
1124
message, the parent range also does not include the key prefix, it in
1125
turn forwards it to its parent or left child in the balanced binary
1126
tree which is maintained to describe the range hierarchy. This limits
1127
the number of messages before an update is visible at the root to `2*log N`,
1128
where `N` is the number of ranges in the key prefix.
1129
1130
## Zones
1131
zones are stored in the map with keys prefixed by
1132
`\0zone` followed by the key prefix to which the zone
1133
configuration applies. Zone values specify a protobuf containing
1134
the datacenters from which replicas for ranges which fall under
1135
the zone must be chosen.
1136
1137
Please see [config/config.proto](https://github.com/cockroachdb/cockroach/blob/master/config/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1139
If zones are modified in situ, each node verifies the
1140
existing zones for its ranges against the zone configuration. If
1141
it discovers differences, it reconfigures ranges in the same way
1142
that it rebalances away from busy nodes, via special-case 1:1
1143
split to a duplicate range comprising the new configuration.
1144
1145
# SQL
1146
1147
Each node in a cluster can accept SQL client connections. CockroachDB
1148
supports the PostgreSQL wire protocol, to enable reuse of native
1149
PostgreSQL client drivers. Connections using SSL and authenticated
1150
using client certificates are supported and even encouraged over
1151
unencrypted (insecure) and password-based connections.
1152
1153
Each connection is associated with a SQL session which holds the
1154
server-side state of the connection. Over the lifespan of a session
1155
the client can send SQL to open/close transactions, issue statements
1156
or queries or configure session parameters, much like with any other
1157
SQL database.
1158
1159
## Language support
1160
1161
CockroachDB also attempts to emulate the flavor of SQL supported by
1162
PostgreSQL, although it also diverges in significant ways:
1163
1164
- CockroachDB exclusively implements MVCC-based consistency for
1165
transactions, and thus only supports SQL's isolation levels SNAPSHOT
1166
and SERIALIZABLE. The other traditional SQL isolation levels are
1167
internally mapped to either SNAPSHOT or SERIALIZABLE.
1168
1169
- CockroachDB implements its own [SQL type system](RFCS/typing.md)
1170
which only supports a limited form of implicit coercions between
1171
types compared to PostgreSQL. The rationale is to keep the
1172
implementation simple and efficient, capitalizing on the observation
1173
that 1) most SQL code in clients is automatically generated with
1174
coherent typing already and 2) existing SQL code for other databases
1175
will need to be massaged for CockroachDB anyways.
1176
1177
## SQL architecture
1178
1179
Client connections over the network are handled in each node by a
1180
pgwire server process (goroutine). This handles the stream of incoming
1181
commands and sends back responses including query/statement results.
1182
The pgwire server also handles pgwire-level prepared statements,
1183
binding prepared statements to arguments and looking up prepared
1184
statements for execution.
1185
1186
Meanwhile the state of a SQL connection is maintained by a Session
1187
object and a monolithic `planner` object (one per connection) which
1188
coordinates execution between the session, the current SQL transaction
1189
state and the underlying KV store.
1190
1191
Upon receiving a query/statement (either directly or via an execute
1192
command for a previously prepared statement) the pgwire server forwards
1193
the SQL text to the `planner` associated with the connection. The SQL
1194
code is then transformed into a SQL query plan.
1195
The query plan is implemented as a tree of objects which describe the
1196
high-level data operations needed to resolve the query, for example
1197
"join", "index join", "scan", "group", etc.
1198
1199
The query plan objects currently also embed the run-time state needed
1200
for the execution of the query plan. Once the SQL query plan is ready,
1201
methods on these objects then carry the execution out in the fashion
1202
of "generators" in other programming languages: each node *starts* its
1203
children nodes and from that point forward each child node serves as a
1204
*generator* for a stream of result rows, which the parent node can
1205
consume and transform incrementally and present to its own parent node
1206
also as a generator.
1207
1208
The top-level planner consumes the data produced by the top node of
1209
the query plan and returns it to the client via pgwire.
1210
1211
## Data mapping between the SQL model and KV
1212
1213
Every SQL table has a primary key in CockroachDB. (If a table is created
1214
without one, an implicit primary key is provided automatically.)
1215
The table identifier, followed by the value of the primary key for
1216
each row, are encoded as the *prefix* of a key in the underlying KV
1217
store.
1218
1219
Each remaining column or *column family* in the table is then encoded
1220
as a value in the underlying KV store, and the column/family identifier
1221
is appended as *suffix* to the KV key.
1222
1223
For example:
1224
1225
- after table `customers` is created in a database `mydb` with a
1226
primary key column `name` and normal columns `address` and `URL`, the KV pairs
1227
to store the schema would be:
1228
1229
| Key | Values |
1230
| ---------------------------- | ------ |
1231
| `/system/databases/mydb/id` | 51 |
1232
| `/system/tables/customer/id` | 42 |
1233
| `/system/desc/51/42/address` | 69 |
1234
| `/system/desc/51/42/url` | 66 |
1235
1236
(The numeric values on the right are chosen arbitrarily for the
1237
example; the structure of the schema keys on the left is simplified
1238
for the example and subject to change.) Each database/table/column
1239
name is mapped to a spontaneously generated identifier, so as to
1240
simplify renames.
1241
1242
Then for a single row in this table:
1243
1244
| Key | Values |
1245
| ----------------- | -------------------------------- |
1246
| `/51/42/Apple/69` | `1 Infinite Loop, Cupertino, CA` |
1247
| `/51/42/Apple/66` | `http://apple.com/` |
1248
1249
Each key has the table prefix `/51/42` followed by the primary key
1250
prefix `/Apple` followed by the column/family suffix (`/66`,
1251
`/69`). The KV value is directly encoded from the SQL value.
1252
1253
Efficient storage for the keys is guaranteed by the underlying RocksDB engine
1254
by means of prefix compression.
1255
1256
Finally, for SQL indexes, the KV key is formed using the SQL value of the
1257
indexed columns, and the KV value is the KV key prefix of the rest of
1258
the indexed row.
1259
1260
# References
1261
1262
[0]: http://rocksdb.org/
1263
[1]: https://github.com/google/leveldb
1264
[2]: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf
1265
[3]: http://research.google.com/archive/spanner.html
1266
[4]: http://research.google.com/pubs/pub36971.html
1267
[5]: https://github.com/cockroachdb/cockroach/tree/master/sql
1268
[7]: https://godoc.org/github.com/cockroachdb/cockroach/kv
1269
[8]: https://github.com/cockroachdb/cockroach/tree/master/kv
1270
[9]: https://godoc.org/github.com/cockroachdb/cockroach/server
1271
[10]: https://github.com/cockroachdb/cockroach/tree/master/server
1272
[11]: https://godoc.org/github.com/cockroachdb/cockroach/storage
1273
[12]: https://github.com/cockroachdb/cockroach/tree/master/storage