Permalink
Newer
100644
1281 lines (1050 sloc)
60.9 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
CockroachDB is a distributed SQL database. The primary design goals
8
are **scalability**, **strong consistency** and **survivability**
12
**homogeneous deployment** (one binary) with minimal configuration and
13
no required external dependencies.
14
15
The entry point for database clients is the SQL interface. Every node
16
in a CockroachDB cluster can act as a client SQL gateway. A SQL
17
gateway transforms and executes client SQL statements to key-value
18
(KV) operations, which the gateway distributes across the cluster as
19
necessary and returns results to the client. CockroachDB implements a
20
**single, monolithic sorted map** from key to value where both keys
21
and values are byte strings.
22
23
The KV map is logically composed of smaller segments of the keyspace called
24
ranges. Each range is backed by data stored in a local KV storage engine (we
25
use [RocksDB](http://rocksdb.org/), a variant of
26
[LevelDB](https://github.com/google/leveldb)). Range data is replicated to a
27
configurable number of additional CockroachDB nodes. Ranges are merged and
28
split to maintain a target size, by default `64M`. The relatively small size
29
facilitates quick repair and rebalancing to address node failures, new capacity
30
and even read/write load. However, the size must be balanced against the
31
pressure on the system from having more ranges to manage.
32
33
CockroachDB achieves horizontally scalability:
34
- adding more nodes increases the capacity of the cluster by the
35
amount of storage on each node (divided by a configurable
36
replication factor), theoretically up to 4 exabytes (4E) of logical
37
data;
38
- client queries can be sent to any node in the cluster, and queries
39
can operate independently (w/o conflicts), meaning that overall
40
throughput is a linear factor of the number of nodes in the cluster.
45
- uses a distributed consensus protocol for synchronous replication of
46
data in each key value range. We’ve chosen to use the [Raft
47
consensus algorithm](https://raftconsensus.github.io); all consensus
48
state is stored in RocksDB.
49
- single or batched mutations to a single range are mediated via the
50
range's Raft instance. Raft guarantees ACID semantics.
51
- logical mutations which affect multiple ranges employ distributed
52
transactions for ACID semantics. CockroachDB uses an efficient
53
**non-locking distributed commit** protocol.
54
55
CockroachDB achieves survivability:
56
- range replicas can be co-located within a single datacenter for low
57
latency replication and survive disk or machine failures. They can
59
- range replicas can be located in datacenters spanning increasingly
60
disparate geographies to survive ever-greater failure scenarios from
61
datacenter power or networking loss to regional power failures
63
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East,
64
US-West, Japan, Australia }`).
66
CockroachDB provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
67
serializable snapshot isolation (SSI) semantics, allowing **externally
68
consistent, lock-free reads and writes**--both from a historical
69
snapshot timestamp and from the current wall clock time. SI provides
70
lock-free reads and writes but still allows write skew. SSI eliminates
71
write skew, but introduces a performance hit in the case of a
72
contentious system. SSI is the default isolation; clients must
74
implements [a limited form of linearizability](#linearizability),
75
providing ordering for any observer or chain of observers.
76
77
Similar to
78
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
80
This allows replication factor, storage device type, and/or datacenter
81
location to be chosen to optimize performance and/or availability.
82
Unlike Spanner, zones are monolithic and don’t allow movement of fine
83
grained data on the level of entity groups.
84
85
# Architecture
86
89
It depends directly on the [*SQL layer*](#sql),
90
which provides familiar relational concepts
91
such as schemas, tables, columns, and indexes. The SQL layer
92
in turn depends on the [distributed key value store](#key-value-api),
93
which handles the details of range addressing to provide the abstraction
94
of a single, monolithic key value store. The distributed KV store
95
communicates with any number of physical cockroach nodes. Each node
96
contains one or more stores, one per physical device.
97
98

99
100
Each store contains potentially many ranges, the lowest-level unit of
101
key-value data. Ranges are replicated using the Raft consensus protocol.
102
The diagram below is a blown up version of stores from four of the five
103
nodes in the previous diagram. Each range is replicated three ways using
104
raft. The color coding shows associated range replicas.
105
106

107
108
Each physical node exports two RPC-based key value APIs: one for
109
external clients and one for internal clients (exposing sensitive
110
operational features). Both services accept batches of requests and
111
return batches of responses. Nodes are symmetric in capabilities and
112
exported interfaces; each has the same binary and may assume any
113
role.
114
115
Nodes and the ranges they provide access to can be arranged with various
116
physical network topologies to make trade offs between reliability and
117
performance. For example, a triplicated (3-way replica) range could have
118
each replica located on different:
119
120
- disks within a server to tolerate disk failures.
121
- servers within a rack to tolerate server failures.
122
- servers on different racks within a datacenter to tolerate rack power/network failures.
123
- servers in different datacenters to tolerate large scale network or power outages.
124
125
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
126
127
# Keys
128
129
Cockroach keys are arbitrary byte arrays. Keys come in two flavors:
130
system keys and table data keys. System keys are used by Cockroach for
131
internal data structures and metadata. Table data keys contain SQL
132
table data (as well as index data). System and table data keys are
133
prefixed in such a way that all system keys sort before any table data
134
keys.
135
136
System keys come in several subtypes:
137
138
- **Global** keys store cluster-wide data such as the "meta1" and
139
"meta2" keys as well as various other system-wide keys such as the
140
node and store ID allocators.
141
- **Store local** keys are used for unreplicated store metadata
142
(e.g. the `StoreIdent` structure). "Unreplicated" indicates that
143
these values are not replicated across multiple stores because the
144
data they hold is tied to the lifetime of the store they are
145
present on.
146
- **Range local** keys store range metadata that is associated with a
147
global key. Range local keys have a special prefix followed by a
148
global key and a special suffix. For example, transaction records
149
are range local keys which look like:
150
`\x01k<global-key>txn-<txnID>`.
151
- **Replicated Range ID local** keys store range metadata that is
152
present on all of the replicas for a range. These keys are updated
153
via Raft operations. Examples include the range lease state and
154
abort cache entries.
155
- **Unreplicated Range ID local** keys store range metadata that is
156
local to a replica. The primary examples of such keys are the Raft
157
state and Raft log.
158
159
Table data keys are used to store all SQL data. Table data keys
160
contain internal structure as described in the section on [mapping
161
data between the SQL model and
162
KV](#data-mapping-between-the-sql-model-and-kv).
163
164
# Versioned Values
165
166
Cockroach maintains historical versions of values by storing them with
167
associated commit timestamps. Reads and scans can specify a snapshot
168
time to return the most recent writes prior to the snapshot timestamp.
169
Older versions of values are garbage collected by the system during
170
compaction according to a user-specified expiration interval. In order
171
to support long-running scans (e.g. for MapReduce), all versions have a
172
minimum expiration.
173
174
Versioned values are supported via modifications to RocksDB to record
175
commit timestamps and GC expirations per key.
176
177
# Lock-Free Distributed Transactions
178
179
Cockroach provides distributed transactions without locks. Cockroach
180
transactions support two isolation levels:
181
182
- snapshot isolation (SI) and
183
- *serializable* snapshot isolation (SSI).
184
185
*SI* is simple to implement, highly performant, and correct for all but a
186
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
187
more complexity, is still highly performant (less so with contention), and has
188
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
189
the literature and some possibly novel insights.
190
191
SSI is the default level, with SI provided for application developers
192
who are certain enough of their need for performance and the absence of
193
write skew conditions to consciously elect to use it. In a lightly
194
contended system, our implementation of SSI is just as performant as SI,
195
requiring no locking or additional writes. With contention, our
196
implementation of SSI still requires no locking, but will end up
197
aborting more transactions. Cockroach’s SI and SSI implementations
198
prevent starvation scenarios even for arbitrarily long transactions.
199
200
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
201
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
202
For a discussion of SSI implemented by preventing read-write conflicts
203
(in contrast to detecting them, called write-snapshot isolation), see
204
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
205
which is the source of much inspiration for Cockroach’s SSI.
206
207
Both SI and SSI require that the outcome of reads must be preserved, i.e.
208
a write of a key at a lower timestamp than a previous read must not succeed. To
209
this end, each range maintains a bounded *in-memory* cache from key range to
210
the latest timestamp at which it was read.
211
212
Most updates to this *timestamp cache* correspond to keys being read, though
213
the timestamp cache also protects the outcome of some writes (notably range
214
deletions) which consequently must also populate the cache. The cache’s entries
215
are evicted oldest timestamp first, updating the low water mark of the cache
216
appropriately.
217
218
Each Cockroach transaction is assigned a random priority and a
219
"candidate timestamp" at start. The candidate timestamp is the
220
provisional timestamp at which the transaction will commit, and is
221
chosen as the current clock time of the node coordinating the
222
transaction. This means that a transaction without conflicts will
223
usually commit with a timestamp that, in absolute time, precedes the
224
actual work done by that transaction.
225
226
In the course of coordinating a transaction between one or more
227
distributed nodes, the candidate timestamp may be increased, but will
229
SI and SSI is that the former allows the transaction's candidate
230
timestamp to increase and the latter does not.
235
in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf).
236
HLC time uses timestamps which are composed of a physical component (thought of
237
as and always close to local wall time) and a logical component (used to
238
distinguish between events with the same physical component). It allows us to
239
track causality for related events similar to vector clocks, but with less
240
overhead. In practice, it works much like other logical clocks: When events
241
are received by a node, it informs the local HLC about the timestamp supplied
242
with the event by the sender, and when events are sent a timestamp generated by
243
the local HLC is attached.
244
245
For a more in depth description of HLC please read the paper. Our
246
implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/pkg/util/hlc/hlc.go).
247
248
Cockroach picks a Timestamp for a transaction using HLC time. Throughout this
249
document, *timestamp* always refers to the HLC time which is a singleton
250
on each node. The HLC is updated by every read/write event on the node, and
252
from another node is not only used to version the operation, but also updates
253
the HLC on the node. This is useful in guaranteeing that all data read/written
254
on a node is at a timestamp < next HLC time.
260
1. Start the transaction by selecting a range which is likely to be
261
heavily involved in the transaction and writing a new transaction
262
record to a reserved area of that range with state "PENDING". In
263
parallel write an "intent" value for each datum being written as part
264
of the transaction. These are normal MVCC values, with the addition of
265
a special flag (i.e. “intent”) indicating that the value may be
268
is stored with intent values. The txn id is used to refer to the
269
transaction record when there are conflicts and to make
272
original candidate timestamp in the absence of read/write conflicts);
273
the client selects the maximum from amongst all write timestamps as the
274
final commit timestamp.
276
2. Commit the transaction by updating its transaction record. The value
277
of the commit entry contains the candidate timestamp (increased as
278
necessary to accommodate any latest read timestamps). Note that the
279
transaction is considered fully committed at this point and control
280
may be returned to the client.
281
282
In the case of an SI transaction, a commit timestamp which was
283
increased to accommodate concurrent readers is perfectly
284
acceptable and the commit may continue. For SSI transactions,
285
however, a gap between candidate and commit timestamps
286
necessitates transaction restart (note: restart is different than
287
abort--see below).
288
289
After the transaction is committed, all written intents are upgraded
290
in parallel by removing the “intent” flag. The transaction is
291
considered fully committed before this step and does not wait for
292
it to return control to the transaction coordinator.
293
294
In the absence of conflicts, this is the end. Nothing else is necessary
295
to ensure the correctness of the system.
296
297
**Conflict Resolution**
298
299
Things get more interesting when a reader or writer encounters an intent
300
record or newly-committed value in a location that it needs to read or
301
write. This is a conflict, usually causing either of the transactions to
302
abort or restart depending on the type of conflict.
303
304
***Transaction restart:***
305
306
This is the usual (and more efficient) type of behaviour and is used
307
except when the transaction was aborted (for instance by another
308
transaction).
309
In effect, that reduces to two cases; the first being the one outlined
310
above: An SSI transaction that finds upon attempting to commit that
311
its commit timestamp has been pushed. The second case involves a transaction
312
actively encountering a conflict, that is, one of its readers or writers
313
encounter data that necessitate conflict resolution
314
(see transaction interactions below).
319
have written some write intents, which need to be deleted before the
320
transaction commits, so as to not be included as part of the transaction.
321
These stale write intent deletions are done during the reexecution of the
323
the same keys as part of the reexecution of the transaction, or explicitly,
324
by cleaning up stale intents that are not part of the reexecution of the
325
transaction. Since most transactions will end up writing to the same keys,
326
the explicit cleanup run just before committing the transaction is usually
327
a NOOP.
328
329
***Transaction abort:***
330
331
This is the case in which a transaction, upon reading its transaction
332
record, finds that it has been aborted. In this case, the transaction
333
can not reuse its intents; it returns control to the client before
334
cleaning them up (other readers and writers would clean up dangling
335
intents as they encounter them) but will make an effort to clean up
336
after itself. The next attempt (if applicable) then runs as a new
337
transaction with **a new txn id**.
340
341
There are several scenarios in which transactions interact:
342
343
- **Reader encounters write intent or value with newer timestamp far
344
enough in the future**: This is not a conflict. The reader is free
345
to proceed; after all, it will be reading an older version of the
346
value and so does not conflict. Recall that the write intent may
347
be committed with a later timestamp than its candidate; it will
348
never commit with an earlier one. **Side note**: if a SI transaction
349
reader finds an intent with a newer timestamp which the reader’s own
351
352
- **Reader encounters write intent or value with newer timestamp in the
353
near future:** In this case, we have to be careful. The newer
354
intent may, in absolute terms, have happened in our read's past if
355
the clock of the writer is ahead of the node serving the values.
356
In that case, we would need to take this value into account, but
357
we just don't know. Hence the transaction restarts, using instead
358
a future timestamp (but remembering a maximum timestamp used to
359
limit the uncertainty window to the maximum clock skew). In fact,
360
this is optimized further; see the details under "choosing a time
361
stamp" below.
362
363
- **Reader encounters write intent with older timestamp**: the reader
365
If the transaction has already been committed, then the reader can
366
just read the value. If the write transaction has not yet been
367
committed, then the reader has two options. If the write conflict
368
is from an SI transaction, the reader can *push that transaction's
369
commit timestamp into the future* (and consequently not have to
370
read it). This is simple to do: the reader just updates the
371
transaction’s commit timestamp to indicate that when/if the
372
transaction does commit, it should use a timestamp *at least* as
373
high. However, if the write conflict is from an SSI transaction,
374
the reader must compare priorities. If the reader has the higher priority,
375
it pushes the transaction’s commit timestamp (that
383
priority, the writer aborts the conflicting transaction. If the write
384
intent has a higher or equal priority the transaction retries, using as a new
385
priority *max(new random priority, conflicting txn’s priority - 1)*;
386
the retry occurs after a short, randomized backoff interval.
388
- **Writer encounters newer committed value**:
389
The committed value could also be an unresolved write intent made by a
390
transaction that has already committed. The transaction restarts. On restart,
391
the same priority is reused, but the candidate timestamp is moved forward
392
to the encountered value's timestamp.
396
candidate timestamp is earlier than the low water mark on the cache itself
397
(i.e. its last evicted timestamp) or if the key being written has a read
398
timestamp later than the write’s candidate timestamp, this later timestamp
399
value is returned with the write. A new timestamp forces a transaction
400
restart only if it is serializable.
402
**Transaction management**
403
404
Transactions are managed by the client proxy (or gateway in SQL Azure
405
parlance). Unlike in Spanner, writes are not buffered but are sent
406
directly to all implicated ranges. This allows the transaction to abort
407
quickly if it encounters a write conflict. The client proxy keeps track
408
of all written keys in order to resolve write intents asynchronously upon
409
transaction completion. If a transaction commits successfully, all intents
410
are upgraded to committed. In the event a transaction is aborted, all written
411
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
414
committed, the dangling transaction would continue to "live" until
415
aborted by another transaction. Transactions periodically heartbeat
416
their transaction record to maintain liveness.
417
Transactions encountered by readers or writers with dangling intents
418
which haven’t been heartbeat within the required interval are aborted.
423
424
An exploration of retries with contention and abort times with abandoned
425
transaction is
426
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
427
430
Please see [pkg/roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/master/pkg/roachpb/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
431
432
**Pros**
433
434
- No requirement for reliable code execution to prevent stalled 2PC
435
protocol.
436
- Readers never block with SI semantics; with SSI semantics, they may
437
abort.
438
- Lower latency than traditional 2PC commit protocol (w/o contention)
439
because second phase requires only a single write to the
441
transaction participants.
442
- Priorities avoid starvation for arbitrarily long transactions and
443
always pick a winner from between contending transactions (no
444
mutual aborts).
445
- Writes not buffered at client; writes fail fast.
446
- No read-locking overhead required for *serializable* SI (in contrast
447
to other SSI implementations).
448
- Well-chosen (i.e. less random) priorities can flexibly give
449
probabilistic guarantees on latency for arbitrary transactions
450
(for example: make OLTP transactions 10x less likely to abort than
451
low priority transactions, such as asynchronously scheduled jobs).
452
453
**Cons**
454
457
- Abandoned transactions may block contending writers for up to the
458
heartbeat interval, though average wait is likely to be
459
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
460
This is likely considerably more performant than detecting and
461
restarting 2PC in order to release read and write locks.
462
- Behavior different than other SI implementations: no first writer
463
wins, and shorter transactions do not always finish quickly.
464
Element of surprise for OLTP systems may be a problematic factor.
465
- Aborts can decrease throughput in a contended system compared with
466
two phase locking. Aborts and retries increase read and write
467
traffic, increase latency and decrease throughput.
468
469
**Choosing a Timestamp**
470
471
A key challenge of reading data in a distributed system with clock skew
472
is choosing a timestamp guaranteed to be greater than the latest
473
timestamp of any committed transaction (in absolute time). No system can
474
claim consistency and fail to read already-committed data.
475
477
accessing a single node is easy. The timestamp is assigned by the node
478
itself, so it is guaranteed to be at a greater timestamp than all the
479
existing timestamped data on the node.
480
481
For multiple nodes, the timestamp of the node coordinating the
482
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
483
supplied to provide an upper bound on timestamps for already-committed
484
data (`ε` is the maximum clock skew). As the transaction progresses, any
485
data read which have timestamps greater than `t` but less than `t+ε`
486
cause the transaction to abort and retry with the conflicting timestamp
488
the same. This implies that transaction restarts due to clock uncertainty
489
can only happen on a time interval of length `ε`.
493
into account t<sub>c</sub>, but the timestamp of the node at the time
494
of the uncertain read t<sub>node</sub>. The larger of those two timestamps
495
t<sub>c</sub> and t<sub>node</sub> (likely equal to the latter) is used
496
to increase the read timestamp. Additionally, the conflicting node is
497
marked as “certain”. Then, for future reads to that node within the
498
transaction, we set `MaxTimestamp = Read Timestamp`, preventing further
499
uncertainty restarts.
500
501
Correctness follows from the fact that we know that at the time of the read,
502
there exists no version of any key on that node with a higher timestamp than
504
encounters a key with a higher timestamp, it knows that in absolute time,
505
the value was written after t<sub>node</sub> was obtained, i.e. after the
506
uncertain read. Hence the transaction can move forward reading an older version
507
of the data (at the transaction's timestamp). This limits the time uncertainty
508
restarts attributed to a node to at most one. The tradeoff is that we might
509
pick a timestamp larger than the optimal one (> highest conflicting timestamp),
510
resulting in the possibility of a few more conflicts.
511
512
We expect retries will be rare, but this assumption may need to be
513
revisited if retries become problematic. Note that this problem does not
514
apply to historical reads. An alternate approach which does not require
515
retries makes a round to all node participants in advance and
516
chooses the highest reported node wall time as the timestamp. However,
517
knowing which nodes will be accessed in advance is difficult and
518
potentially limiting. Cockroach could also potentially use a global
519
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)),
520
which would be feasible for smaller, geographically-proximate clusters.
522
# Strict Serializability (Linearizability)
523
524
Roughly speaking, the gap between <i>strict serializability</i> (which we use
525
interchangeably with <i>linearizability</i>) and CockroachDB's default
526
isolation level (<i>serializable</i>) is that with linearizable transactions,
527
causality is preserved. That is, if one transaction (say, creating a posting
528
for a user) waits for its predecessor (creating the user in the first place)
529
to complete, one would hope that the logical timestamp assigned to the former
530
is larger than that of the latter.
531
In practice, in distributed databases this may not hold, the reason typically
532
being that clocks across a distributed system are not perfectly synchronized
533
and the "later" transaction touches a part disjoint from that on which the
534
first transaction ran, resulting in clocks with disjoint information to decide
535
on the commit timestamps.
536
537
In practice, in CockroachDB many transactional workloads are actually
538
linearizable, though the precise conditions are too involved to outline them
539
here.
540
541
Causality is typically not required for many transactions, and so it is
542
advantageous to pay for it only when it *is* needed. CockroachDB implements
543
this via <i>causality tokens</i>: When committing a transaction, a causality
544
token can be retrieved and passed to the next transaction, ensuring that these
545
two transactions get assigned increasing logical timestamps.
546
547
Additionally, as better synchronized clocks become a standard commodity offered
548
by cloud providers, CockroachDB can provide global linearizability by doing
549
much the same that [Google's
550
Spanner](http://research.google.com/archive/spanner.html) does: wait out the
551
maximum clock offset after committing, but before returning to the client.
552
553
See the blog post below for much more in-depth information.
554
555
https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
559
Logically, the map contains a series of reserved system key/value
560
pairs preceding the actual user data (which is managed by the SQL
561
subsystem).
565
- `\x02<keyN>`: Range metadata for range ending `\x03<keyN>`. This a "meta1" key.
566
- `\x03<key1>`: Range metadata for range ending `<key1>`. This a "meta2" key.
568
- `\x03<keyN>`: Range metadata for range ending `<keyN>`. This a "meta2" key.
569
- `\x04{desc,node,range,store}-idegen`: ID generation oracles for various component types.
570
- `\x04status-node-<varint encoded Store ID>`: Store runtime metadata.
571
- `\x04tsd<key>`: Time-series data key.
572
- `<key>`: A user key. In practice, these keys are managed by the SQL
573
subsystem, which employs its own key anatomy.
575
# Stores and Storage
576
577
Nodes contain one or more stores. Each store should be placed on a unique disk.
578
Internally, each store contains a single instance of RocksDB with a block cache
579
shared amongst all of the stores in a node. And these stores in turn have
580
a collection of range replicas. More than one replica for a range will never
581
be placed on the same store or even the same node.
582
583
Early on, when a cluster is first initialized, the few default starting ranges
584
will only have a single replica, but as soon as other nodes are available they
585
will replicate to them until they've reached their desired replication factor,
586
the default being 3.
587
588
Zone configs can be used to control a range's replication factor and add
589
constraints as to where the range's replicas can be located. When there is a
590
change in a range's zone config, the range will up or down replicate to the
591
appropriate number of replicas and move its replicas to the appropriate stores
592
based on zone config's constraints.
593
594
# Self Repair
595
596
If a store has not been heard from (gossiped their descriptors) in some time,
597
the default setting being 5 minutes, the cluster will consider this store to be
598
dead. When this happens, all ranges that have replicas on that store are
599
determined to be unavailable and removed. These ranges will then upreplicate
600
themselves to other available stores until their desired replication factor is
601
again met. If 50% or more of the replicas are unavailable at the same time,
602
there is no quorum and the whole range will be considered unavailable until at
603
least greater than 50% of the replicas are again available.
604
605
# Rebalancing
606
607
As more data are added to the system, some stores may grow faster than others.
608
To combat this and to spread the overall load across the full cluster, replicas
609
will be moved between stores maintaining the desired replication factor. The
610
heuristics used to perform this rebalancing include:
611
612
- the number of replicas per store
613
- the total size of the data used per store
614
- free space available per store
615
616
In the future, some other factors that might be considered include:
617
618
- cpu/network load per store
619
- ranges that are used together often in queries
620
- number of active ranges per store
621
- number of range leases held per store
622
623
# Range Metadata
624
625
The default approximate size of a range is 64M (2\^26 B). In order to
626
support 1P (2\^50 B) of logical data, metadata is needed for roughly
627
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
630
B would require roughly 4G (2\^32 B) to store--too much to duplicate
631
between machines. Our conclusion is that range metadata must be
632
distributed for large installations.
633
634
To keep key lookups relatively fast in the presence of distributed metadata,
635
we store all the top-level metadata in a single range (the first range). These
636
top-level metadata keys are known as *meta1* keys, and are prefixed such that
637
they sort to the beginning of the key space. Given the metadata size of 256
638
bytes given above, a single 64M range would support 64M/256B = 2\^18 ranges,
640
above, we need two levels of indirection, where the first level addresses the
641
second, and the second addresses user data. With two levels of indirection, we
642
can address 2\^(18 + 18) = 2\^36 ranges; each range addresses 2\^26 B, and
643
altogether we address 2\^(36+26) B = 2\^62 B = 4E of user data.
644
645
For a given user-addressable `key1`, the associated *meta1* record is found
646
at the successor key to `key1` in the *meta1* space. Since the *meta1* space
647
is sparse, the successor key is defined as the next key which is present. The
648
*meta1* record identifies the range containing the *meta2* record, which is
649
found using the same process. The *meta2* record identifies the range
650
containing `key1`, which is again found the same way (see examples below).
652
Concretely, metadata keys are prefixed by `\x02` (meta1) and `\x03`
653
(meta2); the prefixes `\x02` and `\x03` provide for the desired
654
sorting behaviour. Thus, `key1`'s *meta1* record will reside at the
655
successor key to `\x02<key1>`.
658
the RocksDB iterator only supports a Seek() interface which acts as a
659
Ceil(). Using the start key of the range would cause Seek() to find the
660
key *after* the meta indexing record we’re looking for, which would
661
result in having to back the iterator up, an option which is both less
662
efficient and not available in all cases.
663
664
The following example shows the directory structure for a map with
665
three ranges worth of data. Ellipses indicate additional key/value
666
pairs to fill an entire range of data. For clarity, the examples use
667
`meta1` and `meta2` to refer to the prefixes `\x02` and `\x03`. Except
668
for the fact that splitting ranges requires updates to the range
669
metadata with knowledge of the metadata layout, the range metadata
670
itself requires no special treatment or bootstrapping.
671
672
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
673
`dcrama3:8000`)
674
675
- `meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
676
- `meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
677
- `meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
678
- `meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
679
- ...
680
- `<lastkey0>`: `<lastvalue0>`
681
682
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
683
`dcrama6:8000`)
684
685
- ...
686
- `<lastkey1>`: `<lastvalue1>`
687
688
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
689
`dcrama9:8000`)
690
691
- ...
692
- `<lastkey2>`: `<lastvalue2>`
693
694
Consider a simpler example of a map containing less than a single
695
range of data. In this case, all range metadata and all data are
696
located in the same range:
697
698
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
699
`dcrama3:8000`)*
700
701
- `meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
702
- `meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
703
- `<key0>`: `<value0>`
704
- `...`
705
706
Finally, a map large enough to need both levels of indirection would
707
look like (note that instead of showing range replicas, this
708
example is simplified to just show range indexes):
709
710
**Range 0**
711
712
- `meta1<lastkeyN-1>`: Range 0
713
- `meta1\xff`: Range 1
714
- `meta2<lastkey1>`: Range 1
715
- `meta2<lastkey2>`: Range 2
716
- `meta2<lastkey3>`: Range 3
726
- ...
727
- `<lastkey1>`: `<lastvalue1>`
728
729
**Range 2**
730
731
- ...
732
- `<lastkey2>`: `<lastvalue2>`
733
734
**Range 3**
735
736
- ...
737
- `<lastkey3>`: `<lastvalue3>`
738
739
**Range 262144**
740
741
- ...
742
- `<lastkeyN>`: `<lastvalueN>`
743
744
**Range 262145**
745
746
- ...
747
- `<lastkeyN+1>`: `<lastvalueN+1>`
748
749
Note that the choice of range `262144` is just an approximation. The
750
actual number of ranges addressable via a single metadata range is
751
dependent on the size of the keys. If efforts are made to keep key sizes
752
small, the total number of addressable ranges would increase and vice
753
versa.
754
755
From the examples above it’s clear that key location lookups require at
756
most three reads to get the value for `<key>`:
757
760
3. `<key>`.
761
762
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
763
containing less than 16T of data would require two lookups. Clients cache both
764
levels of range metadata, and we expect that data locality for individual
765
clients will be high. Clients may end up with stale cache entries. If on a
766
lookup, the range consulted does not match the client’s expectations, the
767
client evicts the stale entries and possibly does a new lookup.
768
771
Each range is configured to consist of three or more replicas, as specified by
772
their ZoneConfig. The replicas in a range maintain their own instance of a
773
distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://raftconsensus.github.io)
776
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
777
promising performance characteristics for WAN-distributed replicas, but
778
it does not guarantee a consistent ordering between replicas.
779
780
Raft elects a relatively long-lived leader which must be involved to
782
replicated. In the absence of heartbeats, followers become candidates
783
after randomized election timeouts and proceed to hold new leader
784
elections. Cockroach weights random timeouts such that the replicas with
785
shorter round trip times to peers are more likely to hold elections
786
first (not implemented yet). Only the Raft leader may propose commands;
787
followers will simply relay commands to the last known leader.
789
Our Raft implementation was developed together with CoreOS, but adds an extra
790
layer of optimization to account for the fact that a single Node may have
791
millions of consensus groups (one for each Range). Areas of optimization
792
are chiefly coalesced heartbeats (so that the number of nodes dictates the
793
number of heartbeats as opposed to the much larger number of ranges) and
794
batch processing of requests.
795
Future optimizations may include two-phase elections and quiescent ranges
796
(i.e. stopping traffic completely for inactive ranges).
797
799
800
As outlined in the Raft section, the replicas of a Range are organized as a
801
Raft group and execute commands from their shared commit log. Going through
802
Raft is an expensive operation though, and there are tasks which should only be
803
carried out by a single replica at a time (as opposed to all of them).
804
In particular, it is desirable to serve authoritative reads from a single
805
Replica (ideally from more than one, but that is far more difficult).
808
This is a lease held for a slice of (database, i.e. hybrid logical) time and is
809
established by committing a special log entry through Raft containing the
811
combination that uniquely describes the requesting replica. Reads and writes
812
must generally be addressed to the replica holding the lease; if none does, any
813
replica may be addressed, causing it to try to obtain the lease synchronously.
816
lease holder. These requests are retried transparently with the updated lease by the
817
gateway node and never reach the client.
818
819
The replica holding the lease is in charge or involved in handling
820
Range-specific maintenance tasks such as
821
822
* gossiping the sentinel and/or first range information
823
* splitting, merging and rebalancing
824
825
and, very importantly, may satisfy reads locally, without incurring the
826
overhead of going through Raft.
827
828
Since reads bypass Raft, a new lease holder will, among other things, ascertain
829
that its timestamp cache does not report timestamps smaller than the previous
830
lease holder's (so that it's compatible with reads which may have occurred on
831
the former lease holder). This is accomplished by letting leases enter
832
a <i>stasis period</i> (which is just the expiration minus the maximum clock
833
offset) before the actual expiration of the lease, so that all the next lease
834
holder has to do is set the low water mark of the timestamp cache to its
835
new lease's start time.
836
837
As a lease enters its stasis period, no more reads or writes are served, which
838
is undesirable. However, this would only happen in practice if a node became
839
unavailable. In almost all practical situations, no unavailability results
840
since leases are usually long-lived (and/or eagerly extended, which can avoid
841
the stasis period) or proactively transferred away from the lease holder, which
842
can also avoid the stasis period by promising not to serve any further reads
843
until the next lease goes into effect.
844
845
## Colocation with Raft leadership
848
further efforts, Raft leadership and the Range lease might not be held by the
849
same Replica. Since it's expensive to not have these two roles colocated (the
850
lease holder has to forward each proposal to the leader, adding costly RPC
851
round-trips), each lease renewal or transfer also attempts to colocate them.
852
In practice, that means that the mismatch is rare and self-corrects quickly.
856
This subsection describes how a lease holder replica processes a
857
read/write command in more details. Each command specifies (1) a key
858
(or a range of keys) that the command accesses and (2) the ID of a
859
range which the key(s) belongs to. When receiving a command, a node
860
looks up a range by the specified Range ID and checks if the range is
861
still responsible for the supplied keys. If any of the keys do not
862
belong to the range, the node returns an error so that the client will
863
retry and send a request to a correct range.
866
process the command. If the command is an inconsistent read-only
867
command, it is processed immediately. If the command is a consistent
868
read or a write, the command is executed when both of the following
869
conditions hold:
870
872
- There are no other running commands whose keys overlap with
873
the submitted command and cause read/write conflict.
874
875
When the first condition is not met, the replica attempts to acquire
876
a lease or returns an error so that the client will redirect the
881
When the above two conditions are met, the lease holder replica processes the
882
command. Consistent reads are processed on the lease holder immediately.
884
will execute the same commands. All commands produce deterministic
885
results so that the range replicas keep consistent states among them.
886
887
When a write command completes, all the replica updates their response
889
replica updates its timestamp cache to keep track of the latest read
890
for a given key.
891
893
executed. Before executing a command, each replica checks if a replica
894
proposing the command has a still lease. When the lease has been
895
expired, the command will be rejected by the replica.
896
897
901
minimum thresholds for capacity or load. Ranges exceeding maximums for
902
either capacity or load are split; ranges below minimums for *both*
903
capacity and load are merged.
904
905
Ranges maintain the same accounting statistics as accounting key
906
prefixes. These boil down to a time series of data points with minute
907
granularity. Everything from number of bytes to read/write queue sizes.
908
Arbitrary distillations of the accounting stats can be determined as the
910
split/merge are range size in bytes and IOps. A good metric for
911
rebalancing a replica from one node to another would be total read/write
912
queue wait times. These metrics are gossipped, with each range / node
913
passing along relevant metrics if they’re in the bottom or top of the
914
range it’s aware of.
915
916
A range finding itself exceeding either capacity or load threshold
918
candidate and issues the split through Raft. In contrast to splitting,
919
merging requires a range to be below the minimum threshold for both
920
capacity *and* load. A range being merged chooses the smaller of the
921
ranges immediately preceding and succeeding it.
922
923
Splitting, merging, rebalancing and recovering all follow the same basic
924
algorithm for moving data between roach nodes. New target replicas are
925
created and added to the replica set of source range. Then each new
926
replica is brought up to date by either replaying the log in full or
927
copying a snapshot of the source replica data and then replaying the log
928
from the timestamp of the snapshot to catch up fully. Once the new
929
replicas are fully up to date, the range metadata is updated and old,
930
source replica(s) deleted if applicable.
931
933
934
```
935
if splitting
936
SplitRange(split_key): splits happen locally on range replicas and
937
only after being completed locally, are moved to new target replicas.
938
else if merging
939
Choose new replicas on same servers as target range replicas;
940
add to replica set.
941
else if rebalancing || recovering
942
Choose new replica(s) on least loaded servers; add to replica set.
943
```
944
945
**New Replica**
946
947
*Bring replica up to date:*
948
949
```
950
if all info can be read from replicated log
951
copy replicated log
952
else
953
snapshot source replica
954
send successive ReadRange requests to source replica
955
referencing snapshot
956
957
if merging
958
combine ranges on all replicas
959
else if rebalancing || recovering
960
remove old range replica(s)
961
```
962
964
configurable maximum threshold. Similarly, ranges are merged when the
965
total data falls below a configurable minimum threshold.
966
967
**TBD: flesh this out**: Especially for merges (but also rebalancing) we have a
968
range disappearing from the local node; that range needs to disappear
969
gracefully, with a smooth handoff of operation to the new owner of its data.
970
971
Ranges are rebalanced if a node determines its load or capacity is one
972
of the worst in the cluster based on gossipped load stats. A node with
973
spare capacity is chosen in the same datacenter and a special-case split
974
is done which simply duplicates the data 1:1 and resets the range
975
configuration metadata.
976
977
# Node Allocation (via Gossip)
978
979
New nodes must be allocated when a range is split. Instead of requiring
981
of peer nodes --or-- alternatively requiring a specialized curator or
982
master with sufficiently global knowledge, we use a gossip protocol to
983
efficiently communicate only interesting information between all of the
984
nodes in the cluster. What’s interesting information? One example would
985
be whether a particular node has a lot of spare capacity. Each node,
986
when gossiping, compares each topic of gossip to its own state. If its
987
own state is somehow “more interesting” than the least interesting item
988
in the topic it’s seen recently, it includes its own state as part of
989
the next gossip session with a peer node. In this way, a node with
990
capacity sufficiently in excess of the mean quickly becomes discovered
991
by the entire cluster. To avoid piling onto outliers, nodes from the
992
high capacity set are selected at random for allocation.
993
994
The gossip protocol itself contains two primary components:
995
996
- **Peer Selection**: each node maintains up to N peers with which it
997
regularly communicates. It selects peers with an eye towards
998
maximizing fanout. A peer node which itself communicates with an
999
array of otherwise unknown nodes will be selected over one which
1000
communicates with a set containing significant overlap. Each time
1001
gossip is initiated, each nodes’ set of peers is exchanged. Each
1002
node is then free to incorporate the other’s peers as it sees fit.
1003
To avoid any node suffering from excess incoming requests, a node
1004
may refuse to answer a gossip exchange. Each node is biased
1005
towards answering requests from nodes without significant overlap
1006
and refusing requests otherwise.
1007
1008
Peers are efficiently selected using a heuristic as described in
1009
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
1010
1011
**TBD**: how to avoid partitions? Need to work out a simulation of
1012
the protocol to tune the behavior and see empirically how well it
1013
works.
1014
1015
- **Gossip Selection**: what to communicate. Gossip is divided into
1016
topics. Load characteristics (capacity per disk, cpu load, and
1017
state [e.g. draining, ok, failure]) are used to drive node
1018
allocation. Range statistics (range read/write load, missing
1019
replicas, unavailable ranges) and network topology (inter-rack
1020
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
1021
outages) are used for determining when to split ranges, when to
1022
recover replicas vs. wait for network connectivity, and for
1023
debugging / sysops. In all cases, a set of minimums and a set of
1024
maximums is propagated; each node applies its own view of the
1025
world to augment the values. Each minimum and maximum value is
1026
tagged with the reporting node and other accompanying contextual
1027
information. Each topic of gossip has its own protobuf to hold the
1028
structured data. The number of items of gossip in each topic is
1029
limited by a configurable bound.
1030
1031
For efficiency, nodes assign each new item of gossip a sequence
1032
number and keep track of the highest sequence number each peer
1033
node has seen. Each round of gossip communicates only the delta
1034
containing new items.
1035
1036
# Node and Cluster Metrics
1037
1038
Every component of the system is responsible for exporting interesting
1039
metrics about itself. These could be histograms, throughput counters, or
1040
gauges.
1041
1042
These metrics are exported for external monitoring systems (such as Prometheus)
1043
via a HTTP endpoint, but CockroachDB also implements an internal timeseries
1044
database which is stored in the replicated key-value map.
1045
1046
Time series are stored at Store granularity and allow the admin dashboard
1047
to efficiently gain visibility into a universe of information at the Cluster,
1048
Node or Store level. A [periodic background process](RFCS/time_series_culling.md)
1049
culls older timeseries data, downsampling and eventually discarding it.
1054
key prefixes. Key prefixes can overlap, as is necessary for capturing
1055
hierarchical relationships. For illustrative purposes, let’s say keys
1056
specifying rows in a set of databases have the following format:
1057
1058
`<db>:<table>:<primary-key>[:<secondary-key>]`
1059
1061
key prefixes:
1062
1063
`db1`, `db1:user`, `db1:order`,
1064
1065
Accounting is kept for the entire map by default.
1066
1067
## Accounting
1068
to keep accounting for a range defined by a key prefix, an entry is created in
1069
the accounting system table. The format of accounting table keys is:
1070
1071
`\0acct<key-prefix>`
1072
1076
Accounting is kept for key prefix ranges with eventual consistency for
1077
efficiency. There are two types of values which comprise accounting:
1078
counts and occurrences, for lack of better terms. Counts describe
1079
system state, such as the total number of bytes, rows,
1080
etc. Occurrences include transient performance and load metrics. Both
1081
types of accounting are captured as time series with minute
1082
granularity. The length of time accounting metrics are kept is
1083
configurable. Below are examples of each type of accounting value.
1084
1085
**System State Counters/Performance**
1086
1087
- Count of items (e.g. rows)
1088
- Total bytes
1089
- Total key bytes
1090
- Total value length
1091
- Queued message count
1092
- Queued message total bytes
1093
- Count of values \< 16B
1094
- Count of values \< 64B
1095
- Count of values \< 256B
1096
- Count of values \< 1K
1097
- Count of values \< 4K
1098
- Count of values \< 16K
1099
- Count of values \< 64K
1100
- Count of values \< 256K
1101
- Count of values \< 1M
1102
- Count of values \> 1M
1103
- Total bytes of accounting
1104
1105
1106
**Load Occurrences**
1107
1108
- Get op count
1109
- Get total MB
1110
- Put op count
1111
- Put total MB
1112
- Delete op count
1113
- Delete total MB
1114
- Delete range op count
1115
- Delete range total MB
1116
- Scan op count
1117
- Scan op MB
1118
- Split count
1119
- Merge count
1120
1121
Because accounting information is kept as time series and over many
1122
possible metrics of interest, the data can become numerous. Accounting
1123
data are stored in the map near the key prefix described, in order to
1124
distribute load (for both aggregation and storage).
1125
1126
Accounting keys for system state have the form:
1127
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1128
character. It’s meant to sort the root level account AFTER any other
1129
system tables. They must increment the same underlying values as they
1130
are permanent counts, and not transient activity. Logic at the
1132
suffixed (e.g. with timestamp hour) multi-value time series entry.
1133
1134
Keys for perf/load metrics:
1135
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1136
1137
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1138
containing a varint64 entry for each minute with activity during the
1139
specified hour.
1140
1141
To efficiently keep accounting over large key ranges, the task of
1142
aggregation must be distributed. If activity occurs within the same
1143
range as the key prefix for accounting, the updates are made as part
1144
of the consensus write. If the ranges differ, then a message is sent
1145
to the parent range to increment the accounting. If upon receiving the
1146
message, the parent range also does not include the key prefix, it in
1147
turn forwards it to its parent or left child in the balanced binary
1148
tree which is maintained to describe the range hierarchy. This limits
1149
the number of messages before an update is visible at the root to `2*log N`,
1150
where `N` is the number of ranges in the key prefix.
1151
1152
## Zones
1153
zones are stored in the map with keys prefixed by
1154
`\0zone` followed by the key prefix to which the zone
1155
configuration applies. Zone values specify a protobuf containing
1156
the datacenters from which replicas for ranges which fall under
1157
the zone must be chosen.
1158
1159
Please see [pkg/config/config.proto](https://github.com/cockroachdb/cockroach/blob/master/pkg/config/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1162
existing zones for its ranges against the zone configuration. If
1163
it discovers differences, it reconfigures ranges in the same way
1164
that it rebalances away from busy nodes, via special-case 1:1
1165
split to a duplicate range comprising the new configuration.
1166
1167
# SQL
1168
1169
Each node in a cluster can accept SQL client connections. CockroachDB
1170
supports the PostgreSQL wire protocol, to enable reuse of native
1171
PostgreSQL client drivers. Connections using SSL and authenticated
1172
using client certificates are supported and even encouraged over
1173
unencrypted (insecure) and password-based connections.
1174
1175
Each connection is associated with a SQL session which holds the
1176
server-side state of the connection. Over the lifespan of a session
1177
the client can send SQL to open/close transactions, issue statements
1178
or queries or configure session parameters, much like with any other
1179
SQL database.
1180
1181
## Language support
1182
1183
CockroachDB also attempts to emulate the flavor of SQL supported by
1184
PostgreSQL, although it also diverges in significant ways:
1185
1186
- CockroachDB exclusively implements MVCC-based consistency for
1187
transactions, and thus only supports SQL's isolation levels SNAPSHOT
1188
and SERIALIZABLE. The other traditional SQL isolation levels are
1189
internally mapped to either SNAPSHOT or SERIALIZABLE.
1190
1191
- CockroachDB implements its own [SQL type system](RFCS/typing.md)
1192
which only supports a limited form of implicit coercions between
1193
types compared to PostgreSQL. The rationale is to keep the
1194
implementation simple and efficient, capitalizing on the observation
1195
that 1) most SQL code in clients is automatically generated with
1196
coherent typing already and 2) existing SQL code for other databases
1197
will need to be massaged for CockroachDB anyways.
1198
1199
## SQL architecture
1200
1201
Client connections over the network are handled in each node by a
1202
pgwire server process (goroutine). This handles the stream of incoming
1203
commands and sends back responses including query/statement results.
1204
The pgwire server also handles pgwire-level prepared statements,
1205
binding prepared statements to arguments and looking up prepared
1206
statements for execution.
1207
1208
Meanwhile the state of a SQL connection is maintained by a Session
1209
object and a monolithic `planner` object (one per connection) which
1210
coordinates execution between the session, the current SQL transaction
1211
state and the underlying KV store.
1212
1213
Upon receiving a query/statement (either directly or via an execute
1214
command for a previously prepared statement) the pgwire server forwards
1215
the SQL text to the `planner` associated with the connection. The SQL
1216
code is then transformed into a SQL query plan.
1217
The query plan is implemented as a tree of objects which describe the
1218
high-level data operations needed to resolve the query, for example
1219
"join", "index join", "scan", "group", etc.
1220
1221
The query plan objects currently also embed the run-time state needed
1222
for the execution of the query plan. Once the SQL query plan is ready,
1223
methods on these objects then carry the execution out in the fashion
1224
of "generators" in other programming languages: each node *starts* its
1225
children nodes and from that point forward each child node serves as a
1226
*generator* for a stream of result rows, which the parent node can
1227
consume and transform incrementally and present to its own parent node
1228
also as a generator.
1229
1230
The top-level planner consumes the data produced by the top node of
1231
the query plan and returns it to the client via pgwire.
1232
1233
## Data mapping between the SQL model and KV
1234
1235
Every SQL table has a primary key in CockroachDB. (If a table is created
1236
without one, an implicit primary key is provided automatically.)
1237
The table identifier, followed by the value of the primary key for
1238
each row, are encoded as the *prefix* of a key in the underlying KV
1239
store.
1240
1241
Each remaining column or *column family* in the table is then encoded
1242
as a value in the underlying KV store, and the column/family identifier
1243
is appended as *suffix* to the KV key.
1244
1245
For example:
1246
1247
- after table `customers` is created in a database `mydb` with a
1248
primary key column `name` and normal columns `address` and `URL`, the KV pairs
1249
to store the schema would be:
1250
1251
| Key | Values |
1252
| ---------------------------- | ------ |
1253
| `/system/databases/mydb/id` | 51 |
1254
| `/system/tables/customer/id` | 42 |
1255
| `/system/desc/51/42/address` | 69 |
1256
| `/system/desc/51/42/url` | 66 |
1257
1258
(The numeric values on the right are chosen arbitrarily for the
1259
example; the structure of the schema keys on the left is simplified
1260
for the example and subject to change.) Each database/table/column
1261
name is mapped to a spontaneously generated identifier, so as to
1262
simplify renames.
1263
1264
Then for a single row in this table:
1265
1266
| Key | Values |
1267
| ----------------- | -------------------------------- |
1268
| `/51/42/Apple/69` | `1 Infinite Loop, Cupertino, CA` |
1270
1271
Each key has the table prefix `/51/42` followed by the primary key
1272
prefix `/Apple` followed by the column/family suffix (`/66`,
1273
`/69`). The KV value is directly encoded from the SQL value.
1274
1275
Efficient storage for the keys is guaranteed by the underlying RocksDB engine
1276
by means of prefix compression.
1277
1278
Finally, for SQL indexes, the KV key is formed using the SQL value of the
1279
indexed columns, and the KV value is the KV key prefix of the rest of
1280
the indexed row.