Permalink
Newer
100644
1248 lines (1020 sloc)
59.2 KB
1
# About
2
This document is an updated version of the original design documents
3
by Spencer Kimball from early 2014.
4
5
# Overview
6
7
CockroachDB is a distributed SQL database. The primary design goals
8
are **scalability**, **strong consistency** and **survivability**
12
**homogeneous deployment** (one binary) with minimal configuration and
13
no required external dependencies.
14
15
The entry point for database clients is the SQL interface. Every node
16
in a CockroachDB cluster can act as a client SQL gateway. A SQL
17
gateway transforms and executes client SQL statements to key-value
18
(KV) operations, which the gateway distributes across the cluster as
19
necessary and returns results to the client. CockroachDB implements a
20
**single, monolithic sorted map** from key to value where both keys
21
and values are byte strings.
22
23
The KV map is logically composed of smaller segments of the keyspace
24
called ranges. Each range is backed by data stored in a local KV
25
storage engine (we use [RocksDB](http://rocksdb.org/), a variant of
26
LevelDB). Range data is replicated to a configurable number of
27
additional CockroachDB nodes. Ranges are merged and split to maintain
28
a target size, by default `64M`. The relatively small size facilitates
29
quick repair and rebalancing to address node failures, new capacity
30
and even read/write load. However, the size must be balanced against
31
the pressure on the system from having more ranges to manage.
32
33
CockroachDB achieves horizontally scalability:
34
- adding more nodes increases the capacity of the cluster by the
35
amount of storage on each node (divided by a configurable
36
replication factor), theoretically up to 4 exabytes (4E) of logical
37
data;
38
- client queries can be sent to any node in the cluster, and queries
39
can operate independently (w/o conflicts), meaning that overall
40
throughput is a linear factor of the number of nodes in the cluster.
45
- uses a distributed consensus protocol for synchronous replication of
46
data in each key value range. We’ve chosen to use the [Raft
47
consensus algorithm](https://raftconsensus.github.io); all consensus
48
state is stored in RocksDB.
49
- single or batched mutations to a single range are mediated via the
50
range's Raft instance. Raft guarantees ACID semantics.
51
- logical mutations which affect multiple ranges employ distributed
52
transactions for ACID semantics. CockroachDB uses an efficient
53
**non-locking distributed commit** protocol.
54
55
CockroachDB achieves survivability:
56
- range replicas can be co-located within a single datacenter for low
57
latency replication and survive disk or machine failures. They can
59
- range replicas can be located in datacenters spanning increasingly
60
disparate geographies to survive ever-greater failure scenarios from
61
datacenter power or networking loss to regional power failures
62
(e.g. `{ US-East-1a, US-East-1b, US-East-1c }, `{ US-East, US-West,
63
Japan }`, `{ Ireland, US-East, US-West}`, `{ Ireland, US-East,
64
US-West, Japan, Australia }`).
66
CockroachDB provides [snapshot isolation](http://en.wikipedia.org/wiki/Snapshot_isolation) (SI) and
67
serializable snapshot isolation (SSI) semantics, allowing **externally
68
consistent, lock-free reads and writes**--both from a historical
69
snapshot timestamp and from the current wall clock time. SI provides
70
lock-free reads and writes but still allows write skew. SSI eliminates
71
write skew, but introduces a performance hit in the case of a
72
contentious system. SSI is the default isolation; clients must
74
implements [a limited form of linearizability](#linearizability),
75
providing ordering for any observer or chain of observers.
76
77
Similar to
78
[Spanner](http://static.googleusercontent.com/media/research.google.com/en/us/archive/spanner-osdi2012.pdf)
80
This allows replication factor, storage device type, and/or datacenter
81
location to be chosen to optimize performance and/or availability.
82
Unlike Spanner, zones are monolithic and don’t allow movement of fine
83
grained data on the level of entity groups.
84
85
# Architecture
86
89
It depends directly on the [*SQL layer*](#sql),
90
which provides familiar relational concepts
91
such as schemas, tables, columns, and indexes. The SQL layer
92
in turn depends on the [distributed key value store](#key-value-api),
93
which handles the details of range addressing to provide the abstraction
94
of a single, monolithic key value store. The distributed KV store
95
communicates with any number of physical cockroach nodes. Each node
96
contains one or more stores, one per physical device.
97
98

99
100
Each store contains potentially many ranges, the lowest-level unit of
101
key-value data. Ranges are replicated using the Raft consensus protocol.
102
The diagram below is a blown up version of stores from four of the five
103
nodes in the previous diagram. Each range is replicated three ways using
104
raft. The color coding shows associated range replicas.
105
106

107
108
Each physical node exports two RPC-based key value APIs: one for
109
external clients and one for internal clients (exposing sensitive
110
operational features). Both services accept batches of requests and
111
return batches of responses. Nodes are symmetric in capabilities and
112
exported interfaces; each has the same binary and may assume any
113
role.
114
115
Nodes and the ranges they provide access to can be arranged with various
116
physical network topologies to make trade offs between reliability and
117
performance. For example, a triplicated (3-way replica) range could have
118
each replica located on different:
119
120
- disks within a server to tolerate disk failures.
121
- servers within a rack to tolerate server failures.
122
- servers on different racks within a datacenter to tolerate rack power/network failures.
123
- servers in different datacenters to tolerate large scale network or power outages.
124
125
Up to `F` failures can be tolerated, where the total number of replicas `N = 2F + 1` (e.g. with 3x replication, one failure can be tolerated; with 5x replication, two failures, and so on).
126
127
# Cockroach Client
128
129
In order to support diverse client usage, Cockroach clients connect to
130
any node via HTTPS using protocol buffers or JSON. The connected node
131
proxies involved client work including key lookups and write buffering.
132
133
# Keys
134
135
Cockroach keys are arbitrary byte arrays. If textual data is used in
136
keys, utf8 encoding is recommended (this helps for cleaner display of
137
values in debugging tools). User-supplied keys are encoded using an
138
ordered code. System keys are either prefixed with null characters (`\0`
139
or `\0\0`) for system tables, or take the form of
140
`<user-key><system-suffix>` to sort user-key-range specific system
141
keys immediately after the user keys they refer to. Null characters are
142
used in system key prefixes to guarantee that they sort first.
143
144
# Versioned Values
145
146
Cockroach maintains historical versions of values by storing them with
147
associated commit timestamps. Reads and scans can specify a snapshot
148
time to return the most recent writes prior to the snapshot timestamp.
149
Older versions of values are garbage collected by the system during
150
compaction according to a user-specified expiration interval. In order
151
to support long-running scans (e.g. for MapReduce), all versions have a
152
minimum expiration.
153
154
Versioned values are supported via modifications to RocksDB to record
155
commit timestamps and GC expirations per key.
156
157
# Lock-Free Distributed Transactions
158
159
Cockroach provides distributed transactions without locks. Cockroach
160
transactions support two isolation levels:
161
162
- snapshot isolation (SI) and
163
- *serializable* snapshot isolation (SSI).
164
165
*SI* is simple to implement, highly performant, and correct for all but a
166
handful of anomalous conditions (e.g. write skew). *SSI* requires just a touch
167
more complexity, is still highly performant (less so with contention), and has
168
no anomalous conditions. Cockroach’s SSI implementation is based on ideas from
169
the literature and some possibly novel insights.
170
171
SSI is the default level, with SI provided for application developers
172
who are certain enough of their need for performance and the absence of
173
write skew conditions to consciously elect to use it. In a lightly
174
contended system, our implementation of SSI is just as performant as SI,
175
requiring no locking or additional writes. With contention, our
176
implementation of SSI still requires no locking, but will end up
177
aborting more transactions. Cockroach’s SI and SSI implementations
178
prevent starvation scenarios even for arbitrarily long transactions.
179
180
See the [Cahill paper](https://drive.google.com/file/d/0B9GCVTp_FHJIcEVyZVdDWEpYYXVVbFVDWElrYUV0NHFhU2Fv/edit?usp=sharing)
181
for one possible implementation of SSI. This is another [great paper](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf).
182
For a discussion of SSI implemented by preventing read-write conflicts
183
(in contrast to detecting them, called write-snapshot isolation), see
184
the [Yabandeh paper](https://drive.google.com/file/d/0B9GCVTp_FHJIMjJ2U2t6aGpHLTFUVHFnMTRUbnBwc2pLa1RN/edit?usp=sharing),
185
which is the source of much inspiration for Cockroach’s SSI.
186
187
Both SI and SSI require that the outcome of reads must be preserved, i.e.
188
a write of a key at a lower timestamp than a previous read must not succeed. To
189
this end, each range maintains a bounded *in-memory* cache from key range to
190
the latest timestamp at which it was read.
191
192
Most updates to this *timestamp cache* correspond to keys being read, though
193
the timestamp cache also protects the outcome of some writes (notably range
194
deletions) which consequently must also populate the cache. The cache’s entries
195
are evicted oldest timestamp first, updating the low water mark of the cache
196
appropriately.
197
198
Each Cockroach transaction is assigned a random priority and a
199
"candidate timestamp" at start. The candidate timestamp is the
200
provisional timestamp at which the transaction will commit, and is
201
chosen as the current clock time of the node coordinating the
202
transaction. This means that a transaction without conflicts will
203
usually commit with a timestamp that, in absolute time, precedes the
204
actual work done by that transaction.
205
206
In the course of coordinating a transaction between one or more
207
distributed nodes, the candidate timestamp may be increased, but will
209
SI and SSI is that the former allows the transaction's candidate
210
timestamp to increase and the latter does not.
215
in the [Hybrid Logical Clock paper](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf).
216
HLC time uses timestamps which are composed of a physical component (thought of
217
as and always close to local wall time) and a logical component (used to
218
distinguish between events with the same physical component). It allows us to
219
track causality for related events similar to vector clocks, but with less
220
overhead. In practice, it works much like other logical clocks: When events
221
are received by a node, it informs the local HLC about the timestamp supplied
222
with the event by the sender, and when events are sent a timestamp generated by
223
the local HLC is attached.
224
225
For a more in depth description of HLC please read the paper. Our
226
implementation is [here](https://github.com/cockroachdb/cockroach/blob/master/util/hlc/hlc.go).
227
228
Cockroach picks a Timestamp for a transaction using HLC time. Throughout this
229
document, *timestamp* always refers to the HLC time which is a singleton
230
on each node. The HLC is updated by every read/write event on the node, and
232
from another node is not only used to version the operation, but also updates
233
the HLC on the node. This is useful in guaranteeing that all data read/written
234
on a node is at a timestamp < next HLC time.
240
1. Start the transaction by selecting a range which is likely to be
241
heavily involved in the transaction and writing a new transaction
242
record to a reserved area of that range with state "PENDING". In
243
parallel write an "intent" value for each datum being written as part
244
of the transaction. These are normal MVCC values, with the addition of
245
a special flag (i.e. “intent”) indicating that the value may be
248
is stored with intent values. The txn id is used to refer to the
249
transaction record when there are conflicts and to make
252
original candidate timestamp in the absence of read/write conflicts);
253
the client selects the maximum from amongst all write timestamps as the
254
final commit timestamp.
256
2. Commit the transaction by updating its transaction record. The value
257
of the commit entry contains the candidate timestamp (increased as
258
necessary to accommodate any latest read timestamps). Note that the
259
transaction is considered fully committed at this point and control
260
may be returned to the client.
261
262
In the case of an SI transaction, a commit timestamp which was
263
increased to accommodate concurrent readers is perfectly
264
acceptable and the commit may continue. For SSI transactions,
265
however, a gap between candidate and commit timestamps
266
necessitates transaction restart (note: restart is different than
267
abort--see below).
268
269
After the transaction is committed, all written intents are upgraded
270
in parallel by removing the “intent” flag. The transaction is
271
considered fully committed before this step and does not wait for
272
it to return control to the transaction coordinator.
273
274
In the absence of conflicts, this is the end. Nothing else is necessary
275
to ensure the correctness of the system.
276
277
**Conflict Resolution**
278
279
Things get more interesting when a reader or writer encounters an intent
280
record or newly-committed value in a location that it needs to read or
281
write. This is a conflict, usually causing either of the transactions to
282
abort or restart depending on the type of conflict.
283
284
***Transaction restart:***
285
286
This is the usual (and more efficient) type of behaviour and is used
287
except when the transaction was aborted (for instance by another
288
transaction).
289
In effect, that reduces to two cases; the first being the one outlined
290
above: An SSI transaction that finds upon attempting to commit that
291
its commit timestamp has been pushed. The second case involves a transaction
292
actively encountering a conflict, that is, one of its readers or writers
293
encounter data that necessitate conflict resolution
294
(see transaction interactions below).
299
have written some write intents, which need to be deleted before the
300
transaction commits, so as to not be included as part of the transaction.
301
These stale write intent deletions are done during the reexecution of the
303
the same keys as part of the reexecution of the transaction, or explicitly,
304
by cleaning up stale intents that are not part of the reexecution of the
305
transaction. Since most transactions will end up writing to the same keys,
306
the explicit cleanup run just before committing the transaction is usually
307
a NOOP.
308
309
***Transaction abort:***
310
311
This is the case in which a transaction, upon reading its transaction
312
record, finds that it has been aborted. In this case, the transaction
313
can not reuse its intents; it returns control to the client before
314
cleaning them up (other readers and writers would clean up dangling
315
intents as they encounter them) but will make an effort to clean up
316
after itself. The next attempt (if applicable) then runs as a new
317
transaction with **a new txn id**.
320
321
There are several scenarios in which transactions interact:
322
323
- **Reader encounters write intent or value with newer timestamp far
324
enough in the future**: This is not a conflict. The reader is free
325
to proceed; after all, it will be reading an older version of the
326
value and so does not conflict. Recall that the write intent may
327
be committed with a later timestamp than its candidate; it will
328
never commit with an earlier one. **Side note**: if a SI transaction
329
reader finds an intent with a newer timestamp which the reader’s own
331
332
- **Reader encounters write intent or value with newer timestamp in the
333
near future:** In this case, we have to be careful. The newer
334
intent may, in absolute terms, have happened in our read's past if
335
the clock of the writer is ahead of the node serving the values.
336
In that case, we would need to take this value into account, but
337
we just don't know. Hence the transaction restarts, using instead
338
a future timestamp (but remembering a maximum timestamp used to
339
limit the uncertainty window to the maximum clock skew). In fact,
340
this is optimized further; see the details under "choosing a time
341
stamp" below.
342
343
- **Reader encounters write intent with older timestamp**: the reader
345
If the transaction has already been committed, then the reader can
346
just read the value. If the write transaction has not yet been
347
committed, then the reader has two options. If the write conflict
348
is from an SI transaction, the reader can *push that transaction's
349
commit timestamp into the future* (and consequently not have to
350
read it). This is simple to do: the reader just updates the
351
transaction’s commit timestamp to indicate that when/if the
352
transaction does commit, it should use a timestamp *at least* as
353
high. However, if the write conflict is from an SSI transaction,
354
the reader must compare priorities. If the reader has the higher priority,
355
it pushes the transaction’s commit timestamp (that
363
priority, the writer aborts the conflicting transaction. If the write
364
intent has a higher or equal priority the transaction retries, using as a new
365
priority *max(new random priority, conflicting txn’s priority - 1)*;
366
the retry occurs after a short, randomized backoff interval.
368
- **Writer encounters newer committed value**:
369
The committed value could also be an unresolved write intent made by a
370
transaction that has already committed. The transaction restarts. On restart,
371
the same priority is reused, but the candidate timestamp is moved forward
372
to the encountered value's timestamp.
376
candidate timestamp is earlier than the low water mark on the cache itself
377
(i.e. its last evicted timestamp) or if the key being written has a read
378
timestamp later than the write’s candidate timestamp, this later timestamp
379
value is returned with the write. A new timestamp forces a transaction
380
restart only if it is serializable.
382
**Transaction management**
383
384
Transactions are managed by the client proxy (or gateway in SQL Azure
385
parlance). Unlike in Spanner, writes are not buffered but are sent
386
directly to all implicated ranges. This allows the transaction to abort
387
quickly if it encounters a write conflict. The client proxy keeps track
388
of all written keys in order to resolve write intents asynchronously upon
389
transaction completion. If a transaction commits successfully, all intents
390
are upgraded to committed. In the event a transaction is aborted, all written
391
intents are deleted. The client proxy doesn’t guarantee it will resolve intents.
394
committed, the dangling transaction would continue to "live" until
395
aborted by another transaction. Transactions periodically heartbeat
396
their transaction record to maintain liveness.
397
Transactions encountered by readers or writers with dangling intents
398
which haven’t been heartbeat within the required interval are aborted.
403
404
An exploration of retries with contention and abort times with abandoned
405
transaction is
406
[here](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing).
407
410
Please see [roachpb/data.proto](https://github.com/cockroachdb/cockroach/blob/master/roachpb/data.proto) for the up-to-date structures, the best entry point being `message Transaction`.
411
412
**Pros**
413
414
- No requirement for reliable code execution to prevent stalled 2PC
415
protocol.
416
- Readers never block with SI semantics; with SSI semantics, they may
417
abort.
418
- Lower latency than traditional 2PC commit protocol (w/o contention)
419
because second phase requires only a single write to the
421
transaction participants.
422
- Priorities avoid starvation for arbitrarily long transactions and
423
always pick a winner from between contending transactions (no
424
mutual aborts).
425
- Writes not buffered at client; writes fail fast.
426
- No read-locking overhead required for *serializable* SI (in contrast
427
to other SSI implementations).
428
- Well-chosen (i.e. less random) priorities can flexibly give
429
probabilistic guarantees on latency for arbitrary transactions
430
(for example: make OLTP transactions 10x less likely to abort than
431
low priority transactions, such as asynchronously scheduled jobs).
432
433
**Cons**
434
437
- Abandoned transactions may block contending writers for up to the
438
heartbeat interval, though average wait is likely to be
439
considerably shorter (see [graph in link](https://docs.google.com/document/d/1kBCu4sdGAnvLqpT-_2vaTbomNmX3_saayWEGYu1j7mQ/edit?usp=sharing)).
440
This is likely considerably more performant than detecting and
441
restarting 2PC in order to release read and write locks.
442
- Behavior different than other SI implementations: no first writer
443
wins, and shorter transactions do not always finish quickly.
444
Element of surprise for OLTP systems may be a problematic factor.
445
- Aborts can decrease throughput in a contended system compared with
446
two phase locking. Aborts and retries increase read and write
447
traffic, increase latency and decrease throughput.
448
449
**Choosing a Timestamp**
450
451
A key challenge of reading data in a distributed system with clock skew
452
is choosing a timestamp guaranteed to be greater than the latest
453
timestamp of any committed transaction (in absolute time). No system can
454
claim consistency and fail to read already-committed data.
455
457
accessing a single node is easy. The timestamp is assigned by the node
458
itself, so it is guaranteed to be at a greater timestamp than all the
459
existing timestamped data on the node.
460
461
For multiple nodes, the timestamp of the node coordinating the
462
transaction `t` is used. In addition, a maximum timestamp `t+ε` is
463
supplied to provide an upper bound on timestamps for already-committed
464
data (`ε` is the maximum clock skew). As the transaction progresses, any
465
data read which have timestamps greater than `t` but less than `t+ε`
466
cause the transaction to abort and retry with the conflicting timestamp
468
the same. This implies that transaction restarts due to clock uncertainty
469
can only happen on a time interval of length `ε`.
473
into account t<sub>c</sub>, but the timestamp of the node at the time
474
of the uncertain read t<sub>node</sub>. The larger of those two timestamps
475
t<sub>c</sub> and t<sub>node</sub> (likely equal to the latter) is used
476
to increase the read timestamp. Additionally, the conflicting node is
477
marked as “certain”. Then, for future reads to that node within the
478
transaction, we set `MaxTimestamp = Read Timestamp`, preventing further
479
uncertainty restarts.
480
481
Correctness follows from the fact that we know that at the time of the read,
482
there exists no version of any key on that node with a higher timestamp than
484
encounters a key with a higher timestamp, it knows that in absolute time,
485
the value was written after t<sub>node</sub> was obtained, i.e. after the
486
uncertain read. Hence the transaction can move forward reading an older version
487
of the data (at the transaction's timestamp). This limits the time uncertainty
488
restarts attributed to a node to at most one. The tradeoff is that we might
489
pick a timestamp larger than the optimal one (> highest conflicting timestamp),
490
resulting in the possibility of a few more conflicts.
491
492
We expect retries will be rare, but this assumption may need to be
493
revisited if retries become problematic. Note that this problem does not
494
apply to historical reads. An alternate approach which does not require
495
retries makes a round to all node participants in advance and
496
chooses the highest reported node wall time as the timestamp. However,
497
knowing which nodes will be accessed in advance is difficult and
498
potentially limiting. Cockroach could also potentially use a global
499
clock (Google did this with [Percolator](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf)),
500
which would be feasible for smaller, geographically-proximate clusters.
502
# Strict Serializability (Linearizability)
503
504
Roughly speaking, the gap between <i>strict serializability</i> (which we use
505
interchangeably with <i>linearizability</i>) and CockroachDB's default
506
isolation level (<i>serializable</i>) is that with linearizable transactions,
507
causality is preserved. That is, if one transaction (say, creating a posting
508
for a user) waits for its predecessor (creating the user in the first place)
509
to complete, one would hope that the logical timestamp assigned to the former
510
is larger than that of the latter.
511
In practice, in distributed databases this may not hold, the reason typically
512
being that clocks across a distributed system are not perfectly synchronized
513
and the "later" transaction touches a part disjoint from that on which the
514
first transaction ran, resulting in clocks with disjoint information to decide
515
on the commit timestamps.
516
517
In practice, in CockroachDB many transactional workloads are actually
518
linearizable, though the precise conditions are too involved to outline them
519
here.
520
521
Causality is typically not required for many transactions, and so it is
522
advantageous to pay for it only when it *is* needed. CockroachDB implements
523
this via <i>causality tokens</i>: When committing a transaction, a causality
524
token can be retrieved and passed to the next transaction, ensuring that these
525
two transactions get assigned increasing logical timestamps.
526
527
Additionally, as better synchronized clocks become a standard commodity offered
528
by cloud providers, CockroachDB can provide global linearizability by doing
529
much the same that [Google's
530
Spanner](http://research.google.com/archive/spanner.html) does: wait out the
531
maximum clock offset after committing, but before returning to the client.
532
533
See the blog post below for much more in-depth information.
534
535
https://www.cockroachlabs.com/blog/living-without-atomic-clocks/
539
Logically, the map contains a series of reserved system key/value
540
pairs preceding the actual user data (which is managed by the SQL
541
subsystem).
545
- `\x02<keyN>`: Range metadata for range ending `\x03<keyN>`. This a "meta1" key.
546
- `\x03<key1>`: Range metadata for range ending `<key1>`. This a "meta2" key.
548
- `\x03<keyN>`: Range metadata for range ending `<keyN>`. This a "meta2" key.
549
- `\x04{desc,node,range,store}-idegen`: ID generation oracles for various component types.
550
- `\x04status-node-<varint encoded Store ID>`: Store runtime metadata.
551
- `\x04tsd<key>`: Time-series data key.
552
- `<key>`: A user key. In practice, these keys are managed by the SQL
553
subsystem, which employs its own key anatomy.
557
Nodes maintain a separate instance of RocksDB for each store (physical
558
or virtual storage device). Each RocksDB instance hosts any number of
559
replicas. RPCs arriving at a node are routed based on the store ID to
560
the appropriate RocksDB instance. A single instance per store is used
561
to avoid contention. If every range maintained its own RocksDB, global
562
management of available cache memory would be impossible and writers
563
for each range would compete for non-contiguous writes to multiple
564
RocksDB logs.
565
566
In addition to the key/value pairs of the range itself, various range
567
metadata is maintained.
568
569
- participating replicas
570
571
- consensus metadata
572
573
- split/merge activity
574
575
A really good reference on tuning Linux installations with RocksDB is
576
[here](http://docs.basho.com/riak/latest/ops/advanced/backends/leveldb/).
577
578
# Range Metadata
579
580
The default approximate size of a range is 64M (2\^26 B). In order to
581
support 1P (2\^50 B) of logical data, metadata is needed for roughly
582
2\^(50 - 26) = 2\^24 ranges. A reasonable upper bound on range metadata
585
B would require roughly 4G (2\^32 B) to store--too much to duplicate
586
between machines. Our conclusion is that range metadata must be
587
distributed for large installations.
588
589
To keep key lookups relatively fast in the presence of distributed metadata,
590
we store all the top-level metadata in a single range (the first range). These
591
top-level metadata keys are known as *meta1* keys, and are prefixed such that
592
they sort to the beginning of the key space. Given the metadata size of 256
593
bytes given above, a single 64M range would support 64M/256B = 2\^18 ranges,
595
above, we need two levels of indirection, where the first level addresses the
596
second, and the second addresses user data. With two levels of indirection, we
597
can address 2\^(18 + 18) = 2\^36 ranges; each range addresses 2\^26 B, and
598
altogether we address 2\^(36+26) B = 2\^62 B = 4E of user data.
599
600
For a given user-addressable `key1`, the associated *meta1* record is found
601
at the successor key to `key1` in the *meta1* space. Since the *meta1* space
602
is sparse, the successor key is defined as the next key which is present. The
603
*meta1* record identifies the range containing the *meta2* record, which is
604
found using the same process. The *meta2* record identifies the range
605
containing `key1`, which is again found the same way (see examples below).
607
Concretely, metadata keys are prefixed by `\0\0meta{1,2}`; the two null
608
characters provide for the desired sorting behaviour. Thus, `key1`'s
609
*meta1* record will reside at the successor key to `\0\0\meta1<key1>`.
610
612
the RocksDB iterator only supports a Seek() interface which acts as a
613
Ceil(). Using the start key of the range would cause Seek() to find the
614
key *after* the meta indexing record we’re looking for, which would
615
result in having to back the iterator up, an option which is both less
616
efficient and not available in all cases.
617
618
The following example shows the directory structure for a map with
619
three ranges worth of data. Ellipses indicate additional key/value pairs to
620
fill an entire range of data. Except for the fact that splitting ranges
621
requires updates to the range metadata with knowledge of the metadata layout,
622
the range metadata itself requires no special treatment or bootstrapping.
623
624
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
625
`dcrama3:8000`)
626
627
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
628
- `\0\0meta2<lastkey0>`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
629
- `\0\0meta2<lastkey1>`: `dcrama4:8000`, `dcrama5:8000`, `dcrama6:8000`
630
- `\0\0meta2\xff`: `dcrama7:8000`, `dcrama8:8000`, `dcrama9:8000`
631
- ...
632
- `<lastkey0>`: `<lastvalue0>`
633
634
**Range 1** (located on servers `dcrama4:8000`, `dcrama5:8000`,
635
`dcrama6:8000`)
636
637
- ...
638
- `<lastkey1>`: `<lastvalue1>`
639
640
**Range 2** (located on servers `dcrama7:8000`, `dcrama8:8000`,
641
`dcrama9:8000`)
642
643
- ...
644
- `<lastkey2>`: `<lastvalue2>`
645
646
Consider a simpler example of a map containing less than a single
647
range of data. In this case, all range metadata and all data are
648
located in the same range:
649
650
**Range 0** (located on servers `dcrama1:8000`, `dcrama2:8000`,
651
`dcrama3:8000`)*
652
653
- `\0\0meta1\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
654
- `\0\0meta2\xff`: `dcrama1:8000`, `dcrama2:8000`, `dcrama3:8000`
655
- `<key0>`: `<value0>`
656
- `...`
657
658
Finally, a map large enough to need both levels of indirection would
659
look like (note that instead of showing range replicas, this
660
example is simplified to just show range indexes):
661
662
**Range 0**
663
664
- `\0\0meta1<lastkeyN-1>`: Range 0
665
- `\0\0meta1\xff`: Range 1
666
- `\0\0meta2<lastkey1>`: Range 1
667
- `\0\0meta2<lastkey2>`: Range 2
668
- `\0\0meta2<lastkey3>`: Range 3
669
- ...
670
- `\0\0meta2<lastkeyN-1>`: Range 262143
671
672
**Range 1**
673
674
- `\0\0meta2<lastkeyN>`: Range 262144
675
- `\0\0meta2<lastkeyN+1>`: Range 262145
676
- ...
677
- `\0\0meta2\xff`: Range 500,000
678
- ...
679
- `<lastkey1>`: `<lastvalue1>`
680
681
**Range 2**
682
683
- ...
684
- `<lastkey2>`: `<lastvalue2>`
685
686
**Range 3**
687
688
- ...
689
- `<lastkey3>`: `<lastvalue3>`
690
691
**Range 262144**
692
693
- ...
694
- `<lastkeyN>`: `<lastvalueN>`
695
696
**Range 262145**
697
698
- ...
699
- `<lastkeyN+1>`: `<lastvalueN+1>`
700
701
Note that the choice of range `262144` is just an approximation. The
702
actual number of ranges addressable via a single metadata range is
703
dependent on the size of the keys. If efforts are made to keep key sizes
704
small, the total number of addressable ranges would increase and vice
705
versa.
706
707
From the examples above it’s clear that key location lookups require at
708
most three reads to get the value for `<key>`:
709
710
1. lower bound of `\0\0meta1<key>`
711
2. lower bound of `\0\0meta2<key>`,
712
3. `<key>`.
713
714
For small maps, the entire lookup is satisfied in a single RPC to Range 0. Maps
715
containing less than 16T of data would require two lookups. Clients cache both
716
levels of range metadata, and we expect that data locality for individual
717
clients will be high. Clients may end up with stale cache entries. If on a
718
lookup, the range consulted does not match the client’s expectations, the
719
client evicts the stale entries and possibly does a new lookup.
720
723
Each range is configured to consist of three or more replicas, as specified by
724
their ZoneConfig. The replicas in a range maintain their own instance of a
725
distributed consensus algorithm. We use the [*Raft consensus algorithm*](https://raftconsensus.github.io)
728
[ePaxos](https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf) has
729
promising performance characteristics for WAN-distributed replicas, but
730
it does not guarantee a consistent ordering between replicas.
731
732
Raft elects a relatively long-lived leader which must be involved to
734
replicated. In the absence of heartbeats, followers become candidates
735
after randomized election timeouts and proceed to hold new leader
736
elections. Cockroach weights random timeouts such that the replicas with
737
shorter round trip times to peers are more likely to hold elections
738
first (not implemented yet). Only the Raft leader may propose commands;
739
followers will simply relay commands to the last known leader.
741
Our Raft implementation was developed together with CoreOS, but adds an extra
742
layer of optimization to account for the fact that a single Node may have
743
millions of consensus groups (one for each Range). Areas of optimization
744
are chiefly coalesced heartbeats (so that the number of nodes dictates the
745
number of heartbeats as opposed to the much larger number of ranges) and
746
batch processing of requests.
747
Future optimizations may include two-phase elections and quiescent ranges
748
(i.e. stopping traffic completely for inactive ranges).
749
751
752
As outlined in the Raft section, the replicas of a Range are organized as a
753
Raft group and execute commands from their shared commit log. Going through
754
Raft is an expensive operation though, and there are tasks which should only be
755
carried out by a single replica at a time (as opposed to all of them).
756
In particular, it is desirable to serve authoritative reads from a single
757
Replica (ideally from more than one, but that is far more difficult).
760
This is a lease held for a slice of (database, i.e. hybrid logical) time and is
761
established by committing a special log entry through Raft containing the
763
combination that uniquely describes the requesting replica. Reads and writes
764
must generally be addressed to the replica holding the lease; if none does, any
765
replica may be addressed, causing it to try to obtain the lease synchronously.
768
lease holder. These requests are retried transparently with the updated lease by the
769
gateway node and never reach the client.
770
771
The replica holding the lease is in charge or involved in handling
772
Range-specific maintenance tasks such as
773
774
* gossiping the sentinel and/or first range information
775
* splitting, merging and rebalancing
776
777
and, very importantly, may satisfy reads locally, without incurring the
778
overhead of going through Raft.
779
780
Since reads bypass Raft, a new lease holder will, among other things, ascertain
781
that its timestamp cache does not report timestamps smaller than the previous
782
lease holder's (so that it's compatible with reads which may have occurred on
783
the former lease holder). This is accomplished by letting leases enter
784
a <i>stasis period</i> (which is just the expiration minus the maximum clock
785
offset) before the actual expiration of the lease, so that all the next lease
786
holder has to do is set the low water mark of the timestamp cache to its
787
new lease's start time.
788
789
As a lease enters its stasis period, no more reads or writes are served, which
790
is undesirable. However, this would only happen in practice if a node became
791
unavailable. In almost all practical situations, no unavailability results
792
since leases are usually long-lived (and/or eagerly extended, which can avoid
793
the stasis period) or proactively transferred away from the lease holder, which
794
can also avoid the stasis period by promising not to serve any further reads
795
until the next lease goes into effect.
796
797
## Colocation with Raft leadership
800
further efforts, Raft leadership and the Range lease might not be held by the
801
same Replica. Since it's expensive to not have these two roles colocated (the
802
lease holder has to forward each proposal to the leader, adding costly RPC
803
round-trips), each lease renewal or transfer also attempts to colocate them.
804
In practice, that means that the mismatch is rare and self-corrects quickly.
808
This subsection describes how a lease holder replica processes a
809
read/write command in more details. Each command specifies (1) a key
810
(or a range of keys) that the command accesses and (2) the ID of a
811
range which the key(s) belongs to. When receiving a command, a node
812
looks up a range by the specified Range ID and checks if the range is
813
still responsible for the supplied keys. If any of the keys do not
814
belong to the range, the node returns an error so that the client will
815
retry and send a request to a correct range.
818
process the command. If the command is an inconsistent read-only
819
command, it is processed immediately. If the command is a consistent
820
read or a write, the command is executed when both of the following
821
conditions hold:
822
824
- There are no other running commands whose keys overlap with
825
the submitted command and cause read/write conflict.
826
827
When the first condition is not met, the replica attempts to acquire
828
a lease or returns an error so that the client will redirect the
833
When the above two conditions are met, the lease holder replica processes the
834
command. Consistent reads are processed on the lease holder immediately.
836
will execute the same commands. All commands produce deterministic
837
results so that the range replicas keep consistent states among them.
838
839
When a write command completes, all the replica updates their response
841
replica updates its timestamp cache to keep track of the latest read
842
for a given key.
843
845
executed. Before executing a command, each replica checks if a replica
846
proposing the command has a still lease. When the lease has been
847
expired, the command will be rejected by the replica.
848
849
853
minimum thresholds for capacity or load. Ranges exceeding maximums for
854
either capacity or load are split; ranges below minimums for *both*
855
capacity and load are merged.
856
857
Ranges maintain the same accounting statistics as accounting key
858
prefixes. These boil down to a time series of data points with minute
859
granularity. Everything from number of bytes to read/write queue sizes.
860
Arbitrary distillations of the accounting stats can be determined as the
862
split/merge are range size in bytes and IOps. A good metric for
863
rebalancing a replica from one node to another would be total read/write
864
queue wait times. These metrics are gossipped, with each range / node
865
passing along relevant metrics if they’re in the bottom or top of the
866
range it’s aware of.
867
868
A range finding itself exceeding either capacity or load threshold
870
candidate and issues the split through Raft. In contrast to splitting,
871
merging requires a range to be below the minimum threshold for both
872
capacity *and* load. A range being merged chooses the smaller of the
873
ranges immediately preceding and succeeding it.
874
875
Splitting, merging, rebalancing and recovering all follow the same basic
876
algorithm for moving data between roach nodes. New target replicas are
877
created and added to the replica set of source range. Then each new
878
replica is brought up to date by either replaying the log in full or
879
copying a snapshot of the source replica data and then replaying the log
880
from the timestamp of the snapshot to catch up fully. Once the new
881
replicas are fully up to date, the range metadata is updated and old,
882
source replica(s) deleted if applicable.
883
885
886
```
887
if splitting
888
SplitRange(split_key): splits happen locally on range replicas and
889
only after being completed locally, are moved to new target replicas.
890
else if merging
891
Choose new replicas on same servers as target range replicas;
892
add to replica set.
893
else if rebalancing || recovering
894
Choose new replica(s) on least loaded servers; add to replica set.
895
```
896
897
**New Replica**
898
899
*Bring replica up to date:*
900
901
```
902
if all info can be read from replicated log
903
copy replicated log
904
else
905
snapshot source replica
906
send successive ReadRange requests to source replica
907
referencing snapshot
908
909
if merging
910
combine ranges on all replicas
911
else if rebalancing || recovering
912
remove old range replica(s)
913
```
914
916
configurable maximum threshold. Similarly, ranges are merged when the
917
total data falls below a configurable minimum threshold.
918
919
**TBD: flesh this out**: Especially for merges (but also rebalancing) we have a
920
range disappearing from the local node; that range needs to disappear
921
gracefully, with a smooth handoff of operation to the new owner of its data.
922
923
Ranges are rebalanced if a node determines its load or capacity is one
924
of the worst in the cluster based on gossipped load stats. A node with
925
spare capacity is chosen in the same datacenter and a special-case split
926
is done which simply duplicates the data 1:1 and resets the range
927
configuration metadata.
928
929
# Node Allocation (via Gossip)
930
931
New nodes must be allocated when a range is split. Instead of requiring
933
of peer nodes --or-- alternatively requiring a specialized curator or
934
master with sufficiently global knowledge, we use a gossip protocol to
935
efficiently communicate only interesting information between all of the
936
nodes in the cluster. What’s interesting information? One example would
937
be whether a particular node has a lot of spare capacity. Each node,
938
when gossiping, compares each topic of gossip to its own state. If its
939
own state is somehow “more interesting” than the least interesting item
940
in the topic it’s seen recently, it includes its own state as part of
941
the next gossip session with a peer node. In this way, a node with
942
capacity sufficiently in excess of the mean quickly becomes discovered
943
by the entire cluster. To avoid piling onto outliers, nodes from the
944
high capacity set are selected at random for allocation.
945
946
The gossip protocol itself contains two primary components:
947
948
- **Peer Selection**: each node maintains up to N peers with which it
949
regularly communicates. It selects peers with an eye towards
950
maximizing fanout. A peer node which itself communicates with an
951
array of otherwise unknown nodes will be selected over one which
952
communicates with a set containing significant overlap. Each time
953
gossip is initiated, each nodes’ set of peers is exchanged. Each
954
node is then free to incorporate the other’s peers as it sees fit.
955
To avoid any node suffering from excess incoming requests, a node
956
may refuse to answer a gossip exchange. Each node is biased
957
towards answering requests from nodes without significant overlap
958
and refusing requests otherwise.
959
960
Peers are efficiently selected using a heuristic as described in
961
[Agarwal & Trachtenberg (2006)](https://drive.google.com/file/d/0B9GCVTp_FHJISmFRTThkOEZSM1U/edit?usp=sharing).
962
963
**TBD**: how to avoid partitions? Need to work out a simulation of
964
the protocol to tune the behavior and see empirically how well it
965
works.
966
967
- **Gossip Selection**: what to communicate. Gossip is divided into
968
topics. Load characteristics (capacity per disk, cpu load, and
969
state [e.g. draining, ok, failure]) are used to drive node
970
allocation. Range statistics (range read/write load, missing
971
replicas, unavailable ranges) and network topology (inter-rack
972
bandwidth/latency, inter-datacenter bandwidth/latency, subnet
973
outages) are used for determining when to split ranges, when to
974
recover replicas vs. wait for network connectivity, and for
975
debugging / sysops. In all cases, a set of minimums and a set of
976
maximums is propagated; each node applies its own view of the
977
world to augment the values. Each minimum and maximum value is
978
tagged with the reporting node and other accompanying contextual
979
information. Each topic of gossip has its own protobuf to hold the
980
structured data. The number of items of gossip in each topic is
981
limited by a configurable bound.
982
983
For efficiency, nodes assign each new item of gossip a sequence
984
number and keep track of the highest sequence number each peer
985
node has seen. Each round of gossip communicates only the delta
986
containing new items.
987
988
# Node and Cluster Metrics
989
990
Every component of the system is responsible for exporting interesting
991
metrics about itself. These could be histograms, throughput counters, or
992
gauges.
993
994
These metrics are exported for external monitoring systems (such as Prometheus)
995
via a HTTP endpoint, but CockroachDB also implements an internal timeseries
996
database which is stored in the replicated key-value map.
997
998
Time series are stored at Store granularity and allow the admin dashboard
999
to efficiently gain visibility into a universe of information at the Cluster,
1000
Node or Store level. A [periodic background process](RFCS/time_series_culling.md)
1001
culls older timeseries data, downsampling and eventually discarding it.
1006
key prefixes. Key prefixes can overlap, as is necessary for capturing
1007
hierarchical relationships. For illustrative purposes, let’s say keys
1008
specifying rows in a set of databases have the following format:
1009
1010
`<db>:<table>:<primary-key>[:<secondary-key>]`
1011
1013
key prefixes:
1014
1015
`db1`, `db1:user`, `db1:order`,
1016
1017
Accounting is kept for the entire map by default.
1018
1019
## Accounting
1020
to keep accounting for a range defined by a key prefix, an entry is created in
1021
the accounting system table. The format of accounting table keys is:
1022
1023
`\0acct<key-prefix>`
1024
1028
Accounting is kept for key prefix ranges with eventual consistency for
1029
efficiency. There are two types of values which comprise accounting:
1030
counts and occurrences, for lack of better terms. Counts describe
1031
system state, such as the total number of bytes, rows,
1032
etc. Occurrences include transient performance and load metrics. Both
1033
types of accounting are captured as time series with minute
1034
granularity. The length of time accounting metrics are kept is
1035
configurable. Below are examples of each type of accounting value.
1036
1037
**System State Counters/Performance**
1038
1039
- Count of items (e.g. rows)
1040
- Total bytes
1041
- Total key bytes
1042
- Total value length
1043
- Queued message count
1044
- Queued message total bytes
1045
- Count of values \< 16B
1046
- Count of values \< 64B
1047
- Count of values \< 256B
1048
- Count of values \< 1K
1049
- Count of values \< 4K
1050
- Count of values \< 16K
1051
- Count of values \< 64K
1052
- Count of values \< 256K
1053
- Count of values \< 1M
1054
- Count of values \> 1M
1055
- Total bytes of accounting
1056
1057
1058
**Load Occurrences**
1059
1060
- Get op count
1061
- Get total MB
1062
- Put op count
1063
- Put total MB
1064
- Delete op count
1065
- Delete total MB
1066
- Delete range op count
1067
- Delete range total MB
1068
- Scan op count
1069
- Scan op MB
1070
- Split count
1071
- Merge count
1072
1073
Because accounting information is kept as time series and over many
1074
possible metrics of interest, the data can become numerous. Accounting
1075
data are stored in the map near the key prefix described, in order to
1076
distribute load (for both aggregation and storage).
1077
1078
Accounting keys for system state have the form:
1079
`<key-prefix>|acctd<metric-name>*`. Notice the leading ‘pipe’
1080
character. It’s meant to sort the root level account AFTER any other
1081
system tables. They must increment the same underlying values as they
1082
are permanent counts, and not transient activity. Logic at the
1084
suffixed (e.g. with timestamp hour) multi-value time series entry.
1085
1086
Keys for perf/load metrics:
1087
`<key-prefix>acctd<metric-name><hourly-timestamp>`.
1088
1089
`<hourly-timestamp>`-suffixed accounting entries are multi-valued,
1090
containing a varint64 entry for each minute with activity during the
1091
specified hour.
1092
1093
To efficiently keep accounting over large key ranges, the task of
1094
aggregation must be distributed. If activity occurs within the same
1095
range as the key prefix for accounting, the updates are made as part
1096
of the consensus write. If the ranges differ, then a message is sent
1097
to the parent range to increment the accounting. If upon receiving the
1098
message, the parent range also does not include the key prefix, it in
1099
turn forwards it to its parent or left child in the balanced binary
1100
tree which is maintained to describe the range hierarchy. This limits
1101
the number of messages before an update is visible at the root to `2*log N`,
1102
where `N` is the number of ranges in the key prefix.
1103
1104
## Zones
1105
zones are stored in the map with keys prefixed by
1106
`\0zone` followed by the key prefix to which the zone
1107
configuration applies. Zone values specify a protobuf containing
1108
the datacenters from which replicas for ranges which fall under
1109
the zone must be chosen.
1110
1111
Please see [config/config.proto](https://github.com/cockroachdb/cockroach/blob/master/config/config.proto) for up-to-date data structures used, the best entry point being `message ZoneConfig`.
1114
existing zones for its ranges against the zone configuration. If
1115
it discovers differences, it reconfigures ranges in the same way
1116
that it rebalances away from busy nodes, via special-case 1:1
1117
split to a duplicate range comprising the new configuration.
1118
1119
# SQL
1120
1121
Each node in a cluster can accept SQL client connections. CockroachDB
1122
supports the PostgreSQL wire protocol, to enable reuse of native
1123
PostgreSQL client drivers. Connections using SSL and authenticated
1124
using client certificates are supported and even encouraged over
1125
unencrypted (insecure) and password-based connections.
1126
1127
Each connection is associated with a SQL session which holds the
1128
server-side state of the connection. Over the lifespan of a session
1129
the client can send SQL to open/close transactions, issue statements
1130
or queries or configure session parameters, much like with any other
1131
SQL database.
1132
1133
## Language support
1134
1135
CockroachDB also attempts to emulate the flavor of SQL supported by
1136
PostgreSQL, although it also diverges in significant ways:
1137
1138
- CockroachDB exclusively implements MVCC-based consistency for
1139
transactions, and thus only supports SQL's isolation levels SNAPSHOT
1140
and SERIALIZABLE. The other traditional SQL isolation levels are
1141
internally mapped to either SNAPSHOT or SERIALIZABLE.
1142
1143
- CockroachDB implements its own [SQL type system](RFCS/typing.md)
1144
which only supports a limited form of implicit coercions between
1145
types compared to PostgreSQL. The rationale is to keep the
1146
implementation simple and efficient, capitalizing on the observation
1147
that 1) most SQL code in clients is automatically generated with
1148
coherent typing already and 2) existing SQL code for other databases
1149
will need to be massaged for CockroachDB anyways.
1150
1151
## SQL architecture
1152
1153
Client connections over the network are handled in each node by a
1154
pgwire server process (goroutine). This handles the stream of incoming
1155
commands and sends back responses including query/statement results.
1156
The pgwire server also handles pgwire-level prepared statements,
1157
binding prepared statements to arguments and looking up prepared
1158
statements for execution.
1159
1160
Meanwhile the state of a SQL connection is maintained by a Session
1161
object and a monolithic `planner` object (one per connection) which
1162
coordinates execution between the session, the current SQL transaction
1163
state and the underlying KV store.
1164
1165
Upon receiving a query/statement (either directly or via an execute
1166
command for a previously prepared statement) the pgwire server forwards
1167
the SQL text to the `planner` associated with the connection. The SQL
1168
code is then transformed into a SQL query plan.
1169
The query plan is implemented as a tree of objects which describe the
1170
high-level data operations needed to resolve the query, for example
1171
"join", "index join", "scan", "group", etc.
1172
1173
The query plan objects currently also embed the run-time state needed
1174
for the execution of the query plan. Once the SQL query plan is ready,
1175
methods on these objects then carry the execution out in the fashion
1176
of "generators" in other programming languages: each node *starts* its
1177
children nodes and from that point forward each child node serves as a
1178
*generator* for a stream of result rows, which the parent node can
1179
consume and transform incrementally and present to its own parent node
1180
also as a generator.
1181
1182
The top-level planner consumes the data produced by the top node of
1183
the query plan and returns it to the client via pgwire.
1184
1185
## Data mapping between the SQL model and KV
1186
1187
Every SQL table has a primary key in CockroachDB. (If a table is created
1188
without one, an implicit primary key is provided automatically.)
1189
The table identifier, followed by the value of the primary key for
1190
each row, are encoded as the *prefix* of a key in the underlying KV
1191
store.
1192
1193
Each remaining column or *column family* in the table is then encoded
1194
as a value in the underlying KV store, and the column/family identifier
1195
is appended as *suffix* to the KV key.
1196
1197
For example:
1198
1199
- after table `customers` is created in a database `mydb` with a
1200
primary key column `name` and normal columns `address` and `URL`, the KV pairs
1201
to store the schema would be:
1202
1203
| Key | Values |
1204
| ---------------------------- | ------ |
1205
| `/system/databases/mydb/id` | 51 |
1206
| `/system/tables/customer/id` | 42 |
1207
| `/system/desc/51/42/address` | 69 |
1208
| `/system/desc/51/42/url` | 66 |
1209
1210
(The numeric values on the right are chosen arbitrarily for the
1211
example; the structure of the schema keys on the left is simplified
1212
for the example and subject to change.) Each database/table/column
1213
name is mapped to a spontaneously generated identifier, so as to
1214
simplify renames.
1215
1216
Then for a single row in this table:
1217
1218
| Key | Values |
1219
| ----------------- | -------------------------------- |
1220
| `/51/42/Apple/69` | `1 Infinite Loop, Cupertino, CA` |
1221
| `/51/42/Apple/66` | `http://apple.com/` |
1222
1223
Each key has the table prefix `/51/42` followed by the primary key
1224
prefix `/Apple` followed by the column/family suffix (`/66`,
1225
`/69`). The KV value is directly encoded from the SQL value.
1226
1227
Efficient storage for the keys is guaranteed by the underlying RocksDB engine
1228
by means of prefix compression.
1229
1230
Finally, for SQL indexes, the KV key is formed using the SQL value of the
1231
indexed columns, and the KV value is the KV key prefix of the rest of
1232
the indexed row.
1233
1234
# References
1235
1236
[0]: http://rocksdb.org/
1237
[1]: https://github.com/google/leveldb
1238
[2]: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf
1239
[3]: http://research.google.com/archive/spanner.html
1240
[4]: http://research.google.com/pubs/pub36971.html
1241
[5]: https://github.com/cockroachdb/cockroach/tree/master/sql
1242
[7]: https://godoc.org/github.com/cockroachdb/cockroach/kv
1243
[8]: https://github.com/cockroachdb/cockroach/tree/master/kv
1244
[9]: https://godoc.org/github.com/cockroachdb/cockroach/server
1245
[10]: https://github.com/cockroachdb/cockroach/tree/master/server
1246
[11]: https://godoc.org/github.com/cockroachdb/cockroach/storage
1247
[12]: https://github.com/cockroachdb/cockroach/tree/master/storage