/
resiliency.rst
495 lines (401 loc) · 13.9 KB
/
resiliency.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
.. _appendix-resiliency:
=================
Resiliency Issues
=================
CrateDB uses Elasticsearch for data distribution and replication. Most of the
resiliency issues `exist in the Elasticsearch layer
<https://www.elastic.co/guide/en/elasticsearch/resiliency/current/>`_ and can
be tested by `Jepsen <https://github.com/jepsen-io/jepsen/tree/master/crate>`_.
.. rubric:: Table of contents
.. contents::
:local:
Known issues
============
Retry of updates causes double execution
----------------------------------------
.. raw:: html
<table class="summary">
<tr>
<td>Status</td>
<td>Work ongoing (<a
href="https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html#_better_request_retry_mechanism_when_nodes_are_disconnected_status_ongoing"
>More info</a>)
</td>
</tr>
<tr>
<td>Severity</td>
<td>Moderate</td>
</tr>
<tr>
<td>Likelihood</td>
<td>Very rare</td>
</tr>
<tr>
<td>Cause</td>
<td>Network issues, unresponsive nodes</td>
</tr>
<tr>
<td>Workloads</td>
<td>Non-Idempotent writes</td>
</tr>
</table>
.. rubric:: Scenario
A node with a primary shard receives an update, writes it to disk, but goes
offline before having sent a confirmation back to the executing node. When the
node comes back online, it receives an update retry and executes the update
again.
.. rubric:: Consequence
Incorrect data for non-idempotent writes.
For example:
- An double insert on a table without an explicit primary key would be
executed twice and would result in duplicate data.
- A double update would incorrectly increment the row version number twice.
Fixed issues
============
.. _resiliency_cluster_partitions_cause_lost_cluster_updates:
Repeated cluster partitions can cause lost cluster updates
----------------------------------------------------------
.. raw:: html
<style>
table.summary td {
border: 1px solid black;
padding: 5px;
}
</style>
<table class="summary">
<tr>
<td>Status</td>
<td>Fixed in CrateDB v4.0 (<a
href="https://github.com/elastic/elasticsearch/issues/32006">#32006</a>, <a
href="https://github.com/elastic/elasticsearch/issues/32171">#32171</a>)
</td>
</tr>
<tr>
<td>Severity</td>
<td>Moderate</td>
</tr>
<tr>
<td>Likelihood</td>
<td>Very rare</td>
</tr>
<tr>
<td>Cause</td>
<td>Network issues, unresponsive nodes</td>
</tr>
<tr>
<td>Workloads</td>
<td>All</td>
</tr>
</table>
.. rubric:: Scenario
A cluster is partitioned and a new master is elected on the side that has
quorum. The cluster is repaired and simultaneously a change is made to the
cluster state. The cluster is partitioned again before the new master node has
a chance to publish the new cluster state and the partition the master lands on
does not have quorum.
.. rubric:: Consequence
The node steps down as master and the uncommunicated state changes are lost.
Cluster state is very important and contains information like shard location,
schemas, and so on. Lost cluster state updates can cause data loss, reset
settings, and problems with table structures.
.. rubric:: Partially fixed
This problem is mostly fixed by `#20384
<https://github.com/elastic/elasticsearch/pull/20384>`_ (CrateDB v2.0.x),
which uses committed cluster state updates during master election process.
This does not fully solve this rare problem but considerably reduces the chance
of occurrence. The reason is that if the second partition happens concurrently
with a cluster state update and blocks the cluster state commit message from
reaching a majority of nodes, it may be that the in flight update is lost. If
the now-isolated master can still acknowledge the cluster state update to the
client this will result to the loss of an acknowledged change.
.. _resiliency_ambiguous_row_versions:
Version number representing ambiguous row versions
--------------------------------------------------
.. raw:: html
<table class="summary">
<tr>
<td>Status</td>
<td>Fixed in CrateDB v4.0 (<a
href="https://github.com/elastic/elasticsearch/issues/19269">#19269</a>, <a
href="https://github.com/elastic/elasticsearch/issues/10708">#10708</a>)
</td>
</tr>
<tr>
<td>Severity</td>
<td>Significant</td>
</tr>
<tr>
<td>Likelihood</td>
<td>Very rare</td>
</tr>
<tr>
<td>Cause</td>
<td>Network issues, unresponsive nodes</td>
</tr>
<tr>
<td>Workloads</td>
<td>Versioned reads with replicated tables while writing.</td>
</tr>
</table>
.. rubric:: Scenario
A client is writing to a primary shard. The node holding the primary shard is
partitioned from the cluster. It usually takes between 30 and 60 seconds
(depending on `ping configuration
<https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html>`_)
before the master node notices the partition. During this time, the same row is
updated on both the primary shard (partitioned) and a replica shard (not
partitioned).
.. rubric:: Consequence
There are two different versions of the same row using the same version number.
When the primary shard rejoins the cluster and its data is replicated, the
update that was made on the replicated shard is lost but the new version number
matches the lost update. This will break `Optimistic Concurrency Control
<https://crate.io/docs/reference/sql/occ.html>`_.
.. _resiliency_replicas_fall_out_of_sync:
Replicas can fall out of sync when a primary shard fails
--------------------------------------------------------
.. raw:: html
<table class="summary">
<tr>
<td>Status</td>
<td>Fixed in CrateDB v4.0 (<a
href="https://github.com/elastic/elasticsearch/issues/10708">#10708</a>)
</td>
</tr>
<tr>
<td>Severity</td>
<td>Modest</td>
</tr>
<tr>
<td>Likelihood</td>
<td>Rare</td>
</tr>
<tr>
<td>Cause</td>
<td>Primary fails and in-flight writes are only written to a subset of its replicas</td>
</tr>
<tr>
<td>Workloads</td>
<td>Writes on replicated table</td>
</tr>
</table>
.. rubric:: Scenario
When a primary shard fails, a replica shard will be promoted to be the primary
shard. If there is more than one replica shard, it is possible for the remaining
replicas to be out of sync with the new primary shard. This is caused by
operations that were in-flight when the primary shard failed and may not have
been processed on all replica shards. Currently, the discrepancies are not
repaired on primary promotion but instead would be repaired if replica shards
are relocated (e.g., from hot to cold nodes); this does mean that the length of
time which replicas can be out of sync with the primary shard is
unbounded.
.. rubric:: Consequence
Stale data may be read from replicas.
Loss of rows due to network partition
-------------------------------------
.. raw:: html
<table class="summary">
<tr>
<td>Status</td>
<td>Fixed in Crate v2.0.x (<a
href="https://github.com/elastic/elasticsearch/issues/7572">#7572</a>, <a
href="https://github.com/elastic/elasticsearch/issues/14252">#14252</a>)
</td>
</tr>
<tr>
<td>Severity</td>
<td>Significant</td>
</tr>
<tr>
<td>Likelihood</td>
<td>Very rare</td>
</tr>
<tr>
<td>Cause</td>
<td>Single node isolation</td>
</tr>
<tr>
<td>Workloads</td>
<td>Writes on replicated table</td>
</tr>
</table>
.. rubric:: Scenario
A node with a primary shard is partitioned from the cluster. The node continues
to accept writes until it notices the network partition. In the meantime,
another shard has been elected as the primary. Eventually, the partitioned node
rejoins the cluster.
.. rubric:: Consequence
Data that was written to the original primary shard on the partitioned node is
lost as data from the newly elected primary shard replaces it when it rejoins
the cluster.
The risk window depends on your `ping configuration
<https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html>`_.
The default configuration of a 30 second ping timeout with three retries
corresponds to a 90 second risk window. However, it is very rare for a node to
lose connectivity within the cluster but maintain connectivity with clients.
Dirty reads caused by bad primary handover
------------------------------------------
.. raw:: html
<table class="summary">
<tr>
<td>Status</td>
<td>Fixed in CrateDB v2.0.x (<a
href="https://github.com/elastic/elasticsearch/pull/15900">#15900</a>, <a
href="https://github.com/elastic/elasticsearch/issues/12573">#12573</a>)
</td>
</tr>
<tr>
<td>Severity</td>
<td>Moderate</td>
</tr>
<tr>
<td>Likelihood</td>
<td>Rare</td>
</tr>
<tr>
<td>Cause</td>
<td>Race Condition</td>
</tr>
<tr>
<td>Workloads</td>
<td>Reads</td>
</tr>
</table>
.. rubric:: Scenario
During a primary handover, there is a small risk window when a shard can find
out it has been elected as the new primary before the old primary shard notices
that it is no longer the primary.
A primary handover can happen in the following scenarios:
- A shard is relocated and then elected as the new primary, as two separate but
sequential actions. Relocating a shard means creating a new shard and then
deleting the old shard.
- An existing replica shard gets promoted to primary because the primary shard
was partitioned from the cluster.
.. rubric:: Consequence
Writes that occur on the new primary during the risk window will not be
replicated to the old shard (which still believes it is the primary) so any
subsequent reads on the old shard may return incorrect data.
Changes are overwritten by old data in danger of lost data
----------------------------------------------------------
.. raw:: html
<table class="summary">
<tr>
<td>Status</td>
<td>Fixed in CrateDB v2.0.x (<a
href="https://github.com/elastic/elasticsearch/issues/14671">#14671</a>)
</td>
</tr>
<tr>
<td>Severity</td>
<td>Significant</td>
</tr>
<tr>
<td>Likelihood</td>
<td>Very rare</td>
</tr>
<tr>
<td>Cause</td>
<td>Network problems</td>
</tr>
<tr>
<td>Workloads</td>
<td>Writes</td>
</tr>
</table>
.. rubric:: Scenario
A node with a primary that contains new data is partitioned from the cluster.
.. rubric:: Consequence
CrateDB prefers old data over no data, and so promotes an a shard with stale
data as a new primary. The data on the original primary shard is lost. Even if
the node with the original primary shard rejoins the cluster, CrateDB has no
way of distinguishing correct and incorrect data, so that data replaced with
data from the new primary shard.
Make table creation resilient to closing and full cluster crashes
-----------------------------------------------------------------
.. raw:: html
<table class="summary">
<tr>
<td>Status</td>
<td>
The issue has been fixed with the following issues.
Table recovery: <a href="https://github.com/elastic/elasticsearch/issues/9126">#9126</a>
Reopening tables: <a href="https://github.com/elastic/elasticsearch/issues/14739">#14739</a>
Allocation IDs: <a href="https://github.com/elastic/elasticsearch/issues/15281">#15281</a>
</td>
</tr>
<tr>
<td>Severity</td>
<td>Modest</td>
</tr>
<tr>
<td>Likelihood</td>
<td>Very Rare</td>
</tr>
<tr>
<td>Cause</td>
<td>Either the cluster fails while recovering a table or
the table is closed during shard creation.
</td>
</tr>
<tr>
<td>Workloads</td>
<td>Table creation</td>
</tr>
</table>
.. rubric:: Scenario
Recovering a table requires a quorum of shard copies to be available to
:ref:`allocate <gloss-shard-allocation>` a primary. This means that a
primary cannot be assigned if the cluster dies before enough shards have been
allocated. The same happens if a table is closed before enough shard copies
were started, making it impossible to reopen the table. Allocation IDs solve
this issue by tracking allocated shard copies in the cluster. This makes it
possible to safely recover a table in the presence of a single shard copy.
Allocation IDs can also distinguish the situation where a table has been
created but none of the shards have been started. If such an table was
inadvertently closed before at least one shard could be started, a fresh shard
will be allocated upon reopening the table.
.. rubric:: Consequence
The primary shard of the table cannot be assigned or a closed table cannot be
re-opened.
Unaware master accepts cluster updates
--------------------------------------
.. raw:: html
<table class="summary">
<tr>
<td>Status</td>
<td>Fixed in CrateDB v2.0.x (<a
href="https://github.com/elastic/elasticsearch/issues/13062">#13062</a>)
</td>
</tr>
<tr>
<td>Severity</td>
<td>Moderate</td>
</tr>
<tr>
<td>Likelihood</td>
<td>Very rare</td>
</tr>
<tr>
<td>Cause</td>
<td>Network problems</td>
</tr>
<tr>
<td>Workloads</td>
<td>DDL statements</td>
</tr>
</table>
.. rubric:: Scenario
If a master has lost quorum (i.e. the number of nodes it is in communication
with has fallen below the configured minimum) it should step down as master and
stop answering requests to perform cluster updates. There is a small risk
window between losing quorum and noticing that quorum has been lost, depending
on your `ping configuration
<https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-discovery-zen.html>`_.
.. rubric:: Consequence
If a cluster update request is made to the node between losing quorum and
noticing the loss of quorum, that request will be confirmed. However, those
updates will be lost because the node will not be able to perform a successful
cluster update.
Cluster state is very important and contains information like shard location,
schemas, and so on. Lost cluster state updates can cause data loss, reset
settings, and problems with table structures.