<a href="https://colab.research.google.com/github/fbeilstein/dbms/blob/master/DB_lecture_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**<font color=red>Replication and Consistency</font>**

**Consistency** is needed to understand consensus and atomic commitment algorithms. **Consistency models** explain visibility semantics and behavior of the system in the presence of multiple copies of data.


**Fault tolerance** is a property of a system that can continue operating correctly in the presence of failures of its components. 

Making a system fault-tolerant is difficult. The primary goal is to remove a single point of failure from the system and make sure that we have redundancy in mission-critical components. Usually, redundancy is entirely transparent for the user.

In primary/replica databases failover can be done explicitly, by promoting a replica to become a new master. Other systems do not require explicit reconfiguration and ensure consistency by collecting responses from multiple participants during read and write queries.


**Data replication** is a way of introducing redundancy by maintaining multiple copies of data in the system. 
Important for: multidatacenter deployments, georeplication

Updating multiple copies of data atomically is a problem equivalent to consensus, it might be quite costly to perform this operation for every operation in the database. 
It's ok if data **look** consistent from the user’s perspective, allowing some degree of divergence between participants.


we care most about three events: write, replica update, and read. These operations trigger a sequence of events initiated by the client. In some
cases, updating replicas can happen after the write has finished from the client perspective, but this still does not change the fact that the client has to be able to observe operations in a particular order.

**<font color=red>Availability</font>**


Intermittent failures should not impact availability: from the user’s perspective, the system as a whole has to continue operating as if nothing has happened.


To make the system highly available, we need to design it in a way that allows handling failures or unavailability of one or more participants gracefully. For that, we need to introduce redundancy and replication. However, as soon as we add redundancy, we face the problem of keeping several copies of data in sync and have to implement recovery mechanisms

**<font color=red>Shared Memory</font>**


For a client, the distributed system storing the data acts as if it was a single-node system (transparency). 


A single unit of storage, accessible by read or write operations, is usually called a **register**. We can view **shared memory** in a distributed database as an array of such registers.

We identify every operation by its **invocation** and **completion** events. We define an operation as **failed** if the process that invoked it crashes before it completes. 

If **both** invocation and completion events for one operation happen before the other operation is **invoked**, we say that this operation **precedes** the other one, and these two operations are **sequential**. Otherwise, we say that they are **concurrent**.

$$
\begin{align}
    &\dots \dots \dots (i)\rule{2cm}{0.4pt}(c)\dots \dots \dots \dots \dots \dots (A)\\
    &\dots \dots \dots \dots \dots \dots \dots \dots \dots (i)\rule{1.3cm}{0.4pt}(c)  \dots (B)\\
    &\dots \dots \dots \dots \dots (i)\rule{2cm}{0.4pt}(c) \dots \dots \dots \dots (C)\\
    &\dots \dots \dots \dots (i)\rule{0.6cm}{0.4pt}(c) \dots \dots \dots \dots \dots \dots \dots (D)\\
\end{align}
$$

* (A) precedes (B)
* (A) concurrent with (C)
* (A) concurrent with (D)


Multiple readers or writers can access the register simultaneously. Read and write operations on registers are not immediate and take some time. Concurrent read/write operations performed by different processes are not serial: depending on how registers behave when operations overlap, they might be ordered differently and may produce different results. Depending on how the register behaves in the presence of concurrent operations, we distinguish among three **types of registers**:

* **Safe** 
Despite the name returns basically "random value" within the range of the
register if a concurrent write operation is on its way. Values may appeare flickering.
* **Regular**
Read operation can return only the value written by the most recent completed write or the value written by the write operation that overlaps with the current read. In this case, the system has some notion of order, but write results are not visible to all the readers simultaneously (for example, this may happen in a replicated database, where the master accepts writes and replicates them to workers serving reads).
* **Atomic**
Atomic registers guarantee **linearizability**: every write operation has a single moment before which every read operation returns an old value and after which every read operation returns a new one. Atomicity is a fundamental property that simplifies reasoning about the system state.

**<font color=red>Ordering</font>**

When we see a sequence of events, we have some intuition about their execution
order. However, in a distributed system it’s not always that easy, because it’s hard to know when exactly something has happened and have this information available instantly across the cluster. Each participant may have its view of the state, so we have to look at every operation and define it in terms of its invocation and completion events and describe the operation bounds.

Read and write can overlap

Process 1: register_1 = 25 (write)

Process 2: read register_1;  read register_1

Consider different outcomes (write before reads, etc).


There’s no simple answer to what should happen if we have just one copy of data. In a replicated system, we have more combinations of possible states, and it can get even more complicated when we have multiple processes reading and writing the data.

potential difficulties:
* Operations may overlap.
* Effects of the nonoverlapping calls might not be visible immediately.


To reason about the operation order and have nonambiguous descriptions of possible outcomes, we have to define **consistency models**. 

Terminology between concurrent and distributed systems overlap, but we can’t directly apply most of the concurrent algorithms, because of differences in communication patterns, performance, and reliability.

**<font color=red>Consistency Models</font>**


Since operations on shared memory registers are allowed to overlap, we should define clear semantics: what happens if multiple clients read or modify different copies of data simultaneously or within a short period. 


**Consistency models** provide different semantics and guarantees. You can think of a consistency model as a contract between the participants: what each replica has to do to satisfy the required semantics, and what users can expect when issuing read and write operations.


* consistency (state): acceptable invariants, allowable relationships between copies
* consistency (operation): constraint on the order of operations


Without a global clock, it is difficult to give distributed operations a precise and deterministic order. Syncronization is very time-consuming.


We’ll be able to limit the number of possible histories by either positioning dependent writes after one another or defining a point at which the new value is propagated.

**<font color=red>Strict Consistency</font>**

Strict consistency is the equivalent of complete replication transparency: any write by any process is instantly available for the subsequent reads by any process. It involves the concept of a global clock and, if there was a write(x, 1) at instant t1, any read(x) will return a newly written value 1 at any instant t2 > t1.

Unfortunately, this is just a theoretical model, and it’s impossible to implement.

**<font color=red>Linearizability</font>**

Linearizability is the strongest single-object, single-operation consistency model. Under this model, effects of the write become visible to all readers exactly once at some point in time between its start and end, and no client can observe state transitions or side effects of partial (i.e., unfinished, still in-flight) or incomplete (i.e., interrupted before completion) write operations.


Concurrent operations are represented as one of the possible sequential histories for which visibility properties hold. There is some indeterminism in linearizability, as there may exist more than one way in which the events can be ordered.


If two operations overlap, they may take effect in any order. All read operations that occur after write operation completion can observe the effects of this operation. As soon as a single read operation returns a particular value, all reads that come after it return the value at least as recent as the one it returns.


There is some flexibility in terms of the order in which concurrent events occur in a global history. Every **read** of the shared value should return the **latest** value written to this shared variable preceding this read, **or** the value of a write that **overlaps** with this read. 

Linearizable **write** access to a shared variable also implies **mutual exclusion**: between the two concurrent writes, only one can go first.

Even though operations are concurrent and have some overlap, their effects become visible in a way that makes them appear sequential. No operation happens instantaneously, but still appears to be atomic.


$$
\begin{align}
    &\dots \dots \dots (i)\overset{\text{write: }1\rightarrow x}{\rule{4cm}{0.4pt}}(c)\dots \dots \dots \dots \dots \dots \dots\dots \dots  (W_1)\\
    &\dots \dots \dots \dots \dots (i)\overset{\text{write: }2\rightarrow x}{\rule{6cm}{0.4pt}}(c) \dots \dots \dots \dots (W_2)\\
    &\dots \dots \dots \dots \dots \dots \dots (i)\overset{\text{read}}{\rule{0.6cm}{0.4pt}}(c) \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots (R_1)\\
    &\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots (i)\overset{\text{read}}{\rule{0.6cm}{0.4pt}}(c) \dots \dots \dots \dots \dots (R_2)\\
    &\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots  (i)\overset{\text{read}}{\rule{0.6cm}{0.4pt}}(c)  \dots (R_3)\\
\end{align}
$$

* Initial value $x = 0$
* $R_1$ can read $0,1,2$
* $R_2$ can read $1,2$
* $R_3$ can read $2$

**<font color=red>Linearization point</font>**


One of the most important traits of linearizability is visibility: once the operation is complete, everyone must see it, and the system can’t “travel back in time,” reverting it or making it invisible for some participants. 


This consistency model is best explained in terms of atomic (i.e., uninterruptible, indivisible) operations. Operations do not have to be atomic, but their effects have to become visible at some point in time, making an illusion that they were instantaneous. This moment is called a **linearizition point**.

$$
\begin{align}
    &\dots \dots \dots \dots \dots (i)\rule{3cm}{0.4pt}|\rule{3cm}{0.4pt}(c) \dots \dots \dots \dots (W)\\
    &\dots \dots \dots \dots \dots \dots \dots (i)\overset{\text{read}}{\rule{0.6cm}{0.4pt}}(c) \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots (R_1)\\
    &\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots (i)\overset{\text{read}}{\rule{0.6cm}{0.4pt}}(c) \dots \dots \dots \dots \dots (R_2)\\
\end{align}
$$ 

* $W$ contains linearization point
* $R_1$ reads some old data
* $R_2$ reads data W or some more recent data

A visible value should remain stable until the next one becomes visible after
it, and the register should not alternate between the two recent states.

The linearization point serves as a cutoff, after which operation effects become visible. We can implement it by using locks to guard a critical section, atomic read/write, or read-modify-write primitives.

**Note:**
Most of the programming languages these days offer atomic primitives that allow atomic write and compare-and-swap (CAS) operations. 
* atomic write -> just write value (think about A->B->A problem: unchanged value != no changes)
* atomic CAS -> move from one value to the next only when the previous value is unchanged

**<font color=red>Cost of linearizability</font>**

Many systems avoid implementing linearizability today. Even CPUs do not offer linearizability when accessing main memory by default. This has happened because synchronization instructions are expensive, slow, and involve cross-node CPU traffic and cache invalidations. However, it is possible to implement linearizability using lowlevel primitives.


In concurrent programming, you can use compare-and-swap operations to introduce
linearizability. Many algorithms work by preparing results and then using CAS for swapping pointers and publishing them. 


In distributed systems, linearizability requires coordination and ordering. It can be implemented using **consensus**: clients interact with a replicated store using messages, and the consensus module is responsible for ensuring that applied operations are consistent and identical across the cluster. Each write operation will appear instantaneously, exactly once at some point between its invocation and completion events.


A system in which all objects are linearizable, is also linearizable. This is a very useful property, but even though operations on two independent objects are
linearizable, operations that involve both objects have to rely on additional synchronization means.

**<font color=red>Reusable Infrastructure for Linearizability</font>**


Reusable Infrastructure for Linearizability (RIFL), is a mechanism for implementing linearizable [remote procedure calls (RPCs)](https://en.wikipedia.org/wiki/Remote_procedure_call). 

RIFL solves the problem of "exactly once delivary". Operations should be performed once, client or server crashes should be handled.

Consider: client C1 writes value V1, but doesn’t receive an acknowledgment. Meanwhile, client C2 writes value V2. If C1 retries its operation and
successfully writes V1, the write of C2 would be lost. To avoid this, the system needs to prevent repeated execution of retried operations. 

**RIFL**

* Assign client IDs, RIFL uses leases (unique identifiers), issued by the system-wide service.
* In RIFL, messages are uniquely identified with the client ID and a client-local monotonically increasing sequence number.
* Along with the actual data records, **completion objects** are stored in a durable storage. When the client retries the operation due to server crash, instead of reapplying it, RIFL checks for a completion object, indicating that the operation it’s associated with has already been executed, and returns its result.

Clients have to periodically renew their leases to signal their liveness. If the client fails to renew its lease, it is marked as crashed and all the data associated with its lease is garbage collected. Leases have a limited lifetime to make sure that operations that belong to the failed process won’t be retained in the log forever. 

If the failed client tries to continue operation using an expired lease, its results will not be committed and the client will have to start from scratch.

If the server crashes before it can acknowledge the write, the client may attempt to retry this operation without knowing that it has already been applied. 

**completion objects**

Creating a completion object should be atomic with the mutation of the data record it is associated with.

The completion object should exist until either the issuing client promises it won’t retry the operation associated with it, or until the server detects a client crash, in which case all completion objects associated with it can be safely removed.


The advantage of RIFL is that, by guaranteeing that the RPC cannot be executed more than once, an operation can be made linearizable by ensuring that its results are made visible atomically, and most of its implementation details are independent from the underlying storage system.

**<font color=red>Sequential Consistency</font>**


Achieving linearizability might be too expensive, but it is possible to relax the model, while still providing rather strong consistency guarantees. 

**Sequential consistency** allows ordering operations as if they were executed in some sequential order, while requiring operations of each individual process to be executed in the same order they were performed by the process. Order of execution between processes is undefined, as there’s no shared notion of time.


* All write operations propagating from the same process appear in the order they were submitted by this process. 

* Operations propagating from different sources may be ordered arbitrarily, but this order will be consistent from the readers’ perspective. This means operations can be ordered in different ways (depending on the arrival order, or even arbitrarily in case two writes arrive simultaneously), but **all** processes **observe** the operations in the same order.

* Processes can observe operations executed by other participants in the order consistent with their own history, but this view can be arbitrarily stale from the global perspective. Stale reads can be explained, for example, by replica divergence: even though writes propagate to different replicas in the same order, they can arrive there at different times.



$$
\begin{align}
    & \dots (i)\overset{\text{write: }1\rightarrow x}{\rule{2cm}{0.4pt}}(c)\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots  (W_1)\\
    &\dots \dots \dots \dots \dots \dots (i)\overset{\text{write: }2\rightarrow x}{\rule{2cm}{0.4pt}}(c) \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots  (W_2)\\
    &\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots (i)\overset{\text{read: 1}}{\rule{1cm}{0.4pt}}(c) \dots \dots (i)\overset{\text{read: 2}}{\rule{1cm}{0.4pt}}(c) \dots \dots \dots \dots \dots \dots \dots (R_1)\\
    &\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots (i)\overset{\text{read: 1}}{\rule{1cm}{0.4pt}}(c) \dots \dots (i)\overset{\text{read: 2}}{\rule{1cm}{0.4pt}}(c) \dots (R_2)\\
\end{align}
$$


The main difference with **linearizability** is the absence of globally enforced time bounds. Under linearizability, an operation has to become effective within its wallclock time bounds. By the time the write $W_1$ operation completes, its results have to be applied, and every reader should be able to see the value at least as recent as one written by $W_1$. Similarly, after a read operation $R_1$ returns, any read operation that happens after it should return the value that $R_1$ has seen or a later value.


**Sequential consistency** relaxes this requirement: an operation’s results can become visible after its completion, as long as the order is consistent from the individual processes' perspective. Same-origin writes can’t “jump” over each other: their program order, relative to their own executing process, has to be preserved. The other restriction is that the order in which operations have appeared must be consistent for all
readers.

**For embedders**

Similar to linearizability, modern CPUs do not guarantee sequential consistency by default and, since the processor can reorder instructions, we should use memory barriers (also called fences) to make sure that writes become visible to concurrently running threads in order.

**<font color=red>Causal Consistency</font>**

Even though having a global operation order is often unnecessary, it might be necessary to establish order between **some** operations. Under the **causal consistency** model, all processes have to see causally related operations in the same order. Concurrent writes with no causal relationship can be observed in a different order by different processors.

**no casual consistency**
$$
\begin{align}
    & \dots (i)\overset{\text{write: }1\rightarrow x}{\rule{2cm}{0.4pt}}(c)\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots  (W_1)\\
    &\dots \dots \dots \dots \dots \dots (i)\overset{\text{write: }2\rightarrow x}{\rule{2cm}{0.4pt}}(c) \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots  (W_2)\\
    &\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots (i)\overset{\text{read: 1}}{\rule{1cm}{0.4pt}}(c) \dots \dots (i)\overset{\text{read: 2}}{\rule{1cm}{0.4pt}}(c) \dots \dots \dots \dots \dots \dots \dots (R_1)\\
    &\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots (i)\overset{\text{read: 2}}{\rule{1cm}{0.4pt}}(c) \dots \dots (i)\overset{\text{read: 1}}{\rule{1cm}{0.4pt}}(c) \dots \dots \dots \dots \dots \dots \dots  (R_2)\\
\end{align}
$$

To avoid the problem in addition to a written value, we now have to specify a logical clock value that would establish a causal order between operations and keep it track. This establishes a causal order between these operations. Even if the latter write propagates faster than the former one, it isn’t made visible until all of its dependencies arrive, and the event order is reconstructed from their logical timestamps. 


$$
\begin{align}
    & \dots (i)\overset{\text{write: }1\rightarrow x,t_1}{\rule{2cm}{0.4pt}}(c)\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots  (W_1)\\
    &\dots \dots \dots \dots \dots \dots (i)\overset{\text{write: }2,t_2\rightarrow x}{\rule{2cm}{0.4pt}}(c) \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots  (W_2)\\
    &\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots (i)\overset{\text{read: 1}}{\rule{1cm}{0.4pt}}(c) \dots \dots (i)\overset{\text{read: 2}}{\rule{1cm}{0.4pt}}(c) \dots \dots \dots \dots \dots \dots \dots (R_1)\\
    &\dots \dots \dots \dots \dots \dots \dots \dots \dots \dots \dots (i)\overset{\text{read: 1}}{\rule{1cm}{0.4pt}}(c) \dots \dots (i)\overset{\text{read: 2}}{\rule{1cm}{0.4pt}}(c) \dots \dots \dots \dots \dots \dots \dots  (R_2)\\
\end{align}
$$


**Example:** online forum, you post something online, someone sees your post and responds to it, etc., tree of responces. Casually consistent system warranties if you see a message, then you have already seen the previous one.


In a **causally consistent** system, we get **session guarantees** for the application, ensuring the view of the database is consistent with its own actions, even if it executes read and write requests against different, potentially inconsistent, servers
* monotonic reads
* monotonic writes
* read-your-writes
* writes-follow-reads

(more on this later)

Causal consistency can be implemented using logical clocks and sending context metadata with every message, summarizing which operations logically precede the current one. When the update is received from the server, it contains the latest version of the context. Any operation can be processed only if all
operations preceding it have already been applied. Messages for which contexts do not match are buffered on the server as it is too early to deliver them.


**Examples** implement causality through a library (implemented as a frontend
server that users connect to) and track dependencies to ensure consistency
* [Clusters of Order-Preserving Servers (COPS)](https://www.cs.cmu.edu/~dga/papers/cops-sosp2011.pdf) tracks dependencies through key versions 
* [Eiger](https://www.cs.cmu.edu/~dga/papers/eiger-nsdi2013.pdf) establishes operation dependencies

They detect and handle conflicts: in COPS, this is done by checking the key order and using application-specific functions, while Eiger implements the last-write-wins rule (same as Apache Cassandra).

**<font color=red>Vector clocks</font>**


Establishing **causal order** allows the system to reconstruct the sequence of events even if messages are delivered out of order, fill the gaps between the messages, and avoid publishing operation results in case some messages are still missing. 

Many databases, for example, [Dynamo](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) and [Riak](https://riak.com/posts/technical/why-vector-clocks-are-hard/), use vector clocks for establishing causal order.


A vector clock is a structure for 
* establishing a **partial order** between the events
* detecting and resolving divergence between the event chains. 

We can 
* simulate common time
* global state
* represent asynchronous events as synchronous ones 

Method:
* processes maintain vectors of logical clocks, with one clock per process

process | clock
---|---
P1 (this) | 3
P2 | 1
P3 | 12

* every clock starts at the initial value
* every process increments its own clock every time a new event occures
* when receiving clock vectors from other processes, a process updates its local vector to the highest clock values per process from the received vectors

Values in clock vectors help us establish the casual relationship between operations.

![img](https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/Vector_Clock.svg/640px-Vector_Clock.svg.png)

To implement causal consistency, we have to store causal history, add garbage collection, and ask the user to reconcile divergent histories in case of a conflict. Vector clocks can tell you that the conflict has occurred, but do not propose exactly how to resolve it, since resolution semantics are often application-specific. Because of that, some eventually consistent databases, for example, **Apache Cassandra**, do not order operations causally and use the last-write-wins rule for conflict resolution instead

**<font color=red>Eventual Consistency</font>**

Synchronization is expensive, both in multiprocessor programming and in dis‐
tributed systems. We can relax consistency guarantees and use models that allow some divergence between the nodes. 

**Sequential consistency** allows reads to be propagated at different speeds.
Under **eventual consistency**, updates propagate through the system asynchronously. Formally, it states that if there are no additional updates performed against the data item, eventually all accesses return the latest written value. In case of a conflict, the notion of latest value might change, as the values from diverged replicas are reconciled using a conflict resolution strategy, such as last-write-wins or using vector clocks.


Eventually is an interesting term to describe value propagation, since it specifies no hard time bound in which it has to happen. If the delivery service provides nothing more than an “eventually” guarantee. However, in practice, this works well, and many databases these days are described as eventually consistent (Cassandra).

**<font color=red>Session Models</font>**


Thinking about consistency in terms of value propagation is useful for database
developers, since it helps to understand and impose required data invariants, but some things are easier understood and explained from the client point of view. 

**Session models** help to reason about the state of the distributed system from the client perspective: how each client observes the state of the system while issuing read and write operations.

* We still assume that each client’s operations are sequential. 
* Session models make no assumptions about operations made by **different** processes (clients) or from the different logical session BUT the same guarantees have to hold for **every** process in the system.


In a distributed system, clients often can connect to any available replica and, if the results of the recent write against one replica did not propagate to the other one, the client might not be able to observe the state change it has made.

**consistency models**
* **read-own-writes**: every read operation following the write on **the same or the other** replica has to observe the updated value.
* **monotonic reads**: if the read(x) has observed the value V, the following reads have to observe a value at least as recent as V or some later value.
* **monotonic writes**: values originating from the same client appear in the order this client has executed them to all other processes.
* **writes-follow-reads** (sometimes referred as session causality): writes are
ordered after writes that were observed by previous read operations.
* **Pipelined RAM (PRAM) consistency**: monotonic reads + monotonic writes + read-own-writes







PRAM guarantees that write operations originating from one process will propagate in the order they were executed by this process (BUT != sequential
consistency, because writes from different processes can be observed in different order). 

<font color="red">**Tunable Consistency**</font>

Most NoSQL systems are **eventually consistent**. Eventually consistent systems are sometimes described in CAP terms: you can trade availability for consistency or vice versa. From the server-side perspective, eventually consistent systems usually implement **tunable consistency**, where data is replicated, read, and written using three variables:
* **Replication Factor N** Number of nodes that will store a copy of data.
* **Write Consistency W** Number of nodes that have to acknowledge a write for it to succeed.
* **Read Consistency R** Number of nodes that have to respond to a read operation for it to succeed.


Choosing consistency levels where (R + W > N), the system can guarantee returning the most recent written value, because there’s always an overlap between read and write sets (**pigeonhole principle**). 

N = 5 (A,B,1,2,3,4,5)

A -write-> {1,2,3} (do not forget to write **version** of data!)

B <-read- {2,4,5} (at least one will contain recent data)


How do you tune the system for heavy-read and heavy-write?

<font color="red">**Quorums**</font>


A consistency level that consists of floor(N/2) + 1 nodes is called a **quorum**, a majority of nodes. 


In the case of a network partition or node failures, in a system with 2f + 1
nodes, live nodes can continue accepting writes or reads, if up to f nodes are unavailable, until the rest of the cluster is available again. In other words, such systems can tolerate at most f node failures.


Reading and writing using quorums does not guarantee monotonicity in cases of
incomplete writes. If some write operation has failed after writing a value to one replica out of three, a quorum read can return either the result of the incomplete operation, or the old value. Since subsequent samevalue reads are not required to contact the same replicas, values they return can alternate. To achieve read monotonicity (at the cost of availability), we have to use
blocking read-repair.

<font color="red">**Witness Replicas**</font>


Storing many copies and sending data to many replicas may be costly. We can improve storage costs by using a concept called [witness replicas](http://www2.cs.uh.edu/~paris/MYPAPERS/Icdcs86.pdf).

process | data version | state
---|---|---
P1 | 12 | OK
P2 | 15 | OK
P3 | 12 | FAIL
P4 | 15 | FAIL
P5 | 15 | FAIL

We have data, but no quorum.


Witness:
* stores only version of data to participate in quorum
* may be **temporarily** upgraded to store data


There are several implementations of this approach; for example, [Spanner](https://cloud.google.com/blog/topics/developers-practitioners/demystifying-cloud-spanner-multi-region-configurations) and Apache Cassandra.

<font color="red">**Strong Eventual Consistency and CRDTs**</font>



strong eventual consistency = middle between eventual consistency and linearizability. Under this model, updates are allowed to propagate to servers late or out of order, but when all updates finally propagate to target nodes, conflicts between them can be resolved and they can be merged to produce the same valid state.


One of the most prominent examples of such an approach is Conflict-Free Replicated Data Types (CRDTs) implemented, for example, in Redis.
This makes CRDTs useful in eventually consistent systems, since replica states in such systems are allowed to temporarily diverge. Replicas can execute operations locally, without prior synchronization with other nodes, and operations eventually propagate to all other replicas, potentially out of order. CRDTs allow us to reconstruct the complete system state from local individual states or operation sequences.


The simplest example of CRDTs is operation-based Commutative Replicated Data
Types (CmRDTs).
* **Side-effect free** Their application does not change the system state.
* **Commutative** Argument order does not matter: x • y = y • x. In other words, it doesn’t matter whether x is merged with y, or y is merged with x.
* **Causally ordered** Their successful delivery depends on the precondition, which ensures that the system has reached the state the operation can be applied to.

For example, we could implement a grow-only counter. Each server can hold a state vector consisting of last known counter updates from all other participants, initialized with zeros. Each server is only allowed to modify its own value in the vector. When updates are propagated, the function merge(state1, state2) merges the states from the two servers.


It is possible to produce a Positive-Negative-Counter (PN-Counter) that supports both increments and decrements by using payloads consisting of two vectors: P, which nodes use for increments, and N, where they store decrements. 


To save and replicate values, we can use **registers**. The simplest version of the register is the last-write-wins register (LWW register), which stores a unique, globally ordered timestamp attached to each value to resolve conflicts. In case of a conflicting write, we preserve only the one with the larger timestamp. 


Another example of CRDTs is an unordered grow-only set (G-Set). Each node maintains its local state and can append elements to it. Adding elements produces a valid set. Merging two sets is also a commutative operation. 


Similar to counters, we can use two sets to support both additions and removals. In this case, we have to preserve an invariant: only the values contained in the addition set can be added into the removal set. To reconstruct the current state of the set, all elements contained in the removal set are subtracted from the addition set.


There are quite a few possibilities CRDTs provide us with, and we can see more data stores using this concept to provide Strong Eventual Consistency (SEC). This is a powerful concept that we can add to our arsenal of tools for building fault-tolerant distributed systems.