## Key Value Store
A non-relational database which supports two operations - `get(key)` and `put(key, value)` to retrieve and store key-value pairs. Some examples include Redis, Memcached, Amazon Dynamo, etc.

Key needs to be hashable and the value is any object. Some design considerations:
- should be scalable
- choice between consistency and availability (tunable consistency)
- low latency

## Single Server Store
Implementation of a single server store is relatively simple - use a in-memory hashtable. However, memory is limited and we'll soon fill it up. One approach to mitigate the problem is to have an LRU cache and evict key-value pair to disk. Additionally, we can try compressing values.

If a key-value pair is not stored in disk, the `get` operation becomes slow since it has to fetch data from disk. To support large amount of data, we need to shift to using a distributed approach.

## Distributed Key-Value Store
Key-value pairs are distributed over multiple machines. In a distributed system, CAP theorem applies, therefore we can guarantee either consistency or availability. As an example, let's consider a distributed system consisting of 3 nodes  $n_1$, $n_2$ and $n_3$. And $n_3$ goes down.  
**Consistency:** if we chooose consistency, we block read and writes from $n_1$ and $n_2$ till $n_3$ is back online. The service returns error.  
**Availability:** if we choose availability, the system keeps accepting reads, even though it might return stale data. For writes, $n_1$ and $n_2$ will keep accepting writes, and data will be synced to $n_3$ when the network partition is resolved.

### Partitioning
One server can't possibly store all the data, so we split data into smaller partitions and each node stores a fraction of the total data. Consistent hasing allows us to:
- spread data evenly across multiple nodes
- minimise data movement when nodes are added/removed.

<img src="images/key_value_consistent_hash.png" />

Any of the nodes can receive request from the client. The node receiving the request is referred to as coordinator. The coordinator calculates hash and then forwards the request to the right node.

The number of virtual nodes for a server is proportional to the server capacity. For example, servers with higher capacity are assigned with more virtual nodes.

### Replication
To achieve high availability and reliability, data must be replicated asynchronously over $N$ servers, where $N$ is a configurable parameter called as *replication factor*. For example, if replication factor is set to 3, here is how a `put` operation works:  

<img src="images/key_value_replication.png"/>

For better reliability, replicas are placed in distinct data centers, and data centers are connected through high-speed networks.

### Consistency
Let there be a distributed system with replication factor $N$. For a write operation to be successful, it must be acknowledged by $W$ number of replicas. Similarly, for a read operation to be successful, it must be acknowledged by $R$ number of replicas.  

As an example, setting $W=1$ means we get fast writes since acknowledgement from just one node is enough. The system also has high availability (at the cost of consistency). Configuration of $W$, $R$ and $N$ is a typical tradeoff between latency and consistency. Some configurations:
- $R= 1$ and $W=N$, the system is optimized for a fast read.
- $W=1$ and $R=N$, the system is optimized for fast write.
- $W+R>N$, strong consistency is guaranteed (Usually $N=3, W=R=2$).
- $W+R<=N$, strong consistency is not guaranteed.

**Strong Consistency:** to ensure strong consistency, $W=N$. Additionally, concurrent writes for the same key should not be allowed.  
**Weak consistency:** subsequent read operations may not see the most updated value.

<img src="images/key_value_inconsistent.png" />

**Eventual consistency:** this is a specific form of weak consistency. Given enough time, all updates are propagated, and all replicas are consistent.

### Inconsistency Resolution
Suppose there are three nodes $A$, $B$ and $C$ and $W = 1, R = 1$. When $A$ receives a request, it would write to its local storage and immediately respond with success. It will also ask $B$ and $C$ to update their copy. However, if $B$ receives read request before it could store the previous update, it would respond back with stale data. If $R = 2$, $B$ would also consult either of $A$ or $B$'s data (which have updated value). Now there are two values of the same key.

One way to resolve inconsistency is to maintain a timestamp along with every write operation. On a single machine, maintaining sequence of events is easy - maintain timestamp of each event. The same physical clock is used for all timestamps providing a global view of time.

When we move to distributed system, we cannot reply on this property anymore. We cannot guarantee that each machine's clock has the same exact time. Therefore relying on the last timestamp to resolve inconsistency (last write wins), is not the best approach. Let's try logical clocks. 

**Lamport Clock:** we associate a lamport clock with each key. Lamport clock of an event $A$ is $LC(A) = n$. If $A\to B$ ($A$ happens before $B$) then $LC(A) \lt LC(B)$. Diagram below shows an example:  
<img src="images/lamport_clock_1.png" width=350 height=auto />

Now consider the sequence of images below. Events marked as $a$ and $b$. We cannot say $a\to b$ since there is no path from $a$ to $b$. If we look at their lamport clock values, $LC(a) \lt LC(b)$. We conclude that if $LC(A) \lt LC(B)$, we cannot say for sure that $A \lt B$. Thus lamport clocks are not very helpful to resolve inconsistencies.  
<img src="images/lamport_clock_2.png" width=500 height=auto />

**Vector Clock** on the other hand have the following property: $A\to B \iff VC(A)\lt VC(B)$. Below diagram shows vector clocks associated with different events:  
<img src="images/vector_clock.png" width=700 height=auto />

If vector clocks from two different nodes can be compared (blue or pink scenario) then the inconsistency can be resolved. However if the events are concurrent, then both the values are sent to the client and the client has to make the decision (by using timestamps?). After the conflict is resolved, the clocks are merged and new value is written back. The merged clock is decendent from both the previous parallel clocks.

[explanation](https://riak.com/posts/technical/vector-clocks-revisited/index.html?p=9545.html)

### Failure Detection
To know which nodes are UP, every node sends heartbeats to every other node in the system. However, this is inefficient when many servers are in the system - requires $O(n^2)$ number of heartbeats.

**Gossip Protocol:** is a decentralized peer-to-peer communication technique to transmit messages in an enormous distributed system. Gossip protocol is simple in concept. Each node sends out some data to a set of other nodes. Data propagates through the system node by node like a virus. Eventually data propagates to every node in the system. It's a way for nodes to build a global map from limited local interactions.

Every node also supports the following two REST operations: i) GET information about all nodes ii) POST operation that updates all node information sent by another node (through heartbeat). There is also a timer task executed by a node which:
- increments its own heartbeat counter
- moves the peers it hasn’t had a heartbeat exceeding the threshold to “suspected” state
- removes the peers in suspected state which haven’t had any heartbeats exceeding the threshold
- randomly pick $n$ peers to send this node’s membership list to

### Hinted Handoff
When a server goes down, unlike in a strongly consistent system where read and writes are blocked, in an available system, the first $W$ healthy servers for writes and first $R$
healthy servers for reads are chosen on the hash ring. Offline servers are ignored. This is called *sloppy quorom*.

If a server is unavailable due to network or server failures, another server will process requests temporarily. When the down server is up, changes will be pushed back to achieve data consistency. This process is called *hinted handoff*. 

To elaborate, let's say write to node $A$ is replicated by $B$ and $C$. $B$ is temporarily unavailable. The system would chose next available node $D$ on the ring which can handle $B$'s request in the interim. In addition to the key value pair, $D$ would also store the original intended node in a separate queue. When $B$ comes online, $D$ would drain its queue and send all the writes back to $B$. Without hinted handoff, when node $B$ comes back online, it would be completely out of sync. It would have to wait for a heavy *anti-entropy* process (*Merkle Tree* comparison) or a read repair to get its data back. Hinted Handoff "pushes" the missed data to Node B immediately upon its return.

## Read Write Flow
### Write Path 
![Write Path](images/write_path_kv.png)

Write request by a node has to be written to disk for persistance. However, writing to disk is slow process. Thus, to speed this up, data is first written to in memory data structure (*memtable*). Once the size of memtable exceeds set threshold, key-values pairs are flushed to disk. New data is written to new memtable. To safeguard against loss of data stored in memtable, a write ahead log is maintained. WAL maintains append only sequence of commands received by the node. When the node goes down and is brought up, it can replay the WAL (from last known checkpoint) to recreate the state.

**SS Table**: or Sorted Strings Table is a format used to store key-value pairs persistently. It has the following properties:
- key-value pairs stored are sorted by keys.
- once data is written to disk, it is not modified. Immutability allows multi-threaded access without locks. It also improves cachability.

Below diagram shows how segments of SSTable are created and how multiple segments are compacted:  
![SSTable](images/sstable.png)

### Read Path
![Read Path](images/read_path_kv.png)

To look for a key, we first look in the memtable. If the value is not present in memory, we need to consult SSTable files on disk. Now which file contains the key being searched? We can use *Bloom filter* for this. A bloom filter says if a key is **not** present in a file. We start with the latest file and slowly move to recent ones.

### BTree vs SSTable
SSTable performs sequential IO, whereas BTree does random IO. Thus, write speed with SSTable is really fast. Read is slower because you consult memtable, then the files.
<div style="display:inline-block" >
    
| Feature               | SSTable (LSM-Tree)            | B-Tree (RDBMS)                                  |
|-----------------------|-------------------------------|-------------------------------------------------|
| Primary Goal          | Write Throughput (Big Data)   | Read Speed (OLTP)                               |
| Write Pattern         | Append-Only (Sequential)      | Update-in-Place (Random)                        |
| Write Cost            | Low (Fast)                    | High (Requires finding the page first)          |
| Read Cost             | Higher (Check multiple files) | Low (Single path to data)                       |
| Space Efficiency      | High (Excellent compression)  | Medium (Page fragmentation)                     |
| Hardware Friendliness | Great for HDDs & SSDs         | Better on SSDs (which handle random I/O better) |
</div>