- Brainstorming
- 1. Overview
- 2. Data Models
- 3. Storage and Retrieval
- 4. Replication
- 5. Partitioning
- 6. Transactions
- 7. Consistency & Consensus
- References
Reliable
- Making system work correctly, even when the faults occur
Scalable
- Having strategies for keeping performance good, even when load increases
Maintainable
- Making life better for the engineering and operations teams who need to work with the system
Data system
Why distributed data?
- Scalability
- data volume
- read/write load
- High availability
- redundancy
- Latency
- geographic location
Building a reliable system from unreliable components
- The request was lost
- The remote node was down
- The response was lost
Network Congestion
Timestamp for ordering events
- Database writes can disappear
- a lagging clock is unable to overwrite values previously written by a node with a fast clock
Process Pauses
- If the clocks are out of sync by more than a few seconds? (time-of-day clocks)
- What if there is an unexpected pause in the execution of the program?
- GC STW
- Waiting for a slow disk I/O
- Paging
Incorrect implementation of a distributed lock
Fencing Token
Relational model, document model, graph model
Relational model
Document model
Graph model
How we can store the data that we're given, and how we can find it again when we're asked for
Hash index
Compaction
Pros
- Sequential write operations, which are much faster than random writes
- Concurrency and crash recovery are much simpler if segment files are append-only
Cons
- The hash table must fit in memory, so it's not suitable for a very large number of keys
- Range queries are not efficient
Log-structured merge-tree
Sorted String Table
- Merging segments, like mergesort
- Keep the value from the most recent segment and discard the values in older segments, log-structed
- O(logn) search, less I/O bandwidth
Bloom filters
- TODO
B-tree with n keys always has a depth of O(logn)
B-tree
- K-V pairs sorted by key, which allows efficient lookups and queries
- B-trees breaks the database down into pages (fixed size block)
Growing on disk
P.S : A four-level tree for 4KB pages with a branching factor of 500 can store up to 256TB
WAL
Write-ahead log, also known as redo log
- An append-only file
- Every B-tree modifications must be written before it can be applied to the pages
Why WAL?
Because some operations require several different pages to be overwritten, it's a dangerous operation if the database crashes after only some of the pages have been written
All of the difficulty in replication lies in handling changes to replicated data
Why Replica?
- Increase availability
- Increase read throughput
- Reduce latency
Partition vs Replica
Single-leader
Sync VS Asyns
Reading your own writes
Monotonic reads
Consistent prefix reads
Multi-leader across datacenters
Write conflicts
Topologies
Leaderless replication
Quorum reads & writes
Concurrent writes
Merging concurrently written values
- version1 : [milk]
- version2 : [eggs]
- version3 : [milk, flour]
- version4 : [eggs, milk, ham]
- version5 : [milk, flour, eggs, bacon]
Happens-before
Partition w/ replication
Partitioning by Key Range
Partitioning by Hash
Document-based
Term-based
Fixed Number of Partitions
Differeny Ways of Request Routing
Coordination Service
Race Condition
Dirty Read
Atomicity avoids inconsistent state
Read Committed
MVCC
- An update is internally translated into a delete and a create
- At some later time, GC process in the database removes any rows marked for deletion and frees their space
2PL
- After a transaction has acquired the lock, it must continue to hold the lock until the end of the transaction (commit or abort)
Index-Range Locks Example
- Approach 1: Database attach a shared lock to index room_id for room 123
- Approach 2: Database attach a shared lock to a range of values in index start_time and end_time with the time period of noon to 1pm
SSI: Detecting stale MVCC reads
- When the transaction wants to commit (not abort immediately), the database checks whether any of the ignored writes have now been committed. If so, the transaction must be aborted (optimistic)
SSI: Detecting writes that affect prior reads
Read Skew
Dirty Writes
Write Skew Example
The hospital tries to have serveral doctors on call at any one time, but it absolutely must have at least one doctor on call
- Since the database is using snapshot isolation, both checks return 2, so both transactions proceed to the next stage
Explicit lock
- SELECT FOR UPDATE tells the database to lock all rows returned by this query. So, we could make the transaction safe and avoid write skew
Two-Phase Commit
- The application begins a single-node transaction on each of the nodes (what if node crashes or request timeout?)
- The coordinator sends a prepare request to all nodes (what if node crashes or request timeout?)
- When a node receives the prepare request, it makes sure that it can definitely commit the tx under all circumstances
- The coordinator makes a definitely decision, writes it to its log on disk
Coordinator Failure
- This is why the coordinator must write its commit / abort decision to a transaction log on disk before phase 2
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
- How databases scale writes: The power of the log
- What are Bloom Filters? - Hashing
- Merge sort in 3 minutes
- Bloom Filters by Example
- An Illustrated Proof of the CAP Theorem