Personal Solution to MIT's course 6.824-Distributed-Systems Lab, Spring 2020. Course Channel on Youtube: link
Implementation was run against tests provided in the lab that simulate a distributed environment (hosts failing, connection issues, ..etc) but it was not run on a real distributed environment.
- Lab 1: MapReduce [Done]
- Lab 2: Raft [Done]
- Part A: Leader Election
- Part B: Append New Log Entries
- Part C: State Persistence
- Lab 3: Fault-tolerant Key/Value Service [Done]
- Lab 4: Sharded Key/Value Service [Incomplete - some tests failing]
Simple implementation for MapReduce.
An implementation of Raft consensus algorithm.
Some issues that I encountered when implementing this Lab.
Tip #1 Do not check for a timeout using a sleep as this could lead to unexpected timeouts
timeoutTime := time.Now() + RANDOM_TIME
for {
if time.Now().Sub(timeoutTime) {
// start new election
}
time.Sleep(sleepPeriod)
}
Using the implementation of Raft in Lab 2 to implement a fault tolerant Key/Value storage. Raft is used for electing a leader and for fault tolerance.
Notes:
- Number of nodes can be any odd number (to be able to form a quorum) not necessarily 5.
- Both read/write operation go through the leader, a quorum is to be formed before the leader responds with success. (Implementation does not handle eventually consistent reads)
- Sharding the Data is in Lab 4
Why the read operation has to be replicated and not just return the value in the Leader KV Storage ? As mentioned in Raft's paper section 8:
Read-only operations can be handled without writing anything into the log. However, with no additional measures, this would run the risk of returning stale data, since the leader responding to the request might have been superseded by a newer leader of which it is unaware.
It is important to note that Distributed databases like DynamoDB allow both eventually consistent and strongly consistent reads, see: Read Consistency. However the implementation in the assignment is only for strongly consistent reads.
Extension to Lab 3, large data will not fit into 1 host so they are to be sharded across N Shards.
Notes:
- Each shard has data replicated to its hosts.
- A client library determines which shard should be called to fetch a key.
- Hosts within the same shard use Raft for fault tolerance.
- A Shard responds with a success to read/write when leader forms a quorom.