Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Recovery of WAL may see an incomplete set of logs #537
Tablet servers track the active set of WALs (write ahead logs) in zookeeper. When a tablet server dies all WALs listed in zookeeper are used for recovery. Tablet servers determine which write ahead logs are active based on which tablets reference WALs. If a tablet server allocates three WALs over time W1, W2, and W3 then its possible that tablets only reference W1 and W3. If that tablet server dies, then only W1 and W3 would be used for recovery. However, W2 may contain information that is important to some tablets. Consider the following data.
So if the tablet server dies and only W1 and W3 are used for recovery, then tablet T1 will bring back the deleted rowX:colY. It does this because it does not see the data in W2 during recovery. If the data in W2 was seen during recovery, then the tablet would know it had minor compacted and no data needed to be recovered.
Discovered this issue as a result of looking into and discussing #535 with @ctubbsii . This bug only impacts Accumulo 1.8.0 and later. The bug is a result of the change in 1.8.0 to track WALs per tablet servers instead of per tablet. Before 1.8.0, the tablet T1 would have had not WALs associated with it after minor compacting.
referenced this issue
Jun 21, 2018
One possible way to fix this is to make the tablet server keep all WALs newer than the oldest referenced. So if a tserver has 4 closed WALs W1, W2, W3, W4 where W4 is the oldest and W1, W3 are referenced then W1,W2,W3 should be kept in zookeeper. In this case W4 can safely be GCed, but not W2.