Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Seeing more data loss when running continuous ingest with agitation. #449
The above command sleeps 1 to 10 minutes randomly between killing tserver processes and randomly kills 1 to 3 tablets servers. I think in the past I have run with a non random period, but not sure. I suspect the random period may be uncovering some new bugs.
After running verify I saw the following data UNDEFINED count, which indicates data was lost. This count is much lower than what I saw before #441 (saw ~600K). I am going to attempt to track the cause of this down.
I determined the cause of this bug. It happens when a tablet is unloaded from a tserver and then later comes back. When it comes back, if data is written and no minor compaction occurs before the tserver dies then that data may not be seen. Below is a detailed example, the seq# below are WAL seq#.
During recovery of TabletA WAL1 and WAL2 are examined to find the last compaction to finish and its seq#. The last compaction to finish has a seq# of 52, therefore we look for data >= 52. However the data in WAL2 has a lower seq# and is therefore ignored.
Before 1.8.0 WAL1 would never have been included in the recovery and this problem would not have occurred. However the changes made in 1.8.0 (moved from tracking WAL refs per tablet to per tablet server) bring WAL1 into the recovery process.
Notes from tracking down bug : accumulo-449.txt