Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Seeing data loss when running continuous ingest with agitation #441
I did not get a chance to run continuous ingest with agitation before 1.9.0 was released. I ran it after the release and saw some data loss. I have tracked the bug down and it happens when a closed write ahead log is only referenced by minor compacting tablets. When this happens the tablet server may prematurely mark the write ahead log as unreferenced in zookeeper. The following is an example of this bug
In the example above the tablet has data in WAL1, however since the tablet server marked it as unreferenced the master did not assign it to the tablet. The tablet server should only mark WAL1 as unreferenced after the minor compaction finishes, not after it starts.
This bug probably exists in 1.8.0 and later.
Below are some notes I took while debugging this. There were ~600K missing rows out of 24 billion. I focused on one missing row
The following are selected logs messages from the tserver on worker 3.
Later the tablet is loaded on worker 4 and even though the tablet has data in wal