Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Network issue during WAL creation results in a 'missing' WAL during recovery #949
As far as I can tell this is what's happening...
When a new WAL is being created, after the header and OPEN mark are written, the new WAL marker is written to Zookeeper for the master. If due to a network interruption, the marker is written, but the tserver is unaware of this, the tserver will delete the WAL from HDFS, leaving an orphaned entry in the metadata table. This then prevents Accumulo from proceeding with ingest for the associated tablets without manual intervention, because it thinks it's missing a WAL.
This was observed twice in the last three weeks on a moderately sized cluster. Why is the WAL deleted by the tserver, shouldn't the GC do this? Maybe it should only delete the WAL if it doesn't fail on the Zookeeper step?
Note: this only seems to happen in rare circumstances when the cluster is under heavy load.