Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network issue during WAL creation results in a 'missing' WAL during recovery #949

Closed
andrewglowacki opened this issue Feb 9, 2019 · 6 comments

Comments

@andrewglowacki
Copy link
Contributor

commented Feb 9, 2019

As far as I can tell this is what's happening...

When a new WAL is being created, after the header and OPEN mark are written, the new WAL marker is written to Zookeeper for the master. If due to a network interruption, the marker is written, but the tserver is unaware of this, the tserver will delete the WAL from HDFS, leaving an orphaned entry in the metadata table. This then prevents Accumulo from proceeding with ingest for the associated tablets without manual intervention, because it thinks it's missing a WAL.

This was observed twice in the last three weeks on a moderately sized cluster. Why is the WAL deleted by the tserver, shouldn't the GC do this? Maybe it should only delete the WAL if it doesn't fail on the Zookeeper step?

Note: this only seems to happen in rare circumstances when the cluster is under heavy load.

Version 1.9.2

@andrewglowacki

This comment has been minimized.

Copy link
Contributor Author

commented Feb 12, 2019

After further investigation, it looks like the Zookeeper exception was NoExists (after several connection losses with retries). I can contribute fix I made locally if necessary, however testing it seems like it would be difficult.

@keith-turner

This comment has been minimized.

Copy link
Contributor

commented Feb 12, 2019

@andrewglowacki I would be interested in seeing the changes.

@jzgithub1

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

You might want to check that Network Time Protocol is installed across the cluster properly.
I am trying to replicated this issue on my cluster as well.

@jzgithub1

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

Should I work on this issue in Accumulo 1.9.2 or Accumulo 2.0 alpha 2?

@keith-turner

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

@jzgithub1 I think #1005 is for this issue.

@ctubbsii

This comment has been minimized.

Copy link
Member

commented Mar 25, 2019

@keith-turner You self-assigned this. Any update on this? Or should this be bumped from the 1.9.3 release?

keith-turner added a commit to keith-turner/accumulo that referenced this issue Mar 26, 2019

keith-turner added a commit to keith-turner/accumulo that referenced this issue Mar 26, 2019

keith-turner added a commit that referenced this issue Mar 26, 2019

@ctubbsii ctubbsii added this to Done in 1.9.3 Jun 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
4 participants
You can’t perform that action at this time.