Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCS Hadoop Connector can't recover from Tablet Server failure. #428

Closed
adamresson opened this issue Apr 17, 2018 · 5 comments
Closed

GCS Hadoop Connector can't recover from Tablet Server failure. #428

adamresson opened this issue Apr 17, 2018 · 5 comments
Labels
question This issue describes a user question and should be answered; it may be a possible bug.

Comments

@adamresson
Copy link

I was attempting to use the GCS Connector (https://github.com/GoogleCloudPlatform/bigdata-interop/tree/master/gcs) to back Accumulo on GCP.

All pretty straight forward (just pointed instance.volumes to my bucket gs://<bucketname>/accumulo

One of my Tablet Servers OOMed, and when I try to recover it, I end up getting the following error:

Failed to initiate log sort gs://<bucketname>/accumulo/wal/accumulo-gcs-w-1+9997/aa166493-637e-48e8-a9a6-3655dfb59a6c
	java.lang.IllegalStateException: Don't know how to recover a lease for com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
		at org.apache.accumulo.server.master.recovery.HadoopLogCloser.close(HadoopLogCloser.java:70)
		at org.apache.accumulo.master.recovery.RecoveryManager$LogSortTask.run(RecoveryManager.java:96)
		at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
		at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
		at java.lang.Thread.run(Thread.java:748)

I traced that back to HadoopLogCloser, and since the GoogleHadoopFileSystem is a FileSystem, not a DistributedFileSystem, LocalFileSystem, RawLocalFileSystem, it bottoms out here:

"Don't know how to recover a lease for " + ns.getClass().getName());

I'm not sure what to do from here. I was also told that Azure should work with accumulo no problem, but it looks like the NativeAzureFileSystem similarly doesn't implement any of these interfaces, so I would presume it would hit the same issue.

@keith-turner
Copy link
Contributor

keith-turner commented Apr 17, 2018

I'm not sure what to do from here.

I think the following needs to be done.

  • Determine how one process can close a file using GCSC such that any other processes writing to the file will fail. Then implement this as a org.apache.accumulo.server.master.recovery.LogCloser.
  • Put this implementation on Accumulo classpath
  • Set the accumulo property master.walog.closer.implementation to use the impl

I am really curious about the overall write ahead log behavior with GCSC. It would be interesting to intentionally kill tablet servers and see if the data recovers correctly.

@adamresson
Copy link
Author

Looking at GCS, that WAL file doesn't even exist, so I'm not sure where it's storing that upon restar. Is it somewhere in zookeeper I can wipe out, at least to temporarily negate this issue?

@keith-turner
Copy link
Contributor

Is it somewhere in zookeeper I can wipe out, at least to temporarily negate this issue?

Try looking in /accumulo/<uuid>/root_tablet/walogs in zookeeper.

@ctubbsii
Copy link
Member

I'm not sure there's a general solution in Accumulo we can provide for this scenario. We can't really know how to react to unknown FileSystem implementations. If it's not one of the ones we support, the user must provide a LogCloser implementation for that specific FileSystem.

This is really more of a question, than an issue. I'm inclined to close it as answered. Does that seem reasonable?

@ctubbsii ctubbsii added the question This issue describes a user question and should be answered; it may be a possible bug. label Apr 18, 2018
@adamresson
Copy link
Author

Yup, makes sense @ctubbsii . Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question This issue describes a user question and should be answered; it may be a possible bug.
Projects
None yet
Development

No branches or pull requests

3 participants