-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28271 Infinite waiting on lock acquisition by snapshot can result in unresponsive master #5603
Conversation
…lt in unresponsive master
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@@ -129,6 +132,8 @@ public TakeSnapshotHandler(SnapshotDescription snapshot, final MasterServices ma | |||
"Taking " + snapshot.getType() + " snapshot on table: " + snapshotTable, false, true); | |||
this.snapshotManifest = | |||
SnapshotManifest.create(conf, rootFs, workingDir, snapshot, monitor, status); | |||
this.lockAcquireTimeoutMs = | |||
conf.getLong(HBASE_SNAPSHOT_MASTER_LOCK_ACQUIRE_TIMEOUT, 5 * 60 * 1000L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would 5 minutes be a little long ? How about 60 seconds ? As I mentioned before, the client will time out after 60 seconds by default, so there seems to be no difference between waiting 60 seconds and waiting 5 minutes? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, will make this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
} else { | ||
LOG.error("Master lock could not be acquired in {} ms", lockAcquireTimeoutMs); | ||
throw new IOException("Master lock could not be acquired"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think retrying makes much sense in this scenario, so I would prefer to throw a DoNotRetryIOException like HBaseSnapshotException or SnapshotCreationException. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, regardless of the Exception type here, it does eventually throw SnapshotCreationException
:
try {
handler.prepare();
this.executorService.submit(handler);
this.snapshotHandlers.put(TableName.valueOf(snapshot.getTable()), handler);
} catch (Exception e) {
// cleanup the working directory by trying to delete it from the fs.
Path workingDir = SnapshotDescriptionUtils.getWorkingSnapshotDir(snapshot, rootDir,
master.getConfiguration());
FileSystem workingDirFs = workingDir.getFileSystem(master.getConfiguration());
try {
if (!workingDirFs.delete(workingDir, true)) {
LOG.error("Couldn't delete working directory (" + workingDir + " for snapshot:"
+ ClientSnapshotDescriptionUtils.toString(snapshot));
}
} catch (IOException e1) {
LOG.error("Couldn't delete working directory (" + workingDir + " for snapshot:"
+ ClientSnapshotDescriptionUtils.toString(snapshot));
}
// fail the snapshot
throw new SnapshotCreationException("Could not build snapshot handler", e,
ProtobufUtil.createSnapshotDesc(snapshot));
}
but sure i can still make it throw DoNotRetryIOException.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logs:
2024-01-03T15:17:21,015 ERROR [zk-snapshot {}] snapshot.TakeSnapshotHandler(163): Master lock could not be acquired in 0 ms
2024-01-03T15:17:21,016 DEBUG [PEWorker-3 {}] locking.LockProcedure(243): UNLOCKED pid=25, state=RUNNABLE, hasLock=true; org.apache.hadoop.hbase.master.locking.LockProcedure, tableName=TestSnapshotProcedureWithLockTimeout, type=EXCLUSIVE
2024-01-03T15:17:21,016 ERROR [zk-snapshot {}] procedure.TestSnapshotProcedureWithLockTimeout$1(103): zk snapshot failed
org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Could not build snapshot handler
at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotTable(SnapshotManager.java:678) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotEnabledTable(SnapshotManager.java:646) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshotInternal(SnapshotManager.java:762) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshot(SnapshotManager.java:707) ~[classes/:?]
at org.apache.hadoop.hbase.master.procedure.TestSnapshotProcedureWithLockTimeout$1.run(TestSnapshotProcedureWithLockTimeout.java:101) ~[test-classes/:?]
Caused by: java.io.IOException: Master lock could not be acquired
at org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.prepare(TakeSnapshotHandler.java:164) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:61) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:47) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotTable(SnapshotManager.java:659) ~[classes/:?]
... 4 more
Eventually it is SnapshotCreationException
that gets thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With updated patch, it looks like this:
org.apache.hadoop.hbase.snapshot.SnapshotCreationException: Could not build snapshot handler
at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotTable(SnapshotManager.java:678) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotEnabledTable(SnapshotManager.java:646) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshotInternal(SnapshotManager.java:762) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.takeSnapshot(SnapshotManager.java:707) ~[classes/:?]
at org.apache.hadoop.hbase.master.procedure.TestSnapshotProcedureWithLockTimeout$1.run(TestSnapshotProcedureWithLockTimeout.java:105) ~[test-classes/:?]
Caused by: org.apache.hadoop.hbase.DoNotRetryIOException: Master lock could not be acquired
at org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.prepare(TakeSnapshotHandler.java:165) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:61) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.EnabledTableSnapshotHandler.prepare(EnabledTableSnapshotHandler.java:47) ~[classes/:?]
at org.apache.hadoop.hbase.master.snapshot.SnapshotManager.snapshotTable(SnapshotManager.java:659) ~[classes/:?]
... 4 more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
import org.apache.hadoop.hbase.shaded.protobuf.generated.SnapshotProtos; | ||
|
||
@Category({ MasterTests.class, MediumTests.class }) | ||
public class TestSnapshotProcedureWithLockTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little confused about that scenario this unit test is for. Would you mind explaining a little bit ? Thanks. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose is to test with master lock timeout value small enough that we can ensure snapshot will not be created. Let me add some comments to specify this.
Overall, LGTM, only a few details that I think we can discuss a little more. Thanks for the great work ! |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
LGTM ! |
Thanks @frostruan for the review. Let me merge this in 24 hr unless there are any objections. |
hbase-server/src/main/java/org/apache/hadoop/hbase/master/locking/LockManager.java
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/TakeSnapshotHandler.java
Show resolved
Hide resolved
SnapshotTestingUtils.confirmSnapshotValid(TEST_UTIL, snapshotOnSameTableProto, TABLE_NAME, | ||
CF); | ||
throw new AssertionError("should not have created snapshot " + snapshotOnSameTableProto); | ||
} catch (AssertionError e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we do some refactoring here, catching AssertionError and asserting again looks weird...
And since we use EnvironmentEdgeManager to calculate the deadline, I think you can inject your own EnvironmentEdge for this test and make the behavior stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually dealing with EnvironmentEdgeManager is tricky here, because the way we can ensure test is unable to create snapshot is by the main thread unable to take master lock here:
@Override
public TakeSnapshotHandler prepare() throws Exception {
super.prepare();
// after this, you should ensure to release this lock in case of exceptions
if (this.tableLock.tryAcquire(this.lockAcquireTimeoutMs)) {
try {
this.htd = loadTableDescriptor(); // check that .tableinfo is present
} catch (Exception e) {
this.tableLock.release();
throw e;
}
} else {
LOG.error("Master lock could not be acquired in {} ms", lockAcquireTimeoutMs);
throw new DoNotRetryIOException("Master lock could not be acquired");
}
return this;
}
But for test, let me make it look better because throwing as well as catching same type of Exception does not look good here as you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to make it clear: f758519
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we want here is let this.tableLock.tryAcquire throw timeout exception?
You can see the code in EnvironmentEdgeManagerTestHelper, by creating an Exception, you get the full stacktrace so you can determine in which method you are calling the EnvironmentEdge.currentTime, so here I think we can just hack the calls from LockManager.tryAcquire, to make sure that we will not enter the while loop and fail immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would still not be sufficient because we want tryAcquire to return false for our test. It returns false only if main thread exits faster before LockProcedure could be completed. So regardless of our hack of EnvironmentEdgeManager, main thread (i.e. execution of tryAcquire) can still complete (even without entering while loop) after LockProcedure completes execution. If that happens, this if condition would not be true and hence, our test would fail:
if (!proc.isLocked()) {
LOG.info("Timed out waiting to acquire procedure lock: " + proc.toString());
release();
return false;
}
With current retries in place for the test, i have run the test so many times in the loop and i haven't seen it fail anytime so i don't think this would be flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I got your point, but I think this could make it even easier to make the lock failure, as you can submit a LockProcedure first, but taking snapshot...
Anyway, this has been blocked here for a long time, let's get this in and improve later...
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@Apache9 I have addressed all comments, are we good to go here after build results? |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
) { | ||
LOG.debug("Expected behaviour - snapshot could not be created"); | ||
} else { | ||
throw new UncheckedIOException(new IOException(e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in a test, why not just throw the exception out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
|
||
if (snapshotCreated) { | ||
throw new UncheckedIOException(new IOException("Snapshot created successfully")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. Why we need to wrap it with UncheckedIOException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
SnapshotTestingUtils.confirmSnapshotValid(TEST_UTIL, snapshotOnSameTableProto, TABLE_NAME, | ||
CF); | ||
throw new AssertionError("should not have created snapshot " + snapshotOnSameTableProto); | ||
} catch (AssertionError e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we want here is let this.tableLock.tryAcquire throw timeout exception?
You can see the code in EnvironmentEdgeManagerTestHelper, by creating an Exception, you get the full stacktrace so you can determine in which method you are calling the EnvironmentEdge.currentTime, so here I think we can just hack the calls from LockManager.tryAcquire, to make sure that we will not enter the while loop and fail immediately.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Thanks for the reviews @Apache9 @frostruan. Since i have both of your +1s, let me merge by EOD. |
…lt in unresponsive master (#5603) Signed-off-by: Hui Ruan <huiruan@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
…lt in unresponsive master (#5603) Signed-off-by: Hui Ruan <huiruan@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
…lt in unresponsive master (#5603) Signed-off-by: Hui Ruan <huiruan@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
…lt in unresponsive master (#5603) Signed-off-by: Hui Ruan <huiruan@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org>
Jira: HBASE-28271