-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-24562: Stabilize master startup with meta replicas enabled #1903
Conversation
make meta replication async catch and log exception instead of letting it crash master add test
🎊 +1 overall
This message was automatically generated. |
@@ -355,4 +362,58 @@ public void testShutdownOfReplicaHolder() throws Exception { | |||
assertNotEquals(3, i); | |||
} | |||
} | |||
|
|||
@Test | |||
public void testFailedReplicaAssigment() throws InterruptedException, IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the main purpose of the change is to allow Master to complete initialisation in case of meta replica failures, shouldn't this test assert that HMaster.finishActiveMasterInitialization completes properly even with the induced replicas assignment failures? I can see we are asserting that replicas assignment fails, which is not what we actually want to safeguard from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @wchevreuil ! I added a few extra checks to show that the partial meta replica assignment had happened and master is still active and running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have one question about the test case, do not see any exceptions thrown out during assigning meta replica regions, how do we make sure the following code is working.
try {
metaBootstrap.assignMetaReplicas();
} catch (HBaseIOException e){
If we can make sure that the test can compile with or without the changes in non-test modules, it will be great as we can verify that without those changes, the test will fail. Seems with current implementation, it is not possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BukrosSzabolcs Your explain about how the test works addressed my concerns, thanks.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
metaBootstrap.assignMetaReplicas(); | ||
try { | ||
metaBootstrap.assignMetaReplicas(); | ||
} catch (HBaseIOException e){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assignMetaReplicas() throws IOException, InterruptedException, and KeeperException. Why only HBaseIOException is processed here? Should it catch all exceptions and totally ignore all exceptions caused by assignMetaReplicas()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. The idea was, that as far as I can tell InterruptedException
is never actually thrown and KeeperException represent a serious issue outside the scope of meta replication, that we should not hide. So I skipped those. Do you think some of them should be included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of KeeperException has been handled inside the methods when I looked through the code path. IMO, if we do not want meta replicas to prevent master to become active, we should just catch and log it, let the other critical modules to decide if it needs to abort/block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked other places in the code and I see your point. I'll add KeeperException to the catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void assignMetaReplicas()
throws IOException, InterruptedException, KeeperException {
I think there is no InterruptedException, we can remove it from the prototype. In your patch, only HBaseIOException is handled, how about other IOException? Can we just catch IOException as well? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the last comment I have, +1 otherwise.
hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java
Show resolved
Hide resolved
@@ -355,4 +362,58 @@ public void testShutdownOfReplicaHolder() throws Exception { | |||
assertNotEquals(3, i); | |||
} | |||
} | |||
|
|||
@Test | |||
public void testFailedReplicaAssigment() throws InterruptedException, IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have one question about the test case, do not see any exceptions thrown out during assigning meta replica regions, how do we make sure the following code is working.
try {
metaBootstrap.assignMetaReplicas();
} catch (HBaseIOException e){
If we can make sure that the test can compile with or without the changes in non-test modules, it will be great as we can verify that without those changes, the test will fail. Seems with current implementation, it is not possible.
@huaxiangsun About your last question. Well, yes, the test is not obvious. If you check the code you can see that setting a Procedure on the master replica node just before we would call |
extract common assign code add docs to new methods
🎊 +1 overall
This message was automatically generated. |
also catch KeeperException
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address the last comment I have, +1 otherwise.
catch IOException instead of HbaseIOException
@huaxiangsun Made the last change requested. Thanks a lot for your feedback! |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Next time please fix the checkstyle issues before merging? We do not count checkstyle as -1 only because sometimes it is a bit hard to fix all the problems and also sometimes the violations are not introduced by the patch. But here the checkstyle issues are introduced by the PR... And please do not comment out a line of code? Just remove it. I see this in patch
|
Signed-off-by: Wellington Chevreuil <wchevreuil@apache.org> Signed-off-by: Huaxiang Sun <huaxiangsun@apache.com> (cherry picked from commit 8cdb2cc)
Signed-off-by: Wellington Chevreuil <wchevreuil@apache.org> Signed-off-by: Huaxiang Sun <huaxiangsun@apache.com> (cherry picked from commit 8cdb2cc)
…che#1903) Signed-off-by: Wellington Chevreuil <wchevreuil@apache.org> Signed-off-by: Huaxiang Sun <huaxiangsun@apache.com>
This is related to HBASE-21624 .
I created a separate ticket because in the original one a "complete solution for meta replicas" was requested and this is not one. I'm just trying to make master startup more stable by making assigning meta replicas asynchronous and preventing a potential assignment failure from crashing master.
The idea is that starting master with less or even no meta replicas assigned is preferable to not having a running master.