-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28050 RSProcedureDispatcher to fail-fast for krb auth failures #5391
Conversation
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@@ -306,6 +307,10 @@ private boolean scheduleForRetry(IOException e) { | |||
serverName, e.toString(), numberOfAttemptsSoFar); | |||
return false; | |||
} | |||
if (isSaslError(e)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we need to follow the same pattern with CallQueueTooBigException, only if this is the first try, we can make sure that the task has not been sent yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good @Apache9, addressed in the latest revision
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
@@ -347,7 +348,7 @@ public void operationComplete(ChannelFuture future) throws Exception { | |||
private void sendRequest0(Call call, HBaseRpcController hrc) throws IOException { | |||
assert eventLoop.inEventLoop(); | |||
if (reloginInProgress) { | |||
throw new IOException("Can not send request because relogin is in progress."); | |||
throw new IOException(HConstants.RELOGIN_IS_IN_PROGRESS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird, please don't put these kinds of constants in HConstants. There are too many unrelated concerns there already.
Public static string constant in some other file, even this one, is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NettyRpcConnection is package private, hence can't be accessed from hbase-server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what is the best place to keep this, anywhere else in hbase-common would also work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not put it in HConstants, it is IA.Public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand but i am not sure what is the best place to keep this in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, this is now taken care of
@@ -306,6 +308,10 @@ private boolean scheduleForRetry(IOException e) { | |||
serverName, e.toString(), numberOfAttemptsSoFar); | |||
return false; | |||
} | |||
if (isSaslError(e) && numberOfAttemptsSoFar == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter how many attempts we have had so far if now we are getting a SASL error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duo's point is that, if some attempts were already successful, it might have triggered region transition already and we might be in the middle of another sub-procedure when we encounter this.
I also think num of attempts should not matter as we won't be able to make any progress anyways, but then it takes our discussion back to the parent Jira HBASE-28048
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly as @virajjasani said, we need to make sure that the remote procedure has not been sent to the rs then we are safe to quit and choose another rs, otherwise the only safe way is to rely on SCP to tell us the rs is dead so we are safe to quit here.
So if we hit another error, like connection timed out the first time, then we are not sure whether we have already send the procedure to the rs, then no matter what the exceptions are in the following retries, we are not safe to quit.
In the real world, if authentication is not configured correctly, then it is likely we will get sasl error at the first try, so the code is enough to cover the problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@virajjasani Better to introduce a method may be called
boolean hasNotReachedRegionServerYet(IOException e)
In this method we could test for both sasl error and call queue too big, and also other exception types in the future. So we do not need to add extra condition every time when we want to add new exception type tests in scheduleForRetry method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@@ -330,6 +322,73 @@ private boolean scheduleForRetry(IOException e) { | |||
return true; | |||
} | |||
|
|||
private boolean unableToConnectToServerInFirstAttempt(IOException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we could extract a method for testing exception only, and then we test numberOfAttemptsSoFar outside the method...
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
// This exception is thrown in the rpc framework, where we can make sure that the call has not | ||
// been executed yet, so it is safe to mark it as fail. Especially for open a region, we'd | ||
// better choose another region server. | ||
// Notice that, it is safe to quit only if this is the first time we send request to region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better move this block of comments to the if condition in the caller method? I mean the section start from 'Notice that, it is safe blabla'. The numberOfAttemptsSoFar == 0 test is there.
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
Show resolved
Hide resolved
) { | ||
return true; | ||
} | ||
// check 4 level of cause |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a for loop here? And why only test 4 levels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's based on the examples we have seen so far, e.g.
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due to java.io.IOException: Call to address=rs1:61020 failed on local exception: java.io.IOException: org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS initiate failed, try=0, retrying...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could just use a loop to get the cause until cause is null, to check all the exceptions on chain. And we also need to handle RemoteException specially, to unwrap it instead of just calling getCause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle RemoteException specially, to unwrap it instead of just calling getCause
yes, that is taken care of:
private boolean isThrowableOfTypeSasl(Throwable cause) {
if (cause instanceof IOException) {
IOException unwrappedException = unwrapException((IOException) cause);
return unwrappedException instanceof SaslException
|| (unwrappedException.getMessage() != null && unwrappedException.getMessage()
.contains(RpcConnectionConstants.RELOGIN_IS_IN_PROGRESS));
}
return false;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean after unwraping, you still need to go back to the get cause loop, not only test one time...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is in the loop
while (true) {
cause = cause.getCause();
if (cause == null) {
return false;
}
if (isThrowableOfTypeSasl(cause)) {
return true;
}
}
isThrowableOfTypeSasl
does the unwrap and checks for type of exception.
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java
Outdated
Show resolved
Hide resolved
return isSaslError(cause); | ||
} | ||
|
||
private boolean isSaslError(Throwable cause) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use the same method name here, as IOException is also a Throwable, although this is valid in Java, but it will confuse the developers.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
private boolean isSaslError(IOException e) { | ||
if ( | ||
e instanceof SaslException || (e.getMessage() != null | ||
&& e.getMessage().contains(RpcConnectionConstants.RELOGIN_IS_IN_PROGRESS)) | ||
) { | ||
return true; | ||
} | ||
Throwable cause = e; | ||
while (true) { | ||
cause = cause.getCause(); | ||
if (cause == null) { | ||
return false; | ||
} | ||
if (isThrowableOfTypeSasl(cause)) { | ||
return true; | ||
} | ||
} | ||
} | ||
|
||
private boolean isThrowableOfTypeSasl(Throwable cause) { | ||
if (cause instanceof IOException) { | ||
IOException unwrappedException = unwrapException((IOException) cause); | ||
return unwrappedException instanceof SaslException | ||
|| (unwrappedException.getMessage() != null && unwrappedException.getMessage() | ||
.contains(RpcConnectionConstants.RELOGIN_IS_IN_PROGRESS)); | ||
} | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the unwrapping be simplified, since unwrapException
won't do anything unless it's a RemoteException
?
private boolean isSaslError(IOException e) { | |
if ( | |
e instanceof SaslException || (e.getMessage() != null | |
&& e.getMessage().contains(RpcConnectionConstants.RELOGIN_IS_IN_PROGRESS)) | |
) { | |
return true; | |
} | |
Throwable cause = e; | |
while (true) { | |
cause = cause.getCause(); | |
if (cause == null) { | |
return false; | |
} | |
if (isThrowableOfTypeSasl(cause)) { | |
return true; | |
} | |
} | |
} | |
private boolean isThrowableOfTypeSasl(Throwable cause) { | |
if (cause instanceof IOException) { | |
IOException unwrappedException = unwrapException((IOException) cause); | |
return unwrappedException instanceof SaslException | |
|| (unwrappedException.getMessage() != null && unwrappedException.getMessage() | |
.contains(RpcConnectionConstants.RELOGIN_IS_IN_PROGRESS)); | |
} | |
return false; | |
} | |
private boolean isSaslError(IOException e) { | |
Throwable cause = e; | |
while (true) { | |
if (cause instanceof IOException) { | |
IOException unwrappedCause = unwrapException((IOException) cause); | |
if ( | |
unwrappedCause instanceof SaslException || (unwrappedCause.getMessage() != null | |
&& unwrappedCause.getMessage().contains(RpcConnectionConstants.RELOGIN_IS_IN_PROGRESS)) | |
) { | |
return true; | |
} | |
} | |
cause = cause.getCause(); | |
if (cause == null) { | |
return false; | |
} | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much better, let me update the PR
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Thanks for the reviews! Let me merge and backport this today. |
Thank you for the reviews @Apache9 @mnpoonia @d-c-manning @apurtell !! |
…5391) Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Andrew Purtell <apurtell@apache.org> Signed-off-by: Aman Poonia <aman.poonia.29@gmail.com> Signed-off-by: David Manning <david.manning@salesforce.com>
…5391) Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Andrew Purtell <apurtell@apache.org> Signed-off-by: Aman Poonia <aman.poonia.29@gmail.com> Signed-off-by: David Manning <david.manning@salesforce.com>
…5391) Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Andrew Purtell <apurtell@apache.org> Signed-off-by: Aman Poonia <aman.poonia.29@gmail.com> Signed-off-by: David Manning <david.manning@salesforce.com>
…5391) Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Andrew Purtell <apurtell@apache.org> Signed-off-by: Aman Poonia <aman.poonia.29@gmail.com> Signed-off-by: David Manning <david.manning@salesforce.com>
…pache#5391) Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Andrew Purtell <apurtell@apache.org> Signed-off-by: Aman Poonia <aman.poonia.29@gmail.com> Signed-off-by: David Manning <david.manning@salesforce.com>
…pache#5391) Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Andrew Purtell <apurtell@apache.org> Signed-off-by: Aman Poonia <aman.poonia.29@gmail.com> Signed-off-by: David Manning <david.manning@salesforce.com> (cherry picked from commit 597da71) Change-Id: I7a6dc36ebc525397237fe8f0250a43d7213c9d17
No description provided.