-
Notifications
You must be signed in to change notification settings - Fork 481
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-4068. Client should not retry same OM on network connection failure #1324
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, one minor question.
return RPC.getProtocolProxy(OzoneManagerProtocolPB.class, omVersion, | ||
omAddress, ugi, hadoopConf, NetUtils.getDefaultSocketFactory( | ||
hadoopConf), (int) OmUtils.getOMClientRpcTimeOut(conf), | ||
connectionRetryPolicy).getProxy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question, can we use RetryPolicy TRY_ONCE_THEN_FAIL here?
Because in this failoverOnNetworkException, also we set retry count to zero and maxFailOvers to zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be the same as the current one, right?
Would it suffice to add a comment to explain the retry policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. As failoverOnNetworkException uses fallback as TRY_ONCE_THEN_FAIL and maxFailOvers is zero, so it is like TRY_ONCE_THEN_FAIL, as in shouldretry it will fail in below part shouldRetry i think.
if (failovers >= maxFailovers) {
return new RetryAction(RetryAction.RetryDecision.FAIL, 0,
"failovers (" + failovers + ") exceeded maximum allowed ("
+ maxFailovers + ")");
}
So, technically we are using it as similar to TRY_ONCE_THEN_FAIL in this scenario.
db3d2b1
to
ef0dcf3
Compare
ef0dcf3
to
a5aca6f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM.
Thanks @bharatviswa504 for reviewing the patch. Checked with @vivekratnavel that it is safe to commit since the coverage CI failed only during upload. |
* master: (26 commits) HDDS-4167. Acceptance test logs missing if fails during cluster startup (apache#1366) HDDS-4121. Implement OmMetadataMangerImpl#getExpiredOpenKeys. (apache#1351) HDDS-3867. Extend the chunkinfo tool to display information from all nodes in the pipeline. (apache#1154) HDDS-4077. Incomplete OzoneFileSystem statistics (apache#1329) HDDS-3903. OzoneRpcClient support batch rename keys. (apache#1150) HDDS-4151. Skip the inputstream while offset larger than zero in s3g (apache#1354) HDDS-4147. Add OFS to FileSystem META-INF (apache#1352) HDDS-4137. Turn on the verbose mode of safe mode check on testlib (apache#1343) HDDS-4146. Show the ScmId and ClusterId in the scm web ui. (apache#1350) HDDS-4145. Bump version to 1.1.0-SNAPSHOT on master (apache#1349) HDDS-4109. Tests in TestOzoneFileSystem should use the existing MiniOzoneCluster (apache#1316) HDDS-4149. Implement OzoneFileStatus#toString (apache#1356) HDDS-4153. Increase default timeout in kubernetes tests (apache#1357) HDDS-2411. add a datanode chunk validator fo datanode chunk generator (apache#1312) HDDS-4140. Auto-close /pending pull requests after 21 days of inactivity (apache#1344) HDDS-4152. Archive container logs for kubernetes check (apache#1355) HDDS-4056. Convert OzoneAdmin to pluggable model (apache#1285) HDDS-3972. Add option to limit number of items displaying through ldb tool. (apache#1206) HDDS-4068. Client should not retry same OM on network connection failure (apache#1324) HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive. (apache#1291) ...
What changes were proposed in this pull request?
On connection failure to an OM, client retries 10 times on same OM before failing over to the next OM. This is not optimal. Client should failover to next OM after 1 connection exception. In case the connection exception was spurious, then OM HA failover logic would lead the client to retry on that OM again after trying the other OMs first.
(Please fill in changes proposed in this fix)
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-4068
How was this patch tested?
Tested manually on a docker cluster.