HDDS-4068. Client should not retry same OM on network connection failure #1324

hanishakoneru · 2020-08-12T23:34:49Z

What changes were proposed in this pull request?

On connection failure to an OM, client retries 10 times on same OM before failing over to the next OM. This is not optimal. Client should failover to next OM after 1 connection exception. In case the connection exception was spurious, then OM HA failover logic would lead the client to retry on that OM again after trying the other OMs first.
(Please fill in changes proposed in this fix)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4068

How was this patch tested?

Tested manually on a docker cluster.

bharatviswa504

Overall LGTM, one minor question.

bharatviswa504 · 2020-08-24T23:40:40Z

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProvider.java

+    return RPC.getProtocolProxy(OzoneManagerProtocolPB.class, omVersion,
+        omAddress, ugi, hadoopConf, NetUtils.getDefaultSocketFactory(
+            hadoopConf), (int) OmUtils.getOMClientRpcTimeOut(conf),
+        connectionRetryPolicy).getProxy();


One question, can we use RetryPolicy TRY_ONCE_THEN_FAIL here?

Because in this failoverOnNetworkException, also we set retry count to zero and maxFailOvers to zero.

It would be the same as the current one, right?
Would it suffice to add a comment to explain the retry policy?

Yes. As failoverOnNetworkException uses fallback as TRY_ONCE_THEN_FAIL and maxFailOvers is zero, so it is like TRY_ONCE_THEN_FAIL, as in shouldretry it will fail in below part shouldRetry i think.

if (failovers >= maxFailovers) { return new RetryAction(RetryAction.RetryDecision.FAIL, 0, "failovers (" + failovers + ") exceeded maximum allowed (" + maxFailovers + ")"); }

So, technically we are using it as similar to TRY_ONCE_THEN_FAIL in this scenario.

bharatviswa504

+1 LGTM.

hanishakoneru · 2020-08-26T21:23:36Z

Thanks @bharatviswa504 for reviewing the patch. Checked with @vivekratnavel that it is safe to commit since the coverage CI failed only during upload.
Will merge the PR soon.

…ure (apache#1324)

* master: (26 commits) HDDS-4167. Acceptance test logs missing if fails during cluster startup (apache#1366) HDDS-4121. Implement OmMetadataMangerImpl#getExpiredOpenKeys. (apache#1351) HDDS-3867. Extend the chunkinfo tool to display information from all nodes in the pipeline. (apache#1154) HDDS-4077. Incomplete OzoneFileSystem statistics (apache#1329) HDDS-3903. OzoneRpcClient support batch rename keys. (apache#1150) HDDS-4151. Skip the inputstream while offset larger than zero in s3g (apache#1354) HDDS-4147. Add OFS to FileSystem META-INF (apache#1352) HDDS-4137. Turn on the verbose mode of safe mode check on testlib (apache#1343) HDDS-4146. Show the ScmId and ClusterId in the scm web ui. (apache#1350) HDDS-4145. Bump version to 1.1.0-SNAPSHOT on master (apache#1349) HDDS-4109. Tests in TestOzoneFileSystem should use the existing MiniOzoneCluster (apache#1316) HDDS-4149. Implement OzoneFileStatus#toString (apache#1356) HDDS-4153. Increase default timeout in kubernetes tests (apache#1357) HDDS-2411. add a datanode chunk validator fo datanode chunk generator (apache#1312) HDDS-4140. Auto-close /pending pull requests after 21 days of inactivity (apache#1344) HDDS-4152. Archive container logs for kubernetes check (apache#1355) HDDS-4056. Convert OzoneAdmin to pluggable model (apache#1285) HDDS-3972. Add option to limit number of items displaying through ldb tool. (apache#1206) HDDS-4068. Client should not retry same OM on network connection failure (apache#1324) HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive. (apache#1291) ...

hanishakoneru requested review from arp7 and bharatviswa504 August 12, 2020 23:34

bharatviswa504 reviewed Aug 24, 2020

View reviewed changes

HDDS-4068. Client should not retry same OM on network connection failure

eb0c246

hanishakoneru force-pushed the HDDS-4068 branch 2 times, most recently from db3d2b1 to ef0dcf3 Compare August 25, 2020 04:11

adding comments

a5aca6f

hanishakoneru force-pushed the HDDS-4068 branch from ef0dcf3 to a5aca6f Compare August 25, 2020 16:00

bharatviswa504 approved these changes Aug 25, 2020

View reviewed changes

hanishakoneru merged commit 9292b39 into apache:master Aug 26, 2020

rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020

HDDS-4068. Client should not retry same OM on network connection fail…

861bb9e

…ure (apache#1324)

hanishakoneru deleted the HDDS-4068 branch December 1, 2020 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-4068. Client should not retry same OM on network connection failure #1324

HDDS-4068. Client should not retry same OM on network connection failure #1324

hanishakoneru commented Aug 12, 2020

bharatviswa504 left a comment •

edited

Loading

bharatviswa504 Aug 24, 2020

hanishakoneru Aug 24, 2020

bharatviswa504 Aug 25, 2020 •

edited

Loading

bharatviswa504 left a comment

hanishakoneru commented Aug 26, 2020

HDDS-4068. Client should not retry same OM on network connection failure #1324

HDDS-4068. Client should not retry same OM on network connection failure #1324

Conversation

hanishakoneru commented Aug 12, 2020

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

bharatviswa504 left a comment • edited Loading

Choose a reason for hiding this comment

bharatviswa504 Aug 24, 2020

Choose a reason for hiding this comment

hanishakoneru Aug 24, 2020

Choose a reason for hiding this comment

bharatviswa504 Aug 25, 2020 • edited Loading

Choose a reason for hiding this comment

bharatviswa504 left a comment

Choose a reason for hiding this comment

hanishakoneru commented Aug 26, 2020

bharatviswa504 left a comment •

edited

Loading

bharatviswa504 Aug 25, 2020 •

edited

Loading