Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-4068. Client should not retry same OM on network connection failure #1324

Merged
merged 2 commits into from
Aug 26, 2020

Conversation

hanishakoneru
Copy link
Contributor

What changes were proposed in this pull request?

On connection failure to an OM, client retries 10 times on same OM before failing over to the next OM. This is not optimal. Client should failover to next OM after 1 connection exception. In case the connection exception was spurious, then OM HA failover logic would lead the client to retry on that OM again after trying the other OMs first.
(Please fill in changes proposed in this fix)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4068

How was this patch tested?

Tested manually on a docker cluster.

Copy link
Contributor

@bharatviswa504 bharatviswa504 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, one minor question.

return RPC.getProtocolProxy(OzoneManagerProtocolPB.class, omVersion,
omAddress, ugi, hadoopConf, NetUtils.getDefaultSocketFactory(
hadoopConf), (int) OmUtils.getOMClientRpcTimeOut(conf),
connectionRetryPolicy).getProxy();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, can we use RetryPolicy TRY_ONCE_THEN_FAIL here?

Because in this failoverOnNetworkException, also we set retry count to zero and maxFailOvers to zero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be the same as the current one, right?
Would it suffice to add a comment to explain the retry policy?

Copy link
Contributor

@bharatviswa504 bharatviswa504 Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. As failoverOnNetworkException uses fallback as TRY_ONCE_THEN_FAIL and maxFailOvers is zero, so it is like TRY_ONCE_THEN_FAIL, as in shouldretry it will fail in below part shouldRetry i think.

      if (failovers >= maxFailovers) {
        return new RetryAction(RetryAction.RetryDecision.FAIL, 0,
            "failovers (" + failovers + ") exceeded maximum allowed ("
            + maxFailovers + ")");
      }

So, technically we are using it as similar to TRY_ONCE_THEN_FAIL in this scenario.

@hanishakoneru hanishakoneru force-pushed the HDDS-4068 branch 2 times, most recently from db3d2b1 to ef0dcf3 Compare August 25, 2020 04:11
Copy link
Contributor

@bharatviswa504 bharatviswa504 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM.

@hanishakoneru
Copy link
Contributor Author

Thanks @bharatviswa504 for reviewing the patch. Checked with @vivekratnavel that it is safe to commit since the coverage CI failed only during upload.
Will merge the PR soon.

@hanishakoneru hanishakoneru merged commit 9292b39 into apache:master Aug 26, 2020
rakeshadr pushed a commit to rakeshadr/hadoop-ozone that referenced this pull request Sep 3, 2020
errose28 added a commit to errose28/ozone that referenced this pull request Sep 11, 2020
* master: (26 commits)
  HDDS-4167. Acceptance test logs missing if fails during cluster startup (apache#1366)
  HDDS-4121. Implement OmMetadataMangerImpl#getExpiredOpenKeys. (apache#1351)
  HDDS-3867. Extend the chunkinfo tool to display information from all nodes in the pipeline. (apache#1154)
  HDDS-4077. Incomplete OzoneFileSystem statistics (apache#1329)
  HDDS-3903. OzoneRpcClient support batch rename keys. (apache#1150)
  HDDS-4151. Skip the inputstream while offset larger than zero in s3g (apache#1354)
  HDDS-4147. Add OFS to FileSystem META-INF (apache#1352)
  HDDS-4137. Turn on the verbose mode of safe mode check on testlib (apache#1343)
  HDDS-4146. Show the ScmId and ClusterId in the scm web ui. (apache#1350)
  HDDS-4145. Bump version to 1.1.0-SNAPSHOT on master (apache#1349)
  HDDS-4109. Tests in TestOzoneFileSystem should use the existing MiniOzoneCluster (apache#1316)
  HDDS-4149. Implement OzoneFileStatus#toString (apache#1356)
  HDDS-4153. Increase default timeout in kubernetes tests (apache#1357)
  HDDS-2411. add a datanode chunk validator fo datanode chunk generator (apache#1312)
  HDDS-4140. Auto-close /pending pull requests after 21 days of inactivity (apache#1344)
  HDDS-4152. Archive container logs for kubernetes check (apache#1355)
  HDDS-4056. Convert OzoneAdmin to pluggable model (apache#1285)
  HDDS-3972. Add option to limit number of items displaying through ldb tool. (apache#1206)
  HDDS-4068. Client should not retry same OM on network connection failure (apache#1324)
  HDDS-4062. Non rack aware pipelines should not be created if multiple racks are alive. (apache#1291)
  ...
@hanishakoneru hanishakoneru deleted the HDDS-4068 branch December 1, 2020 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants