Skip to content

Comments

HDDS-9323. Apply expiry of excluded datanodes to writing Ratis keys#5530

Merged
adoroszlai merged 11 commits intoapache:masterfrom
DaveTeng0:HDDS-9323
Nov 21, 2023
Merged

HDDS-9323. Apply expiry of excluded datanodes to writing Ratis keys#5530
adoroszlai merged 11 commits intoapache:masterfrom
DaveTeng0:HDDS-9323

Conversation

@DaveTeng0
Copy link
Contributor

What changes were proposed in this pull request?

Currently it is possible that a long lived client can add most or all nodes of a small cluster to its exclude list, and further writes using that client instance will fail. This PR add a timeout for RATIS key to remove datanodes from the exclude list so that they can be retried later.
(For EC key, this mechanism exists and is configured to 10 minutes by default.)

This improvement is especially relevant for S3 gateway, which uses a persistent Ozone client to connect to the cluster.

And there is another part of improvement that allows the write to fall back to datanodes in the exclude list if that is all available. This would be implemented as a retry from the client based on the server's initial error response.

This part of work is separated into another PR: #5514

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9323

How was this patch tested?

unit test

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveTeng0 Thanks for working over this, have minor comments, please check

DaveTeng0 added 2 commits November 7, 2023 21:06
…ethod in BlockOutputStreamEntryPool to use constructor of ExcludeList with clock
Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveTeng0 Give few review comments

@adoroszlai
Copy link
Contributor

@DaveTeng0 please check integration test failures:

org.apache.hadoop.ozone.client.rpc.TestFailureHandlingByClient
org.apache.hadoop.ozone.client.rpc.TestFailureHandlingByClientFlushDelay

https://github.com/DaveTeng0/ozone/actions/runs/6818659142/job/18545006173

@DaveTeng0
Copy link
Contributor Author

@DaveTeng0 please check integration test failures:

org.apache.hadoop.ozone.client.rpc.TestFailureHandlingByClient
org.apache.hadoop.ozone.client.rpc.TestFailureHandlingByClientFlushDelay

https://github.com/DaveTeng0/ozone/actions/runs/6818659142/job/18545006173

yup thanks attila! These two tests failure are on my radar!

@DaveTeng0
Copy link
Contributor Author

@adoroszlai adoroszlai changed the title HDDS-9323. Better datanode exclude list handling for long-lived clients HDDS-9323. Apply expiry of excluded datanodes to writing Ratis keys Nov 18, 2023
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should apply the same in streaming write:

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveTeng0 Have minor comment

ExcludeList createExcludeList() {
return new ExcludeList();
return new ExcludeList(getConfig().getExcludeNodesExpiryTime(),
Clock.system(ZoneOffset.UTC));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new use createExcludeList() for constructor BlockOutputStreamEntryPool() at line 140 as used.
https://github.com/apache/ozone/pull/5530/files#diff-db8e4914806b73e8ec37c15ffd146fc70bd1fcba631df9a314c7389af64a895dL140

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah updated.

@DaveTeng0
Copy link
Contributor Author

I think we should apply the same in streaming write:

oh yes you're right! updated.

@adoroszlai
Copy link
Contributor

@DaveTeng0 please merge master into your branch, it's 106 commits behind, and there is a conflict

@DaveTeng0
Copy link
Contributor Author

@DaveTeng0 please merge master into your branch, it's 106 commits behind, and there is a conflict

oh sure! working on it!

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DaveTeng0 for updating the patch, LGTM.

@adoroszlai adoroszlai merged commit cddc571 into apache:master Nov 21, 2023
@adoroszlai
Copy link
Contributor

Thanks @DaveTeng0 for the patch, @sumitagrawl for the review.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 1, 2024
…Ratis keys (apache#5530)

(cherry picked from commit cddc571)
Change-Id: I4a2e1d12c1bfb6593a43850869a9b24dabcb1d2a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants