-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-9823. Pipeline failure should trigger heartbeat immediately #5725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trigger of this case is very frequent. Please check if this should not cause overload of HB to SCM due to continuous trigger of HB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, I saw that
ContainerStateMachine#notifyFollowerSlownesswill get triggered continuously as long as the follower is still uncontactable by the leader. Thanks for catching it.I think we can have a deduplication logic to trigger heartbeat only for the first
triggerPipelineClosefor that particular pipeline. Maybe a concurrent set that stores the pipeline ID to be closed by SCM and the pipeline ID can be removed from the set inClosePipelineCommandHandlerwhen SCM sends back the pipeline close command.Let me think about this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @ivandika3 for the proposal. just curious, when you say we could get that failure pipeline ID removed from the set in ClosePipelineCommandHandler, do you mean the 'raftGids' concurrent set in XceiverServerRatis? (https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java#L758-L760)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DaveTeng0 for checking this out, I was previously thinking of another concurrent set, similar to
raftGidsthat will contain theRaftGroupIds of the inflight pipelines pending to be closed, i.e. the datanode has triggered the heartbeat containing DN close pipeline action to the SCM, but the DN has not received the close pipeline command from SCM yet.The idea is to prevent excessive heartbeat triggers since the
ContainerStateMachine#notifyFollowerSlownesshook will get triggered for every leader's follower health check (see https://github.com/apache/ratis/blob/05db67929a5b06ce964eda6627d44cd153cc2bce/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L1285), which might happen every heartbeat (< 150ms).We can change the
raftGidsto store extra information about whether the pipeline for theRaftGroupIdis pending to be closed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumitagrawl I have updated
XceiverServerRatisto keep track of active pipelines and its relevant information (i.e. whether the pipeline is pending close and whether the current datanode is the leader of the pipeline).I tested manually by shutting down one of the datanodes in an active pipeline. The leader datanode triggered the pipeline close immediately due to
notifyFollowerSlownesshook, but the subsequent pipeline close commands is triggered in the next heartbeats.SCM pipeline action close log (separated by the 30s heartbeat interval) received from the pipeine leader DN
The DN was restarted at 14:16:53, maybe why SCM received multiple heartbeat from the same DN around that time.
DN pipeline close due to follower log (triggered multiple times within a single heartbeat interval)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivandika3
Current code has fixed frequent retry, but still we need retry over certain time interval is SCM is down / unable to handle the request. May be we need send together again during HB if still its active.