-
Notifications
You must be signed in to change notification settings - Fork 594
HDDS-9823. Pipeline failure should trigger heartbeat immediately #5725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@SaketaChalamchala can you please take a look? |
|
@DaveTeng0 can you please take a look? |
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivandika3 Thanks for working over this, please check occurance and impact of enableing trigger every time for this, if have performance impact.
|
|
||
| triggerPipelineClose(groupId, msg, | ||
| ClosePipelineInfo.Reason.PIPELINE_FAILED, false); | ||
| ClosePipelineInfo.Reason.PIPELINE_FAILED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think trigger of this case is very frequent. Please check if this should not cause overload of HB to SCM due to continuous trigger of HB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, I saw that ContainerStateMachine#notifyFollowerSlowness will get triggered continuously as long as the follower is still uncontactable by the leader. Thanks for catching it.
I think we can have a deduplication logic to trigger heartbeat only for the first triggerPipelineClose for that particular pipeline. Maybe a concurrent set that stores the pipeline ID to be closed by SCM and the pipeline ID can be removed from the set in ClosePipelineCommandHandler when SCM sends back the pipeline close command.
Let me think about this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @ivandika3 for the proposal. just curious, when you say we could get that failure pipeline ID removed from the set in ClosePipelineCommandHandler, do you mean the 'raftGids' concurrent set in XceiverServerRatis? (https://github.com/apache/ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java#L758-L760)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @DaveTeng0 for checking this out, I was previously thinking of another concurrent set, similar to raftGids that will contain the RaftGroupIds of the inflight pipelines pending to be closed, i.e. the datanode has triggered the heartbeat containing DN close pipeline action to the SCM, but the DN has not received the close pipeline command from SCM yet.
The idea is to prevent excessive heartbeat triggers since the ContainerStateMachine#notifyFollowerSlowness hook will get triggered for every leader's follower health check (see https://github.com/apache/ratis/blob/05db67929a5b06ce964eda6627d44cd153cc2bce/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L1285), which might happen every heartbeat (< 150ms).
We can change the raftGids to store extra information about whether the pipeline for the RaftGroupId is pending to be closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sumitagrawl I have updated XceiverServerRatis to keep track of active pipelines and its relevant information (i.e. whether the pipeline is pending close and whether the current datanode is the leader of the pipeline).
I tested manually by shutting down one of the datanodes in an active pipeline. The leader datanode triggered the pipeline close immediately due to notifyFollowerSlowness hook, but the subsequent pipeline close commands is triggered in the next heartbeats.
SCM pipeline action close log (separated by the 30s heartbeat interval) received from the pipeine leader DN
2023-12-20 14:14:53,962 [scm1-EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline action CLOSE for PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a from datanode a3fcdd27-8244-4b7a-840c-037dac8c6337. Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 302697ms
2023-12-20 14:15:23,963 [scm1-EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline action CLOSE for PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a from datanode a3fcdd27-8244-4b7a-840c-037dac8c6337. Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 302697ms
2023-12-20 14:15:53,960 [scm1-EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline action CLOSE for PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a from datanode a3fcdd27-8244-4b7a-840c-037dac8c6337. Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 302697ms
2023-12-20 14:16:23,960 [scm1-EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline action CLOSE for PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a from datanode a3fcdd27-8244-4b7a-840c-037dac8c6337. Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 302697ms
2023-12-20 14:16:52,409 [scm1-EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline action CLOSE for PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a from datanode a3fcdd27-8244-4b7a-840c-037dac8c6337. Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 302697ms
2023-12-20 14:16:52,413 [scm1-EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline action CLOSE for PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a from datanode a3fcdd27-8244-4b7a-840c-037dac8c6337. Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 302697ms
2023-12-20 14:16:52,595 [scm1-EventQueue-PipelineActionsForPipelineActionHandler] INFO org.apache.hadoop.hdds.scm.pipeline.PipelineActionHandler: Received pipeline action CLOSE for PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a from datanode a3fcdd27-8244-4b7a-840c-037dac8c6337. Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 302697ms
The DN was restarted at 14:16:53, maybe why SCM received multiple heartbeat from the same DN around that time.
DN pipeline close due to follower log (triggered multiple times within a single heartbeat interval)
2023-12-20 14:14:53,956 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 302697ms
2023-12-20 14:15:02,517 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 311262ms
2023-12-20 14:15:09,112 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 317858ms
2023-12-20 14:15:17,303 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 326049ms
2023-12-20 14:15:21,319 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 330065ms
2023-12-20 14:15:26,097 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 334843ms
2023-12-20 14:15:33,563 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 342308ms
2023-12-20 14:15:38,610 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 347356ms
2023-12-20 14:15:47,337 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 356082ms
2023-12-20 14:15:54,298 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 363044ms
2023-12-20 14:15:58,898 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 367643ms
2023-12-20 14:16:04,575 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 373321ms
2023-12-20 14:16:11,602 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 380348ms
2023-12-20 14:16:19,886 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 388632ms
2023-12-20 14:16:24,493 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 393239ms
2023-12-20 14:16:31,369 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 400114ms
2023-12-20 14:16:37,345 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 406091ms
2023-12-20 14:16:45,904 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 414649ms
2023-12-20 14:16:51,766 [a3fcdd27-8244-4b7a-840c-037dac8c6337@group-661AE7EEF57A->a2289e91-7fe6-49e8-ae84-fb33f39719d0-GrpcLogAppender-LogAppenderDaemon] ERROR org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis: pipeline Action CLOSE on pipeline PipelineID=38c214e0-9a5e-4b89-b114-661ae7eef57a.Reason : a3fcdd27-8244-4b7a-840c-037dac8c6337 has not seen follower/s a2289e91-7fe6-49e8-ae84-fb33f39719d0 for 420512ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivandika3
Current code has fixed frequent retry, but still we need retry over certain time interval is SCM is down / unable to handle the request. May be we need send together again during HB if still its active.
sumitagrawl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks @ivandika3 for the patch, @sumitagrawl for the review. |
What changes were proposed in this pull request?
XceiverServerRatis#handlePipelineFailure is called in CSM failure scenarios
Currently, XceiverServerRatis#handlePipelineFailure does not trigger Heartbeat to SCM immediately. Instead, it waits until the next heartbeat (default 60s) to send the pipeline close action command. This might cause SCM to still allocate blocks to these "failed" pipelines during this duration which might impact on client writing to these blocks.
To minimize the impact on the client and the datanodes on the failed pipeline. I suggest that the datanode trigger the pipeline close command immediately for every pipeline action close command triggered due to pipeline failure.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-9823
How was this patch tested?
Existing tests.
Clean CI run: https://github.com/ivandika3/ozone/actions/runs/7084351468