Skip to content

HDDS-9735. Datanodes to retry close pipeline commands until pipeline is removed.#5643

Merged
sodonnel merged 15 commits intoapache:masterfrom
SaketaChalamchala:HDDS-9735
Nov 22, 2023
Merged

HDDS-9735. Datanodes to retry close pipeline commands until pipeline is removed.#5643
sodonnel merged 15 commits intoapache:masterfrom
SaketaChalamchala:HDDS-9735

Conversation

@SaketaChalamchala
Copy link
Contributor

What changes were proposed in this pull request?

Data node does not retry close pipeline command enough. If the SCM has just been restarted and leader election has not completed yet, the close pipeline request might be dropped by SCM.
In this case the Datanode does not retry sending the close pipeline action and the pipeline remains open until a manual close command is sent.
If the datanode triggered the pipeline close because the pipeline is bad then the new writes were still coming to this pipeline and continued to fail. This causes writes to become slow.

Proposed changes ensures that close pipeline requests are not removed from pending pipeline action queue until they are removed from the datanode.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-9735

How was this patch tested?

Unit Test.

@errose28 errose28 self-requested a review November 21, 2023 01:00
Copy link
Contributor

@sodonnel sodonnel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change LGTM. I am a little concerned that we retrieve the pipelineList for each action, as it is a newly formed list of new objects each time getPipelineReport() is called (see XceiverServerRatis), and then we have to scan that list for each pipelineAction. However I think these lists should be small and hence performance should not be a concern.

@adoroszlai
Copy link
Contributor

TestStateContext.testActionAPIs  Time elapsed: 0.026 s  <<< ERROR!
java.lang.NullPointerException
	at org.apache.hadoop.ozone.container.common.statemachine.StateContext.getPendingPipelineAction(StateContext.java:594)
	at org.apache.hadoop.ozone.container.common.statemachine.TestStateContext.testActionAPIs(TestStateContext.java:468)

seems related

@SaketaChalamchala
Copy link
Contributor Author

Thanks for the review @sodonnel and @adoroszlai.
Fixed the failing test.

@sodonnel sodonnel merged commit cf47339 into apache:master Nov 22, 2023
jojochuang pushed a commit to jojochuang/ozone that referenced this pull request Feb 1, 2024
…il pipeline is removed. (apache#5643)

(cherry picked from commit cf47339)
Change-Id: Ic1ece669a42fd1be22a0ae52df0e66f666bbb6c4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants