HDDS-2281. ContainerStateMachine#handleWriteChunk should ignore close container exception #54

bshashikant · 2019-10-18T12:15:30Z

As write chunk happens in parallel over datanode, it might be possible that writeChunk happening as part of writeStateMachineData may fail with CloseContainerException. This leads to a log append failure in Ratis and as a result of which pipeline close action gets triggered on datanode resulting in frequent destruction of pipelines in the system.

Currently, ContainerStateMachine#applyTrannsaction ignores close container exception.Similarly,ContainerStateMachine#handleWriteChunk call also should ignore close container exception.

The patch was tested by adding a unit test where after allocating a container and doing writes over it with multiple threads in parallel with one thread closing the container randomly and verifying that because of close container , the stateMachine is not marked unhealthy and new snapshots can still be taken and pipeline functions does not halt.

… container exception.

anuengineer · 2019-10-21T16:57:43Z

@bshashikant can we please fill up the JIRA template ? that helps people who read this JIRA and understand what it is about. I was reading the JIRA description and the patch and not able to make a head or tail about it.

@mukul1987 when you commit or review can you please comment about this ?

anuengineer · 2019-10-21T16:58:34Z

...a/org/apache/hadoop/ozone/container/common/transport/server/ratis/ContainerStateMachine.java

+            metrics.incNumWriteDataFails();
+            // write chunks go in parallel. It's possible that one write chunk
+            // see the stateMachine is marked unhealthy by other parallel thread.
+            stateMachineHealthy.set(false);


So question; if a thread has marked the container as unhealthy why should a write be successful at all ?

I know this patch is merged, but I have no way of understanding what this means -- so I appreciate some comments or feedback that explains what happens here?

So question; if a thread has marked the container as unhealthy why should a write be successful at all ?

If a container is marked unhealthy, write will marked fail and log Append wlll fail in ratis. This is juts incrementing fail count metrics here and marking the stateMachine for the pipeline unhealthy so that now new ratis snapshots can be taken .

I know this patch is merged, but I have no way of understanding what this means -- so I appreciate some comments or feedback that explains what happens here?

Updated the description to add clarity to this.

How an unhealthy pipeline will be recovered? I got a lot of exception because the pipline is marked as unhealthy and remain in the unhealthy state...

bshashikant added 2 commits October 11, 2019 01:53

HDDS-2281. ContainerStateMachine#handleWriteChunk should ignore close…

a4013be

… container exception.

Addressed review comments.

49e0974

bshashikant requested a review from mukul1987 October 18, 2019 12:15

bshashikant changed the title ~~Hdds 2281. ContainerStateMachine#handleWriteChunk should ignore close container exception~~ HDDS-2281. ContainerStateMachine#handleWriteChunk should ignore close container exception Oct 18, 2019

mukul1987 approved these changes Oct 20, 2019

View reviewed changes

mukul1987 merged commit bfaa640 into apache:master Oct 20, 2019

anuengineer reviewed Oct 21, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-2281. ContainerStateMachine#handleWriteChunk should ignore close container exception #54

HDDS-2281. ContainerStateMachine#handleWriteChunk should ignore close container exception #54

bshashikant commented Oct 18, 2019 •

edited

anuengineer commented Oct 21, 2019

anuengineer Oct 21, 2019

anuengineer Oct 21, 2019

bshashikant Oct 22, 2019

bshashikant Oct 22, 2019

elek Oct 28, 2019

HDDS-2281. ContainerStateMachine#handleWriteChunk should ignore close container exception #54

HDDS-2281. ContainerStateMachine#handleWriteChunk should ignore close container exception #54

Conversation

bshashikant commented Oct 18, 2019 • edited

anuengineer commented Oct 21, 2019

anuengineer Oct 21, 2019

Choose a reason for hiding this comment

anuengineer Oct 21, 2019

Choose a reason for hiding this comment

bshashikant Oct 22, 2019

Choose a reason for hiding this comment

bshashikant Oct 22, 2019

Choose a reason for hiding this comment

elek Oct 28, 2019

Choose a reason for hiding this comment

bshashikant commented Oct 18, 2019 •

edited