New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-2281. ContainerStateMachine#handleWriteChunk should ignore close container exception #54
Conversation
@bshashikant can we please fill up the JIRA template ? that helps people who read this JIRA and understand what it is about. I was reading the JIRA description and the patch and not able to make a head or tail about it. @mukul1987 when you commit or review can you please comment about this ? |
metrics.incNumWriteDataFails(); | ||
// write chunks go in parallel. It's possible that one write chunk | ||
// see the stateMachine is marked unhealthy by other parallel thread. | ||
stateMachineHealthy.set(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So question; if a thread has marked the container as unhealthy why should a write be successful at all ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this patch is merged, but I have no way of understanding what this means -- so I appreciate some comments or feedback that explains what happens here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So question; if a thread has marked the container as unhealthy why should a write be successful at all ?
If a container is marked unhealthy, write will marked fail and log Append wlll fail in ratis. This is juts incrementing fail count metrics here and marking the stateMachine for the pipeline unhealthy so that now new ratis snapshots can be taken .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this patch is merged, but I have no way of understanding what this means -- so I appreciate some comments or feedback that explains what happens here?
Updated the description to add clarity to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How an unhealthy pipeline will be recovered? I got a lot of exception because the pipline is marked as unhealthy and remain in the unhealthy state...
As write chunk happens in parallel over datanode, it might be possible that writeChunk happening as part of writeStateMachineData may fail with CloseContainerException. This leads to a log append failure in Ratis and as a result of which pipeline close action gets triggered on datanode resulting in frequent destruction of pipelines in the system.
Currently, ContainerStateMachine#applyTrannsaction ignores close container exception.Similarly,ContainerStateMachine#handleWriteChunk call also should ignore close container exception.
The patch was tested by adding a unit test where after allocating a container and doing writes over it with multiple threads in parallel with one thread closing the container randomly and verifying that because of close container , the stateMachine is not marked unhealthy and new snapshots can still be taken and pipeline functions does not halt.