Skip to content

HDDS-6083. EC: Fix flakyness of tests around nodefailures.#2911

Closed
fapifta wants to merge 3 commits intoapache:HDDS-3816-ecfrom
fapifta:HDDS-6083
Closed

HDDS-6083. EC: Fix flakyness of tests around nodefailures.#2911
fapifta wants to merge 3 commits intoapache:HDDS-3816-ecfrom
fapifta:HDDS-6083

Conversation

@fapifta
Copy link
Copy Markdown
Contributor

@fapifta fapifta commented Dec 11, 2021

What changes were proposed in this pull request?

The problem is with the selection of failing nodes in TestOzoneECClient#testNodeFailuresWhileWriting.
If we write two chunks only after a failure, and If we select for example node3 in the block group to fail, then the write will not fail, as we don't actually write to node3 in this case.
Similarly, if we write one chunk after a failure, and if we select node 2 or node 3 to fail, then as we don't actually will write to these nodes the write won't fail.

The selection logic itself is easy, we get the keyset of the storages map from the MockXCieverClientFactory which maps DataNodeDetails to MockDataNodeStorages, and we go over the keyset, and set numFailureToInject nodes to fail.
I have modified this logic to sort the keyset and put the nodes that we won't write to at the back of the list.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6083

How was this patch tested?

Put up a loop, and ran the test 1000 times, evaluated the nodes selected to fail, and I saw all the nodes that we write to as selected, but none of the nodes that we won't write to, in case when testNodeFailuresWhileWriting method is called with parameters (2, 1), (2, 2), (1, 1), (1, 2). Also ran all tests in TestOzoneECClient.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing the MockXceiverClientFactory#storage to LinkedHashMap instead of HashMap?
That should maintain the ordering.
If that does not solve, then I would suggest to move this sorting logic to getStorages method, so that anybody else invoke also will get in order.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to maintain the ordering? :)
With the randomness here, I believe we gain that there are test runs that tests parity failures, and there are test runs that tests data failures. Flaky results still might be a problem, I think it is more tedious to test all possible scenarios, that is why I chose this way of fixing it.
Though I think it is still not optimal, as it relies on the internal representation where we store the index of the DN in the most significant bits of the uuid of the DataNode, so I am open to any better idea, but if we order the nodes, and always select the first n to fail in this method, that might hide some of the possible issues that we can uncover easier with a proper random selection.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, just hold this JIRA until we commit HDDS-6036.
HDDS-6036 is refactored to provide more flexibility to inject the failures in specific nodes in block group.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yepp, moving this one to draft, and probably I will force-push a new version later on after #2910 is merged. Thank you!

@fapifta fapifta marked this pull request as draft December 15, 2021 22:01
@fapifta fapifta closed this Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants