HDDS-6083. EC: Fix flakyness of tests around nodefailures.#2911
HDDS-6083. EC: Fix flakyness of tests around nodefailures.#2911fapifta wants to merge 3 commits intoapache:HDDS-3816-ecfrom
Conversation
There was a problem hiding this comment.
How about changing the MockXceiverClientFactory#storage to LinkedHashMap instead of HashMap?
That should maintain the ordering.
If that does not solve, then I would suggest to move this sorting logic to getStorages method, so that anybody else invoke also will get in order.
There was a problem hiding this comment.
Do we want to maintain the ordering? :)
With the randomness here, I believe we gain that there are test runs that tests parity failures, and there are test runs that tests data failures. Flaky results still might be a problem, I think it is more tedious to test all possible scenarios, that is why I chose this way of fixing it.
Though I think it is still not optimal, as it relies on the internal representation where we store the index of the DN in the most significant bits of the uuid of the DataNode, so I am open to any better idea, but if we order the nodes, and always select the first n to fail in this method, that might hide some of the possible issues that we can uncover easier with a proper random selection.
There was a problem hiding this comment.
yepp, moving this one to draft, and probably I will force-push a new version later on after #2910 is merged. Thank you!
What changes were proposed in this pull request?
The problem is with the selection of failing nodes in TestOzoneECClient#testNodeFailuresWhileWriting.
If we write two chunks only after a failure, and If we select for example node3 in the block group to fail, then the write will not fail, as we don't actually write to node3 in this case.
Similarly, if we write one chunk after a failure, and if we select node 2 or node 3 to fail, then as we don't actually will write to these nodes the write won't fail.
The selection logic itself is easy, we get the keyset of the storages map from the MockXCieverClientFactory which maps DataNodeDetails to MockDataNodeStorages, and we go over the keyset, and set numFailureToInject nodes to fail.
I have modified this logic to sort the keyset and put the nodes that we won't write to at the back of the list.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6083
How was this patch tested?
Put up a loop, and ran the test 1000 times, evaluated the nodes selected to fail, and I saw all the nodes that we write to as selected, but none of the nodes that we won't write to, in case when testNodeFailuresWhileWriting method is called with parameters (2, 1), (2, 2), (1, 1), (1, 2). Also ran all tests in TestOzoneECClient.