HDDS-6083. EC: Fix flakyness of tests around nodefailures. by fapifta · Pull Request #2911 · apache/ozone

fapifta · 2021-12-11T01:22:43Z

What changes were proposed in this pull request?

The problem is with the selection of failing nodes in TestOzoneECClient#testNodeFailuresWhileWriting.
If we write two chunks only after a failure, and If we select for example node3 in the block group to fail, then the write will not fail, as we don't actually write to node3 in this case.
Similarly, if we write one chunk after a failure, and if we select node 2 or node 3 to fail, then as we don't actually will write to these nodes the write won't fail.

The selection logic itself is easy, we get the keyset of the storages map from the MockXCieverClientFactory which maps DataNodeDetails to MockDataNodeStorages, and we go over the keyset, and set numFailureToInject nodes to fail.
I have modified this logic to sort the keyset and put the nodes that we won't write to at the back of the list.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6083

How was this patch tested?

Put up a loop, and ran the test 1000 times, evaluated the nodes selected to fail, and I saw all the nodes that we write to as selected, but none of the nodes that we won't write to, in case when testNodeFailuresWhileWriting method is called with parameters (2, 1), (2, 2), (1, 1), (1, 2). Also ran all tests in TestOzoneECClient.

umamaheswararao · 2021-12-11T06:45:13Z

hadoop-ozone/client/src/test/java/org/apache/hadoop/ozone/client/TestOzoneECClient.java

How about changing the MockXceiverClientFactory#storage to LinkedHashMap instead of HashMap?
That should maintain the ordering.
If that does not solve, then I would suggest to move this sorting logic to getStorages method, so that anybody else invoke also will get in order.

Do we want to maintain the ordering? :)
With the randomness here, I believe we gain that there are test runs that tests parity failures, and there are test runs that tests data failures. Flaky results still might be a problem, I think it is more tedious to test all possible scenarios, that is why I chose this way of fixing it.
Though I think it is still not optimal, as it relies on the internal representation where we store the index of the DN in the most significant bits of the uuid of the DataNode, so I am open to any better idea, but if we order the nodes, and always select the first n to fail in this method, that might hide some of the possible issues that we can uncover easier with a proper random selection.

As we discussed, just hold this JIRA until we commit HDDS-6036.
HDDS-6036 is refactored to provide more flexibility to inject the failures in specific nodes in block group.

yepp, moving this one to draft, and probably I will force-push a new version later on after #2910 is merged. Thank you!

umamaheswararao reviewed Dec 11, 2021

View reviewed changes

fapifta added 3 commits December 13, 2021 13:51

HDDS-6083. EC: Fix flakyness of tests around nodefailures.

65f032a

Remove unused import.

47db31c

Re-trigger CI.

715a962

fapifta force-pushed the HDDS-6083 branch from e514035 to 715a962 Compare December 13, 2021 12:51

fapifta marked this pull request as draft December 15, 2021 22:01

fapifta closed this Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-6083. EC: Fix flakyness of tests around nodefailures.#2911

HDDS-6083. EC: Fix flakyness of tests around nodefailures.#2911
fapifta wants to merge 3 commits intoapache:HDDS-3816-ecfrom
fapifta:HDDS-6083

fapifta commented Dec 11, 2021

Uh oh!

umamaheswararao Dec 11, 2021

Uh oh!

fapifta Dec 12, 2021

Uh oh!

umamaheswararao Dec 15, 2021

Uh oh!

fapifta Dec 15, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fapifta commented Dec 11, 2021

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

umamaheswararao Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

fapifta Dec 12, 2021

Choose a reason for hiding this comment

Uh oh!

umamaheswararao Dec 15, 2021

Choose a reason for hiding this comment

Uh oh!

fapifta Dec 15, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants