Revert #3737's EOFException changes by ctubbsii · Pull Request #3789 · apache/accumulo

ctubbsii · 2023-09-28T20:13:10Z

This commit partially reverts b1b2557 from #3737, to restore the previous infinite retry behavior when an EOFException occurs in the transport. While this EOFException can be an indicator of a fatal exception that should not be retried, it also occurs during transient network failures, where a retry should occur. It is not possible to easily detect which of these two cases is ocurring. Since transient network failures are routine, and the fatal exception that caused the issue that #3737 tried to address has a workaround via configuration, this commit restores the previous behavior that assumes the exception is caused by a transient issue.

This fixes #3762

This fixes the issue in #3762, where the ManagerRepairsDualAssignmentIT intentionally triggered a transient network failure, and therefore triggered a scan to fail, causing the IT as a whole to fail.

This commit partially reverts b1b2557 from apache#3737, to restore the previous infinite retry behavior when an EOFException occurs in the transport. While this EOFException can be an indicator of a fatal exception that should not be retried, it also occurs during transient network failures, where a retry should occur. It is not possible to easily detect which of these two cases is ocurring. Since transient network failures are routine, and the fatal exception that caused the issue that apache#3737 tried to address has a workaround via configuration, this commit restores the previous behavior that assumes the exception is caused by a transient issue. This fixes apache#3762 This fixes the issue in apache#3762, where the ManagerRepairsDualAssignmentIT intentionally triggered a transient network failure, and therefore triggered a scan to fail, causing the IT as a whole to fail.

ctubbsii · 2023-09-28T20:15:23Z

@EdColeman I think this fix supersedes #3771 and #3776

EdColeman · 2023-09-29T00:32:47Z

I have no issue with this superseding the #3771 and #3776. As long as the test becomes stable.
(#3771 should be closed in favor of #3776 anyway.)

We may want to consider a specific test that can create / test for the condition rather than hoping to hit it by chance in ManagerRepairsDualAssignmentIT Changing the way killing and testing that it has been reported dead may be better as is done in #3776 - but that would mask hitting the thrift change.

ctubbsii · 2023-09-29T01:03:21Z

I have no issue with this superseding the #3771 and #3776. As long as the test becomes stable. (#3771 should be closed in favor of #3776 anyway.)

Okay, I'll go ahead and merge it and close the others, then.

We may want to consider a specific test that can create / test for the condition rather than hoping to hit it by chance in ManagerRepairsDualAssignmentIT Changing the way killing and testing that it has been reported dead may be better as is done in #3776 - but that would mask hitting the thrift change.

I had similar thoughts myself. For now, I'm inclined to leave the IT as it is, and close the other issues, rather than modify it in a way that would mask this issue and create a dedicated test for it. I think this issue basically was just an extension of the code reviews for #3737, resulting in us changing our mind on a portion of that change prior to a release of it. We're just slightly rolling back to the previous status quo, where everything was fine. So, I'm not terribly inclined to do much more than just roll that one small change back out.

ivakegg · 2023-10-03T15:10:13Z

So I guess now if a client hits the EOFException because the max message size was exceeded, then it will retry indefinitly which was one of the symptoms we were trying to avoid. Given we can now set the max size along with the max frame size this can perhaps be avoided. However that is not really a good scenario.
On the flip side, not retrying on an EOFException because a network or datanode failure could also be problematic. I need to scan some running systems to see how often this happens and the implications thereof.

ctubbsii · 2023-10-03T20:59:15Z

I also think we should just set the max message size to the max possible, at least by default, so users don't hit this. It may be possible to supply a patch upstream to force the max message size limit to appear as a different exception type than EOFException, so they can be distinguished from other types of transient network errors. It's a little weird that they throw EOFException for that scenario in the first place.

ctubbsii requested review from EdColeman and dlmarion September 28, 2023 20:13

ctubbsii self-assigned this Sep 28, 2023

ctubbsii linked an issue Sep 28, 2023 that may be closed by this pull request

Broken or Flaky test: ManagerRepairsDualAssignmentIT #3762

Closed

ctubbsii merged commit 8b31b4f into apache:2.1 Sep 29, 2023

ctubbsii deleted the fix-3762-restore-eofexception-behavior branch September 29, 2023 01:05

This was referenced Sep 29, 2023

Count dead tservers to continue in ManagerRepairsDualAssignmentIT #3776

Closed

update test to handle connect error when tserver killed #3771

Closed

ctubbsii modified the milestones: 3.1.0, 2.1.3 Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Revert #3737's EOFException changes#3789

Revert #3737's EOFException changes#3789
ctubbsii merged 1 commit intoapache:2.1from
ctubbsii:fix-3762-restore-eofexception-behavior

ctubbsii commented Sep 28, 2023

Uh oh!

ctubbsii commented Sep 28, 2023

Uh oh!

EdColeman commented Sep 29, 2023

Uh oh!

ctubbsii commented Sep 29, 2023

Uh oh!

ivakegg commented Oct 3, 2023

Uh oh!

ctubbsii commented Oct 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

ctubbsii commented Sep 28, 2023

Uh oh!

ctubbsii commented Sep 28, 2023

Uh oh!

EdColeman commented Sep 29, 2023

Uh oh!

ctubbsii commented Sep 29, 2023

Uh oh!

ivakegg commented Oct 3, 2023

Uh oh!

ctubbsii commented Oct 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants