Skip to content

Comments

Revert #3737's EOFException changes#3789

Merged
ctubbsii merged 1 commit intoapache:2.1from
ctubbsii:fix-3762-restore-eofexception-behavior
Sep 29, 2023
Merged

Revert #3737's EOFException changes#3789
ctubbsii merged 1 commit intoapache:2.1from
ctubbsii:fix-3762-restore-eofexception-behavior

Conversation

@ctubbsii
Copy link
Member

This commit partially reverts b1b2557 from #3737, to restore the previous infinite retry behavior when an EOFException occurs in the transport. While this EOFException can be an indicator of a fatal exception that should not be retried, it also occurs during transient network failures, where a retry should occur. It is not possible to easily detect which of these two cases is ocurring. Since transient network failures are routine, and the fatal exception that caused the issue that #3737 tried to address has a workaround via configuration, this commit restores the previous behavior that assumes the exception is caused by a transient issue.

This fixes #3762

This fixes the issue in #3762, where the ManagerRepairsDualAssignmentIT intentionally triggered a transient network failure, and therefore triggered a scan to fail, causing the IT as a whole to fail.

This commit partially reverts b1b2557
from apache#3737, to restore the previous infinite retry behavior when an
EOFException occurs in the transport. While this EOFException can be an
indicator of a fatal exception that should not be retried, it also
occurs during transient network failures, where a retry should occur. It
is not possible to easily detect which of these two cases is ocurring.
Since transient network failures are routine, and the fatal exception
that caused the issue that apache#3737 tried to address has a workaround via
configuration, this commit restores the previous behavior that assumes
the exception is caused by a transient issue.

This fixes apache#3762

This fixes the issue in apache#3762, where the ManagerRepairsDualAssignmentIT
intentionally triggered a transient network failure, and therefore
triggered a scan to fail, causing the IT as a whole to fail.
@ctubbsii ctubbsii self-assigned this Sep 28, 2023
@ctubbsii ctubbsii linked an issue Sep 28, 2023 that may be closed by this pull request
@ctubbsii
Copy link
Member Author

@EdColeman I think this fix supersedes #3771 and #3776

@EdColeman
Copy link
Contributor

I have no issue with this superseding the #3771 and #3776. As long as the test becomes stable.
(#3771 should be closed in favor of #3776 anyway.)

We may want to consider a specific test that can create / test for the condition rather than hoping to hit it by chance in ManagerRepairsDualAssignmentIT Changing the way killing and testing that it has been reported dead may be better as is done in #3776 - but that would mask hitting the thrift change.

@ctubbsii
Copy link
Member Author

I have no issue with this superseding the #3771 and #3776. As long as the test becomes stable. (#3771 should be closed in favor of #3776 anyway.)

Okay, I'll go ahead and merge it and close the others, then.

We may want to consider a specific test that can create / test for the condition rather than hoping to hit it by chance in ManagerRepairsDualAssignmentIT Changing the way killing and testing that it has been reported dead may be better as is done in #3776 - but that would mask hitting the thrift change.

I had similar thoughts myself. For now, I'm inclined to leave the IT as it is, and close the other issues, rather than modify it in a way that would mask this issue and create a dedicated test for it. I think this issue basically was just an extension of the code reviews for #3737, resulting in us changing our mind on a portion of that change prior to a release of it. We're just slightly rolling back to the previous status quo, where everything was fine. So, I'm not terribly inclined to do much more than just roll that one small change back out.

@ctubbsii ctubbsii merged commit 8b31b4f into apache:2.1 Sep 29, 2023
@ctubbsii ctubbsii deleted the fix-3762-restore-eofexception-behavior branch September 29, 2023 01:05
@ivakegg
Copy link
Contributor

ivakegg commented Oct 3, 2023

So I guess now if a client hits the EOFException because the max message size was exceeded, then it will retry indefinitly which was one of the symptoms we were trying to avoid. Given we can now set the max size along with the max frame size this can perhaps be avoided. However that is not really a good scenario.
On the flip side, not retrying on an EOFException because a network or datanode failure could also be problematic. I need to scan some running systems to see how often this happens and the implications thereof.

@ctubbsii
Copy link
Member Author

ctubbsii commented Oct 3, 2023

I also think we should just set the max message size to the max possible, at least by default, so users don't hit this. It may be possible to supply a patch upstream to force the max message size limit to appear as a different exception type than EOFException, so they can be distinguished from other types of transient network errors. It's a little weird that they throw EOFException for that scenario in the first place.

@ctubbsii ctubbsii modified the milestones: 3.1.0, 2.1.3 Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Broken or Flaky test: ManagerRepairsDualAssignmentIT

3 participants