Revert #3737's EOFException changes#3789
Conversation
This commit partially reverts b1b2557 from apache#3737, to restore the previous infinite retry behavior when an EOFException occurs in the transport. While this EOFException can be an indicator of a fatal exception that should not be retried, it also occurs during transient network failures, where a retry should occur. It is not possible to easily detect which of these two cases is ocurring. Since transient network failures are routine, and the fatal exception that caused the issue that apache#3737 tried to address has a workaround via configuration, this commit restores the previous behavior that assumes the exception is caused by a transient issue. This fixes apache#3762 This fixes the issue in apache#3762, where the ManagerRepairsDualAssignmentIT intentionally triggered a transient network failure, and therefore triggered a scan to fail, causing the IT as a whole to fail.
|
@EdColeman I think this fix supersedes #3771 and #3776 |
|
I have no issue with this superseding the #3771 and #3776. As long as the test becomes stable. We may want to consider a specific test that can create / test for the condition rather than hoping to hit it by chance in |
Okay, I'll go ahead and merge it and close the others, then.
I had similar thoughts myself. For now, I'm inclined to leave the IT as it is, and close the other issues, rather than modify it in a way that would mask this issue and create a dedicated test for it. I think this issue basically was just an extension of the code reviews for #3737, resulting in us changing our mind on a portion of that change prior to a release of it. We're just slightly rolling back to the previous status quo, where everything was fine. So, I'm not terribly inclined to do much more than just roll that one small change back out. |
|
So I guess now if a client hits the EOFException because the max message size was exceeded, then it will retry indefinitly which was one of the symptoms we were trying to avoid. Given we can now set the max size along with the max frame size this can perhaps be avoided. However that is not really a good scenario. |
|
I also think we should just set the max message size to the max possible, at least by default, so users don't hit this. It may be possible to supply a patch upstream to force the max message size limit to appear as a different exception type than EOFException, so they can be distinguished from other types of transient network errors. It's a little weird that they throw EOFException for that scenario in the first place. |
This commit partially reverts b1b2557 from #3737, to restore the previous infinite retry behavior when an EOFException occurs in the transport. While this EOFException can be an indicator of a fatal exception that should not be retried, it also occurs during transient network failures, where a retry should occur. It is not possible to easily detect which of these two cases is ocurring. Since transient network failures are routine, and the fatal exception that caused the issue that #3737 tried to address has a workaround via configuration, this commit restores the previous behavior that assumes the exception is caused by a transient issue.
This fixes #3762
This fixes the issue in #3762, where the ManagerRepairsDualAssignmentIT intentionally triggered a transient network failure, and therefore triggered a scan to fail, causing the IT as a whole to fail.