-
Notifications
You must be signed in to change notification settings - Fork 469
Use custom Transport Factory to set Transport message and frame size #3737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Use a custom TFramedTransport.Factory implementation so that when getTransport is called, the frame and message size are set on the underlying configuration. Fixes apache#3731
|
Created as
|
|
Kicked off full IT build |
|
Full IT build passed. |
cshannon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice find. I hadn't looked at the Thrift code in a while and only saw references to maxFrameSize when setting in the constructor so didn't think maxMessageSize would be easily configurable but this is a great solution.
|
This is a great find. Thank you @dlmarion. I would like to point out that this only resolves half of the issue. The other side of this issue is that the when this exception is thrown on the client side of the scan, the underlying logic will retry indefinitely leaving the process hung. Granted that this PR will fix avoid the problem for the most part, but I worry that some other situation will cause the scan (or batch writer) to hang indefinitly. |
|
Still looking at this this morning. Will be done by noon (EDT), and we can merge after if there's no other issues. |
|
I merged 2.1 into this PR's branch, just to get it up-to-date before I finish reviewing. |
* Put factory in own class * Remove direct references to TFramedTransport.Factory, so it's easier to see the new one is used everywhere * Fix spelling of transport * Fix formatting/comments in ThriftMaxFrameSizeIT
ctubbsii
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks like a good set of changes. However, I could not find a good way to test any of it.
core/src/test/java/org/apache/accumulo/core/rpc/ThriftUtilTest.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/functional/ThriftMaxFrameSizeIT.java
Outdated
Show resolved
Hide resolved
ctubbsii
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think the changes to the IT aren't good, for the reasons I say below. I'm going to spend a little time trying to polish it up, but I do think the non-test stuff is pretty much good.
core/src/main/java/org/apache/accumulo/core/clientImpl/TabletServerBatchReaderIterator.java
Show resolved
Hide resolved
core/src/main/java/org/apache/accumulo/core/rpc/AccumuloTFramedTransportFactory.java
Show resolved
Hide resolved
core/src/test/java/org/apache/accumulo/core/rpc/ThriftUtilTest.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/functional/ThriftMaxFrameSizeIT.java
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/functional/ThriftMaxFrameSizeIT.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/functional/ThriftMaxFrameSizeIT.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/functional/ThriftMaxFrameSizeIT.java
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/functional/ThriftMaxFrameSizeIT.java
Outdated
Show resolved
Hide resolved
test/src/main/java/org/apache/accumulo/test/functional/ThriftMaxFrameSizeIT.java
Outdated
Show resolved
Hide resolved
* Add comment about upstream issue * Check the exception message to make sure it's caused by the frame size * Remove unneeded dependency declaration * Refactor ThriftMaxFrameSizeIT to cover testing messages bigger and smaller than the configured value
|
Okay, I updated the IT to test for messages that are smaller than the configured amount (which should work), but also included the checks that @dlmarion had to verify it failed if the message was larger than the configured amount. So, we have both now. But, there are a few remaining issues:
|
|
I wrote a test using what I think is the correct stack on the client side to see the number of bytes that it ends up writing for a Mutation. It's not exhibiting the symptoms described above. Test: |
|
I think it's related to the issue of the max message size counter not being reset when the transport is reused for some new messages. So, you won't see it for only a single message. I'm not sure how to verify this at the moment, though. |
…pache#3737) Use a custom TFramedTransport.Factory implementation so that when getTransport is called, the frame and message size are set on the underlying configuration. This is a workaround for https://issues.apache.org/jira/browse/THRIFT-5732 Fixes apache#3731 Also: * Throw EOFException when TTransportException type is END_OF_FILE * Refactor ThriftMaxFrameSizeIT to cover testing messages bigger and smaller than the configured value and also to use separate mini dirs for each test * Include stack trace in TabletServerBatchWriter log message for debugging * Add default value for timeout.factor in Wait class to avoid error message and 24 timeout default in IDEs when the system property isn't set --------- Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
This commit partially reverts b1b2557 from apache#3737, to restore the previous infinite retry behavior when an EOFException occurs in the transport. While this EOFException can be an indicator of a fatal exception that should not be retried, it also occurs during transient network failures, where a retry should occur. It is not possible to easily detect which of these two cases is ocurring. Since transient network failures are routine, and the fatal exception that caused the issue that apache#3737 tried to address has a workaround via configuration, this commit restores the previous behavior that assumes the exception is caused by a transient issue. This fixes apache#3762 This fixes the issue in apache#3762, where the ManagerRepairsDualAssignmentIT intentionally triggered a transient network failure, and therefore triggered a scan to fail, causing the IT as a whole to fail.
This commit partially reverts b1b2557 from apache#3737, to restore the previous infinite retry behavior when an EOFException occurs in the transport. While this EOFException can be an indicator of a fatal exception that should not be retried, it also occurs during transient network failures, where a retry should occur. It is not possible to easily detect which of these two cases is ocurring. Since transient network failures are routine, and the fatal exception that caused the issue that apache#3737 tried to address has a workaround via configuration, this commit restores the previous behavior that assumes the exception is caused by a transient issue. This fixes apache#3762 This fixes the issue in apache#3762, where the ManagerRepairsDualAssignmentIT intentionally triggered a transient network failure, and therefore triggered a scan to fail, causing the IT as a whole to fail.
This commit partially reverts b1b2557 from #3737, to restore the previous infinite retry behavior when an EOFException occurs in the transport. While this EOFException can be an indicator of a fatal exception that should not be retried, it also occurs during transient network failures, where a retry should occur. It is not possible to easily detect which of these two cases is ocurring. Since transient network failures are routine, and the fatal exception that caused the issue that #3737 tried to address has a workaround via configuration, this commit restores the previous behavior that assumes the exception is caused by a transient issue. This fixes #3762 This fixes the issue in #3762, where the ManagerRepairsDualAssignmentIT intentionally triggered a transient network failure, and therefore triggered a scan to fail, causing the IT as a whole to fail.
Use a custom TFramedTransport.Factory implementation so that when getTransport is called, the frame and message size are set on the underlying configuration.
Fixes #3731