-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TabletServerBatchReaderIterator gets into infinite loop #3731
Comments
We are also seeing this with the BatchWriter. If one large mutation (> 49 * 1024 * 1024 bytes) or several mutations that add up to that same size or greater are sent, we enter an infinite loop. The BatchWriter client receives a “java.io.IOException: org.apache.thrift.transport.TTransportException: Socket is closed by peer.” here: https://github.com/apache/accumulo/blob/main/core/src/main/java/org/apache/accumulo/core/clientImpl/TabletServerBatchWriter.java#L897 For what it’s worth, Thrift 0.14 seems to have added this max message size check (different from the max frame size check): https://issues.apache.org/jira/browse/THRIFT-5021 A possible additional improvement could be to have the BatchWriter break its set of mutations into chunks that are smaller than Thrift’s max message size (while also respecting all other constraints). This doesn’t address the case where a single mutations exceeds the max message size limit. I’m not sure the best solution here. Possibly throwing a |
@cshannon previously looked into #3134, which is related to Thrift's max message stuff, and may have some insight. Upon a first pass investigating this, I can see that Thrift does not have an API for configuring the max message size. It has a hard-coded value of 100MB. The max frame size is 16MB, but that is configurable. I'm not sure why the max message size isn't configurable, but it seems like a patch to Thrift would be needed to do much about this. We can try to break up the Mutations based on the max message size, or throw some client side error that isn't going to retry forever, but that's going to take a lot of wiring up of the RPC layer up into the Accumulo client library, and I'm not quite comfortable with tightly coupling those layers like that. It might be better to just impose a strict limit on the client side for the max message, using our knowledge of the hard-coded Thrift value... and work with upstream Thrift to change it or make it configurable. However, I think if you set the BatchWriter maxMemory to something less than 100MB (might need to account for some overhead), you can probably avoid hitting this issue. |
It looks like the problem occurred in 0.14.0 with commit apache/thrift@63213c1 (apache/thrift#2191; THRIFT-5237). The change added a TConfiguration object and a TEndpointTransport base class for several TTransports, including the TFramedTransport that we use. However, they didn't provide new constructors to be able to pass in a TConfiguration in the child classes, and not all of them have a way to customize the TConfiguration they use. |
In #3134, If I recall correctly we were not applying our setting correctly to I found some docs here: It says that all current implementations just send 1 frame at a time so the values are related and will essentially pick the smaller one I guess. So I think we should really be applying our max message size setting in our configuration to both max message and max frame size in thrift and that would likely solve the issue, but as @ctubbsii pointed out, it doesn't look like we can currently do that without a patch to make it configurable. |
We could put an upper-bound on our own configuration, so you can't set the max frame size bigger than the hard-coded limit. I'm not sure what configuration triggered this, though. There's still the issue of the fact that it's going to cause an exception that isn't going to fall out... mostly because it's hard to distinguish the error from transient transport exceptions, except for the e.getMessage() string, which is probably a bit fragile. |
Use a custom TFramedTransport.Factory implementation so that when getTransport is called, the frame and message size are set on the underlying configuration. Fixes apache#3731
I created https://issues.apache.org/jira/browse/THRIFT-5732 upstream. |
…pache#3737) Use a custom TFramedTransport.Factory implementation so that when getTransport is called, the frame and message size are set on the underlying configuration. This is a workaround for https://issues.apache.org/jira/browse/THRIFT-5732 Fixes apache#3731 Also: * Throw EOFException when TTransportException type is END_OF_FILE * Refactor ThriftMaxFrameSizeIT to cover testing messages bigger and smaller than the configured value and also to use separate mini dirs for each test * Include stack trace in TabletServerBatchWriter log message for debugging * Add default value for timeout.factor in Wait class to avoid error message and 24 timeout default in IDEs when the system property isn't set --------- Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
When TabletServerBatchReaderIterator gets the following exception:
org.apache.thrift.transport.TTransportException: MaxMessageSize reached
at org.apache.accumulo.core.clientImpl.TabletServerBatchReaderIterator.doLookup:913
at org.apache.accumulo.core.clientImpl.TabletServerBatchReaderIterator$QueryTask.run:401
this failed scan will be added to the failures and the processFailures routine will requeue
the scan. Given the tserver will simply create the same results for the scan, this will happen
indefinitly.
Accumulo 2.1.2
OS CenOS 7.7
Java 11
The text was updated successfully, but these errors were encountered: