Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TabletServerBatchReaderIterator gets into infinite loop #3731

Closed
ivakegg opened this issue Aug 28, 2023 · 6 comments · Fixed by #3737
Closed

TabletServerBatchReaderIterator gets into infinite loop #3731

ivakegg opened this issue Aug 28, 2023 · 6 comments · Fixed by #3737
Labels
bug This issue has been verified to be a bug.
Milestone

Comments

@ivakegg
Copy link
Contributor

ivakegg commented Aug 28, 2023

When TabletServerBatchReaderIterator gets the following exception:

org.apache.thrift.transport.TTransportException: MaxMessageSize reached
at org.apache.accumulo.core.clientImpl.TabletServerBatchReaderIterator.doLookup:913
at org.apache.accumulo.core.clientImpl.TabletServerBatchReaderIterator$QueryTask.run:401

this failed scan will be added to the failures and the processFailures routine will requeue
the scan. Given the tserver will simply create the same results for the scan, this will happen
indefinitly.

Accumulo 2.1.2
OS CenOS 7.7
Java 11

  1. a MaxMessageSize reached issue should probably not be retried indefinitly.
  2. we need a way to set the thrift max message size.
@ivakegg ivakegg added the bug This issue has been verified to be a bug. label Aug 28, 2023
@jsduncan2
Copy link

jsduncan2 commented Aug 28, 2023

We are also seeing this with the BatchWriter. If one large mutation (> 49 * 1024 * 1024 bytes) or several mutations that add up to that same size or greater are sent, we enter an infinite loop. The BatchWriter client receives a “java.io.IOException: org.apache.thrift.transport.TTransportException: Socket is closed by peer.” here: https://github.com/apache/accumulo/blob/main/core/src/main/java/org/apache/accumulo/core/clientImpl/TabletServerBatchWriter.java#L897

For what it’s worth, Thrift 0.14 seems to have added this max message size check (different from the max frame size check): https://issues.apache.org/jira/browse/THRIFT-5021

A possible additional improvement could be to have the BatchWriter break its set of mutations into chunks that are smaller than Thrift’s max message size (while also respecting all other constraints).

This doesn’t address the case where a single mutations exceeds the max message size limit. I’m not sure the best solution here. Possibly throwing a MutationsRejectedException?

@ctubbsii
Copy link
Member

@cshannon previously looked into #3134, which is related to Thrift's max message stuff, and may have some insight.

Upon a first pass investigating this, I can see that Thrift does not have an API for configuring the max message size. It has a hard-coded value of 100MB. The max frame size is 16MB, but that is configurable. I'm not sure why the max message size isn't configurable, but it seems like a patch to Thrift would be needed to do much about this.

We can try to break up the Mutations based on the max message size, or throw some client side error that isn't going to retry forever, but that's going to take a lot of wiring up of the RPC layer up into the Accumulo client library, and I'm not quite comfortable with tightly coupling those layers like that. It might be better to just impose a strict limit on the client side for the max message, using our knowledge of the hard-coded Thrift value... and work with upstream Thrift to change it or make it configurable.

However, I think if you set the BatchWriter maxMemory to something less than 100MB (might need to account for some overhead), you can probably avoid hitting this issue.

@ctubbsii
Copy link
Member

ctubbsii commented Aug 29, 2023

It looks like the problem occurred in 0.14.0 with commit apache/thrift@63213c1 (apache/thrift#2191; THRIFT-5237). The change added a TConfiguration object and a TEndpointTransport base class for several TTransports, including the TFramedTransport that we use. However, they didn't provide new constructors to be able to pass in a TConfiguration in the child classes, and not all of them have a way to customize the TConfiguration they use.

@cshannon
Copy link
Contributor

In #3134, If I recall correctly we were not applying our setting correctly to maxFrameSize so Thrift was hanging instead of throwing an error like it should after changes. I didn't touch anything with max message size in thrift. It's a bit confusing as there is maxFrameSize and maxMessageSize is used interchangeably a lot but they are different. We actually call it maxMessageSize in our code and our configuration even though we apply it to maxFrameSize in thrift and the default is set to 1 GB.

I found some docs here:
https://github.com/apache/thrift/blob/v0.17.0/doc/specs/thrift-tconfiguration.md#maxmessagesize and https://github.com/apache/thrift/blob/v0.17.0/doc/specs/thrift-tconfiguration.md#maxframesize-vs-maxmessagesize

It says that all current implementations just send 1 frame at a time so the values are related and will essentially pick the smaller one I guess. So I think we should really be applying our max message size setting in our configuration to both max message and max frame size in thrift and that would likely solve the issue, but as @ctubbsii pointed out, it doesn't look like we can currently do that without a patch to make it configurable.

@ctubbsii
Copy link
Member

We could put an upper-bound on our own configuration, so you can't set the max frame size bigger than the hard-coded limit. I'm not sure what configuration triggered this, though. There's still the issue of the fact that it's going to cause an exception that isn't going to fall out... mostly because it's hard to distinguish the error from transient transport exceptions, except for the e.getMessage() string, which is probably a bit fragile.

dlmarion added a commit to dlmarion/accumulo that referenced this issue Aug 29, 2023
Use a custom TFramedTransport.Factory implementation so that when getTransport
is called, the frame and message size are set on the underlying configuration.

Fixes apache#3731
@ctubbsii
Copy link
Member

ctubbsii commented Sep 1, 2023

I created https://issues.apache.org/jira/browse/THRIFT-5732 upstream.

@asfgit asfgit closed this as completed in b1b2557 Sep 1, 2023
EdColeman pushed a commit to EdColeman/accumulo that referenced this issue Sep 11, 2023
…pache#3737)

Use a custom TFramedTransport.Factory implementation so that when getTransport
is called, the frame and message size are set on the underlying configuration.

This is a workaround for https://issues.apache.org/jira/browse/THRIFT-5732

Fixes apache#3731

Also:
* Throw EOFException when TTransportException type is END_OF_FILE
* Refactor ThriftMaxFrameSizeIT to cover testing messages bigger and
  smaller than the configured value and also to use separate mini dirs
  for each test
* Include stack trace in TabletServerBatchWriter log message for debugging
* Add default value for timeout.factor in Wait class to avoid error message
  and 24 timeout default in IDEs when the system property isn't set

---------

Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
@ctubbsii ctubbsii modified the milestones: 3.1.0, 2.1.3 Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue has been verified to be a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants