Skip to content

Conversation

@dlmarion
Copy link
Contributor

Use a custom TFramedTransport.Factory implementation so that when getTransport is called, the frame and message size are set on the underlying configuration.

Fixes #3731

Use a custom TFramedTransport.Factory implementation so that when getTransport
is called, the frame and message size are set on the underlying configuration.

Fixes apache#3731
@dlmarion
Copy link
Contributor Author

Created as draft because I think a test needs to be added to confirm that the message and frame sizes are getting set. While looking at this I found several things:

  1. Not all Thrift server types use the framed transport

  2. According to this all current implementations send one message in one frame, which begs the question why are we using framed transports.

  3. The Accumulo servers have properties for their max message size, which falls back to the general max message size when not defined. The remaining places where Thrift connections are created are using a maxMessageSize of Integer.MAX_VALUE. As mentioned, the Accumulo max message properties were not setting the Thrift maxMessageSize, they were setting the maxFrameSize. The default Thrift maxMessageSize is 100MB, so any frame size over that will get lowered to 100MB. This change sets the Thrift maxMessageSize, I'm unsure what will happen when the client and server have different values.

  4. Ideally, we should get reduce the max message size Accumulo properties to 1 and default it to a large value (2GB?).

@dlmarion dlmarion linked an issue Aug 29, 2023 that may be closed by this pull request
@dlmarion
Copy link
Contributor Author

Kicked off full IT build

@dlmarion
Copy link
Contributor Author

Full IT build passed.

@dlmarion dlmarion marked this pull request as ready for review August 30, 2023 11:26
Copy link
Contributor

@cshannon cshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice find. I hadn't looked at the Thrift code in a while and only saw references to maxFrameSize when setting in the constructor so didn't think maxMessageSize would be easily configurable but this is a great solution.

@ivakegg
Copy link
Contributor

ivakegg commented Aug 30, 2023

This is a great find. Thank you @dlmarion. I would like to point out that this only resolves half of the issue. The other side of this issue is that the when this exception is thrown on the client side of the scan, the underlying logic will retry indefinitely leaving the process hung. Granted that this PR will fix avoid the problem for the most part, but I worry that some other situation will cause the scan (or batch writer) to hang indefinitly.

@dlmarion
Copy link
Contributor Author

@ivakegg - I overlooked that part of the issue in #3731, I'll look at that also.

@dlmarion dlmarion requested a review from ivakegg August 30, 2023 14:27
@dlmarion
Copy link
Contributor Author

@ivakegg - I overlooked that part of the issue in #3731, I'll look at that also.

Looks like when it's thrown from here the type is set to END_OF_FILE. aed8516 throws a subclass of IOException in this case, which is handled differently in the calling code so that it does not retry.

@ctubbsii
Copy link
Member

Still looking at this this morning. Will be done by noon (EDT), and we can merge after if there's no other issues.

@ctubbsii
Copy link
Member

I merged 2.1 into this PR's branch, just to get it up-to-date before I finish reviewing.

* Put factory in own class
* Remove direct references to TFramedTransport.Factory, so it's easier
  to see the new one is used everywhere
* Fix spelling of transport
* Fix formatting/comments in ThriftMaxFrameSizeIT
Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks like a good set of changes. However, I could not find a good way to test any of it.

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think the changes to the IT aren't good, for the reasons I say below. I'm going to spend a little time trying to polish it up, but I do think the non-test stuff is pretty much good.

* Add comment about upstream issue
* Check the exception message to make sure it's caused by the frame size
* Remove unneeded dependency declaration
* Refactor ThriftMaxFrameSizeIT to cover testing messages bigger and
  smaller than the configured value
@ctubbsii
Copy link
Member

ctubbsii commented Sep 1, 2023

Okay, I updated the IT to test for messages that are smaller than the configured amount (which should work), but also included the checks that @dlmarion had to verify it failed if the message was larger than the configured amount. So, we have both now. But, there are a few remaining issues:

  • The ssl case does not seem to have the same failure characteristics as the other server types. In some cases, when a message is sent that is larger than what is configured, it still works. I'm not sure why. In other cases, it seemed to not work, like we'd expect, but it would fail with a different message. I had to refactor the test to ensure MiniAccumuloCluster gave it separate directories so this can be debugged further. I don't think it's critical, though. For now, I've ignored the ssl case in the IT.
  • The more interesting thing is that if I reduce the configured max size to a lower amount (say, 20MB), the tests that send a smaller message (even if it is substantially smaller than 20MB) seem to fail with messages in the tserver logs saying that it received a message that was larger. I'm not sure how, since I never sent one that big, and I don't expect Accumulo to natively be sending itself large messages from another server. This one doesn't make sense. I got the IT to pass with 80MB... I haven't found the precise threshold where it stops working, though. Hopefully, having separate directories for the tests now will make it easier to troubleshoot each one separately.

@ctubbsii ctubbsii merged commit b1b2557 into apache:2.1 Sep 1, 2023
@dlmarion dlmarion deleted the create-tconf branch September 1, 2023 19:31
@dlmarion
Copy link
Contributor Author

dlmarion commented Sep 1, 2023

I wrote a test using what I think is the correct stack on the client side to see the number of bytes that it ends up writing for a Mutation. It's not exhibiting the symptoms described above. Test:

  @Test
  public void testThriftMessageSize() throws Exception {
    ByteBuffer buf = ByteBuffer.allocate(MB10);
    
    ByteBufferOutputStream bbos = new ByteBufferOutputStream(buf);
    
    TIOStreamTransport stream = new TIOStreamTransport(bbos);
    
    AccumuloTFramedTransportFactory factory = new AccumuloTFramedTransportFactory(Integer.MAX_VALUE);
    TTransport framedTransport = factory.getTransport(stream);
    assertEquals(framedTransport.getConfiguration().getMaxFrameSize(), Integer.MAX_VALUE);
    assertEquals(framedTransport.getConfiguration().getMaxMessageSize(), Integer.MAX_VALUE);
    
    TCompactProtocol protocol = new TCompactProtocol(framedTransport);
    
    Mutation m = new Mutation(new Text(FastFormat.toZeroPaddedString(0 + 0, 10, 10, "row_".getBytes(UTF_8))));
    byte[] value = new byte[MB1];
    Arrays.fill(value, (byte) 1);
    m.put(new Text("colf_colf"), new Text(FastFormat.toZeroPaddedString(0, 7, 10, "colf_".getBytes(UTF_8))),
        new ColumnVisibility(), System.currentTimeMillis(), new Value(value, true));

    TMutation tmut = m.toThrift();
    tmut.write(protocol);
    
    framedTransport.flush();
    assertEquals(MB1, buf.position());
    buf.flip();
    assertEquals(0, buf.position());
    assertEquals(MB1, buf.limit());

  }

@ctubbsii
Copy link
Member

I think it's related to the issue of the max message size counter not being reset when the transport is reused for some new messages. So, you won't see it for only a single message. I'm not sure how to verify this at the moment, though.

EdColeman pushed a commit to EdColeman/accumulo that referenced this pull request Sep 11, 2023
…pache#3737)

Use a custom TFramedTransport.Factory implementation so that when getTransport
is called, the frame and message size are set on the underlying configuration.

This is a workaround for https://issues.apache.org/jira/browse/THRIFT-5732

Fixes apache#3731

Also:
* Throw EOFException when TTransportException type is END_OF_FILE
* Refactor ThriftMaxFrameSizeIT to cover testing messages bigger and
  smaller than the configured value and also to use separate mini dirs
  for each test
* Include stack trace in TabletServerBatchWriter log message for debugging
* Add default value for timeout.factor in Wait class to avoid error message
  and 24 timeout default in IDEs when the system property isn't set

---------

Co-authored-by: Christopher Tubbs <ctubbsii@apache.org>
ctubbsii added a commit to ctubbsii/accumulo that referenced this pull request Sep 28, 2023
This commit partially reverts b1b2557
from apache#3737, to restore the previous infinite retry behavior when an
EOFException occurs in the transport. While this EOFException can be an
indicator of a fatal exception that should not be retried, it also
occurs during transient network failures, where a retry should occur. It
is not possible to easily detect which of these two cases is ocurring.
Since transient network failures are routine, and the fatal exception
that caused the issue that apache#3737 tried to address has a workaround via
configuration, this commit restores the previous behavior that assumes
the exception is caused by a transient issue.

This fixes apache#3762

This fixes the issue in apache#3762, where the ManagerRepairsDualAssignmentIT
intentionally triggered a transient network failure, and therefore
triggered a scan to fail, causing the IT as a whole to fail.
ctubbsii added a commit to ctubbsii/accumulo that referenced this pull request Sep 28, 2023
This commit partially reverts b1b2557
from apache#3737, to restore the previous infinite retry behavior when an
EOFException occurs in the transport. While this EOFException can be an
indicator of a fatal exception that should not be retried, it also
occurs during transient network failures, where a retry should occur. It
is not possible to easily detect which of these two cases is ocurring.
Since transient network failures are routine, and the fatal exception
that caused the issue that apache#3737 tried to address has a workaround via
configuration, this commit restores the previous behavior that assumes
the exception is caused by a transient issue.

This fixes apache#3762

This fixes the issue in apache#3762, where the ManagerRepairsDualAssignmentIT
intentionally triggered a transient network failure, and therefore
triggered a scan to fail, causing the IT as a whole to fail.
ctubbsii added a commit that referenced this pull request Sep 29, 2023
This commit partially reverts b1b2557
from #3737, to restore the previous infinite retry behavior when an
EOFException occurs in the transport. While this EOFException can be an
indicator of a fatal exception that should not be retried, it also
occurs during transient network failures, where a retry should occur. It
is not possible to easily detect which of these two cases is ocurring.
Since transient network failures are routine, and the fatal exception
that caused the issue that #3737 tried to address has a workaround via
configuration, this commit restores the previous behavior that assumes
the exception is caused by a transient issue.

This fixes #3762

This fixes the issue in #3762, where the ManagerRepairsDualAssignmentIT
intentionally triggered a transient network failure, and therefore
triggered a scan to fail, causing the IT as a whole to fail.
@ctubbsii ctubbsii modified the milestones: 3.1.0, 2.1.3 Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TabletServerBatchReaderIterator gets into infinite loop

4 participants