Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sure maxMessageSize is correctly set when Thrift is configured to use a non blocking server #3134

Merged
merged 4 commits into from Jan 11, 2023

Conversation

cshannon
Copy link
Contributor

@cshannon cshannon commented Dec 17, 2022

TServerUtils was not correctly setting the max frame size value on the constructor when creating a TNonblockingServerSocket leading to failures if the frame size exceeded the default.

use a non blocking server

TServerUtils was not correctly setting the max frame size value on the
constructor when creating a TNonblockingServerSocket leading to failures
if the frame size exceeded the default.

This fixes apache#3094
@cshannon
Copy link
Contributor Author

cshannon commented Dec 17, 2022

For the testing I went ahead and marked the integration test as SunnyDay but I can take that off if we don't want to make those tests take longer. The test seems to take about 40-45 seconds to test the max frame size against the different Thrift configurations.

Also note I added a test for all Thrift types except for SASL as SASL is a giant pain to setup and the main issue was when non blocking server is used which is covered by the tests and corresponds to the type of ThriftServerType.CUSTOM_HS_HA (default) and ThriftServerType.THREADPOOL. Also a test isn't really necessary anyways when using SSL or SASL as both SSL and SASL just use ThriftUtil.transportFactory() to create the transport factory and that utility uses our default factory which sets the length to Integer.MAX_VALUE and isn't configurable as it doesn't read the max message size property.

Copy link
Contributor

@Manno15 Manno15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good and I did confirm this fixes the frame size error as expected. I didn't look too closely at the IT so I hope others will give a more thorough review there.

@cshannon
Copy link
Contributor Author

The changes look good and I did confirm this fixes the frame size error as expected. I didn't look too closely at the IT so I hope others will give a more thorough review there.

Thanks, the test is pretty simple it just sends a large mutation that is bigger than the default thrift frame size. Before the changes if you run the test you will see the tests that use a non blocking server type hang and in the tserver logs there will be thrift errors printing the max frame size message.

@cshannon
Copy link
Contributor Author

The full IT passed except for 1 failure which was caused by #3106

I also removed the comment from the description about this fixing #3094 was that appears to be a bit different as that is a different error and was reported against 1.10.x version which this doesn't apply to. When this is merged the commit message needs to be updated to drop that same comment.

Copy link
Member

@ctubbsii ctubbsii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this is a bug in 1.10, or if it only affected us after the upgrade to Thrift 0.17. If it affects 1.10, it would be good to backport, along with the integration test. However, we're not using JUnit5 in 1.10, so we'd have to write the IT a bit differently.

@ctubbsii ctubbsii added this to In progress in 2.1.1 via automation Dec 17, 2022
@cshannon
Copy link
Contributor Author

I'm wondering if this is a bug in 1.10, or if it only affected us after the upgrade to Thrift 0.17. If it affects 1.10, it would be good to backport, along with the integration test. However, we're not using JUnit5 in 1.10, so we'd have to write the IT a bit differently.

Initially I thought this didn't apply but it does (my initial search didn't show the non blocking server being used). The test should be easy to convert to JUnit 4. My preference would be to keep this PR as it is for 2.1.1 so we can keep using JUnit 5 and merge it forward to main and i can create another PR for 1.10 that has the same fix but I'll fix the test so it's JUnit 4 which I'll do now.

@cshannon
Copy link
Contributor Author

I'm wondering if this is a bug in 1.10, or if it only affected us after the upgrade to Thrift 0.17. If it affects 1.10, it would be good to backport, along with the integration test. However, we're not using JUnit5 in 1.10, so we'd have to write the IT a bit differently.

Initially I thought this didn't apply but it does (my initial search didn't show the non blocking server being used). The test should be easy to convert to JUnit 4. My preference would be to keep this PR as it is for 2.1.1 so we can keep using JUnit 5 and merge it forward to main and i can create another PR for 1.10 that has the same fix but I'll fix the test so it's JUnit 4 which I'll do now.

@ctubbsii - I updated based on your comments and things should be good now. I went to backport this to 1.10 and when I did I just noticed that the maxFrameSize option doesn't exist and wasn't added until this PR: apache/thrift#2533

So this doesn't actually apply to 1.10 ....or at least if it is the same bug we can't apply this fix. I'm now wondering if the issue with #3094 is not easily fixable do to this option not being added yet. The answer might simply be to upgrade the version of thrift which supports the option and to a version where we set it which would imply to upgrade to Accumulo 2.1.1 when we release the fix.

Copy link
Contributor

@dlmarion dlmarion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching up on this, so it appears that a new constructor was added to TNonblockingServerSocket that allows us to set the frame size, is that right? Good find!

@cshannon
Copy link
Contributor Author

Catching up on this, so it appears that a new constructor was added to TNonblockingServerSocket that allows us to set the frame size, is that right? Good find!

Yeah that is correct, that's likely why we didn't have it initially as it didn't exist and also why we can't backport it to 1.10.x right now as the version is too old.

@ctubbsii
Copy link
Member

I'm still a little confused about when this max message size set in the constructor would apply, as opposed to wherever we're setting it before this change. Presumably, wherever we're currently setting the max message is also needed... I don't think we were just ignoring the max message configuration option before. As I understand it, this is just additional places where we need to set it... but I don't understand the difference between these places and the old places we set it.

@cshannon
Copy link
Contributor Author

cshannon commented Dec 19, 2022

I'm still a little confused about when this max message size set in the constructor would apply, as opposed to wherever we're setting it before this change. Presumably, wherever we're currently setting the max message is also needed... I don't think we were just ignoring the max message configuration option before. As I understand it, this is just additional places where we need to set it... but I don't understand the difference between these places and the old places we set it.

The issue is the non blocking server implementation. The previous way works for a blocking server but didn't work correctly for the non blocking server implementation. From what I can tell it has never worked for TNonblockingServerSocket until this fix.

@cshannon
Copy link
Contributor Author

cshannon commented Dec 21, 2022

After talking to @ctubbsii about this a bit more I'm going to do some more research before merging this to make sure it's correct and to make sure all the code paths are properly set. I'm going to look more into why it doesn't always seem to be an issue and also make sure there's nothing to do for 1.10. I will get back to it right after the new year.

@cshannon
Copy link
Contributor Author

cshannon commented Jan 5, 2023

I took a look at this a bit today to refresh my memory after a couple weeks and at first glance it is odd that it wasn't working before as the maxMessageSize was being configured on the transport factory already. So I will dig into it more and see what's going on and why that setting doesn't work but adding the constructor param seems to fix it.

@cshannon
Copy link
Contributor Author

cshannon commented Jan 6, 2023

@ctubbsii and @dlmarion - Ok so I finally figured out what happened. Everything was working with the maxFrameSize setting being configured on the transport factory until version 0.17.0. The test I just added passes fine without modification to TServerUtils with version 0.16.0. The reason it broke in 0.17.0 was due to the change that is part of apache/thrift#2533 . Specifically this commit introduced the maxFrameSize option as part of the constructor apache/thrift@66ac7b4 and is necessary to configure.

My fix works but this behavior change is obviously a breaking change and I'm not sure whether or not it's something that was intended or should maybe be addressed in Thrift itself at some point.

TLDR version:

My commit in this PR fixes the issue and it only applies to version 2.1.0 since no other version uses Thrift 0.17.0. So this is a new regression and does not apply to previous versions, like 1.10. Also, when we configure the settings for our custom non blocking server we do still need to keep our existing maxFrameSize settings on both the transport factory configuration and the maxReadBufferBytes property on the options as shown here otherwise it breaks further downstream, even if we set the constructor args.

options.protocolFactory(protocolFactory);
options.transportFactory(ThriftUtil.transportFactory(maxMessageSize));
options.maxReadBufferBytes = maxMessageSize;

Detailed Version:

Why it breaks

The root cause is the changes inside of AbstractNonblockingServer. This commit changed how max frame size is checked during reads, here is version 0.16.0. Previously, before 0.17.0, frame size was compared against MAX_READ_BUFFER_BYTES which comes from maxReadBufferBytes and is set by us inside TServerUtils

But in version 0.17.0, this has been refactored now and instead of reading the value from MAX_READ_BUFFER_BYTES, this is read from trans_.getMaxFrameSize() as shown here in version 0.17.0. This value is configured by the new constructor argument which we were not setting so it was always the default value and caused the failure.

Why our existing config doesn't work

Now, what confused me initially is that we also set the value on the configuration of the factory so I was unsure why there were two ways to configure the setting and why ours didn't work. It turns out the reason is they are different configuration objects used for different things.

The maxFrameSize that is now set by the constructor argument is applied to the trans_ transport which is passed to FrameBuffer as a constructor arg and configured back in TNonblockingServerSocket on line 137.

However, the maxFrameSize set on the configuration for the factory we create is applied to different configurations and transports as the factory that we configure doesn't get used to create trans_ transport but is used to create the inTrans_ and outTrans_ transport objects in FrameBuffer as shown here.

So there are multiple transports configured in FrameBuffer and the constructor arg configuration applies to one of them and the factory configuration applies to the others.

Why we need to keep existing config as well

As I said above, we still need to keep the existing configuration on the transport factory. The new constructor setting will allow the read to get further but if we don't set the factory config then we get errors downstream here when reading a frame.

Furthermore, I think we need to keep setting the maxFrameSize configuration on options.maxReadBufferBytes because if the frame size is larger than that value then thrift just hangs and spins forever with no error (which is even worse). This happened when I was testing not setting it. This is because it's waiting for more memory to free up but it won't ever be free since the frame to be read is less than the configured max frame size so it tries to read it but it is larger than the max read bytes so it never finishes.

@dlmarion
Copy link
Contributor

dlmarion commented Jan 6, 2023

Great write-up! I think this is good to merge after the RPC timeout conversation thread is resolved.

@cshannon cshannon requested a review from ctubbsii January 6, 2023 16:25
@cshannon
Copy link
Contributor Author

cshannon commented Jan 6, 2023

@dlmarion and @ctubbsii - I looked into the RPC timeout a bit more before making the change to apply it to the server socket to fix issue #3153 and I wanted to get some feedback before I did it. As I said in my last comment it's pretty trivial to update the code to set the timeout on the server socket. However, now I'm not sure if we want to do it and what the side effects might be.

Right now the value seems to be primarily intended to be used as a client side timeout (defaulting to 120 seconds) and not as a server side timeout. This does in fact work and is applied correctly to thrift clients. It is configured in the ClientContext in the constructor and set as part of the timeoutSupplier and can be accessed using a getter. I verified with some testing that the clients time out properly when setting the value. So one potential issue I see is do we actually want to re-use this same property for a server timeout when it's been used for the client timeout and in some cases you may want to make them different.

A second issue is if a timeout is applied to the server side socket as well I am a little concerned of potential consequences and changing behavior on the client side if suddenly the server has a timeout value. If the server side does timeout then the client should timeout eventually on its side as well out as long as there's an actual timeout value set (default is 120s) and in that case I think that behavior would be fine I would think and would protect the server.

However, I noticed there are other cases where the client sets no timeout such as for manager operations. For a case like this the client would potentially hang waiting forever if the server side timed out. I actually first was alerted to this behavior when doing some testing to verify the server config worked and was seeing the client side just hang.

So I just wanted to get people's thoughts on it before I proceeded.

@cshannon
Copy link
Contributor Author

I'm going to go ahead and merge this and leave the RPC timeout issue (#3153) to investigate and can be a follow on PR.

@cshannon cshannon merged commit eb033ce into apache:2.1 Jan 11, 2023
2.1.1 automation moved this from In progress to Done Jan 11, 2023
@ctubbsii
Copy link
Member

@cshannon This merge caused TServerUtilsTest to start failing. It looks like it might be an easy fix, but I haven't looked at it too closely yet.

@cshannon
Copy link
Contributor Author

@ctubbsii - Odd, when I looked at the output in CI the errors made me immediately think it was environmental because it had to do with things like expected port numbers being wrong. Sure enough I just ran the test against both 2.1 and main branches in IntelliJ and on the command line using maven and the test passes for me. I even configured InteillJ to run the test in a loop 100 times and it didn't break. So I'm not sure what is up but I can try and take a closer look tomorrow.

@ctubbsii
Copy link
Member

@ctubbsii - Odd, when I looked at the output in CI the errors made me immediately think it was environmental because it had to do with things like expected port numbers being wrong.

It could just be a coincidence then, perhaps related to a change in my build environment due to a system update. I can also investigate further.

@ctubbsii
Copy link
Member

Odd. I can't reproduce this locally. Never mind, then. If I see it again, I'll consider opening a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
2.1.1
Done
Development

Successfully merging this pull request may close these issues.

None yet

4 participants