Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 2787: TestPerChannelBookieClient.testEpollChannelTcpUserTimeout fails #2801

Merged
merged 1 commit into from Oct 5, 2021

Conversation

eolivelli
Copy link
Contributor

… fails

- Fixes apache#2787
- On some Linux systems the value 1234 is rounded to 1236, this patch simply uses 1236 as value in order to prevent failures
@eolivelli eolivelli requested a review from fpj September 21, 2021 09:59
@eolivelli eolivelli added this to the 4.15.0 milestone Sep 21, 2021
@eolivelli eolivelli self-assigned this Sep 21, 2021
@eolivelli
Copy link
Contributor Author

eolivelli commented Sep 21, 2021

@@ -314,7 +314,7 @@ public void testEpollChannelTcpUserTimeout() throws Exception {
EventLoopGroup eventLoopGroup = new EpollEventLoopGroup();
OrderedExecutor executor = getOrderedSafeExecutor();
ClientConfiguration conf = new ClientConfiguration();
int tcpUserTimeout = 1234;
int tcpUserTimeout = 1236; // this value may be rounded on some Linux implementations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is it documented anywhere? How did you find that this rounding is going, @eolivelli ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some references

https://datatracker.ietf.org/doc/html/rfc5482

I didn't find more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that's the RFC, but where did you see that some linux implementations round it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this setup:

mvn -v
Maven home: /usr/share/maven
Java version: 1.8.0_292, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
Default locale: en, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-64-generic", arch: "amd64", family: "unix"

uname -a
Linux jenkins-pulsar-basic-runner-324 4.15.0-64-generic #73-Ubuntu SMP Thu Sep 12 13:16:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

The problem always reproduces on our Jenkins slaves at DataStax, basically every test run starts a new VM, installs Ubuntu and Java and then builds bookkeeper from the master ASF branch.
it is 100% reproducible on those machines.

I had not time to investigate, because I should learn more about the machines.
We should check that kernel version + Java version....

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eolivelli do you know if with this change the test is passing consistently in your environment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test was born a few weeks ago.
It passes on GH CI and it fails on Datastax CI.

Let me paste the results of the tests on this branch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dlg99 is it possible that 1234 is corrupted to 1236 due an endianness problem ?
but 1236 is not changed to another value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I do not know.
I experimented with endianness conversion, neither 1234 nor 1236 seem problematic or radically different, even if I add casting to unsigned int.
OTOH, TCP_USER_TIMEOUT is defined as a 15bit unsigned in the RFC but I don't think 15 vs 16bit matters for either value.

but extra 2ms timeout making the test pass makes even less sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other cure for this problem is to revert the change at all.

@RaulGracia said that in the test environments this feature is a good enhancement and fixes a problem.

So probably we can change the test to set a value that is likely to be used in production.
If Netty/Linux or other parts om the stack do not preserve every value it is not a BK problem.
We can open a ticket on Netty, I am pretty sure that we cannot so anything in BK.

I want to fix this test that is consistently failing and it is blocking some CI pipeline.

@eolivelli
Copy link
Contributor Author

The tests passed on this branch on DataStax CI

[INFO] Running org.apache.bookkeeper.proto.TestPerChannelBookieClient
[INFO] Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.099 s - in org.apache.bookkeeper.proto.TestPerChannelBookieClient

Copy link
Contributor

@RaulGracia RaulGracia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes no difference to the test logic and if it solves the corner case reported, I'm fine with it.

Copy link
Contributor

@dlg99 dlg99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to understand the rootcause but if it unblocks right now it is ok to go.
Please add an issue to investigate/follow up later

@eolivelli eolivelli merged commit 50b5c1e into apache:master Oct 5, 2021
@eolivelli eolivelli deleted the fix/epool-test branch October 5, 2021 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestPerChannelBookieClient.testEpollChannelTcpUserTimeout fails
4 participants