New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 2787: TestPerChannelBookieClient.testEpollChannelTcpUserTimeout fails #2801
Conversation
eolivelli
commented
Sep 21, 2021
- Fixes TestPerChannelBookieClient.testEpollChannelTcpUserTimeout fails #2787
- On some Linux systems the value 1234 is rounded to 1236, this patch simply uses 1236 as value in order to prevent failures
… fails - Fixes apache#2787 - On some Linux systems the value 1234 is rounded to 1236, this patch simply uses 1236 as value in order to prevent failures
@@ -314,7 +314,7 @@ public void testEpollChannelTcpUserTimeout() throws Exception { | |||
EventLoopGroup eventLoopGroup = new EpollEventLoopGroup(); | |||
OrderedExecutor executor = getOrderedSafeExecutor(); | |||
ClientConfiguration conf = new ClientConfiguration(); | |||
int tcpUserTimeout = 1234; | |||
int tcpUserTimeout = 1236; // this value may be rounded on some Linux implementations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is it documented anywhere? How did you find that this rounding is going, @eolivelli ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's the RFC, but where did you see that some linux implementations round it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this setup:
mvn -v
Maven home: /usr/share/maven
Java version: 1.8.0_292, vendor: Private Build, runtime: /usr/lib/jvm/java-8-openjdk-amd64/jre
Default locale: en, platform encoding: UTF-8
OS name: "linux", version: "4.15.0-64-generic", arch: "amd64", family: "unix"
uname -a
Linux jenkins-pulsar-basic-runner-324 4.15.0-64-generic #73-Ubuntu SMP Thu Sep 12 13:16:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
The problem always reproduces on our Jenkins slaves at DataStax, basically every test run starts a new VM, installs Ubuntu and Java and then builds bookkeeper from the master ASF branch.
it is 100% reproducible on those machines.
I had not time to investigate, because I should learn more about the machines.
We should check that kernel version + Java version....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eolivelli do you know if with this change the test is passing consistently in your environment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another suspect is that TCP_USER_TIMEOUT (tcpUserTimeout feeds into it) is unsigned int
but netty's interface to jni passes it as it is jint: https://github.com/netty/netty/blob/23405e2000427e9cad104913e7071d8d08e91a3c/transport-native-epoll/src/main/c/netty_epoll_linuxsocket.c#L144-L146
i.e. for multicast netty casts parameter to u_int
: https://github.com/netty/netty/blob/23405e2000427e9cad104913e7071d8d08e91a3c/transport-native-epoll/src/main/c/netty_epoll_linuxsocket.c#L78-L81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test was born a few weeks ago.
It passes on GH CI and it fails on Datastax CI.
Let me paste the results of the tests on this branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dlg99 is it possible that 1234 is corrupted to 1236 due an endianness problem ?
but 1236 is not changed to another value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I do not know.
I experimented with endianness conversion, neither 1234 nor 1236 seem problematic or radically different, even if I add casting to unsigned int.
OTOH, TCP_USER_TIMEOUT is defined as a 15bit unsigned in the RFC but I don't think 15 vs 16bit matters for either value.
but extra 2ms timeout making the test pass makes even less sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other cure for this problem is to revert the change at all.
@RaulGracia said that in the test environments this feature is a good enhancement and fixes a problem.
So probably we can change the test to set a value that is likely to be used in production.
If Netty/Linux or other parts om the stack do not preserve every value it is not a BK problem.
We can open a ticket on Netty, I am pretty sure that we cannot so anything in BK.
I want to fix this test that is consistently failing and it is blocking some CI pipeline.
The tests passed on this branch on DataStax CI
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes no difference to the test logic and if it solves the corner case reported, I'm fine with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to understand the rootcause but if it unblocks right now it is ok to go.
Please add an issue to investigate/follow up later