-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-17552. Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang #2727
Conversation
…et.SocketTimeoutException due to the mistaken usage of the rpcTimeout configuration
@functioner Good catch, and thanks for contribution |
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
Outdated
Show resolved
Hide resolved
@ferhui Thanks! |
@functioner see TestRPC#testClientRpcTimeout |
@ferhui If this is expected, then the client should be able to automatically request a new connection rather than blindly waiting for the problematic connection it currently uses. Otherwise this endless waiting is meaningless, because now it's impossible for the server to add this connection to its Reader. In this case, timeout is better, because the user (client) at least can be aware of this issue. |
@ferhui See TestRPC.java line 1419-1426 (within testClientRpcTimeout)
According to the comment there, I think the correct behavior is to assign the effective rpc-timeout with multiple of ping interval. |
In this case, we can set ipc.client.rpc-timeout.ms suitable value and resolve it. Let us see author(@iwasakims )'s thoughts first and then resolve it carefully. |
Agree. |
@functioner @ferhui, I made the timeout configurable by introducing |
I think the long default timeout (kicked by tcp_retries2) surfaced as RM-HA (YARN) issue like YARN-2578 rather than HDFS issue. Maybe NN-HA client proxy injecting other timeout worked as mitigation. |
From the description of HADOOP-17552:
@functioner I think the issue was surfaced by half-closed TCP connection (connection loss without RST packet) caused by HW issue like power fault. What kind of fault injection caused this? |
@iwasakims In Server.java, the socket channel is accepted in line 1400, and then the fault (IOException) is injected in line 1402 (1403 or 1404 will also work).
The basic idea of this fault injection is to allow the server to accept the connection in line 1400 but stop it from being added to reader so that server can't be aware of the data from this client. The injected IOException is swallowed in line 1350. |
@iwasakims The current codebase can pass this test, but in the wrong way. The current behavior is always swallowing the SocketTimeoutException in line 548:
According to the comment in that test case, it "should not time out because effective rpc-timeout is multiple of ping interval: 1600 (= 800 * (1000 / 800 + 1))", and it doesn't mean that it shouldn't time out. |
The SocketTimeoutException is thrown on |
Oh, I think the current fix I propose does not consider the case of
@iwasakims Do you agree? |
@functioner Just update the value of |
@iwasakims Thanks for your explanation. Now I understand your design. I think we should update the description for hadoop/hadoop-common-project/hadoop-common/src/main/resources/core-default.xml Lines 2384 to 2392 in 1f1a1ef
For the default value of @iwasakims I'm going to update the description of |
@iwasakims I've updated the description of You can double-check whether the description is unambiguous, and whether there exists other files requiring similar update. |
I guess many users may not change the default configuration. If they encounter this endless hang issue, it may take a while for them to figure out what's wrong, even with the clarification on the parameter description. Therefore, I think we can consider to change the default value of @iwasakims I advocate for setting a reasonable default |
@functioner Thanks for comments. |
Considering it affects failover time (for NM to reconnect to new active RM) of YARN-2578, 60 or 120 sounds reasonable. Since 60 (= defaul ipc.ping.interval) effectively disables ipc ping, how about using 120? |
It makes sense to me. Then let's try with 120s and see what happens (e.g., result of test). |
If the current title is not appropriate or concise, what title would be better? Any idea? I can change the title with your suggestions. |
How about "Change ipc.client.rpc-timeout.ms from 0 to 120000 by default to avoid potential hang" |
Sounds good. Done. |
@functioner According to CI results, TestIPC#testClientGetTimeout fails. It is related, please check. |
It fails at line 1459: hadoop/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestIPC.java Lines 1456 to 1460 in b4985c1
hadoop/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java Lines 237 to 258 in b4985c1
Before we change the default rpcTimeout: After we change the default rpcTimeout=120000: Conclusion: |
@functioner That's OK |
@functioner The |
@functioner As @iwasakims said, you can add |
It seems it doesn't work. The obtained timeout is still 120000. Any idea? |
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/ipc/TestIPC.java
Outdated
Show resolved
Hide resolved
Are we ready to merge? @ferhui @iwasakims |
@functioner You should address the checkstyle warning. I think we don't need the comment.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Thanks, @functioner and @ferhui.
…fault to avoid potential hang. (apache#2727)
I proposed a fix for HADOOP-17552.