-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDFS-16853. IPC shutdown failures. #5366
HDFS-16853. IPC shutdown failures. #5366
Conversation
…LeaseRecoveryAfterNameNodeRestart failed
Trying to address the problem 1. MUST NOT submit into blocking queue while closing 2. MUST NOT call queue.put() in synchronous block. This design doesn't quite stop (2), though it should detect and warn if the problem surfaces. "Possible overlap in queue shutdown and request" No new tests; Change-Id: I2c7d3752bc8c4ab852015f53a5ef768d737efb2f
@virajjasani you were near this code...what do you think? @ZanderXu's core patch does the cleanup, but there's still a small window of possible overlap which I can't see how to get rid of through synchronized() blocks. I've got detection, but maybe some semaphore or similar needs to get involved so as to actually block cleanup while other threads are submitting work. dangerous though |
@xkrogen thoughts? |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments, otherwise changes look good.
Not relevant mostly for this change but I wonder why we do not use running
in addCall:
diff --git a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java b/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
index c0f90d98bc6..c0bef116e73 100644
--- a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
+++ b/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
@@ -474,7 +474,7 @@ private void touch() {
* @return true if the call was added.
*/
private synchronized boolean addCall(Call call) {
- if (shouldCloseConnection.get())
+ if (shouldCloseConnection.get() || !running.get())
return false;
calls.put(call.id, call);
notify();
*/ | ||
private synchronized SynchronousQueue<Pair<Call, ResponseBuffer>> acquireActiveRequestQueue() { | ||
if (shouldCloseConnection.get() || !running.get()) { | ||
LOG.debug("IPC client is stopped"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: LOG.debug("IPC client {} is stopped", this)
would be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was thinking that myself
* @return the queue or null. | ||
*/ | ||
private synchronized SynchronousQueue<Pair<Call, ResponseBuffer>> acquireActiveRequestQueue() { | ||
if (shouldCloseConnection.get() || !running.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not need additional synchronization on putLock
object while accessing running
correct?
Edit: This would be too much for accessing atomic boolean.
// or it has happened but the finally {} clause has not been invoked (good). | ||
// without knowing which, print a warning message so at least logs on | ||
// a deadlock are meaningful. | ||
LOG.warn("Possible overlap in queue shutdown and request"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also print queueReservations.get()
with this log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. with the design of owens patch you should just have 1 thread in put() and one worker, but good to show.
+explicitly raising an exception if a call is made while closed not sure about the exception raising but when in, ipc tests fail, including the one on socket retries. Test problem? real problem? This gets complex fast, to complex for a last-minute fix. Change-Id: I86e738d22a927e36ac49c6c5d3e0fdb185416259 proposing: revert the HADOOP-18324 patch from 3.3.5
latest revision has a connection exception raised when queuing on a closed connection, rather than a silent no-op. this highlights that tests expecting socket retries to work now fail as stuff is going through the closed connection. Either the production code is wrong, or the mockito-based test isn't doing the right thing. it's too dangerous to try and get a fix in here for an RC, so I am proposing rolling back the big change and targeting the 3.3.6 release. |
see #5369 |
IMO that is reasonable thing to do given we are still not aware whether the impact is only limited to UTs. |
💔 -1 overall
This message was automatically generated. |
owen actually fixed this properly |
Description of PR
Extension of #5162
Trying to address the problem
This design doesn't quite stop (2), though it should
detect and warn if the problem surfaces.
"Possible overlap in queue shutdown and request"
No new tests;
How was this patch tested?
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?