Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDFS-16853. IPC shutdown failures. #5366

Closed

Conversation

steveloughran
Copy link
Contributor

Description of PR

Extension of #5162

Trying to address the problem

  1. MUST NOT submit into blocking queue while closing
  2. MUST NOT call queue.put() in synchronous block.

This design doesn't quite stop (2), though it should
detect and warn if the problem surfaces.
"Possible overlap in queue shutdown and request"

No new tests;

How was this patch tested?

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

ZanderXu and others added 3 commits February 7, 2023 17:28
Trying to address the problem
1. MUST NOT submit into blocking queue while closing
2. MUST NOT call queue.put() in synchronous block.

This design doesn't quite stop (2), though it should
detect and warn if the problem surfaces.
"Possible overlap in queue shutdown and request"

No new tests;

Change-Id: I2c7d3752bc8c4ab852015f53a5ef768d737efb2f
@steveloughran
Copy link
Contributor Author

@virajjasani you were near this code...what do you think? @ZanderXu's core patch does the cleanup, but there's still a small window of possible overlap which I can't see how to get rid of through synchronized() blocks. I've got detection, but maybe some semaphore or similar needs to get involved so as to actually block cleanup while other threads are submitting work. dangerous though

@steveloughran
Copy link
Contributor Author

@xkrogen thoughts?

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 51s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 48m 22s trunk passed
+1 💚 compile 25m 20s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 compile 21m 38s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 checkstyle 1m 7s trunk passed
+1 💚 mvnsite 1m 38s trunk passed
+1 💚 javadoc 1m 8s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 0m 41s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 2m 40s trunk passed
+1 💚 shadedclient 28m 10s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 7s the patch passed
+1 💚 compile 24m 30s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javac 24m 30s the patch passed
+1 💚 compile 21m 41s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 javac 21m 41s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 1m 0s the patch passed
+1 💚 mvnsite 1m 35s the patch passed
+1 💚 javadoc 0m 58s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 0m 41s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 2m 41s the patch passed
+1 💚 shadedclient 28m 28s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 18m 31s hadoop-common in the patch passed.
+1 💚 asflicense 1m 6s The patch does not generate ASF License warnings.
233m 44s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5366/1/artifact/out/Dockerfile
GITHUB PR #5366
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 8ee5018a3bdc 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / d7a515e
Default Java Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5366/1/testReport/
Max. process+thread count 1235 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5366/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

Copy link
Contributor

@virajjasani virajjasani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments, otherwise changes look good.

Not relevant mostly for this change but I wonder why we do not use running in addCall:

diff --git a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java b/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
index c0f90d98bc6..c0bef116e73 100644
--- a/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
+++ b/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
@@ -474,7 +474,7 @@ private void touch() {
      * @return true if the call was added.
      */
     private synchronized boolean addCall(Call call) {
-      if (shouldCloseConnection.get())
+      if (shouldCloseConnection.get() || !running.get())
         return false;
       calls.put(call.id, call);
       notify();

*/
private synchronized SynchronousQueue<Pair<Call, ResponseBuffer>> acquireActiveRequestQueue() {
if (shouldCloseConnection.get() || !running.get()) {
LOG.debug("IPC client is stopped");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: LOG.debug("IPC client {} is stopped", this) would be great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking that myself

* @return the queue or null.
*/
private synchronized SynchronousQueue<Pair<Call, ResponseBuffer>> acquireActiveRequestQueue() {
if (shouldCloseConnection.get() || !running.get()) {
Copy link
Contributor

@virajjasani virajjasani Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need additional synchronization on putLock object while accessing running correct?

Edit: This would be too much for accessing atomic boolean.

// or it has happened but the finally {} clause has not been invoked (good).
// without knowing which, print a warning message so at least logs on
// a deadlock are meaningful.
LOG.warn("Possible overlap in queue shutdown and request");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also print queueReservations.get() with this log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. with the design of owens patch you should just have 1 thread in put() and one worker, but good to show.

+explicitly raising an exception if a call is made while closed

not sure about the exception raising but when in, ipc tests fail,
including the one on socket retries. Test problem? real problem?

This gets complex fast, to complex for a last-minute fix.

Change-Id: I86e738d22a927e36ac49c6c5d3e0fdb185416259
proposing: revert the HADOOP-18324 patch from 3.3.5
@steveloughran
Copy link
Contributor Author

latest revision has a connection exception raised when queuing on a closed connection, rather than a silent no-op.

this highlights that tests expecting socket retries to work now fail as stuff is going through the closed connection. Either the production code is wrong, or the mockito-based test isn't doing the right thing.

it's too dangerous to try and get a fix in here for an RC, so I am proposing rolling back the big change and targeting the 3.3.6 release.

@steveloughran
Copy link
Contributor Author

see #5369

@virajjasani
Copy link
Contributor

it's too dangerous to try and get a fix in here for an RC, so I am proposing rolling back the big change and targeting the 3.3.6 release.

IMO that is reasonable thing to do given we are still not aware whether the impact is only limited to UTs.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 49s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 46m 9s trunk passed
+1 💚 compile 25m 13s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 compile 21m 46s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 checkstyle 1m 7s trunk passed
+1 💚 mvnsite 1m 38s trunk passed
+1 💚 javadoc 1m 7s trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 0m 40s trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 2m 38s trunk passed
+1 💚 shadedclient 28m 5s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 1m 6s the patch passed
+1 💚 compile 24m 36s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javac 24m 36s the patch passed
+1 💚 compile 21m 28s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 javac 21m 28s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 59s /results-checkstyle-hadoop-common-project_hadoop-common.txt hadoop-common-project/hadoop-common: The patch generated 1 new + 127 unchanged - 0 fixed = 128 total (was 127)
+1 💚 mvnsite 1m 36s the patch passed
+1 💚 javadoc 0m 59s the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04
+1 💚 javadoc 0m 40s the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08
+1 💚 spotbugs 2m 41s the patch passed
+1 💚 shadedclient 28m 25s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 18m 52s /patch-unit-hadoop-common-project_hadoop-common.txt hadoop-common in the patch passed.
+1 💚 asflicense 0m 52s The patch does not generate ASF License warnings.
231m 1s
Reason Tests
Failed junit tests hadoop.ipc.TestRPCWaitForProxy
hadoop.ipc.TestSaslRPC
hadoop.ipc.TestRPC
hadoop.ipc.TestIPC
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5366/2/artifact/out/Dockerfile
GITHUB PR #5366
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux c021f31b2fff 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / e0bbcbf
Default Java Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5366/2/testReport/
Max. process+thread count 3165 (vs. ulimit of 5500)
modules C: hadoop-common-project/hadoop-common U: hadoop-common-project/hadoop-common
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5366/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@steveloughran steveloughran marked this pull request as draft February 9, 2023 13:31
@steveloughran
Copy link
Contributor Author

owen actually fixed this properly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants