Skip to content

Conversation

@cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Apr 27, 2023

Description of PR

NodeManager has two dispatchers, we can add metrics to observe the dispatcher queue size.

How was this patch tested?

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@slfan1989
Copy link
Contributor

@cxzl25 Thank you very much for your contribution! Can you explain why we are adding this metric?

@slfan1989 slfan1989 self-requested a review April 27, 2023 06:01

eventQueueMetricExecutor = new ScheduledThreadPoolExecutor(1,
new ThreadFactoryBuilder().setDaemon(true)
.setNameFormat("EventQueueSizeMetricThread").build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good, 5 chars

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 37s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 39m 51s trunk passed
+1 💚 compile 1m 31s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 1m 25s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 checkstyle 0m 41s trunk passed
+1 💚 mvnsite 0m 50s trunk passed
+1 💚 javadoc 0m 49s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 38s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 1m 39s trunk passed
+1 💚 shadedclient 21m 12s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 39s the patch passed
+1 💚 compile 1m 22s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 1m 22s the patch passed
+1 💚 compile 1m 18s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 javac 1m 18s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 26s the patch passed
+1 💚 mvnsite 0m 38s the patch passed
+1 💚 javadoc 0m 32s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 29s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 1m 28s the patch passed
+1 💚 shadedclient 20m 45s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 23m 57s hadoop-yarn-server-nodemanager in the patch passed.
+1 💚 asflicense 0m 38s The patch does not generate ASF License warnings.
122m 39s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/1/artifact/out/Dockerfile
GITHUB PR #5599
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux e8c41707527d 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / fcbeede
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/1/testReport/
Max. process+thread count 560 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/1/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Apr 27, 2023

Can you explain why we are adding this metric?

Because I found that there are several NMs on the cluster, although the heartbeat with RM is normal, but the NM has not started the container, and the event queue has a backlog of many events.
So I'm thinking that it can be similar to YARN-10771, add NM's metrics, so that it is convenient to observe the health status of NM.

@slfan1989
Copy link
Contributor

slfan1989 commented Apr 27, 2023

the event queue has a backlog of many events.

What events are included? In my opinion, nm dispatcher queue accumulation is rare in NM.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Apr 27, 2023

What events are included? In my opinion, nm dispatcher queue accumulation is rare in NM.

We found that it was stuck on an RPC waiting for the NN result to return, and the event kept increasing.
And the connection with NN is also normal (tcudmp), NN is also healthy.
I used tcpkill to kill the connection, NM tried to reconnect to NN, at this time NM resumed work without restarting.

"AsyncDispatcher event handler Dispatcher" #361 prio=5 os_prio=0 tid=0x00007f1f91e58000 nid=0x3ece in Object.wait() [0x00007f1f10027000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.hadoop.ipc.Client.call(Client.java:1448)
        - locked <0x000000074415df28> (a org.apache.hadoop.ipc.Client$Call)
        at org.apache.hadoop.ipc.Client.call(Client.java:1394)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:238)
        at com.sun.proxy.$Proxy28.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:818)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
        at com.sun.proxy.$Proxy29.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2073)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1285)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1281)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1297)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:195)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:321)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:199)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
        at java.lang.Thread.run(Thread.java:745)

@slfan1989
Copy link
Contributor

@cxzl25 Thanks for the explanation. Have we observed any changes in CPU usage after adding this metric? I noticed that the collection frequency is set to once per second. I have some doubts that network congestion could be the cause of this issue. Is the number of tcp connections normal?

@cxzl25
Copy link
Contributor Author

cxzl25 commented Apr 27, 2023

Have we observed any changes in CPU usage after adding this metric? I noticed that the collection frequency is set to once per second.

I refer to YARN-10771 to set the interval time.

I have some doubts that network congestion could be the cause of this issue. Is the number of tcp connections normal?

I checked the two machines with this problem in the cluster, and there was no obvious abnormality at the network level, and the number of connections did not fluctuate too much.
ipc.Client has no timeout option in synchronous mode, so NM's AsyncDispatcher has been waiting here.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Apr 27, 2023

BTW I noticed that NM will check the permission settings of RemoteLogDir when scheduling each APP, which generates a lot of RPCs. In my environment, I see more than 10 million times a day.

I was wondering if it is possible to check only once per NM ?
LogAggregationFileController#verifyAndCreateRemoteLogDir

public void verifyAndCreateRemoteLogDir() {
// Checking the existence of the TLD
FileSystem remoteFS = null;
try {
remoteFS = getFileSystem(conf);
} catch (IOException e) {
throw new YarnRuntimeException(
"Unable to get Remote FileSystem instance", e);
}
boolean remoteExists = true;
Path remoteRootLogDir = getRemoteRootLogDir();
try {
FsPermission perms =
remoteFS.getFileStatus(remoteRootLogDir).getPermission();

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 35s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 40m 56s trunk passed
+1 💚 compile 1m 34s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 compile 1m 29s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 checkstyle 0m 37s trunk passed
+1 💚 mvnsite 0m 50s trunk passed
+1 💚 javadoc 0m 47s trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 36s trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 1m 46s trunk passed
+1 💚 shadedclient 21m 2s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 38s the patch passed
+1 💚 compile 1m 28s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javac 1m 28s the patch passed
+1 💚 compile 1m 24s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 javac 1m 24s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 checkstyle 0m 28s the patch passed
+1 💚 mvnsite 0m 41s the patch passed
+1 💚 javadoc 0m 34s the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚 javadoc 0m 31s the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚 spotbugs 1m 40s the patch passed
+1 💚 shadedclient 24m 40s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 24m 20s hadoop-yarn-server-nodemanager in the patch passed.
+1 💚 asflicense 0m 37s The patch does not generate ASF License warnings.
127m 53s
Subsystem Report/Notes
Docker ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/2/artifact/out/Dockerfile
GITHUB PR #5599
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 3851f3d8efd6 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / aa60c6b
Default Java Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/2/testReport/
Max. process+thread count 556 (vs. ulimit of 5500)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/2/console
versions git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@slfan1989
Copy link
Contributor

BTW I noticed that NM will check the permission settings of RemoteLogDir when scheduling each APP, which generates a lot of RPCs. In my environment, I see more than 10 million times a day.

Thanks for the feedback, I need some time to read this part of the code.
Let's complete this pr first.

@cxzl25
Copy link
Contributor Author

cxzl25 commented May 8, 2023

@slfan1989 Can you help merge this PR, thanks.

@slfan1989
Copy link
Contributor

slfan1989 commented May 10, 2023

@cxzl25 Thanks for the contribution! Can we add a switch configuration(In Yarn Configuration)? In general, we might not need to collect scheduler metrics for the NodeManager (NM).

@cxzl25
Copy link
Contributor Author

cxzl25 commented May 17, 2023

Can we add a switch configuration(In Yarn Configuration)?

Can we reuse this configuration? The configuration item yarn.nodemanager.dispatcher.metric.enable introduced in YARN-10846.

@github-actions
Copy link
Contributor

We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open it and ask for a committer to remove the stale tag and review again.
Thanks all for your contribution.

@github-actions github-actions bot added the Stale label Oct 22, 2025
@github-actions github-actions bot closed this Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants