YARN-11476. Add NodeManager metric for event queue size of dispatcher #5599

cxzl25 · 2023-04-27T04:57:54Z

Description of PR

NodeManager has two dispatchers, we can add metrics to observe the dispatcher queue size.

How was this patch tested?

For code changes:

Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

slfan1989 · 2023-04-27T06:00:55Z

@cxzl25 Thank you very much for your contribution! Can you explain why we are adding this metric?

slfan1989 · 2023-04-27T06:15:56Z

...-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java

+
+    eventQueueMetricExecutor = new ScheduledThreadPoolExecutor(1,
+        new ThreadFactoryBuilder().setDaemon(true)
+            .setNameFormat("EventQueueSizeMetricThread").build());


code looks good, 5 chars

hadoop-yetus · 2023-04-27T07:01:53Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 37s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 1s		codespell was not available.
+0 🆗	detsecrets	0m 1s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+1 💚	mvninstall	39m 51s		trunk passed
+1 💚	compile	1m 31s		trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	compile	1m 25s		trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚	checkstyle	0m 41s		trunk passed
+1 💚	mvnsite	0m 50s		trunk passed
+1 💚	javadoc	0m 49s		trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	0m 38s		trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚	spotbugs	1m 39s		trunk passed
+1 💚	shadedclient	21m 12s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 39s		the patch passed
+1 💚	compile	1m 22s		the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	javac	1m 22s		the patch passed
+1 💚	compile	1m 18s		the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚	javac	1m 18s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 26s		the patch passed
+1 💚	mvnsite	0m 38s		the patch passed
+1 💚	javadoc	0m 32s		the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	0m 29s		the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚	spotbugs	1m 28s		the patch passed
+1 💚	shadedclient	20m 45s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	23m 57s		hadoop-yarn-server-nodemanager in the patch passed.
+1 💚	asflicense	0m 38s		The patch does not generate ASF License warnings.
		122m 39s

Subsystem	Report/Notes
Docker	ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/1/artifact/out/Dockerfile
GITHUB PR	#5599
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux e8c41707527d 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `fcbeede`
Default Java	Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/1/testReport/
Max. process+thread count	560 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/1/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

cxzl25 · 2023-04-27T07:44:48Z

Can you explain why we are adding this metric?

Because I found that there are several NMs on the cluster, although the heartbeat with RM is normal, but the NM has not started the container, and the event queue has a backlog of many events.
So I'm thinking that it can be similar to YARN-10771, add NM's metrics, so that it is convenient to observe the health status of NM.

slfan1989 · 2023-04-27T08:19:39Z

the event queue has a backlog of many events.

What events are included? In my opinion, nm dispatcher queue accumulation is rare in NM.

cxzl25 · 2023-04-27T08:24:53Z

What events are included? In my opinion, nm dispatcher queue accumulation is rare in NM.

We found that it was stuck on an RPC waiting for the NN result to return, and the event kept increasing.
And the connection with NN is also normal (tcudmp), NN is also healthy.
I used tcpkill to kill the connection, NM tried to reconnect to NN, at this time NM resumed work without restarting.

"AsyncDispatcher event handler Dispatcher" #361 prio=5 os_prio=0 tid=0x00007f1f91e58000 nid=0x3ece in Object.wait() [0x00007f1f10027000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.hadoop.ipc.Client.call(Client.java:1448)
        - locked <0x000000074415df28> (a org.apache.hadoop.ipc.Client$Call)
        at org.apache.hadoop.ipc.Client.call(Client.java:1394)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:238)
        at com.sun.proxy.$Proxy28.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:818)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
        at com.sun.proxy.$Proxy29.getFileInfo(Unknown Source)
        at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2073)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1285)
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1281)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1297)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:195)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:321)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:199)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
        at java.lang.Thread.run(Thread.java:745)

slfan1989 · 2023-04-27T08:38:45Z

@cxzl25 Thanks for the explanation. Have we observed any changes in CPU usage after adding this metric? I noticed that the collection frequency is set to once per second. I have some doubts that network congestion could be the cause of this issue. Is the number of tcp connections normal?

cxzl25 · 2023-04-27T09:22:55Z

Have we observed any changes in CPU usage after adding this metric? I noticed that the collection frequency is set to once per second.

I refer to YARN-10771 to set the interval time.

I have some doubts that network congestion could be the cause of this issue. Is the number of tcp connections normal?

I checked the two machines with this problem in the cluster, and there was no obvious abnormality at the network level, and the number of connections did not fluctuate too much.
ipc.Client has no timeout option in synchronous mode, so NM's AsyncDispatcher has been waiting here.

cxzl25 · 2023-04-27T09:28:13Z

BTW I noticed that NM will check the permission settings of RemoteLogDir when scheduling each APP, which generates a lot of RPCs. In my environment, I see more than 10 million times a day.

I was wondering if it is possible to check only once per NM ?
LogAggregationFileController#verifyAndCreateRemoteLogDir

hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/LogAggregationFileController.java

Lines 347 to 360 in 0ac443b

    
           public void verifyAndCreateRemoteLogDir() { 
        
             // Checking the existence of the TLD 
        
             FileSystem remoteFS = null; 
        
             try { 
        
               remoteFS = getFileSystem(conf); 
        
             } catch (IOException e) { 
        
               throw new YarnRuntimeException( 
        
                   "Unable to get Remote FileSystem instance", e); 
        
             } 
        
             boolean remoteExists = true; 
        
             Path remoteRootLogDir = getRemoteRootLogDir(); 
        
             try { 
        
               FsPermission perms = 
        
                   remoteFS.getFileStatus(remoteRootLogDir).getPermission();

hadoop-yetus · 2023-04-27T09:54:57Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 35s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
-1 ❌	test4tests	0m 0s		The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
			_ trunk Compile Tests _
+1 💚	mvninstall	40m 56s		trunk passed
+1 💚	compile	1m 34s		trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	compile	1m 29s		trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚	checkstyle	0m 37s		trunk passed
+1 💚	mvnsite	0m 50s		trunk passed
+1 💚	javadoc	0m 47s		trunk passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	0m 36s		trunk passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚	spotbugs	1m 46s		trunk passed
+1 💚	shadedclient	21m 2s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 38s		the patch passed
+1 💚	compile	1m 28s		the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	javac	1m 28s		the patch passed
+1 💚	compile	1m 24s		the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚	javac	1m 24s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
+1 💚	checkstyle	0m 28s		the patch passed
+1 💚	mvnsite	0m 41s		the patch passed
+1 💚	javadoc	0m 34s		the patch passed with JDK Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1
+1 💚	javadoc	0m 31s		the patch passed with JDK Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
+1 💚	spotbugs	1m 40s		the patch passed
+1 💚	shadedclient	24m 40s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	24m 20s		hadoop-yarn-server-nodemanager in the patch passed.
+1 💚	asflicense	0m 37s		The patch does not generate ASF License warnings.
		127m 53s

Subsystem	Report/Notes
Docker	ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/2/artifact/out/Dockerfile
GITHUB PR	#5599
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 3851f3d8efd6 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `aa60c6b`
Default Java	Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Multi-JDK versions	/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.18+10-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u362-ga-0ubuntu1~20.04.1-b09
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/2/testReport/
Max. process+thread count	556 (vs. ulimit of 5500)
modules	C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5599/2/console
versions	git=2.25.1 maven=3.6.3 spotbugs=4.2.2
Powered by	Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

slfan1989 · 2023-04-27T11:10:20Z

BTW I noticed that NM will check the permission settings of RemoteLogDir when scheduling each APP, which generates a lot of RPCs. In my environment, I see more than 10 million times a day.

Thanks for the feedback, I need some time to read this part of the code.
Let's complete this pr first.

cxzl25 · 2023-05-08T03:44:42Z

@slfan1989 Can you help merge this PR, thanks.

slfan1989 · 2023-05-10T07:06:22Z

@cxzl25 Thanks for the contribution! Can we add a switch configuration(In Yarn Configuration)? In general, we might not need to collect scheduler metrics for the NodeManager (NM).

cxzl25 · 2023-05-17T05:14:21Z

Can we add a switch configuration(In Yarn Configuration)?

Can we reuse this configuration? The configuration item yarn.nodemanager.dispatcher.metric.enable introduced in YARN-10846.

github-actions · 2025-10-22T00:23:22Z

We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open it and ask for a committer to remove the stale tag and review again.
Thanks all for your contribution.

metrics

fcbeede

slfan1989 self-requested a review April 27, 2023 06:01

slfan1989 reviewed Apr 27, 2023

View reviewed changes

style

aa60c6b

slfan1989 approved these changes Apr 28, 2023

View reviewed changes

github-actions bot added the Stale label Oct 22, 2025

github-actions bot closed this Oct 23, 2025

YARN-11476. Add NodeManager metric for event queue size of dispatcher #5599

YARN-11476. Add NodeManager metric for event queue size of dispatcher #5599

Uh oh!

Conversation

cxzl25 commented Apr 27, 2023

Description of PR

How was this patch tested?

For code changes:

Uh oh!

slfan1989 commented Apr 27, 2023

Uh oh!

slfan1989 Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

hadoop-yetus commented Apr 27, 2023

Uh oh!

cxzl25 commented Apr 27, 2023

Uh oh!

slfan1989 commented Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cxzl25 commented Apr 27, 2023

Uh oh!

slfan1989 commented Apr 27, 2023

Uh oh!

cxzl25 commented Apr 27, 2023

Uh oh!

cxzl25 commented Apr 27, 2023

Uh oh!

hadoop-yetus commented Apr 27, 2023

Uh oh!

slfan1989 commented Apr 27, 2023

Uh oh!

cxzl25 commented May 8, 2023

Uh oh!

slfan1989 commented May 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cxzl25 commented May 17, 2023

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

slfan1989 commented Apr 27, 2023 •

edited

Loading

slfan1989 commented May 10, 2023 •

edited

Loading