-
Notifications
You must be signed in to change notification settings - Fork 9.2k
YARN-11476. Add NodeManager metric for event queue size of dispatcher #5599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@cxzl25 Thank you very much for your contribution! Can you explain why we are adding this metric? |
|
|
||
| eventQueueMetricExecutor = new ScheduledThreadPoolExecutor(1, | ||
| new ThreadFactoryBuilder().setDaemon(true) | ||
| .setNameFormat("EventQueueSizeMetricThread").build()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good, 5 chars
|
💔 -1 overall
This message was automatically generated. |
Because I found that there are several NMs on the cluster, although the heartbeat with RM is normal, but the NM has not started the container, and the event queue has a backlog of many events. |
What events are included? In my opinion, nm dispatcher queue accumulation is rare in NM. |
We found that it was stuck on an RPC waiting for the NN result to return, and the event kept increasing. "AsyncDispatcher event handler Dispatcher" #361 prio=5 os_prio=0 tid=0x00007f1f91e58000 nid=0x3ece in Object.wait() [0x00007f1f10027000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.hadoop.ipc.Client.call(Client.java:1448)
- locked <0x000000074415df28> (a org.apache.hadoop.ipc.Client$Call)
at org.apache.hadoop.ipc.Client.call(Client.java:1394)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:238)
at com.sun.proxy.$Proxy28.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:818)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
at com.sun.proxy.$Proxy29.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2073)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1285)
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1281)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1297)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.verifyAndCreateRemoteLogDir(LogAggregationService.java:195)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:321)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:456)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:68)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:199)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:115)
at java.lang.Thread.run(Thread.java:745) |
|
@cxzl25 Thanks for the explanation. Have we observed any changes in CPU usage after adding this metric? I noticed that the collection frequency is set to once per second. I have some doubts that network congestion could be the cause of this issue. Is the number of tcp connections normal? |
I refer to YARN-10771 to set the interval time.
I checked the two machines with this problem in the cluster, and there was no obvious abnormality at the network level, and the number of connections did not fluctuate too much. |
|
BTW I noticed that NM will check the permission settings of I was wondering if it is possible to check only once per NM ? Lines 347 to 360 in 0ac443b
|
|
💔 -1 overall
This message was automatically generated. |
Thanks for the feedback, I need some time to read this part of the code. |
|
@slfan1989 Can you help merge this PR, thanks. |
|
@cxzl25 Thanks for the contribution! Can we add a switch configuration(In Yarn Configuration)? In general, we might not need to collect scheduler metrics for the NodeManager (NM). |
Can we reuse this configuration? The configuration item |
|
We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Description of PR
NodeManager has two dispatchers, we can add metrics to observe the dispatcher queue size.
How was this patch tested?
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?