-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
YARN-11360: Add number of decommissioning/shutdown nodes to YARN cluster metrics. #5060
Conversation
💔 -1 overall
This message was automatically generated. |
@cnauroth this Junit Test error is caused by YARN-11342 (#5005), I submitted a fix patch YARN-11357(#5055). Can you help review this pr(#5055)? Thank you very much! |
@slfan1989 , thanks for notifying me of the test failure! I gave +1 on your patch, but I'd also like to see one more code review from someone who has spent more time than me in the federation code. |
@cnauroth Thank you very much for your help reviewing the code! |
💔 -1 overall
This message was automatically generated. |
Regarding Checkstyle -0, TopCLI.java has 134 pre-existing Checkstyle violations. For consistency, my patch is continuing one of the class's patterns that triggers a Checkstyle violation. I'm not planning to address this right now. We could do a big style cleanup later, but I don't want to mix it with real logic changes here. Pre-commit tests also reported new compiler warnings, pasted below. I don't understand why this is happening. It looks unrelated to anything I'm changing in this patch, but I also can't reproduce the warnings locally.
|
@mikaylakonst , as you suggested, I've updated the patch so that it also includes number of shutdown nodes. @abmodi , thank you for the review. FYI, I pushed up a change to add one more property after you approved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM +1 (non-binding)
[nit] @cnauroth, may be you want to uncheck the below in PR description
|
Thanks for the suggestion. I made the update. Thanks also for the code review and LGTM! |
I have committed this to trunk, branch-3.3 and branch-3.2 (after resolving a minor merge conflict). @mikaylakonst , @ashutoshcipher and @abmodi , thank you for the code reviews. |
Description of PR
YARN cluster metrics expose counts of NodeManagers in various states including active and decommissioned. However, these metrics don't expose NodeManagers that are currently in the process of decommissioning. This can look a little spooky to a consumer of these metrics. First, the node drops out of the active count, so it seems like a node just vanished. Then, later (possibly hours later with consideration of graceful decommission), it comes back into existence in the decommissioned count.
This issue tracks adding the decommissioning count to the metrics ResourceManager RPC. We're also adding the shutdown node count. This also enables exposing it in the
yarn top
output. These metrics are already visible through the REST API, so there isn't any change required there.How was this patch tested?
The patch adds new unit tests for the ResourceManager RPC, correct merging of the metric through the router service and
yarn top
. I also tested successfully in a live cluster.For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?