New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31611][YARN] Register NettyMemoryMetrics into Node Manager's metrics system #28416
Conversation
ok to test |
@@ -242,7 +240,6 @@ public ShuffleMetrics() { | |||
allMetrics.put("registeredExecutorsSize", | |||
(Gauge<Integer>) () -> blockManager.getRegisteredExecutorsSize()); | |||
allMetrics.put("numActiveConnections", activeConnections); | |||
allMetrics.put("numRegisteredConnections", registeredConnections); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, this might remove this metric from the other resource managers like Mesos
. Are you sure about the side-effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you need to fix some YARN issue, I'd like to recommend to remove from YARN side and leave this shared class AS-IS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line looks suspicious and confusing since numRegisteredConnections
is registered twice for YarnShuffleService
. After digging into it, I find it's the Counter from TransportContext that's really counting the numbers. That created here in ExternalBlockHandler#ShuffleMetrics
is never used anywhere.
I think it's misleading for people who want to add new metrics. That said, I'm ok to leave it as is if you still suggest so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are worried about the side-effect, I can add UTs for all resource managers to ensure all expected metrics are registered. Then maybe it's worth a separate PR. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a removal PR first for the following. That will make the review easier~
After digging into it, I find it's the Counter from TransportContext that's really counting the numbers. That created here in ExternalBlockHandler#ShuffleMetrics is never used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will review there if that is really never exposed to any one or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun #28457 has been created. Please help review, thanks.
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
Show resolved
Hide resolved
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
Show resolved
Hide resolved
Test build #122149 has finished for PR 28416 at commit
|
#28457 is merged now. Could you rebase this PR? |
@dongjoon-hyun thanks a lot. I've rebased the PR and updated the description. |
Test build #122423 has finished for PR 28416 at commit
|
"registerExecutorRequestLatencyMillis" | ||
"registerExecutorRequestLatencyMillis", | ||
"shuffle-server.usedDirectMemory", | ||
"shuffle-server.usedHeapMemory" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, the naming looks inconsistent from the others. Do you think we have a better way? For example, do we need shuffle-server.
prefix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics from NettyMemoryMetrics
are both collected at server and client sides. That's why they have the module name (shuffle-server
at server side) as prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master for Apache Spark 3.1.0.
BTW, @manuzhang . You can add
If you add that, your commit will be listed together here. |
@dongjoon-hyun thanks for the nice advice. That's unexpected but somehow I got the emails mixed up. Thanks. |
What changes were proposed in this pull request?
Register
NettyMemoryMetrics
into Node Manager's metrics system throughYarnShuffleServiceMetrics
.Why are the changes needed?
Such that
NettyMemoryMetrics
can be exposed through Node Manager's JMX.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Update UT to ensure NettyMemoryMetrics are registered into Node Manager's metrics system.