Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31611][YARN] Register NettyMemoryMetrics into Node Manager's metrics system #28416

Closed
wants to merge 1 commit into from

Conversation

manuzhang
Copy link
Contributor

@manuzhang manuzhang commented Apr 30, 2020

What changes were proposed in this pull request?

Register NettyMemoryMetrics into Node Manager's metrics system through YarnShuffleServiceMetrics.

  • usedDirectMemory
  • usedHeapMemory

Why are the changes needed?

Such that NettyMemoryMetrics can be exposed through Node Manager's JMX.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Update UT to ensure NettyMemoryMetrics are registered into Node Manager's metrics system.

@dongjoon-hyun
Copy link
Member

ok to test

@@ -242,7 +240,6 @@ public ShuffleMetrics() {
allMetrics.put("registeredExecutorsSize",
(Gauge<Integer>) () -> blockManager.getRegisteredExecutorsSize());
allMetrics.put("numActiveConnections", activeConnections);
allMetrics.put("numRegisteredConnections", registeredConnections);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, this might remove this metric from the other resource managers like Mesos. Are you sure about the side-effect?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need to fix some YARN issue, I'd like to recommend to remove from YARN side and leave this shared class AS-IS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line looks suspicious and confusing since numRegisteredConnections is registered twice for YarnShuffleService. After digging into it, I find it's the Counter from TransportContext that's really counting the numbers. That created here in ExternalBlockHandler#ShuffleMetrics is never used anywhere.

I think it's misleading for people who want to add new metrics. That said, I'm ok to leave it as is if you still suggest so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are worried about the side-effect, I can add UTs for all resource managers to ensure all expected metrics are registered. Then maybe it's worth a separate PR. What do you think ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a removal PR first for the following. That will make the review easier~

After digging into it, I find it's the Counter from TransportContext that's really counting the numbers. That created here in ExternalBlockHandler#ShuffleMetrics is never used anywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will review there if that is really never exposed to any one or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun #28457 has been created. Please help review, thanks.

@SparkQA
Copy link

SparkQA commented May 1, 2020

Test build #122149 has finished for PR 28416 at commit 47a8954.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

#28457 is merged now. Could you rebase this PR?

@manuzhang
Copy link
Contributor Author

@dongjoon-hyun thanks a lot. I've rebased the PR and updated the description.

@SparkQA
Copy link

SparkQA commented May 8, 2020

Test build #122423 has finished for PR 28416 at commit f9241be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"registerExecutorRequestLatencyMillis"
"registerExecutorRequestLatencyMillis",
"shuffle-server.usedDirectMemory",
"shuffle-server.usedHeapMemory"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, the naming looks inconsistent from the others. Do you think we have a better way? For example, do we need shuffle-server. prefix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics from NettyMemoryMetrics are both collected at server and client sides. That's why they have the module name (shuffle-server at server side) as prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master for Apache Spark 3.1.0.

@dongjoon-hyun
Copy link
Member

BTW, @manuzhang . You can add tianlzhang@ebay.com to your GitHub Action additional email. Currently, you are using two emails differently.

If you add that, your commit will be listed together here.

@manuzhang
Copy link
Contributor Author

@dongjoon-hyun thanks for the nice advice. That's unexpected but somehow I got the emails mixed up. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants