Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Log Improvement] Output the registering/lost/exclude nodes in log #148

Merged
merged 4 commits into from Aug 10, 2022

Conversation

zuston
Copy link
Member

@zuston zuston commented Aug 9, 2022

What changes were proposed in this pull request?

[Log Improvement] Output the registering/lost/exclude nodes in log

Why are the changes needed?

It's hard to find the alive nodes in coordinator side, and only total alive nodes number can be shown in coordinator server metrics, this is not enough. So it's necessary to output the registering/lost/exclude nodes in log file.

Does this PR introduce any user-facing change?

No

How was this patch tested?

No need.

@codecov-commenter
Copy link

codecov-commenter commented Aug 9, 2022

Codecov Report

Merging #148 (7da7c65) into master (04cbdbb) will increase coverage by 0.14%.
The diff coverage is 86.95%.

@@             Coverage Diff              @@
##             master     #148      +/-   ##
============================================
+ Coverage     57.20%   57.35%   +0.14%     
- Complexity     1200     1207       +7     
============================================
  Files           150      150              
  Lines          8185     8209      +24     
  Branches        773      775       +2     
============================================
+ Hits           4682     4708      +26     
  Misses         3257     3257              
+ Partials        246      244       -2     
Impacted Files Coverage Δ
...ache/uniffle/coordinator/SimpleClusterManager.java 91.07% <81.25%> (+2.77%) ⬆️
...rg/apache/uniffle/coordinator/CoordinatorConf.java 96.39% <100.00%> (+0.24%) ⬆️
...org/apache/uniffle/server/ShuffleFlushManager.java 76.83% <0.00%> (-1.70%) ⬇️
...apache/uniffle/coordinator/CoordinatorMetrics.java 93.93% <0.00%> (+0.18%) ⬆️
...pache/spark/shuffle/writer/WriteBufferManager.java 83.45% <0.00%> (+1.23%) ⬆️
.../apache/uniffle/coordinator/ClientConfManager.java 92.95% <0.00%> (+1.40%) ⬆️
...e/uniffle/server/storage/SingleStorageManager.java 67.21% <0.00%> (+1.63%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@@ -99,6 +101,13 @@ void nodesCheck() {
}
}
}
if (!deleteIds.isEmpty() || outputAliveServerCount % 30 == 0) {
LOG.info("Alive servers number: {}, ids: {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? We have metrics tell us how many there are alive servers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number is not enough. I want to know which servers are lost from the coordinator side.

And the number shown in log will be easy to find.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got it. Could 30 be configOption?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.
Changelog

  1. Introduce rss.coordinator.server.periodic.output.interval.times, default value = 30, this means it will output every 5min.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you judge deleteIds.isEmpty()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when deleteIds is not empty, it means the server list changed, we should output the latest alive servers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

} catch (FileNotFoundException fileNotFoundException) {
excludeNodes = Sets.newConcurrentHashSet();
} catch (Exception e) {
LOG.warn("Error when updating exclude nodes, the exclude nodes file path: " + path, e);
}
int newlyExcludeNodesNumber = excludeNodes.size();
if (newlyExcludeNodesNumber != originalExcludeNodesNumber) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

@@ -47,6 +47,12 @@ public class CoordinatorConf extends RssBaseConf {
.longType()
.defaultValue(30 * 1000L)
.withDescription("timeout if can't get heartbeat from shuffle server");
public static final ConfigOption<Long> COORDINATOR_NODES_PERIODIC_OUTPUT_INTERVAL_TIMES = ConfigOptions
.key("rss.coordinator.server.periodic.output.interval.times")
.longType()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add checkValue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will add check.

Copy link
Contributor

@jerqi jerqi Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some document?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's OK. But currently the file of doc/coordinator_guide.md is empty and the README about coordinator conf is missing lots of params.

So do i add this conf into README?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you migrate the cooridinator's document from README to doc/coordinator_guide.md and add the document of this config option in the doc/coordinator_guide.md in a new pr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'm glad to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's merge this pr first.

@@ -99,6 +101,13 @@ void nodesCheck() {
}
}
}
if (!deleteIds.isEmpty() || outputAliveServerCount % 30 == 0) {
LOG.info("Alive servers number: {}, ids: {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you judge deleteIds.isEmpty()?

Copy link
Contributor

@jerqi jerqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zuston

@jerqi jerqi merged commit 2874c3e into apache:master Aug 10, 2022
@zuston
Copy link
Member Author

zuston commented Aug 10, 2022

Thanks for your review @jerqi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants