New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Log Improvement] Output the registering/lost/exclude nodes in log #148
Conversation
Codecov Report
@@ Coverage Diff @@
## master #148 +/- ##
============================================
+ Coverage 57.20% 57.35% +0.14%
- Complexity 1200 1207 +7
============================================
Files 150 150
Lines 8185 8209 +24
Branches 773 775 +2
============================================
+ Hits 4682 4708 +26
Misses 3257 3257
+ Partials 246 244 -2
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
@@ -99,6 +101,13 @@ void nodesCheck() { | |||
} | |||
} | |||
} | |||
if (!deleteIds.isEmpty() || outputAliveServerCount % 30 == 0) { | |||
LOG.info("Alive servers number: {}, ids: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? We have metrics tell us how many there are alive servers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number is not enough. I want to know which servers are lost from the coordinator side.
And the number shown in log will be easy to find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I got it. Could 30
be configOption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Changelog
- Introduce
rss.coordinator.server.periodic.output.interval.times
, default value = 30, this means it will output every 5min.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you judge deleteIds.isEmpty()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when deleteIds is not empty, it means the server list changed, we should output the latest alive servers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
} catch (FileNotFoundException fileNotFoundException) { | ||
excludeNodes = Sets.newConcurrentHashSet(); | ||
} catch (Exception e) { | ||
LOG.warn("Error when updating exclude nodes, the exclude nodes file path: " + path, e); | ||
} | ||
int newlyExcludeNodesNumber = excludeNodes.size(); | ||
if (newlyExcludeNodesNumber != originalExcludeNodesNumber) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
@@ -47,6 +47,12 @@ public class CoordinatorConf extends RssBaseConf { | |||
.longType() | |||
.defaultValue(30 * 1000L) | |||
.withDescription("timeout if can't get heartbeat from shuffle server"); | |||
public static final ConfigOption<Long> COORDINATOR_NODES_PERIODIC_OUTPUT_INTERVAL_TIMES = ConfigOptions | |||
.key("rss.coordinator.server.periodic.output.interval.times") | |||
.longType() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add checkValue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I will add check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK. But currently the file of doc/coordinator_guide.md is empty and the README about coordinator conf is missing lots of params.
So do i add this conf into README?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you migrate the cooridinator's document from README to doc/coordinator_guide.md and add the document of this config option in the doc/coordinator_guide.md in a new pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'm glad to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's merge this pr first.
@@ -99,6 +101,13 @@ void nodesCheck() { | |||
} | |||
} | |||
} | |||
if (!deleteIds.isEmpty() || outputAliveServerCount % 30 == 0) { | |||
LOG.info("Alive servers number: {}, ids: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you judge deleteIds.isEmpty()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zuston
Thanks for your review @jerqi |
What changes were proposed in this pull request?
[Log Improvement] Output the registering/lost/exclude nodes in log
Why are the changes needed?
It's hard to find the alive nodes in coordinator side, and only total alive nodes number can be shown in coordinator server metrics, this is not enough. So it's necessary to output the registering/lost/exclude nodes in log file.
Does this PR introduce any user-facing change?
No
How was this patch tested?
No need.